Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by human experts, we may not expect the artificial model to outperform the experts on their original objectives. Yet it is often observed in practice that such models possess surprising capabilities, suggesting that they might surpass human experts in certain aspects.

In this
work, we study the phenomenon of *transcendence*: when a generative model achieves capabilities
that surpass
the abilities
of the human experts generating its data. We demonstrate transcendence by training an autoregressive
transformer to play
chess from game transcripts, and show that the trained model can sometimes achieve better Glicko-2 scores
compared to
the players in the dataset.

We theoretically prove that transcendence is enabled by low-temperature sampling, and rigorously assess this experimentally. Finally, we discuss other forms of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.

In a setting of \( f_1, \dots, f_k \in \cf \) experts, \( \cx \) input
distribution, and \( p \in P(\cx) \), we define *transcendence* to be:
\[
R_{p_{test}}(\hat{f}) > \max_{i \in [k]} R_{\ptest}(f_i).
\]
Here \( R_{\ptest}(f) \) is the expected reward of a predictor \( f \) on the test distribution \( \ptest \),
and \( r(x, y) \) is the reward function:
*transcendence* describes cases where the learned predictor performs better (achieves
better reward) than
the best expert generating the data.
Note that we are focusing on an idealized setting, where the learner has access to infinite amount of data
from the
distribution \( \dist \) , and can arbitrarily choose any function to fit the distribution (not limited to a
particular
choice of architecture or optimization constraints). As we will show, even in this idealized setting,
transcendence can
be impossible to achieve without further modifying the disribution.

\[
R_{\ptest}(f) = \mathbb{E}_{x \sim \ptest}\left[r_x(f)\right], ~~~\mathrm{where}~~r_x(f) = \mathbb{E}_{y
\sim
f(\cdot |
x)} \left[r(x,y)\right].
\]

\[
R_{\ptest}(f) = \mathbb{E}_{x \sim \ptest}\left[r_x(f)\right],
\\
~~~\mathrm{where}~~r_x(f) = \mathbb{E}_{y
\sim
f(\cdot |
x)} \left[r(x,y)\right].
\]

In other words,
Now, we consider a temperature sampling scheme over the learned function \( \hat{f} \). Namely, for some
temperature \( \tau >
0 \), and some probability distribution \( q \in P(\cy) \), denote the softmax operator with temperature \(
\tau \) by
\( \softmax(q;\tau) \in P(\cy) \) such that
\[
\softmax(q; \tau)_y = \frac{\exp(q_y/\tau)}{\sum_{y' \in \cy}\exp(q_{y'}/\tau)}
\]
Additionally, we define \( \argmax(q) \in P(\cy) \) to be the uniform distribution over the maximal values of
\( q \),
namely

\[
\argmax{q} = \begin{cases}
1/|{Y_q}| & \mathrm{if}~y \in Y_q \\
0 & \mathrm{if}~y \notin Y_q
\end{cases}, ~~~\mathrm{where}~~~ Y_q = \{y \in \cy ~:~q_y = \max(q)\}
\]

\[
\argmax{q} = \begin{cases}
1/|{Y_q}| & \mathrm{if}~y \in Y_q \\
0 & \mathrm{if}~y \notin Y_q
\end{cases},\]

\[ ~~~\mathrm{where}~~~ Y_q = \{y \in \cy ~:~q_y = \max(q)\} \]

Now, define \( \hat{f}_\tau \) to be the temperature sampling of \( \hat{f} \), i.e.
\[
\hat{f}_\tau(\cdot|x)
=
\softmax(\hat{f}
(\cdot|x);\tau)
\]
and \( \hat{f}_{\max} \) the arg-max ''sampling'' of \( \hat{f} \), i.e.
\[
\hat{f}_{\max}(\cdot|x) =
\argmax(\hat{f}(\cdot|x)).
\]
We prove in the paper that if the arg-max predictor \( \hat{f}_{\max} \) is better than
the best
expert,
then transcendence is possible with low-temperature sampling.
Assume that \( R_{\ptest}(\hat{f}_{\max}) > \max_{i \in [k]} R_{\ptest}(f_i) \), then there exists some
temperature \( \tau
\in (0,1) \) s.t. for all \( 0 \le \tau' \le \tau \) it holds that.
\[
R_{\ptest}(\hat{f}_{\tau'}) > \max_{i \in [k]} R_{\ptest}(f_i)
\]
\[ ~~~\mathrm{where}~~~ Y_q = \{y \in \cy ~:~q_y = \max(q)\} \]

Our theory requires dataset diversity as a
necessary condition for enabling transcendence. As shown in the first figure, not all
models are able to transcend. Unlike ChessFormer 1000 or 1300, the Chessformer 1500 fails to transcend. We
hypothesize that this is due to the fact that in the band of ratings from 1000 to 1500, diversity does not
significantly increase. If this is true, a 1000 rated player can be thought of as a noisy 1500 rated
player, but a 1500 rated player cannot be thought of as a noisy 2000 rated player.
We explore this research question by quantifying dataset diversity through
the normalized entropy on the action distribution:
$$\mathcal{H}_f(Y | X)= {\mathbb{E}_{y \sim
f(y|x=X)}[-\log_2
f(y | x=X)]}/{\log_2 |\mathcal{Y}|}$$
\[
\mathcal{H}_f(Y | X)=\\
{\mathbb{E}_{y \sim
f(y|x=X)}[-\log_2
f(y | x=X)]}/{\log_2 |\mathcal{Y}|}
\]
To
gain intuition for this metric, imagine the action distribution of
moves taken for any given state. Entropy will be higher for more uniform action distributions, and lower for
more deterministic, peaked action distributions. The average entropy of these action distributions can
therefore serve as a measurement of the diversity of the dataset. We normalize this entropy to the range \([0,
1]\) by dividing by the binary log of the number of legal moves: \(\log_2 |\mathcal{Y}|\).
Importantly, we cannot calculate this normalized entropy for every state, as most states after move 16 in
the midgame and before the engame are unique within the dataset and we therefore observe just a single action
for thus states. Therefore our metric is limited in that it only considers opening moves, the beginning of the
midgame, and the endgame. We consider only common states with greater than 100 actions by sampling
1,000,000 games from each dataset. The average entropy confirm our hypothesis: The < 1500 cut off dataset
has on average less diversity than the < 1300 dataset, which has is again less than the < 1000 dataset.
This points towards answering our research question in the affirmative; Chessformer 1500 likely is not
transcendent due to a lack of diversity in its dataset. If the entropy stayed constant for each dataset,
this would imply a similar level of diversity for each. In such a case, we would expect that ChessFormer
1500 likely would also transcend. Instead, as predicted, Chessformer 1500 likely is not transcendent due
to a lack of diversity in its dataset.

Another example where denoising helps avoid errors. Moving the queen to either d1 or h1 takes a bishop or rook, respectively, but loses the queen in the following turn. While queen to e5 does not put the queen in immediate danger, it allows white to push the pawn on f3 to d3, where it threatens the queen and is protected by the bishop on c1. The queen then must move out of danger, losing its opportunity to take the free pawn on h4 and giving white valuable space towards the center of the board. As \(\tau\) decreases, the expected reward converges to the move queen to d4, taking the pawn and checking the black king.

In this setup, a higher temperature shows two plausible moves for the black rook: g1 or f1. As the temperature decreases, the expected reward converges to g1. If the black rook were to move to f1, the white rook would take the black rook, blocking the black pawn on f2 from promoting and protecting the promotion square from the h2 pawn. If the rook were to move to g1, on the other hand, it would open the promotion square from the h2 pawn without being at any immediate risk. If white responded by moving its bishop to g2, protecting the promotion squares from both of the advanced black pawns, black could respond by taking the rook on a1, gaining significant material.

The first expert output distribution. Although it puts non-negligible mass on the purple, high-reward action, it still samples a low-reward action the majority of the time.

The second expert output distribution. Symmetric to the first expert, it also puts non-negligible mass on the purple, high-reward action. However, it samples a low-reward action the majority of the time on the right.

By taking the average of the first and second experts, we observe that this distribution now puts the majority of mass onto the correct action.

Finally, by setting temperature \(\tau\) to be <1, more weight is shifted towards the high probability action, leading to a gain in the expected reward.

This project is built on some exceptional prior projects and platforms, which we are extraordinarily grateful for. In no particular order, these include Lichess, our dataset source, Adam Karvonen's codebase for training chess models and the StockFish chess engine.

```
$ git clone https://github.com/transcendence-research/chess-research.git
$ make install && source .vevn/bin/activate
$ chess_research --help
Setting up experiment...
usage: chess_research [-h] [--save_interval SAVE_INTERVAL] [--eval_every_n_saves EVAL_EVERY_N_SAVES] [--log_interval
LOG_INTERVAL]
[--eval_iters EVAL_ITERS] [--eval_only [EVAL_ONLY]] [--eval_n_games EVAL_N_GAMES]
[--eval_default_elo EVAL_DEFAULT_ELO] [--eval_job_id EVAL_JOB_ID] [--eval_job_total EVAL_JOB_TOTAL]
[--always_save_checkpoint [ALWAYS_SAVE_CHECKPOINT]] [--no_always_save_checkpoint] [--wandb_log [WANDB_LOG]]
[--wandb_project WANDB_PROJECT] [--wandb_run_name WANDB_RUN_NAME] [--resume_from RESUME_FROM]
[--resume_iter_num RESUME_ITER_NUM] [--dataset DATASET] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--batch_size BATCH_SIZE] [--block_size BLOCK_SIZE] [--n_layer N_LAYER] [--n_head N_HEAD] [--n_embd N_EMBD]
[--dropout DROPOUT] [--bias [BIAS]] [--learning_rate LEARNING_RATE] [--max_iters MAX_ITERS]
[--weight_decay WEIGHT_DECAY] [--beta1 BETA1] [--beta2 BETA2] [--grad_clip GRAD_CLIP] [--decay_lr [DECAY_LR]]
[--no_decay_lr] [--warmup_iters WARMUP_ITERS] [--lr_decay_iters LR_DECAY_ITERS] [--min_lr MIN_LR]
[--backend BACKEND] [--device DEVICE] [--dtype DTYPE] [--compile [COMPILE]] [--low_elo LOW_ELO]
[--high_elo HIGH_ELO] [--win_condition [WIN_CONDITION]] [--no_win_condition] [--length_gen LENGTH_GEN]
[--temperature TEMPERATURE] [--seed SEED] [--debug [DEBUG]] [--temperature_sampling [TEMPERATURE_SAMPLING]]
[--no_temperature_sampling] [--elo_generalize [ELO_GENERALIZE]] [-c CONFIG]
options:
-h, --help show this help message and exit
--save_interval SAVE_INTERVAL
--eval_every_n_saves EVAL_EVERY_N_SAVES
--log_interval LOG_INTERVAL
--eval_iters EVAL_ITERS
--eval_only [EVAL_ONLY]
--eval_n_games EVAL_N_GAMES
--eval_default_elo EVAL_DEFAULT_ELO
--eval_job_id EVAL_JOB_ID
--eval_job_total EVAL_JOB_TOTAL
--always_save_checkpoint [ALWAYS_SAVE_CHECKPOINT]
--no_always_save_checkpoint
--wandb_log [WANDB_LOG]
--wandb_project WANDB_PROJECT
--wandb_run_name WANDB_RUN_NAME
--resume_from RESUME_FROM
--resume_iter_num RESUME_ITER_NUM
--dataset DATASET
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
--batch_size BATCH_SIZE
--block_size BLOCK_SIZE
--n_layer N_LAYER
--n_head N_HEAD
--n_embd N_EMBD
--dropout DROPOUT
--bias [BIAS]
--learning_rate LEARNING_RATE
--max_iters MAX_ITERS
--weight_decay WEIGHT_DECAY
--beta1 BETA1
--beta2 BETA2
--grad_clip GRAD_CLIP
--decay_lr [DECAY_LR]
--no_decay_lr
--warmup_iters WARMUP_ITERS
--lr_decay_iters LR_DECAY_ITERS
--min_lr MIN_LR
--backend BACKEND
--device DEVICE
--dtype DTYPE
--compile [COMPILE]
--low_elo LOW_ELO
--high_elo HIGH_ELO
--win_condition [WIN_CONDITION]
--no_win_condition
--length_gen LENGTH_GEN
--temperature TEMPERATURE
--seed SEED
--debug [DEBUG]
--temperature_sampling [TEMPERATURE_SAMPLING]
--no_temperature_sampling
--elo_generalize [ELO_GENERALIZE]
-c CONFIG, --config CONFIG
```