Chess-GPT's Internal World Model

>>homarp+(OP)
code here: https://github.com/adamkarvonen/chess_llm_interpretability

>>Legend+G71
Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features

>>homarp+(OP)
If you take a neural network that already knows the basic rules of chess and train it on chess games, you produce a chess engine.

From the Wikipedia page on one of the strongest ever[1]: "Like Leela Zero and AlphaGo Zero, Leela Chess Zero starts with no intrinsic chess-specific knowledge other than the basic rules of the game. Leela Chess Zero then learns how to play chess by reinforcement learning from repeated self-play"

[1]: https://en.wikipedia.org/wiki/Leela_Chess_Zero

>>homarp+(OP)
I hope it performs better than ChatGPT: https://old.reddit.com/r/AnarchyChess/comments/10ydnbb/i_pla...

Though I will give it to ChatGPT, castling across the bishop was a genius move.

>>wavemo+tm1
As described in the OP's blog post https://adamkarvonen.github.io/machine_learning/2024/01/03/c... - one of the incredible things here is that the standard GPT architecture, trained from scratch from PGN strings alone, can intuit the rules of the game from those examples, without any notion of the rules of chess or even that it is playing a game.

Leela, by contrast, requires a specialized structure of iterative tree searching to generate move recommendations: https://lczero.org/dev/wiki/technical-explanation-of-leela-c...

Which is not to diminish the work of the Leela team at all! But I find it fascinating that an unmodified GPT architecture can build up internal neural representations that correspond closely to board states, despite not having been designed for that task. As they say, attention may indeed be all you need.

>>foota+AD1
The apples-to-apples comparison would be comparing an LLM with Leela with search turned off (only using a single board state)

According to figure 6b [0] removing MCTS reduces Elo by about 40%, scaling 1800 Elo by 5/3 gives us 3000 Elo which would be superhuman but not as good as e.g. LeelaZero.

[0]: https://gwern.net/doc/reinforcement-learning/model/alphago/2...

>>redcob+jr1
When relationships are represented implicitly by the magnitude of the dot product between two vectors, there's no particular advantage to not "creating" all relationships (i.e. enforcing orthogonality for "uncreated" relationships).

On the contrary, by allowing vectors for unrelated concepts to be only almost orthogonal, it's possible to represent a much larger number of unrelated concepts. https://terrytao.wordpress.com/2013/07/18/a-cheap-version-of...

In machine learning, this phenomenon is known as polysemanticity or superposition https://transformer-circuits.pub/2022/toy_model/index.html

>>empora+IO1
By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.

Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.

https://aider.chat/docs/repomap.html

More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.

>>sjducb+Z52
> The problem is that a stockfish based bot knows some very strong moves, but deliberately plays bad moves so it’s about the right skill level.

What are you basing this on? To me it seems like difficulty is set by limiting search depth/time: https://github.com/lichess-org/fishnet/blob/master/src/api.r...

>>wredue+Q32
No this isn’t likely. Chess has trillions of possible games[1] that could be played and if it all it took was such a small number of games to hit most piece combinations chess would be solved. It has to have learned some fundamental aspects of the game to achieve the rating stated ITT

1. https://en.m.wikipedia.org/wiki/Shannon_number#:~:text=After....

>>gwern+bj2
>The entire point of this is to not encode the board state!

I am not sure about this. From the article "The 50M parameter model played at 1300 ELO with 99.8% of its moves being legal within one day of training."

I thought that the experiment was how well the model will perform, given that it's reward function is to predict text, rather than checkmate. Leela, Alpha0 their reward function is to win the game, checkmate or capture pieces. Also it goes without saying that Leela, Alpha0 cannot make illegal moves.

The experiment does not need to include the whole board position if that's a problem, if that's an important point of interest. It could encode more information about squares covered by each side for example. See for example this training experiment for Trackmania [1]. There are techniques that the ML algorithm will *never* figure out by itself if this information is not encoded in it's training data.

The point still stands. PGN notation certainly is not a good format if the goal (or one of the goals) of the experiment is to be a good chess player.

[1]https://www.youtube.com/watch?v=Dw3BZ6O_8LY

>>anothe+S52
>Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.

Great stuff.

>More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.

As the experiments on PHI-1 and PHI-2 from microsoft show, training data make a difference. The "textbooks is all you need" moto means better structured data, more clear data make a difference.

https://arxiv.org/abs/2306.11644

>>sjducb+Z52
There is a very interesting project on this exact problem called Maia, which trains an engine based on millions of human games played on Lichess, specifically targeting varying levels of skill from 1300 to 1900 Elo. I haven't played it myself, by my understanding is that it does a much better job imitating the mistakes of human players. https://maiachess.com

>>empath+dg2
There is also a lot of evidence lately that they do not generalize.

https://arxiv.org/abs/2311.00871

https://arxiv.org/abs/2309.13638

https://arxiv.org/abs/2311.09247

https://arxiv.org/abs/2305.18654

zlacker

Chess-GPT's Internal World Model