From the Wikipedia page on one of the strongest ever[1]: "Like Leela Zero and AlphaGo Zero, Leela Chess Zero starts with no intrinsic chess-specific knowledge other than the basic rules of the game. Leela Chess Zero then learns how to play chess by reinforcement learning from repeated self-play"
Though I will give it to ChatGPT, castling across the bishop was a genius move.
Leela, by contrast, requires a specialized structure of iterative tree searching to generate move recommendations: https://lczero.org/dev/wiki/technical-explanation-of-leela-c...
Which is not to diminish the work of the Leela team at all! But I find it fascinating that an unmodified GPT architecture can build up internal neural representations that correspond closely to board states, despite not having been designed for that task. As they say, attention may indeed be all you need.
According to figure 6b [0] removing MCTS reduces Elo by about 40%, scaling 1800 Elo by 5/3 gives us 3000 Elo which would be superhuman but not as good as e.g. LeelaZero.
[0]: https://gwern.net/doc/reinforcement-learning/model/alphago/2...
On the contrary, by allowing vectors for unrelated concepts to be only almost orthogonal, it's possible to represent a much larger number of unrelated concepts. https://terrytao.wordpress.com/2013/07/18/a-cheap-version-of...
In machine learning, this phenomenon is known as polysemanticity or superposition https://transformer-circuits.pub/2022/toy_model/index.html
Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.
https://aider.chat/docs/repomap.html
More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
What are you basing this on? To me it seems like difficulty is set by limiting search depth/time: https://github.com/lichess-org/fishnet/blob/master/src/api.r...
1. https://en.m.wikipedia.org/wiki/Shannon_number#:~:text=After....
I am not sure about this. From the article "The 50M parameter model played at 1300 ELO with 99.8% of its moves being legal within one day of training."
I thought that the experiment was how well the model will perform, given that it's reward function is to predict text, rather than checkmate. Leela, Alpha0 their reward function is to win the game, checkmate or capture pieces. Also it goes without saying that Leela, Alpha0 cannot make illegal moves.
The experiment does not need to include the whole board position if that's a problem, if that's an important point of interest. It could encode more information about squares covered by each side for example. See for example this training experiment for Trackmania [1]. There are techniques that the ML algorithm will *never* figure out by itself if this information is not encoded in it's training data.
The point still stands. PGN notation certainly is not a good format if the goal (or one of the goals) of the experiment is to be a good chess player.
Great stuff.
>More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
As the experiments on PHI-1 and PHI-2 from microsoft show, training data make a difference. The "textbooks is all you need" moto means better structured data, more clear data make a difference.
https://arxiv.org/abs/2311.00871
https://arxiv.org/abs/2309.13638