Chess-GPT's Internal World Model

>>homarp+(OP)
If you take a neural network that already knows the basic rules of chess and train it on chess games, you produce a chess engine.

From the Wikipedia page on one of the strongest ever[1]: "Like Leela Zero and AlphaGo Zero, Leela Chess Zero starts with no intrinsic chess-specific knowledge other than the basic rules of the game. Leela Chess Zero then learns how to play chess by reinforcement learning from repeated self-play"

[1]: https://en.wikipedia.org/wiki/Leela_Chess_Zero

>>wavemo+tm1
As described in the OP's blog post https://adamkarvonen.github.io/machine_learning/2024/01/03/c... - one of the incredible things here is that the standard GPT architecture, trained from scratch from PGN strings alone, can intuit the rules of the game from those examples, without any notion of the rules of chess or even that it is playing a game.

Leela, by contrast, requires a specialized structure of iterative tree searching to generate move recommendations: https://lczero.org/dev/wiki/technical-explanation-of-leela-c...

Which is not to diminish the work of the Leela team at all! But I find it fascinating that an unmodified GPT architecture can build up internal neural representations that correspond closely to board states, despite not having been designed for that task. As they say, attention may indeed be all you need.

>>btown+Np1
> can intuit the rules of the game from those examples,

I am pretty sure a bunch of matrix multiplications can't intuit anything.

naively, it doesn't seem very surprising that enormous amounts of self play cause the internal structure to reflect the inputs and outputs?

>>banana+lC1
> I am pretty sure a bunch of matrix multiplications can't intuit anything.

I don't understand how people can say things like this when universal approximation is an easy thing to prove. You could reproduce Magnus Carlsen's exact chess-playing stochastic process with a bunch of matrix multiplications and nonlinear activations, up to arbitrarily small error.

>>golol+bF1
This simply isn't true. There are big caveats to the idea that neural networks are universal function approximators (as there are to the idea that they're universal Turing machines, which also somehow became common knowledge in our post-ChatGPT world). The function has to be continuous, we're talking about functions rather than algorithms, an approximator being possible and us knowing how to construct it are very different things, and so on.

>>FreakL+hJ1
>The function has to be continuouss.

That's not a problem. You can show that neural network induced functions are dense in a bunch of function spaces, just like continuous functions. Regularity is not a critical concern anyways.

>functions vs algorithms

Repeatedly applying arbitrary functions to a memory (like in a transformer) yields you arbitrary dynamical systems, so we can do algorithms too.

> an approximator being possible and us knowing how to construct it are very different things,

This is of course the critical point, but not so relevant when asking whether something is theoretically possible. The way I see it this was the big question for deep learning and over the last decade the evidence has just continually grown that SGD is VERY good at finding weights that do in fact generalize quite well and that don't just approximate a function from step-functions the way you imagine an approximation theorem to construct it, but instead efficiently find features in the intermediate layers and use them for multiple purposes, etc. My intuition is that the gradient in high dimension doesn't just decrease the loss a bit in the way we imagine it for a low dimensional plot, but in those high dimensions really finds directions that are immensely efficient at decreasing loss. This is how transformers can become so extremely good at memorization.

zlacker