Strictly speaking, it should be a mistake to assign a probability equal to zero to any moves, even for illegal board moves, but especially for an AI that learns by example and self-play. It never gets taught the rules, it only gets shown the games -- there's no reason that it should conclude that the probability of a rook moving diagonally is exactly zero just because it's never seen it happen in the data, and gets penalized in training every time it tries it.
But even for a human, assigning probability of exactly zero is too strong. It would forbid any possibility that you misunderstand any rules, or forgot any special cases. It's a good idea to always maintain at least a small amount of epistemic humility that you might be mistaken about the rules, so that sufficiently overwhelmingly strong evidence could convince you that a move you thought was illegal turns out to be legal.
It is possible the model calculates an approximate board state, which is different from the board state but equivalent for most games, but not all games. It would be interesting to train adversarial policy to check this. From KataGo attack we know this does happen for Go AIs: Go rules have a concept of liberty, but so called pseudoliberty is easier to calculate and equivalent for most cases (but not all cases). In fact, human programmers also used pseudoliberty to optimize their engines. Adversarial attack found Go AIs also use pseudoliberty.
From the Wikipedia page on one of the strongest ever[1]: "Like Leela Zero and AlphaGo Zero, Leela Chess Zero starts with no intrinsic chess-specific knowledge other than the basic rules of the game. Leela Chess Zero then learns how to play chess by reinforcement learning from repeated self-play"
Though I will give it to ChatGPT, castling across the bishop was a genius move.
Leela, by contrast, requires a specialized structure of iterative tree searching to generate move recommendations: https://lczero.org/dev/wiki/technical-explanation-of-leela-c...
Which is not to diminish the work of the Leela team at all! But I find it fascinating that an unmodified GPT architecture can build up internal neural representations that correspond closely to board states, despite not having been designed for that task. As they say, attention may indeed be all you need.
>> As they say, attention may indeed be all you need.
I don't think drawing general conclusions about intelligence from a board game is warranted. We didn't evolve to play chess or Go.
sometimes it is not a matter of "is it better? is it larger? is it more efficient?", but just a question.
mountains are mountains, men are men.
I am pretty sure a bunch of matrix multiplications can't intuit anything.
naively, it doesn't seem very surprising that enormous amounts of self play cause the internal structure to reflect the inputs and outputs?
But the model could in principle just have learned a long list of rote heuristics that happen to predict notation strings well, without having made the inferential leap to a much simpler set of rules, and a learner weaker than a LLM could well have got stuck at that stage.
I wonder how well a human (or a group of humans) would fare at the same task and if they could also successfully reconstruct chess even if they had no prior knowledge of chess rules or notation.
(OTOH a GPT3+ level LLM certainly does know that chess notation is related to something called "chess", which is a "game" and has certain "rules", but to what extent is it able to actually utilize that information?)
Pretty shit for a computer. He says his 50m model reached 1800 Elo (by the way, its Elo and not ELO as the article incorrectly has it, it is named after a Hungarian guy called Elo). It seems to be a bit better than Stockfish level 1 and a bit worse than Stockfish level 2 from the bar graph.
Based on what we know I think its not surprising these models can learn to play chess, but they get absolutely smoked by a "real" chess bot like Stockfish or Leela.
Right. Wait, are you talking about AI or humans?
What's kind of amazing is that, in doing so, it actually learns to play chess! That is, the model weights naturally organize into something resembling an understanding of chess, just by trying to minimize error on next-token prediction.
It makes sense, but it's still kind of astonishing that it actually works.
I don't understand how people can say things like this when universal approximation is an easy thing to prove. You could reproduce Magnus Carlsen's exact chess-playing stochastic process with a bunch of matrix multiplications and nonlinear activations, up to arbitrarily small error.
Also, it took me actually writing a chess game to learn about en passant capturing, the 50 moves without capturing or pawn move forced draw, and the 3 state repetition forced draw.
According to figure 6b [0] removing MCTS reduces Elo by about 40%, scaling 1800 Elo by 5/3 gives us 3000 Elo which would be superhuman but not as good as e.g. LeelaZero.
[0]: https://gwern.net/doc/reinforcement-learning/model/alphago/2...
This goes both ways by the way. I could be convinced that LLMs can achieve something the likes of intuition, but I strongly believe that it is a very different kind of intuition than we normally associate with humans/animals. Usins the same label is thus potentially confusing, and (human pride aside) might even prevent us from appreciating the full scope of what LLMs are capable of.
It's still too strong a claim given that matrix multiplication also describes quantum mechanics and by extension chemistry and by extension biology and by extension our own brains… but I frequently encounter examples of mistaking two related concepts for synonyms, and I assume in this case it is meant to be a weaker claim about LLMs not being conscious.
Me, I think the word "intuition" is fine, just like I'd say that a tree falling in a forest with no one to hear it does produce a sound because sound is the vibration of the air instead of the qualia.
It's the active, iterative thinking and planning that is more critical for AGI and, while obviousky theoretically possible, much harder to imagine a neural network performing.
If someone came to the table with "intuition is the process of a system inferring a likely outcome from given inputs by the process X - not to be confused with matmultuition which is process Y", that might be a reasonable proposal.
Would love to see a similar experiment for 9x9 Go, where the model also needs to learn the concepts of connected group and its liberties.
Would love to see a similar experiment for 9x9 Go, where the model also needs to learn the concepts of connected group and its liberties.
Probably most of us even know about en passant, so we think we know everything. But if I found myself in that same bewildering situation being talked down by a judge after an an opponent moved their rook diagonally, I'd have to either admit I was wrong about knowing all the rules, or else at least wonder how and why such an epic prank was being coordinated against me!
To me that suggests investigating whether there are aspects of human culture that can improve chess playing performance - i.e. whether just training on games produces less good results than training on games and literature.
This seems plausible to me, even beyond literature that is explictly about the game - learning go proverbs, which are often phrased as life advice is a part of learning go, and games are embedded all through our culture, with some stories really illustrating that you have to 'know when to hold em, know when to fold em, know when to walk away, know when to run'.
The experiment could be a little better by using a more descriptive form of notation than PGN. PGN notation's strength is the shorthand properties of it, because it is used by humans while playing the game. That is far from being a strength as LLM training data. ML algorithms, and LLMs are trained better by feeding them more descriptive and accurate data, and verbosity is not a problem at all. There is the FEN notation in which in every move the entire board is encoded.
One could easily imagine many different ways to describe a game, like encoding vertical and horizontal lines, listing what exact squares each piece is covering, what color squares, which of the pieces are able to move, and in each move generate one whole page of the board situation.
I call this spatial navigation, in which the LLM learns the ins and outs of it's training data and it is able to make more informed guesses. Chess is fun and all, but code generation has the potential to be a lot better than just writing functions. By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.
That's not a problem. You can show that neural network induced functions are dense in a bunch of function spaces, just like continuous functions. Regularity is not a critical concern anyways.
>functions vs algorithms
Repeatedly applying arbitrary functions to a memory (like in a transformer) yields you arbitrary dynamical systems, so we can do algorithms too.
> an approximator being possible and us knowing how to construct it are very different things,
This is of course the critical point, but not so relevant when asking whether something is theoretically possible. The way I see it this was the big question for deep learning and over the last decade the evidence has just continually grown that SGD is VERY good at finding weights that do in fact generalize quite well and that don't just approximate a function from step-functions the way you imagine an approximation theorem to construct it, but instead efficiently find features in the intermediate layers and use them for multiple purposes, etc. My intuition is that the gradient in high dimension doesn't just decrease the loss a bit in the way we imagine it for a low dimensional plot, but in those high dimensions really finds directions that are immensely efficient at decreasing loss. This is how transformers can become so extremely good at memorization.
More likely, the 16 million games just has most of the piece move combinations. It does not know a knight moves in an L. It knows from each square where a knight can move based on 16 million games.
On the contrary, by allowing vectors for unrelated concepts to be only almost orthogonal, it's possible to represent a much larger number of unrelated concepts. https://terrytao.wordpress.com/2013/07/18/a-cheap-version-of...
In machine learning, this phenomenon is known as polysemanticity or superposition https://transformer-circuits.pub/2022/toy_model/index.html
Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.
https://aider.chat/docs/repomap.html
More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
One of the challenges to making fun chess bots is to make it play like a low or mid ranked human. The problem is that a stockfish based bot knows some very strong moves, but deliberately plays bad moves so it’s about the right skill level. The problem is that these bad moves are often very obvious. For example I’ll threaten a queen capture. Any human would see it and move their queen. The bot “blunders” and loses the queen to an obvious attack. It feels like the bot is letting you win which kills the enjoyment of playing with the bot.
I think that this approach would create very human like games.
Consider the world to contain causal properties which bring about regularities in text, eg., Alice likes chocolate so Alice says, "I like chocolate". Alice's liking, ie., her capacity for preference, desire, taste, asethetic juddgement etc is the cause of "like".
Now these causal properties brings about significant regularities in text, so "like" occurring early in the paragraph comes to be extremely predictive of other text tokens occurring (eg., b-e-s-t, etc.)
No one in this debate doubts, whatsoever, that NNs contain "subnetworks" which divide the problem up into detecting these token correlations. This is trivially observable in CNNs where it is trivial to demonstrate subnetworks "activating" on, say, an eye-shape.
The issue is that when a competent language user judges someone's sentiment, or the implied sentiment the speaker of some text would have -- they are not using a model of how some subset of terms (like, etc.) comes to be predictive of others.
They're using the fact that the know the relevant causal properties (liking, preference, desire, etc.) and how these cause certain linguistic phrases. It is for this reason a competent language user can trivially detect irony ("of course I like going to the dentist!" -- here since we know how unlikely it is to desire this, we know this phrase is unlikely to express such a preference, etc.).
To say NNs, or any ML system, is sensitive to these mere correlations is not to say that these correlations are not formed by tracking the symptoms of real causes (eg., desire). Rather it is to say they do not track desire.
This seems obvious, since the mechanism to train them is just sensitive to patterns in tokens. These patterns are not their causes, and are not models of their causes. They're only predictive of them under highly constrained circumstances.
Astrological signs are predictive of birth dates, but they arent models of being born -- nor of time, or anything else.
No one here doubts whether NNs are sensitive to patterns in text caused by causal properties -- the issue is that they arent models of these properties; they are models of (some of) their effects as encoded in text.
Your links are not about actually orthogonal vectors, so they’re not relevant. Also that’s not what superposition is defined as in your own links:
> In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition
In an ideal AI model this would be the aim though.
This makes (merely) predictive models extremely fragile; as we often see.
One worry about this fragility is saftey: no one doubts that, say, city route planning from 1bn+ images is done via a "pixel-correlation (world) model" of pedestrian behaviour. The issue is that it isnt a model of pedestrian behaviour.
So it is only effective insofar as the effects of pedestrian behaviour, as captured in the images, in these environments, etc. remain constant.
If you understood pedestrians, ie., people, then you can imagine their behaviour in arbitrary environments.
Another way of putting it is: correlative models of effects arent sufficient for imagining novel circumstances. They encode only the effects of causes in those circumstances.
Whereas if you had a real world model, you can trivially simulate arbiatry circumstnaces.
If the term, "effect model" were used there would be zero debate. Of course NNs model the effects of sentiment.
The debate is that AI hype artists don't merely claim to model effects in constrained domains.
The representation of the ruleset may not be the optimal Kolmogorov complexity - but for an experienced human player who can glance at a board and know what is and isn’t legal, who is to say that their mental representation of the rules is optimizing for Kolmogorov complexity either?
An LLM is predicting what comes next per it's training set. If it's trained on human games then it should play like a human; if it's trained on Stockfish games, then it should play more like Stockfish.
Stockfish, or any chess engine using brute force lookahead, is just trying to find the optimal move - not copying any style of play - and it's moves are therefore sometimes going to look very un-human. Imagine if the human player is looking 10-15 moves ahead, but Stockfish 40-50 moves ahead... what looks good 40-50 moves out might be quite different than what looks good to the human.
There are still a lot of people who deny that (for example Bender's "superintelligent octopus" supposedly wouldn't learn a world model, no matter how much text it trained on), so more evidence is always good.
> There is the FEN notation in which in every move the entire board is encoded.
The entire point of this is to not encode the board state!
What are you basing this on? To me it seems like difficulty is set by limiting search depth/time: https://github.com/lichess-org/fishnet/blob/master/src/api.r...
Most humans have fast pattern matching that is quite good at finding some reasonable moves.
There are also classes of moves that all humans will spot. (You just moved your bishop, now it’s pointing at my queen)
The problem is that stockfish scores all moves with a number based on how good the move is. You have no idea if a human would agree.
For example mis-calculating a series of trades 4 moves deep is a very human mistake, but it’s scored the same as moving the bishop to a square where it can easily be taken by a pawn. They both result in you being a bishop down. A nerfed stockfish bot is equally likely to play either of those moves.
You might think that you could have a list of dumb move types that the bot might play, but there are thousands of possible obviously dumb moves. This is a problem for machine learning.
Yes - this is exactly what the probes show.
One interesting aspect is that it still learns to play when trained on blocks of move sequences starting from the MIDDLE of the game, so it seems it must be incrementally inferring the board state by what's being played rather than just by tracking the moves.
NNs cannot apply a 'concept' across different 'effect' domains, because they have only one effect domain: the training data. They are just models of how the effect shows itself in that data.
This is why they do not have world models: they are not generalising data by building an effect-neutral model of something; theyre just modelling its effects.
Compare having a model of 3D vs. a model of shadows of a fixed number of 3D objects. NNs generalise in the sense that they can still predict for shadows similar to their training set. They cannot predict 3d; and with sufficiently novel objects, fail catastrophically.
1. https://en.m.wikipedia.org/wiki/Shannon_number#:~:text=After....
We also don't know what internal representations of the state of play it's using other than what the author has discovered via probes... Maybe it has other representations effectively representing where pieces are (or what they may do next) other than just the board position.
I'm guessing that it's just using all of it's learned representations to recognize patterns where, for example, Nf3 and Nh3 are both statistically likely, and has no spatial understanding of the relationship of these moves.
I guess one way to explore this would be to generate a controlled training set where each knight only ever makes a different subset of it's legal (up to) 8 moves depending on which square it is on. Will the model learn a generalization that all L-shaped moves are possible from any square, or will it memorize the different subset of moves that "are possible" from each individual square?
Maybe I misread something as I only skimmed, but the pretty weak Elo would most definitely suggest a failure of intuiting rules.
I am not sure about this. From the article "The 50M parameter model played at 1300 ELO with 99.8% of its moves being legal within one day of training."
I thought that the experiment was how well the model will perform, given that it's reward function is to predict text, rather than checkmate. Leela, Alpha0 their reward function is to win the game, checkmate or capture pieces. Also it goes without saying that Leela, Alpha0 cannot make illegal moves.
The experiment does not need to include the whole board position if that's a problem, if that's an important point of interest. It could encode more information about squares covered by each side for example. See for example this training experiment for Trackmania [1]. There are techniques that the ML algorithm will *never* figure out by itself if this information is not encoded in it's training data.
The point still stands. PGN notation certainly is not a good format if the goal (or one of the goals) of the experiment is to be a good chess player.
Great stuff.
>More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
As the experiments on PHI-1 and PHI-2 from microsoft show, training data make a difference. The "textbooks is all you need" moto means better structured data, more clear data make a difference.
Say a white rook is on h7 and a white pawn is on g7.
Rook gets taken, then the pawn moves to g8 and promotes to a rook.
The rook kind of moved diagonally.
"Ah, when the two pieces are in this position, if you land on my rook, I have the option to remove my pawn from the board and then move my rook diagonally in front of where my pawn used to be."
Functionally, kind of the same? Idk.
Also eating ice cream and getting bitten by a shark do have some mutual predictive associations.
I think that the chess-GPT experiment can be interesting, not because the machine can predict every causal connection, but how many causal connections can it extract from the training data by itself. By putting a human in the loop, much more causal connections will be revealed but the human is lazy. Or expensive. Or expensive because he is lazy.
In addition correlation can be a hint for causation. If a human researches it further, then maybe it is a correlation and nothing substantial, but sometimes it may actually be a causative effect. So there is value in that.
About the overall sentiment, NN's world model is very different from a human world model indeed.
The author seems more interested in the ability to learn chess at a decent level from such a poor input, as well as what kind of world model it might build, rather than wanting to help it to play as well as possible.
The fact that it was able to build a decent model of the board position from PGN training samples, without knowing anything about chess (or that it was even playing chess) is super impressive.
It seems simple enough to learn that, for example, "Nf3" means that an "N" is on "f3", especially since predicting well requires you to know what piece is on each square.
However, what is not so simple is to have to learn - without knowing a single thing about chess - that "Nf3" also means that:
1) One of the 8 squares that is a knights move away from f3, and had an "N" on it, now has nothing on it. There's a lot going on there!
2) If "f3" previously had a different piece on it, that piece is now gone (taken) - it should no longer also be associated with "f3"
A different way to test the internal state of the model would be to score all possible valid and invalid moves at every position and see how the probabilities of these moves would change as a function of the player's ELO rating. One would expect that invalid moves would always score poorly independent of ELO, whereas valid moves would score monotonically with how good they are (as assessed by Stockfish) and that the player's ELO would stretch that monotonic function to separate the best moves from the weakest moves for a strong player.
You’re really wishing a lot more in to AI than is actually there.
https://arxiv.org/abs/2311.00871
https://arxiv.org/abs/2309.13638
Certainly a lot of folks wish more into kids than is really there.
That is literally, literally, what it does.
One may argue that it does so wrongly, but that's a different claim entirely.
> there’s no reason to imply they do
The predictions matching reality to the best of our collective abilities to test them is such a reason.
The saying that "all models are wrong but some are useful" is a reason against that.
I don't think it makes sense to talk of the model (potentially) knowing that knights make L-shaped moves (i.e. 2 squares left or right, plus 1 square up or down, or vice versa) unless it is able to add/subtract row/column numbers to be able to determine the squares it can move to on the basis of this (hypothetical) L-shaped move knowledge.
Being able to do row/column math is essentially what I mean by spatial representation - that it knows the spatial relationships between rows ("1"-"8") and columns ("a"-"h"), such that if it had a knight on e1 it could then use this L-shaped move knowledge to do coordinate math like e1 + (1,2) = f3.
I rather doubt this is the case. I expect the board representation is just a map from square name (not coordinates) to piece on that square, and that generated moves likely are limited to those it saw the piece being moved make when it had been on the same square during training - i.e. it's not calculating possible, say, knight destinations base on an L-shaped move generalization, but rather "recalling" a move it had seen during training when (among other things) it had a knight on a given square.
Somewhat useless speculation perhaps, but would seem simple and sufficient, and an easy hypothesis to test.
And the reverse, can a human situation be expressed as a chessboard presented with a move?
"This model is only trained to predict the next character in PGN strings (1.e4 e5 2.Nf3 …) and is never explicitly given the state of the board or the rules of chess. Despite this, in order to better predict the next character, it learns to compute the state of the board at any point of the game, and learns a diverse set of rules, including check, checkmate, castling, en passant, promotion, pinned pieces, etc. In addition, to better predict the next character it also learns to estimate latent variables such as the ELO rating of the players in the game."
Positions analyzed per $, per W and per watt-dollar are surely much, much higher, though ;)
Even if it does, it doesn't know that it has. And in principle, you can't know for sure if you have or not either. It's just a question of what odds you put on having learned a simplified version for all this time without having realised that yet. Or, if you're a professional chess player, the chance that right now you're dreaming and you're about to wake up and realise you dreamed about forgetting the 𐀀𐀁𐀂𐀃:𐀄𐀅𐀆𐀇𐀈𐀉 move that everyone knows (and you should've noticed because the text was all funny and you couldn't read it, which is a well-known sign of dreaming).
That many people act like things can be known 100% (including me) is evidence that humans quantise our certainty. My gut feeling is that anything over 95% likely is treated as certain, but this isn't something I've done any formal study in, and I'd assume that presentation matters to this number because nobody's[0] going to say that a D20 dice "never rolls a 1". But certainty isn't the same as knowledge, it's just belief[1].
[0] I only noticed at the last moment that this itself is an absolute, so I'm going to add this footnote saying "almost nobody".
[1] That said, I'm not sure what "knowledge" even is: we were taught the tripartite definition of "justified true belief", but as soon as it was introduced to us the teacher showed us the flaws, so I now regard "knowledge" as just the subjective experience of feeling like you have a justified true belief, where all that you really have backing up the feeling is a justified belief with no way to know if it's true, which obviously annoys a lot of people who want truth to be a thing we can actually access.