I think my issue with the "don't anthropomorphize" is that it's unclear to me that the main difference between a human and an LLM isn't simply the inability for the LLM to rewrite its own model weights on the fly. (And I say "simply" but there's obviously nothing simple about it, and it might be possible already with current hardware, we just don't know how to do it.)
Even if we decide it is clearly different, this is still an incredibly large and dynamic system. "Stateless" or not, there's an incredible amount of state that is not comprehensible to me.
That said, would you anthropomorphize a meteorological simulation just because it contains lots and lots of constants that you don't understand well?
I'm pretty sure that recurrent dynamical systems pretty quickly become universal computers, but we are treating those that generate human language differently from others, and I don't quite see the difference.
It's fun to think about just how fantastic a brain is, and how much wattage and data-center-scale we're throwing around trying to approximate its behavior. Mega-effecient and mega-dense. I'm bearish on AGI simply from an internetworking standpoint, the speed of light is hard to beat and until you can fit 80 billion interconnected cores in half a cubic foot you're just not going to get close to the responsiveness of reacting to the world in real time as biology manages to do. but that's a whole nother matter. I just wanted to pick apart that magnitude of parameters is not an altogether meaningful comparison :)
This is "simply" an acknowledgement of extreme ignorance of how human brains work.
And if it were just language, I would say, sure maybe this is more limited. But it seems like tensors can do a lot more than that. Poorly, but that may primarily be a hardware limitation. It also might be something about the way they work, but not something terribly different from what they are doing.
Also, I might talk about a meteorological simulation in terms of whatever it was intended to simulate.
There's loads of state in the LLM that doesn't come out in the tokens it selects. The tokens are just the very top layer, and even then, you get to see just one selection from the possible tokens.
If you wish to anthropomorphize, that state - the set of activations, all the calculations that add up to the logits that determine the probability of the token to select, the whole lot of it - is what the model is "thinking". But all you get to see is one selected token.
Then, during autoregression, we run the program again, but one more tick of the CPU clock. Variables get updated a bit more. The chosen token from the previous pass conditions the next token prediction - the hidden state evolves its thinking one more step.
If you just look at the tokens being selected, you're missing this machinery. And the machinery is there. It's a program being ticked by generating tokens autoregressively. It has state which doesn't directly show up in tokens, it just informs which tokens to select. And the tokens it selects don't necessarily reflect the correspondences with perceived reality that the model is maintaining in that state. That's what I meant by talking about a lie.
We need a vocabulary to talk about this machinery. The machinery is learned, and it runs programs, effectively, that help the LLM reduce loss when predicting tokens. Since the tokens it's predicting come from human minds, the programs it's running are (broken, lossy, not very good) simulations of processes that seem to run inside human minds.
The simulations are pretty decent for producing gramatically correct text, for emulating tone and style, and so on. They're okay-ish for representing concepts. They're poor for representing very specific facts. But the overall point is they are simulations, and they have some analogous correspondence with human behavior, such that words we use to describe human behaviour are useful and practical.
They're not true, I'm not claiming that. But they're useful for talking about these weird defective minds we call LLMs.