Consider the world to contain causal properties which bring about regularities in text, eg., Alice likes chocolate so Alice says, "I like chocolate". Alice's liking, ie., her capacity for preference, desire, taste, asethetic juddgement etc is the cause of "like".
Now these causal properties brings about significant regularities in text, so "like" occurring early in the paragraph comes to be extremely predictive of other text tokens occurring (eg., b-e-s-t, etc.)
No one in this debate doubts, whatsoever, that NNs contain "subnetworks" which divide the problem up into detecting these token correlations. This is trivially observable in CNNs where it is trivial to demonstrate subnetworks "activating" on, say, an eye-shape.
The issue is that when a competent language user judges someone's sentiment, or the implied sentiment the speaker of some text would have -- they are not using a model of how some subset of terms (like, etc.) comes to be predictive of others.
They're using the fact that the know the relevant causal properties (liking, preference, desire, etc.) and how these cause certain linguistic phrases. It is for this reason a competent language user can trivially detect irony ("of course I like going to the dentist!" -- here since we know how unlikely it is to desire this, we know this phrase is unlikely to express such a preference, etc.).
To say NNs, or any ML system, is sensitive to these mere correlations is not to say that these correlations are not formed by tracking the symptoms of real causes (eg., desire). Rather it is to say they do not track desire.
This seems obvious, since the mechanism to train them is just sensitive to patterns in tokens. These patterns are not their causes, and are not models of their causes. They're only predictive of them under highly constrained circumstances.
Astrological signs are predictive of birth dates, but they arent models of being born -- nor of time, or anything else.
No one here doubts whether NNs are sensitive to patterns in text caused by causal properties -- the issue is that they arent models of these properties; they are models of (some of) their effects as encoded in text.
This makes (merely) predictive models extremely fragile; as we often see.
One worry about this fragility is saftey: no one doubts that, say, city route planning from 1bn+ images is done via a "pixel-correlation (world) model" of pedestrian behaviour. The issue is that it isnt a model of pedestrian behaviour.
So it is only effective insofar as the effects of pedestrian behaviour, as captured in the images, in these environments, etc. remain constant.
If you understood pedestrians, ie., people, then you can imagine their behaviour in arbitrary environments.
Another way of putting it is: correlative models of effects arent sufficient for imagining novel circumstances. They encode only the effects of causes in those circumstances.
Whereas if you had a real world model, you can trivially simulate arbiatry circumstnaces.
NNs cannot apply a 'concept' across different 'effect' domains, because they have only one effect domain: the training data. They are just models of how the effect shows itself in that data.
This is why they do not have world models: they are not generalising data by building an effect-neutral model of something; theyre just modelling its effects.
Compare having a model of 3D vs. a model of shadows of a fixed number of 3D objects. NNs generalise in the sense that they can still predict for shadows similar to their training set. They cannot predict 3d; and with sufficiently novel objects, fail catastrophically.