Perhaps, although intelligence and knowledge are two separate things, so one can display intelligence over a given set of knowledge without knowing other things. Of course intelligence isn't a scalar quantity - to be super-intelligent you want to display intelligence across the widest variety/type of experience and knowledge sets - not just "book smart or street smart", but both and more.
Certainly for parity with humans you need to be able to interact with the real world, but I'm not sure it's much different or a whole lot more more complex. Instead of "predict next word" the model/robot would be doing "predict next action", followed by "predict action response". Embeddings are a very powerful type of general-purpose representation - you can embed words, but also perceptions/etc, so I don't think we're very far at all from having similar transformer-based models able to act & perceive - I'd be somewhat surprised if people aren't already experimenting with this.