At the end of the day the "predict next word" training goal of LLMs is the ultimate intelligence test. If you could always answer that "correctly" (i.e. intelligently) you'd be a polymath genius. Focusing on the "next word" ("autocomplete") aspect of this, and ignoring the knowledge/intelligence needed to do WELL at it is rather misleading!
"The best way to combine quantum mechanics and general relativity into a single theory of everything is ..."
Perhaps, although intelligence and knowledge are two separate things, so one can display intelligence over a given set of knowledge without knowing other things. Of course intelligence isn't a scalar quantity - to be super-intelligent you want to display intelligence across the widest variety/type of experience and knowledge sets - not just "book smart or street smart", but both and more.
Certainly for parity with humans you need to be able to interact with the real world, but I'm not sure it's much different or a whole lot more more complex. Instead of "predict next word" the model/robot would be doing "predict next action", followed by "predict action response". Embeddings are a very powerful type of general-purpose representation - you can embed words, but also perceptions/etc, so I don't think we're very far at all from having similar transformer-based models able to act & perceive - I'd be somewhat surprised if people aren't already experimenting with this.