Okay, ChatGPT is only text-to-text, but Google & Co are adding more modalities now, including images, audio and robotics. I think one missing step is to fuse training and inference regime into one, just as in animals. That probably requires something else than the usual transformer-based token predictors.
This will be a great strategy very fast.
It shows to be quite good for image generation already