Okay, ChatGPT is only text-to-text, but Google & Co are adding more modalities now, including images, audio and robotics. I think one missing step is to fuse training and inference regime into one, just as in animals. That probably requires something else than the usual transformer-based token predictors.
It's not clear it is one. Sleep is training (replay from hippocampus). Wake is inference.