I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?
> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.