No, the "large _language_ model" name is a misnomer nowadays. Some time ago it was indeed common to get a pure-text model and inject embeddings from a separately trained image-encoder (which generated "meh" results), but current natively multi-modal models are pre-trained with both text and images from the ground-up. That's why they are so much better at image understanding.
> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training
dataset uses data from web documents, books, and code, and includes image, audio, and video data.
https://arxiv.org/pdf/2312.11805