zlacker

[parent] [thread] 2 comments
1. bee_ri+(OP)[view] [source] 2025-12-05 22:28:50
Naive question, but what is Gemini?

I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?

replies(2): >>genrad+Dh >>andy12+T51
2. genrad+Dh[view] [source] 2025-12-06 00:36:50
>>bee_ri+(OP)
Go watch some of the more recent Google developer, Google AI, and Google deepmind videos, they're all separate channels at YouTube but try to catch some from the last 6 months with some of these explanatory topics on the developer side that are philosophical/ mathematical enough to explain this to you without going into the gritty details and should answer your question
3. andy12+T51[view] [source] 2025-12-06 11:15:55
>>bee_ri+(OP)
No, the "large _language_ model" name is a misnomer nowadays. Some time ago it was indeed common to get a pure-text model and inject embeddings from a separately trained image-encoder (which generated "meh" results), but current natively multi-modal models are pre-trained with both text and images from the ground-up. That's why they are so much better at image understanding.

> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.

https://arxiv.org/pdf/2312.11805

[go to top]