zlacker

I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

replies(9): >>decide+F2 >>popalc+ob >>keegan+ph >>rainco+Dx >>depr+Kx >>idiots+cM >>m463+jM >>black_+Xl1 >>ryan_l+eH1

>>janals+(OP)
I think this model proves it's very efficient and accurate.

replies(1): >>ethmar+nS

>>janals+(OP)
It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.

replies(2): >>javier+mi >>janals+ke1

>>janals+(OP)
uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.

replies(4): >>make3+Tk >>kergon+Vw >>ethmar+cU >>janals+Sj1

>>popalc+ob
yeah, one example I run into is getting my perplexity phone assistant to play a song in spanish. I cannot for the life of me get a model to translate: "Play señorita a mi me gusta su style on spotify" correctly

>>keegan+ph
model size directly affects latency

>>keegan+ph
Even model size, it’s modest. There is a lot of machinery that is going to be common for all languages. You don’t multiply model size by 2 when you double the number of supported languages.

>>janals+(OP)
Imagine if ChatGPT started like this and thought they should trim coding abilities from their language model because most people don't code.

replies(1): >>ethmar+xT

>>janals+(OP)
STT services that have been around for longer, like Azure, Google and Amazon, generally require you to request a specific language, and their quality is a lot higher than models that advertise themselves as LLMs (even though I believe the clouds are also using the same types of models now).

>>janals+(OP)
The hilarious part of this comment is all the comments around it complaining about not supporting enough languages

>>janals+(OP)
I don't know. What about words inherited from other languages? I think a cross-language model could improve lots of things.

For example, "here it is, voila!" "turn left on el camino real"

replies(1): >>janals+3f1

>>decide+F2
But it could potentially be even more efficient if it was single-language.

>>rainco+Dx
They've already done the inverse and trimmed non-coding abilities from their language model: https://openai.com/index/introducing-gpt-5-2-codex/. There's already precedent for creating domain-specific models.

I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.

That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.

>>keegan+ph
If encoding more learned languages and grammars and dictionaries makes the model size bigger, it will also increase latency. Try running a 1B model locally and then try to run a 500B model on the same hardware. You'll notice that latency has rather a lot to do with model size.

>>popalc+ob
Everything is a tradeoff, and different use cases require different tradeoffs:

Option A: this model

Option B: faster model, only 1 language

Option C: same size model, only 1 language but higher quality

My point is that option A isn’t always best.

And on the borrowed words bit, there’s no rule that we cannot add borrowed words into the vocab. But you don’t need the whole language. I know what deja voux means but I don’t speak French.

replies(1): >>popalc+FL1

>>m463+jM
Most English speakers likely would understand those and don’t speak French or Spanish. So it’s not necessary to tack on extra languages even if there are loan words.

In general there is a concept called the “curse of multilinguality”

https://arxiv.org/pdf/1911.02116

>>keegan+ph
Well for example the last step is to softmax over all output logits, which is the same as your vocab size. You need the sum of the exponentiated values of each logit to calculate the denominator which is O(N).

Bigger impact is before that you need to project the hidden state matrix to the vocab list. Something like 4096x250000. Bigger vocab=more FLOPs.

If you’re on a GPU things are parallelized so maybe it’s not quite linear if everything fits nicely. But on a cpu you’re going to struggle more.

This is why the juiciest target when shrinking models is the token embedding table. For example AlBERT factorized the whole embedding table to two low rank matrices.

>>janals+(OP)
honestly the inability to correctly transcribe the 4 language mix i use in my everyday life is one of the major blockers for adopting ASR tech in my own tooling. this coming from someone who literally works in that field.

turns out, outside the US, many people speak more than one language. :)

edit: I should say was a major blocker, because the last iterations of open-weight models actually work better and better. it's often the UX that's not thought for these usecases.

>>janals+(OP)
"I only speak one language, so models I use should only understand one".

>>janals+ke1
that depends entirely on how common the borrowed thing is. And anyway, option A is always going to be insufficient for my code-switching example -- as another commenter pointed out, it is very common to want to refer to a foreign work (song, movie, book) by its foreign language title. Monolingual ASR solutions break over this all the time. Try asking Alexa to play a Spanish language track on Spotify. It fails frequently.

The real world is like that.