zlacker

[return to "My AI skeptic friends are all nuts"]
1. retrac+J[view] [source] 2025-06-02 21:16:59
>>tablet+(OP)
Machine translation and speech recognition. The state of the art for these is a multi-modal language model. I'm hearing impaired veering on deaf, and I use this technology all day every day. I wanted to watch an old TV series from the 1980s. There are no subtitles available. So I fed the show into a language model (Whisper) and now I have passable subtitles that allow me to watch the show.

Am I the only one who remembers when that was the stuff of science fiction? It was not so long ago an open question if machines would ever be able to transcribe speech in a useful way. How quickly we become numb to the magic.

◧◩
2. albert+R4[view] [source] 2025-06-02 21:39:10
>>retrac+J
That's not quite true. State of the art both in speech recognition and translation is still a dedicated model only for this task alone. Although the gap is getting smaller and smaller, and it also heavily depends on who invests how much training budget.

For example, for automatic speech recognition (ASR), see: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

The current best ASR model has 600M params (tiny compared to LLMs, and way faster than any LLM: 3386.02 RTFx vs 62.12 RTFx, much cheaper) and was trained on 120,000h of speech. In comparison, the next best speech LLM (quite close in WER, but slightly worse) has 5.6B params and was trained on 5T tokens, 2.3M speech hours. It has been always like this: With a fraction of the cost, you will get a pure ASR model which still beats every speech LLM.

The same is true for translation models, at least when you have enough training data, so for popular translation pairs.

However, LLMs are obviously more powerful in what they can do despite just speech recognition or translation.

◧◩◪
3. edflsa+s6[view] [source] 2025-06-02 21:47:18
>>albert+R4
What translation models are better than LLMs?

The problem with Google-Translate-type models is the interface is completely wrong. Translation is not sentence->translation, it's (sentence,context)->translation (or even (sentence,context)->(translation,commentary)). You absolutely have to be able to input contextual information, instructions about how certain terms are to be translated, etc. This is trivial with an LLM.

◧◩◪◨
4. gpm+9g[view] [source] 2025-06-02 22:46:33
>>edflsa+s6
I've been using small local LLMs for translation recently (<=7GB total vram usage) and they, even the small ones, definitely beat Google Translate in my experience. And they don't require sharing whatever I'm reading with Google, which is nice.
◧◩◪◨⬒
5. yubble+jh[view] [source] 2025-06-02 22:52:57
>>gpm+9g
What are you using? whisper?
◧◩◪◨⬒⬓
6. gpm+Yh[view] [source] 2025-06-02 22:57:38
>>yubble+jh
Edit: Huh, didn't know whisper could translate.

Just whatever small LLM I have installed as the default for the `llm` command line tool at the time. Currently that's gemma3:4b-it-q8_0 though it's generally been some version of llama in the past. And then this fish shell function (basically a bash alias)

    function trans
        llm "Translate \"$argv\" from French to English please"
    end
[go to top]