Am I the only one who remembers when that was the stuff of science fiction? It was not so long ago an open question if machines would ever be able to transcribe speech in a useful way. How quickly we become numb to the magic.
Disclaimer: I'm not praising piracy but outside of US borders is a free for all.
For example, for automatic speech recognition (ASR), see: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
The current best ASR model has 600M params (tiny compared to LLMs, and way faster than any LLM: 3386.02 RTFx vs 62.12 RTFx, much cheaper) and was trained on 120,000h of speech. In comparison, the next best speech LLM (quite close in WER, but slightly worse) has 5.6B params and was trained on 5T tokens, 2.3M speech hours. It has been always like this: With a fraction of the cost, you will get a pure ASR model which still beats every speech LLM.
The same is true for translation models, at least when you have enough training data, so for popular translation pairs.
However, LLMs are obviously more powerful in what they can do despite just speech recognition or translation.
Yes, yes and yes!
I tried speech recognition many times over the years (Dragon, etc). Initially they all were "Wow!", but they simply were not good enough to use. 95% accuracy is not good enough.
Now I use Whisper to record my voice, and have it get passed to an LLM for cleanup. The LLM contribution is what finally made this feasible.
It's not perfect. I still have to correct things. But only about a tenth of the time I used to. When I'm transcribing notes for myself, I'm at the point I don't even bother verifying the output. Small errors are OK for my own notes.
See https://blog.nawaz.org/posts/2023/Dec/cleaning-up-speech-rec...
(This is not the best example as I gave it free rein to modify the text - I should post a followup that has an example closer to a typical use of speech recognition).
Without that extra cleanup, Whisper is simply not good enough.
The problem with Google-Translate-type models is the interface is completely wrong. Translation is not sentence->translation, it's (sentence,context)->translation (or even (sentence,context)->(translation,commentary)). You absolutely have to be able to input contextual information, instructions about how certain terms are to be translated, etc. This is trivial with an LLM.
Unfortunately, one of those powerful features is "make up new things that fit well but nobody actually said", and... well, there's no way to disable it. :p
Traditional machine translators, perhaps. Human translation is still miles ahead when you actually care about the quality of the output. But for getting a general overview of a foreign-language website, translating a menu in a restaurant, or communicating with a taxi driver? Sure, LLMs would be a great fit!
These models have made it possible to robustly practice all 4 quadrants of language learning for most common languages using nothing but a computer, not just passive reading. Whisper is directly responsible for 2 of those quadrants, listening and speaking. LLMs are responsible for writing [2]. We absolutely live in the future.
[1]: https://github.com/hiandrewquinn/audio2anki
[2]: https://hiandrewquinn.github.io/til-site/posts/llm-tutored-w...
"As a safe AI language model, I refuse to translate this" is not a valid translation of "spierdalaj".
The current SOTA LLMs are better than Traditional machine translators (there is no perhaps) and most human translators.
If a 'general overview' is all you think they're good for, then you've clearly not seriously used them.
It is stated that GPT-4o-transcribe is better than Whisper-large. That might be true, but what version of Whisper-large actually exactly? Looking at the leaderboard, there are a lot of Whisper variants. But anyway, the best Whisper variant, CrisperWhisper, is currently only at rank 5. (I assume GPT-4o-transcribe was not compared to that but to some other Whisper model.)
It is stated that Scribe v1 from elevenlabs is better than GPT-4o-transcribe. In the leaderboard, Scribe v1 is also only at rank 6.
A slightly off topic but interesting video about this https://www.youtube.com/watch?v=OSCOQ6vnLwU
The other day, alone in a city I'd never been to before, I snapped a photo of a bistro's daily specials hand-written on a blackboard in Chinese, copied the text right out of the photo, translated it into English, learned how to pronounce the menu item I wanted, and ordered some dinner.
Two years ago this story would have been: notice the special board, realize I don't quite understand all the characters well enough to choose or order, and turn wistfully to the menu to hopefully find something familiar instead. Or skip the bistro and grab a pre-packaged sandwich at a convenience store.
The captions looked like they would be correct in context, but I could not cross-reference them with snippets of manually checked audio, to the best of my ability.
Would you go to a foreign country and sign a work contract based on the LLM translation ?
Would you answer a police procedure based on the speech recognition alone ?
That to me was the promise of the science fiction. Going to another planet and doing inter-species negotiations based on machine translation. We're definitely not there IMHO, and I wouldn't be surprised if we don't quite get there in our lifetime.
Otherwise if we're lowering the bar, speech to text has been here for decades, albeit clunky and power hungry. So improvements have been made, but watching old movies is a way too low stake situation IMHO.
Somehow LLMs can't do that for structured code with well defined semantics, but sure, they will be able to extract "obscure references" from speech/text
To be fair apps dedicated apps like Pleco have supported things like this for 6+ years, but the spread of modern language models has made it more accessible
Also the traditional cross-attention-based encoder-decoder translation models support document-level translation, and also with context. And Google definitely has all those models. But I think the Google webinterface has used much weaker models (for whatever reason; maybe inference costs?).
I think DeepL is quite good. For business applications, there is Lilt or AppTek and many others. They can easily set up a model for you that allows you to specify context, or be trained for some specific domain, e.g. medical texts.
I don't really have a good reference for a similar leaderboard for translation models. For translation, the metric to measure the quality is anyway much more problematic than for speech recognition. I think for the best models, only human evaluation is working well now.
This video explains all about it: https://youtu.be/OSCOQ6vnLwU
I really think support for native content is the ideal way to learn for someone like me, especially with listening.
Thanks for posting and good luck.
> Two years ago
This functionality was available in 2014, on either an iPhone or android. I ordered specials in Taipei way before Covid. Here's the blog post celebrating it:
https://blog.google/products/translate/one-billion-installs/
This is all a post about AI, hype, and skepticism. In my childhood sci-fi, the idea of people working multiple jobs to still not be able to afford rent was written as shocking or seen as dystopian. All this incredible technology is a double edges sword, but doesn't solve the problems of the day, only the problems of business efficiency, which exacerbates the problems of the day.
As someone who has started losing the higher frequencies and thus clarity, I have subtitles on all the time just so I don't miss dialogue. The only pain point is when the subtitles (of the same language) are not word-for-word with the spoken line. The discordance between what you are reading and hearing is really distracting.
This is my major peeve with my The West Wing DVDs, where the subtitles are often an abridgement of the spoken line.
Just whatever small LLM I have installed as the default for the `llm` command line tool at the time. Currently that's gemma3:4b-it-q8_0 though it's generally been some version of llama in the past. And then this fish shell function (basically a bash alias)
function trans
llm "Translate \"$argv\" from French to English please"
endNot sure if it was due to the poor quality of the sound, the fact people used to speak a bit differently 60 years ago or that 3 different languages were used (plot took place in France during WW2).
Whisper can translate to English (and maybe other languages these days?), too.
I use subtitles because I don’t want to micromanage the volume on my TV when adverts are forced on me and they are 100x louder than than what I was watching.
Yes, Whisper has been able to do this since the first release. At work we use it to live-transcribe-and-translate all-hands meetings and it works very well.
https://www.pcworld.com/article/470008/bing_translator_app_g...
There is really not that much similar between trying to code and trying to translate emotion. At the very least, language “compiles” as long as the words are in a sensible order and maintain meaning across the start and finish.
All they need to do now in order to be able to translate well is to have contextual knowledge to inform better responses on the translated end. They’ve been doing that for years, so I really don’t know what you’re getting at here.
Ah yeah, the famous "all they need to do". Such a minor thing left to do
On their chart they compare also with: gemini 2.0 flash, whisper large v2, whisper large v3, scribe v1, nova 1, nova 2. If you need only english transcription then pretty much all models will be good these days but big difference is depending on input language.
There are plenty of uncensored models that will run on less than 8GB of vram.
But when would that ever happen? Guess you’re right.
- your mother visiting your sister (arguably extremely low stake. At any moment she can just phone your sister I presume ?)
- You traveling around (you're not trying to close a business deal or do anything irreversible)
Basically you seem to be agreeing that it's fine for convenience, but not ready for "science fiction" level use cases.