zlacker

Voxtral Transcribe 2

submitted by meetpa+(OP) on 2026-02-04 15:08:17 | 883 points 210 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
1. observ+4a[view] [source] 2026-02-04 15:53:39
>>meetpa+(OP)
Native diarization, this looks exciting. edit: or not, no diarization in real-time.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

~9GB model.

◧◩
3. ReadEv+Ua[view] [source] [discussion] 2026-02-04 15:57:57
>>serf+5a
You can try it on HF: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...
5. dmix+4d[view] [source] 2026-02-04 16:07:16
>>meetpa+(OP)
> At approximately 4% word error rate on FLEURS and $0.003/min

Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/

◧◩◪◨
14. GaggiX+ke[view] [source] [discussion] 2026-02-04 16:13:00
>>mdrzn+Jd
>So it's not just whisper v3 under the hood?

Why it should be Whisper v3? They even released an open model: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

17. simonw+kg[view] [source] 2026-02-04 16:21:17
>>meetpa+(OP)
This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

◧◩
29. daemon+Xm[view] [source] [discussion] 2026-02-04 16:48:11
>>simonw+kg
404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).
44. janals+Gy[view] [source] 2026-02-04 17:41:39
>>meetpa+(OP)
I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

47. siddbu+pA[view] [source] 2026-02-04 17:48:38
>>meetpa+(OP)
Wired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".

[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...

◧◩
59. nindal+0E[view] [source] [discussion] 2026-02-04 18:02:40
>>antire+Oe
I love seeing people from other countries share their own folk tales about what makes their countries special and unique. I've seen it up close in my country and I always cringed when I heard my fellow countrymen came up with these stories. In my adulthood I'm reassured that it happens everywhere and I find it endearing.

On the information density of languages: it is true that some languages have a more information dense textual representation. But all spoken languages convey about the same information in the same time. Which is not all that surprising, it just means that human brains have an optimal range at which they process information.

Further reading: Coupé, Christophe, et al. "Different Languages, Similar Encoding Efficiency: Comparable Information Rates across the Human Communicative Niche." Science Advances. https://doi.org/10.1126/sciadv.aaw2594

◧◩◪
63. druska+bG[view] [source] [discussion] 2026-02-04 18:11:16
>>Oras+qh
According to the announcement blog Le Chat is powered by the new model as well: https://chat.mistral.ai/chat
◧◩
68. m1el+qK[view] [source] [discussion] 2026-02-04 18:29:05
>>pietz+Dm
I've been using nemotron ASR with my own ported inference, and happy about it:

https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...

https://github.com/m1el/nemotron-asr.cpp https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...

◧◩
78. nodja+WP[view] [source] [discussion] 2026-02-04 18:50:47
>>jiehon+1P
There is https://huggingface.co/spaces/hf-audio/open_asr_leaderboard but it hasn't been updated for half a year.
◧◩◪◨
83. zipy12+tS[view] [source] [discussion] 2026-02-04 19:01:47
>>XCSme+LB
I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

◧◩◪◨
84. ashenk+VS[view] [source] [discussion] 2026-02-04 19:03:41
>>sbroth+ho
You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see
◧◩
90. breisa+yU[view] [source] [discussion] 2026-02-04 19:10:54
>>satvik+Hk
Not sure if its "realtime" but the recently released VibeVoice-ASR from Microsoft does do diarization. https://huggingface.co/microsoft/VibeVoice-ASR
◧◩◪
93. Multic+eW[view] [source] [discussion] 2026-02-04 19:19:59
>>m1el+qK
I'm so amazed to find out just how close we are to the start trek voice computer.

I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.

And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.

But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.

And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.

Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.

[1] https://github.com/cjpais/Handy

◧◩
96. tietje+iX[view] [source] [discussion] 2026-02-04 19:24:59
>>number+mW
There is Handy, an open source project meant to be a desktop tool, but I haven’t installed it yet to see how you pick your model.

Handy – Free open source speech-to-text app https://github.com/cjpais/Handy

◧◩
114. IanCal+od1[view] [source] [discussion] 2026-02-04 20:41:44
>>ccleve+771
If you use something like youtube-dlp you can download the audio from the meetings, and you could try things out in mistrals ai studio.

You could use their api (they have this snippet):

```curl -X POST "https://api.mistral.ai/v1/audio/transcriptions" \ -H "Authorization: Bearer $MISTRAL_API_KEY" \ -F model="voxtral-mini-latest" \ -F file=@"your-file.m4a" \ -F diarize=true \ -F timestamp_granularities="segment"```

In the api it took 18s to do a 20m audio file I had lying around where someone is reviewing a product.

There will, I'm sure, be ways of running this locally up and available soon (if they aren't in huggingface right now) but the API is $0.003/min. If it's something like 120 meetings (10 years of monthly ones) then it's roughly $20 if the meetings are 1hr each. Depending on whether they're 1 or 10 hours (or if they're weekly or monthly but 10 parallel sessions or something) then this might be a price you're willing to pay if you get the results back in an afternoon.

edit - their realtime model can be run with vllm, the batch model is not open

◧◩◪
115. IanCal+ge1[view] [source] [discussion] 2026-02-04 20:44:22
>>kamran+Ir
The HF page suggests yes, with vllm.

> We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM. Special thanks goes out to Joshua Deng, Yu Luo, Chen Zhang, Nick Hill, Nicolò Lucchesi, Roger Wang, and Cyrus Leung for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

https://docs.vllm.ai/en/latest/serving/openai_compatible_ser...

◧◩◪◨
127. m1el+Eo1[view] [source] [discussion] 2026-02-04 21:32:49
>>Multic+eW
you should check out

https://github.com/pipecat-ai/nemotron-january-2026/

discovered through this twitter post:

https://x.com/kwindla/status/2008601717987045382

◧◩◪
130. ethmar+ds1[view] [source] [discussion] 2026-02-04 21:51:56
>>rainco+j61
They've already done the inverse and trimmed non-coding abilities from their language model: https://openai.com/index/introducing-gpt-5-2-codex/. There's already precedent for creating domain-specific models.

I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.

That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.

148. maxdo+rD1[view] [source] 2026-02-04 22:51:06
>>meetpa+(OP)
https://www.tavus.io/post/sparrow-1-human-level-conversation...

how does it compare to sparrow-1?

◧◩◪
150. Tactic+EG1[view] [source] [discussion] 2026-02-04 23:09:14
>>Oras+qh
> Truly impressive for real-time.

Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.

It was so bad I told several people not to invest in what was back then a national tech darling:

https://en.wikipedia.org/wiki/Lernout_%26_Hauspie

That turned out to be a massive fraud.

But ...

> I tried speaking in 2 languages at once, and it picked it up correctly.

I'm a native french speaker and I tried with a very simple sentence mixing french and english:

"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")

And instead I got this:

"Je prépare un redote, mais pour une carabine, je préfère un ACOG."

"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.

I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.

◧◩
153. asah+EJ1[view] [source] [discussion] 2026-02-04 23:26:15
>>asah+aJ1
(when it was released, adults/press/etc. found SLTS famously incomprehensible and then they realized that the kids didn't understand the lyrics either, and Weird Al nailed it with his classic, Smells Like Nirvana: https://www.google.com/search?q=Smells+Like+Nirvana )
◧◩◪
159. janals+JN1[view] [source] [discussion] 2026-02-04 23:54:43
>>m463+Zk1
Most English speakers likely would understand those and don’t speak French or Spanish. So it’s not necessary to tack on extra languages even if there are loan words.

In general there is a concept called the “curse of multilinguality”

https://arxiv.org/pdf/1911.02116

◧◩◪◨
170. jnaina+K02[view] [source] [discussion] 2026-02-05 01:33:10
>>Tactic+EG1
I used sell the Mac Voice Navigator (from Articulate Systems) in the 90s, which was a SCSI based hardware box that you plug into a Mac, Mac SE or Mac II. It used to use the same L&H speech recognition tech (if I recall correctly) and was called the "User Interface" of the future.

Horrible speech recognition rate and very glitchy. Customers hated it, and lots of returns/complaints.

A few years later, L&H went bankrupt. And so did Articulate Systems.

https://applerescueofdenver.com/products-page/macintosh-to-p...

◧◩
172. d4rkp4+i92[view] [source] [discussion] 2026-02-05 02:44:08
>>pietz+Dm
I’m curious about this too. On my M1 Max MacBook I use the Handy app on macOS with Parakeet V3 and I get near instant transcription, accuracy slightly less than slower Whisper models, but that drop is immaterial when talking to CLI coding agents, which is where I find the most use for this.

https://github.com/cjpais/Handy

◧◩
175. rabf+Ca2[view] [source] [discussion] 2026-02-05 02:57:58
>>mijoha+j42
I made this for myself, might not work on wayland though if thats an issue.

https://github.com/rabfulton/Auriscribe

◧◩
180. fittin+Af2[view] [source] [discussion] 2026-02-05 03:39:51
>>fph+SP
Have been using https://github.com/notune/android_transcribe_app And pretty happy with it. Fully local and fast and accurate
◧◩◪◨
197. meatma+uo2[view] [source] [discussion] 2026-02-05 05:16:36
>>draken+6Z1
Do you mean https://huggingface.co/nvidia/nemotron-speech-streaming-en-0... ?
198. owenbr+Pt2[view] [source] 2026-02-05 06:08:52
>>meetpa+(OP)
The other demos didn't work for me, so I made https://github.com/owenbrown/transcribe It's just a python script to test the streaming.

Wow, Voxtral is amazing. It will be great when someone stitches this up so an LLM starts thinking, researching for you, before you actually finish talking.

Like, create a conversation partner with sub 0.5 second latency. For example, you ask it a multi part questions and, as soon as you finish talking, it gives you the answer to the first part while it looks up the rest of the answer, then stitches it together so that there's no break.

The 2-3 second latency of existing voice chatbots is a non-started for most humans.

200. barrel+Du2[view] [source] 2026-02-05 06:19:05
>>meetpa+(OP)
Very happy with all the mistral work. I feel like I'm always one release behind theirs. Last time they released Mistral 3 I commented saying how excited I was to try it out [1]

Well, I'm happy to report I integrated the new Mistral 3 and have been truly astounded by the results. I still am not a big fan of the model wrt factual information - it seems to be especially confident and especially wrong if left to it's own devices - but with http://phrasing.app I do most of the data aggregation myself and just use an LLM to format it. Mistral 3 was a drop-in replacement for 3x the quality (it was already very very good), 0% error rate for my use case (I had an issue for it occasionally going off the rails that was entirely solved), and sticks to my formatting guidelines perfectly (which even gpt-5-pro failed on). Plus it was somehow even cheaper.

I'm using Scribe v2 at the moment for TTS, but I'm very excited now to try integrating Voxtral Transcribe. The language support is a little lacking for my use cases, but I can always fall back to Scribe and amatorize the cost across languages. I actually was due to work on the transcription of phrasing very soon so I guess look forward to my (hopefully) glowing review on their next hn launch! XD

[1] https://news.ycombinator.com/item?id=46121889#46122612

[go to top]