zlacker

[parent] [thread] 4 comments
1. hdjrud+(OP)[view] [source] 2026-02-03 04:27:51
They can run these things at 100% utilization for 3 years straight? And not burn them out? That's impressive.
replies(2): >>vlovic+O7 >>imtrin+An
2. vlovic+O7[view] [source] 2026-02-03 05:43:30
>>hdjrud+(OP)
Not really. GPUs are stateless so your bounded lifetime regardless of how much you use them is the lifetime of the shitties capacitor on there (essentially). Modulo a design defect or manufacturing defect, I’d expect a usable lifetime of at least 10 years, well beyond the manufacturer’s desire to support the drivers for it (ie the sw should “fail” first).
replies(1): >>mike_h+ex
3. imtrin+An[view] [source] 2026-02-03 08:01:46
>>hdjrud+(OP)
I don't see anything impressive here?
◧◩
4. mike_h+ex[view] [source] [discussion] 2026-02-03 09:15:43
>>vlovic+O7
The silicon itself does wear out. Dopant migration or something, I'm not an expert. Three years is probably too low but they do die. GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.
replies(1): >>Majrom+5b1
◧◩◪
5. Majrom+5b1[view] [source] [discussion] 2026-02-03 13:57:08
>>mike_h+ex
> GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.

The scale there is a little bit different. If you're training an LLM with 10,000 tightly-coupled GPUs where one failure could kill the entire job, then your mean time to failure drops by that factor of 10,000. What is a trivial risk in a single-GPU home setup would become a daily occurrence at that scale.

[go to top]