zlacker

[parent] [thread] 1 comments
1. mike_h+(OP)[view] [source] 2026-02-03 09:15:43
The silicon itself does wear out. Dopant migration or something, I'm not an expert. Three years is probably too low but they do die. GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.
replies(1): >>Majrom+RD
2. Majrom+RD[view] [source] 2026-02-03 13:57:08
>>mike_h+(OP)
> GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.

The scale there is a little bit different. If you're training an LLM with 10,000 tightly-coupled GPUs where one failure could kill the entire job, then your mean time to failure drops by that factor of 10,000. What is a trivial risk in a single-GPU home setup would become a daily occurrence at that scale.

[go to top]