zlacker

[parent] [thread] 12 comments
1. notaha+(OP)[view] [source] 2025-10-22 12:15:40
You've also got the problem of cosmic radiation flipping bits. Your fault tolerant architecture probably mitigates this with redundancy, with the extra servers again eating into the purported advantages of extra solar power. Dealing with the PITA of single event upsets is something developers of edge data processing software in space put up with to avoid the latency issues that data clouds in space introduce
replies(2): >>preiss+1b >>btown+RU
2. preiss+1b[view] [source] 2025-10-22 13:10:54
>>notaha+(OP)
I wonder if "normal" RDIMM ECC would be enough to mitigate most of those radiation bit-flipping issues. If so it wouldn't really make a difference to earth-based servers since most enterprise servers use RDIMM ECC too
replies(1): >>eptcyk+dm
◧◩
3. eptcyk+dm[view] [source] [discussion] 2025-10-22 13:58:13
>>preiss+1b
You'll get bitflips elsewhere besides just in RAM. A bitflip in L1 or L3 cache will be propagated to your DIMM and noone will be the wiser.
replies(3): >>zamada+Ln >>LtdJor+cx >>shrubb+vP
◧◩◪
4. zamada+Ln[view] [source] [discussion] 2025-10-22 14:04:36
>>eptcyk+dm
I thought server CPUs already handled this? E.g. for Epyc https://moorinsightsstrategy.com/wp-content/uploads/2017/05/...

> Because caches hold the most recent and most relevant data to the current processing, it is critical that this data be accurate. To enable this, AMD has designed EPYC with multiple tiers of cache protection. The level 1 data cache includes SEC-DED ECC, which can detect two-bit errors and correct single-bit errors. Through parity and retry, L1 data cache tag errors and L1 instruction cache errors are automatically corrected. The L2 and L3 caches are extended even further with the ability to correct double errors and detect triple errors.

◧◩◪
5. LtdJor+cx[view] [source] [discussion] 2025-10-22 14:47:21
>>eptcyk+dm
Those do ECC already
replies(1): >>ls612+ec2
◧◩◪
6. shrubb+vP[view] [source] [discussion] 2025-10-22 15:57:10
>>eptcyk+dm
Sun Microsystems famously had this problem with their servers using the UltraSPARC II chips, with cache SRAM that didn’t have ECC. Later versions of their processors had ECC added.
7. btown+RU[view] [source] 2025-10-22 16:17:32
>>notaha+(OP)
In all seriousness, if AI models can handle quantization, they can handle some flipped bits from time to time! There are probably some fascinating papers to be written around how to choose which layers in an LLM architecture could benefit more than others from redundant computation in a high-radiation environment.
replies(1): >>kibwen+x01
◧◩
8. kibwen+x01[view] [source] [discussion] 2025-10-22 16:44:03
>>btown+RU
Brilliant, to turn up the model temperature we just hinge open the shielding. I call dibs on the patent!
replies(1): >>lawles+Ll1
◧◩◪
9. lawles+Ll1[view] [source] [discussion] 2025-10-22 18:18:36
>>kibwen+x01
Ok, has anyone patented chips with radioactive source glued to them? For "true" randomness.

If it not i want dibs on it.

replies(1): >>DontBr+Kq1
◧◩◪◨
10. DontBr+Kq1[view] [source] [discussion] 2025-10-22 18:42:20
>>lawles+Ll1
https://en.wikipedia.org/wiki/Hardware_random_number_generat...

> and even the nuclear decay (due to practical considerations the latter, as well as the atmospheric noise, is not viable except for fairly restricted applications or online distribution services)

replies(1): >>notaha+GQ1
◧◩◪◨⬒
11. notaha+GQ1[view] [source] [discussion] 2025-10-22 20:52:37
>>DontBr+Kq1
yeah, I think the space weather experts would have fun statistically analysing the single-event-upset RNG :)
◧◩◪◨
12. ls612+ec2[view] [source] [discussion] 2025-10-22 23:12:14
>>LtdJor+cx
What about the registers?
replies(1): >>yencab+Ki5
◧◩◪◨⬒
13. yencab+Ki5[view] [source] [discussion] 2025-10-23 22:37:59
>>ls612+ec2
What about the ALU/FPU/TPU itself?
[go to top]