zlacker

[parent] [thread] 6 comments
1. preiss+(OP)[view] [source] 2025-10-22 13:10:54
I wonder if "normal" RDIMM ECC would be enough to mitigate most of those radiation bit-flipping issues. If so it wouldn't really make a difference to earth-based servers since most enterprise servers use RDIMM ECC too
replies(1): >>eptcyk+cb
2. eptcyk+cb[view] [source] 2025-10-22 13:58:13
>>preiss+(OP)
You'll get bitflips elsewhere besides just in RAM. A bitflip in L1 or L3 cache will be propagated to your DIMM and noone will be the wiser.
replies(3): >>zamada+Kc >>LtdJor+bm >>shrubb+uE
◧◩
3. zamada+Kc[view] [source] [discussion] 2025-10-22 14:04:36
>>eptcyk+cb
I thought server CPUs already handled this? E.g. for Epyc https://moorinsightsstrategy.com/wp-content/uploads/2017/05/...

> Because caches hold the most recent and most relevant data to the current processing, it is critical that this data be accurate. To enable this, AMD has designed EPYC with multiple tiers of cache protection. The level 1 data cache includes SEC-DED ECC, which can detect two-bit errors and correct single-bit errors. Through parity and retry, L1 data cache tag errors and L1 instruction cache errors are automatically corrected. The L2 and L3 caches are extended even further with the ability to correct double errors and detect triple errors.

◧◩
4. LtdJor+bm[view] [source] [discussion] 2025-10-22 14:47:21
>>eptcyk+cb
Those do ECC already
replies(1): >>ls612+d12
◧◩
5. shrubb+uE[view] [source] [discussion] 2025-10-22 15:57:10
>>eptcyk+cb
Sun Microsystems famously had this problem with their servers using the UltraSPARC II chips, with cache SRAM that didn’t have ECC. Later versions of their processors had ECC added.
◧◩◪
6. ls612+d12[view] [source] [discussion] 2025-10-22 23:12:14
>>LtdJor+bm
What about the registers?
replies(1): >>yencab+J75
◧◩◪◨
7. yencab+J75[view] [source] [discussion] 2025-10-23 22:37:59
>>ls612+d12
What about the ALU/FPU/TPU itself?
[go to top]