Radiation hardening:
While there is some state information on GPU, for ML applications the occasional bit flip isn't that critical, so Most of the GPU area can be used as efficiently as before and only the critical state information on GPU die or host CPU needs radiation hardening.
Scaling: the didactic unoptimized 30m x 30m x 90m pyramid would train a 405B model 17 days, it would have 23 TB RAM (so it can continue training larger and larger state of the art models at comparatively slower rates). Not sure what's ridiculous about it? At some point people piss on didactic examples because they want somebody to hold their hand and calculate everything for them?