zlacker

Why would you need to fit the GPUs all in one structure?

You can have a swarm of small, disposable satellites with laser links between them.

>>pantal+(OP)
Because that brings in the whole distributed computing mess. No matter how instantaneous the actual link is, you still have to deal with the problems of which satellites can see one another, how many simultaneous links can exist per satellite, the max throughput, the need for better error correction and all sorts of other things that will drastically slow the system down in the best case. Unlike something like Starlink, with GPUs you have to be ready that everyone may need to talk to everyone else at the same time while maintaining insane throughput. If you want to send GPUs up one by one, get ready to also equip each satellite with a fixed mass of everything required to transmit and receive so much data, redundant structural/power/compute mass, individual shielding and much more. All the wasted mass you have to launch with individual satellites makes the already nonsensical pricing even worse. It just makes no sense when you can build a warehouse on the ground, fill it with shoulder-to-shoulder servers that communicate in a simple, sane and well-known way and can be repaired on the spot. What's the point?

replies(2): >>pantal+Bc >>crote+Fq

>>tavave+p3
Starlink already solved those problems, they do 200 GBit/s via laser between satellites.

And for data centers, the satellite wouldn't be as far apart as starlight satellites, they would be quite close instead.

replies(1): >>tavave+BK

>>pantal+(OP)
Because the latencies required for modern AI training are extremely restrictive. A light-nanosecond is famously a foot, and the critical distances have to be kept in that range.

And a single cluster today would already require more solar & cooling capacity than all starlink satellites combined.

>>tavave+p3
Isn't this already a major problem for AI clusters?

I vaguely recall an article a while ago about the impact of GPU reliability: a big problem with training is that the entire cluster basically operates in lock-step, with each node needing the data its neighbors calculated during the previous step to proceed. The unfortunate side-effect is that any failure stops the entire hundred-thousand-node cluster from proceeding - as the cluster grows even the tiniest failure rate is going to absolutely ruin your uptime. I think they managed to somehow solve this, but I have absolutely no idea how they managed to do it.

>>pantal+Bc
No they didn't. 200Gb/s is 25GB/s, so... They could run 1/36th of a single current-gen SXM5 socket. Not even any of the futuristic next-gen stuff. 25GB/s is less than the bandwidth of one X16 PCIe3 socket. And that's already assuming the best-case scenario, and in reality trying to sync up GPUs like that would likely have loads of other issues. But even just the sheer amount of inter-GPU bandwidth you need is quite extreme. And this isn't some point-to-point routing like Starlink trying to get data from A to B, this is maintaining a network of interconnected systems that need to communicate chaotically and with uneven demand.