Touching the Elephant – TPUs

>>giulio+(OP)
I'm surprised the perspective of China making TPUs at scale in a couple of years is not bigger news. It could be a deadly blow for Google, NVIDIA, and the rest. Combine it with China's nuclear base and labor pool. And the cherry on top, America will train 600k Chinese students as Trump agreed to.

The TPUv4 and TPUv6 docs were stolen by a Chinese national in 2022/2023: https://www.cyberhaven.com/blog/lessons-learned-from-the-goo... https://www.justice.gov/opa/pr/superseding-indictment-charge...

And that's just 1 guy that got caught. Who knows how many other cases were there.

A Chinese startup is already making clusters of TPUs and has revenue https://www.scmp.com/tech/tech-war/article/3334244/ai-start-...

>>giulio+(OP)
The Scaling ML textbook also has an excellent section on TPUs. https://jax-ml.github.io/scaling-book/tpus/

>>deside+Nr
I also enjoyed https://henryhmko.github.io/posts/tpu/tpu.html >>44342977 .

The work that XLA & schedulers are doing here is wildly impressive.

This feels so much drastically harder to work with than Itanium must have been. ~400bit VLIW, across extremely diverse execution units. The workload is different, it's not general purpose, but still awe inspiring to know not just that they built the chip but that the software folks can actually use such a wildly weird beast.

I wish we saw more industry uptake for XLA. Uptakes not bad, per-se: there's a bunch of different hardware it can target! But what amazing secret sauce, it's open source, and it doesn't feel like there's the industry rally behind it it deserves. It feels like Nvidia is only barely beginning to catch up, to dig a new moat, with the just announced Nvidia Tiles. Such huge overlap. Afaik, please correct if wrong, but XLA isn't at present particularly useful at scheduling across machines, is it? https://github.com/openxla/xla

>>jaunty+XC
Thanks for sharing this. I agree w.r.t. XLA. I've been moving to JAX after many years of using torch and XLA is kind of magic. I think torch.compile has quite a lot of catching up to do.

> XLA isn't at present particularly useful at scheduling across machines,

I'm not sure if you mean compiler-based distributed optimizations, but JAX does this with XLA: https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_...

>>jaunty+XC
Side note, just ran into this article that mentions how Amazon is planning to have XLA / JAX support in the future for their Trainium's. https://newsletter.semianalysis.com/p/aws-trainium3-deep-div...

zlacker

Touching the Elephant – TPUs