zlacker

The Scaling ML textbook also has an excellent section on TPUs. https://jax-ml.github.io/scaling-book/tpus/

replies(1): >>jaunty+ab

>>deside+(OP)
I also enjoyed https://henryhmko.github.io/posts/tpu/tpu.html >>44342977 .

The work that XLA & schedulers are doing here is wildly impressive.

This feels so much drastically harder to work with than Itanium must have been. ~400bit VLIW, across extremely diverse execution units. The workload is different, it's not general purpose, but still awe inspiring to know not just that they built the chip but that the software folks can actually use such a wildly weird beast.

I wish we saw more industry uptake for XLA. Uptakes not bad, per-se: there's a bunch of different hardware it can target! But what amazing secret sauce, it's open source, and it doesn't feel like there's the industry rally behind it it deserves. It feels like Nvidia is only barely beginning to catch up, to dig a new moat, with the just announced Nvidia Tiles. Such huge overlap. Afaik, please correct if wrong, but XLA isn't at present particularly useful at scheduling across machines, is it? https://github.com/openxla/xla

replies(4): >>deside+Sg >>alevsk+wi >>cpgxii+Xl >>jaunty+UR

>>jaunty+ab
Thanks for sharing this. I agree w.r.t. XLA. I've been moving to JAX after many years of using torch and XLA is kind of magic. I think torch.compile has quite a lot of catching up to do.

> XLA isn't at present particularly useful at scheduling across machines,

I'm not sure if you mean compiler-based distributed optimizations, but JAX does this with XLA: https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_...

>>jaunty+ab
I do think it's a lot simpler than the problem Itanium was trying to solve. Neural nets are just way more regular in nature, even with block sparsity, compared to generic consumer pointer-hopping code. I wouldn't call it "easy", but we've found that writing performant NN kernels for a VLIW architecture chip is in practice a lot more straightforward than other architectures.

JAX/XLA does offer some really nice tools for doing automated sharding of models across devices, but for really large performance-optimized models we often handle the comms stuff manually, similar in spirit to MPI.

replies(1): >>jaunty+Sy

>>jaunty+ab
In Itanium's heyday, the compilers and libraries were pretty good at handling HPC workloads, which is really the closest anyone was running then to modern NN training/inference. The problem with Itanium and its compilers was that people obviously wanted to run workloads that looked nothing like HPC (databases, web servers, etc) and the architecture and compilers weren't very good at that. There have always been very successful VLIW-style architectures in more specialized domains (graphics, HPC, DSP, now NPU) it just hasn't worked out well for general-purpose processors.

>>alevsk+wi
I agree with regards to the actual work being done by the systolic arrays, which sort of are VLIW-ish & have a predictable plannable workflow for them. Not easy, but there's a very direct path to actually executing these NN kernels. The article does an excellent job setting up how great at win it is that the systolic MXU's can do the work, don't need anything but local registers and local communication across cells, don't need much control.

But if you make it 2900 words through this 9000 word document, to the "Sample VLIW Instructions" and "Simplified TPU Instruction Overlay" diagrams, trying to map the VLIW slots ("They contain slots for 2 scalar, 4 vector, 2 matrix, 1 miscellaneous, and 6 immediate instructions") to useful work one can do seems incredibly incredible challenging. Given the vast disparity of functionality and style of the attached units that that governs, and given the extreme complexity in keeping that MXU constantly fed, keeping very tight timing so that it is constantly well utilized.

> Subsystems operate with different latencies: scalar arithmetic might take single digit cycles, vector arithmetic 10s, and matrix multiplies 100s. DMAs, VMEM loads/stores, FIFO buffer fill/drain, etc. all must be coordinated with precise timing.

Where-as Itanium's compilers needed to pack parallel work into a single instruction, there's maybe less need for that here. But that quote there feels like an incredible heart of the machine challenge, to write instruction bundles that are going to feed a variety of systems all at once, when these systems have such drastically different performance profiles / pipeline depths. Truly an awe-some system, IMO.

Still though, yes: Itanium's software teams did have an incredibly hard challenge finding enough work at compile time to pack into instructions. Maybe it was a harder task. What a marvel modern cores are, having almost a dozen execution units that cpu control can juggle and keep utilized, analyzing incoming instructions on the fly, with deep out-of-order depenency-tracking insight. Trying to figure it all out ahead of time & packing it into the instructions apriori was a wildly hard task.

>>jaunty+ab
Side note, just ran into this article that mentions how Amazon is planning to have XLA / JAX support in the future for their Trainium's. https://newsletter.semianalysis.com/p/aws-trainium3-deep-div...