zlacker

We are in the 4 Petaflops on a single card age currently, my friend: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor...

It is quite insane. Now, getting to use all of them is difficult, but certainly possible with some clever planning. Hopefully as the tech matures we'll see higher and higher utilization rates (I think we're moving as fast as we were in the 90's in some ways, but some parts of how big the industry is hides the absolutely insane rate of progress. Also, scale, I suppose).

I remember George Hotz nearly falling out of his chair for example at a project that was running some deep learning computations at 50% peak GPU efficiency (i.e. used flops vs possible flops) (locally, one GPU, with some other interesting constraints). I hadn't personally realized how hard that is apparently to hit, for some things, though I guess it makes sense as there are few efficient applications that _also_ use every single available computing unit on a GPU.

And FP8 should be very usable too in the right circumstances. I myself am very much looking forward to using it at some point in the future once proper support gets released for it. :)))) :3 :3 :3 :))))

replies(2): >>dahart+o5 >>swyx+FQ1

>>tysam_+(OP)
> We are in the 4 Petaflops on a single card age currently

FP8 is really only useful for machine learning, which is why it is stuck inside tensor cores. FP8 is not useful for graphics, even FP16 is hard to use for anything general. I’d say 100 Tflops is more accurate as a summary without needing qualification. Calling it “4 petaflops” without saying FP8 in the same sentence could be pretty misleading, I think you should say “4 FP8 Petaflops”.

replies(2): >>startu+Hd >>tysam_+zg1

>>dahart+o5
At 1080p yes, tensor cores are not used. But at 4k majority of the pixels are filled by tensor cores (DLSS), so these FP8 ops are used.

Of course the card linked above is a server card, not a desktop or workstation card optimized for rendering.

What is that Megatron chat in the advertisement? Does it refer to a loser earth destroying character from Transformers? Rockfart?

replies(2): >>dahart+dk >>tysam_+Vg1

>>startu+Hd
Oh yeah excellent point, I should not draw lines between graphics and ML — graphics has will continue to see more and more ML applications. I hope none of my coworkers see this.

I guess Megatron is a language model framework https://developer.nvidia.com/blog/announcing-megatron-for-tr...

>>dahart+o5
I did mention it, at the end! That's why I made the qualification, it is an important difference.

Though as the other commenter noted, NVIDIA does like getting their money's worth out of the tensor cores, and FP8 will likely be a large part of what they're doing with it. Crazy stuff. Especially since the temporal domain is so darn exploitable when covering for precision/noise issues -- they seem to be stretching things a lot further than I would have expected.

In any case -- crazy times.

>>startu+Hd
Megatron is a Large Language Model -- unfortunately it seems they really undertrained it for the parameter counts it had, so it was more a numbers game of "hey, look how big this model is!" when they first released it.

Many modern models are far more efficient for inference IIRC, though I guess it remains a good exercise in "how much can we fit through this silicon?" engineering. :D

>>tysam_+(OP)
what is the usual range of flop utilization (10-30%?) and is there a resource for learning more about the contributing factors?

replies(1): >>tysam_+NY1

>>swyx+FQ1
I've seen anywhere from 20%-50% on large, fully-GPU-saturating models (that are transformers. And the one that Hotz was reacting to was a tiny CNN (< 10 MB) that still used the GPU pretty efficiently in the end.

I think that's roughly the upper limit, I think your contributing factors are going to be: 1. How much can you use tensor cores + normal CUDA cores in parallel (likely something influenced by ahead-of-time compilation and methods friendly to parallel execution, I'd guess?), 2. What's the memory format someone is using, 3. What's the dataloader like? Is it all on GPU? Is it bottlenecked? Some sort of complex, involved prefetching madness? 4. How many memory-bound operations are we using? Can we conceivably convert them to large matrix multiplies?, 5. How few total kernels can we run these calls in? 6. Are my tensors in dimensions that are a factor of 64 by 64 (if possible), or if that's not really helpful/necessary/feasible, a factor of 8? 7. Can I directly train in lower precision (to avoid the overhead of casting in any kind of way?)

That should get you pretty far, off the top of my head. :penguin: :D :))))) <3 <3 :fireworks:

replies(1): >>swyx+fg2

>>tysam_+NY1
thats a pretty dang good head. thank you!!