3dfx: So powerful it’s kind of ridiculous

>>BirAda+(OP)
I'm just here to post old 3dfx commercials:

The print ads were similarly incredible:

http://www.x86-secret.com/pics/divers/v56k/histo/1999/commer...

https://www.purepc.pl/files/Image/artykul_zdjecia/2012/3DFX_...

https://fcdn.me/813/97f/3d-pc-accelerators-blow-dryer-ee8eb6...

>>rl3+h9
100 billion operations per second, what are we at now, 100 trillion?

>>echees+lO
We are in the 4 Petaflops on a single card age currently, my friend: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor...

It is quite insane. Now, getting to use all of them is difficult, but certainly possible with some clever planning. Hopefully as the tech matures we'll see higher and higher utilization rates (I think we're moving as fast as we were in the 90's in some ways, but some parts of how big the industry is hides the absolutely insane rate of progress. Also, scale, I suppose).

I remember George Hotz nearly falling out of his chair for example at a project that was running some deep learning computations at 50% peak GPU efficiency (i.e. used flops vs possible flops) (locally, one GPU, with some other interesting constraints). I hadn't personally realized how hard that is apparently to hit, for some things, though I guess it makes sense as there are few efficient applications that _also_ use every single available computing unit on a GPU.

And FP8 should be very usable too in the right circumstances. I myself am very much looking forward to using it at some point in the future once proper support gets released for it. :)))) :3 :3 :3 :))))

>>tysam_+0T
what is the usual range of flop utilization (10-30%?) and is there a resource for learning more about the contributing factors?

>>swyx+FJ2
I've seen anywhere from 20%-50% on large, fully-GPU-saturating models (that are transformers. And the one that Hotz was reacting to was a tiny CNN (< 10 MB) that still used the GPU pretty efficiently in the end.

I think that's roughly the upper limit, I think your contributing factors are going to be: 1. How much can you use tensor cores + normal CUDA cores in parallel (likely something influenced by ahead-of-time compilation and methods friendly to parallel execution, I'd guess?), 2. What's the memory format someone is using, 3. What's the dataloader like? Is it all on GPU? Is it bottlenecked? Some sort of complex, involved prefetching madness? 4. How many memory-bound operations are we using? Can we conceivably convert them to large matrix multiplies?, 5. How few total kernels can we run these calls in? 6. Are my tensors in dimensions that are a factor of 64 by 64 (if possible), or if that's not really helpful/necessary/feasible, a factor of 8? 7. Can I directly train in lower precision (to avoid the overhead of casting in any kind of way?)

That should get you pretty far, off the top of my head. :penguin: :D :))))) <3 <3 :fireworks:

zlacker