I think that's roughly the upper limit, I think your contributing factors are going to be: 1. How much can you use tensor cores + normal CUDA cores in parallel (likely something influenced by ahead-of-time compilation and methods friendly to parallel execution, I'd guess?), 2. What's the memory format someone is using, 3. What's the dataloader like? Is it all on GPU? Is it bottlenecked? Some sort of complex, involved prefetching madness? 4. How many memory-bound operations are we using? Can we conceivably convert them to large matrix multiplies?, 5. How few total kernels can we run these calls in? 6. Are my tensors in dimensions that are a factor of 64 by 64 (if possible), or if that's not really helpful/necessary/feasible, a factor of 8? 7. Can I directly train in lower precision (to avoid the overhead of casting in any kind of way?)
That should get you pretty far, off the top of my head. :penguin: :D :))))) <3 <3 :fireworks: