zlacker

Oh wow there's still work being done on ampere?

I was wondering - I've been thinking about switching to AI systems programming (I know, easy task), but from what I understand, industry cloud GPUs are the main winners, right? Nobody's going to pay me (assuming I even had the skills) to optimize for consumer GPUs?

From what I understand, it's not just number + capacity + performance, it's literal core primitives. I don't think any of the "Blackwell" chips like the grace one or rtx 5090 have for example SM pairs in their ISA? And likewise similar fundamental differences between consumer and cloud hopper (where the majority of the perf is the cloud one's ISA?)

So I guess I'm wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

replies(7): >>Maxiou+b4 >>storus+e7 >>vlovic+U8 >>coolsu+kd >>g947o+Ym >>saagar+QG >>mips_a+8O

>>sigbot+(OP)
yep, https://github.com/poad42/cuda-fp8-ampere recently another attempt at squeezing whatever's left from ampere

>>sigbot+(OP)
I still have 2x NVLinked A6000 and they aren't that bad compared to a single RTX 6000 Pro.

>>sigbot+(OP)
Look at am the email addresses. If you’ll recall there’s an embargo on China.

>>sigbot+(OP)
I do CUDA for a living (not inference) and for the life of me (and a couple of LLMs for that matter) I cannot figure out what you mean by "SM pairs".

Do you mean the coupled dies on stuff like the B200? An NVidia chip die has many SMs if so.

Do you mean TMEM MMA cooperative execution? I'm guessing that must be it given what the paper is about.

replies(1): >>sigbot+Ye

>>coolsu+kd
https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...

cooperative execution yeah

as you can tell I do not do CUDA for a living :D

>>sigbot+(OP)
Why does publishing papers require the latest and greatest GPUs? My understanding is that the paper talks about very general principles.

> So I guess I'm wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

Unless you have money to throw around, you'd better start working on something, write some code and get them running on a leased GPU, before deciding on a long term plan

replies(1): >>nl+7x1

>>sigbot+(OP)
> Nobody's going to pay me (assuming I even had the skills) to optimize for consumer GPUs?

People will but probably less, not many people are doing AI at the edge that can pay the mega millions

> And likewise similar fundamental differences between consumer and cloud hopper (where the majority of the perf is the cloud one's ISA?)

I think Hopper was the version where they did a clean split and it’s only for datacenter

> So I guess I'm wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

You can do performance work on any system you have really it’s just that the details change depending on what you’re targeting. You can definitely learn the basics on like a 3060 by following blog posts

>>sigbot+(OP)
You should check out nanochat. I would personally appreciate it if someone implemented hardware optimized flash attention for my 3090

>>g947o+Ym
> My understanding is that the paper talks about very general principles.

This isn't really true.

In this case it's specific to NVidia's tensor matrix multiply-add (MMA) instructions, which lets it use silicon that would otherwise be unusued at that point.

> Why does publishing papers require the latest and greatest GPUs?

You really do need to test these things on real hardware and across hardware. When you are doing unexpected things there are lots of unexpected interaction effects.

replies(1): >>g947o+mF1

>>nl+7x1
It's supported on Ampere, so it's good enough.

As a reminder, the context is "require the latest and greatest GPUs", responding to the parent comment. "General" doesn't mean "you can do this on an Intel Arc GPU" level of general.

That said, my comment could have used a bit more clarity.