Nvidia H100 GPUs: Supply and Demand

>>tin7in+(OP)
The real gut-punch for this is a reminder how far behind most engineers are in this race. With web 1.0 and web 2.0 at least you could rent a cheap VPS for $10/month and try out some stuff. There is almost no universe where a couple of guys in their garage are getting access to 1000+ H100s with a capital cost in the multiple millions. Even renting at that scale is $4k/hour. That is going to add up quickly.

I hope we find a path to at least fine-tuning medium sized models for prices that aren't outrageous. Even the tiny corp's tinybox [1] is $15k and I don't know how much actual work one could get done on it.

If the majority of startups are just "wrappers around OpenAI (et al.)" the reason is pretty obvious.

1. https://tinygrad.org/

>>zoogen+Lo1
I'd argue that you really don't need 1000+ H100s to test things out and make a viable product.

When I was at Rad AI we managed just fine. We took a big chunk of our seed round and used it to purchase our own cluster, which we setup at Colovore in Santa Clara. We had dozens, not hundreds, of GPUs and it set us back about half a million.

The one thing I can't stress enough- do not rent these machines. For the cost of renting a machine from AWS for 8 months you can own one of these machines and cover all of the datacenter costs- this basically makes it "free" from the eight month to three year mark. Once we decoupled our training from cloud prices we were able to do a lot more training and research. Maintenance of the machines is surprisingly easy, and they keep their value too since there's such a high demand for them.

I'd also argue that you don't need the H100s to get started. Most of our initial work was on much cheaper GPUs, with the A100s we purchased being reserved for training production models rapidly. What you need, and is far harder to get, is researchers who actually understand the models so they can improve the models themselves (rather than just compensating with more data and training). That was what really made the difference for Rad AI.

>>tedivm+ur1
I did choose the 1000+ H100 case as the outlier. But even what you are describing, $500k for dozens of A100s or whatever entry level looks like these days, is a far step away from the $10/month for previous generations. This suggests we will live in a world where VCs have even more power than they did before.

Even if I validate my idea on a RTX 4090, the path to scaling any idea gets expensive fast. 15k to move up to something like a tinybox (probably capable of running 65B model but is it realistic to train or fine-tune 65B model?). Then maybe $100k in cloud costs. Then maybe $500k in research sized cluster. Then $10m+ for enterprise grade. I don't see that kind of ramp happening outside well-financed VC startups.

>>zoogen+1v1
Not every company is OpenAI. What OpenAI is trying to do is solve a generic problem, and that requires their models to be huge. There's a ton of space for specialized models though, and those specialized ones still outperform the more general one. Startups can focus on smaller problems than "solve everything, but with kind of low quality". Solving one specific problem well can bring in a lot of customers.

To put it another way, the $10m+ for enterprise grade just seems wrong to me. It's more like $10m+ for mediocre responses to a lot of things. Rad AI didn't spend $10m on their models, but they absolutely are professional grade and are in use today.

I also think it's important to consider capital costs that are a one time thing, versus long term costs. Once you purchase that $10m cluster you have that forever, not just for a single model, and because of the GPU scarcity right now that cluster isn't losing value nearly as rapidly as most hardware does. If you purchase a $500k cluster, use it for three years, and then sell it for $400k you're really not doing all that bad.

>>tedivm+BA1
That is a decent point, in that it reminds me of a startup that posted on HN a couple of months ago that did background removal from images using AI models. They claimed this was a mature market now where bulk pricing was bringing the cost down to some marginal over the price of compute. I suspect those kinds of models are comparatively small compared to the general intelligence LLMs we are seeing and might reasonably be trainable on 250k clusters. There is likely a universe of low-hanging fruit for those kinds of problems and those who are capable. That is definitely not a market I would want to compete in since once a particular problem is sufficiently solved then it becomes a race to the bottom in cost.

But my (totally amateur and outsider informed) intuition is that the innovative work will still happen at the edge of model size for the next few years. We literally just got the breakthroughs in LLM capabilities around the 30b parameter mark. These capabilities seemed to accelerate with larger models. There appears to be a gulf in the capabilities from 7B to 70B parameter LLMs that makes me not want to bother with LLMs at all unless I can get that higher level performance of the massive models. But even if I did want to play around at 30B or whatever I have to pay 15k-100k.

I think we are just in a weird spot right now where the useful model sizes for a large class of potential applications is at a price point that many engineers will find prohibitively expensive to experiment with on their own.

>>zoogen+7J1
For the first example, I think that was just due to the specific problem being solved. I can tell you there are a ton of places that aren't yet "solved" yet, and that aren't trivial to solve either. One thing we haven't discussed in this conversation is the data itself, and cleaning up that data. Rad AI probably spent more money on staff cleaning up data than they did on model training. This isn't trivial- for medical grade stuff you need physician data scientists to help out, and that field has only really existed since 2018 (which was the first time the title was listed in any job listing). The reason background removal is "mature" is because it's not that hard of a problem and there's a good amount of data out there.

I also think that you're way off on the second point. I'm not saying that to be rude, because it does seem to be a popular opinion. It's just that if you read papers most people publishing aren't using giant clusters. There's a whole field of people who are finding ways to shrink models down. Once we understand the models we can also optimize them. You see this happen in all sorts of fields beyond "general intelligence"- tasks that used to take entire clusters to run can work on your cell phone now. Optimization is important not just because it opens up more people to work on things, but also because it drops down the costs that these big companies are paying.

Lets think about this in another direction. ML models are based off of how the brain is thought to work. The human brain is capable of quite a bit, but it uses very little power: about 10 watts. It is clearly better optimized than ML models are. That means there's a huge gap we still have to fit on efficiency.

zlacker