I hope we find a path to at least fine-tuning medium sized models for prices that aren't outrageous. Even the tiny corp's tinybox [1] is $15k and I don't know how much actual work one could get done on it.
If the majority of startups are just "wrappers around OpenAI (et al.)" the reason is pretty obvious.
When I was at Rad AI we managed just fine. We took a big chunk of our seed round and used it to purchase our own cluster, which we setup at Colovore in Santa Clara. We had dozens, not hundreds, of GPUs and it set us back about half a million.
The one thing I can't stress enough- do not rent these machines. For the cost of renting a machine from AWS for 8 months you can own one of these machines and cover all of the datacenter costs- this basically makes it "free" from the eight month to three year mark. Once we decoupled our training from cloud prices we were able to do a lot more training and research. Maintenance of the machines is surprisingly easy, and they keep their value too since there's such a high demand for them.
I'd also argue that you don't need the H100s to get started. Most of our initial work was on much cheaper GPUs, with the A100s we purchased being reserved for training production models rapidly. What you need, and is far harder to get, is researchers who actually understand the models so they can improve the models themselves (rather than just compensating with more data and training). That was what really made the difference for Rad AI.
That said, a lot of other businesses don't want to take on the capex, but they do need to train some models... and those models can't run on just a half a million worth of hardware. In that case, someone else is going to have to do it for you.
It works both ways and there are no absolutes here.
My response was more for these folks the OP mentioned-
> There is almost no universe where a couple of guys in their garage are getting access to 1000+ H100s with a capital cost in the multiple millions.
I'm pointing out that this isn't true. I was the founding engineer at Rad AI- we had four people when we started. We managed to build LLMs that are in production today. If you've had a CT, MRI, or XRay in the last year there's a real chance your results were reviewed by the Rad AI models.
My point is simply that people are really overestimating the amount of hardware actually needed, as well as the costs to use that hardware. There absolutely is a space for people to jump in and build out LLM companies right now, and the don't need to build a datacenter or raise nine figures of funds to do it.
Another absolute. I try to not be so focused on single points of input like that.
From what I can tell, sitting on the other side of the wall (GPU provider), there is metric tons of demand from all sides.