zlacker

Nvidia H100 GPUs: Supply and Demand

submitted by tin7in+(OP) on 2023-08-01 03:10:55 | 227 points 160 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
◧◩
10. ukd1+7h[view] [source] [discussion] 2023-08-01 06:19:39
>>holodu+Hg
https://tinygrad.org is trying something around this; currently working on getting AMD GPUs to get on MLPerf. Info on what they're up to / why is mostly here - https://geohot.github.io/blog/jekyll/update/2023/05/24/the-t... - though there are some older interesting bits too.
◧◩
16. anewhn+fl[view] [source] [discussion] 2023-08-01 07:00:20
>>slushh+lg
> >Who is going to take the risk of deplying 10,000 AMD GPUs or 10,000 random startup silicon chips? That’s almost a $300 million investment.

Lumi: https://www.lumi-supercomputer.eu/lumis-full-system-architec...

◧◩◪◨
28. spider+8u[view] [source] [discussion] 2023-08-01 08:31:43
>>qumpis+fq
Where did you get that? They just received a shitload of GPU's and it appears AMD is actively cooperating: https://twitter.com/realGeorgeHotz/status/168616581138659737...
◧◩◪
29. Kirill+lu[view] [source] [discussion] 2023-08-01 08:34:26
>>Tepix+qn
I bet you listen to stuff like http://www.openbsd.org/lyrics.html
◧◩◪◨⬒
35. qumpis+iy[view] [source] [discussion] 2023-08-01 09:12:22
>>spider+8u
I don't remember the exact tweet, but here's one discussion [1]. I guess something changed in the mean time.

[1]_https://www.reddit.com/r/Amd/comments/140uct5/geohot_giving_...

◧◩◪◨⬒⬓
59. jorlow+2V[view] [source] [discussion] 2023-08-01 13:02:23
>>qumpis+iy
[1] links to https://github.com/RadeonOpenCompute/ROCm/issues/2198 which has all the context (driver bugs, vowing to stop using AMD, Lisa Su's response that they're committed to fixing this stuff, a comment that it's fixed)
◧◩
69. gravyp+C51[view] [source] [discussion] 2023-08-01 14:02:35
>>nl+I21
(opinions are my own)

https://coral.ai/products/

◧◩
71. sargun+661[view] [source] [discussion] 2023-08-01 14:05:01
>>nl+I21
Google kind of has done this with Coral: https://coral.ai/about-coral/

These TPUs obviously aren't the ones deployed in Google's datacenters. That being said, I'm not sure how practical it would be to deploy TPUs elsewhere.

Also, Amazon's Infinera (sp?) gets a fair bit of usage in industrial settings. It's just that these nvidia GPUs offer an amazing breeding ground for research and cutting edge work.

◧◩
74. tikkun+G81[view] [source] [discussion] 2023-08-01 14:19:52
>>atty+G71
I agree. (I'm the author) Touched on that briefly here >>36955403 . Need help with that research; please email - email is in profile. Had a section on it in early drafts; didn't feel confident enough; removed it.

Would be good to have more on enterprise companies like Pepsi, BMW, Bentley, Lowes, as well as other HPC uses like oil and gas, others in manufacturing, others in automotive, weather forecasting.

◧◩◪◨⬒⬓
79. Thaxll+Di1[view] [source] [discussion] 2023-08-01 15:02:18
>>throwa+LJ
FSR is vastly inferior to dlss, not sure what you're talking about, even xless from Intel is better.

As for driver: https://www.tomshardware.com/news/adrenalin-23-7-2-marks-ret...

◧◩◪◨
87. latchk+Zm1[view] [source] [discussion] 2023-08-01 15:20:37
>>ekianj+tk1
Not in China! =)

https://www.iflscience.com/china-has-started-building-a-wind...

89. zoogen+Lo1[view] [source] 2023-08-01 15:27:47
>>tin7in+(OP)
The real gut-punch for this is a reminder how far behind most engineers are in this race. With web 1.0 and web 2.0 at least you could rent a cheap VPS for $10/month and try out some stuff. There is almost no universe where a couple of guys in their garage are getting access to 1000+ H100s with a capital cost in the multiple millions. Even renting at that scale is $4k/hour. That is going to add up quickly.

I hope we find a path to at least fine-tuning medium sized models for prices that aren't outrageous. Even the tiny corp's tinybox [1] is $15k and I don't know how much actual work one could get done on it.

If the majority of startups are just "wrappers around OpenAI (et al.)" the reason is pretty obvious.

1. https://tinygrad.org/

◧◩
94. tedivm+At1[view] [source] [discussion] 2023-08-01 15:44:31
>>latchk+Sh1
There are datacenters that are specializing in this, and they exist today.

I highly recommend Colovore in Santa Clara. They got purchased by DR not too long ago, but are run independently as far as I can tell. Their team is great, and they have the highest power density per rack out of anyone. I had absolutely no problem setting up a DGX cluster there.

https://www.colovore.com/

◧◩◪
109. slushh+HJ1[view] [source] [discussion] 2023-08-01 16:45:20
>>TechBr+T41
It could be something like i-codes.

It's mentioned about 2 minutes into this video [1].

[1] https://www.youtube.com/watch?v=tJTp-3rtkYQ

◧◩◪
123. zoogen+q32[view] [source] [discussion] 2023-08-01 17:58:56
>>luckyt+FU1
The question is what happens once you want to transition from your RTX 4090 to a business. It might be cute to generate 10 tokens per second or whatever you can get with whatever model you have to delight your family and friends. But once you want to scale that out into a genuine product - you're up against the ramp. Even a modest inference rig is going to cost a chunk of change in the hundreds of thousands. You have no real way to validate your business model without making some big investment.

Of course, it is the businesses that find a way to make this work that will succeed. It isn't an impossible problem, it is just a seemingly difficult one for now. That is why I mentioned VC funding as appearing to have more leverage over this market than previous ones. If you can find someone to foot the 250k+ cost (e.g. AI Grant [1] where they offer 250k cash and 350k cloud compute) then you might have a chance.

1. https://aigrant.org/

◧◩◪◨
128. Improb+of2[view] [source] [discussion] 2023-08-01 18:44:57
>>monolo+rO1
They're talking about the meltdown he had on stream [1] (in front of the mentioned pirate flag), that ended with him saying he'd stop using AMD hardware [2]. He recanted this two weeks after talking with AMD [3].

Maybe he'll succeed, but this definitely doesn't scream stability to me. I'd be wary of investing money into his ventures (but then I'm not a VC, so what do I know).

[1] https://www.youtube.com/watch?v=Mr0rWJhv9jU

[2] https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

[3] https://twitter.com/realGeorgeHotz/status/166980346408248934...

◧◩◪◨
136. slushh+im2[view] [source] [discussion] 2023-08-01 19:07:36
>>slushh+HJ1
The company was named ikos: https://en.wikipedia.org/wiki/Hardware_emulation
◧◩◪◨⬒
144. realsl+ys3[view] [source] [discussion] 2023-08-02 00:16:49
>>a_wild+iq1
Grid-scale batteries are basically nonexistent in the US, but also aren't particularly common elsewhere. In 2016 there was only 160mW [0] of battery storage available to the grid. Battery prices have come down since then, but not enough for energy storage to make sense for utilities in a lot of cases. If capacity has doubled in the past seven years, the person you're responding to would still be asking for like 3% of available battery capacity nationwide.

As far as other storage methods, they're really cool but water and trains require a lot of space, and flywheels typically aren't well suited for storing energy for long amounts of time. That being said, pumped water is still about 10x more common than batteries right now and flywheels are useful if you want to normalize a peaky supply of electricity.

I'd like to believe we'll see more innovative stuff like you're suggesting, but I think for the time being the regulatory environment is too complicated and the capex is probably too high for anyone outside of the MAMA companies to try something like that right now.

[0] - https://www.energy.gov/policy/articles/deployment-grid-scale...

◧◩◪◨⬒⬓⬔⧯
154. latchk+nl5[view] [source] [discussion] 2023-08-02 15:41:13
>>rohit8+bl4
He emailed her after he had his meltdown. It wasn't like she saw the meltdown and wrote to him. He is nowhere on her radar.

By the way, I also got a bug in the AMD drivers fixed too [0]. That bug fix enabled me to fully automate the performance tuning of 150,000 AMD gpus that I was managing. This is something nobody else had done before, it was impossible to do without this bug fix. We were doing this by hand before! The only bummer was that I had to upgrade the kernel on 12k+ systems... that took a while.

I went through the proper channels and they fixed it in a week, no need for a public meltdown or email to Lisa crying for help.

[0] https://patchwork.freedesktop.org/patch/470297/?series=99134...

[go to top]