AMD funded a drop-in CUDA implementation built on ROCm: It's now open-source

>>mfigui+(OP)
https://github.com/vosen/ZLUDA - source

>>btown+y6
According to the article, AMD seems to have pulled the plug on this as they think it will hinder ROCMv6 adoption, which still btw only supports two consumer cards out of their entire line up[1]

1. https://www.phoronix.com/news/AMD-ROCm-6.0-Released

>>hd4+P1
https://github.com/vosen/ZLUDA/tree/v3

>>iforgo+R7
They financed the prior iteration of Zluda: https://github.com/vosen/ZLUDA?tab=readme-ov-file#faq

but then stopped

>>mfigui+(OP)
Aside from the latest commit, there has been no activity for almost 3 years (latest code change on Feb 22, 2021).

People are criticizing AMD for dropping this, but it makes sense to stop paying for development when the dev has stopped doing the work, no?

And if he means that AMD stopped paying 3 years ago - well, that was before dinosaurs and ChatGPT, and alot has changed since then.

https://github.com/vosen/ZLUDA/commits/v3

>>mfigui+(OP)
ROCm is not spelled out anywhere in their documentation and the best answers in search come from Github and not AMD official documents

"Radeon Open Compute Platform"

https://github.com/ROCm/ROCm/issues/1628

And they wonder why they are losing. Branding absolutely matters.

>>izacus+r9
AMD is betting big on GPUs. They recently released the MI300, which has "2x transistors, 2.4x memory and 1.6x memory bandwidth more than the H100, the top-of-the-line artificial-intelligence chip made by Nvidia" (https://www.economist.com/business/2024/01/31/could-amd-brea...).

They very much plan to compete in this space, and hope to ship $3.5B of these chips in the next year. Small compared to Nvidia's revenues of $59B (includes both consumer and data centre), but AMD hopes to match them. It's too big a market to ignore, and they have the hardware chops to match Nvidia. What they lack is software, and it's unclear if they'll ever figure that out.

>>Espada+x9
This.

    762 changed files with 252,017 additions and 39,027 deletions.

https://github.com/vosen/ZLUDA/commit/1b9ba2b2333746c5e2b05a...

>>rubatu+Rb
You don't think the courts would force the opening of CUDA? Didn't a court already rule that API cannot be patented. I believe it was a Google case. As long as no implementation was stolen, the API itself is not able to be copyrighted.

Here it is: https://arstechnica.com/tech-policy/2021/04/how-the-supreme-...

>>Andrew+i9
Funnily enough it doesn't work on their RDNA ("Radeon DNA") hardware (with some exceptions I think), but it's aimed at their CDNA (Compute DNA). If they would come up with a new name today it probably wouldn't include Radeon.

AMD seems to be a firm believer in separating the consumer chips for gaming and the compute chips for everything else. This probably makes a lot of sense from a chip design and current business perspective, but I think it's shortsighted and a bad idea. GPUs are very competent compute devices, and basically wasting all that performance for "only" gaming is strange to me. AI and other compute is getting more and more important for things like image and video processing, language models, etc. Not only for regular consumers, but for enthusiasts and developers it makes a lot of sense to be able to use your 10 TFLOPS chip even when you're not gaming.

While reading through the AMD CDNA whitepaper I saw this and got a good chuckle. "culmination of years of effort by AMD" indeed.

> The computational resources offered by the AMD CDNA family are nothing short of astounding. However, the key to heterogeneous computing is a software stack and ecosystem that easily puts these abilities into the hands of software developers and customers. The AMD ROCm 4.0 software stack is the culmination of years of effort by AMD to provide an open, standards-based, low-friction ecosystem that enables productivity creating portable and efficient high-performance applications for both first- and third-party developers.

https://www.amd.com/content/dam/amd/en/documents/instinct-bu...

>>kkielh+Zb
You've got to remember that AMD are behind at all aspects of this, including documenting their work in an easily digestible way.

"Support" means that the card is actively tested and presumably has some sort of SLA-style push to fix bugs for. As their stack matures, a bunch of cards that don't have official support will work well [0]. I have an unsupported card. There are horrible bugs. But the evidence I've seen is that the card will work better with time even though it is never going to be officially supported. I don't think any of my hardware is officially supported by the manufacturer, but the kernel drivers still work fine.

> Meanwhile CUDA supports anything with Nvidia stamped on it before it's even released...

A lot of older Nvidia cards don't support CUDA v9 [1]. It isn't like everything supports everything, particularly in the early part of building out capability. The impression I'm getting is that in practice the gap in strategy here is not as large as the current state makes it seem.

[0] If anyone has bought an AMD card for their machine to multiply matrices they've been gambling on whether the capability is there. This comment is reasonable speculation, but I want to caveat the optimism by asserting that I'm not going to put money into AMD compute until there is some some actual evidence on the table that GPU lockups are rare.

[1] https://en.wikipedia.org/wiki/CUDA#GPUs_supported

>>sam_go+M8
If only this exact concern was addressed explicitly in the first FAQ at the bottom of the README...

https://github.com/vosen/ZLUDA/tree/v3?tab=readme-ov-file#fa...

>>sorenj+4h
ROCm works fine on the RDNA cards. On Ubuntu 23.10 and Debian Sid, the system packages for the ROCm math libraries have been built to run on every discrete Vega, RDNA 1, RDNA 2, CDNA 1, and CDNA 2 GPU. I've manually tested dozens of cards and every single one worked. There were just a handful of bugs in a couple of the libraries that could easily be fixed by a motivated individual. https://slerp.xyz/rocm/logs/full/

The system package for HIP on Debian has been stuck on ROCm 5.2 / clang-15 for a while, but once I get it updated to ROCm 5.7 / clang-17, I expect that all discrete RDNA 3 GPUs will work.

>>kkielh+Zb
The most recent "card" is their MI300 line.

It's annoying as hell to you and me that they are not catering to the market of people who want to run stuff on their gaming cards.

But it's not clear it's bad strategy to focus on executing in the high-end first. They have been very successful landing MI300s in the HPC space...

Edit: I just looked it up: 25% of the GPU Compute in the current Top500 Supercomputers is AMD

https://www.top500.org/statistics/list/

Even though the list has plenty of V100 and A100s which came out (much) earlier. Don't have the data at hand, but I wouldn't be surprised if AMD got more of the Top500 new installations than nVidia in the last two years.

>>coldte+Sn
Microsoft could do that because they had the Operating System monopoly to leverage and take out both Lotus 123 and WordPerfect. Without the monopoly of the operating system they wouldn't of been able to Embrace, Extend, Extinguish.

https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...

>>phh+jb
Compute Unified Device Architecture [1]

[1] https://en.wikipedia.org/wiki/CUDA

>>roenxi+ph
All versions of CUDA support PTX, which is an intermediate bytecode/compiler representation that can be finally-compiled by even CUDA 1.0.

So the contract is: as long as your future program does not touch any intrinsics etc that do not exist in CUDA 1.0, you can export the new program from CUDA 27.0 as PTX, and the GTX 6800 driver will read the PTX and let your gpu run it as CUDA 1.0 code… so it is quite literally just as they describe, unlimited forward and backward capability/support as long as you go through PTX in the middle.

https://docs.nvidia.com/cuda/archive/10.1/parallel-thread-ex...

https://en.wikipedia.org/wiki/Parallel_Thread_Execution

>>mfigui+(OP)
This event of release is however a result of AMD stopped funding it per "After two years of development and some deliberation, AMD decided that there is no business case for running CUDA applications on AMD GPUs. One of the terms of my contract with AMD was that if AMD did not find it fit for further development, I could release it. Which brings us to today." from https://github.com/vosen/ZLUDA?tab=readme-ov-file#faq

so, same mistake intel made before.

>>chucka+at
That's what happens when your primary business model is selling to the military. They had to pay what IBM charged them (within a small bit of reason) and it was incredibly difficult for them to pivot away from any path they chose in the 80's once they had chosen it.

However, that same logic doesn't apply to consumers, and since they continued to fail to learn that lesson now IBM doesn't even target the consumer market given that they never learned how to be competitive and could only ever effectively function when they had a monopoly or at least a vendor lock-in.

https://en.wikipedia.org/wiki/Acquisition_of_the_IBM_PC_busi...

>>enonim+Rk
fertile soil for Alyssa and Asahi Lina :)

https://rosenzweig.io/

https://vt.social/@lina

>>p_l+ys
> betting on the wrong horse (OS/2)

Ahhhh, your hindsight is well developed. I would be interested to know the background on the reasons why Lotus made that bet. We can't know the counterfactual, but Lotus delivering on a platform owned by their deadly competitor Microsoft would seem to me to be a clearly worrysome idea to Lotus at the time. Turned out it was an existentially bad idea. Did Lotus fear Microsoft? "DOS ain't done till Lotus won't run" is a myth[1] for a reason. Edit: DRDOS errors[2] were one reason Lotus might fear Microsoft. We can just imagine a narritive of a different timeline where Lotus delivered on Windows but did some things differently to beat Excel. I agree, Lotus made other mistakes and Microsoft made some great decisions, but the point remains.

We can also suspect that AMD have a similar choice now where they are forked. Depending on Nvidea/CUDA may be a similar choice for AMD - fail if they do and fail if they don't.

[1] http://www.proudlyserving.com/archives/2005/08/dos_aint_done...

[2] https://www.theregister.com/1999/11/05/how_ms_played_the_inc...

>>mfigui+(OP)
I may have missed it in the article, but this post would mean absolutely nothing to me except for the fact that last week I got into stable diffusion so I'm crushing my 4090 with pytorch and deepspeed, etc and dealing with a lot of nvidia ctk/sdk stuff. Well, I'm actually trying to do this in windows w/ wsl2 and deepmind/torch/etc in containers and it's completely broken so not crushing currently.

I guess awhile ago it was found that Nvidia was bypassing the kernels GPL license driver check and I read that kernel 6.6 was going to lock that driver out if they didn't fix it, and from what I've read there was no reply or anything done by nvidia yet. Which I think I probably just can't find.

Am I wrong about that part?

We're on kernel 6.7.4 now and I'm still using the same drivers. Did it get pushed back, did nvidia fix it?

Also, while trying to find answers myself I came across this 21 year old post which is pretty funny and very apt for the topic https://linux-kernel.vger.kernel.narkive.com/eVHsVP1e/why-is...

I'm seeing conflicting info all over the place so I'm not really sure what the status of this GPL nvidia driver block thing is.

>>bick_n+iy
I just went through this this weekend - If you're running in Windows and want to use deepspeed, you have to still use Cuda 12.1 because deepspeed 13.1 is the latest that works with 12.1. There's no deepspeed for windows that works with 12.3.

I tried to get it working this weekend but it was a huge PITA so I switched to putting everything into WSL2 then in arch on there pytorch etc in containers so I could flip versions easily now that I know how SPECIFIC the versions are to one another.

I'm still working on that part, halfway into it my WSL2 completely broke and I had to reinstall windows. I'm scared to mount the vhdx right now. I did ALL of my work and ALL of my documentation is inside of the WSL2 archlinux and NOT on my windows machine. I have EVERYTHING I need to quickly put another server up (dotfiles, configs) sitting in a chezmoi git repo ON THE VM. That I only git committed one init like 5 mins into everything. THAT was a learning experience, now I have no idea if I should follow the "best practice" of keeping projects in wsl or having wsl reach out to windows, there's a performance drop. The 9p networking stopped working and no matter what I reinstalled, reset, removed features, reset windows, etc, it wouldn't start. But at least I have that WSL2 .vhdx image that will hopefully mount and start. And probably break WSL2 again. I even SPECIFICALLY took backups of the image as tarballs every hour in case I broke LINUX, not WSL.

If anyone has done sd containers in wsl2 already let me know. I've tried to use WSL for dev work (i use osx) like this 2-3 times in the last 4-5 years and I always run into some catastrophically broken thing that makes my WSL stop working. I hadn't used it in years so hoped it was super reliable by now. This is on 3 different desktops with completely different hardware, etc. I was terrified it would break this weekend and IT DID. At least I can be up in windows in 20 minutes thanks to chocolately and chezmoi. Wiped out my entire gaming desktop.

Sorry I'm venting now this was my entire weekend.

This repo is from a deepspeed contrib (iirc) and lists the reqs for deepspeed + windows that mention the version matches

https://github.com/S95Sedan/Deepspeed-Windows

> conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

It may sound weird to do any of this in Windows, or maybe not, but if it does just remember that it's a lot of gamers like me with 4090s who just want to learn ML stuff as a hobby. I have absolutely no idea what I'm doing but thank god I know containers and linux like the back of my hand.

>>modele+tD
Not without breaking the support contract? If you change PTX format then CUDA 1.0 machines can no longer it and it's no longer PTX.

Again, you are missing the point. Java is both a language (java source) and a machine (the JVM). The latter is a hardware ISA - there are processors that implement Java bytecode as their ISA format. Yet most people who are running Java are not doing so on java-machine hardware, yet they are using the java ISA in the process.

https://en.wikipedia.org/wiki/Java_processor

https://en.wikipedia.org/wiki/Bytecode#Execution

any bytecode is an ISA, the bytecode spec defines the machine and you can physically build such a machine that executes bytecode directly. Or you can translate via an intermediate layer, like how Transmeta Crusoe processors executed x86 as bytecode on a VLIW processor (and how most modern x86 processors actually use RISC micro-ops inside).

these are completely fungible concepts. They are not quite the same thing but bytecode is clearly an ISA in itself. Any given processor can choose to use a particular bytecode as either an ISA or translate it to its native representation, and this includes both PTX, Java, and x86 (among all other bytecodes). And you can do the same for any other ISA (x86 as bytecode representation, etc).

furthermore, what most people think of as "ISAs" aren't necessarily so. For example RDNA2 is an ISA family - different processors have different capabilities (for example 5500XT has mesh shader support while 5700XT does not) and the APUs use a still different ISA internally etc. GFX1101 is not the same ISA as GFX1103 and so on. These are properly implementations not ISAs, or if you consider it to be an ISA then there is also a meta-ISA encompassing larger groups (which also applies to x86's numerous variations). But people casually throw it all into the "ISA" bucket and it leads to this imprecision.

like many things in computing, it's all a matter of perspective/position. where is the boundary between "CMT core within a 2-thread module that shares a front-end" and "SMT thread within a core with an ALU pinned to one particular thread"? It's a matter of perspective. Where is the boundary of "software" vs "hardware" when virtually every "software" implementation uses fixed-function accelerator units and every fixed-function accelerator unit is running a control program that defines a flow of execution and has schedulers/scoreboards multiplexing the execution unit across arbitrary data flows? It's a matter of perspective.

>>jchw+jN
There is already a work-in-progress implementation of HIP on top of OpenCL https://github.com/CHIP-SPV/chipStar and the Mesa RustiCL folks are quite interested in getting that to run on top of Vulkan.

(To be clear, HIP is about converting CUDA source code not running CUDA-compiled binaries but the Zluda project discussed in OP heavily relies on it.)

>>justin+Mi
I'm not an expert like you would find here on HN, I am only really a tinkerer and learner, amateur at best, but I think Intel's compute is very promising on Alchemist. The A770 beats out the 4060ti 16gb in video rendering via Davinci Resolve and Adobe; has AV1 support in free Davinci Resolve while Lovelace only has AV1 support in studio. Then for AI, the A770 has had a good showing in stable diffsion against Nvidia's midrange Lovelace since the summer: https://www.tomshardware.com/news/stable-diffusion-for-intel...

The big issue for Intel is pretty similar to that of AMD; everything is made for CUDA, and Intel has to either build their own solutions or convince people to build support for Intel. While I'm working on learning AI and plan to use an Nvidia card, its pretty the progress Intel has made in the last couple of years since introducing their first GPU to market has been pretty wild, and I think it really give AMD pause.

>>jvande+NO
> These were precisely the arguments for 'x86 will entrench Intel for all time', and we've seen AMD succeed at that game just fine.

... after a couple decades of legal proceedings and a looming FTC monopoly case convinced Intel to throw in the towel, cross-license, and compete more fairly with AMD.

https://jolt.law.harvard.edu/digest/intel-and-amd-settlement

AMD didn't just magically do it on its own.

>>Cu3PO4+0r
I would love to be able to have a native stable diffusion experience, my rx 580 takes 30s to generate a single image. But it does work after following https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...

I got this up and running on my windows machine in short order and I don't even know what stable diffusion is.

But again, it would be nice to have first class support to locally participate in the fun.

>>katbyt+DP
What are you running for audio? pipewire+jack, pipewire, jack2, pulseaudio? I wonder if it's from latency. Pulseaudio is the most common but if you do any audio engineering or play guitar etc with your machine we all use jack protocol for less latency.

https://linuxmusicians.com/viewtopic.php?t=25556

Could be completely unrelated though, RDP sessions can definitely act up, get audio out of sync etc. I try to never do pass through rdp audio, it's not even enabled by default in the mstsc client IIRC but that may just be a "probably server" thing.

>>beebee+U21
I probably can't comment on that, but what I can comment on is this:

H100's are hard to get. Nearly impossible. CoreWeave and others have scooped them all up for the foreseeable future. So, if you are looking at only price as the factor, then it becomes somewhat irrelevant, if you can't even buy them [0]. I don't really understand the focus on price because of this fact.

Even if you do manage to score yourself some H100's. You also need to factor in the networking between nodes. IB (Infiniband) made by Mellanox, is owned by NVIDIA. Lead times on that equipment are 50+ weeks. Again, price becomes irrelevant if you can't even network your boxes together.

As someone building a business around MI300x (and future products), I don't care that much about price [!]. We know going in that this is a super capital intensive business and have secured the backing to support that. It is one of those things where "if you have to ask, you can't afford it."

We buy cards by the chassis, it is one price. I actually don't know the exact prices of the cards (but I can infer it). It is a lot about who you know and what you're doing. You buy more chassis, you get better pricing. Azure is probably paying half of what I'm paying [1]. But I'd also say that from what I've seen so far, their chassis aren't nearly as nice as mine. I have dual 9754's, 2x bonded 400G, 3TB ram, and 122TB nvme... plus the 8x MI300x. These are top of the top. They have Intel and I don't know what else inside.

[!] Before you harp on me, of course I care about price... but at the end of the day, it isn't what I'm focused on today as much as just being focused on investing all of the capex/opex that I can get my hands on, into building a sustainable business that provides as much value as possible to our customers.

[0] https://www.tomshardware.com/news/tsmc-shortage-of-nvidias-a...

[1] https://www.techradar.com/pro/instincts-are-massively-cheape...

>>Cu3PO4+0r
> Proton+DXVK for Linux gaming

"Building the DirectX shader compiler better than Microsoft?" (2024) >>39324800

E.g. llama.cpp already supports hipBLAS; is there an advantage to this ROCm CUDA-compatibility layer - ZLUDA on Radeon (and not yet Intel OneAPI) - instead or in addition? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi... >>38588573

What can't WebGPU abstract away from CUDA unportability? >>38527552

>>slavik+yo
It doesn't matter to my lab whether it technically runs. According to https://rocm.docs.amd.com/projects/install-on-linux/en/lates... it only supports three commercially-available Radeon cards (and four available Radeon Pro) on Linux. Contrast this to CUDA, which supports literally every nVIDIA card in the building, including the crappy NVS series and weirdo laptop GPUs, and it basically becomes impossible to convince anyone to develop for ROCm.

>>mfigui+(OP)
Phoronix Article from earlier(1):

"While AMD ships pre-built ROCm/HIP stacks for the major enterprise Linux distributions, if you are using not one of them or just want to be adventurous and compile your own stack for building HIP programs for running on AMD GPUs, one of the AMD Linux developers has written a how-to guide. "(1)

(1)

"Building An AMD HIP Stack From Upstream Open-Source Code

Written by Michael Larabel in Radeon on 9 February 2024 at 06:45 AM EST."

https://www.phoronix.com/news/Building-Upstream-HIP-Stack

>>Farfig+po1
Hähnle is one of our best, that'll be solid. http://nhaehnle.blogspot.com/2024/02/building-hip-environmen.... Looks pretty similar to how I build it.

Side point, there's a driver in your linux kernel already that'll probably work. The driver that ships with rocm is a newer version of the same and might be worth building via dkms.

Very strange that the rocm github doesn't have build scripts but whatever, I've been trying to get people to publish those for almost five years now and it just doesn't seem to be feasible.

>>Farfig+po1
You can also install HIP/ROCm via Guix:

https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix...

> AMD has just contributed 100+ Guix packages adding several versions of the whole HIP and ROCm stack

>>tester+lF1
The 'crazy' decision is them slowly abandoning the PC gaming market which is where consumers get these cards, and focusing on the 'client' market to sell their 'Insight' datacenter/AI cards. I think the parent you are responding to isn't questioning why it is a bad 'make money now' profit decision but why it is a bad 'get people to use your system' decision.

"AMD’s client segment, mostly chips for PCs and laptops, rose 62% year over year to $1.46 billion in sales, thanks to recent chip launches.

Sales in AMD’s gaming segment, which includes “semi-custom” processors for Microsoft Xbox and Sony PlayStation consoles, fell 17%. "

* https://www.cnbc.com/2024/01/30/amd-earnings-report-q4-2024....

>>JonChe+ys1
From the Phoronix comments section of the Article that I linked to:

https://www.phoronix.com/forums/forum/linux-graphics-x-org-d...

And I'm on Linux Mint 21.3 and so how to change any instillation script to think that Mint is Ubuntu to get that to maybe work there but there's no how-to for Mint like the one that AMD provides for Ubuntu! And really that's compiled By AMD for the specific Linux Kernel so not any DKMS sort of methods there AFAIK! but I'm no Linux Expert and just want some one-click install or that to ship with the Distro already working so Blender 3D's iGPU/dGPU accelerated Cycles rendering is possible on AMD Radeon consumer GPUs.

>>codedo+pF1
I did a few times with Direct3D 11 compute shaders. Here’s an open-source example: https://github.com/Const-me/Cgml

Pretty sure Vulkan gonna work equally well, at the very least there’s an open source DXVK project which implements D3D11 on top of Vulkan.

>>smcl+F61
pretty sure she's a vt.social admin, so she can always do what jwz does with HN referer headers :D

given how omnipresent she is with her live streaming, it's a bit like South Park's Worldwide Privacy Tour: https://www.youtube.com/watch?v=2N8_5LDkZwY

>>Keyfra+sN1
https://en.wikipedia.org/wiki/Project_Denver#History

>>Cu3PO4+hW1
> Installing Invoke from PyPi... To me, your pyproject.toml looks perfectly sane, so I wasn't sure how to go about fixing the problem.

You can't install the PyTorch that's best for the currently running platform using a pyproject.toml with a setuptools backend, for starters. Invoke would have to author a setup.py that deals with all the issues, in a way that is compatible with build isolation.

> The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.)

Why? Given the state of the ecosystem, what guarantee is there really that the documentation for Docker Desktop with AMD ROCm device binding is going to actually work for your device? (https://rocm.docs.amd.com/projects/MIVisionX/en/latest/docke...)

There is a lot of ad-hoc reinvention of tooling in this space.

>>mfigui+(OP)
From the same repo, I found this excellent, well-written architecture document: https://github.com/vosen/ZLUDA/blob/master/ARCHITECTURE.md

I love the direct, "no bullshit" style of writing.

Some gems:

> Anyone familiar with C++ will instantly understand that compiling it is a complicated affair.

> Additionally CUDA allows, to a large degree, mixing CPU code and GPU code. What does all this complexity mean for ZLUDA? Absolutely nothing

> Since an application can dynamically link to either Driver API or Runtime API, it would seem that ZLUDA needs to provide both. In reality very few applications dynamically link to Runtime API. For the vast majority of applications it's sufficient to provide Driver API for dynamic (runtime) linking.

>>codedo+pF1
ncnn uses Vulkan for GPU acceleration, I've seen it used in a few projects to get AMD hardware support.

https://github.com/Tencent/ncnn

>>codedo+pF1
there's a pretty cool Vulkan LLM engine here for example:

https://github.com/mlc-ai/mlc-llm

>>squigz+OE2
Found some... Seems crazy to me; this community has never felt transphobic to me... >>36226845

>>sophro+Zd2
As promised in my other comment, I did send a PR! https://github.com/invoke-ai/InvokeAI/pull/5714

>>HarHar+Bc4
"CUDNN API supported by HIP" has a coverage table: https://rocm.docs.amd.com/projects/HIPIFY/en/amd-staging/tab...

ROCm/hipDNN wraps CuDNN on Nvidia and MiOpen on AMD; but hasn't been updated in awhile: https://github.com/ROCm/hipDNN

>>37808036 : conda-forge has various BLAS implementations, including MKL-optimized BLAS, and compatible NumPy and SciPy builds.

BLAS: Basic Linear Algebra Sub programs: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

"Using CuPy on AMD GPU (experimental)" https://docs.cupy.dev/en/v13.0.0/install.html#using-cupy-on-... :

  $ sudo apt install hipblas hipsparse rocsparse rocrand rocthrust rocsolver rocfft hipcub rocprim rccl

>>Cu3PO4+hW1
> AMD's ROCm OCI base images,

ROCm docs > "Install ROCm Docker containers" > Base Image: https://rocm.docs.amd.com/projects/install-on-linux/en/lates... links to ROCm/ROCm-docker: https://github.com/ROCm/ROCm-docker which is the source of docker.io/rocm/rocm-terminal: https://hub.docker.com/r/rocm/rocm-terminal :

  docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/rocm-terminal

ROCm docs > "Docker image support matrix": https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

ROCm/ROCm-docker//dev/Dockerfile-centos-7-complete: https://github.com/ROCm/ROCm-docker/blob/master/dev/Dockerfi...

Bazzite is a ublue (Universal Blue) fork of the Fedora Kinoite (KDE) or Fedora Silverblue (Gnome) rpm-ostree Linux distributions; ublue-os/bazzite//Containerfile : https://github.com/ublue-os/bazzite/blob/main/Containerfile#... has, in addition to fan and power controls, automatic updates on desktop, supergfxctl, system76-scheduler, and an fsync kernel:

  rpm-ostree install rocm-hip \
        rocm-opencl \
        rocm-clinfo

But it's not `rpm-ostree install --apply-live` because its a Containerfile.

To install a ublue-os distro, you install any of the Fedora ostree distros: {Silverblue, Kinoite, Sway Atomic, or Budgie Atomic} from e.g. a USB stick and then `rpm-ostree rebase <OCI_host_image_url>`:

  rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite:stable
  rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite-nvidia:stable
  rpm-ostree rebase ostree-image-signed:

ublue-os/config//build/ublue-os-just/40-nvidia.just defines the `ujust configure-nvidia` and `ujust toggle-nvk` commands: https://github.com/ublue-os/config/blob/main/build/ublue-os-...

There's a default `distrobox` with pytorch in ublue-os/config//build/ublue-os-just/etc-distrobox/apps.ini: https://github.com/ublue-os/config/blob/main/build/ublue-os-...

  [mlbox]
  image=nvcr.io/nvidia/pytorch:23.08-py3
  additional_packages="nano git htop"
  init_hooks="pip3 install huggingface_hub tokenizers transformers accelerate datasets wandb peft bitsandbytes fastcore fastprogress watermark torchmetrics deepspeed"
  pre-init-hooks="/init_script.sh"
  nvidia=true
  pull=true
  root=false
  replace=false

docker.io/rocm/pytorch: https://hub.docker.com/r/rocm/pytorch

pytorch/builder//manywheel/Dockerfile: https://github.com/pytorch/builder/blob/main/manywheel/Docke...

ROCm/pytorch//Dockerfile: https://github.com/ROCm/pytorch/blob/main/Dockerfile

The ublue-os (and so also bazzite) OCI host image Containerfile has Sunshine installed; which is a 4k HDR 120fps remote desktop solution for gaming.

There's a `ujust remove-sunshine` command in system_files/desktop/shared/usr/share/ublue-os/just/80-bazzite.just : https://github.com/ublue-os/bazzite/blob/main/system_files/d... and also kernel args for AMD:

  pstate-force-enable:
    rpm-ostree kargs --append-if-missing=amd_pstate=active

ublue-os/config//Containerfile: https://github.com/ublue-os/config/blob/main/Containerfile

LizardByte/Sunshine: https://github.com/LizardByte/Sunshine

moonlight-stream https://github.com/moonlight-stream

Anyways, hopefully this PR fixes the immediate issue: https://github.com/invoke-ai/InvokeAI/pull/5714/files

conda-forge/pytorch-cpu-feedstock > "Add ROCm variant?": https://github.com/conda-forge/pytorch-cpu-feedstock/issues/...

And Fedora supports OCI containers as host images and also podman container images with just systemd to respawn one or a pod of containers.

>>westur+af5
I actually used the rocm/pytorch image you also linked.

I'm not sure what you're pointing to with your reference to the Fedora-based images. I'm quite happy with my NixOS install and really don't want to switch to anything else. And as long as I have the correct kernel module, my host OS really shouldn't matter to run any of the images.

And I'm sure it can be made to work with many base images, my point was just that the dependency management around pytorch was in a bad state, where it is extremely easy to break.

> Anyways, hopefully this PR fixes the immediate issue: https://github.com/invoke-ai/InvokeAI/pull/5714/files

It does! At least for me. It is my PR after all ;)

>>acchow+jy2
I think it has a lot more to do with this: https://youtu.be/590h3XIUfHg?t=1956

AMD fundamentally viewed/views GPUs as nothing more than a tool to make semicustom deals. Just like "xbox isn't the product, gamepass is the product" - well, for AMD "radeon isn't the product, semicustom is the product". The only thing they really need graphics for is APUs, and they don't need to beat the 4090, they just need to beat Xe-LP. They don't need raytracing, they don't need that "AI" crap (oops), just to run games at 720p/1080p.

They're happy to squeeze whatever they can out of Sony/MS's R&D spend, but they aren't going to invest heavily on their own. And now that there is an obvious money fountain occurring in AI/ML... that is starting to change.

It was always about the money, specifically the lack of it. AMD knew HSA-Library/OpenCL/etc sucked, they didn't care, especially when the money was better spent going after Intel instead of NVIDIA. Intel is dysfunctional and AMD had a chance to crack their marketshare, and that's where every penny they had went. And that's probably not a wrong business decision.

zlacker

AMD funded a drop-in CUDA implementation built on ROCm: It's now open-source