I think this is essentially the same situation as Proton+DXVK for Linux gaming. I think that that is a net positive for Linux, but I'm less sure about this. Getting good performance out of GPU compute requires much more tuning to the concrete architecture, which I'm afraid devs just won't do for AMD GPUs through this layer, always leaving them behind their Nvidia counterparts.
However, AMD desperately needs to do something. Story time:
On the weekend I wanted to play around with Stable Diffusion. Why pay for cloud compute, when I have a powerful GPU at home, I thought. Said GPU is a 7900 XTX, i.e. the most powerful consumer card from AMD at this time. Only very few AMD GPUs are supported by ROCm at this time, but mine is, thankfully.
So, how hard could it possibly to get Stable Diffusion running on my GPU? Hard. I don't think my problems were actually caused by AMD: I had ROCm installed and my card recognized by rocminfo in a matter of minutes. But the whole ML world is so focused on Nvidia that it took me ages to get a working installation of pytorch and friends. The InvokeAI installer, for example, asks if you want to use CUDA or ROCm, but then always installs the CUDA variant whatever you answer. Ultimately, I did get a model to load, but the software crashed my graphical session before generating a single image.
The whole experience left me frustrated and wanting to buy an Nvidia GPU again...
As I mentioned elsewhere, 25% of GPU compute on the Top 500 Supercomputer list is AMD. This all on the back of a card that came out only three years ago. We are very rapidly moving towards a situation where there are many, many high-performance developers that will target ROCm.
Personally I want Nvidia to break the x86-64 monopoly, with how amazing properly spec'd Nvidia cards are to work with I can only dream of a world where Nvidia is my CPU too.
I got this up and running on my windows machine in short order and I don't even know what stable diffusion is.
But again, it would be nice to have first class support to locally participate in the fun.
"Building the DirectX shader compiler better than Microsoft?" (2024) >>39324800
E.g. llama.cpp already supports hipBLAS; is there an advantage to this ROCm CUDA-compatibility layer - ZLUDA on Radeon (and not yet Intel OneAPI) - instead or in addition? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi... >>38588573
What can't WebGPU abstract away from CUDA unportability? >>38527552
With AMD's official 15GB(!) Docker image, I was now able to get the A1111 UI running. With SD 1.5 and 30 sample iterations, generating an image takes under 2s. I'm still struggling to get InvokeAI running.
On the other hand:
> The next major ROCm release (ROCm 6.0) will not be backward [source] compatible with the ROCm 5 series.
Even worse, not even the driver is backwards-compatible:
> There are some known limitations though like currently only targeting the ROCm 5.x API and not the newly-released ROCm 6.x releases.. In turn having to stick to ROCm 5.7 series as the latest means that using the ROCm DKMS modules don't build against the Linux 6.5 kernel now shipped by Ubuntu 22.04 LTS HWE stacks, for example. Hopefully there will be enough community support to see ZLUDA ported to ROCM 6 so at least it can be maintained with current software releases.
It's time for Open Source to be on the extinguishing side for once.
The one supplied by two companies?
I'm a maintainer (and CEO) of Invoke.
It's something we're monitoring as well.
ROCm has been challenging to work with - we're actively talking to AMD to keep apprised of ways we can mitigate some of the more troublesome experiences that users have with getting Invoke running on AMD (and hoping to expand official support to Windows AMD)
The problem is that a lot of the solutions proposed involve significant/unsustainable dev effort (i.e., supporting an entirely different inference paradigm), rather than "drop in" for the existing Torch/diffusers pipelines.
While I don't know enough about your set up to offer immediate solutions, if you join the discord, am sure folks would be happy to try walking through some manual troubleshooting/experimentation to get you up and running - discord.gg/invoke-ai
I have since gotten Invoke to run and was already able to get some results I'm really quite happy with, so thank you for your time and commitment working on Invoke!
I understand that ROCm is still challenging, but it seems my problems were less related to ROCm or Invoke itself and more to Python dependency management. It really boiled down to getting the correct (ROCm) versions of packages installed. Installing Invoke from PyPi always removed my Torch and installed CUDA-enabled Torch (as well as cuBLAS, cuDNN, ...). Once I had the correct versions of packages, everything just worked.
To me, your pyproject.toml looks perfectly sane, so I wasn't sure how to go about fixing the problem.
What ended up working for me was to use one of AMD's ROCm OCI base images, manually installing all dependencies, foregoing a virtual environment, cloning your repo (, building the frontend), and then installing from there.
The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.) Trying to build the Dockerfile from your repo, I also ended up with a CUDA-enabled Torch. It did install the correct one first, but in a later step removed the ROCm-enabled Torch to switch it for the CUDA-enabled one.
I hope you'll consider investing some resources into publishing newer, working builds of your Docker image.
Also, nothing is easier on Windows. It's a wonder that anything works there, except for the power of recalcitrance.
Not dogging Windows users, but once your brain heals, it just can't go back.
We do have Docker packages hosted on GH, but I'll be the first to admit that we haven't prioritized ROCm. Contributors who have AMDs are a scant few, but maybe we'll find some help in wrangling that problem now that we know there's an avenue to do so.
You can't install the PyTorch that's best for the currently running platform using a pyproject.toml with a setuptools backend, for starters. Invoke would have to author a setup.py that deals with all the issues, in a way that is compatible with build isolation.
> The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.)
Why? Given the state of the ecosystem, what guarantee is there really that the documentation for Docker Desktop with AMD ROCm device binding is going to actually work for your device? (https://rocm.docs.amd.com/projects/MIVisionX/en/latest/docke...)
There is a lot of ad-hoc reinvention of tooling in this space.
I'm pretty sure Torvalds was giving the finger over the subject of GPU drivers (which run on the CPU), not programming on the Nvidia GPU itself. Particularly, they namedropped Bumblebee (and maybe Optimus?) which was more about power-management and making Nvidia cooperate with a non-Nvidia integrated GPU than it was about the Nvidia GPU itself.
> Also, nothing is easier on Windows.
As much as I, too, dislike Windows, I still have to disagree. I have encountered (proprietary) software which was much easier to get working on Windows. For example, Cisco AnyConnect with SmartCard authentication has been a nightmare for me on Linux.
I see. I do know Python, but my knowledge of setuptools, pip, poetry and whatever else have you. To get my working setup, I specified an --index-url for my Torch installation. Does that not work while using their current setup?
> Why? Given the state of the ecosystem, what guarantee is there really that the documentation for Docker Desktop with AMD ROCm device binding is going to actually work for your device?
Well, they did work for me. Though I think only passing /dev/{dri,kfd} and setting seccomp=unconfined was sufficient. So for my particular case, getting a working image was the only missing step.
From a more general POV: it might not make sense to invest in a ROCm OCI image from a short-term business perspective, but in the long term and based purely on principal, I do think the ecosystem should strive to be less reliant on CUDA and only CUDA.
Lock people in to something that didn’t exist in a way any user could use before it existed? I get people hate CUDAs dominance but no one else was pushing this before CUDA and Apple+AMD completely fumbled OpenCL.
Can’t hate on something good just because it’s successful and I can’t be angry the talent behind the success wanting to profit.
What's nice about BLAS is that there are optimized implementations for CPUs (Intel MKL) as well as NVIDIA (cuBLAS) and AMD (hipBLAS), so while it's very much limited in what it can do, you can at least write portable code around it.
ROCm/hipDNN wraps CuDNN on Nvidia and MiOpen on AMD; but hasn't been updated in awhile: https://github.com/ROCm/hipDNN
>>37808036 : conda-forge has various BLAS implementations, including MKL-optimized BLAS, and compatible NumPy and SciPy builds.
BLAS: Basic Linear Algebra Sub programs: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
"Using CuPy on AMD GPU (experimental)" https://docs.cupy.dev/en/v13.0.0/install.html#using-cupy-on-... :
$ sudo apt install hipblas hipsparse rocsparse rocrand rocthrust rocsolver rocfft hipcub rocprim rcclYou were asking if this CUDA compatability layer might hold any advantage over HIP (e.g. for use by llama.cpp) ?
I think the answer is no, since HIP includes pretty full-featured support for many of the higher level CUDA-based APIs (cuDNN, cuBLAS, etc), while per the Phoronix article ZLUDA only (currently) has minimal support for them.
I wouldn't expect ZLUDA to provide any performance benefit over HIP either, since on AMD hardware HIP is just a pass-thru to MIOpen (AMD's equivalent to cuDNN), rocBLAS, etc.
ROCm docs > "Install ROCm Docker containers" > Base Image: https://rocm.docs.amd.com/projects/install-on-linux/en/lates... links to ROCm/ROCm-docker: https://github.com/ROCm/ROCm-docker which is the source of docker.io/rocm/rocm-terminal: https://hub.docker.com/r/rocm/rocm-terminal :
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/rocm-terminal
ROCm docs > "Docker image support matrix":
https://rocm.docs.amd.com/projects/install-on-linux/en/lates...ROCm/ROCm-docker//dev/Dockerfile-centos-7-complete: https://github.com/ROCm/ROCm-docker/blob/master/dev/Dockerfi...
Bazzite is a ublue (Universal Blue) fork of the Fedora Kinoite (KDE) or Fedora Silverblue (Gnome) rpm-ostree Linux distributions; ublue-os/bazzite//Containerfile : https://github.com/ublue-os/bazzite/blob/main/Containerfile#... has, in addition to fan and power controls, automatic updates on desktop, supergfxctl, system76-scheduler, and an fsync kernel:
rpm-ostree install rocm-hip \
rocm-opencl \
rocm-clinfo
But it's not `rpm-ostree install --apply-live` because its a Containerfile.To install a ublue-os distro, you install any of the Fedora ostree distros: {Silverblue, Kinoite, Sway Atomic, or Budgie Atomic} from e.g. a USB stick and then `rpm-ostree rebase <OCI_host_image_url>`:
rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite:stable
rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite-nvidia:stable
rpm-ostree rebase ostree-image-signed:
ublue-os/config//build/ublue-os-just/40-nvidia.just defines the `ujust configure-nvidia` and `ujust toggle-nvk` commands:
https://github.com/ublue-os/config/blob/main/build/ublue-os-...There's a default `distrobox` with pytorch in ublue-os/config//build/ublue-os-just/etc-distrobox/apps.ini: https://github.com/ublue-os/config/blob/main/build/ublue-os-...
[mlbox]
image=nvcr.io/nvidia/pytorch:23.08-py3
additional_packages="nano git htop"
init_hooks="pip3 install huggingface_hub tokenizers transformers accelerate datasets wandb peft bitsandbytes fastcore fastprogress watermark torchmetrics deepspeed"
pre-init-hooks="/init_script.sh"
nvidia=true
pull=true
root=false
replace=false
docker.io/rocm/pytorch:
https://hub.docker.com/r/rocm/pytorchpytorch/builder//manywheel/Dockerfile: https://github.com/pytorch/builder/blob/main/manywheel/Docke...
ROCm/pytorch//Dockerfile: https://github.com/ROCm/pytorch/blob/main/Dockerfile
The ublue-os (and so also bazzite) OCI host image Containerfile has Sunshine installed; which is a 4k HDR 120fps remote desktop solution for gaming.
There's a `ujust remove-sunshine` command in system_files/desktop/shared/usr/share/ublue-os/just/80-bazzite.just : https://github.com/ublue-os/bazzite/blob/main/system_files/d... and also kernel args for AMD:
pstate-force-enable:
rpm-ostree kargs --append-if-missing=amd_pstate=active
ublue-os/config//Containerfile:
https://github.com/ublue-os/config/blob/main/ContainerfileLizardByte/Sunshine: https://github.com/LizardByte/Sunshine
moonlight-stream https://github.com/moonlight-stream
Anyways, hopefully this PR fixes the immediate issue: https://github.com/invoke-ai/InvokeAI/pull/5714/files
conda-forge/pytorch-cpu-feedstock > "Add ROCm variant?": https://github.com/conda-forge/pytorch-cpu-feedstock/issues/...
And Fedora supports OCI containers as host images and also podman container images with just systemd to respawn one or a pod of containers.
I'm not sure what you're pointing to with your reference to the Fedora-based images. I'm quite happy with my NixOS install and really don't want to switch to anything else. And as long as I have the correct kernel module, my host OS really shouldn't matter to run any of the images.
And I'm sure it can be made to work with many base images, my point was just that the dependency management around pytorch was in a bad state, where it is extremely easy to break.
> Anyways, hopefully this PR fixes the immediate issue: https://github.com/invoke-ai/InvokeAI/pull/5714/files
It does! At least for me. It is my PR after all ;)
Is there a way to 'restorecon --like / /nix/os/root72`; to apply SELonix extended filesystem attributes labels just to NixOS prefixes?
Some research is done with RPM-based distros; which have become so advanced with rpm-ostree support.
FWICS Bazzite has NixOS support, too; in addition to distrobox containers.
Bazzite has alot of other stuff installed that's not necessary when attempting to isolate sources of variance in the interest of reproducible research; but being for gaming it has various optimizations.
InvokeAI might be faster to install and to compute with with conda-forge builds.
Who owns the Cyrix x86 license these days?
But being able to leverage my graphics card for GPGPU was a top priority for me, and like you, I was appalled with the ROCm situation. Not necessarily the tech itself (though I did not enjoy the docker approach), but more the developer situation surrounding it.
* well, that and some vague notions about RTX