zlacker

[parent] [thread] 2 comments
1. tarrud+(OP)[view] [source] 2025-12-06 22:12:54
> It's fast (~3 seconds on my RTX 4090)

It is amazing how far behind Apple Silicon is when it comes to use non- language models.

Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.

replies(1): >>p-e-w+vc
2. p-e-w+vc[view] [source] 2025-12-06 23:57:34
>>tarrud+(OP)
The diffusion process is usually compute-bound, while transformer inference is memory-bound.

Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.

replies(1): >>tarrud+xi
◧◩
3. tarrud+xi[view] [source] [discussion] 2025-12-07 00:47:39
>>p-e-w+vc
> but it’s light years behind on compute.

Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.

[go to top]