It is amazing how far behind Apple Silicon is when it comes to use non- language models.
Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.
Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.
Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.