zlacker

[parent] [thread] 2 comments
1. jamilt+(OP)[view] [source] 2024-02-01 18:27:22
Can someone provide evidence one way or the other? I don’t know enough to do it myself.
replies(1): >>refulg+9f
2. refulg+9f[view] [source] 2024-02-01 19:44:29
>>jamilt+(OP)
c.f. >>39220027 , or TFA*. They're doing a gish gallop, and I can't really justify burning more karma to poke holes in a stranger's overly erudite tales. I swing about 8 points to the negative when they reply with more.

* multiple sources including OP:

"The SDXL VAE of the same architecture doesn't have this problem,"

"If future models using KL autoencoders do not use the pretrained CompVis checkpoints and use one like SDXL's that is trained properly, they'll be fine."

"SDXL is not subject to this issue because it has its own VAE, which as far as I can tell is trained correctly and does not exhibit the same issues."

replies(1): >>joefou+Im4
◧◩
3. joefou+Im4[view] [source] [discussion] 2024-02-02 23:56:12
>>refulg+9f
I think you must have misunderstood me, I didn’t say the SD-XL VAE had the same issue as in OP. What I said was that it didn’t take into account some of my points that came up during my research:

- Bounding the outputs to -1, 1 and optimising the variance directly to make it approach 1

- Increasing the number of channels to 8, as the spatial resolution reduction is most important for latent diffusion

- Using a more modern discriminator architecture instead of PatchGAN’s

- Using a vanilla AE with various perturbations instead of KL divergence

Now SD-XL’s VAE is very good and superior to its predecessor, on account of an improved training procedure, but it didn’t use any of the above tricks. It may even be the case that they would have made no difference in the end - they were useful to me in the context of training models with limited compute.

[go to top]