The VAE Used for Stable Diffusion Is Flawed

>>Broken+(OP)
> It's a spot where the VAE is trying to smuggle global information about the image through latent space. This is exactly the problem that KL-divergence loss is supposed to prevent.

Is that what KL divergence does?

I thought it was supposed to (when combined with reconstruction loss) “smooth” the latent space out so that you could interpolate over it.

Doesn’t increasing the weight of the KL term just result in random output in the latent; eg. What you get if you opt purely for KL divergence?

I honestly have no idea at all what the OP has found or what it means, but it doesnt seem that surprising that modifying the latent results in global changes in the output.

Is manually editing latents a thing?

Surely you would interpolate from another latent…? And if the result is chaos, you dont have well clustered latents? (Which is what happens from too much KL, not too little right?)

I'd feel a lot more 'across' this if the OP had demonstrated it on a trivial MNIST vae with both the issue, the result and quantitatively what fixing it does.

> What are the implications?

> Somewhat subtle, but significant.

Mm. I have to say I don't really get it.

>>wokwok+M8
I can't comment on what changing the weights of the KL divergence does in this context, but generally

> Is that what KL divergence does?

KL divergence is basically a distance "metric" in the space of probability distributions. If you have two probability distributions A and B, you can ask how similar they are. "Metric" is in scare quotes because you can't actually get a distance function in the usual sense. For example, dist(A,B) != dist(B,A).

If you think about the distribution as giving information about things, then the distance function should say two things are close if they provide similar information and are distant if one provides more information about something than the other.

The comment claims (and I assume they know what they're talking about) that after training we want the KL divergence to be close to a standard Gaussian. So that would mean that our statistical distribution gives roughly the same information as a standard Gaussian. It sounds like this distribution has a whole lot of information in one heavily localized area though (or maybe too little information in that area, I'm not sure which way it goes).

>>ants_e+za
MMmm... Is there any specific reason this would result in a 1-1 mapping between the latent and the decoded image? Wouldn't just be a random distribution and everything out of the VAE would just be pure chaos?

Some background reading on generic VAE https://towardsdatascience.com/intuitively-understanding-var..., see "Optimizing using pure KL divergence loss".

Perhaps the SD 'VAE' uses a different architecture to a normal vae...

zlacker