zlacker

[parent] [thread] 0 comments
1. ollin+(OP)[view] [source] 2024-02-01 01:34:07
some points that stood out to me:

1. they made a lot of careful tweaks to the unet network architecture - it seems like they ran many different ablations here ("In total, our endeavor consumes approximately 512 TPUs spanning 30 days").

2. the model distillation is based on previous UFOGen work from the same team https://arxiv.org/abs/2311.09257 (hence the UFO graphic in the diffusion-gan diagram)

3. they train their own 8-channel latent encoder / decoder ("VAE") from scratch (similar to Meta's Emu paper) instead of using the SD VAEs like many other papers do

4. they use an internal dataset of 150m image/text pairs (roughly the size of laion-highres)

5. they also reran SD training from scratch on this dataset to get their baseline performance

[go to top]