zlacker

Interesting to me that this one can draw legible text. DALLE models seem to generate weird glyphs that only look like text. The examples they show here have perfectly legible characters and correct spelling. The difference between this and DALLE makes me suspicious / curious. I wish I could play with this model.

replies(5): >>Tehdas+Q6 >>GaggiX+Ka >>ricard+lb >>zimpen+SX >>the847+Hc1

>>ALittl+(OP)
Still has the issue with screwing up mechanical objects. In their demo checkout the wheels on the skateboards, all over the place.

replies(2): >>sdento+gs >>gpt5+xz2

>>ALittl+(OP)
Imagen takes text embeddings, OpenAI model takes image embeddings instead, this is the reason. There are other models that can generate text: latent diffusion trained on LAION-400M, GLIDE, DALL-E (1).

replies(1): >>ALittl+Ql

>>ALittl+(OP)
I thought the weird text in DALL-E 2 was on purpose to prevent malicious use.

>>GaggiX+Ka
My understanding of the terms text and image embeddings is that they are ways of representing text or images as vectors. But, I don't understand how that would help with the process of actually drawing the symbols for those letters.

replies(1): >>GaggiX+kK

>>Tehdas+Q6
For comparison, most humans can't draw a bicycle:

https://www.wired.com/2016/04/can-draw-bikes-memory-definite...

replies(1): >>dclowd+YH

>>sdento+gs
I blame it on the surprisingly structural cleverness of a bicycle. Opposing triangles probably isn’t the first thing most people think of when they think of a bicycle (vs two wheels and some handlebars)

replies(1): >>gwern+rf3

>>ALittl+Ql
If the model takes text embeddings/tokens as an input, it can create a connection between the caption and the text on the image (sometimes they are really similar).

>>ALittl+(OP)
The latent-diffusion[1] one I've been playing with is not terrible at drawing legible text but generally awful at actually drawing the text you want (cf. [2]) (or drawing text when you don't want any.)

[1] https://github.com/CompVis/latent-diffusion.git [2] https://imgur.com/a/Sl8YVD5

>>ALittl+(OP)
DALLE1 was able to render text[0]. That DALLE2 isn't probably is a tradeoff introduced by unCLIP in exchange for diverse results. Now the google model is better yet and doesn't have to make that tradeoff.

[0] https://openai.com/blog/dall-e/#text-rendering

>>Tehdas+Q6
I only see the problem for the paintings. If you choose a photo it's good. Could be a problem in the source data (i.e. paintings of mechanical objects are imperfect).

>>dclowd+YH
They also can't draw pennies, the letter 'g' with the loop, and so on (https://www.gwern.net/docs/psychology/illusion-of-depth/inde...). Bicycles may be clever, but the shallowness of mental representation is real.