Imagen, a text-to-image diffusion model

>>kevema+(OP)
Metacalculus, a mass forecasting site, has steadily brought forward the prediction date for a weakly general AI. Jaw-dropping advances like this, only increase my confidence in this prediction. "The future is now, old man."

https://www.metaculus.com/questions/3479/date-weakly-general...

>>ceeplu+Q5
https://en.wikipedia.org/wiki/Moral_relativism

>>user39+C6
"Reality" as defined by the available training set isn't necessarily reality.

For example, Google's image search results pre-tweaking had some interesting thoughts on what constitutes a professional hairstyle, and that searches for "men" and "women" should only return light-skinned people: https://www.theguardian.com/technology/2016/apr/08/does-goog...

Does that reflect reality? No.

(I suspect there are also mostly unstated but very real concerns about these being used as child pornography, revenge porn, "show my ex brutally murdered" etc. generators.)

>>rvnx+n9
> The algorithm is just ranking the top "non-professional hairstyle" in the most neutral way in its database

You're telling me those are all the most non-professional hairstyles available? That this is a reasonable assessment? That fairly standard, well-kept, work-appropriate curly black hair is roughly equivalent to the pink-haired, three-foot-wide hairstyle that's one of the only white people in the "unprofessional" search?

Each and everyone of them is less workplace appropriate than, say, http://www.7thavenuecostumes.com/pictures/750x950/P_CC_70594... ?

>>visarg+I6
That's because we're still bad about long-tailed data and that people outside the research don't realize that we're first prioritizing realistic images before we deal with long-tailed data (which is going to be the more generic form of bias). To be honest, it is a bit silly to focus on long-tailed data when results aren't great. That's why we see the constant pattern of getting good on a dataset and then focusing on the bias in that dataset.

I mean a good example of this is the Pulse[0][1] paper. You may remember it as the white Obama. This became a huge debate and it was pretty easily shown that the largest factor was the dataset bias. This outrage did lead to fixing FFHQ but it also sparked a huge debate with LeCun (data centric bias) and Timnit (model centric bias) at the center. Though Pulse is still remembered for this bias, not for how they responded to it. I should also note that there is human bias in this case as we have a priori knowledge of what the upsampled image should look like (humans are pretty good at this when the small image is already recognizable but this is a difficult metric to mathematically calculate).

It is fairly easy to find adversarial examples, where generative models produce biased results. It is FAR harder to fix these. Since this is known by the community but not by the public (and some community members focus on finding these holes but not fixing them) it creates outrage. Probably best for them to limit their release.

[0] https://arxiv.org/abs/2003.03808

[1] https://cdn.vox-cdn.com/thumbor/MXX-mZqWLQZW8Fdx1ilcFEHR8Wk=...

>>benwik+L6
See the paper here : https://gweb-research-imagen.appspot.com/paper.pdf Section E : "Comparison to GLIDE and DALL-E 2"

>>ml_bas+V7
You mean one of Google's domains?

  # whois appspot.com
  [Querying whois.verisign-grs.com]
  [Redirected to whois.markmonitor.com]
  [Querying whois.markmonitor.com]
  [whois.markmonitor.com]
  Domain Name: appspot.com
  Registry Domain ID: 145702338_DOMAIN_COM-VRSN
  Registrar WHOIS Server: whois.markmonitor.com
  Registrar URL: http://www.markmonitor.com
  Updated Date: 2022-02-06T09:29:56+0000
  Creation Date: 2005-03-10T02:27:55+0000
  Registrar Registration Expiration Date: 2023-03-10T00:00:00+0000
  Registrar: MarkMonitor, Inc.
  Registrar IANA ID: 292
  Registrar Abuse Contact Email: abusecomplaints@markmonitor.com
  Registrar Abuse Contact Phone: +1.2086851750
  Domain Status: clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)
  Domain Status: clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)
  Domain Status: clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)
  Domain Status: serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)
  Domain Status: serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)
  Domain Status: serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)
  Registrant Organization: Google LLC
  Registrant State/Province: CA
  Registrant Country: US
  Registrant Email: Select Request Email Form at https://domains.markmonitor.com/whois/appspot.com
  Admin Organization: Google LLC
  Admin State/Province: CA
  Admin Country: US
  Admin Email: Select Request Email Form at https://domains.markmonitor.com/whois/appspot.com
  Tech Organization: Google LLC
  Tech State/Province: CA
  Tech Country: US
  Tech Email: Select Request Email Form at https://domains.markmonitor.com/whois/appspot.com
  Name Server: ns4.google.com
  Name Server: ns3.google.com
  Name Server: ns2.google.com
  Name Server: ns1.google.com

>>Mockap+q6
There are people working on reproducing the models, see here for Dall-E 2 for example: https://github.com/lucidrains/DALLE2-pytorch

It's often not worth it to decentralize the computation of the trained model though but it's not hard to get donated cycles and groups are working on it. Don't fret because Google isn't releasing the API/code. They released the paper and that's all you need.

>>ml_bas+V7
I'n not certain but I think it's prelease. The paper says the site should be at https://imagen.research.google/ but that host doesn't respond

>>ml_bas+V7
appspot.com is the domain that hosts all App Engine apps (at least those that don't use a custom domain). It's kind of like Heroku and has been around for at least a decade.

https://cloud.google.com/appengine

>>ml_bas+V7
This is quite suspicious considering that google AI research has an official blog[1], and this is not mentioned at all there. It seems quite possible that this is an elaborate prank.

1: https://ai.googleblog.com/

>>kevema+(OP)
https://github.com/lucidrains/imagen-pytorch

>>jefftk+sc
Fun fact: appspot.com was the second "private" suffix to be added to the Public Suffix List, after operaunite.com: https://bugzilla.mozilla.org/show_bug.cgi?id=593818

>>pxmpxm+hc
Love it. Added to https://github.com/globalcitizen/taoup

>>benwik+L6
Posting a few comparisons here.

https://twitter.com/joeyliaw/status/1528856081476116480?s=21...

>>nomel+j9
See section 6 titled “Conclusions, Limitations and Societal Impact” in the research paper: https://gweb-research-imagen.appspot.com/paper.pdf

One quote:

> “On the other hand, generative methods can be leveraged for malicious purposes, including harassment and misinformation spread [20], and raise many concerns regarding social and cultural exclusion and bias [67, 62, 68]”

>>daenz+b5
> a tendency for images portraying different professions to align with Western gender stereotypes

There are two possible ways of interpreting interpreting "gender stereotypes in professions".

biased or correct

https://www.abc.net.au/news/2018-05-21/the-most-gendered-top...

https://www.statista.com/statistics/1019841/female-physician...

>>kevema+(OP)
Note that there was a close model in 2021 ignored by all https://paperswithcode.com/sota/text-to-image-generation-on-... (on this benchmark) Also what is the score of dalle v2?

>>pid-1+Ld
https://en.wikipedia.org/wiki/Dagger_(mark)

>>jefftk+sc
You mean like: https://say-can.github.io/

This is common in the research PA. People don't want to deal with broccoli man [1].

[1] https://www.youtube.com/watch?v=3t6L-FlfeaI

>>tines+te
While not essential, I wouldn't exactly call the gender "accidental":

> We investigated sex differences in 473,260 adolescents’ aspirations to work in things-oriented (e.g., mechanic), people-oriented (e.g., nurse), and STEM (e.g., mathematician) careers across 80 countries and economic regions using the 2018 Programme for International Student Assessment (PISA). We analyzed student career aspirations in combination with student achievement in mathematics, reading, and science, as well as parental occupations and family wealth. In each country and region, more boys than girls aspired to a things-oriented or STEM occupation and more girls than boys to a people-oriented occupation. These sex differences were larger in countries with a higher level of women's empowerment. We explain this counter-intuitive finding through the indirect effect of wealth. Women's empowerment is associated with relatively high levels of national wealth and this wealth allows more students to aspire to occupations they are intrinsically interested in.

Source: https://psyarxiv.com/zhvre/ (HN discussion: https://news.ycombinator.com/item?id=29040132)

>>y04nn+D7
SymphonyNet: https://youtu.be/m4tT5fx_ih8

>>davikr+ed
That only lasts until the community copies the paper and catches up. For example the open source DALLE-2 implementation is coming along great: https://github.com/lucidrains/DALLE2-pytorch

>>hn_thr+3j
Check https://github.com/multimodalart/majesty-diffusion or https://github.com/lucidrains/DALLE2-pytorch

There is a Google Colab workbook that you can try and run for free :)

This is the image-text pairs behind: https://laion.ai/laion-400-open-dataset/

>>codeth+nh
The "Gender Equality Paradox"... there's a fascinating episode[0] about it. It's incredible how unscientific and ideologically-motivated one side comes off in it.

0. https://www.youtube.com/watch?v=_XsEsTvfT-M

>>joshcr+7a
T5 was open-sourced on release (up to 11B params): https://github.com/google-research/text-to-text-transfer-tra...

It is also available via Hugging Face transformers.

However, the paper mentions T5-XXL is 4.6B, which doesn't fit any of the checkpoints above, so I'm confused.

>>calvin+Eh
The pictures I got from a similar model when asking for a "sunday school photograph of baptists in the National Baptist Convention": https://ibb.co/sHGZwh7

>>tpmx+oi
All it takes is one 'trick' to give these models the ability to do reasoning.

Like for example the discovery that language models get far better at answering complex questions if asked to show their working step by step with chain of thought reasoning as in page 19 of the PaLM paper [1]. Worth checking out the explanations of novel jokes on page 38 of the same paper. While it is, like you say, all statistics, if it's indistinguishable from valid reasoning, then perhaps it doesn't matter.

[1]: https://arxiv.org/pdf/2204.02311.pdf

>>dr_dsh+Gb
This video by Juergen Schmidhuber discusses the acceleration of AI progress:

https://youtu.be/pGftUCTqaGg

>>hn_thr+Ym
A good explanation is here.

https://www.youtube.com/watch?v=344w5h24-h8

>>paisaw+Wn
I meant "accidental" in the Aristotelian sense: https://plato.stanford.edu/entries/essential-accidental/

>>hn_thr+Ym
> but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.

When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...

It works because it should. The next question is: "What are the implications?".

Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?

>>kevema+(OP)
I thought I was doing well after not being overly surprised by DALL-E 2 or Gato. How am I still not calibrated on this stuff? I know I am meant to be the one who constantly argues that language models already have sophisticated semantic understanding, and that you don't need visual senses to learn grounded world knowledge of this sort, but come on, you don't get to just throw T5 in a multimodal model as-is and have it work better than multimodal transformers! VLM[1] at least added fine-tuned internal components.

Good lord we are screwed. And yet somehow I bet even this isn't going to kill off the they're just statistical interpolators meme.

[1] https://www.deepmind.com/blog/tackling-multiple-tasks-with-a...

>>davikr+ed
check out open source alternative dalle-mini: https://huggingface.co/spaces/dalle-mini/dalle-mini

>>faizsh+0b
latent diffusion model trained on LAION-400M https://github.com/CompVis/latent-diffusion

>>Pheoni+uu
Some cutting-edge stuff is still being made by talented hackers using nothing but a rig of 8x 3090s: https://github.com/neonbjb/tortoise-tts

Other funding models are possible as well, in the grand scheme of things the price for these models is small enough.

>>hn_thr+4o
> What I don't understand is how they do the composition

Convolutional filters lend themselves to rich combinatorics of compositions[1]: think of them as of context-dependent texture-atoms, repulsing and attracting over the variations of the local multi-dimensional context in the image. The composition is literally a convolutional transformation of local channels encoding related principal components of context.

Astronomical amounts of computations spent via training allow the network to form a lego-set of these texture-atoms in a general distribution of contexts.

At least this is my intuition for the nature of the convnets.

1. https://microscope.openai.com/models/contrastive_16x/image_b...

>>toxicF+LA
https://www.vice.com/en/article/93ywpp/text-adventure-game-c...

TL;DR generative story site creators employ human moderation after horny people inevitably use site to make gross porn; horny people using site to make regular porn justifiably freaked out

Bring your popcorn

>>karpie+m9
> It depends on whether you'd like the model to learn casual or correlative relationships.

I expect that in the practical limit of scale achievable, the regularization pressure inherent to the process of training these models converges to https://en.wikipedia.org/wiki/Minimum_description_length and the correlative relationships become optimized away, leaving mostly true causal relationships inherent to data-generating process.

>>ChadNa+Hi
You really are not helping that cause.

As a foreigner[], your point confused me anyway, and doing a Google for cultural stuff usually gets variable results. But I did laugh at many of the comments here https://www.reddit.com/r/TooAfraidToAsk/comments/ufy2k4/why_...

[] probably, New Zealand, although foreigner is relative

>>Tehdas+Hs
For comparison, most humans can't draw a bicycle:

https://www.wired.com/2016/04/can-draw-bikes-memory-definite...

>>addand+Rf
You can sort of do that with https://fairuseify.ml

>>gwern+qB
The developer behind Tortoise is experimenting with using diffusion for music generation:

https://nonint.com/2022/05/04/friends-dont-let-friends-train...

>>rhacke+dM
I wrote a comedic "Best Apache Chef recipe" article[1] mocking these sites.

I guess the concern would be: If one of these recipe websites _was_ generated by an AI, the ingredients _look_ correct to an AI but are otherwise wrong - then what do you do? Baking soda swapped with baking powder. Tablespoons instead of teaspoons. Add 2tbsp of flower to the caramel macchiato. Whoops! Meant sugar.

[0] http://slimsag.com/best-apache-chef-recipe/1438731.htm

>>toxicF+hF
Time to update this song? https://www.youtube.com/watch?v=j6eFNRKEROw

>>kevema+(OP)
Used some of the same prompts and generated results with open source models, model I am using fails on long prompts but does well on short and descriptive prompts. Results:

https://imgur.com/gallery/6qAK09o

>>satoka+EY
Maybe not draw, but we can do an image search for "chess puzzle mate in 4" which gives plenty of results:

https://www.google.com/search?q=chess+puzzle+mate+in+4&tbm=i...

It would be surprising if AI couldn't do the same search and produce a realistic drawing out of any one of the result puzzles.

>>rg111+xY
1. DeepCreamPy draws over hentai sensor bars if you direct it where the bar is: https://github.com/gguilt/DeepCreamPy

2. hentAI automates the process: https://github.com/natethegreate/hent-AI

3. [NSFW] Should look at this person on Twitter: https://twitter.com/nate_of_hent_ai

4. [NSFW] PornHub released vintage porn videos upscaled to 4k with AI a while back. The called it the "Remastured Project": https://www.pornhub.com/art/remastured

5. [NSFW] This project shows the limit of AI-wthout-big-tech-or-corporate-support projects. This project creates female genitalia that don't exist in the real world. Project is "This Vagina Does Not Exist": https://thisvaginadoesnotexist.com/about.html

>>rg111+TX
In my melody generation system I'm already including melodies that I've judged as "good" (https://www.youtube.com/playlist?list=PLoCzMRqh5SkFwkumE578Y...) in the updated training set. Since the number of catchy melodies that have been created by humans is much, much lower than the number of pretty images, it makes a significant difference. But I'd expect that including AI-generated images without human quality judgement scores in the training set won't be any better than other augmentation techniques.

>>UncleO+o9
It seems like lucidrains is currently working on an implementation [1] of it.

I would love it.

[1] https://github.com/lucidrains/imagen-pytorch

>>ALittl+Rl
The latent-diffusion[1] one I've been playing with is not terrible at drawing legible text but generally awful at actually drawing the text you want (cf. [2]) (or drawing text when you don't want any.)

[1] https://github.com/CompVis/latent-diffusion.git [2] https://imgur.com/a/Sl8YVD5

>>ithkui+Oi1
Invisible, robust watermarks had a lot of attention in research from the late 90s to the early 10s, and apparently some resurgence with the availability of cheap GPU power.

Naturally there's a python library [1] with some algorithms that are resistant to lossy compression, cropping, brightness changes, etc. Scaling seems to be a weakness though.

1: https://pypi.org/project/invisible-watermark/

>>kevema+(OP)
Can anybody give me short high-level explanation how the model achieves these results? I'm especially interested in the image synthesis, not the language parsing.

For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body.

[0] https://gweb-research-imagen.appspot.com/main_gallery_images...

>>Nition+Bj1
Don't have access to Dall-E 2 or Imagen but I do have [1] and [2] locally and they produced [3] with that prompt.

[1] https://github.com/nerdyrodent/VQGAN-CLIP.git [2] https://github.com/CompVis/latent-diffusion.git [3] https://imgur.com/a/dCPt35K

>>ALittl+Rl
DALLE1 was able to render text[0]. That DALLE2 isn't probably is a tradeoff introduced by unCLIP in exchange for diverse results. Now the google model is better yet and doesn't have to make that tradeoff.

[0] https://openai.com/blog/dall-e/#text-rendering

>>geonic+bn1
In the paper they say about half the training data was an internal training set, and the other half came from: https://laion.ai/laion-400-open-dataset/

>>geonic+bn1
> Since guidance weights are used to control image quality and text alignment, we also report ablation results using curves that show the trade-off between CLIP and FID scores as a function of the guidance weights (see Fig. A.5a). We observe that larger variants of T5 encoder results in both better image-text alignment, and image fidelity. This emphasizes the effectiveness of large frozen text encoders for text-to-image models

I usually consider myself fairly intelligent, but I know that when I read an AI research paper I'm going to feel dumb real quick. All I managed to extract from the paper was a) there isn't a clear explanation of how it's done that was written for lay people and b) they are concerned about the quality and biases in the training sets.

Having thought about the problem of "building" an artificial means to visualize from thought, I have a very high level (dumb) view of this. Some human minds are capable of generating synthetic images from certain terms. If I say "visualize a GREEN apple sitting on a picnic table with a checkerboard table cloth", many people will create an image that approximately matches the query. They probably also see a red and white checkerboard cloth because that's what most people have trained their models on in the past. By leaving that part out of the query we can "see" biases "in the wild".

Of course there are people that don't do generative in-mind imagery, but almost all of us do build some type of model in real time from our sensor inputs. That visual model is being continuously updated and is what is perceived by the mind "as being seen". Or, as the Gorillaz put it:

  … For me I say God, y'all can see me now
  'Cos you don't see with your eye
  You perceive with your mind
  That's the end of it…

To generatively produce strongly accurate imagery from text, a system needs enough reference material in the document collection. It needs to have sampled a lot of images of corn and snakes. It needs to be able to do image segmentation and probably perspective estimation. It needs a lot of semantic representations (optimized query of words) of what is being seen in a given image, across multiple "viewing models", even from humans (who also created/curated the collections). It needs to be able to "know" what corn looks like, even from the perspective of another model. It needs to know what "shape" a snake model takes and how combining the bitmask of the corn will affect perspective and framing of the final image. All of this information ends up inside the model's network.

Miika Aittala at Nvidia Research has done several presentations on taking a model (imagined as a wireframe) and then mapping a bitmapped image onto it with a convolutional neural network. They have shown generative abilities for making brick walls that looks real, for example, from images of a bunch of brick walls and running those on various wireframes.

Maybe Imagen is an example of the next step in this, by using diffusion models instead of the CNN for the generator and adding in semantic text mappings while varying the language models weights (i.e. allowing the language model to more broadly use related semantics when processing what is seen in a generated image). I'm probably wrong about half that.

Here's my cut on how I saw this working from a few years ago: https://storage.googleapis.com/mitta-public/generate.PNG

Regardless of how it works, it's AMAZING that we are here now. Very exciting!

>>ccbccc+Qj
Smoking these meats! https://youtu.be/YeemJlrNx2Q

>>xtreme+E22
You're not wrong that the dataset and compute are important, and if you browse the author's previous work, you'll see there are datasets available. The reproduction of DALL-E 2 required a dataset of similar size to the one imagen was trained on (see: https://arxiv.org/abs/2111.02114).

The harder part here will be getting access to the compute required, but again, the folks involved in this project have access to lots of resources (they've already trained models of this size). We'll likely see some trained checkpoints as soon as they're done converging.

>>dclowd+P31
They also can't draw pennies, the letter 'g' with the loop, and so on (https://www.gwern.net/docs/psychology/illusion-of-depth/inde...). Bicycles may be clever, but the shallowness of mental representation is real.

>>qz_kb+AI
For better training data in the future: Storing a content hash and author identification (an example proprietary solution right now [0]) of image authors, and having a decentralized reputation system for people/authors would help be the solution for better training data in the future whereby authors can gain reputation/incentives too.

[0] https://creativecloud.adobe.com/discover/article/how-to-use-...

>>richri+Vj1
> Of course once someone trains an AI with a robotic arm to do the actual painting, then your worry holds firm.

It's been done, starting from plotter based solutions years ago, through the work of folks like Thomas Lindemeier:

https://scholar.google.com/citations?user=5PpKJ7QAAAAJ&hl=en...

Up to and including actual painting robot arms that dip brushes in paint and apply strokes to canvas today:

https://www.theguardian.com/technology/2022/apr/04/mind-blow...

The painting technique isn't all that great yet for any of these artbots working in a physical medium, but that's largely a general lack of dexterity in manual tool use rather than an art specific challenge. I suspect that RL environments that physically model the application of paint with a brush would help advance the SOTA. It might be cheaper to model other mediums like pencil, charcoal, or even airbrushing first, before tackling more complex and dimensional mediums like oil paint or watercolor.

>>qz_kb+AI
I don't think it will be a big deal, for multiple different reasons: https://www.lesswrong.com/posts/uKp6tBFStnsvrot5t/what-dall-...

zlacker

Imagen, a text-to-image diffusion model