But I think the real value -- and this is definitely in Google's favor -- is providing this functionality for photos you have taken in the past.
I have probably 30K+ photos in Google Photos that capture moments from the past 15 years. There are quite a lot of them where I've taken multiple shots of the same scene in quick succession, and it would be fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals. It already does something similar for photo collages and "best in a series of rapid shots." They surface without my having to do anything.
Wouldn't an operation like this require some kind of fine-tuning? Or do diffusion models have a way of using images as context, the way one would provide context to an LLM?