Unless you assume there are bad actors who will crop out the tags. Not many people now have access to Dall-E2 or will have access to Imagen.
As someone working in Vision, I am also thinking about whether to include such images deliberately. Using image augmentation techniques is ubiquitous in the field. Thus we introduce many examples for training the model that are not in the distribution over input images. They improve model generality by huge margins. Whether generated images improve generality of future models is a thing to try.
Damn I just got an idea for a paper writing this comment.
I don't know why people do that but lots of randoms on the internet do that and they're not even bad actors per se. The removed signatures from art posted online became a kind of a meme itself. Especially when comic strips are reposted on Reddit. So yeah, we'll see lots of them.
Naturally there's a python library [1] with some algorithms that are resistant to lossy compression, cropping, brightness changes, etc. Scaling seems to be a weakness though.