zlacker

If the model takes text embeddings/tokens as an input, it can create a connection between the caption and the text on the image (sometimes they are really similar).