Also my bet would be that video capable models are better at this.
So back to the analogy, it could be as if the LLMs experience the equivalent of a very intense optical illusion in these cases, and then completely fall apart trying to make sense of it.