I doubt that’s what’s happening too but it’s not beyond the pale. They could be feeding both the input video and audio/transcript into their transformer and it has learned “when the audio is talking about lips the person is usually puckering their lips for the camera” so it regurgitates that.