I think most people just find it easy to put a podcast and pay semi-attention on while they do tasks or go on their phone. And the education sector is having to adapt to that and make it possible for students to achieve good grades by learning like that.
</old man yells at cloud>
"A moment" in a video is exactly that, a moment of time, either a frame or a couple of seconds that will stay in short term memory.
"A moment" in a text is a page or two facing pages. There can be diagrams or formulas there. It is extremely easy to direct attention to parts of these pages, in any order.
In a video, "moments" in the above sense are generally low information, quickly changing in linear order. In a text, they are fewer and of higher density. It seems that the second type is easier to commit to long-term memory, to understand, etc.