I would expect AI development to follow a similar path to digital media generally, as its following the increasing difficulty and space requirements of digitally representing said media: text < basic sounds < images < advanced audio < video.
What’s more impressive to me is how far ahead text-to-speech is, but I think the explanation is straightforward (the accessibility value has motivated us to work on that for a lot longer).