Neural compression wouldn't be like HVEC, operating on frames and pixels. Rather, these techniques can encode entire features and optical flow, which can explain the larger discrepancies. Larger fingers, slightly misplaced items, etc.
Neural compression techniques reshape the image itself.
If you've ever input an image into `gpt-image-1` and asked it to output it again, you'll notice that it's 95% similar, but entire features might move around or average out with the concept of what those items are.
It looks like they're compressing the data before it gets further processed with the traditional suite of video codecs. They're relying on the traditional codecs to serve, but running some internal first pass to further compress the data they have to store.
I don't think that's actually what's up, but I don't think it's completely ruled out either.