So now that h.264, h.265, and AV1 seem to be the three major codecs with hardware support, I wonder what will be the next one?
Where did it say that?
> AV1 powers approximately 30% of all Netflix viewing
Is admittedly a bit non-specific, it could be interpreted as 30% of users or 30% of hours-of-video-streamed, which are very different metrics. If 5% of your users are using AV1, but that 5% watches far above the average, you can have a minority userbase with an outsized representation in hours viewed.
I'm not saying that's the case, just giving an example of how it doesn't necessarily translate to 30% of devices using Netflix supporting AV1.
Also, the blog post identifies that there is an effective/efficient software decoder, which allows people without hardware acceleration to still view AV1 media in some cases (the case they defined was Android based phones). So that kinda complicates what "X% of devices support AV1 playback," as it doesn't necessarily mean they have hardware decoding.
AV1 was specifically designed to be friendly for a hardware decoder and that decision makes it friendly to software decoding. This happened because AOMedia got hardware manufacturers on the board pretty early on and took their feedback seriously.
VP8/9 took a long time to get decent hardware decoding and part of the reason for that was because the stream was more complex than the AV1 stream.
I think they certainly go hand in hand in that algorithms relatively easier for software vs previously are easier for hardware vs previously and vice versa, but they are good at different things.
Bit masking/shifting is certainly more expensive in software, but it's also about the cheapest software operation. In most cases it's a single cycle transform. In the best cases, it's something that can be done with some type of SIMD instruction. And in even better cases, it's a repeated operation which can be distributed across the array of GPU vector processors.
What kills both hardware and software performance is data dependency and conditional logic. That's the sort of thing that was limited in the AV1 stream.
He's not talking about simple bit shifts. Imagine if you had to swap every other bit of a value. In hardware that's completely free; just change which wires you connect to. In software it takes several instructions. The 65 bit example is good too. In hardware it makes basically no difference to go from 64 bits to 65 bits. In software it is significantly more complete - it can more than double computation time.
I think where software has the advantage is sheer complexity. It's harder to design and verify complex algorithms in hardware than it is in software, so you need to keep things fairly simple. The design of even state-of-the-art CPUs is surprisingly simple; a cycle accurate model might only be a few tens of thousands of lines of code.