zlacker

> To me, the big news here is that ~30% of devices now support AV1 hardware decoding

Where did it say that?

> AV1 powers approximately 30% of all Netflix viewing

Is admittedly a bit non-specific, it could be interpreted as 30% of users or 30% of hours-of-video-streamed, which are very different metrics. If 5% of your users are using AV1, but that 5% watches far above the average, you can have a minority userbase with an outsized representation in hours viewed.

I'm not saying that's the case, just giving an example of how it doesn't necessarily translate to 30% of devices using Netflix supporting AV1.

Also, the blog post identifies that there is an effective/efficient software decoder, which allows people without hardware acceleration to still view AV1 media in some cases (the case they defined was Android based phones). So that kinda complicates what "X% of devices support AV1 playback," as it doesn't necessarily mean they have hardware decoding.

replies(3): >>sophie+qi >>endorp+fm >>cogman+ep1

>>0manrh+(OP)
“30% of viewing” I think clearly means either time played or items played. I’ve never worked with a data team that would possibly write that and mean users.

If it was a stat about users they’d say “of users”, “of members”, “of active watchers”, or similar. If they wanted to be ambiguous they’d say “has reached 30% adoption” or something.

replies(2): >>0manrh+mj >>csdrea+MT1

>>sophie+qi
Agreed, but this is the internet, the ultimate domain of pedantry, and they didn't say it explicitly, so I'm not going to put words in their mouth just to have a circular discussion about why I'm claiming they said something they didn't technically say, which is why I asked "Where did it say that" at the very top.

Also, either way, my point was and still stands: it doesn't say 30% of devices have hardware encoding.

>>0manrh+(OP)
In either case, it is still big news.

>>0manrh+(OP)
That was one of the best decisions of AOMedia.

AV1 was specifically designed to be friendly for a hardware decoder and that decision makes it friendly to software decoding. This happened because AOMedia got hardware manufacturers on the board pretty early on and took their feedback seriously.

VP8/9 took a long time to get decent hardware decoding and part of the reason for that was because the stream was more complex than the AV1 stream.

replies(2): >>Neywin+DD1 >>galad8+2a2

>>cogman+ep1
Hmmm disagree on your chain there. Plenty of easy hardware algorithms are hard for software. For example, in hardware (including FPGAs), bit movement/shuffling is borderline trivial if it's constant, while in software you have to shift and mask and or over and over. In hardware you literally just switch which wire is connected to what on the next stage. Same for weird bit widths. Hardware doesn't care (too much) if you're operating on 9 bit quantities or 33 or 65. Software isn't that granular and often you'll double your storage and waste a bunch.

I think they certainly go hand in hand in that algorithms relatively easier for software vs previously are easier for hardware vs previously and vice versa, but they are good at different things.

replies(1): >>cogman+6I1

>>Neywin+DD1
I'm not claiming that software will be more efficient. I'm claiming that things that make it easy to go fast in hardware make it easy to go fast in software.

Bit masking/shifting is certainly more expensive in software, but it's also about the cheapest software operation. In most cases it's a single cycle transform. In the best cases, it's something that can be done with some type of SIMD instruction. And in even better cases, it's a repeated operation which can be distributed across the array of GPU vector processors.

What kills both hardware and software performance is data dependency and conditional logic. That's the sort of thing that was limited in the AV1 stream.

replies(1): >>IshKeb+Qp5

>>sophie+qi
I am not in data science so I can not validate your comment, but 30% of viewing I would assume mean users or unique/discreet viewing sessions and not watched minutes. I would appreciate it if Netflix would clarify.

>>cogman+ep1
All I read about is that it's less hardware friendly than H.264 and HEVC, and they were all complaining about it. AV2 should be better in this regard.

Where did you read that it was designed to make creating an hardware decoder easier?

replies(2): >>cogman+yj2 >>hulitu+ii5

>>galad8+2a2
It was a presentation on AV1 before it was released. I'll see if I can find it but I'm not holding my breath. It's mostly coming from my own recollection.

Ok, I don't think I'll find it. I think I'm mostly just regurgitating what I remember watching at one of the research symposiums. IDK which one it was unfortunately [1]

[1] https://www.youtube.com/@allianceforopenmedia2446/videos

replies(1): >>danude+BI2

>>cogman+yj2
I've heard that same anecdote before, that hardware decoding was front of mind. Doesn't mean that you (we) are right, but at least if you're hallucinating it's not just you.

>>galad8+2a2
> AV2 should be better in this regard

Will it, though ?

Why create a SW spec and hope that the HW will support it ? Why not design together with HW ?

>>cogman+6I1
> Bit masking/shifting is certainly more expensive in software, but it's also about the cheapest software operation. In most cases it's a single cycle transform.

He's not talking about simple bit shifts. Imagine if you had to swap every other bit of a value. In hardware that's completely free; just change which wires you connect to. In software it takes several instructions. The 65 bit example is good too. In hardware it makes basically no difference to go from 64 bits to 65 bits. In software it is significantly more complete - it can more than double computation time.

I think where software has the advantage is sheer complexity. It's harder to design and verify complex algorithms in hardware than it is in software, so you need to keep things fairly simple. The design of even state-of-the-art CPUs is surprisingly simple; a cycle accurate model might only be a few tens of thousands of lines of code.

replies(2): >>Neywin+Ev5 >>cogman+bw5

>>IshKeb+Qp5
Right. It's bit packing and unpacking. Currently dealing with a 32 bit system that needs to pack 8 11 bit quantities each subsisting of 3 multi bit values into a 96 bit word. As you can imagine, the assembly is a mess of bit manipulation and it takes forever. Ridiculously it's to talk to a core that extracts them effortlessly. I'm seriously considering writing an accelerator to do this for me

>>IshKeb+Qp5
I just have to repeat myself

> I'm not claiming that software will be more efficient. I'm claiming that things that make it easy to go fast in hardware make it easy to go fast in software.

replies(1): >>IshKeb+Sx5

>>cogman+bw5
Right... That's what I was disagreeing with. Hardware and software have fairly different constraints.

replies(1): >>cogman+9F5

>>IshKeb+Sx5
I don't think you have an accurate view on what makes an algorithm slow.

The actual constraints on what makes hardware or software slow are remarkably similar. It's not ultimately the transforms on the data which slow down software, it's when you inject conditional logic or data loads. The same is true for hardware.

The only added constraint software has is a limited number of registers to operate on. That can cause software to put more pressure on memory than hardware does. But otherwise, similar algorithms accomplishing the same task will have similar performance characteristics.

Your example of the bitshift is a good illustration of that. Yes, in hardware it's free. And in software it's 3 operations which is pretty close to free. Both will spend far more time waiting on main memory to load up the data for the masking than they will spend doing the actual bit shuffling. The constraint on the software is you are burning maybe 3 extra registers. That might get worse if you have no registers to spare forcing you to constantly load and store.

This is the reason SMT has become ubiquitous on x86 platforms. Because CPUs spend so much time waiting on data to arrive that we can make them do useful work while we wait for those cache lines to fill up.

Saying "hardware can do this for free" is an accurate statement, but you are missing the 80/20 of the performance. Yes, it can do something subcycle that costs software 3 cycles to perform. Both will wait for 1000 cycles while the data is loaded up from main memory. A fast video codec that is easy to decode with hardware gets there by limiting the amount of dataloads that need to happen to calculates a given frame. It does that by avoiding wonky frame transformations. By preferring compression which uses data-points in close memory proximity.