zlacker

[parent] [thread] 2 comments
1. cogman+(OP)[view] [source] 2025-12-06 23:09:37
I just have to repeat myself

> I'm not claiming that software will be more efficient. I'm claiming that things that make it easy to go fast in hardware make it easy to go fast in software.

replies(1): >>IshKeb+H1
2. IshKeb+H1[view] [source] 2025-12-06 23:21:33
>>cogman+(OP)
Right... That's what I was disagreeing with. Hardware and software have fairly different constraints.
replies(1): >>cogman+Y8
◧◩
3. cogman+Y8[view] [source] [discussion] 2025-12-07 00:21:46
>>IshKeb+H1
I don't think you have an accurate view on what makes an algorithm slow.

The actual constraints on what makes hardware or software slow are remarkably similar. It's not ultimately the transforms on the data which slow down software, it's when you inject conditional logic or data loads. The same is true for hardware.

The only added constraint software has is a limited number of registers to operate on. That can cause software to put more pressure on memory than hardware does. But otherwise, similar algorithms accomplishing the same task will have similar performance characteristics.

Your example of the bitshift is a good illustration of that. Yes, in hardware it's free. And in software it's 3 operations which is pretty close to free. Both will spend far more time waiting on main memory to load up the data for the masking than they will spend doing the actual bit shuffling. The constraint on the software is you are burning maybe 3 extra registers. That might get worse if you have no registers to spare forcing you to constantly load and store.

This is the reason SMT has become ubiquitous on x86 platforms. Because CPUs spend so much time waiting on data to arrive that we can make them do useful work while we wait for those cache lines to fill up.

Saying "hardware can do this for free" is an accurate statement, but you are missing the 80/20 of the performance. Yes, it can do something subcycle that costs software 3 cycles to perform. Both will wait for 1000 cycles while the data is loaded up from main memory. A fast video codec that is easy to decode with hardware gets there by limiting the amount of dataloads that need to happen to calculates a given frame. It does that by avoiding wonky frame transformations. By preferring compression which uses data-points in close memory proximity.

[go to top]