Can this 37% truly be attributed to loop unrolling? One really notable difference is the loop only compares against len(a) instead of both len(a) and len(b) in the unrolled version. I don't know enough about Go to know whether the compiler can optimize the comparison away, but in some other languages it would be significant.
> Also, getting my hands dirty with some assembly sounds fun, so that's what I'm going to do.
I'd have loved to see the comparison between the asm that the compiler was generating and the bespoke asm written here. I'd bet that simply gussying up the generated asm results in a pretty sizable improvement (but obviously less impressive than by switching to SIMD).