zlacker

From slow to SIMD: A Go optimization story

submitted by rbanff+(OP) on 2024-01-23 17:57:31 | 254 points 104 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
◧◩◪
4. mseepg+ls[view] [source] [discussion] 2024-01-23 19:48:39
>>mcronc+Jq
That information is out of date: ARM64, PPC, RISC-V https://go.dev/src/cmd/compile/abi-internal
◧◩
5. hoten+jt[view] [source] [discussion] 2024-01-23 19:52:44
>>koala_+1o
Maybe a hidden blessing. Just ran into a nasty MSVC auto vectorization bug. Apparently it's hard to get right.

https://developercommunity.visualstudio.com/t/Bad-codegen-du...

7. carboc+Yt[view] [source] 2024-01-23 19:55:00
>>rbanff+(OP)
I wonder whether avo could have been useful here?[1] I mention it because it came up the last time we were talking about AVX operations in go.[2, 3]

1 = https://github.com/mmcloughlin/avo

2 = >>34465297

3 = https://www.reddit.com/r/golang/comments/10hmh07/how_to_use_...

13. malkia+8H[view] [source] 2024-01-23 20:53:41
>>rbanff+(OP)
I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.

The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).

Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.

◧◩
14. mratsi+dH[view] [source] [discussion] 2024-01-23 20:53:58
>>miki12+ND
It depends.

You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

There is an escape hatch in -ffast-math.

I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

◧◩
15. neonsu+IK[view] [source] [discussion] 2024-01-23 21:08:29
>>koala_+1o
Even manual vectorization is pain...writing ASM, really?

Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).

Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

https://github.com/gunnarmorling/1brc/discussions/67

◧◩◪
17. neonsu+uL[view] [source] [discussion] 2024-01-23 21:12:40
>>jshear+Tt
C# is doing that :)

https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

Examples of usage:

- https://github.com/U8String/U8String/blob/main/Sources/U8Str...

- https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAccu...

- https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

(and many more if you search github for the uses of Vector128/256<byte> and the like!)

◧◩
19. camden+lO[view] [source] [discussion] 2024-01-23 21:26:05
>>zeehio+2G
Copying in a response to a similar question on Reddit:

I should really add some discussion around BLAS in particular, which has an good implementation[0] of the float32 dot product that outperforms any of the float32 implementations in the blog post. I'm getting ~1.9m vecs/s on my benchmarking rig.

However, that BLAS became unusable for us as soon as we switched to quantized vectors because there is no int8 implementation of the dot product in BLAS (though I'd love to be proven wrong)

[0]: https://pkg.go.dev/gonum.org/v1/gonum@v0.14.0/blas/blas32#Do...

◧◩
20. stevek+mO[view] [source] [discussion] 2024-01-23 21:26:10
>>miki12+ND
> Rust iterators would also be fun to benchmark here

I started to write this out, and then thought "you know what given how common this is, I bet I could even just google it" and thought that would be more interesting, as it makes it feel more "real world." The first result I got is what I would have written: https://stackoverflow.com/a/30422958/24817

Here's a godbolt with three different outputs: one at -O, one at -O3, and one at -03 and -march=native

https://godbolt.org/z/6xf9M1cf3

Eyeballing it comments:

Looks like 2 and 3 both provide extremely similar if not identical output.

Adding the native flag ends up generating slightly different codegen, I am not at the level to be able to simply look at that and know how meaningful the difference is.

It does appear to have eliminated the bounds check entirely, and it's using xmm registers.

I am pleasantly surprised at this output because zip in particular can sometimes hinder optimizations, but rustc did a great job here.

----------------------

For fun, I figured "why not also try as direct of the original Go as possible." The only trick here is that Rust doesn't really do the c-style for loop the way Go does, so I tried to translate what I saw as the spirit of the example: compare the two lengths and use the minimum for the loop length.

Here it is: https://godbolt.org/z/cTcddc8Gs

... literally the same. I am very surprised at this outcome. It makes me wonder if LLVM has some sort of idiom recognition for dot product specifically.

EDIT: looks like it does not currently, see the comment at line 28 and 29: https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html

◧◩◪
22. camel-+mR[view] [source] [discussion] 2024-01-23 21:39:53
>>stevek+mO
Those aren't vectorized at all, just unrolled. vmulss/vaddss just multiply/add the single-precision floating-point in the vector register.

With clang you get basically the same codegen, although it uses fused multiply adds.

The problem is that you need to enable -ffast-math, otherwise the compiler can't change the order of floating point operations, and thus not vectorize.

With clang that works wonderfully and it gives us a lovely four times unrolled AVX2 fused multiply add loop, but enabling it in rust doesn't seem to work: https://godbolt.org/z/G4Enf59Kb

Edit: from what I can tell this is still an open issue??? https://github.com/rust-lang/rust/issues/21690

Edit: relevant SO post: https://stackoverflow.com/questions/76055058/why-cant-the-ru... Apparently you need to use `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast` and `std::intrinsics::fmul_fast`.

◧◩◪◨
26. stevek+ZU[view] [source] [discussion] 2024-01-23 21:56:33
>>camel-+mR
Ahh yes, thank you.

Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.

There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv

The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690

It would be nice to get these intrinsics stabilized, at least.

EDIT: oops you figured this out while I was writing it, haha.

◧◩◪◨
27. nemoth+4W[view] [source] [discussion] 2024-01-23 22:02:09
>>camel-+mR
It was just last week I was reading a comment that made it seem like you shouldn't really use -ffast-math[0], but here looks like a non-rare reason why you would want to enable it.

What is correct idiom here? It feels if this sort of thing really matters to you, you should have the know how to handroll a couple lines of ASM. I want to say this is rare, but I had a project a couple years ago where I needed to handroll some vectorized instructions on a raspberry pi.

[0] >>39013277

◧◩
31. camden+0Y[view] [source] [discussion] 2024-01-23 22:13:17
>>jeremy+IW
Definitely possible! However, I prefer to avoid cgo wherever possible for more than just the overhead.

In my experience:

- It complicates builds by requiring a C toolchain

- It makes single, static binaries more difficult

- It makes portability more difficult (not that assembly is portable though)

- It causes difficult-to-debug issues (I recently ran into an issue where MacOS signing changed, causing all my cgo binaries to be killed on startup)

- Debuggers don't work across Cgo boundaries (they do with go ASM!)

I think Dave Cheney said it best: https://dave.cheney.net/2016/01/18/cgo-is-not-go

◧◩◪◨⬒
34. cbm-vi+p11[view] [source] [discussion] 2024-01-23 22:33:07
>>stevek+ZU
Even without a -ffast-math flag, the current stable Rust compiler will vectorize loops on integer types.

https://godbolt.org/z/KjErzacfv

Edit: ...and I now realize who I responded to, I'm sure you already know this. :)

36. srouss+n41[view] [source] 2024-01-23 22:50:40
>>rbanff+(OP)
For other languages (including nodejs/bun/rust/python etc) you can have a look at SimSIMD which I have contributed to this year (made recompiled binaries for nodejs/bun part of the build process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on windows).

[0] https://github.com/ashvardanian/SimSIMD

◧◩◪◨⬒
43. miki12+3a1[view] [source] [discussion] 2024-01-23 23:25:13
>>nemoth+4W
> you should have the know how to handroll a couple lines of ASM

For what architecture? What if this code is in a library that your users might want to run on Intel (both 32 and 64 bit), ARM, Risc V and s390x? Even if you learn assembly for all of these, how are you going to get access to an S390X IBM mainframe to test your code? What if a new architecture[1] gets popular in the next couple of years, and you won't have access to a CPU to test on?

Leaving this work to a compiler or architecture-independent functions / macros that use intrinsics under the hood frees you from having to think about all of that. As long as whatever the user is running on has decent compiler support, your code is going to work and be fast, even years later.

[1] https://en.wikipedia.org/wiki/Loongson

◧◩◪◨⬒
45. orlp+md1[view] [source] [discussion] 2024-01-23 23:50:03
>>nemoth+4W
You can write it like this to get the compiler to generate SIMD: https://godbolt.org/z/ohvoEb7er

It's certainly not perfect though (in particular the final reduction/remainder handling).

Unfortunately Rust doesn't have a proper optimizing float type. I really wish there was a type FastF32 or something similar which may be optimized using the usual transformation rules of algebra (e.g. associative property, distributive property, x + y - y = x, etc).

There is fadd_fast and co, but those are UB on NaN/infinite input.

49. huac+6i1[view] [source] 2024-01-24 00:21:39
>>rbanff+(OP)
I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf... (comparable to the 9x improvement in article using SIMD + int8)

TBH my takeaway was that it was more useful to use smaller vectors as a representation

◧◩◪
52. Thaxll+Dn1[view] [source] [discussion] 2024-01-24 01:01:51
>>neonsu+IK
5sec for simple and readable code: https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...

Have you seen the 2sec code from c#?

◧◩
57. lifthr+Xu1[view] [source] [discussion] 2024-01-24 02:05:31
>>Thalia+qW
In my testing [1] that doesn't eliminate bound checks. Instead, it avoids a computation of otherwise unused `cap(a[i:i+4]) = len(a) - i` value if my reading is correct.

[1] https://go.godbolt.org/z/63n6hTGGq (original) vs. https://go.godbolt.org/z/YYPrzjxP5 (capacity not limited)

> Well I had never seen that "full slice" expression syntax before.

Go's notion of capacity is somewhat pragmatic but at the same time confusing as well. I learned the hard way that the excess capacity is always available for the sake of optimization:

    a := []int{1, 2, 3, 4, 5}
    lo, hi := a[:2], a[2:]
    lo = append(lo, 6, 7, 8)      // Oops, it tries to reuse `lo[2:5]`!
    fmt.Printf("%v %v\n", lo, hi) // Prints `[1 2 6 7 8] [6 7 8]`
While I do understand the rationale, it is too unintuitive because there is no indication of the excess capacity in this code. I would prefer `a[x:y]` to be a shorthand for `a[x:y:y]` instead. The `a[x:y:len(a)]` case is of course useful though, so maybe a different shorthand like `a[x:y:$]` can be added.
◧◩◪◨⬒⬓
58. lifthr+tw1[view] [source] [discussion] 2024-01-24 02:18:41
>>sevagh+g11
For a very long time `-funsafe-math-optimizations` (and thus `-ffast-math`) had been infectious [1], so a responsible library should never have used `-ffast-math` anyway.

You are right in that the final binary is free to turn `-ffast-math` on if you can verify that everything went okay. But almost no one would actually verify that. It's like an advice that you shouldn't write your own crypto code---it's fine if you know what you are doing, but almost no one does, so the advice is technically false but still worthwhile.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 (GCC), https://github.com/llvm/llvm-project/issues/57589 (LLVM)

◧◩
65. lifthr+sA1[view] [source] [discussion] 2024-01-24 02:57:45
>>snitty+Wi1
C++ users can enjoy Highway [1].

[1] https://github.com/google/highway/

69. kldx+AS1[view] [source] 2024-01-24 06:12:18
>>rbanff+(OP)
This is a task where Halide https://halide-lang.org/ could really shine! It disconnects logic from scheduling (unrolling, vectorizing, tiling, caching intermediates etc), so every step the author describes in the article is a tunable in halide. halide doesn't appear to have bindings for golang so calling C++ from go might be the only viable option.

Edit: the author has valid points against FFI here >>39110692

◧◩◪
99. Measte+f63[view] [source] [discussion] 2024-01-24 15:59:24
>>stevek+mO
I had a go at writing this, both the initial float version and the later integer-based version. Note that I haven't actually tested these to make sure that output is correct, so, I might have messed up somewhere, and that I targeted Rocketlake for the CPU.

For the float-based version[0] I had to break out the unstable portable_simd to get it to vectorize. Most of the function ends up being setting up everything, but then actually doing the calculation is simple, and basically the same as non-SIMD section. I've never used the portable SIMD stuff before, and it was quite pleasant to use.

For the integer-based version, I started with the simple naive approach[1], and that vectorized to a pretty good degree on stable. However, it doesn't use the dot-product instruction. For that, I think we need to use nightly and go a bit manual[2]. Unsurprisingly, it mostly ends up looking like the float version as a fair chunk is just setup. I didn't bother here, but it should probably be using feature detection to make sure the instruction exists.

[0] https://godbolt.org/z/Gdv8azorW [1] https://godbolt.org/z/d8jv3ofYo [2] https://godbolt.org/z/4oYEnKTbf

[go to top]