Keep up the good work!
https://developercommunity.visualstudio.com/t/Bad-codegen-du...
I'd rather see new languages focus on making better explicit SIMD abstractions a la Intels ISPC, rather than writing yet another magic vectorizer that only actually works in trivial cases.
1 = https://github.com/mmcloughlin/avo
2 = >>34465297
3 = https://www.reddit.com/r/golang/comments/10hmh07/how_to_use_...
If I were to ever take this further and add loop unrolling or something, I'd absolutely reach for Avo
Then it's just a codegen problem.
But yes, ultimately, the user needs to be aware of how the language works, what is parallelizable and what isn't, and of the cost of the operations that they ask their computer to execute.
From my brief forays into reading (mostly AARCH64) assembly, it looks like C compilers can detect these kinds of patterns now and just convert them all to SIMD by themselves, with no work from the programmer. Even at -O2, converting an index-based loop into one based on start and end pointers is not unusual. Go doesn't seem to do this, the assembly output by the Go compiler looks much closer to the actual code than what you get from C.
Rust iterators would also be fun to benchmark here, they're supposed to be as fast as plain old loops, and they're probably optimized to omit bounds checks entirely.
The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).
Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.
You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.
There is an escape hatch in -ffast-math.
I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...
Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).
Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
Nowadays, even on most embedded platforms resources are so abundant and applications are HORIZONTALLY scaled that it seems to feel like nobody bothers with VERTICAL scaling.
The thing about multiplication is it is commutative; an 8x improvement in performance of a single node means you need 8x less nodes! What a fun way to cut down your AWS bill and send your SaaS company to profitability.
https://learn.microsoft.com/en-us/dotnet/api/system.runtime....
Examples of usage:
- https://github.com/U8String/U8String/blob/main/Sources/U8Str...
- https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAccu...
- https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
(and many more if you search github for the uses of Vector128/256<byte> and the like!)
I should really add some discussion around BLAS in particular, which has an good implementation[0] of the float32 dot product that outperforms any of the float32 implementations in the blog post. I'm getting ~1.9m vecs/s on my benchmarking rig.
However, that BLAS became unusable for us as soon as we switched to quantized vectors because there is no int8 implementation of the dot product in BLAS (though I'd love to be proven wrong)
[0]: https://pkg.go.dev/gonum.org/v1/gonum@v0.14.0/blas/blas32#Do...
I started to write this out, and then thought "you know what given how common this is, I bet I could even just google it" and thought that would be more interesting, as it makes it feel more "real world." The first result I got is what I would have written: https://stackoverflow.com/a/30422958/24817
Here's a godbolt with three different outputs: one at -O, one at -O3, and one at -03 and -march=native
https://godbolt.org/z/6xf9M1cf3
Eyeballing it comments:
Looks like 2 and 3 both provide extremely similar if not identical output.
Adding the native flag ends up generating slightly different codegen, I am not at the level to be able to simply look at that and know how meaningful the difference is.
It does appear to have eliminated the bounds check entirely, and it's using xmm registers.
I am pleasantly surprised at this output because zip in particular can sometimes hinder optimizations, but rustc did a great job here.
----------------------
For fun, I figured "why not also try as direct of the original Go as possible." The only trick here is that Rust doesn't really do the c-style for loop the way Go does, so I tried to translate what I saw as the spirit of the example: compare the two lengths and use the minimum for the loop length.
Here it is: https://godbolt.org/z/cTcddc8Gs
... literally the same. I am very surprised at this outcome. It makes me wonder if LLVM has some sort of idiom recognition for dot product specifically.
EDIT: looks like it does not currently, see the comment at line 28 and 29: https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html
With clang you get basically the same codegen, although it uses fused multiply adds.
The problem is that you need to enable -ffast-math, otherwise the compiler can't change the order of floating point operations, and thus not vectorize.
With clang that works wonderfully and it gives us a lovely four times unrolled AVX2 fused multiply add loop, but enabling it in rust doesn't seem to work: https://godbolt.org/z/G4Enf59Kb
Edit: from what I can tell this is still an open issue??? https://github.com/rust-lang/rust/issues/21690
Edit: relevant SO post: https://stackoverflow.com/questions/76055058/why-cant-the-ru... Apparently you need to use `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast` and `std::intrinsics::fmul_fast`.
Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.
There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv
The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690
It would be nice to get these intrinsics stabilized, at least.
EDIT: oops you figured this out while I was writing it, haha.
What is correct idiom here? It feels if this sort of thing really matters to you, you should have the know how to handroll a couple lines of ASM. I want to say this is rare, but I had a project a couple years ago where I needed to handroll some vectorized instructions on a raspberry pi.
[0] >>39013277
Well I had never seen that "full slice" expression syntax before. It turns out that it's important because it controls the capacity of the new slice. The capacity of the new slice is now i+4 - i.
So by using the full slice expression you get a slice of length 4 and capacity 4. Without doing this the capacity would be equal to the capacity of the original slice.
I suppose that by controlling the capacity that you eliminate the bounds check.
I get that hand rolling assembly is fun but as a theoretical future maintainer I would much prefer C to ASM.
In my experience:
- It complicates builds by requiring a C toolchain
- It makes single, static binaries more difficult
- It makes portability more difficult (not that assembly is portable though)
- It causes difficult-to-debug issues (I recently ran into an issue where MacOS signing changed, causing all my cgo binaries to be killed on startup)
- Debuggers don't work across Cgo boundaries (they do with go ASM!)
I think Dave Cheney said it best: https://dave.cheney.net/2016/01/18/cgo-is-not-go
I personal almost always use -ffast-math by default in my C programs that care about performance, because I almost never care enough about the loss in accuracy. The only case I remember needing it was when doing some random number distribution tests where I cared about subnormals, and got confused for a second because they didn't seem to exist (-ffast-math disables them on x86).
FWIW I have oodles of numerical C++ code where fast-math doesn't change the output.
https://godbolt.org/z/KjErzacfv
Edit: ...and I now realize who I responded to, I'm sure you already know this. :)
For instance, imagine I have auto-perf something and I check (manually mind you) the asm and all is good. Then someone changes the algorithm slightly, or another engineer adds a layer of indirection for some unrelated purpose, or maybe the compiler updates its code paths which misses some cases that were previously supported. And the optimization goes away silently.
I suppose -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math will get you most of the way there, and maybe an -ffinite-math-only when appropriate.
Generally, I would recommend against -ffast-math mostly because it enables -ffinite-math-only and that one can really blow up in your face. Most other flags (like -funsafe-math-operations) aren't that bad from an accuracy standpoint. Obviously you should not turn them on for code that you have actually tuned to minimize error, but in other cases they barely ever degrade the results.
For what architecture? What if this code is in a library that your users might want to run on Intel (both 32 and 64 bit), ARM, Risc V and s390x? Even if you learn assembly for all of these, how are you going to get access to an S390X IBM mainframe to test your code? What if a new architecture[1] gets popular in the next couple of years, and you won't have access to a CPU to test on?
Leaving this work to a compiler or architecture-independent functions / macros that use intrinsics under the hood frees you from having to think about all of that. As long as whatever the user is running on has decent compiler support, your code is going to work and be fast, even years later.
It's certainly not perfect though (in particular the final reduction/remainder handling).
Unfortunately Rust doesn't have a proper optimizing float type. I really wish there was a type FastF32 or something similar which may be optimized using the usual transformation rules of algebra (e.g. associative property, distributive property, x + y - y = x, etc).
There is fadd_fast and co, but those are UB on NaN/infinite input.
Edit2: (I’m throttled and I can’t post a new comment I guess? Makes me feel like my voice was stolen! I guess I’m not wanted around HN.)
Thanks for the correction, I would delete my comment but I found a bug in HN while updating it so I’ll preserve my stupidity here in duplicate for now.
Edit: weird, this was supposed to be an update to a previous comment I made, but this is a different comment now
TBH my takeaway was that it was more useful to use smaller vectors as a representation
It's quite telling that there is a #pragma omp simd to hint to a compiler to rewrite the loop.
Now I wonder what's the state of polyhedral compilers. It's been many years. And given the AI, LLMs hype they could really shine.
Have you seen the 2sec code from c#?
Can this 37% truly be attributed to loop unrolling? One really notable difference is the loop only compares against len(a) instead of both len(a) and len(b) in the unrolled version. I don't know enough about Go to know whether the compiler can optimize the comparison away, but in some other languages it would be significant.
> Also, getting my hands dirty with some assembly sounds fun, so that's what I'm going to do.
I'd have loved to see the comparison between the asm that the compiler was generating and the bespoke asm written here. I'd bet that simply gussying up the generated asm results in a pretty sizable improvement (but obviously less impressive than by switching to SIMD).
[1] https://go.godbolt.org/z/63n6hTGGq (original) vs. https://go.godbolt.org/z/YYPrzjxP5 (capacity not limited)
> Well I had never seen that "full slice" expression syntax before.
Go's notion of capacity is somewhat pragmatic but at the same time confusing as well. I learned the hard way that the excess capacity is always available for the sake of optimization:
a := []int{1, 2, 3, 4, 5}
lo, hi := a[:2], a[2:]
lo = append(lo, 6, 7, 8) // Oops, it tries to reuse `lo[2:5]`!
fmt.Printf("%v %v\n", lo, hi) // Prints `[1 2 6 7 8] [6 7 8]`
While I do understand the rationale, it is too unintuitive because there is no indication of the excess capacity in this code. I would prefer `a[x:y]` to be a shorthand for `a[x:y:y]` instead. The `a[x:y:len(a)]` case is of course useful though, so maybe a different shorthand like `a[x:y:$]` can be added.You are right in that the final binary is free to turn `-ffast-math` on if you can verify that everything went okay. But almost no one would actually verify that. It's like an advice that you shouldn't write your own crypto code---it's fine if you know what you are doing, but almost no one does, so the advice is technically false but still worthwhile.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 (GCC), https://github.com/llvm/llvm-project/issues/57589 (LLVM)
Edit: the author has valid points against FFI here >>39110692
Capacity being conserved by slicing is important to the “slice tricks” of removing items, I assume.
There, beautiful Go code from that point onwards.
Slices encapsulate an ubiquitous C pattern, where you pass to a function: a pointer (to an initial array element), and a length or capacity (sometimes both). This pattern is directly encapsulated by Go's slices, which can be thought of as something like:
type Slice struct {
Ptr *Elem
Len int
Cap int
}
I love Go's slice syntax. It's the right piece of syntactic sugar. It removes tedium and room-for-mistakes from this prevalent C pattern. It lets me work as precisely with memory as I do in C, yet everything is simpler, lighter, and breezier.For example, I'm making a game in Go. I don't want to be allocating memory throughout the game (which can cause frame drops), so instead I allocate giant arrays of memory on launch. Then I use slices to partition this memory as needed.
Some of these arrays get their elements re-accumulated at some interval (every level, or every frame, etc.). And so it works nicely to first set them to zero length:
myMem = myMem[:0]
Notice that the myMem slice now has zero length, but still points to its same underlying array. Then I perform the accumulation: for ... {
myMem = append(myMem, elem)
}
Again, I don't want to be allocating in general, so I care very much that append continues to use myMem's preexisting capacity.All this is to say, I don't see slices as being the way they are for the "sake of optimization." Rather I see them as an ideal tool for working with, referring to, and partitioning memory.
Functions with `-ffast-math` enabled still return fp values via usual registers and in usual formats. If some function `f` is expected to return -1.0 to 1.0 for particluar inputs, `-ffast-math` can only make it to return 1.001 or NaN instead. If another function without `-ffast-math` expects and doesn't verify f's return value, it will surely misbehave, but only because the original analysis of f no longer holds.
`-ffast-math` the compiler option is bad because this effect is not evident from the code. Anything visible in the code should be okay.
What's more, I have no complaints about cgo. It hasn't been a performance problem at all (my game runs smoothly at 240hz). And the interface for binding to C is quite simple and nice to work with (I made my own bindings to SDL2, OpenGL, and Steamworks).
Cgo did make cross-compiling trickier to set up, but I managed to do it with a fair amount of elbow grease (Windows and MacOS builds on my Linux desktop).
sum := float32(0)
Over Go's zero-value default initialization e.g. var sum float32
A nit stylistically but wondering if there was a good reason to do so?To me, this becomes obvious if you read The Practice of Programming by Kernighan and Pike, the latter of which was a co-designer of Go. If you read the book, which was written well before Go, and pay attention to how it uses C arrays, you can almost feel Go slices emerging. The slice syntax perfectly encapsulates the C array bookkeeping.
If you know how to rewrite the algorithm in such a way so that it makes close-to-ideal utilization of CPU ports through your SIMD then it is practically impossible to beat it. And I haven't seen a compiler (GCC, clang) doing such a thing or at least not in the instances I had written. I've measured substantial improvements from such and similar utilization of CPU-level microarchitectural details. So perhaps I don't think it's the loop analysis only but I do think it's practically an impossible task for the compiler. Perhaps with the AI ...
size_t trim_end(const char *p, size_t len) {
while (len > 0) {
if (p[len - 1] != ' ') break;
--len;
}
return len;
}
Conceptually this function accepts a slice `(p, len, cap)` and returns a slice `(p, len2, cap)` where `len2 <= len` and the capacity never changes. But the actual argument doesn't have `cap`, and the return argument doesn't have `p`. Everything is implicit and it's typical for C programmers to fully document and follow such implicits. Go's slice operator can't come out of such implicit practices in my opinion.In comparison, your claim only makes sense when the following was somehow normal:
struct slice { const char *p; size_t len, cap; };
struct slice trim_end(const struct slice *s) {
struct slice out = *s;
while (out.len > 0) {
if (out.p[out.len - 1] != ' ') break;
out = subslice(out, 0, out.len - 1);
}
return out;
}
Note that a hypothetical `subslice` function call maps perfectly to a Go code `out[0:len(out)-1]`, and my complaint will equally apply: there should be two clearly named variants of `subslice` that may or may not keep the capacity. But I hardly saw such construction in C. var x string
when initializing empty (zero-value) vars, versus: x := "hello"
when initializing variables that should hold an initial value.To me as a Go programmer at least, this is more obvious and intuitive as to the intent of the declaration.
b) do you know the usual length of the slices? if so, can't you autogenerate/manually write switch to manually do the math? how would that perform?
In D, for example, this works as most people would expect:
import std.stdio;
void main() {
int[] a = [1, 2, 3, 4, 5];
auto lo = a[0 .. 2], hi = a[2 .. $];
lo ~= [6,7,8]; // new allocation
writefln("%s %s", lo, hi); // prints [1, 2, 6, 7, 8] [3, 4, 5]
}
You simply can't overwrite a backing array from a slice (unless you do unsafe stuff very explicitly).I'm saying that if you look at how arrays get used in C, you'll see that you're usually passing around extra numbers with them. So Go added syntax than encapsulates this. And it encapsulates the most general case (hence both a length and a capacity, even though most cases in C only use one number).
Instead of passing (char *, int) in C, you just pass (slice) in Go. And the Go slicing operator gives a nice syntax for selecting ranges of buffers.
But a Go slice is just a pointer to an array underneath, and I think it's best to always think about them that way. And then it's natural that mySlice[:2] would keep the same underlying capacity, and that future appends would mutate that capacity. Defaulting mySlice[:2] to mean mySlice[:2:2] seems less convenient overall to me (though prevents mistakes like in your original hi/lo example, but those come from not thinking of slices in terms of the underlying array).
Maybe that's a point of disagreement here. I don't exactly see (char *, int) as a single entity, so it cannot be replaced with a slice (a definitely single entity) in my view.
They do appear together in arguments, but they have to be separate variables when you manipulate them and no C syntax will suggest that they are indeed related. So you have to rely on conventions (say, `name`, `name_len` and `name_cap`) which tend to be fragile. You can't expect such annoyance from Go slices, so I argue they are necessarily different from each other.
The problem is that Go does not do that, because its designers could not be arsed to provide a separate vector type it’s (ptr, int, int) and now you can start stomping on out-of-slice memory unless the creator of the slice has used the extended slicing form.
For the float-based version[0] I had to break out the unstable portable_simd to get it to vectorize. Most of the function ends up being setting up everything, but then actually doing the calculation is simple, and basically the same as non-SIMD section. I've never used the portable SIMD stuff before, and it was quite pleasant to use.
For the integer-based version, I started with the simple naive approach[1], and that vectorized to a pretty good degree on stable. However, it doesn't use the dot-product instruction. For that, I think we need to use nightly and go a bit manual[2]. Unsurprisingly, it mostly ends up looking like the float version as a fair chunk is just setup. I didn't bother here, but it should probably be using feature detection to make sure the instruction exists.
[0] https://godbolt.org/z/Gdv8azorW [1] https://godbolt.org/z/d8jv3ofYo [2] https://godbolt.org/z/4oYEnKTbf