zlacker

TIL that the Go compiler still does not have autovectorization.

replies(6): >>mcronc+I2 >>hoten+i5 >>jshear+S5 >>neonsu+Hm >>icholy+Yt >>klabb3+YI

>>koala_+(OP)
They still don't have a register-based calling convention on architectures other than x86-64, right? Or is that information out of date?

replies(1): >>mseepg+k4

>>mcronc+I2
That information is out of date: ARM64, PPC, RISC-V https://go.dev/src/cmd/compile/abi-internal

>>koala_+(OP)
Maybe a hidden blessing. Just ran into a nasty MSVC auto vectorization bug. Apparently it's hard to get right.

https://developercommunity.visualstudio.com/t/Bad-codegen-du...

>>koala_+(OP)
Is it really worth the trouble if you're not building on top of something like LLVM which already has a vectorizer? We're still waiting for the mythical sufficiently-smart-vectorizer, even the better ones are still extremely brittle, and any serious high-performance work still does explicit SIMD rather than trying to coax the vectorizer into cooperating.

I'd rather see new languages focus on making better explicit SIMD abstractions a la Intels ISPC, rather than writing yet another magic vectorizer that only actually works in trivial cases.

replies(2): >>mgauna+7e >>neonsu+tn

>>jshear+S5
any of the polyhedral frameworks is reasonably good at splitting loop nests into parallelizable ones.

Then it's just a codegen problem.

But yes, ultimately, the user needs to be aware of how the language works, what is parallelizable and what isn't, and of the cost of the operations that they ask their computer to execute.

>>koala_+(OP)
Even manual vectorization is pain...writing ASM, really?

Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).

Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

https://github.com/gunnarmorling/1brc/discussions/67

replies(2): >>Thaxll+CZ >>anonym+M41

>>jshear+S5
C# is doing that :)

https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

Examples of usage:

- https://github.com/U8String/U8String/blob/main/Sources/U8Str...

- https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAccu...

- https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

(and many more if you search github for the uses of Vector128/256<byte> and the like!)

replies(1): >>pjmlp+Qw1

>>koala_+(OP)
Last time I checked, it didn't even unroll loops.

>>koala_+(OP)
I never do this kind of work so I can’t say. But if I did, I’d imagine I want more control. I mean, perf improvements are welcome to all code, but if I need a piece of code to have a specific optimization I’d rather opt-in through language constructs, so that the compiler (or other tooling) can tell me when it breaks. A well designed API with adapters from and to regular code would be better, no?

For instance, imagine I have auto-perf something and I check (manually mind you) the asm and all is good. Then someone changes the algorithm slightly, or another engineer adds a layer of indirection for some unrelated purpose, or maybe the compiler updates its code paths which misses some cases that were previously supported. And the optimization goes away silently.

replies(1): >>koala_+Ip1

>>neonsu+Hm
5sec for simple and readable code: https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...

Have you seen the 2sec code from c#?

replies(1): >>anonym+X81

>>neonsu+Hm
Panama vectors are extremely disappointing. ByteVector.rearrange in particular takes like 10ns and is the only available way to implement vpshufb, an instruction that takes 1 cycle. Operations like andnot don't just use the andnot instruction. Converting a 32-wide vector that the type system thinks is a mask into a vector uses a blend instead of using 0 instructions. Fixed rearranges like packus are missing. Arithmetic operations that are not simple lane-wise operations like maddubs are missing. aesenc is missing. Non-temporal stores and non-temporal prefetches are missing (there is a non-temporal load instruction but apparently it doesn't do anything differently from a normal load, so if you want to move data to L1d skipping other caches you have to use the prefetch).

replies(1): >>pjmlp+Zw1

>>Thaxll+CZ
These numbers aren't comparable. This golang solution is likely much more than 2.5x slower if you run them on the same hardware.

>>klabb3+YI
Do you have the same view of other compiler optimizations? Would you prefer if the compiler never unrolled a loop so that you can write it out manually when you need it?

replies(1): >>klabb3+xK3

>>neonsu+tn
Java as well, unfortunelly it will be kept around as preview until Valhala arrives (if ever).

>>anonym+M41
Panama vectors are still in preview anyway.

replies(1): >>anonym+nh5

>>koala_+Ip1
No I wasn’t saying (or at least didn’t mean) the compiler shouldn’t optimize automatically. I meant that ensuring certain paths are optimizable could be important when you need it. And also that language constructs would be a good way to achieve that.

>>pjmlp+Zw1
Sure, in a few weeks I will post on the mailing list about how lots of stuff one wants to do with vectors is many times slower because of these issues, and we'll see if they end up adding ByteVector.multiplySignedWithUnsignedGivingShortsAndAddPairsOfAdjacentShorts so that people can write decimal parsers or not.