https://developercommunity.visualstudio.com/t/Bad-codegen-du...
I'd rather see new languages focus on making better explicit SIMD abstractions a la Intels ISPC, rather than writing yet another magic vectorizer that only actually works in trivial cases.
Then it's just a codegen problem.
But yes, ultimately, the user needs to be aware of how the language works, what is parallelizable and what isn't, and of the cost of the operations that they ask their computer to execute.
Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).
Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
https://learn.microsoft.com/en-us/dotnet/api/system.runtime....
Examples of usage:
- https://github.com/U8String/U8String/blob/main/Sources/U8Str...
- https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAccu...
- https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
(and many more if you search github for the uses of Vector128/256<byte> and the like!)
For instance, imagine I have auto-perf something and I check (manually mind you) the asm and all is good. Then someone changes the algorithm slightly, or another engineer adds a layer of indirection for some unrelated purpose, or maybe the compiler updates its code paths which misses some cases that were previously supported. And the optimization goes away silently.
Have you seen the 2sec code from c#?