zlacker

Amazing!

Oddly, two tests fail for me with Brave (Version 1.51.118 / Chromium: 113.0.5672.126 (arm64)) on macOS Ventura 13.3.1

- pow([0], [0]) gradient, with "Expected «-Infinity» to be close to «0» (diff: < 0.0000005)"

- xlogy([0], [0.30000001192092896]) gradient with "Expected «0» to be close to «-1.2039728164672852»"

replies(3): >>pauldd+Z6 >>Muffin+ff >>praecl+as

>>activa+(OP)
Same with Chrome 113.0.5672.92 (arm64) on Ventura 13.2.

Safari 16.3 has 4 failures: "webgpu is supported", "tensor is webgpu", "xlogy([0], [0]) gradient", "xlogy([0], [0.30000001192092896]) gradient"

replies(1): >>praecl+ls

>>activa+(OP)
https://praeclarum.org/webgpu-torch/tests/

This is a dumb question but... are GPUs really that much faster than CPUs specifically at the math functions tested on this page?

xlogy trunc tan/tanh sub square sqrt sin/sinc/silu/sinh sign sigmoid sqrt/rsqrt round relu reciprocal rad2deg pow positive neg mul logaddexp/logaddexp2 log/log1p/log10/log2 ldexp hypot frac floor expm1 exp2 exp div deg2rad cos/cosh copysign ceil atan/atan2 asinh/asin add acosh/acos abs

Those are the types of math GPUs are good at? I thought they were better at a different kind of math, like matrices or something?

replies(3): >>hedgeh+dg >>nicoco+fg >>modele+Hq

>>Muffin+ff
They are relatively tiny but they run on the GPU to avoid lots of copies back and forth.

>>Muffin+ff
GPUs are usually not faster at doing the operation, but excel at doing the operation in parallel on a gazillion elements. Matrix math is mostly additions and multiplications.

replies(2): >>praecl+js >>hutzli+EV

>>Muffin+ff
GPUs are about 100 times faster than CPUs for any type of single-precision floating point math operation. The catch is that you have to do roughly similar math operations on 10k+ items in parallel before the parallelism and memory bandwidth advantages of the GPU outweigh the latency and single-threaded performance advantages of the CPU. Of course this is achievable in graphics applications with millions of triangles and millions of pixels, and in machine learning applications with millions or billions of neurons.

IMO almost any application that is bottlenecked by CPU performance can be recast to use GPUs effectively. But it's rarely done because GPUs aren't nearly as standardized as CPUs and the developer tools are much worse, so it's a lot of effort for a faster but much less portable outcome.

replies(1): >>HexDec+Wx

>>activa+(OP)
Yeah so the thing is WebGPU doesn’t correctly support IEEE floating point. Particularly, 0 is often substituted for +-Inf and NaN. See section 14.6 of the spec.

https://www.w3.org/TR/WGSL/#floating-point-evaluation

It’s not such a problem for real nets since you avoid those values like the plague. But the tests catch them and I need to make the tests are tolerant. Thanks for the results!

>>nicoco+fg
Yeah this is the trick. You need to maximize the use of workgroup parallelism and also lay things out in memory for those kernels to access efficiently. It’s a bit of a balancing act and I’ll be working on benchmarks to test out different strategies.

>>pauldd+Z6
Sorry Safari does not support WebGPU yet. Please join me in writing to Apple and requesting it.

>>modele+Hq
Are there any standardised approaches for this? I fail to imagine how one would put branchy CPU code like parsing, etc. on GPUs effectively?

replies(2): >>raphli+hy >>kaliqt+jB

>>HexDec+Wx
It is possible but you have to do things very differently, for example use monoids. There are a few compilers implemented on GPU, including Aaron Hsu's co-dfns and Voetter's compiler project[1]. The parentheses matching problem itself (the core of parsing) has long known efficient parallel algorithms and those have been ported to compute shaders[2] (disclosure: blatant self-promotion).

[1]: https://dl.acm.org/doi/pdf/10.1145/3528416.3530249

[2]: https://arxiv.org/pdf/2205.11659.pdf

>>HexDec+Wx
WebGPU I think will help change a lot of this. Finally, portable code that is performant and runs virtually anywhere. It's the same reason web apps have taken off so much, or just the idea of deploying to and from web platforms, e.g. write in web and deploy to native.

I think WebGPU will be that universal language everyone speaks, and I think also that this will help get rid of Nvidia's monopoly on GPU compute.

>>nicoco+fg
The main advantage is parallelism, but on top of that, common math operations are hardware accelerated on the GPU, so should run indeed faster just by being run on the GPU.