PyTorch for WebGPU

Oddly, two tests fail for me with Brave (Version 1.51.118 / Chromium: 113.0.5672.126 (arm64)) on macOS Ventura 13.3.1

- pow([0], [0]) gradient, with "Expected «-Infinity» to be close to «0» (diff: < 0.0000005)"

- xlogy([0], [0.30000001192092896]) gradient with "Expected «0» to be close to «-1.2039728164672852»"

>>activa+S4
https://praeclarum.org/webgpu-torch/tests/

This is a dumb question but... are GPUs really that much faster than CPUs specifically at the math functions tested on this page?

xlogy trunc tan/tanh sub square sqrt sin/sinc/silu/sinh sign sigmoid sqrt/rsqrt round relu reciprocal rad2deg pow positive neg mul logaddexp/logaddexp2 log/log1p/log10/log2 ldexp hypot frac floor expm1 exp2 exp div deg2rad cos/cosh copysign ceil atan/atan2 asinh/asin add acosh/acos abs

Those are the types of math GPUs are good at? I thought they were better at a different kind of math, like matrices or something?

>>Muffin+7k
GPUs are about 100 times faster than CPUs for any type of single-precision floating point math operation. The catch is that you have to do roughly similar math operations on 10k+ items in parallel before the parallelism and memory bandwidth advantages of the GPU outweigh the latency and single-threaded performance advantages of the CPU. Of course this is achievable in graphics applications with millions of triangles and millions of pixels, and in machine learning applications with millions or billions of neurons.

IMO almost any application that is bottlenecked by CPU performance can be recast to use GPUs effectively. But it's rarely done because GPUs aren't nearly as standardized as CPUs and the developer tools are much worse, so it's a lot of effort for a faster but much less portable outcome.

>>modele+zv
Are there any standardised approaches for this? I fail to imagine how one would put branchy CPU code like parsing, etc. on GPUs effectively?

>>HexDec+OC
It is possible but you have to do things very differently, for example use monoids. There are a few compilers implemented on GPU, including Aaron Hsu's co-dfns and Voetter's compiler project[1]. The parentheses matching problem itself (the core of parsing) has long known efficient parallel algorithms and those have been ported to compute shaders[2] (disclosure: blatant self-promotion).

[1]: https://dl.acm.org/doi/pdf/10.1145/3528416.3530249

[2]: https://arxiv.org/pdf/2205.11659.pdf

zlacker