PyTorch for WebGPU

Oddly, two tests fail for me with Brave (Version 1.51.118 / Chromium: 113.0.5672.126 (arm64)) on macOS Ventura 13.3.1

- pow([0], [0]) gradient, with "Expected «-Infinity» to be close to «0» (diff: < 0.0000005)"

- xlogy([0], [0.30000001192092896]) gradient with "Expected «0» to be close to «-1.2039728164672852»"

>>activa+S4
https://praeclarum.org/webgpu-torch/tests/

This is a dumb question but... are GPUs really that much faster than CPUs specifically at the math functions tested on this page?

xlogy trunc tan/tanh sub square sqrt sin/sinc/silu/sinh sign sigmoid sqrt/rsqrt round relu reciprocal rad2deg pow positive neg mul logaddexp/logaddexp2 log/log1p/log10/log2 ldexp hypot frac floor expm1 exp2 exp div deg2rad cos/cosh copysign ceil atan/atan2 asinh/asin add acosh/acos abs

Those are the types of math GPUs are good at? I thought they were better at a different kind of math, like matrices or something?

>>Muffin+7k
GPUs are usually not faster at doing the operation, but excel at doing the operation in parallel on a gazillion elements. Matrix math is mostly additions and multiplications.

>>nicoco+7l
Yeah this is the trick. You need to maximize the use of workgroup parallelism and also lay things out in memory for those kernels to access efficiently. It’s a bit of a balancing act and I’ll be working on benchmarks to test out different strategies.

zlacker