PyTorch for WebGPU

>>mighdo+(OP)
Impressive work.

A number of test failures for me on chromium 113.0.5672.63 (ungoogled chromium) MacOS Ventura 13.3.1: https://pastebin.com/eM6ZA3j2

I'll open a ticket if it helps..

>>mighdo+(OP)
I'm excited about this for probably different reasons than most: I think Typescript could be a more ergonomic way to develop ML models than Python because you can automatically infer and check tensor dimensions while you are writing code! Compare this to the mess of comments you usually see writing pytorch telling you that x is of shape [x, y, z].

  // An empty 3x4 matrix
  const tensorA = tensor([3, 4])
  
  // An empty 4x5 matrix
  const tensorB = tensor([4, 5])

  const good = multiplyMatrix(tensorA, tensorB);
        ^
        Inferred type is Tensor<readonly [3, 5]>
  
  const bad = multiplyMatrix(tensorB, tensorA);
                             ^^^^^^^
                             Argument of type 'Tensor<readonly [4, 5]>' is not 
                             assignable to parameter of type '[never, "Differing 
                             types", 3 | 5]'.(2345)

I prototyped this for PotatoGPT [1] and some kind stranger on the internet wrote up a more extensive take [2]. You can play with an early version on the Typescript playground here [3] (uses a twitter shortlink for brevity)

[1] https://github.com/newhouseb/potatogpt

[2] https://sebinsua.com/type-safe-tensors

[3] https://t.co/gUzzTl4AAN

>>newhou+Zd
That work looks really interesting! I am also excited about type safety when it comes to tensors. My understanding was that this type safe approach to tensor shape had encountered issues because it was difficult/impossible (maybe?) to reason about the shape of some common operators at compile time. But perhaps those operators are not really necessary. [0]

Some sort of typed 'named tensor' that could be combined with einsum notation at runtime would be awesome, ie. (don't really know TS/JS well but pseudocode)

  import { torch } from 'pytorch' as t
  import { torch.nn } from 'pytorch' as nn

  const tensorA: Tensor[Batch, Seq, Emb] = t.randn([10,10,10]) // initialize tensor
  const transformLayer = nn.Einsum((Batch, Seq, Emb),(Emb)->(Batch, Seq))

  const tensorB: Tensor[Emb2] = t.randn([20])

  const transformedOutput = transformLayer(tensorA, tensorB) // type error: Emb2 does not match Emb

[0]: https://github.com/pytorch/pytorch/issues/26889

>>mighdo+(OP)
What's the reason to run pytorch directly on WebGPU vs using ONNX on WebGPU (e.g. with https://github.com/webonnx/wonnx)?

>>activa+S4
https://praeclarum.org/webgpu-torch/tests/

This is a dumb question but... are GPUs really that much faster than CPUs specifically at the math functions tested on this page?

xlogy trunc tan/tanh sub square sqrt sin/sinc/silu/sinh sign sigmoid sqrt/rsqrt round relu reciprocal rad2deg pow positive neg mul logaddexp/logaddexp2 log/log1p/log10/log2 ldexp hypot frac floor expm1 exp2 exp div deg2rad cos/cosh copysign ceil atan/atan2 asinh/asin add acosh/acos abs

Those are the types of math GPUs are good at? I thought they were better at a different kind of math, like matrices or something?

>>newhou+Zd
Just a little push back here, I think you strike on the right theme where a programming language could fill this gap. However, I wonder if new domain specific languages will eventually be the more elegant solution. Think Modular's Mojo [1] or Meta's KNYFE [2] mentioned earlier this week.

[1] - https://www.modular.com/mojo [2] - https://ai.facebook.com/blog/meta-training-inference-acceler...

>>activa+S4
Yeah so the thing is WebGPU doesn’t correctly support IEEE floating point. Particularly, 0 is often substituted for +-Inf and NaN. See section 14.6 of the spec.

https://www.w3.org/TR/WGSL/#floating-point-evaluation

It’s not such a problem for real nets since you avoid those values like the plague. But the tests catch them and I need to make the tests are tolerant. Thanks for the results!

>>mighdo+(OP)
This is really nice! I have been working on getting ANN search working in the browser ([1] demo, [2] WIP repo) and would love to switch out onnx for the embedding generation.

[1] https://anansi.pages.dev/ [2] https://github.com/infrawhispers/anansi/tree/main/embedds/li...

privacy focused semantic search / ML at the edge is looking brighter every day.

>>HexDec+OC
It is possible but you have to do things very differently, for example use monoids. There are a few compilers implemented on GPU, including Aaron Hsu's co-dfns and Voetter's compiler project[1]. The parentheses matching problem itself (the core of parsing) has long known efficient parallel algorithms and those have been ported to compute shaders[2] (disclosure: blatant self-promotion).

[1]: https://dl.acm.org/doi/pdf/10.1145/3528416.3530249

[2]: https://arxiv.org/pdf/2205.11659.pdf

>>Subopt+3B
I actually think that tagged template strings in JS/TS could be a much better version of operator overloading! https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

This would give access to any math notation in a more flexible way, implementing a custom DSL in a type safe but expressive way.

Imagine writing stuff like

const result = math`${a} + ${b} / ${c}`

>>sva_+Za
Could there be something like emscripten-forge/requests-wasm-polyfill for PyTorch with WebGPU? https://github.com/emscripten-forge/requests-wasm-polyfill

How does the performance of webgpu-torch compare to compiling PyTorch to WASM with emscripten and WebGPU?

tfjs benchmarks: Environment > backend > {WASM, WebGL, CPU, WebGPU, tflite} https://tensorflow.github.io/tfjs/e2e/benchmarks/local-bench... src: https://github.com/tensorflow/tfjs/tree/master/e2e/benchmark...

tensorflow/tfjs https://github.com/tensorflow/tfjs

tfjs-backend-wasm https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-...

tfjs-backend-webgpu https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-...

([...], tflite-support, tflite-micro)

From facebookresearch/shumai (a JS tensor library) https://github.com/facebookresearch/shumai/issues/122 :

> It doesn't make sense to support anything besides WebGPU at this point. WASM + SIMD is around 15-20x slower on my machine[1]. Although WebGL is more widely supported today, it doesn't have the compute features needed for efficient modern ML (transformers etc) and will likely be a deprecated backend for other frameworks when WebGPU comes online.

tensorflow rust has a struct.Tensor: https://tensorflow.github.io/rust/tensorflow/struct.Tensor.h...

"ONNX Runtime merges WebGPU backend" https://github.com/microsoft/onnxruntime https://news.ycombinator.com/item?id=35696031 ... TIL about wonnx: https://github.com/webonnx/wonnx#in-the-browser-using-webgpu...

microsoft/onnxruntime: https://github.com/microsoft/onnxruntime

Apache/arrow has language-portable Tensors for cpp: https://arrow.apache.org/docs/cpp/api/tensor.html and rust: https://docs.rs/arrow/latest/arrow/tensor/struct.Tensor.html and Python: https://arrow.apache.org/docs/python/api/tables.html#tensors https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...

Fwiw it looks like the llama.cpp Tensor is from ggml, for which there are CUDA and OpenCL implementations (but not yet ROCm, or a WebGPU shim for use with emscripten transpilation to WASM): https://github.com/ggerganov/llama.cpp/blob/master/ggml.h

Are the recommendable ways to cast e.g. arrow Tensors to pytorch/tensorflow?

FWIU, Rust has a better compilation to WASM; and that's probably faster than already-compiled-to-JS/ES TensorFlow + WebGPU.

What's a fair benchmark?

>>westur+0K
> What's a fair benchmark?

- /? pytorch tensorflow benchmarks webgpu 2023 site:github.com https://www.google.com/search?q=pytorch+tensorflow+benchmark...

- [tfjs benchmarks]

- huggingface/transformers:src/transformers/benchmark https://github.com/huggingface/transformers/tree/main/src/tr...

>>crubie+GJ
I got nerdsniped and made a small library to test this concept. Might maintain

https://github.com/crubier/opov

>>killth+LC
Indeed. "Object-oriented" fluent notation is basically equivalent to infix notation.

https://mlajtos.mu/posts/new-kind-of-paper-2

>>westur+0K
>What's a fair benchmark?

the absolute golden benchmarks are https://github.com/pytorch/benchmark They are a diverse set of userland code taken from github as-is and made into benchmarks.

zlacker

PyTorch for WebGPU