Oddly, two tests fail for me with Brave (Version 1.51.118 / Chromium: 113.0.5672.126 (arm64)) on macOS Ventura 13.3.1
- pow([0], [0]) gradient, with "Expected «-Infinity» to be close to «0» (diff: < 0.0000005)"
- xlogy([0], [0.30000001192092896]) gradient with "Expected «0» to be close to «-1.2039728164672852»"
A number of test failures for me on chromium 113.0.5672.63 (ungoogled chromium) MacOS Ventura 13.3.1: https://pastebin.com/eM6ZA3j2
I'll open a ticket if it helps..
On a side-note, I'm not sure if it is because I've looked at so many autograd engines by now, but it is really cool to see that after the years of different frameworks having been developed, most people seem to agree on some concepts and structure on how to implement something like this. It is pretty easy to dive into this, even without being particularly skilled in JS/TS.
Wondering how such frameworks will look in a couple years.
Safari 16.3 has 4 failures: "webgpu is supported", "tensor is webgpu", "xlogy([0], [0]) gradient", "xlogy([0], [0.30000001192092896]) gradient"
A clever way to implement an AOT variant of the operator fusion methods in the XLA (JIT) compiler.
// An empty 3x4 matrix
const tensorA = tensor([3, 4])
// An empty 4x5 matrix
const tensorB = tensor([4, 5])
const good = multiplyMatrix(tensorA, tensorB);
^
Inferred type is Tensor<readonly [3, 5]>
const bad = multiplyMatrix(tensorB, tensorA);
^^^^^^^
Argument of type 'Tensor<readonly [4, 5]>' is not
assignable to parameter of type '[never, "Differing
types", 3 | 5]'.(2345)
I prototyped this for PotatoGPT [1] and some kind stranger on the internet wrote up a more extensive take [2]. You can play with an early version on the Typescript playground here [3] (uses a twitter shortlink for brevity)[1] https://github.com/newhouseb/potatogpt
Some sort of typed 'named tensor' that could be combined with einsum notation at runtime would be awesome, ie. (don't really know TS/JS well but pseudocode)
import { torch } from 'pytorch' as t
import { torch.nn } from 'pytorch' as nn
const tensorA: Tensor[Batch, Seq, Emb] = t.randn([10,10,10]) // initialize tensor
const transformLayer = nn.Einsum((Batch, Seq, Emb),(Emb)->(Batch, Seq))
const tensorB: Tensor[Emb2] = t.randn([20])
const transformedOutput = transformLayer(tensorA, tensorB) // type error: Emb2 does not match Emb
[0]: https://github.com/pytorch/pytorch/issues/26889It would he even better if tensor dims from loaded models could be infered ahead of time in the editor.
This is a dumb question but... are GPUs really that much faster than CPUs specifically at the math functions tested on this page?
xlogy trunc tan/tanh sub square sqrt sin/sinc/silu/sinh sign sigmoid sqrt/rsqrt round relu reciprocal rad2deg pow positive neg mul logaddexp/logaddexp2 log/log1p/log10/log2 ldexp hypot frac floor expm1 exp2 exp div deg2rad cos/cosh copysign ceil atan/atan2 asinh/asin add acosh/acos abs
Those are the types of math GPUs are good at? I thought they were better at a different kind of math, like matrices or something?
When I initially started implementing this I was hung up on similar concerns. For example in GPT2/PotatoGPT the MLP player is 4x the width of the residual stream. I went down a rabbit hole of addition and multiplication in Typescript types (the type system is Turing complete, so it's technically possible!) and after crashing my TS language server a bunch I switched tacticts.
Where I ended up was to use symbolic equivalence, which turned out to be more ergonomic anyway, i.e.
type Multiply<A extends number, B extends number> =
number & { label: `${A} * ${B}` }
const Multiply = <A extends number, B extends number>(a: A, b: B) =>
a * b as Multiply<A, B>;
such that tensor([
params.EmbeddingDimensions, // This is a literal with known size
Multiply(4, params.EmbeddingDimensions)] as const)
is inferred as Tensor<readonly [768, Multiply<4, 768>]>
Notably, switching to a more symbolic approach makes it easier for type checking dimensions that can change at runtime, so something like: tensor([Var(tokens.length, 'Sequence Length'),
Multiply<4, Var(tokens.length, 'Sequence Length')>])
infers as Tensor<readonly [
Var<'Sequence Length'>,
Multiply<4, Var<'Sequence Length'>>]>
And you'll get all the same correctness constraints that you would if these were known dimensions.The downside to this approach is that typescript won't know that Multiply<4, Var<'A'>> is equivalent to Multiply<Var<'A'>, 4> but in practice I haven't found this to be a problem.
Finally, on more complicated operators/functions that compose dimensions from different variables Typescript is also very capable, albeit not the most ergonomic. You can check my code for matrix multiplication and Seb's writeup for another example of a zip function).
Heck, if you are doing that, maybe convert to webgpu automatically as well.
Someone very enterprising might do this in bun using zig.
[1] - https://www.modular.com/mojo [2] - https://ai.facebook.com/blog/meta-training-inference-acceler...
def my_fn(x, **kwargs):
...
return y_1, y_2, y_3
Which is a pain because kwargs could be anything really + now every call site has to expect 3 return values exactly while knowing their order; there's no way of adding an extra return value without changing everyone. In typescript the same function could look like: function myFn(x, options = { someOption: 1 }) {
...
return { y_1, y_2, y_3 };
}
Which is so much nicer because everything is typed with all types inferred automatically! And you don't burden the call sites with values they don't need: const { y_1 } = myFn(x, { someOption: 1 });
In Python, everyone mostly passes unbundled arguments through every function, and changing anything involves threading these untyped arguments through a bunch of untyped call sites, its not the end of the world but we can do better...IMO almost any application that is bottlenecked by CPU performance can be recast to use GPUs effectively. But it's rarely done because GPUs aren't nearly as standardized as CPUs and the developer tools are much worse, so it's a lot of effort for a faster but much less portable outcome.
https://www.w3.org/TR/WGSL/#floating-point-evaluation
It’s not such a problem for real nets since you avoid those values like the plague. But the tests catch them and I need to make the tests are tolerant. Thanks for the results!
That was my biggest pain-point with using TS for graphics related projects. If operator overloading existed, then TS would be a no brainer for entry level graphics + AI/ML projects.
Edit: This gets more complicated when doing operations that force you to manually respect PEMDAS. For example, `add(div(a, b), multiply(c, d))` in TypeScript would simplify to `a / b + c * d` in Python. The TS version is unreadable.
[1] https://anansi.pages.dev/ [2] https://github.com/infrawhispers/anansi/tree/main/embedds/li...
privacy focused semantic search / ML at the edge is looking brighter every day.
const sum = a.add(b).add(c);I think WebGPU will be that universal language everyone speaks, and I think also that this will help get rid of Nvidia's monopoly on GPU compute.
This would give access to any math notation in a more flexible way, implementing a custom DSL in a type safe but expressive way.
Imagine writing stuff like
const result = math`${a} + ${b} / ${c}`
How does the performance of webgpu-torch compare to compiling PyTorch to WASM with emscripten and WebGPU?
tfjs benchmarks: Environment > backend > {WASM, WebGL, CPU, WebGPU, tflite} https://tensorflow.github.io/tfjs/e2e/benchmarks/local-bench... src: https://github.com/tensorflow/tfjs/tree/master/e2e/benchmark...
tensorflow/tfjs https://github.com/tensorflow/tfjs
tfjs-backend-wasm https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-...
tfjs-backend-webgpu https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-...
([...], tflite-support, tflite-micro)
From facebookresearch/shumai (a JS tensor library) https://github.com/facebookresearch/shumai/issues/122 :
> It doesn't make sense to support anything besides WebGPU at this point. WASM + SIMD is around 15-20x slower on my machine[1]. Although WebGL is more widely supported today, it doesn't have the compute features needed for efficient modern ML (transformers etc) and will likely be a deprecated backend for other frameworks when WebGPU comes online.
tensorflow rust has a struct.Tensor: https://tensorflow.github.io/rust/tensorflow/struct.Tensor.h...
"ONNX Runtime merges WebGPU backend" https://github.com/microsoft/onnxruntime https://news.ycombinator.com/item?id=35696031 ... TIL about wonnx: https://github.com/webonnx/wonnx#in-the-browser-using-webgpu...
microsoft/onnxruntime: https://github.com/microsoft/onnxruntime
Apache/arrow has language-portable Tensors for cpp: https://arrow.apache.org/docs/cpp/api/tensor.html and rust: https://docs.rs/arrow/latest/arrow/tensor/struct.Tensor.html and Python: https://arrow.apache.org/docs/python/api/tables.html#tensors https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...
Fwiw it looks like the llama.cpp Tensor is from ggml, for which there are CUDA and OpenCL implementations (but not yet ROCm, or a WebGPU shim for use with emscripten transpilation to WASM): https://github.com/ggerganov/llama.cpp/blob/master/ggml.h
Are the recommendable ways to cast e.g. arrow Tensors to pytorch/tensorflow?
FWIU, Rust has a better compilation to WASM; and that's probably faster than already-compiled-to-JS/ES TensorFlow + WebGPU.
What's a fair benchmark?
- /? pytorch tensorflow benchmarks webgpu 2023 site:github.com https://www.google.com/search?q=pytorch+tensorflow+benchmark...
- [tfjs benchmarks]
- huggingface/transformers:src/transformers/benchmark https://github.com/huggingface/transformers/tree/main/src/tr...
(x+y)*z/3
vs
x.add(y).mul(z).div(3)
And that’s just a really simple example.
I’m also hopeful that pythons new variadic generic types make progress here in python.
I've written a lot of dataloader and such code over the last number of years, and the slicing was probably the most important (and most hair-pulling) parts for me. I've really debated writing my own wrapper at some point (if it is indeed worth the effort) just to keep my sanity, even if it is as the expense of some speed.
Deno has (or had), but you'd have to use Deno v1.31.3 to get WebGPU support (because if was removed afterwards for startup performance issues).
the absolute golden benchmarks are https://github.com/pytorch/benchmark They are a diverse set of userland code taken from github as-is and made into benchmarks.
Even just the [:, None] trick replacing unsqueeze is super useful for me.
This way of thinking is not just unhelpful but even harmful. If one would often benefit from these checks while coding, then they should not be relying on a type checker. They should be thinking more, and writing comments is a great way to do that.
This is especially true because many operations on ndarrays / tensors can yield perfectly valid shapes with completely unintended consequences. When comments are written reasonably well they help avoid these difficult-to-debug, correct-output-shape-but-unintended-result mistakes. Not to mention the additional clear benefit of helping one quickly re-understand the tensor manipulations when coming back to the code weeks or months later.
And more generally, if one can get in the habit of writing these comments before the code, it can help push them away from the write-quickly-now-debug-later mentality. I have seen this bite folks many times, both while teaching ugrad + grad courses and while working at large tech companies.