GPUs are usually not faster at doing the operation, but excel at doing the operation in parallel on a gazillion elements. Matrix math is mostly additions and multiplications.
>>nicoco+(OP)
Yeah this is the trick. You need to maximize the use of workgroup parallelism and also lay things out in memory for those kernels to access efficiently. It’s a bit of a balancing act and I’ll be working on benchmarks to test out different strategies.
>>nicoco+(OP)
The main advantage is parallelism, but on top of that, common math operations are hardware accelerated on the GPU, so should run indeed faster just by being run on the GPU.