zlacker
[return to "CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL"]
◧
1. stonog+x5
[view]
[source]
2025-12-04 21:33:58
>>dzign+(OP)
Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?
◧◩
2. Bulat_+RY4
[view]
[source]
2025-12-06 10:31:48
>>stonog+x5
They compare HGEMM implementations. At least CUBLAS has HGEMM functions.
HGEMM means half-precision (i.e. FP16) general matrix multiplication
[go to top]