zlacker

I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf... (comparable to the 9x improvement in article using SIMD + int8)

TBH my takeaway was that it was more useful to use smaller vectors as a representation