I did a similar optimization via
https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32:
https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf... (comparable to the 9x improvement in article using SIMD + int8)
TBH my takeaway was that it was more useful to use smaller vectors as a representation