>>rbanff+(OP)
I would have tried parallelism just by curiosity. Split and spread the computation over multiple cores. If you have n cores you could get close to a factor n increase minus the cost of spreading the data and combining the results. That's an easy optimization right out of the box with go (no assembly required).