"Random seed xxx is all you need" was another demonstration of this need.
You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.