Do you use ChatGPT Code Interpreter because it's better, or is it just something you're more familiar with and you're sticking with it for convenience?
Of course, I don't know how one would structure a suitable test, since doing it sequentially would likely bias the later agents with clearer descriptions & feedback on the tasks. I imagine familiarity with how to prompt each particular model is also a factor.