thanks for sharing the work. correct, we're currently working on evals for skills so you can compare skills between models and harnesses.
we wrote a blog on getting agents to write CUDA kernels and evaluating them: https://huggingface.co/blog/upskill