My understanding/experience is that LLM performance in a language scales with how well the language is represented in the training data.
From that assumption, we might expect LLMs to actually do better with an existing language for which more training code is available, even if that language is more complex and seems like it should be “harder” to understand.
That GRPO works?
> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase
Page 2 of https://arxiv.org/pdf/2402.03300
That GRPO on code works?
> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness
Page 4 of https://arxiv.org/pdf/2501.12948
> here is no RL for programming languages.
and
> Either RL works & you have evidence
This is just so completely wrong, and here is the evidence.
I think everyone in this thread is just surprised you don't seem to know this.
Haven't you seen the hundreds of job ads for people to write code for LLMs to train on?