zlacker

Developed by Jordan Hubbard of NVIDIA (and FreeBSD).

My understanding/experience is that LLM performance in a language scales with how well the language is represented in the training data.

From that assumption, we might expect LLMs to actually do better with an existing language for which more training code is available, even if that language is more complex and seems like it should be “harder” to understand.

replies(11): >>whimsi+u >>nxobje+w >>vessen+B1 >>Zigurd+G4 >>cmrdpo+F5 >>adastr+M9 >>NewsaH+Yj >>nl+mt >>nemo16+SJ >>vidarh+JV >>boxed+4W

>>thorum+(OP)
easy enough to solve with RL probably

replies(1): >>measur+C1

>>thorum+(OP)
I think it's depressingly true of any novel language/framework at this point, especially if they have novel ideas.

>>thorum+(OP)
A lot of this depends on your workflow. A language with great typing, type checking and good compiler errors will work better in a loop than one with a large surface overhead and syntax complexity, even if it's well represented. This is the instinct behind, e.g. https://github.com/toon-format/toon, a json alternative format. They test LLM accuracy with the format against JSON, (and are generally slightly ahead of JSON).

Additionally just the ability to put an entire language into context for an LLM - a single document explaining everything - is also likely to close the gap.

I was skimming some nano files and while I can't say I loved how it looked, it did look extremely clear. Likely a benefit.

replies(1): >>btown+1o

>>whimsi+u
There is no RL for programming languages. Especially ones w/ no significant amount of code.

replies(3): >>whimsi+R1 >>nl+vv >>thorum+GM

>>measur+C1
not even wrong

replies(1): >>measur+n2

>>whimsi+R1
Exactly.

>>thorum+(OP)
It's not just how well the language is represented. Obscure-ish APIs can trip up LLMs. I've been using Antigravity for a Flutter project that uses ATProto. Gemini is very strong at Dart coding, which makes picking up my 17th managed language a breeze. It's also very good at Flutter UI elements. It was noticeably less good at ATProto and its Dart API.

The characteristics of failures have been interesting: As I anticipated it might be, an over ambitious refactoring was a train wreck, easily reverted. But something as simple as regenerating Android launcher icons in a Flutter project was a total blind spot. I had to Google that like some kind of naked savage running through the jungle.

replies(1): >>nl+Xu

>>thorum+(OP)
Not my experience, honestly. With a good code base for it to explore and good tooling, and a really good prompt I've had excellent results with frankly quite obscure things, including homegrown languages.

As others said, the key is feedback and prompting. In a model with long context, it'll figure it out.

replies(2): >>rocha+k8 >>vidarh+WW

>>cmrdpo+F5
But isn't this inefficient since the agent has to "bootstrap" its knowledge of the new language every time it's context window is reset?

replies(1): >>adastr+v9

>>rocha+k8
No, it gets it “for free” just by looking around when it is figuring out how to solve whatever problem it is working on.

>>thorum+(OP)
I don’t think that assumption holds. For example, only recently have agents started getting Rust code right on the first try, but that hasn’t mattered in the past because the rust compiler and linters give such good feedback that it immediately fixes whatever goof it made.

This does fill up context a little faster, (1) not as much as debugging the problem would have in a dynamic language, and (2) better agentic frameworks are coming that “rewrite” context history for dynamic on the fly context compression.

replies(4): >>root_a+Gd >>Punchy+BN >>bevr13+fG1 >>Growin+jR1

>>adastr+M9
> that hasn’t mattered in the past because the rust compiler and linters give such good feedback that it immediately fixes whatever goof it made.

This isn't even true today. Source: heavy user of claude code and gemini with rust for almost 2 years now.

replies(2): >>adastr+up >>ekidd+uu

>>thorum+(OP)
I wonder if there is a way to create a sort of 'transpilation' layer to a new language like this for existing languages, so that it would be able to use all of the available training from other languages. Something that's like AST to AST. Though I wonder if it would only work in the initial training or fine-tuning stage.

>>vessen+B1
Thanks for sharing this! A question I've grappled with is "how do you make the DOM of a rendered webpage optimal for complex retrieval in both accuracy and tokens?" This could be a really useful transformation to throw in the mix!

>>root_a+Gd
I have no problems with rust and Claude Code, and I use it on a daily basis.

>>thorum+(OP)
> My understanding/experience is that LLM performance in a language scales with how well the language is represented in the training data.

This isn't really true. LLMs understand grammars really really well. If you have a grammar for your language the LLM can one-shot perfect code.

What they don't know is the tooling around the language. But again, this is pretty easily fixed - they are good at exploring cli tools.

>>root_a+Gd
Yeah, I have zero problem getting Opus 4.5 to write high-quality Rust code. And I'm picky.

>>Zigurd+G4
I have a vibe coded fantasy console. Getting Doom running on it was easy.

Getting the Doom sound working on it involved me setting there typing "No I can't hear anything" over and over until it magically worked...

Maybe I should have written a helper program to listen using the microphone or something.

replies(1): >>nxobje+Ip3

>>measur+C1
I guess the op was implying that is something fixable fairly easily?

(Which is true - it's easy to prompt your LLM with the language grammar, have it generate code and then RL on that)

Easy in the sense of "it is only having enough GPUs to RL a coding capable LLM" anyway.

replies(1): >>measur+hy

>>nl+vv
If you can generate code from the grammar then what exactly are you RLing? The point was to generate code in the first place so what does backpropagation get you here?

replies(1): >>nl+Q91

>>thorum+(OP)
Blackpill is that, for this reason, the mainstream languages we have today will be the final (human-designed) languages to be relevant on a global scale.

Eventually AIs will create their own languages. And humans will, of course, continue designing hobbyist languages for fun. But in terms of influence, there will not be another human language that takes the programming world by storm. There simply is not enough time left.

replies(1): >>rzmmm+7M1

>>measur+C1
Go read the DeepSeek R1 paper

replies(1): >>measur+KO

>>adastr+M9
so you're saying... the assumption actually holds

replies(1): >>adastr+SQ1

>>thorum+GM
Why would I do that? If you know something then quote the relevant passage & equation that says you can train code generators w/ RL on a novel language w/ little to no code to train on. More generally, don't ask random people on the internet to do work for you for free.

replies(2): >>thorum+U01 >>whimsi+sM1

>>thorum+(OP)
I mostly agree, and I think a combination of good representation and tooling that lets it self-correct quickly will do better than new language in the short term.

In the long term I expect it won't matter - already GPT3.5 was able to reason about the basic semantics of programs in languages "synthesised" zero-shot in context by just describing it as a combination of existing languages (e.g. "Ruby with INTERCAL's COME FROM") or by providing a grammar (e.g. simple EBNF plus some notes on new/different constructs) reasonably well and could explain what a program written in a franken-language it had not seen before was likely to do.

I think long before there is enough training data for a new language to be on equal grounds in that respect, we should expect the models to be good enough at this that you could just provide a terse language spec.

But at the same time, I'd expect the same improvement to future models to be good enough at working with existing languages that it's pointless to tailor languages to LLMs.

>>thorum+(OP)
Claude is very good with Elm, which there should be quite little training data.

>>cmrdpo+F5
Yeah, I've had Claude work on my buggy, incomplete Ruby compiler written (mostly) in Ruby, which uses an s-expression like syntax with a custom "mini language" to implement low-level features that can't be done (or is impractical to do) in pure Ruby, and it only had minor problems with the s-expression language that was mostly fixed with a handful of lines in CLAUDE.md (and were, frankly, mostly my fault for making the language itself somewhat inconsistent) and e.g. when it write a bigint implementation, I had to "tell it off" for too readily resorting to the s-expression syntax since it seemed to "prefer it" over writing high-level code in Ruby.

replies(1): >>cmrdpo+t11

>>measur+KO
Your other comment sounded like you were interested in learning about how AI labs are applying RL to improve programming capability. If so, the DeepSeek R1 paper is a good introduction to the topic (maybe a bit out of date at this point, but very approachable). RL training works fine for low resource languages as long as you have tooling to verify outputs and enough compute to throw at the problem.

replies(2): >>whimsi+AM1 >>measur+2R2

>>vidarh+WW
Even 3 years ago, GH Copilot, hardly the most intelligent of LLMs was suggesting/writing bytecode in my custom VM, writing full programs in bytecode for a custom VM just by looking at a couple examples.

That's when I smelled that things were getting a little crazy.

>>measur+hy
Post RL you won't need to put the grammar in the prompt anymore.

replies(1): >>measur+pR2

>>adastr+M9
> because the rust compiler and linters give such good feedback that it immediately fixes whatever goof it made.

I still experience agents slipping in a `todo!` and other hacks to get code to compile, lint, and pass tests.

The loop with tests and doc tests are really nice, agreed, but it'll still shit out bad code.

replies(1): >>adastr+KL2

>>nemo16+SJ
My impression is that AI models need large amounts of quality training data. "Data contamination", i.e. AI output in the training data set has been a problem for years.

>>measur+KO
well, that’s one way to react to being provided with interesting reading material.

replies(1): >>measur+cR2

>>thorum+U01
imo generally not worth it to keep going when you encounter this sort of HN archetype

>>Punchy+BN
No, it’s the exact opposite of the assumption. It doesn’t matter how represented the language is in the training data, so long as the surrounding infrastructure is good.

>>adastr+M9
> only recently have agents started getting Rust code right on the first try

This is such a silly thing to say. Either you set the bar so low that "hello world" qualifies or you expect LLMs to be able to reason about lifetimes, which they clearly cannot. But LLMs were never very good at full-program reasoning in any language.

I don't see this language fixing this, but it's not trying to—it just seems to be removing cruft

replies(1): >>adastr+sI2

>>Growin+jR1
I have had no issue with Claude writing code that uses lifetimes. It seems to be able to reason about them just fine.

I don't know what to say. May experience does not match yours.

>>bevr13+fG1
What agents, using what models?

replies(1): >>bevr13+t85

>>thorum+U01
So you should have no problem bringing up the exact passages & equations they use for their policies.

>>whimsi+sM1
Bring up passage that supports your claim. I'll wait.

replies(1): >>nl+Z34

>>nl+Q91
The grammar of this language is no more than a few hundred tokens (thousands at worst) & current LLMs support context windows in the millions of tokens.

replies(1): >>nl+X93

>>measur+pR2
Sure.

The point is that your statement about the ability to do RL is wrong.

Additionally your response to the Deepseek paper in the other subthread shows profound and deliberate ignorance.

replies(1): >>measur+pp3

>>nl+X93
Theorycrafting is very easy. Not a single person in this thread has shown any code to do what they're suggesting. You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible or admit you lack the relevant understanding to back up your claims.

replies(1): >>nl+r24

>>nl+Xu
I was reading an Ars article by someone who'd hacked up Apple II and Atari 2600 emulators to provide state introspection/reproducible input via MCP, sockets, file I/O - would that work?

https://github.com/benj-edwards/atari800-ai https://github.com/benj-edwards/bobbin

replies(1): >>nl+t24

>>measur+pp3
> You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible

GPU poor here though...

To quote someone (you...) on the internet:

> More generally, don't ask random people on the internet to do work for you for free.

>>46689232

replies(1): >>measur+r94

>>nxobje+Ip3
wow I love it.

>>measur+cR2
Not exactly sure what you are looking for here.

That GRPO works?

> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase

Page 2 of https://arxiv.org/pdf/2402.03300

That GRPO on code works?

> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness

Page 4 of https://arxiv.org/pdf/2501.12948

replies(1): >>measur+J94

>>nl+r24
Claims require evidence & if you are unwilling to present it then admit you do not have any evidence to support your claims. It's not complicated. Either RL works & you have evidence or you do not know & can not claim that it works w/o first doing the required due diligence which (shockingly) actually requires work instead of empty theory crafting & hand waving.

>>nl+Z34
None of those are novel domains w/ their own novel syntax & semantic validators, not to mention the dearth of readily available sources of examples for sampling the baselines. So again, where does it say it works for a programming language with nothing but a grammar & a compiler?

replies(1): >>nl+tP4

>>measur+J94
To quote you:

> here is no RL for programming languages.

and

> Either RL works & you have evidence

This is just so completely wrong, and here is the evidence.

I think everyone in this thread is just surprised you don't seem to know this.

Haven't you seen the hundreds of job ads for people to write code for LLMs to train on?

replies(1): >>measur+7K5

>>adastr+KL2
Whatever work is paying for on a given day. We've rotated through a few offerings. It's a work truck not a personal vehicle, for me.

I manage a team of interns and I don't have the energy to babysit an agent too. For me, gpt and gemini yield the best talk-it-through approach. For example, dropping a research paper into the chat and describing details until the implementation is clarified.

We also use Claude and Cursor, and that was an exceptionally disruptive experience. Huge, sweeping, wrong changes all over. Gyah! If I bitch about todo! macros, this is where they came from.

For hobby projects, I sometimes use whatever free agent microsoft is shilling via VS Code (and me selling my data) that day. This is relatively productive, but reaches profoundly wrong conclusions.

Writing for CLR in visual studio is the smoothest smart-complete experience today.

I have not touched Grok and likely won't.

/ two pennies

Hope that answers your questions.

>>nl+tP4
You're not going to get less confused by doubling down. None of your claims are valid & this is because you haven't actually tried to do what you're suggesting. Taking a grammar & compiler & RLing will get you nowhere.