The primary constraint is the size of the language specification. Any new programming language starts out not being in the training data, so in context learning is all you've got. That makes it similar to a compression competition - the size of the codec is considered to be a part of the output size in such contests, so you have to balance codec code against how effective it is. You can't win by making a gigantic compressor that produces a tiny output.
To me that suggests starting from a base of an existing language and using an iterative tree-based agent exploration. It's a super expensive technique and I'm not sure the ROI is worth it, but that's how you'd do it. You don't want to create a new language from scratch.
I don't think focusing on tokenization makes sense. The more you drift from the tokenization of the training text the harder it will be for the model to work, just like with a human (and that's what the author finds). At best you might get small savings by asking it to write in something like Chinese, but the GPT-4/5 token vocabularies already have a lot of programming related tokens like ".self", ".Iter", "-server" and so on. So trying to make something look shorter to a human can easily be counter productive.
A better approach is to look at where models struggle and try to optimize a pre-existing language for those issues. It might all be rendered obsolete by a better model released tomorrow of course, but what I see is problems like this:
1. Models often want to emit imports or fully qualified names into the middle of code, because they can't go backwards and edit what they already emitted to add an import line at the top. So a better language for an LLM would be one that doesn't require you to move the cursor upwards as you type, e.g. Python/JS benefits here because you can run an import statement anywhere, languages like Java or Kotlin are just about workable because you can write out names in full and importing something is just a convenience, but languages that force you to import types only at the very top of the file are going to be hell for an LLM.
Taking this principle further it may be useful to have a PL that lets you emit "delete last block" type tokens (smarter ^H). If the model emits code that it then realizes was wrong, it no longer has to commit to it and build on it anyway, it can wipe it and redo it. I've often noticed GPT-5 use "no op" patterns when it emits patches, where it deletes a line and then immediately re-adds the exact same line, and I think it's because it changed what it wanted to do half way through emitting a patch but had no way to stop except by doing a no-op.
The nice thing about this idea is that it's robust to model changes. For as long as we use auto-regression this will be a problem. Maybe diffusion LLMs find it easier but we don't use those today.
2. As the article notes, models can struggle with counting indentation especially when emitting patches. That suggests NOT using a whitespace sensitive language like Python. I keep hearing that Python is the "language of AI" but objectively models do sometimes still make mistakes with indentation. In a brace based language this isn't a problem, you can just mechanically reformat any file that the LLM edits after it's done. In a whitespace sensitive language that's not an option.
3. Heavy use of optional type inference. Types communicate lots of context in a small number of tokens, but demanding the model actually write out types is also inefficient (it knows in its activations what the types are meant to be). So what you want is to encourage the model to rely heavily on type inference even if the surrounding code is explicitl, then use a CLI tool that automatically adds in missing type annotations, i.e. you enrich the input and shrink the output. TypeScript, Kotlin etc - all good for this. Languages like Clojure, I think not so good, despite it being apparently token efficient on the surface.
4. In the same way you want to let the model import code half way through a file, it'd be good to also be able to add dependencies half way through a file, without needing to manually edit a separate file somewhere else. Even if it's redundant, you should be able to write something like "import('@foo/bar:1.2.3').SomeType.someMethod". Languages like JS/TS are the closest to this. You can't do it in most languages, where the definition of a package+version is very far both textually and semantically from the place where it's used.
5. Agree with the author that letting test and production code be interleaved sounds helpful. Models often forget to write tests but are good at following the style of what they see. If they see test code intermixed with the code they're reading and writing they're more likely to remember to add tests.
There's probably dozens of ideas like these. The nice thing is, if you implement it as a pre-processor on top of some other language, you exploit the exiting training data as much as possible and in fact the codebase it's working on becomes 'training data' as well just via ICL.