zlacker

I've found that short symbols cause collisions with other tokens in the llms vocabulary. It is generally much better to have long descriptive names for everything in a language than short ones.

An example that shocked me was using an xml translation of C for better vector search. The lack of curly braces made the model return much more relavent code than using anything else, including enriching the database with ctags.

>GlyphLang is specifically optimized for how modern LLMs tokenize.

This is extremely dubious. The vocabulary of tokens isn't conserved inside model families, let alone entirely different types of models. The only thing they are all good at is tokenizing English.

replies(1): >>goose0+06

>>noosph+(OP)
The collision point is interesting, but I'd argue context disambiguates. If I'm understanding you correctly, I don't think the models are confused about whether or not it's looking at an email when `@` appears before a route pattern. These symbols are heavily represented in programming contexts (e.g. Python decorators, shell scripts, etc.), so LLMs have seen them plenty of times in code. I'd be interested if you shared your findings though! Definitely an issue I would like to see if I could avoid or at least mitigate somewhat.

That's an absolutely fair point that vocabularies differ regarding the tokenizer variance, but the symbols GlyphLang uses are ASCII characters that tokenize as single tokens across GPT4, Claude, and Gemini tokenizers. THe optimization isn't model-specific, but rather it's targeting the common case of "ASCII char = 1 token". I could definitely reword my post though - looking at it more closely, it does read more as "fix-all" rather than "fix-most".

Regardless, I'd genuinely be interested in seeing failure cases. It would be incredibly useful data to see if there are specific patterns where symbol density hurts comprehension.

replies(2): >>noosph+Yx3 >>sitkac+iD4

>>goose0+06
Collision is perhaps the wrong word. But llms definitely have trouble disambiguating different symbols of a language that map to similar tokens.

Way back in the gpt3.5 days I could never get the model to do a parse of even the simplest grammar until I replaced the one letter production rules with one word production rules, e.g. S vs Start. A bit like how they couldn't figure out the number of rs in strawberry.

>>goose0+06
If context disambiguates, then you have to use attention which is even more resource intensive.

You want to be as state free as possible. Your tokenizer should match your vocab and be unambiguous. I think your goal is sound, but golfing for the wrong metric.