zlacker

[return to "Which programming languages are most token-efficient?"]
1. janals+Xj[view] [source] 2026-01-12 04:03:50
>>tehnub+(OP)
This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.
◧◩
2. muyuu+rk[view] [source] 2026-01-12 04:09:29
>>janals+Xj
You could, but you wouldn't when those keywords can all change in equivalent contexts.
◧◩◪
3. janals+sR2[view] [source] 2026-01-12 20:22:20
>>muyuu+rk
The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.
[go to top]