zlacker

[parent] [thread] 9 comments
1. janals+(OP)[view] [source] 2026-01-12 04:03:50
This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.
replies(4): >>muyuu+u >>moelf+25 >>make3+u6 >>crypto+Gi
2. muyuu+u[view] [source] 2026-01-12 04:09:29
>>janals+(OP)
You could, but you wouldn't when those keywords can all change in equivalent contexts.
replies(2): >>eru+q1 >>janals+vx2
◧◩
3. eru+q1[view] [source] [discussion] 2026-01-12 04:17:45
>>muyuu+u
What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.

replies(1): >>muyuu+x37
4. moelf+25[view] [source] 2026-01-12 05:00:33
>>janals+(OP)
the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...
5. make3+u6[view] [source] 2026-01-12 05:17:31
>>janals+(OP)
If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.

replies(1): >>xigoi+ag
◧◩
6. xigoi+ag[view] [source] [discussion] 2026-01-12 06:49:57
>>make3+u6
I imagine that having to write

  for (int index = 0; index < size; ++index)
instead of

  for index in 0...size
eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.
7. crypto+Gi[view] [source] 2026-01-12 07:12:33
>>janals+(OP)
Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.
◧◩
8. janals+vx2[view] [source] [discussion] 2026-01-12 20:22:20
>>muyuu+u
The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.
replies(1): >>muyuu+B37
◧◩◪
9. muyuu+x37[view] [source] [discussion] 2026-01-14 01:17:02
>>eru+q1
I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it
◧◩◪
10. muyuu+B37[view] [source] [discussion] 2026-01-14 01:17:54
>>janals+vx2
yes, but then you have both alternatives as tokens, which nullifies GP's argument
[go to top]