zlacker

[parent] [thread] 4 comments
1. muyuu+(OP)[view] [source] 2026-01-12 04:09:29
You could, but you wouldn't when those keywords can all change in equivalent contexts.
replies(2): >>eru+W >>janals+1x2
2. eru+W[view] [source] 2026-01-12 04:17:45
>>muyuu+(OP)
What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.

replies(1): >>muyuu+337
3. janals+1x2[view] [source] 2026-01-12 20:22:20
>>muyuu+(OP)
The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.
replies(1): >>muyuu+737
◧◩
4. muyuu+337[view] [source] [discussion] 2026-01-14 01:17:02
>>eru+W
I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it
◧◩
5. muyuu+737[view] [source] [discussion] 2026-01-14 01:17:54
>>janals+1x2
yes, but then you have both alternatives as tokens, which nullifies GP's argument
[go to top]