zlacker

[parent] [thread] 3 comments
1. jacob0+(OP)[view] [source] 2025-05-23 19:08:23
I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.
replies(1): >>valine+v1
2. valine+v1[view] [source] 2025-05-23 19:16:25
>>jacob0+(OP)
The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

replies(1): >>pyinst+GK
◧◩
3. pyinst+GK[view] [source] [discussion] 2025-05-24 03:21:32
>>valine+v1
Where does it happen ?
replies(1): >>valine+nI1
◧◩◪
4. valine+nI1[view] [source] [discussion] 2025-05-24 16:47:03
>>pyinst+GK
My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.
[go to top]