zlacker

[parent] [thread] 2 comments
1. valine+(OP)[view] [source] 2025-05-23 19:16:25
The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

replies(1): >>pyinst+bJ
2. pyinst+bJ[view] [source] 2025-05-24 03:21:32
>>valine+(OP)
Where does it happen ?
replies(1): >>valine+SG1
◧◩
3. valine+SG1[view] [source] [discussion] 2025-05-24 16:47:03
>>pyinst+bJ
My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.
[go to top]