zlacker

[parent] [thread] 2 comments

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

replies(1): >>pyinst+bJ

>>valine+(OP)
Where does it happen ?

replies(1): >>valine+SG1

>>pyinst+bJ
My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

[go to top]