>>nyrikk+(OP)
why is it unreasonable that giving the llm a spot to think and collate long range attention and summarize without the pressure of building a meaningful next token so quickly would result in higher effectiveness?
>>throwa+Id
It's more about the lack of semantic meaning in the intermediate tokens, not that they aren't effective (even when the intermediates are wrong)