Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens

>>nyrikk+(OP)
Man that "Unreasonable Effectiveness of ..." pattern is getting a bit overused. With the original paper [1] you could still say that there really is some deeply philosophical mystery. But they now slap that on everything.

[1] https://en.m.wikipedia.org/wiki/The_Unreasonable_Effectivene...

>>ngruhn+0o
TIL. I am not from an engineering/physics background so for me the original Unreasonable Effectiveness paper was Karpathy’s blog post about RNNs.

>>dkga+sK
(Karpathy's might be more a call back to Halevy, Norvig, and Pereira's "The Unreasonable Effectiveness of Data"[0].)

But I think is a good example that fits the OP's critique (I don't think the critique fits to the arXiv paper. Even though I expected the main results, see my main comment).

The "unreasonableness" in Karpathy's post[1] is using sequencing to process non-sequential data. But the reason this isn't unreasonable is that we explicitly expect non-sequential processes to be able to be reformulated as sequential ones.

The SVHN (hose numbers) he shows is actually a great example of this. We humans don't process that all at once. Our eyes similarly dart around, even if very fast. Or we might think about how to draw a picture. We don't do everything at once, but we work in sections, building up, and have layers that end up being ordered even though this technically isn't a requirement. I'm actually struggling to think of things that cannot be broken down into sequences. He says as much here

  | an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to process them in a sequential manner.

So really the question is: what part of this was unreasonable? Or what part was unexpected? Honestly, we should be expecting this as the nature of neural nets is itself sequential, data being processed layer by layer. Hell, every computer program has a trace, which is sequential. I can give tons of examples. So it is quite reasonable that sequential processing should work.

[0] https://static.googleusercontent.com/media/research.google.c...

[1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

zlacker