zlacker

I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?

replies(1): >>vlovic+V8

>>seanhu+(OP)
Bounded error weights by definition is a more strict evaluation criterion than “performance” metrics through running the model.

replies(1): >>ehsanu+6W

>>vlovic+V8
To spell it out for myself and others: approaching equivalent calculations for each individual attention block means we also approach equivalent performance for the combination of them. And with an error bar approaching floating point accuracy, the performance should be practically identical to regular attention. Elementwise errors of this magnitude can't lead to any noteworthy changes in the overall result, especially given how robust LLM networks seem to be to small deviations.