This seems like a research dead end to me, the fundamentals are not there
they's why we teach them new tricks on the fly (in-context learning) with instruction files
For example, DeepSeek has done some interesting things with attention, via changes to the structures / algos, but all this is still optimized by gradient descent, which is why models do not learn facts and such from a single pass. It takes many to refine the weights that go into the math formulas