If context disambiguates, then you have to use attention which is even more resource intensive.
You want to be as state free as possible. Your tokenizer should match your vocab and be unambiguous. I think your goal is sound, but golfing for the wrong metric.