Bigger impact is before that you need to project the hidden state matrix to the vocab list. Something like 4096x250000. Bigger vocab=more FLOPs.
If you’re on a GPU things are parallelized so maybe it’s not quite linear if everything fits nicely. But on a cpu you’re going to struggle more.
This is why the juiciest target when shrinking models is the token embedding table. For example AlBERT factorized the whole embedding table to two low rank matrices.