Our experience (https://arxiv.org/abs/2410.16107) is that LLMs like GPT-4o have a particular writing style, including both vocabulary and distinct grammatical features, regardless of the type of text they're prompted with. The style is informationally dense, features longer words, and favors certain grammatical structures (like participles; GPT-4o loooooves participles).
With Llama we're able to compare base and instruction-tuned models, and it's the instruction-tuned models that show the biggest differences. Evidently the AI companies are (deliberately or not) introducing particular writing styles with their instruction-tuning process. I'd like to get access to more base models to compare and figure out why.
Still, perhaps saying "copy" was a bit misleading. Influence would have been more precise way of putting it. After all, there is no such thing as a "normal" writing style in the first place.
So long as you communicate with anything or anyone, I find people will naturally just absorb the parts they like without even noticing most of the time.
The language it uses is peculiar. It's like the entire model is a little bit ESL.
I suspect that this pattern comes from SFT and RLHF, not the optimizer or the base architecture or the pre-training dataset choices, and the base model itself would perform much more "in line" with other base models. But I could be wrong.
Goes to show just how "entangled" those AIs are, and how easy it is to affect them in unexpected ways with training. Base models have a vast set of "styles" and "language usage patterns" they could draw from - but instruct-tuning makes a certain set of base model features into the "default" persona, shaping the writing style this AI would use down the line.