zlacker

This depends less on the internals vs the patterns in the training data.

Even if the model has the capacity to abstract beyond the patterns, the patterns are still very likely to have influence on its ability to do so.

For example, early after GPT-4 was released it was being claimed it couldn't solve variations on the goat, wolf, and cabbage problem.

I found that it could solve these variations fine 100% of the time, you just needed to explicitly prompt for it to repeat adjectives with nouns and change the nouns to emojis. The repeating worked similar to CoT by biasing the generation towards the variation and away from the original form, and the emojis in place of the nouns further broke the token associations which was leading it to fail by extending the original solution.

So while it's possible that with enough finessing you could get a model to perform self-critique as well as its critique of others, if the training data has a clear pattern of bias between those two, why actively ignore it?

It's a bit like sanding against the grain vs with it. You can sand against the grain of the training data and with enough effort potentially get the result you want with sophisticated enough models. But maybe your life will be a lot easier if you identify the grain in the data first and sand along with it instead?

replies(1): >>Poigna+oT5