Seems to me that the outcome would be near random because they are so poorly suited. Which might manifest as
> We also found that the models were highly sensitive to seemingly trivial prompt changes
since they're so general, you need to explore if and how you can use them in your domain. guessing 'they're poorly suited' is just that, guessing. in particular:
> We also found that the models were highly sensitive to seemingly trivial prompt changes
this is as much as obvious for anyone who seriously looked at deploying these, that's why there are some very successful startups in the evals space.
I have a really nice bridge to sell you...
This "failure" is just a grab at trying to look "cool" and "innovative" I'd bet. Anyone with a modicum of understanding of the tooling (or hell experience they've been around for a few years now, enough for people to build a feeling for this), knows that this it's not a task for a pre-trained general LLM.