Seems to me that the outcome would be near random because they are so poorly suited. Which might manifest as
> We also found that the models were highly sensitive to seemingly trivial prompt changes
since they're so general, you need to explore if and how you can use them in your domain. guessing 'they're poorly suited' is just that, guessing. in particular:
> We also found that the models were highly sensitive to seemingly trivial prompt changes
this is as much as obvious for anyone who seriously looked at deploying these, that's why there are some very successful startups in the evals space.