zlacker

I would argue that this test isn't particularly informative. Given 5 attempts and 5 successes, even though the point estimate of accuracy is 1, the 95% CI ranges from 0.48 to 1:

    > binom.test(5,5,0.5)

     Exact binomial test

    data:  5 and 5
    number of successes = 5, number of trials = 5, p-value = 0.0625
    alternative hypothesis: true probability of success is not equal to 0.5
    95 percent confidence interval:
     0.4781762 1.0000000

In other words, we don't have enough data in that small sample to reject the possibility that the model is 50% accurate, much less 99.9% accurate.

replies(1): >>virapt+S1

>>carboc+(OP)
I think the message was claiming something else, specifically that each classification was given a score of how confident the model was in the answer and the answers were given 99.9%+ in those cases.

See the app: https://huggingface.co/openai-detector/ - it gives a response as % chance it's genetic or chat bot.

replies(2): >>ivegot+0u >>carboc+T21

>>virapt+S1
Seems to have major biases against who knows what sentence structures. Even without trying to make it say fake, some of my messages and text I write in it are pretty confident I'm GPT-2...

>>virapt+S1
With 5 samples, we have no way to assess whether the app’s 99.9% self-assessment is remotely well calibrated. (As noted above, 5/5 is also consistent with a model that is right 50% of the time.)