GPTZero vs Originality.ai – I Tested Both This Weekend. Here’s the Data.
Tested both tools this weekend because I got tired of reading vague comparisons that don’t give numbers. Here’s what I found on 30 essays from my class (mix of known AI, known human, hybrid).
Test set: 10 confirmed AI-generated (I wrote the prompts), 10 confirmed human (I watched students write them), 10 flagged by Turnitin as suspicious (unknown ground truth).
GPTZero:
– AI essays correctly flagged: 8/10 (80%)
– Human essays false positives: 3/10 (30%) – this is bad
– Turnitin-flagged essays: 6/10 flagged by GPTZero also
Originality.ai:
– AI essays correctly flagged: 9/10 (90%)
– Human essays false positives: 2/10 (20%) – still bad
– Turnitin-flagged essays: 7/10 flagged also
Both tools: the “hybrid” problem is real. essays that are 30-40% AI-assisted are basically invisible to both tools. and thats the most common real-world case.
my takeaway: Originality.ai has a slight accuracy edge but both have unacceptable false positive rates for high-stakes use. Turnitin seems to flag different things than both of them, which is interesting.
6 Replies
Join the discussion.
Log In to ReplyFascinating data. the 30% false positive rate on GPTZero matches my informal tests exactly. i'd been getting similar numbers but didn't have the same rigour. the hybrid detection gap is the real problem - that's where most actual AI use happens and none of these tools handle it well.
The false positive rate is the real story here. Everything else is noise. Detection accuracy on confirmed AI text is almost irrelevant if you're generating unacceptable false accusations against real students. Any tool with a 20%+ false positive rate on authentic student writing is not a policy-grade tool. It's a research prototype.
been using GPTZero for a term and considering switching after reading this. my main gripe has always been consistency - same essay, different day, noticeably different results. that alone makes it hard to defend if a student pushes back.
i just want one that works. thats it. stop making me test three tools every semester.
the sample size is small but the methodology is sound. 30% false positive rate means you'd wrongly flag 1 in 3 genuinely human essays. in a class of 30 students that's 9 false accusations waiting to happen. i'm not sure thats a tool any school should be using for formal decisions.
Ethan's right that neither is the definitive answer but the Originality.ai false positive rate matters a lot for how you use it. if im running 120 essays I need confidence intervals, not just a score. the 8-point gap in false positives between the two is meaningful when scaled.