GPTZero vs. Turnitin vs. Originality: Which Catches More?
I see teachers debating which AI detector is “best” all the time, so I did a direct comparison of the three most popular options: GPTZero, Turnitin’s AI detection, and Originality.ai.
I tested each with the same set of ten texts: five human-written (ranging from ESL student work to published academic writing) and five AI-generated (ranging from raw ChatGPT output to heavily humanized text).
For raw, unedited AI text, all three performed similarly well. GPTZero caught 5/5, Turnitin caught 5/5, and Originality caught 5/5. No real difference at this level.
For lightly edited AI text (student makes minor changes), things diverged. GPTZero caught 4/5, Turnitin caught 4/5, Originality caught 3/5. The misses were different texts for each tool, which is interesting.
For heavily humanized AI text (run through an AI humanizer tool), accuracy dropped across the board. GPTZero caught 2/5, Turnitin caught 2/5, Originality caught 1/5. This confirms that humanizer tools are effective at defeating detection.
For human-written text (false positive test), GPTZero falsely flagged 1/5, Turnitin flagged 0/5, and Originality flagged 1/5. The false positives were both on ESL student writing, which matches the pattern I’ve seen before.
My overall impression: Turnitin had the lowest false positive rate and matched GPTZero’s detection accuracy. GPTZero is the best free option. Originality.ai is decent but slightly less reliable at both detection and false positive avoidance based on my small sample.
But the real takeaway is that none of them are reliable against humanized text. If a student is determined enough to use both an AI generator and a humanizer, current detection technology won’t consistently catch them.
Have you done your own comparisons? What were your results?
6 Replies
Join the discussion.
Log In to ReplyThis is EXACTLY what I needed. I've been trying to explain to my department head why we can't just rely on Turnitin scores and now I have actual data to back it up. Sharing this with my whole team tomorrow!
false positive rate is the real story here. everything else is noise.
yeah same experience here
im new to all this ai stuff. my school just told us to 'use our judgment' which is super helpful lol. this thread is really helping me understand whats going on tho
This aligns with the research I've been reading. Hao et al. (2025) found similar patterns in their cross-institutional study. The consistency across different contexts is notable.
The detection arms race reminds me of when schools tried to ban cell phones in the early 2000s. Complete waste of energy. The technology won. We adapted our teaching instead. I suspect the same will happen here. Five years from now, we'll look back at AI detection the way we look back at phone bans.