AI Detection · Posted by Nate Bernier ·

5 AI Detection Tools – Controlled Experiment. Here’s My Data.

11

Ran a controlled experiment this spring because I wanted actual numbers, not opinions. I teach physics in Quebec.

Setup: 40 writing samples from my classes. 20 were written by students in class (confirmed human). 20 were ChatGPT-generated responses to the same prompts (confirmed AI). I did not use any humanizer on the AI samples – this tested raw detection.

Tools tested: GPTZero, Turnitin, Originality.ai, and Proofademic. A free tool called Writer.com as a baseline.

Results on confirmed AI text:
– GPTZero: 16/20 correct (80%)
– Turnitin: 15/20 (75%)
– Originality.ai: 17/20 (85%)
– Proofademic: 17/20 (85%)
– Writer.com: 12/20 (60%)

Results on confirmed human text (false positives):
– GPTZero: 4/20 flagged (20%)
– Turnitin: 3/20 (15%)
– Originality.ai: 5/20 (25%)
– Proofademic: 2/20 (10%)
– Writer.com: 5/20 (25%)

Bottom line: Proofademic and Originality.ai tied on detection accuracy but Proofademic had the lowest false positive rate. Turnitin was better on false positives than I expected. All tools are significantly worse on humanized text – tested separately, results dropped dramatically.

1 replies

1 Reply

12

Excellent methodology. The separation of detection accuracy from false positive rate is exactly what most comparisons miss. Tools that look similar on detection accuracy can be very different on false positives, and for practical classroom use, the false positive rate matters more. A false positive creates a problem. A missed detection is... just a student who maybe cheated and didn't get caught. Not ideal but less immediately harmful.