Jump to content

AI-Detection Bias and False Positives

From Metopedia



AI-Detection Bias and False Positives is a Metopedia article on a 2026 comparative detector study by Andrew Lehti, focusing on AI-detection false positives, detector disagreement, stylistic bias, and academic-integrity risk.

AI-Detection Bias and False Positives
Comparative detector study
Full title AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors
Author Andrew Lehti
Publication date February 28, 2026
DOI 10.6084/m9.figshare.31439995
Subject AI-detection reliability, false positives, detector bias, academic-integrity policy
Corpus Five text samples: three human essays/comments, one seventh-grade human essay, one AI-generated essay
Main claim AI-detection scores vary sharply by detector and appear sensitive to polish, formatting, and structural regularity
Archive Internet Archive PDF

AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors is a 2026 paper by Andrew Lehti examining the reliability of common AI-authorship detectors when applied to human-written and AI-generated text samples.[1] The study compares detector outputs on older human writing, contemporary AI writing, informal student writing, and additional control texts. It argues that many detectors appear to penalize formal structure, grammatical consistency, and polished academic style rather than identifying a stable signature of machine authorship.

The paper is situated within broader debate over AI-content detection in education. Published research has found that available AI-detection tools can be unreliable, inconsistent, and vulnerable to paraphrasing or obfuscation.[2] Other research has reported bias against non-native English writing, raising concerns about fairness in academic and evaluative settings.[3]

The central finding of Lehti's paper is that a polished human essay from 2016 received a higher average AI-detection score than a 2026 AI-generated essay. The paper interprets this as evidence of a “polish penalty”: the tendency of detectors to associate structural competence, formal tone, and regular formatting with artificial generation.[1]

Background

AI-detection systems are software tools that attempt to estimate whether text was written by a human, generated by a language model, or produced through a mixture of human and machine assistance. These tools are often used in academic settings to support plagiarism screening, authorship review, and academic-integrity investigations. Their outputs are commonly expressed as percentages, risk labels, or probability bands.

The growth of generative language models increased pressure on schools, universities, publishers, and online platforms to distinguish human writing from model-generated writing. In practice, this task is difficult because modern language models are trained on large corpora of human writing, including academic prose, web articles, essays, forum posts, documentation, and informal discussion. A model's output can resemble the average style of the same written ecosystem from which human writers also learn.

Lehti's paper argues that this creates a convergence problem. AI systems learn from human writing; human writers increasingly read AI-influenced text; and ordinary writing tools now include grammar correction, tone rewriting, autocomplete, and one-click revision. As a result, the boundary between “human style” and “machine style” becomes less stable over time.[1]

Publication and context

The paper is part of Lehti's broader Metopedia and cognitive-psychology corpus. It includes an introductory advisory on “Cognitive Impasse,” a concept used by the author to describe resistance to ideas that contradict prior beliefs. The paper also describes the author's method as “Extrapolative Trial by Error,” a process in which independent observation and synthesis precede review of external academic literature.[1]

Although those framing sections are unusual for a conventional detector-evaluation paper, the empirical core of the document is a comparative table of detector outputs across multiple writing samples. The article also includes appendices containing control cases, full detector tables, error-rate summaries, and graphs.

Research question

The study asks whether public and commercial AI-detection systems can reliably distinguish AI-generated content from older and contemporary human-generated writing. It focuses on false positives, detector disagreement, and the possibility that detectors may classify polished human writing as AI-generated because of surface-level features.

The specific concerns examined include:

  • whether detector outputs agree across tools;
  • whether older pre-ChatGPT human writing is misclassified as AI;
  • whether informal or lower-structure writing is more likely to be treated as human;
  • whether formatting, punctuation, and academic polish affect AI scores;
  • whether “humanizer” tools reduce or increase AI-detection scores;
  • whether academic institutions should use AI detectors as determinative evidence.

Methodology

The initial comparison used three main texts:

Sample Date Origin Length Style
2016 Human Essay 2016 Human-written 7,062 words; about 46,000 characters Polished, semi-academic prose
2026 AI Essay 2026 AI-generated 1,115 words Mixed formal and informal register
2007 Human Essay 2007 Human-written seventh-grade essay 898 words Informal, beginner-level prose

Each document was submitted to multiple AI-detection systems. Where a detector required chunking because of word limits, the reported value was rounded or averaged. Later appendices added two additional controls: a 2026 human essay and a 2026 human Reddit comment.[1]

The study reports detector outputs as AI-likelihood percentages. These numbers are treated as the tools' own probability-like claims rather than as independently validated probabilities.

Detectors tested

The paper reports scores from a range of AI-detection systems and model-based judgments, including:

  • AIDetector
  • ChatGPT Extended Thinking
  • Content Detector AI
  • Copyleaks
  • Dechecker
  • Decopy AI
  • Detecting-AI
  • Detector IO
  • eduwriter AI
  • Gemini 3 Thinking
  • GPTInf
  • GPTZero
  • Grammarly
  • NoteGPT
  • OpenL IO
  • originalityAI
  • Pangram
  • Quillbot
  • Sapling AI
  • Scribbr
  • undetectableAI
  • Winston AI
  • YouScan
  • ZeroGPT

The paper notes that Quillbot and Scribbr may share backend relationships or similar scoring behavior, but that subscription limits and input limits produced differing values in the reported tests.[1]

Main comparison

The first major table compared the 2016 human essay with the 2026 AI essay.

Detector 2016 Human Essay 2026 AI Essay
Copyleaks 96.8% 99.9%
ZeroGPT 54.51% 31.67%
GPTZero 98% 92%
Gemini 3 Thinking 80–90% 90–95%
ChatGPT Extended Thinking 35–45% 75–85%
Quillbot 29% 54%
Sapling AI 100% 29%
Grammarly 45% 15%
AIDetector 2.35% 4.75%
Scribbr 37.5% 26%
undetectableAI 67% 72%
originalityAI 100% 100%
Pangram 100% 100%
NoteGPT 28.9% 53.26%
GPTInf 100% 100%
eduwriter AI 49% 29%
Winston AI 99% 61%

The reported summary statistics were:

Sample Mean AI score Median Standard deviation Range
2016 Human Essay 66.59% 67.00% 32.78% 2.35% to 100%
2026 AI Essay 59.25% 57.50% 33.69% 4.75% to 100%

The study emphasizes that the human-written 2016 essay received a higher average AI score than the AI-generated 2026 essay. Both samples also produced wide ranges, from near-zero values to 100% AI classifications, depending on the detector.[1]

Informal writing control

The 2007 seventh-grade human essay served as a low-polish control. It contained informal narration, uneven structure, inconsistent punctuation, and beginner-level prose. Most tools classified it as overwhelmingly human.

Detector AI score for 2007 human essay
AIDetector 1.75%
Scribbr 1%
Quillbot 1%
Sapling AI 26.4%
NoteGPT 16.15%
GPTZero 2%
WinstonAI 5%
undetectableAI 9%
ZeroGPT 16.15%
Eduwriter AI 16%
OpenL IO 16%
YouScan 15%
ChatGPT Extended Thinking 10–25%
Gemini 3 Pro Thinking 5–10%

Lehti interprets the results as evidence that imperfection, irregularity, and lower structural control may be treated by detectors as human markers.[1]

Appendix controls

Reddit comment control

Appendix A examined an informal Reddit comment described as “clearly not AI.” The comment included spelling mistakes, repetitive personal details, uneven rhythm, and non-linear narration. Most detectors assigned low AI scores.

Detector Result
Copyleaks 0% AI
ZeroGPT 3.24% AI
GPTZero 0% AI
Quillbot 0% AI
Detecting-AI 29.1%
Sapling AI 0% AI
Grammarly 0% AI
AIDetector 2.37% AI
Scribbr 0% AI
undetectableAI 5% AI
originalityAI 0% AI
Pangram 0% AI
NoteGPT 3.27%
GPTInf 0%
eduwriter AI 3% AI
Winston AI 2%
OpenL IO 3% AI
YouScan 15% AI

The paper treats this sample as a qualitative control because most detectors converged on a human classification, while a minority still reported non-trivial AI likelihood.[1]

2026 human essay control

Appendix B examined a 2026 first-person human essay. The author states that it was not polished, corrected, revised, or generated with tools. It was 3,502 words and 22,372 characters.

Detector Result
Copyleaks 77.4% AI
ZeroGPT 7.1% AI
GPTZero 38.31% AI
Quillbot 13% AI
Detecting-AI 30.6%
Sapling AI 27.6% AI
Grammarly 30% AI
AIDetector 12.47%
Scribbr 19%
undetectableAI 77% AI
originalityAI 100% AI
Pangram 39% AI
NoteGPT 7.87%
GPTInf 11% AI
eduwriter AI 17% AI
Winston AI 3%
OpenL IO 7% AI
YouScan 85% AI
Detector IO 21% AI
Content Detector AI 18% AI
Decopy AI 57% AI
Dechecker 8% AI

For this control, the reported mean was 33.60%, the median was 24.3%, the standard deviation was 28.31%, and the range was 3% to 100%. The paper uses these values to argue that even human-coded personal writing can be classified as AI by some systems when it contains enough coherent structure.[1]

Full detector table

Appendix C expanded the results into a five-condition comparison.

Detector 2026 Human Essay 2007 Human Essay 2016 Human Essay 2026 AI Essay 2026 Human Comment
AIDetector 12.5% 1.8% 2.4% 4.8% 2.4%
ChatGPT 65.0% 17.5% 40.0% 80.0% 7.5%
Content Detector AI 18.0% 0.0% 73.0% 0.0% 23.0%
Copyleaks 77.4% 0.0% 96.8% 99.9% 0.0%
Dechecker 8.0% 16.0% 16.0% 31.0% 3.2%
Decopy AI 57.0% 46.0% 51.0% 34.0% 32.0%
Detecting-AI 30.6% 42.6% 48.3% 67.9% 29.1%
Detector IO 21.0% 0.0% 21.0% 70.0% 0.0%
GPTInf 11.0% 0.0% 100.0% 100.0% 0.0%
GPTZero 38.3% 2.0% 98.0% 92.0% 0.0%
Gemini 35.0% 7.5% 85.0% 92.5% 0.0%
Grammarly 30.0% 1.0% 45.0% 15.0% 0.0%
NoteGPT 7.9% 16.2% 28.9% 53.3% 3.3%
OpenL IO 7.0% 16.0% 37.0% 42.0% 3.0%
Pangram 39.0% 0.0% 100.0% 100.0% 0.0%
Quillbot 13.0% 1.0% 29.0% 54.0% 0.0%
Sapling AI 27.6% 26.4% 100.0% 29.0% 0.0%
Scribbr 19.0% 1.0% 37.5% 26.0% 0.0%
Winston AI 3.0% 5.0% 99.0% 61.0% 2.0%
YouScan 85.0% 15.0% 15.0% 25.0% 15.0%
ZeroGPT 7.1% 16.2% 54.5% 31.7% 3.2%
eduwriter AI 17.0% 16.0% 49.0% 29.0% 3.0%
originalityAI 100.0% 0.0% 100.0% 100.0% 0.0%
undetectableAI 77.0% 9.0% 67.0% 72.0% 5.0%
Mean 33.60% 3.67% 66.59% 61.18% 5.49%
Median 24.30% 1.00% 67.00% 61.00% 2.19%
Standard deviation 28.31% 7.00% 32.78% 33.57% 9.42%

The table divides the samples into “structured” and “low-quality” categories. The 2007 student essay and 2026 Reddit comment are described as poor-structure, inconsistent, or low-flow writing. The 2016 and 2026 human essays are described as semi-formal or experience-driven, while the 2026 AI essay is described as fully generated AI.[1]

Detector error rates

Appendix C also reports error-rate categories.

Category Reported error rate
Human Detection Error Rate 26.95%
AI Detection Error Rate 45.4%
Semi-Formal Error Rate 45.8%
Low-Quality Writing Error Rate 8.1%

In the paper's terminology, the human error rate is the average AI probability assigned to human-written texts. The AI error rate is the probability mass not assigned to the AI class for the AI-generated essay. Semi-formal error refers to structured human essays, while low-quality writing error refers to the seventh-grade essay and the Reddit comment.[1]

Graphical findings

The paper includes five appendix graphs:

  • Detector AI-Probability Heatmap — a heatmap showing detector-by-sample AI scores across all five texts.
  • Detector Performance by Text Sample — a bar chart showing mean AI-probability scores by sample. The 2016 human essay and the 2026 AI essay receive much higher mean scores than the 2007 essay and 2026 Reddit comment.
  • Detector Error Scatterplot — a detector-by-detector comparison of human false-positive tendency and AI false-negative tendency.
  • Detector Tradeoff Between Accusing Humans and Catching AI — a scatterplot placing detectors by mean AI probability on human texts and AI probability on the AI essay. The paper states that a strong detector would appear in the upper-left region: low human false accusations and high AI detection.
  • Category-Level Error Rates — a horizontal bar chart summarizing human error, AI error, semi-formal error, and low-quality-writing error rates.[1]

These graphics visually support the paper's claim that detector behavior clusters around style and structure rather than around a clean authorship boundary.

Stylistic ecosystem convergence

One of the paper's main interpretive sections argues that AI detection is structurally unstable because both humans and AI participate in the same writing ecosystem. Language models are trained on human writing from online platforms, academic works, social media, forums, and professional writing. Human writers, in turn, learn from the same public internet and increasingly encounter AI-generated prose.

The paper describes this as a feedback loop:

  1. AI produces structured text.
  2. Humans read and absorb the structures.
  3. Humans adopt some of those structures.
  4. Human writing moves closer to AI training distributions.
  5. Detectors must distinguish overlapping distributions.

This argument resembles broader concerns in machine-learning classification: when two classes become statistically entangled, classifiers lose stable decision boundaries. In authorship detection, the relevant signal is not only what a text looks like, but whether its stylistic features uniquely identify its source. Lehti argues that they increasingly do not.[1]

Formatting and structural bias

The paper argues that formatting choices can affect detector scores even when semantic content remains unchanged. Reported factors include:

  • greater length;
  • em-dash frequency;
  • semicolon and colon usage;
  • structured section headings;
  • markdown formatting;
  • vocabulary density;
  • lexical precision;
  • consistent punctuation;
  • grammatical confidence.

Lehti describes this as a practical institutional concern because a student or writer might be penalized for writing clearly, using formal organization, or presenting work in a clean academic format.[1]

AI authorship and AI-assisted revision

The paper distinguishes full AI authorship from AI-assisted revision. Full AI authorship is described as a model generating the main substance of a work from limited prompting. AI-assisted revision is described as human authorship followed by software-based improvements to clarity, grammar, flow, tone, punctuation, or redundancy.

This distinction is significant because modern writing environments often include AI-like correction systems. Email clients, word processors, grammar checkers, and online editors may suggest rewritten sentences, tone adjustments, and punctuation changes. These changes can regularize a document without replacing the author's ideas.

Lehti argues that detection systems often treat authorship and revision as the same category. A human-authored document that has been lightly smoothed may acquire features that detectors associate with generated text: regular syntax, coherent structure, reduced redundancy, and consistent punctuation.[1]

The polish penalty hypothesis

The “polish penalty” is the paper's name for the observed tendency of AI detectors to assign higher AI probabilities to cleaner, more structured writing. Under this hypothesis, detectors may interpret formal competence as artificiality.

The paper contrasts three kinds of samples:

Writing type Detector tendency in the study Interpretation
Informal, irregular, error-prone writing Usually scored as human Irregularity is treated as a human signal
Semi-formal human writing Often scored as partially or highly AI Structure and polish raise suspicion
AI-generated semi-formal writing Scored inconsistently, sometimes lower than human writing Detectors do not share a stable AI signal

The hypothesis does not require that every detector behave identically. Instead, it claims that a general relationship appears across the study: as structural competence increases, AI attribution becomes more likely.[1]

Humanization paradox

The paper reports an additional test in which a human-written academic text initially received a mean AI score of 44.7%. After being processed through AI “humanization” or detector-bypass tools, the mean AI score increased to 76.4%.

Lehti describes this as paradoxical because a tool intended to make text appear more human made the human text appear more AI-generated. The paper interprets this as evidence that detectors are highly sensitive to surface-level statistical changes and may not be measuring authorship origin directly.[1]

Academic implications

The paper argues that AI detectors should not be used as determinative evidence in academic misconduct cases. The reasoning is based on three claims:

  1. false positives can accuse human writers of misconduct;
  2. detector disagreement makes tool choice outcome-determinative;
  3. polished academic writing can be misread as AI-generated writing.

This position is consistent with caution expressed by outside institutions and researchers. Vanderbilt University disabled Turnitin's AI detector in 2023, citing concerns about transparency, false positives, and the consequences of a 1% false-positive rate when applied to large numbers of student papers.[4] Turnitin has published its own explanation of false positives and its AI-writing report behavior, including thresholds designed to reduce false-positive risk.[5][6]

Relation to existing research

Lehti's findings align with several external concerns in the literature:

  • Weber-Wulff et al. tested multiple AI-generated-text detectors and concluded that available tools were not accurate or reliable enough for dependable academic use.[2]
  • Liang et al. found that GPT detectors misclassified non-native English writing as AI-generated and warned against unfair evaluative use.[3]
  • Institutional guidance has often urged instructors to treat detector scores as signals rather than proof.[4]
  • Detector companies themselves acknowledge the possibility of false positives, although they may claim lower rates than independent or field-level critiques suggest.[5]

Lehti's contribution is narrower and more autobiographical in sample selection, but it adds a specific claim: polished pre-AI human writing can be scored as AI at a higher mean level than an AI-generated essay tested in the same comparison.[1]

Limitations

The paper has several limitations:

  • Small sample set. The core study uses a limited number of texts, including personally selected examples.
  • Detector opacity. The internal criteria of commercial detectors are not independently known.
  • Changing tools. AI detectors update over time, so exact scores may not reproduce later.
  • Input limits. Some tools required chunking or limited submissions, which may affect comparability.
  • No blind institutional sample. The study does not use a large randomized corpus of verified human and AI writing.
  • Percentage comparability. Different tools may define “AI percentage” differently.
  • Model-based judgments. ChatGPT and Gemini outputs are not detector scores in the same sense as commercial detector percentages.

These limitations do not eliminate the paper's concern about false positives, but they affect how broadly its numeric results can be generalized.

Terminology

AI detector
A tool that estimates whether text was generated by artificial intelligence.
False positive
A human-written text incorrectly classified as AI-generated.
False negative
AI-generated text incorrectly classified as human-written.
Polish penalty
Lehti's proposed term for the tendency of detectors to assign higher AI probability to cleaner, more structured, or more academically polished writing.
Stylistic convergence
The overlap between human and AI writing styles caused by shared corpora, AI-assisted writing tools, and human exposure to AI-generated prose.
AI-assisted revision
Human-authored writing that has been edited, clarified, or polished with machine assistance, without the machine generating the core argument or evidence.

Summary of findings

The paper's main findings can be summarized as follows:

Finding Evidence in paper
Detector outputs vary sharply across systems Human and AI samples ranged from near-zero to 100% AI depending on tool
The 2016 human essay scored higher than the AI essay on average 66.59% mean for the 2016 human essay versus 59.25–61.18% for the 2026 AI essay depending on table version
Informal writing was treated as human The 2007 essay and 2026 Reddit comment received low mean AI scores
Structured human writing was more vulnerable to false positives Semi-formal human essays had much higher AI scores than low-quality controls
“Humanizer” tools may backfire A human text reportedly rose from 44.7% to 76.4% after humanization
Academic enforcement use is risky The paper argues detector scores should not serve as decisive evidence

See also

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 Andrew Lehti, AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors, figshare, 2026. DOI: 10.6084/m9.figshare.31439995. Archived PDF: Internet Archive.
  2. 2.0 2.1 Debora Weber-Wulff et al., “Testing of detection tools for AI-generated text,” International Journal for Educational Integrity, 2023. DOI: 10.1007/s40979-023-00146-z.
  3. 3.0 3.1 Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou, “GPT detectors are biased against non-native English writers,” Patterns, 2023. DOI: 10.1016/j.patter.2023.100779.
  4. 4.0 4.1 Vanderbilt University Brightspace Support, “Guidance on AI Detection and Why We're Disabling Turnitin's AI Detector,” 2023. Link.
  5. 5.0 5.1 Turnitin, “Understanding false positives within our AI writing detection capabilities,” 2023. Link.
  6. Turnitin, “Using the AI Writing Report,” 2026. Link.