AI-Detection Bias and False Positives
| AI-Detection Bias and False Positives | |
|---|---|
| Comparative detector study | |
| Full title | AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors |
| Author | Andrew Lehti |
| Publication date | February 28, 2026 |
| DOI | 10.6084/m9.figshare.31439995 |
| Subject | AI-detection reliability, false positives, detector bias, academic-integrity policy |
| Corpus | Five text samples: three human essays/comments, one seventh-grade human essay, one AI-generated essay |
| Main claim | AI-detection scores vary sharply by detector and appear sensitive to polish, formatting, and structural regularity |
| Archive | Internet Archive PDF |
AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors is a 2026 paper by Andrew Lehti examining the reliability of common AI-authorship detectors when applied to human-written and AI-generated text samples.[1] The study compares detector outputs on older human writing, contemporary AI writing, informal student writing, and additional control texts. It argues that many detectors appear to penalize formal structure, grammatical consistency, and polished academic style rather than identifying a stable signature of machine authorship.
The paper is situated within broader debate over AI-content detection in education. Published research has found that available AI-detection tools can be unreliable, inconsistent, and vulnerable to paraphrasing or obfuscation.[2] Other research has reported bias against non-native English writing, raising concerns about fairness in academic and evaluative settings.[3]
The central finding of Lehti's paper is that a polished human essay from 2016 received a higher average AI-detection score than a 2026 AI-generated essay. The paper interprets this as evidence of a “polish penalty”: the tendency of detectors to associate structural competence, formal tone, and regular formatting with artificial generation.[1]
Background
AI-detection systems are software tools that attempt to estimate whether text was written by a human, generated by a language model, or produced through a mixture of human and machine assistance. These tools are often used in academic settings to support plagiarism screening, authorship review, and academic-integrity investigations. Their outputs are commonly expressed as percentages, risk labels, or probability bands.
The growth of generative language models increased pressure on schools, universities, publishers, and online platforms to distinguish human writing from model-generated writing. In practice, this task is difficult because modern language models are trained on large corpora of human writing, including academic prose, web articles, essays, forum posts, documentation, and informal discussion. A model's output can resemble the average style of the same written ecosystem from which human writers also learn.
Lehti's paper argues that this creates a convergence problem. AI systems learn from human writing; human writers increasingly read AI-influenced text; and ordinary writing tools now include grammar correction, tone rewriting, autocomplete, and one-click revision. As a result, the boundary between “human style” and “machine style” becomes less stable over time.[1]
Publication and context
The paper is part of Lehti's broader Metopedia and cognitive-psychology corpus. It includes an introductory advisory on “Cognitive Impasse,” a concept used by the author to describe resistance to ideas that contradict prior beliefs. The paper also describes the author's method as “Extrapolative Trial by Error,” a process in which independent observation and synthesis precede review of external academic literature.[1]
Although those framing sections are unusual for a conventional detector-evaluation paper, the empirical core of the document is a comparative table of detector outputs across multiple writing samples. The article also includes appendices containing control cases, full detector tables, error-rate summaries, and graphs.
Research question
The study asks whether public and commercial AI-detection systems can reliably distinguish AI-generated content from older and contemporary human-generated writing. It focuses on false positives, detector disagreement, and the possibility that detectors may classify polished human writing as AI-generated because of surface-level features.
The specific concerns examined include:
- whether detector outputs agree across tools;
- whether older pre-ChatGPT human writing is misclassified as AI;
- whether informal or lower-structure writing is more likely to be treated as human;
- whether formatting, punctuation, and academic polish affect AI scores;
- whether “humanizer” tools reduce or increase AI-detection scores;
- whether academic institutions should use AI detectors as determinative evidence.
Methodology
The initial comparison used three main texts:
| Sample | Date | Origin | Length | Style |
|---|---|---|---|---|
| 2016 Human Essay | 2016 | Human-written | 7,062 words; about 46,000 characters | Polished, semi-academic prose |
| 2026 AI Essay | 2026 | AI-generated | 1,115 words | Mixed formal and informal register |
| 2007 Human Essay | 2007 | Human-written seventh-grade essay | 898 words | Informal, beginner-level prose |
Each document was submitted to multiple AI-detection systems. Where a detector required chunking because of word limits, the reported value was rounded or averaged. Later appendices added two additional controls: a 2026 human essay and a 2026 human Reddit comment.[1]
The study reports detector outputs as AI-likelihood percentages. These numbers are treated as the tools' own probability-like claims rather than as independently validated probabilities.
Detectors tested
The paper reports scores from a range of AI-detection systems and model-based judgments, including:
- AIDetector
- ChatGPT Extended Thinking
- Content Detector AI
- Copyleaks
- Dechecker
- Decopy AI
- Detecting-AI
- Detector IO
- eduwriter AI
- Gemini 3 Thinking
- GPTInf
- GPTZero
- Grammarly
- NoteGPT
- OpenL IO
- originalityAI
- Pangram
- Quillbot
- Sapling AI
- Scribbr
- undetectableAI
- Winston AI
- YouScan
- ZeroGPT
The paper notes that Quillbot and Scribbr may share backend relationships or similar scoring behavior, but that subscription limits and input limits produced differing values in the reported tests.[1]
Main comparison
The first major table compared the 2016 human essay with the 2026 AI essay.
| Detector | 2016 Human Essay | 2026 AI Essay |
|---|---|---|
| Copyleaks | 96.8% | 99.9% |
| ZeroGPT | 54.51% | 31.67% |
| GPTZero | 98% | 92% |
| Gemini 3 Thinking | 80–90% | 90–95% |
| ChatGPT Extended Thinking | 35–45% | 75–85% |
| Quillbot | 29% | 54% |
| Sapling AI | 100% | 29% |
| Grammarly | 45% | 15% |
| AIDetector | 2.35% | 4.75% |
| Scribbr | 37.5% | 26% |
| undetectableAI | 67% | 72% |
| originalityAI | 100% | 100% |
| Pangram | 100% | 100% |
| NoteGPT | 28.9% | 53.26% |
| GPTInf | 100% | 100% |
| eduwriter AI | 49% | 29% |
| Winston AI | 99% | 61% |
The reported summary statistics were:
| Sample | Mean AI score | Median | Standard deviation | Range |
|---|---|---|---|---|
| 2016 Human Essay | 66.59% | 67.00% | 32.78% | 2.35% to 100% |
| 2026 AI Essay | 59.25% | 57.50% | 33.69% | 4.75% to 100% |
The study emphasizes that the human-written 2016 essay received a higher average AI score than the AI-generated 2026 essay. Both samples also produced wide ranges, from near-zero values to 100% AI classifications, depending on the detector.[1]
Informal writing control
The 2007 seventh-grade human essay served as a low-polish control. It contained informal narration, uneven structure, inconsistent punctuation, and beginner-level prose. Most tools classified it as overwhelmingly human.
| Detector | AI score for 2007 human essay |
|---|---|
| AIDetector | 1.75% |
| Scribbr | 1% |
| Quillbot | 1% |
| Sapling AI | 26.4% |
| NoteGPT | 16.15% |
| GPTZero | 2% |
| WinstonAI | 5% |
| undetectableAI | 9% |
| ZeroGPT | 16.15% |
| Eduwriter AI | 16% |
| OpenL IO | 16% |
| YouScan | 15% |
| ChatGPT Extended Thinking | 10–25% |
| Gemini 3 Pro Thinking | 5–10% |
Lehti interprets the results as evidence that imperfection, irregularity, and lower structural control may be treated by detectors as human markers.[1]
Appendix controls
Reddit comment control
Appendix A examined an informal Reddit comment described as “clearly not AI.” The comment included spelling mistakes, repetitive personal details, uneven rhythm, and non-linear narration. Most detectors assigned low AI scores.
| Detector | Result |
|---|---|
| Copyleaks | 0% AI |
| ZeroGPT | 3.24% AI |
| GPTZero | 0% AI |
| Quillbot | 0% AI |
| Detecting-AI | 29.1% |
| Sapling AI | 0% AI |
| Grammarly | 0% AI |
| AIDetector | 2.37% AI |
| Scribbr | 0% AI |
| undetectableAI | 5% AI |
| originalityAI | 0% AI |
| Pangram | 0% AI |
| NoteGPT | 3.27% |
| GPTInf | 0% |
| eduwriter AI | 3% AI |
| Winston AI | 2% |
| OpenL IO | 3% AI |
| YouScan | 15% AI |
The paper treats this sample as a qualitative control because most detectors converged on a human classification, while a minority still reported non-trivial AI likelihood.[1]
2026 human essay control
Appendix B examined a 2026 first-person human essay. The author states that it was not polished, corrected, revised, or generated with tools. It was 3,502 words and 22,372 characters.
| Detector | Result |
|---|---|
| Copyleaks | 77.4% AI |
| ZeroGPT | 7.1% AI |
| GPTZero | 38.31% AI |
| Quillbot | 13% AI |
| Detecting-AI | 30.6% |
| Sapling AI | 27.6% AI |
| Grammarly | 30% AI |
| AIDetector | 12.47% |
| Scribbr | 19% |
| undetectableAI | 77% AI |
| originalityAI | 100% AI |
| Pangram | 39% AI |
| NoteGPT | 7.87% |
| GPTInf | 11% AI |
| eduwriter AI | 17% AI |
| Winston AI | 3% |
| OpenL IO | 7% AI |
| YouScan | 85% AI |
| Detector IO | 21% AI |
| Content Detector AI | 18% AI |
| Decopy AI | 57% AI |
| Dechecker | 8% AI |
For this control, the reported mean was 33.60%, the median was 24.3%, the standard deviation was 28.31%, and the range was 3% to 100%. The paper uses these values to argue that even human-coded personal writing can be classified as AI by some systems when it contains enough coherent structure.[1]
Full detector table
Appendix C expanded the results into a five-condition comparison.
| Detector | 2026 Human Essay | 2007 Human Essay | 2016 Human Essay | 2026 AI Essay | 2026 Human Comment |
|---|---|---|---|---|---|
| AIDetector | 12.5% | 1.8% | 2.4% | 4.8% | 2.4% |
| ChatGPT | 65.0% | 17.5% | 40.0% | 80.0% | 7.5% |
| Content Detector AI | 18.0% | 0.0% | 73.0% | 0.0% | 23.0% |
| Copyleaks | 77.4% | 0.0% | 96.8% | 99.9% | 0.0% |
| Dechecker | 8.0% | 16.0% | 16.0% | 31.0% | 3.2% |
| Decopy AI | 57.0% | 46.0% | 51.0% | 34.0% | 32.0% |
| Detecting-AI | 30.6% | 42.6% | 48.3% | 67.9% | 29.1% |
| Detector IO | 21.0% | 0.0% | 21.0% | 70.0% | 0.0% |
| GPTInf | 11.0% | 0.0% | 100.0% | 100.0% | 0.0% |
| GPTZero | 38.3% | 2.0% | 98.0% | 92.0% | 0.0% |
| Gemini | 35.0% | 7.5% | 85.0% | 92.5% | 0.0% |
| Grammarly | 30.0% | 1.0% | 45.0% | 15.0% | 0.0% |
| NoteGPT | 7.9% | 16.2% | 28.9% | 53.3% | 3.3% |
| OpenL IO | 7.0% | 16.0% | 37.0% | 42.0% | 3.0% |
| Pangram | 39.0% | 0.0% | 100.0% | 100.0% | 0.0% |
| Quillbot | 13.0% | 1.0% | 29.0% | 54.0% | 0.0% |
| Sapling AI | 27.6% | 26.4% | 100.0% | 29.0% | 0.0% |
| Scribbr | 19.0% | 1.0% | 37.5% | 26.0% | 0.0% |
| Winston AI | 3.0% | 5.0% | 99.0% | 61.0% | 2.0% |
| YouScan | 85.0% | 15.0% | 15.0% | 25.0% | 15.0% |
| ZeroGPT | 7.1% | 16.2% | 54.5% | 31.7% | 3.2% |
| eduwriter AI | 17.0% | 16.0% | 49.0% | 29.0% | 3.0% |
| originalityAI | 100.0% | 0.0% | 100.0% | 100.0% | 0.0% |
| undetectableAI | 77.0% | 9.0% | 67.0% | 72.0% | 5.0% |
| Mean | 33.60% | 3.67% | 66.59% | 61.18% | 5.49% |
| Median | 24.30% | 1.00% | 67.00% | 61.00% | 2.19% |
| Standard deviation | 28.31% | 7.00% | 32.78% | 33.57% | 9.42% |
The table divides the samples into “structured” and “low-quality” categories. The 2007 student essay and 2026 Reddit comment are described as poor-structure, inconsistent, or low-flow writing. The 2016 and 2026 human essays are described as semi-formal or experience-driven, while the 2026 AI essay is described as fully generated AI.[1]
Detector error rates
Appendix C also reports error-rate categories.
| Category | Reported error rate |
|---|---|
| Human Detection Error Rate | 26.95% |
| AI Detection Error Rate | 45.4% |
| Semi-Formal Error Rate | 45.8% |
| Low-Quality Writing Error Rate | 8.1% |
In the paper's terminology, the human error rate is the average AI probability assigned to human-written texts. The AI error rate is the probability mass not assigned to the AI class for the AI-generated essay. Semi-formal error refers to structured human essays, while low-quality writing error refers to the seventh-grade essay and the Reddit comment.[1]
Graphical findings
The paper includes five appendix graphs:
- Detector AI-Probability Heatmap — a heatmap showing detector-by-sample AI scores across all five texts.
- Detector Performance by Text Sample — a bar chart showing mean AI-probability scores by sample. The 2016 human essay and the 2026 AI essay receive much higher mean scores than the 2007 essay and 2026 Reddit comment.
- Detector Error Scatterplot — a detector-by-detector comparison of human false-positive tendency and AI false-negative tendency.
- Detector Tradeoff Between Accusing Humans and Catching AI — a scatterplot placing detectors by mean AI probability on human texts and AI probability on the AI essay. The paper states that a strong detector would appear in the upper-left region: low human false accusations and high AI detection.
- Category-Level Error Rates — a horizontal bar chart summarizing human error, AI error, semi-formal error, and low-quality-writing error rates.[1]
These graphics visually support the paper's claim that detector behavior clusters around style and structure rather than around a clean authorship boundary.
Stylistic ecosystem convergence
One of the paper's main interpretive sections argues that AI detection is structurally unstable because both humans and AI participate in the same writing ecosystem. Language models are trained on human writing from online platforms, academic works, social media, forums, and professional writing. Human writers, in turn, learn from the same public internet and increasingly encounter AI-generated prose.
The paper describes this as a feedback loop:
- AI produces structured text.
- Humans read and absorb the structures.
- Humans adopt some of those structures.
- Human writing moves closer to AI training distributions.
- Detectors must distinguish overlapping distributions.
This argument resembles broader concerns in machine-learning classification: when two classes become statistically entangled, classifiers lose stable decision boundaries. In authorship detection, the relevant signal is not only what a text looks like, but whether its stylistic features uniquely identify its source. Lehti argues that they increasingly do not.[1]
Formatting and structural bias
The paper argues that formatting choices can affect detector scores even when semantic content remains unchanged. Reported factors include:
- greater length;
- em-dash frequency;
- semicolon and colon usage;
- structured section headings;
- markdown formatting;
- vocabulary density;
- lexical precision;
- consistent punctuation;
- grammatical confidence.
Lehti describes this as a practical institutional concern because a student or writer might be penalized for writing clearly, using formal organization, or presenting work in a clean academic format.[1]
AI authorship and AI-assisted revision
The paper distinguishes full AI authorship from AI-assisted revision. Full AI authorship is described as a model generating the main substance of a work from limited prompting. AI-assisted revision is described as human authorship followed by software-based improvements to clarity, grammar, flow, tone, punctuation, or redundancy.
This distinction is significant because modern writing environments often include AI-like correction systems. Email clients, word processors, grammar checkers, and online editors may suggest rewritten sentences, tone adjustments, and punctuation changes. These changes can regularize a document without replacing the author's ideas.
Lehti argues that detection systems often treat authorship and revision as the same category. A human-authored document that has been lightly smoothed may acquire features that detectors associate with generated text: regular syntax, coherent structure, reduced redundancy, and consistent punctuation.[1]
The polish penalty hypothesis
The “polish penalty” is the paper's name for the observed tendency of AI detectors to assign higher AI probabilities to cleaner, more structured writing. Under this hypothesis, detectors may interpret formal competence as artificiality.
The paper contrasts three kinds of samples:
| Writing type | Detector tendency in the study | Interpretation |
|---|---|---|
| Informal, irregular, error-prone writing | Usually scored as human | Irregularity is treated as a human signal |
| Semi-formal human writing | Often scored as partially or highly AI | Structure and polish raise suspicion |
| AI-generated semi-formal writing | Scored inconsistently, sometimes lower than human writing | Detectors do not share a stable AI signal |
The hypothesis does not require that every detector behave identically. Instead, it claims that a general relationship appears across the study: as structural competence increases, AI attribution becomes more likely.[1]
Humanization paradox
The paper reports an additional test in which a human-written academic text initially received a mean AI score of 44.7%. After being processed through AI “humanization” or detector-bypass tools, the mean AI score increased to 76.4%.
Lehti describes this as paradoxical because a tool intended to make text appear more human made the human text appear more AI-generated. The paper interprets this as evidence that detectors are highly sensitive to surface-level statistical changes and may not be measuring authorship origin directly.[1]
Academic implications
The paper argues that AI detectors should not be used as determinative evidence in academic misconduct cases. The reasoning is based on three claims:
- false positives can accuse human writers of misconduct;
- detector disagreement makes tool choice outcome-determinative;
- polished academic writing can be misread as AI-generated writing.
This position is consistent with caution expressed by outside institutions and researchers. Vanderbilt University disabled Turnitin's AI detector in 2023, citing concerns about transparency, false positives, and the consequences of a 1% false-positive rate when applied to large numbers of student papers.[4] Turnitin has published its own explanation of false positives and its AI-writing report behavior, including thresholds designed to reduce false-positive risk.[5][6]
Relation to existing research
Lehti's findings align with several external concerns in the literature:
- Weber-Wulff et al. tested multiple AI-generated-text detectors and concluded that available tools were not accurate or reliable enough for dependable academic use.[2]
- Liang et al. found that GPT detectors misclassified non-native English writing as AI-generated and warned against unfair evaluative use.[3]
- Institutional guidance has often urged instructors to treat detector scores as signals rather than proof.[4]
- Detector companies themselves acknowledge the possibility of false positives, although they may claim lower rates than independent or field-level critiques suggest.[5]
Lehti's contribution is narrower and more autobiographical in sample selection, but it adds a specific claim: polished pre-AI human writing can be scored as AI at a higher mean level than an AI-generated essay tested in the same comparison.[1]
Limitations
The paper has several limitations:
- Small sample set. The core study uses a limited number of texts, including personally selected examples.
- Detector opacity. The internal criteria of commercial detectors are not independently known.
- Changing tools. AI detectors update over time, so exact scores may not reproduce later.
- Input limits. Some tools required chunking or limited submissions, which may affect comparability.
- No blind institutional sample. The study does not use a large randomized corpus of verified human and AI writing.
- Percentage comparability. Different tools may define “AI percentage” differently.
- Model-based judgments. ChatGPT and Gemini outputs are not detector scores in the same sense as commercial detector percentages.
These limitations do not eliminate the paper's concern about false positives, but they affect how broadly its numeric results can be generalized.
Terminology
; AI detector : A tool that estimates whether text was generated by artificial intelligence.
; False positive : A human-written text incorrectly classified as AI-generated.
; False negative : AI-generated text incorrectly classified as human-written.
; Polish penalty : Lehti's proposed term for the tendency of detectors to assign higher AI probability to cleaner, more structured, or more academically polished writing.
; Stylistic convergence : The overlap between human and AI writing styles caused by shared corpora, AI-assisted writing tools, and human exposure to AI-generated prose.
; AI-assisted revision : Human-authored writing that has been edited, clarified, or polished with machine assistance, without the machine generating the core argument or evidence.
Summary of findings
The paper's main findings can be summarized as follows:
| Finding | Evidence in paper |
|---|---|
| Detector outputs vary sharply across systems | Human and AI samples ranged from near-zero to 100% AI depending on tool |
| The 2016 human essay scored higher than the AI essay on average | 66.59% mean for the 2016 human essay versus 59.25–61.18% for the 2026 AI essay depending on table version |
| Informal writing was treated as human | The 2007 essay and 2026 Reddit comment received low mean AI scores |
| Structured human writing was more vulnerable to false positives | Semi-formal human essays had much higher AI scores than low-quality controls |
| “Humanizer” tools may backfire | A human text reportedly rose from 44.7% to 76.4% after humanization |
| Academic enforcement use is risky | The paper argues detector scores should not serve as decisive evidence |
See also
- Artificial Intelligence Content Detection
- Academic Integrity
- False Positive
- Machine Learning Bias
- Authorship Attribution
- Cognitive Impasse
- Standardized Obedience
- Reputation Flair
References
- ↑a ↑b ↑c ↑d ↑e ↑f ↑g ↑h ↑i ↑j ↑k ↑l ↑m ↑n ↑o ↑p ↑q ↑r ↑s Andrew Lehti, AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors, figshare, 2026. DOI: 10.6084/m9.figshare.31439995. Archived PDF: Internet Archive.
- ↑a ↑b Debora Weber-Wulff et al., “Testing of detection tools for AI-generated text,” International Journal for Educational Integrity, 2023. DOI: 10.1007/s40979-023-00146-z.
- ↑a ↑b Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou, “GPT detectors are biased against non-native English writers,” Patterns, 2023. DOI: 10.1016/j.patter.2023.100779.
- ↑a ↑b Vanderbilt University Brightspace Support, “Guidance on AI Detection and Why We're Disabling Turnitin's AI Detector,” 2023. Link.
- ↑a ↑b Turnitin, “Understanding false positives within our AI writing detection capabilities,” 2023. Link.
- ↑ Turnitin, “Using the AI Writing Report,” 2026. Link.
