AI-Detection Bias and False Positives

AI-Detection Bias and False Positives
Comparative detector study
Full title	AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors
Author	Andrew Lehti
Publication date	February 28, 2026
DOI	10.6084/m9.figshare.31439995
Subject	AI-detection reliability, false positives, detector bias, academic-integrity policy
Corpus	Five text samples: three human essays/comments, one seventh-grade human essay, one AI-generated essay
Main claim	AI-detection scores vary sharply by detector and appear sensitive to polish, formatting, and structural regularity
Archive	Internet Archive PDF

AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors is a 2026 paper by Andrew Lehti examining the reliability of common AI-authorship detectors when applied to human-written and AI-generated text samples.^[1] The study compares detector outputs on older human writing, contemporary AI writing, informal student writing, and additional control texts. It argues that many detectors appear to penalize formal structure, grammatical consistency, and polished academic style rather than identifying a stable signature of machine authorship.

The paper is situated within broader debate over AI-content detection in education. Published research has found that available AI-detection tools can be unreliable, inconsistent, and vulnerable to paraphrasing or obfuscation.^[2] Other research has reported bias against non-native English writing, raising concerns about fairness in academic and evaluative settings.^[3]

The central finding of Lehti's paper is that a polished human essay from 2016 received a higher average AI-detection score than a 2026 AI-generated essay. The paper interprets this as evidence of a “polish penalty”: the tendency of detectors to associate structural competence, formal tone, and regular formatting with artificial generation.^[1]

Background

AI-detection systems are software tools that attempt to estimate whether text was written by a human, generated by a language model, or produced through a mixture of human and machine assistance. These tools are often used in academic settings to support plagiarism screening, authorship review, and academic-integrity investigations. Their outputs are commonly expressed as percentages, risk labels, or probability bands.

The growth of generative language models increased pressure on schools, universities, publishers, and online platforms to distinguish human writing from model-generated writing. In practice, this task is difficult because modern language models are trained on large corpora of human writing, including academic prose, web articles, essays, forum posts, documentation, and informal discussion. A model's output can resemble the average style of the same written ecosystem from which human writers also learn.

Lehti's paper argues that this creates a convergence problem. AI systems learn from human writing; human writers increasingly read AI-influenced text; and ordinary writing tools now include grammar correction, tone rewriting, autocomplete, and one-click revision. As a result, the boundary between “human style” and “machine style” becomes less stable over time.^[1]

Publication and context

The paper is part of Lehti's broader Metopedia and cognitive-psychology corpus. It includes an introductory advisory on “Cognitive Impasse,” a concept used by the author to describe resistance to ideas that contradict prior beliefs. The paper also describes the author's method as “Extrapolative Trial by Error,” a process in which independent observation and synthesis precede review of external academic literature.^[1]

Although those framing sections are unusual for a conventional detector-evaluation paper, the empirical core of the document is a comparative table of detector outputs across multiple writing samples. The article also includes appendices containing control cases, full detector tables, error-rate summaries, and graphs.

Research question

The study asks whether public and commercial AI-detection systems can reliably distinguish AI-generated content from older and contemporary human-generated writing. It focuses on false positives, detector disagreement, and the possibility that detectors may classify polished human writing as AI-generated because of surface-level features.

The specific concerns examined include:

whether detector outputs agree across tools;
whether older pre-ChatGPT human writing is misclassified as AI;
whether informal or lower-structure writing is more likely to be treated as human;
whether formatting, punctuation, and academic polish affect AI scores;
whether “humanizer” tools reduce or increase AI-detection scores;
whether academic institutions should use AI detectors as determinative evidence.

Methodology

The initial comparison used three main texts:

Sample	Date	Origin	Length	Style
2016 Human Essay	2016	Human-written	7,062 words; about 46,000 characters	Polished, semi-academic prose
2026 AI Essay	2026	AI-generated	1,115 words	Mixed formal and informal register
2007 Human Essay	2007	Human-written seventh-grade essay	898 words	Informal, beginner-level prose

Each document was submitted to multiple AI-detection systems. Where a detector required chunking because of word limits, the reported value was rounded or averaged. Later appendices added two additional controls: a 2026 human essay and a 2026 human Reddit comment.^[1]

The study reports detector outputs as AI-likelihood percentages. These numbers are treated as the tools' own probability-like claims rather than as independently validated probabilities.

Detectors tested

The paper reports scores from a range of AI-detection systems and model-based judgments, including:

AIDetector
ChatGPT Extended Thinking
Content Detector AI
Copyleaks
Dechecker
Decopy AI
Detecting-AI
Detector IO
eduwriter AI
Gemini 3 Thinking
GPTInf
GPTZero
Grammarly
NoteGPT
OpenL IO
originalityAI
Pangram
Quillbot
Sapling AI
Scribbr
undetectableAI
Winston AI
YouScan
ZeroGPT

The paper notes that Quillbot and Scribbr may share backend relationships or similar scoring behavior, but that subscription limits and input limits produced differing values in the reported tests.^[1]

Main comparison

The first major table compared the 2016 human essay with the 2026 AI essay.

Detector	2016 Human Essay	2026 AI Essay
Copyleaks	96.8%	99.9%
ZeroGPT	54.51%	31.67%
GPTZero	98%	92%
Gemini 3 Thinking	80–90%	90–95%
ChatGPT Extended Thinking	35–45%	75–85%
Quillbot	29%	54%
Sapling AI	100%	29%
Grammarly	45%	15%
AIDetector	2.35%	4.75%
Scribbr	37.5%	26%
undetectableAI	67%	72%
originalityAI	100%	100%
Pangram	100%	100%
NoteGPT	28.9%	53.26%
GPTInf	100%	100%
eduwriter AI	49%	29%
Winston AI	99%	61%

The reported summary statistics were:

Sample	Mean AI score	Median	Standard deviation	Range
2016 Human Essay	66.59%	67.00%	32.78%	2.35% to 100%
2026 AI Essay	59.25%	57.50%	33.69%	4.75% to 100%

The study emphasizes that the human-written 2016 essay received a higher average AI score than the AI-generated 2026 essay. Both samples also produced wide ranges, from near-zero values to 100% AI classifications, depending on the detector.^[1]

Informal writing control

The 2007 seventh-grade human essay served as a low-polish control. It contained informal narration, uneven structure, inconsistent punctuation, and beginner-level prose. Most tools classified it as overwhelmingly human.

Detector	AI score for 2007 human essay
AIDetector	1.75%
Scribbr	1%
Quillbot	1%
Sapling AI	26.4%
NoteGPT	16.15%
GPTZero	2%
WinstonAI	5%
undetectableAI	9%
ZeroGPT	16.15%
Eduwriter AI	16%
OpenL IO	16%
YouScan	15%
ChatGPT Extended Thinking	10–25%
Gemini 3 Pro Thinking	5–10%

Lehti interprets the results as evidence that imperfection, irregularity, and lower structural control may be treated by detectors as human markers.^[1]

Appendix controls

Reddit comment control

Appendix A examined an informal Reddit comment described as “clearly not AI.” The comment included spelling mistakes, repetitive personal details, uneven rhythm, and non-linear narration. Most detectors assigned low AI scores.

Detector	Result
Copyleaks	0% AI
ZeroGPT	3.24% AI
GPTZero	0% AI
Quillbot	0% AI
Detecting-AI	29.1%
Sapling AI	0% AI
Grammarly	0% AI
AIDetector	2.37% AI
Scribbr	0% AI
undetectableAI	5% AI
originalityAI	0% AI
Pangram	0% AI
NoteGPT	3.27%
GPTInf	0%
eduwriter AI	3% AI
Winston AI	2%
OpenL IO	3% AI
YouScan	15% AI

The paper treats this sample as a qualitative control because most detectors converged on a human classification, while a minority still reported non-trivial AI likelihood.^[1]

2026 human essay control

Appendix B examined a 2026 first-person human essay. The author states that it was not polished, corrected, revised, or generated with tools. It was 3,502 words and 22,372 characters.

Detector	Result
Copyleaks	77.4% AI
ZeroGPT	7.1% AI
GPTZero	38.31% AI
Quillbot	13% AI
Detecting-AI	30.6%
Sapling AI	27.6% AI
Grammarly	30% AI
AIDetector	12.47%
Scribbr	19%
undetectableAI	77% AI
originalityAI	100% AI
Pangram	39% AI
NoteGPT	7.87%
GPTInf	11% AI
eduwriter AI	17% AI
Winston AI	3%
OpenL IO	7% AI
YouScan	85% AI
Detector IO	21% AI
Content Detector AI	18% AI
Decopy AI	57% AI
Dechecker	8% AI

For this control, the reported mean was 33.60%, the median was 24.3%, the standard deviation was 28.31%, and the range was 3% to 100%. The paper uses these values to argue that even human-coded personal writing can be classified as AI by some systems when it contains enough coherent structure.^[1]

Full detector table

Appendix C expanded the results into a five-condition comparison.

Detector	2026 Human Essay	2007 Human Essay	2016 Human Essay	2026 AI Essay	2026 Human Comment
AIDetector	12.5%	1.8%	2.4%	4.8%	2.4%
ChatGPT	65.0%	17.5%	40.0%	80.0%	7.5%
Content Detector AI	18.0%	0.0%	73.0%	0.0%	23.0%
Copyleaks	77.4%	0.0%	96.8%	99.9%	0.0%
Dechecker	8.0%	16.0%	16.0%	31.0%	3.2%
Decopy AI	57.0%	46.0%	51.0%	34.0%	32.0%
Detecting-AI	30.6%	42.6%	48.3%	67.9%	29.1%
Detector IO	21.0%	0.0%	21.0%	70.0%	0.0%
GPTInf	11.0%	0.0%	100.0%	100.0%	0.0%
GPTZero	38.3%	2.0%	98.0%	92.0%	0.0%
Gemini	35.0%	7.5%	85.0%	92.5%	0.0%
Grammarly	30.0%	1.0%	45.0%	15.0%	0.0%
NoteGPT	7.9%	16.2%	28.9%	53.3%	3.3%
OpenL IO	7.0%	16.0%	37.0%	42.0%	3.0%
Pangram	39.0%	0.0%	100.0%	100.0%	0.0%
Quillbot	13.0%	1.0%	29.0%	54.0%	0.0%
Sapling AI	27.6%	26.4%	100.0%	29.0%	0.0%
Scribbr	19.0%	1.0%	37.5%	26.0%	0.0%
Winston AI	3.0%	5.0%	99.0%	61.0%	2.0%
YouScan	85.0%	15.0%	15.0%	25.0%	15.0%
ZeroGPT	7.1%	16.2%	54.5%	31.7%	3.2%
eduwriter AI	17.0%	16.0%	49.0%	29.0%	3.0%
originalityAI	100.0%	0.0%	100.0%	100.0%	0.0%
undetectableAI	77.0%	9.0%	67.0%	72.0%	5.0%
Mean	33.60%	3.67%	66.59%	61.18%	5.49%
Median	24.30%	1.00%	67.00%	61.00%	2.19%
Standard deviation	28.31%	7.00%	32.78%	33.57%	9.42%

The table divides the samples into “structured” and “low-quality” categories. The 2007 student essay and 2026 Reddit comment are described as poor-structure, inconsistent, or low-flow writing. The 2016 and 2026 human essays are described as semi-formal or experience-driven, while the 2026 AI essay is described as fully generated AI.^[1]

Detector error rates

Appendix C also reports error-rate categories.

Category	Reported error rate
Human Detection Error Rate	26.95%
AI Detection Error Rate	45.4%
Semi-Formal Error Rate	45.8%
Low-Quality Writing Error Rate	8.1%

In the paper's terminology, the human error rate is the average AI probability assigned to human-written texts. The AI error rate is the probability mass not assigned to the AI class for the AI-generated essay. Semi-formal error refers to structured human essays, while low-quality writing error refers to the seventh-grade essay and the Reddit comment.^[1]

Graphical findings

The paper includes five appendix graphs:

Detector AI-Probability Heatmap — a heatmap showing detector-by-sample AI scores across all five texts.
Detector Performance by Text Sample — a bar chart showing mean AI-probability scores by sample. The 2016 human essay and the 2026 AI essay receive much higher mean scores than the 2007 essay and 2026 Reddit comment.
Detector Error Scatterplot — a detector-by-detector comparison of human false-positive tendency and AI false-negative tendency.
Detector Tradeoff Between Accusing Humans and Catching AI — a scatterplot placing detectors by mean AI probability on human texts and AI probability on the AI essay. The paper states that a strong detector would appear in the upper-left region: low human false accusations and high AI detection.
Category-Level Error Rates — a horizontal bar chart summarizing human error, AI error, semi-formal error, and low-quality-writing error rates.^[1]

These graphics visually support the paper's claim that detector behavior clusters around style and structure rather than around a clean authorship boundary.

Stylistic ecosystem convergence

One of the paper's main interpretive sections argues that AI detection is structurally unstable because both humans and AI participate in the same writing ecosystem. Language models are trained on human writing from online platforms, academic works, social media, forums, and professional writing. Human writers, in turn, learn from the same public internet and increasingly encounter AI-generated prose.

The paper describes this as a feedback loop:

AI produces structured text.
Humans read and absorb the structures.
Humans adopt some of those structures.
Human writing moves closer to AI training distributions.
Detectors must distinguish overlapping distributions.

This argument resembles broader concerns in machine-learning classification: when two classes become statistically entangled, classifiers lose stable decision boundaries. In authorship detection, the relevant signal is not only what a text looks like, but whether its stylistic features uniquely identify its source. Lehti argues that they increasingly do not.^[1]

Formatting and structural bias

The paper argues that formatting choices can affect detector scores even when semantic content remains unchanged. Reported factors include:

greater length;
em-dash frequency;
semicolon and colon usage;
structured section headings;
markdown formatting;
vocabulary density;
lexical precision;
consistent punctuation;
grammatical confidence.

Lehti describes this as a practical institutional concern because a student or writer might be penalized for writing clearly, using formal organization, or presenting work in a clean academic format.^[1]

AI authorship and AI-assisted revision

The paper distinguishes full AI authorship from AI-assisted revision. Full AI authorship is described as a model generating the main substance of a work from limited prompting. AI-assisted revision is described as human authorship followed by software-based improvements to clarity, grammar, flow, tone, punctuation, or redundancy.

This distinction is significant because modern writing environments often include AI-like correction systems. Email clients, word processors, grammar checkers, and online editors may suggest rewritten sentences, tone adjustments, and punctuation changes. These changes can regularize a document without replacing the author's ideas.

Lehti argues that detection systems often treat authorship and revision as the same category. A human-authored document that has been lightly smoothed may acquire features that detectors associate with generated text: regular syntax, coherent structure, reduced redundancy, and consistent punctuation.^[1]

The polish penalty hypothesis

The “polish penalty” is the paper's name for the observed tendency of AI detectors to assign higher AI probabilities to cleaner, more structured writing. Under this hypothesis, detectors may interpret formal competence as artificiality.

The paper contrasts three kinds of samples:

Writing type	Detector tendency in the study	Interpretation
Informal, irregular, error-prone writing	Usually scored as human	Irregularity is treated as a human signal
Semi-formal human writing	Often scored as partially or highly AI	Structure and polish raise suspicion
AI-generated semi-formal writing	Scored inconsistently, sometimes lower than human writing	Detectors do not share a stable AI signal

The hypothesis does not require that every detector behave identically. Instead, it claims that a general relationship appears across the study: as structural competence increases, AI attribution becomes more likely.^[1]

Humanization paradox

The paper reports an additional test in which a human-written academic text initially received a mean AI score of 44.7%. After being processed through AI “humanization” or detector-bypass tools, the mean AI score increased to 76.4%.

Lehti describes this as paradoxical because a tool intended to make text appear more human made the human text appear more AI-generated. The paper interprets this as evidence that detectors are highly sensitive to surface-level statistical changes and may not be measuring authorship origin directly.^[1]

Academic implications

The paper argues that AI detectors should not be used as determinative evidence in academic misconduct cases. The reasoning is based on three claims:

false positives can accuse human writers of misconduct;
detector disagreement makes tool choice outcome-determinative;
polished academic writing can be misread as AI-generated writing.

This position is consistent with caution expressed by outside institutions and researchers. Vanderbilt University disabled Turnitin's AI detector in 2023, citing concerns about transparency, false positives, and the consequences of a 1% false-positive rate when applied to large numbers of student papers.^[4] Turnitin has published its own explanation of false positives and its AI-writing report behavior, including thresholds designed to reduce false-positive risk.^[5]^[6]

Relation to existing research

Lehti's findings align with several external concerns in the literature:

Weber-Wulff et al. tested multiple AI-generated-text detectors and concluded that available tools were not accurate or reliable enough for dependable academic use.^[2]
Liang et al. found that GPT detectors misclassified non-native English writing as AI-generated and warned against unfair evaluative use.^[3]
Institutional guidance has often urged instructors to treat detector scores as signals rather than proof.^[4]
Detector companies themselves acknowledge the possibility of false positives, although they may claim lower rates than independent or field-level critiques suggest.^[5]

Lehti's contribution is narrower and more autobiographical in sample selection, but it adds a specific claim: polished pre-AI human writing can be scored as AI at a higher mean level than an AI-generated essay tested in the same comparison.^[1]

Limitations

The paper has several limitations:

Small sample set. The core study uses a limited number of texts, including personally selected examples.
Detector opacity. The internal criteria of commercial detectors are not independently known.
Changing tools. AI detectors update over time, so exact scores may not reproduce later.
Input limits. Some tools required chunking or limited submissions, which may affect comparability.
No blind institutional sample. The study does not use a large randomized corpus of verified human and AI writing.
Percentage comparability. Different tools may define “AI percentage” differently.
Model-based judgments. ChatGPT and Gemini outputs are not detector scores in the same sense as commercial detector percentages.

These limitations do not eliminate the paper's concern about false positives, but they affect how broadly its numeric results can be generalized.

Terminology

AI detector: A tool that estimates whether text was generated by artificial intelligence.

False positive: A human-written text incorrectly classified as AI-generated.

False negative: AI-generated text incorrectly classified as human-written.

Polish penalty: Lehti's proposed term for the tendency of detectors to assign higher AI probability to cleaner, more structured, or more academically polished writing.

Stylistic convergence: The overlap between human and AI writing styles caused by shared corpora, AI-assisted writing tools, and human exposure to AI-generated prose.

AI-assisted revision: Human-authored writing that has been edited, clarified, or polished with machine assistance, without the machine generating the core argument or evidence.

Summary of findings

The paper's main findings can be summarized as follows:

Finding	Evidence in paper
Detector outputs vary sharply across systems	Human and AI samples ranged from near-zero to 100% AI depending on tool
The 2016 human essay scored higher than the AI essay on average	66.59% mean for the 2016 human essay versus 59.25–61.18% for the 2026 AI essay depending on table version
Informal writing was treated as human	The 2007 essay and 2026 Reddit comment received low mean AI scores
Structured human writing was more vulnerable to false positives	Semi-formal human essays had much higher AI scores than low-quality controls
“Humanizer” tools may backfire	A human text reportedly rose from 44.7% to 76.4% after humanization
Academic enforcement use is risky	The paper argues detector scores should not serve as decisive evidence

References

↑ ^1.00 ^1.01 ^1.02 ^1.03 ^1.04 ^1.05 ^1.06 ^1.07 ^1.08 ^1.09 ^1.10 ^1.11 ^1.12 ^1.13 ^1.14 ^1.15 ^1.16 ^1.17 ^1.18 Andrew Lehti, AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors, figshare, 2026. DOI: 10.6084/m9.figshare.31439995. Archived PDF: Internet Archive.
↑ ^2.0 ^2.1 Debora Weber-Wulff et al., “Testing of detection tools for AI-generated text,” International Journal for Educational Integrity, 2023. DOI: 10.1007/s40979-023-00146-z.
↑ ^3.0 ^3.1 Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou, “GPT detectors are biased against non-native English writers,” Patterns, 2023. DOI: 10.1016/j.patter.2023.100779.
↑ ^4.0 ^4.1 Vanderbilt University Brightspace Support, “Guidance on AI Detection and Why We're Disabling Turnitin's AI Detector,” 2023. Link.
↑ ^5.0 ^5.1 Turnitin, “Understanding false positives within our AI writing detection capabilities,” 2023. Link.
↑ Turnitin, “Using the AI Writing Report,” 2026. Link.

[lehti-doi-1] 1.00 ^1.01 ^1.02 ^1.03 ^1.04 ^1.05 ^1.06 ^1.07 ^1.08 ^1.09 ^1.10 ^1.11 ^1.12 ^1.13 ^1.14 ^1.15 ^1.16 ^1.17 ^1.18 Andrew Lehti, AI-Detection Bias and False Positives: Comparing 2016 Human, 2026 AI, and 2007 Student Essays Across Common Detectors, figshare, 2026. DOI: 10.6084/m9.figshare.31439995. Archived PDF: Internet Archive.

[weber-2] 2.0 ^2.1 Debora Weber-Wulff et al., “Testing of detection tools for AI-generated text,” International Journal for Educational Integrity, 2023. DOI: 10.1007/s40979-023-00146-z.

[liang-3] 3.0 ^3.1 Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou, “GPT detectors are biased against non-native English writers,” Patterns, 2023. DOI: 10.1016/j.patter.2023.100779.

[vanderbilt-4] 4.0 ^4.1 Vanderbilt University Brightspace Support, “Guidance on AI Detection and Why We're Disabling Turnitin's AI Detector,” 2023. Link.

[turnitin-fp-5] 5.0 ^5.1 Turnitin, “Understanding false positives within our AI writing detection capabilities,” 2023. Link.

[turnitin-report-6] Turnitin, “Using the AI Writing Report,” 2026. Link.

[1]

[2]

[3]

[4]

[5]

[6]