How it Works

A transparent look at the methodology, data, models, and limitations behind the detector.

1. The Approach

YOU HERETIC! — You coded this up with an LLM…

Yes. Guilty as charged. The irony is not lost on us. An AI-detection tool, built with the help of AI. Go ahead, feed this very source code into the detector — it'll probably flag itself.

But here's the thing: this tool isn't anti-AI. It's anti-bullshit. There's a difference.

LLMs are extraordinary tools for coding, brainstorming, and drafting. The problem isn't that students use AI — it's when they paste in a ChatGPT essay and submit it as their own thinking. The teacher gets a wall of "multifaceted tapestries" and "pivotal paradigm shifts" and has no idea whether a real human brain engaged with the material.

This tool exists to give educators a transparent, explainable second opinion — not a black-box "gotcha" machine, but a teaching aid that says "here are the specific patterns that look machine-generated, and here's why." It's designed to start conversations, not end them.

So yes, we used an LLM to build the thing that detects LLMs. We also used a hammer to build the house that keeps the rain out. Tools are tools. What matters is what you do with them.

🔒 100% Local & Private — Your text never leaves your device.

The entire detection pipeline — feature extraction, NLP processing, and the ML model — runs locally in your browser using WebAssembly. No text is sent to any server. No API calls. No logging. Nothing.

Don't take our word for it — try it with your Wi-Fi off. Disconnect from the internet after the page loads and paste in some text. It still works. That's the proof. Your data is yours.

This tool identifies AI-generated text through a two-layer system:

46 hand-crafted features in two complementary groups: 31 AI red flags that check for linguistic patterns statistically associated with LLM output, and 15 human green flags that detect markers of authentic human writing (disfluencies, informal language, personal voice, citations). Strong green flags push the probability down.
A Gradient Boosting classifier trained on all 46 feature scores, which learns the optimal combination weights from labeled data.

The ML model runs client-side via ONNX Runtime Web (WebAssembly). When the model loads successfully, the tool uses its probability as the primary signal. If it fails to load (e.g. very old browser), it falls back to a weighted heuristic sum.

Crucially, the tool doesn't just give a score — it shows you which patterns triggered and where, so you can form your own judgment about whether the flags are meaningful in context.

2. Why Heuristics + ML (Not a Black Box)

Pure neural AI detectors (like GPTZero, Originality.ai, or OpenAI's classifier) are opaque — they give a probability but can't explain why. This is a problem for educators, because:

You can't have a conversation with a student about a probability score.
False positives are devastating without evidence to back up the accusation.
Neural detectors are famously unreliable on non-native English speakers, second-language learners, and highly structured academic writing.

Our approach is the opposite: every detection is grounded in observable textual evidence. If the tool flags a passage, you can hover over the trigger to see exactly which phrases, word choices, or structures caused it — and decide for yourself whether the flag is meaningful.

Design principle: The tool should help a teacher understand what looks like AI and why, not just deliver a verdict. It's a lens, not a judge.

3. The 46 Features (31 AI Red Flags + 15 Human Green Flags)

The AI red flags fall into three groups, and are joined by 15 human green flags that score evidence of authentic human writing. Each has a weight reflecting how strongly it discriminates between AI and human text (calibrated on the training data). Negative weights belong to green flags — a high score there reduces the AI probability:

Heuristic	What it looks for	Weight
Structural Formulas & Rhetorical Patterns
AI Vocabulary	"delve," "tapestry," "multifaceted," "nuanced" — words that appear at wildly higher rates in AI text	4.0
Formulaic Openers	"Additionally," "Furthermore," "Moreover" at sentence start	3.5
Sentence Uniformity	Suspiciously even sentence lengths (doc-level stat)	3.5
LLM Markup	Internal reference IDs or tags from ChatGPT/Copilot	3.0
Trailing Participles	", highlighting its significance" — fake analysis at sentence end	2.5
Chatbot Artifacts	"Certainly!," "I hope this helps," "Let me know"	2.5
Didactic Disclaimers	"It's important to note," "plays a crucial role"	2.5
Copula Avoidance	"Serves as" / "stands as" instead of simply "is"	2.0
Rule of Three	Tricolon lists: "education, innovation, and progress"	2.0
Challenges & Future	"Despite these challenges, the future continues to evolve"	1.5
Formulaic Lists	Bold inline headers, numbered/bullet structures in prose	1.5
"Not X, but Y"	"Not just... but also" parallel construction	1.0
Vocabulary & Word Choice
Promotional Language	"groundbreaking," "unparalleled," "world-class"	2.0
Hedging	"potentially," "arguably," "fundamentally"	2.0
Legacy Emphasis	"enduring legacy," "indelible mark," "broader implications"	2.0
Average Word Length	AI consistently uses longer, more formal words (doc-level)	2.0
Disclaimers	"Based on available information" — knowledge-cutoff phrases	2.0
Vague Attributions	"Experts have noted" without citing anyone specific	1.5
Notability Claims	"Featured in major publications" without naming them	1.5
Filler Adverbs	"significantly," "increasingly," "effectively" — AI emphasis spray	1.5
Comma Density	More commas per word than typical student writing (doc-level)	1.5
Elegant Variation	Many synonyms for the same concept — AI's repetition penalty (doc-level)	1.5
Avg Sentence Length	Consistently medium-long sentences in 20–35 word range (doc-level)	1.5
Markdown Artifacts	bold or # heading markdown syntax leaked into prose	1.5
Low-Signal Indicators
Em Dash Overuse	Multiple em dashes per paragraph	1.0
Passive Voice	"is considered," "was believed" — higher density	1.0
Curly Quotes	Smart/curly quotation marks (ChatGPT default)	3.0
Absence-of-Human-Signal Indicators
No Contractions	AI expands "don't," "it's" to formal equivalents — near-zero contraction density	2.5
No First Person	AI analytical text avoids "I" — very low first-person density	1.5
Vocab Diversity	AI's repetition-penalty produces unnaturally high overall lexical variety (type-token ratio)	1.5
✅ Human Green Flags (negative weight = evidence against AI)
Informal Intensifiers	"super," "totally," "pretty much" — informal human emphasis, not AI vocabulary	−2.0
Sentence Fragments	Short phrases without a root verb ("So weird." "No way.") — casual human writing	−2.5
Trailing Ellipsis	Ellipses (…) used to trail off — a human stylistic quirk AI rarely produces	−2.0
Para. Variation	Natural variation in paragraph lengths — AI produces uniform paragraph blocks	−1.5
Disfluency Markers	"I mean," "like," "you know," "um" — spoken rhythms of real thought	−2.5
Emphasis Typography	ALL-CAPS, repeated punctuation (!!) or reduplication ("really really")	−2.0
Personal Time Refs	"last summer," "when I was a kid" — grounded in personal memory	−2.5
Contraction Variety	Many different contractions (I'm, don't, it's, we're) — natural relaxed voice	−2.5
Pronoun Variety	Mixing I, we, you, me, my, our — engaged conversational voice	−2.0
Self-Correction	"actually," "wait no," "I mean—" — writer revising thought mid-sentence	−3.0
Internet Slang	"lol," "omg," "bff," "ngl" — strong evidence of genuine human expression	−3.5
Academic Citations	Parenthetical citations (Smith, 2020) — student drawing on real sources	−3.0
Mild Expletive	"damn," "crap," "hell" — authentic human voice, AI avoids unprompted	−2.5
Casual Spelling	"u," "thru," "ngl," "ilysm" — shorthand signatures of human voice	−3.0
Tag Question	"right?" "aren't they?" — rhetorical questions inviting the reader in	−2.0

The heuristics are derived from Wikipedia's "Signs of AI Writing" article and refined through empirical testing on labeled data. They use a mix of regex pattern matching, compromise.js NLP tagging, and statistical measures.

4. Training Dataset

Property	Value
Source	`datasets/unified_dataset_v3.csv` (12 aggregated sources)
Total samples	2,493,948
Training set	2,119,856 (1,059,928 per class, balanced)
Held-out test set	374,092 (never seen during training)
Split strategy	Stratified random, `seed=42`

Training used the balanced training split (2,119,856 samples, 1,059,928 per class). Performance was estimated using 10-fold stratified cross-validation on the training set, then the final model was retrained on the full training split. The 374,092-sample held-out test set was never used during training or hyperparameter selection — it exists solely to measure real-world performance, and it's the same pool that the "Try an example" buttons and Quiz Mode draw from.

5. ML Model Details

5a. Cross-Validation Protocol

Property	Value
Validation method	10-fold Stratified K-Fold
Folds	10
Shuffle	Yes
Random seed	`42`
Scoring metrics	Accuracy, Precision, Recall, F1 (all per-fold)
Parallelism	`n_jobs=-1` (all available CPUs)

5b. Model Comparison (10-Fold CV, mean ± std)

Four classifiers were compared, all using the same 46 feature scores as input (31 AI red flags + 15 human green flags):

Model	Mean CV F1	Std
Logistic Regression (C=1.0)	81.9%	± 0.1%
Random Forest (200 trees, depth 10)	86.3%	± 0.03%
Gradient Boosting (200 est, depth 4)	88.4%	± 0.06%
XGBoost (200 est, depth 4)	88.2%	± 0.04%

5c. Per-Fold Results (Gradient Boosting)

Fold	F1	Accuracy
1	88.2%	88.5%
2	88.3%	88.6%
3	88.5%	88.7%
4	88.4%	88.7%
5	88.4%	88.7%
6	88.3%	88.6%
7	88.4%	88.7%
8	88.4%	88.6%
9	88.3%	88.6%
10	88.4%	88.6%
Mean ± Std	88.4% ± 0.06%	88.6% ± 0.06%

5d. Selected Model & Hyperparameters

Gradient Boosting was selected as the production model based on highest mean F1 across all 10 folds. Full hyperparameters for reproducibility:

Parameter	Value
Pipeline	`StandardScaler → GradientBoostingClassifier`
`n_estimators`	200
`max_depth`	4
`learning_rate`	0.1
`min_samples_leaf`	10
`n_jobs`	-1 (all CPUs)
Random seed	`42`
Final training set	2,119,856 samples (retrained after CV)
Client-side runtime	ONNX Runtime Web (WebAssembly)

5e. Heuristic-Only Baseline vs ML

The ML model outperforms the heuristic-only weighted-sum baseline by a significant margin (measured via 10-fold CV on the 2.1M training set):

Metric	Heuristics Only	+ ML Model (CV)	Improvement
Accuracy	62.8%	88.6%	+25.8%
F1 Score	71.3%	88.4%	+17.1%
Precision	58.1%	90.5%	+32.4%
Recall	92.4%	86.3%	−6.1%

Note: heuristic baseline has higher recall but much lower precision (many false positives). The ML model trades a small amount of recall for dramatically better precision.

5e′. Held-Out Test Set Performance

Final performance on the 374,092-sample held-out test set the model was never trained on:

Metric	Value
Accuracy	88.7%
Precision	90.6%
Recall	86.2%
F1 Score	88.4%
True Negatives (correct human)	170,357
False Positives (human flagged as AI)	16,689
False Negatives (AI missed)	25,785
True Positives (correct AI)	161,261

5f. Top Feature Importances

Rank	Feature	Importance
1	paragraph_asymmetry	0.235
2	avg_word_length	0.207
3	lexical_ttr	0.152
4	ai_vocabulary	0.133
5	sentence_uniformity	0.043
6	curly_quotes	0.040
7	passive_voice	0.025
8	comma_density	0.024
9	academic_citations	0.021
10	elegant_variation	0.014
11	first_person_density	0.013

paragraph_asymmetry alone accounts for 23.5% of the model's splitting decisions — uneven paragraph lengths are a strong signal of natural human writing (AI tends to produce uniform blocks). The top four features collectively account for 72.7% of model importance. Two of the top-11 features are human green flags (paragraph_asymmetry, academic_citations), directly confirming their discriminative value.

5g. Reproducibility

All parameters needed to reproduce this exact model are saved in ml_metadata.json alongside the model file. The training script is train_ml_model.py. To retrain from scratch:

python train_ml_model.py --fresh

The --fresh flag clears any cached intermediate results. Without it, the script will resume from cached feature extractions and CV results if they exist. The model is saved as ml_model.pkl and loaded automatically at server startup.

6. Verdict Tiers

AI Probability	Verdict	Meaning
< 40%	✓ Likely Human	Few or no AI patterns detected. Consistent with human writing.
40% – 65%	~ Inconclusive	Some patterns present but not enough for a confident determination. Could be human writing with a formal style, or lightly edited AI text.
> 65%	⚠ Likely AI	Multiple strong AI patterns detected. Consistent with unedited or lightly edited LLM output.

7. How Trigger Ranking Works

In the detector interface, triggered patterns are ranked by importance, calculated as:

Importance = Raw Score × |Heuristic Weight|

This bubbles the most discriminative and strongest-firing patterns to the top. AI red flags and human green flags are ranked separately within their own sidebar tiers, each with a percentage showing that pattern's share of its group's total signal.

8. Caveats & Limitations

This tool is not proof of AI use. It highlights statistical patterns and should be used as one data point alongside your professional judgment about a student's abilities and writing history.

Known limitations:

Formal writing looks like AI. Highly polished academic prose, professional copywriting, and non-native English speakers who learned formal English may trigger structural and vocabulary heuristics.
Edited AI text evades detection. If a student rewrites AI output in their own voice, the distinctive patterns disappear. The tool detects unedited or lightly edited AI output best.
Dataset era matters. The training data reflects AI writing patterns circa 2023–2024. As LLMs evolve and lose some of these tics, detection accuracy may degrade.
Short text is unreliable. Fewer than ~100 words provides too little statistical signal for confident detection. Doc-level features like sentence uniformity need enough sentences to measure.
English only. All heuristics, vocabulary lists, and training data are English. The tool should not be used on other languages.
No individual heuristic is conclusive. "Furthermore" at the start of a sentence is perfectly valid English. It only becomes suspicious in aggregate — when combined with other AI-typical patterns. Always look at the full picture.
Doc-level stats have no spans. Some triggers (sentence uniformity, avg word length, etc.) are statistical properties of the whole document with no specific passage to highlight. They still contribute signal but are harder to point to in conversation with a student.

9. Philosophy: Why Transparency Matters

Most AI detectors are black boxes. You paste text in, get a number out, and have no idea why. This makes them dangerous tools for high-stakes decisions like academic integrity.

This tool takes a different approach: it makes every part of its reasoning visible. You can see exactly which patterns it found, how much each one contributed, and where in the text they appear. This means:

You can evaluate the evidence yourself rather than trusting a number.
You can have a specific conversation with a student: "I noticed your essay uses 'Furthermore' to start three sentences and contains the phrase 'it is important to note' twice — can you walk me through your writing process?"
You can identify false positives by seeing that the only triggers are low-signal ones like passive voice or sentence length — not the high-signal structural patterns that are truly distinctive of AI.

Built with compromise.js, scikit-learn, ONNX Runtime Web, and Flask. Detection runs entirely in your browser — no text is ever sent to a server. Source heuristics derived from Wikipedia's Signs of AI Writing.