A transparent look at the methodology, data, models, and limitations behind the detector.
YOU HERETIC! — You coded this up with an LLM…
Yes. Guilty as charged. The irony is not lost on us. An AI-detection tool, built with the help of AI. Go ahead, feed this very source code into the detector — it'll probably flag itself.
But here's the thing: this tool isn't anti-AI. It's anti-bullshit. There's a difference.
LLMs are extraordinary tools for coding, brainstorming, and drafting. The problem isn't that students use AI — it's when they paste in a ChatGPT essay and submit it as their own thinking. The teacher gets a wall of "multifaceted tapestries" and "pivotal paradigm shifts" and has no idea whether a real human brain engaged with the material.
This tool exists to give educators a transparent, explainable second opinion — not a black-box "gotcha" machine, but a teaching aid that says "here are the specific patterns that look machine-generated, and here's why." It's designed to start conversations, not end them.
So yes, we used an LLM to build the thing that detects LLMs. We also used a hammer to build the house that keeps the rain out. Tools are tools. What matters is what you do with them.
🔒 100% Local & Private — Your text never leaves your device.
The entire detection pipeline — feature extraction, NLP processing, and the ML model — runs locally in your browser using WebAssembly. No text is sent to any server. No API calls. No logging. Nothing.
Don't take our word for it — try it with your Wi-Fi off. Disconnect from the internet after the page loads and paste in some text. It still works. That's the proof. Your data is yours.
This tool identifies AI-generated text through a two-layer system:
The ML model runs client-side via ONNX Runtime Web (WebAssembly). When the model loads successfully, the tool uses its probability as the primary signal. If it fails to load (e.g. very old browser), it falls back to a weighted heuristic sum.
Crucially, the tool doesn't just give a score — it shows you which patterns triggered and where, so you can form your own judgment about whether the flags are meaningful in context.
Pure neural AI detectors (like GPTZero, Originality.ai, or OpenAI's classifier) are opaque — they give a probability but can't explain why. This is a problem for educators, because:
Our approach is the opposite: every detection is grounded in observable textual evidence. If the tool flags a passage, you can hover over the trigger to see exactly which phrases, word choices, or structures caused it — and decide for yourself whether the flag is meaningful.
The AI red flags fall into three groups, and are joined by 15 human green flags that score evidence of authentic human writing. Each has a weight reflecting how strongly it discriminates between AI and human text (calibrated on the training data). Negative weights belong to green flags — a high score there reduces the AI probability:
| Heuristic | What it looks for | Weight | |
|---|---|---|---|
| Structural Formulas & Rhetorical Patterns | |||
| AI Vocabulary | "delve," "tapestry," "multifaceted," "nuanced" — words that appear at wildly higher rates in AI text | 4.0 | |
| Formulaic Openers | "Additionally," "Furthermore," "Moreover" at sentence start | 3.5 | |
| Sentence Uniformity | Suspiciously even sentence lengths (doc-level stat) | 3.5 | |
| LLM Markup | Internal reference IDs or tags from ChatGPT/Copilot | 3.0 | |
| Trailing Participles | ", highlighting its significance" — fake analysis at sentence end | 2.5 | |
| Chatbot Artifacts | "Certainly!," "I hope this helps," "Let me know" | 2.5 | |
| Didactic Disclaimers | "It's important to note," "plays a crucial role" | 2.5 | |
| Copula Avoidance | "Serves as" / "stands as" instead of simply "is" | 2.0 | |
| Rule of Three | Tricolon lists: "education, innovation, and progress" | 2.0 | |
| Challenges & Future | "Despite these challenges, the future continues to evolve" | 1.5 | |
| Formulaic Lists | Bold inline headers, numbered/bullet structures in prose | 1.5 | |
| "Not X, but Y" | "Not just... but also" parallel construction | 1.0 | |
| Vocabulary & Word Choice | |||
| Promotional Language | "groundbreaking," "unparalleled," "world-class" | 2.0 | |
| Hedging | "potentially," "arguably," "fundamentally" | 2.0 | |
| Legacy Emphasis | "enduring legacy," "indelible mark," "broader implications" | 2.0 | |
| Average Word Length | AI consistently uses longer, more formal words (doc-level) | 2.0 | |
| Disclaimers | "Based on available information" — knowledge-cutoff phrases | 2.0 | |
| Vague Attributions | "Experts have noted" without citing anyone specific | 1.5 | |
| Notability Claims | "Featured in major publications" without naming them | 1.5 | |
| Filler Adverbs | "significantly," "increasingly," "effectively" — AI emphasis spray | 1.5 | |
| Comma Density | More commas per word than typical student writing (doc-level) | 1.5 | |
| Elegant Variation | Many synonyms for the same concept — AI's repetition penalty (doc-level) | 1.5 | |
| Avg Sentence Length | Consistently medium-long sentences in 20–35 word range (doc-level) | 1.5 | |
| Markdown Artifacts | **bold** or # heading markdown syntax leaked into prose | 1.5 | |
| Low-Signal Indicators | |||
| Em Dash Overuse | Multiple em dashes per paragraph | 1.0 | |
| Passive Voice | "is considered," "was believed" — higher density | 1.0 | |
| Curly Quotes | Smart/curly quotation marks (ChatGPT default) | 3.0 | |
| Absence-of-Human-Signal Indicators | |||
| No Contractions | AI expands "don't," "it's" to formal equivalents — near-zero contraction density | 2.5 | |
| No First Person | AI analytical text avoids "I" — very low first-person density | 1.5 | |
| Vocab Diversity | AI's repetition-penalty produces unnaturally high overall lexical variety (type-token ratio) | 1.5 | |
| ✅ Human Green Flags (negative weight = evidence against AI) | |||
| Informal Intensifiers | "super," "totally," "pretty much" — informal human emphasis, not AI vocabulary | −2.0 | |
| Sentence Fragments | Short phrases without a root verb ("So weird." "No way.") — casual human writing | −2.5 | |
| Trailing Ellipsis | Ellipses (…) used to trail off — a human stylistic quirk AI rarely produces | −2.0 | |
| Para. Variation | Natural variation in paragraph lengths — AI produces uniform paragraph blocks | −1.5 | |
| Disfluency Markers | "I mean," "like," "you know," "um" — spoken rhythms of real thought | −2.5 | |
| Emphasis Typography | ALL-CAPS, repeated punctuation (!!) or reduplication ("really really") | −2.0 | |
| Personal Time Refs | "last summer," "when I was a kid" — grounded in personal memory | −2.5 | |
| Contraction Variety | Many different contractions (I'm, don't, it's, we're) — natural relaxed voice | −2.5 | |
| Pronoun Variety | Mixing I, we, you, me, my, our — engaged conversational voice | −2.0 | |
| Self-Correction | "actually," "wait no," "I mean—" — writer revising thought mid-sentence | −3.0 | |
| Internet Slang | "lol," "omg," "bff," "ngl" — strong evidence of genuine human expression | −3.5 | |
| Academic Citations | Parenthetical citations (Smith, 2020) — student drawing on real sources | −3.0 | |
| Mild Expletive | "damn," "crap," "hell" — authentic human voice, AI avoids unprompted | −2.5 | |
| Casual Spelling | "u," "thru," "ngl," "ilysm" — shorthand signatures of human voice | −3.0 | |
| Tag Question | "right?" "aren't they?" — rhetorical questions inviting the reader in | −2.0 | |
The heuristics are derived from Wikipedia's "Signs of AI Writing" article and refined through empirical testing on labeled data. They use a mix of regex pattern matching, compromise.js NLP tagging, and statistical measures.
| Property | Value |
|---|---|
| Source | datasets/unified_dataset_v3.csv (12 aggregated sources) |
| Total samples | 2,493,948 |
| Training set | 2,119,856 (1,059,928 per class, balanced) |
| Held-out test set | 374,092 (never seen during training) |
| Split strategy | Stratified random, seed=42 |
Training used the balanced training split (2,119,856 samples, 1,059,928 per class). Performance was estimated using 10-fold stratified cross-validation on the training set, then the final model was retrained on the full training split. The 374,092-sample held-out test set was never used during training or hyperparameter selection — it exists solely to measure real-world performance, and it's the same pool that the "Try an example" buttons and Quiz Mode draw from.
| Property | Value |
|---|---|
| Validation method | 10-fold Stratified K-Fold |
| Folds | 10 |
| Shuffle | Yes |
| Random seed | 42 |
| Scoring metrics | Accuracy, Precision, Recall, F1 (all per-fold) |
| Parallelism | n_jobs=-1 (all available CPUs) |
Four classifiers were compared, all using the same 46 feature scores as input (31 AI red flags + 15 human green flags):
| Model | Mean CV F1 | Std |
|---|---|---|
| Logistic Regression (C=1.0) | 81.9% | ± 0.1% |
| Random Forest (200 trees, depth 10) | 86.3% | ± 0.03% |
| Gradient Boosting (200 est, depth 4) | 88.4% | ± 0.06% |
| XGBoost (200 est, depth 4) | 88.2% | ± 0.04% |
| Fold | F1 | Accuracy |
|---|---|---|
| 1 | 88.2% | 88.5% |
| 2 | 88.3% | 88.6% |
| 3 | 88.5% | 88.7% |
| 4 | 88.4% | 88.7% |
| 5 | 88.4% | 88.7% |
| 6 | 88.3% | 88.6% |
| 7 | 88.4% | 88.7% |
| 8 | 88.4% | 88.6% |
| 9 | 88.3% | 88.6% |
| 10 | 88.4% | 88.6% |
| Mean ± Std | 88.4% ± 0.06% | 88.6% ± 0.06% |
Gradient Boosting was selected as the production model based on highest mean F1 across all 10 folds. Full hyperparameters for reproducibility:
| Parameter | Value |
|---|---|
| Pipeline | StandardScaler → GradientBoostingClassifier |
n_estimators | 200 |
max_depth | 4 |
learning_rate | 0.1 |
min_samples_leaf | 10 |
n_jobs | -1 (all CPUs) |
| Random seed | 42 |
| Final training set | 2,119,856 samples (retrained after CV) |
| Client-side runtime | ONNX Runtime Web (WebAssembly) |
The ML model outperforms the heuristic-only weighted-sum baseline by a significant margin (measured via 10-fold CV on the 2.1M training set):
| Metric | Heuristics Only | + ML Model (CV) | Improvement |
|---|---|---|---|
| Accuracy | 62.8% | 88.6% | +25.8% |
| F1 Score | 71.3% | 88.4% | +17.1% |
| Precision | 58.1% | 90.5% | +32.4% |
| Recall | 92.4% | 86.3% | −6.1% |
Note: heuristic baseline has higher recall but much lower precision (many false positives). The ML model trades a small amount of recall for dramatically better precision.
Final performance on the 374,092-sample held-out test set the model was never trained on:
| Metric | Value |
|---|---|
| Accuracy | 88.7% |
| Precision | 90.6% |
| Recall | 86.2% |
| F1 Score | 88.4% |
| True Negatives (correct human) | 170,357 |
| False Positives (human flagged as AI) | 16,689 |
| False Negatives (AI missed) | 25,785 |
| True Positives (correct AI) | 161,261 |
| Rank | Feature | Importance | |
|---|---|---|---|
| 1 | paragraph_asymmetry | 0.235 | |
| 2 | avg_word_length | 0.207 | |
| 3 | lexical_ttr | 0.152 | |
| 4 | ai_vocabulary | 0.133 | |
| 5 | sentence_uniformity | 0.043 | |
| 6 | curly_quotes | 0.040 | |
| 7 | passive_voice | 0.025 | |
| 8 | comma_density | 0.024 | |
| 9 | academic_citations | 0.021 | |
| 10 | elegant_variation | 0.014 | |
| 11 | first_person_density | 0.013 |
paragraph_asymmetry alone accounts for 23.5% of the model's splitting decisions — uneven paragraph lengths are a strong signal of natural human writing (AI tends to produce uniform blocks). The top four features collectively account for 72.7% of model importance. Two of the top-11 features are human green flags (paragraph_asymmetry, academic_citations), directly confirming their discriminative value.
All parameters needed to reproduce this exact model are saved in ml_metadata.json alongside the model file. The training script is train_ml_model.py. To retrain from scratch:
python train_ml_model.py --fresh
The --fresh flag clears any cached intermediate results. Without it, the script will resume from cached feature extractions and CV results if they exist. The model is saved as ml_model.pkl and loaded automatically at server startup.
| AI Probability | Verdict | Meaning |
|---|---|---|
| < 40% | ✓ Likely Human | Few or no AI patterns detected. Consistent with human writing. |
| 40% – 65% | ~ Inconclusive | Some patterns present but not enough for a confident determination. Could be human writing with a formal style, or lightly edited AI text. |
| > 65% | ⚠ Likely AI | Multiple strong AI patterns detected. Consistent with unedited or lightly edited LLM output. |
In the detector interface, triggered patterns are ranked by importance, calculated as:
Importance = Raw Score × |Heuristic Weight|
This bubbles the most discriminative and strongest-firing patterns to the top. AI red flags and human green flags are ranked separately within their own sidebar tiers, each with a percentage showing that pattern's share of its group's total signal.
Known limitations:
Most AI detectors are black boxes. You paste text in, get a number out, and have no idea why. This makes them dangerous tools for high-stakes decisions like academic integrity.
This tool takes a different approach: it makes every part of its reasoning visible. You can see exactly which patterns it found, how much each one contributed, and where in the text they appear. This means:
Built with compromise.js, scikit-learn, ONNX Runtime Web, and Flask. Detection runs entirely in your browser — no text is ever sent to a server. Source heuristics derived from Wikipedia's Signs of AI Writing.