Why is it Slop?

BETA ← Back to detector

How It Works

A transparent look at the methodology, data, models, and limitations behind the detector.

1. The Approach

YOU HERETIC! — You coded this up with an LLM…

Yes. Guilty as charged. The irony is not lost on us. An AI-detection tool, built with the help of AI. Go ahead, feed this very source code into the detector — it'll probably flag itself.

But here's the thing: this tool isn't anti-AI. It's anti-bullshit. There's a difference.

LLMs are extraordinary tools for coding, brainstorming, and drafting. The problem isn't that students use AI — it's when they paste in a ChatGPT essay and submit it as their own thinking. The teacher gets a wall of "multifaceted tapestries" and "pivotal paradigm shifts" and has no idea whether a real human brain engaged with the material.

This tool exists to give educators a transparent, explainable second opinion — not a black-box "gotcha" machine, but a teaching aid that says "here are the specific patterns that look machine-generated, and here's why." It's designed to start conversations, not end them.

So yes, we used an LLM to build the thing that detects LLMs. We also used a hammer to build the house that keeps the rain out. Tools are tools. What matters is what you do with them.

🔒 100% Local & Private — Your text never leaves your device.

The entire detection pipeline — feature extraction, NLP processing, and the ML model — runs locally in your browser using WebAssembly. No text is sent to any server. No API calls. No logging. Nothing.

Don't take our word for it — try it with your Wi-Fi off. Disconnect from the internet after the page loads and paste in some text. It still works. That's the proof. Your data is yours.

This tool identifies AI-generated text through a two-layer system:

  1. 46 hand-crafted features in two complementary groups: 31 AI red flags that check for linguistic patterns statistically associated with LLM output, and 15 human green flags that detect markers of authentic human writing (disfluencies, informal language, personal voice, citations). Strong green flags push the probability down.
  2. A Gradient Boosting classifier trained on all 46 feature scores, which learns the optimal combination weights from labeled data.

The ML model runs client-side via ONNX Runtime Web (WebAssembly). When the model loads successfully, the tool uses its probability as the primary signal. If it fails to load (e.g. very old browser), it falls back to a weighted heuristic sum.

Crucially, the tool doesn't just give a score — it shows you which patterns triggered and where, so you can form your own judgment about whether the flags are meaningful in context.

2. Why Heuristics + ML (Not a Black Box)

Pure neural AI detectors (like GPTZero, Originality.ai, or OpenAI's classifier) are opaque — they give a probability but can't explain why. This is a problem for educators, because:

Our approach is the opposite: every detection is grounded in observable textual evidence. If the tool flags a passage, you can hover over the trigger to see exactly which phrases, word choices, or structures caused it — and decide for yourself whether the flag is meaningful.

Design principle: The tool should help a teacher understand what looks like AI and why, not just deliver a verdict. It's a lens, not a judge.

3. The 46 Features (31 AI Red Flags + 15 Human Green Flags)

The AI red flags fall into three groups, and are joined by 15 human green flags that score evidence of authentic human writing. Each has a weight reflecting how strongly it discriminates between AI and human text (calibrated on the training data). Negative weights belong to green flags — a high score there reduces the AI probability:

HeuristicWhat it looks forWeight
Structural Formulas & Rhetorical Patterns
AI Vocabulary"delve," "tapestry," "multifaceted," "nuanced" — words that appear at wildly higher rates in AI text4.0
Formulaic Openers"Additionally," "Furthermore," "Moreover" at sentence start3.5
Sentence UniformitySuspiciously even sentence lengths (doc-level stat)3.5
LLM MarkupInternal reference IDs or tags from ChatGPT/Copilot3.0
Trailing Participles", highlighting its significance" — fake analysis at sentence end2.5
Chatbot Artifacts"Certainly!," "I hope this helps," "Let me know"2.5
Didactic Disclaimers"It's important to note," "plays a crucial role"2.5
Copula Avoidance"Serves as" / "stands as" instead of simply "is"2.0
Rule of ThreeTricolon lists: "education, innovation, and progress"2.0
Challenges & Future"Despite these challenges, the future continues to evolve"1.5
Formulaic ListsBold inline headers, numbered/bullet structures in prose1.5
"Not X, but Y""Not just... but also" parallel construction1.0
Vocabulary & Word Choice
Promotional Language"groundbreaking," "unparalleled," "world-class"2.0
Hedging"potentially," "arguably," "fundamentally"2.0
Legacy Emphasis"enduring legacy," "indelible mark," "broader implications"2.0
Average Word LengthAI consistently uses longer, more formal words (doc-level)2.0
Disclaimers"Based on available information" — knowledge-cutoff phrases2.0
Vague Attributions"Experts have noted" without citing anyone specific1.5
Notability Claims"Featured in major publications" without naming them1.5
Filler Adverbs"significantly," "increasingly," "effectively" — AI emphasis spray1.5
Comma DensityMore commas per word than typical student writing (doc-level)1.5
Elegant VariationMany synonyms for the same concept — AI's repetition penalty (doc-level)1.5
Avg Sentence LengthConsistently medium-long sentences in 20–35 word range (doc-level)1.5
Markdown Artifacts**bold** or # heading markdown syntax leaked into prose1.5
Low-Signal Indicators
Em Dash OveruseMultiple em dashes per paragraph1.0
Passive Voice"is considered," "was believed" — higher density1.0
Curly QuotesSmart/curly quotation marks (ChatGPT default)3.0
Absence-of-Human-Signal Indicators
No ContractionsAI expands "don't," "it's" to formal equivalents — near-zero contraction density2.5
No First PersonAI analytical text avoids "I" — very low first-person density1.5
Vocab DiversityAI's repetition-penalty produces unnaturally high overall lexical variety (type-token ratio)1.5
✅ Human Green Flags (negative weight = evidence against AI)
Informal Intensifiers"super," "totally," "pretty much" — informal human emphasis, not AI vocabulary−2.0
Sentence FragmentsShort phrases without a root verb ("So weird." "No way.") — casual human writing−2.5
Trailing EllipsisEllipses (…) used to trail off — a human stylistic quirk AI rarely produces−2.0
Para. VariationNatural variation in paragraph lengths — AI produces uniform paragraph blocks−1.5
Disfluency Markers"I mean," "like," "you know," "um" — spoken rhythms of real thought−2.5
Emphasis TypographyALL-CAPS, repeated punctuation (!!) or reduplication ("really really")−2.0
Personal Time Refs"last summer," "when I was a kid" — grounded in personal memory−2.5
Contraction VarietyMany different contractions (I'm, don't, it's, we're) — natural relaxed voice−2.5
Pronoun VarietyMixing I, we, you, me, my, our — engaged conversational voice−2.0
Self-Correction"actually," "wait no," "I mean—" — writer revising thought mid-sentence−3.0
Internet Slang"lol," "omg," "bff," "ngl" — strong evidence of genuine human expression−3.5
Academic CitationsParenthetical citations (Smith, 2020) — student drawing on real sources−3.0
Mild Expletive"damn," "crap," "hell" — authentic human voice, AI avoids unprompted−2.5
Casual Spelling"u," "thru," "ngl," "ilysm" — shorthand signatures of human voice−3.0
Tag Question"right?" "aren't they?" — rhetorical questions inviting the reader in−2.0

The heuristics are derived from Wikipedia's "Signs of AI Writing" article and refined through empirical testing on labeled data. They use a mix of regex pattern matching, compromise.js NLP tagging, and statistical measures.

4. Training Dataset

PropertyValue
Sourcedatasets/unified_dataset_v3.csv (12 aggregated sources)
Total samples2,493,948
Training set2,119,856 (1,059,928 per class, balanced)
Held-out test set374,092 (never seen during training)
Split strategyStratified random, seed=42

Training used the balanced training split (2,119,856 samples, 1,059,928 per class). Performance was estimated using 10-fold stratified cross-validation on the training set, then the final model was retrained on the full training split. The 374,092-sample held-out test set was never used during training or hyperparameter selection — it exists solely to measure real-world performance, and it's the same pool that the "Try an example" buttons and Quiz Mode draw from.

5. ML Model Details

5a. Cross-Validation Protocol

PropertyValue
Validation method10-fold Stratified K-Fold
Folds10
ShuffleYes
Random seed42
Scoring metricsAccuracy, Precision, Recall, F1 (all per-fold)
Parallelismn_jobs=-1 (all available CPUs)

5b. Model Comparison (10-Fold CV, mean ± std)

Four classifiers were compared, all using the same 46 feature scores as input (31 AI red flags + 15 human green flags):

ModelMean CV F1Std
Logistic Regression (C=1.0)81.9%± 0.1%
Random Forest (200 trees, depth 10)86.3%± 0.03%
Gradient Boosting (200 est, depth 4)88.4%± 0.06%
XGBoost (200 est, depth 4)88.2%± 0.04%

5c. Per-Fold Results (Gradient Boosting)

FoldF1Accuracy
188.2%88.5%
288.3%88.6%
388.5%88.7%
488.4%88.7%
588.4%88.7%
688.3%88.6%
788.4%88.7%
888.4%88.6%
988.3%88.6%
1088.4%88.6%
Mean ± Std88.4% ± 0.06%88.6% ± 0.06%

5d. Selected Model & Hyperparameters

Gradient Boosting was selected as the production model based on highest mean F1 across all 10 folds. Full hyperparameters for reproducibility:

ParameterValue
PipelineStandardScaler → GradientBoostingClassifier
n_estimators200
max_depth4
learning_rate0.1
min_samples_leaf10
n_jobs-1 (all CPUs)
Random seed42
Final training set2,119,856 samples (retrained after CV)
Client-side runtimeONNX Runtime Web (WebAssembly)

5e. Heuristic-Only Baseline vs ML

The ML model outperforms the heuristic-only weighted-sum baseline by a significant margin (measured via 10-fold CV on the 2.1M training set):

MetricHeuristics Only+ ML Model (CV)Improvement
Accuracy62.8%88.6%+25.8%
F1 Score71.3%88.4%+17.1%
Precision58.1%90.5%+32.4%
Recall92.4%86.3%−6.1%

Note: heuristic baseline has higher recall but much lower precision (many false positives). The ML model trades a small amount of recall for dramatically better precision.

5e′. Held-Out Test Set Performance

Final performance on the 374,092-sample held-out test set the model was never trained on:

MetricValue
Accuracy88.7%
Precision90.6%
Recall86.2%
F1 Score88.4%
True Negatives (correct human)170,357
False Positives (human flagged as AI)16,689
False Negatives (AI missed)25,785
True Positives (correct AI)161,261

5f. Top Feature Importances

RankFeatureImportance
1paragraph_asymmetry0.235
2avg_word_length0.207
3lexical_ttr0.152
4ai_vocabulary0.133
5sentence_uniformity0.043
6curly_quotes0.040
7passive_voice0.025
8comma_density0.024
9academic_citations0.021
10elegant_variation0.014
11first_person_density0.013

paragraph_asymmetry alone accounts for 23.5% of the model's splitting decisions — uneven paragraph lengths are a strong signal of natural human writing (AI tends to produce uniform blocks). The top four features collectively account for 72.7% of model importance. Two of the top-11 features are human green flags (paragraph_asymmetry, academic_citations), directly confirming their discriminative value.

5g. Reproducibility

All parameters needed to reproduce this exact model are saved in ml_metadata.json alongside the model file. The training script is train_ml_model.py. To retrain from scratch:

python train_ml_model.py --fresh

The --fresh flag clears any cached intermediate results. Without it, the script will resume from cached feature extractions and CV results if they exist. The model is saved as ml_model.pkl and loaded automatically at server startup.

6. Verdict Tiers

AI ProbabilityVerdictMeaning
< 40%✓ Likely HumanFew or no AI patterns detected. Consistent with human writing.
40% – 65%~ InconclusiveSome patterns present but not enough for a confident determination. Could be human writing with a formal style, or lightly edited AI text.
> 65%⚠ Likely AIMultiple strong AI patterns detected. Consistent with unedited or lightly edited LLM output.

7. How Trigger Ranking Works

In the detector interface, triggered patterns are ranked by importance, calculated as:

Importance = Raw Score × |Heuristic Weight|

This bubbles the most discriminative and strongest-firing patterns to the top. AI red flags and human green flags are ranked separately within their own sidebar tiers, each with a percentage showing that pattern's share of its group's total signal.

8. Caveats & Limitations

This tool is not proof of AI use. It highlights statistical patterns and should be used as one data point alongside your professional judgment about a student's abilities and writing history.

Known limitations:

9. Philosophy: Why Transparency Matters

Most AI detectors are black boxes. You paste text in, get a number out, and have no idea why. This makes them dangerous tools for high-stakes decisions like academic integrity.

This tool takes a different approach: it makes every part of its reasoning visible. You can see exactly which patterns it found, how much each one contributed, and where in the text they appear. This means:

Built with compromise.js, scikit-learn, ONNX Runtime Web, and Flask. Detection runs entirely in your browser — no text is ever sent to a server. Source heuristics derived from Wikipedia's Signs of AI Writing.