How Much Is AI Generated? | What The Numbers Miss

Across new web pages, studies often find AI-written text present in many posts, yet the share swings by niche, language, and detection method.

You’ve probably seen bold claims like “half the web is AI” or “a lot of content is written by bots now.” The truth is messier. “AI-generated” can mean a fully machine-written page, a human draft polished by a model, a paragraph rewritten for clarity, or a headline suggestion.

This guide helps you answer the question in a way that stands up to scrutiny: what “AI generated” means, what research actually measures, and how to estimate the share in a site, a class project, a newsroom, or a content pipeline without guessing.

How Much Is AI Generated? A Practical Way To Estimate

Start by deciding what you’re measuring. If you treat “AI-generated” as “any AI involvement,” your number jumps fast. If you treat it as “mostly model-written,” your number drops. Pick one definition, stick to it, then write the rule in plain language so a reader can repeat your math.

Pick one of three plain definitions

AI-assisted: A person wrote the core text and used AI for edits, outlines, summaries, or headline options.
AI-drafted: AI produced the first full draft and a person revised it line by line.
AI-authored: AI produced most of the final text with light human edits (typos, formatting, short swaps).

For many audits, “AI-authored” is the clearest bucket. It separates “spellcheck plus a few rewrites” from “a model wrote the page.”

Use two signals, not one

Detectors can be wrong. Watermarks can be removed. Metadata is uneven. A better estimate uses two independent signals and treats the result as a range.

Text signal: Run a detector on a sample and record a probability score, not a yes/no label.
Process signal: Check creation logs when you have them (CMS revision history, prompt logs, writing platform exports).

If you only have the text, you can still do a careful sample-based estimate. You just need to be clear about the limits.

What Research Says About How Much Content Uses AI

Public research often measures “AI presence in text,” not “fully machine-written pages.” That wording matters because a page can contain a few AI-written sentences and still be mostly human.

One large crawl study by Ahrefs reported that 74.2% of newly created pages they reviewed in April 2025 contained AI-generated content, based on their detection method and sample design. Treat that as “AI presence,” not “AI wrote the whole page.” :contentReference[oaicite:0]{index=0}

Another analysis by Graphite reported a rapid rise in AI-written articles after late 2022 and said AI-generated articles overtook human-written ones around November 2024 within the data they tracked. Their dataset and classifier choices shape that result, so it works best as a trend signal. :contentReference[oaicite:1]{index=1}

Takeaways you can safely use:

AI text is common in new publishing.
Percentages vary because “AI-generated” is defined and detected in different ways.
One number for “the whole internet” is a headline, not a measurement.

Why the numbers swing so widely

Three factors drive the swing:

Sampling: New pages vs. all pages, English-only vs. multilingual, blogs vs. ecommerce vs. forums.
Thresholds: A detector score of 0.55 vs. 0.90 creates different counts.
Edits: Human revision can erase patterns detectors look for, even when a model drafted the text.

So when you hear “X% is AI,” your first move is simple: ask what “AI” means in that study and what the dataset includes.

How Detection Works And Where It Breaks

Most text detectors look for statistical patterns in token choice and sentence structure that differ from human writing. Some tools train on known AI outputs and learn what that style looks like.

Peer-reviewed reviews of detection tools note a blunt reality: accuracy varies, and detectors can be unreliable, with bias risks across writing styles and language backgrounds. That’s one reason you should treat detector results as a signal, not a verdict. :contentReference[oaicite:2]{index=2}

Four common failure modes

Short text: A paragraph, a caption, or a chat reply gives too little signal.
Heavy edits: Line-by-line human rewriting can push a model draft under a detector threshold.
Translation: Machine translation can look “AI-like” even when the original text was human.
Non-standard writing: Simple phrasing, second-language writing, or strict templates can trigger false flags.

Watermarking is another idea: embed a subtle pattern in generated text so it can be detected later. Research summaries note that watermark designs can be weakened by edits, paraphrasing, or prompt tricks, and some designs raise false-positive risks. :contentReference[oaicite:3]{index=3}

Sampling Method You Can Repeat

If you want an estimate you can defend, use a small, clean method. You don’t need a lab. You need a consistent sample and a recorded rule set.

Step 1: Define your population

Write one sentence that states what you are measuring. Examples: “All blog posts published on Site A in 2025” or “Student essays in Course B this semester.” Then lock that scope.

Step 2: Pull a random sample

Random beats “pick the ones that feel suspicious.” Use a list of URLs or IDs, shuffle, and pick a fixed count. For small sites, 30–60 items can still reveal patterns. For big archives, go higher.

Step 3: Score each item twice

Use one detector score as your text signal. Pair it with a process signal if you can: revision logs, drafts, author notes, or content platform exports. If you can’t, mark that field as “unknown” and keep going.

Step 4: Set your threshold and publish a range

Choose one threshold that maps to your definition. Then publish two numbers: a “strict” count (high threshold) and a “loose” count (lower threshold). That range is often more honest than a single point estimate.

Below is a compact set of scoring rules you can lift into a spreadsheet.

Signal Or Check	What To Record	How To Read It
Detector probability	Score from Tool A (0–1)	Use as a gradient, not a label
Second detector	Score from Tool B (0–1)	Disagreement hints uncertainty
Revision depth	Count of meaningful edits	Many edits may mask a model draft
Draft source	Human / AI / unknown	Logs beat guesswork
Template density	Percent reused boilerplate	High reuse can fool detectors
Fact density	Number of checkable claims	Low density often signals thin writing
Attribution trail	Sources cited or linked	Clear sourcing reduces risk
Human review flag	Editor reviewed? yes/no	Review raises reliability of final text

What Google Cares About When AI Writes The Draft

Search engines don’t ban AI-written text by default. They demote pages that exist to manipulate rankings or that fail the reader. Google’s documentation on spam policies describes “scaled content abuse” as producing many pages mainly to game search results, no matter how the pages were made. Google Search spam policies spells out that the rule centers on intent and value. :contentReference[oaicite:4]{index=4}

That means your risk is rarely “AI was used.” Your risk is “the page reads like a thin rewrite, gets facts wrong, or repeats the same structure across hundreds of URLs.”

Signals that keep pages readable

Clear scope: who the page is for and what it answers.
Specific detail: steps, measurements, constraints, and choices explained in plain words.
Editing with intent: tighten, check, and remove anything that doesn’t help.
Sources for non-obvious claims: link out sparingly and wisely.

If you publish AI-assisted work, treat AI as a draft engine. The final result should still read like a human cared about clarity and accuracy.

How To Use AI Without Ending Up With “Samey” Pages

“Samey” content is the trap. It happens when each post uses the same outline, the same generic advice, and the same empty phrases. Readers bounce fast. Search systems notice.

Write with a checkable spine

Before you write, list 10–15 facts or decisions the page should contain. Then write to that list. If you can’t list them, the topic may be too broad or the page needs a narrower angle.

Force specificity with constraints

Add constraints that push real thinking: time windows, audience skill level, data sources, and what you did not measure. Constraints stop a model from drifting into generic text.

Keep a revision log

Even a simple note like “Stats checked; examples updated; headings rewritten for clarity” can help internal quality checks and team handoffs. It also makes updates easier later.

Estimating AI Share In A Website: A Worked Example

Say a site published 200 posts last year. You sample 50 at random. You run two detectors and record revision depth from the CMS.

Your rule: label a post “AI-authored” only when both detectors score ≥0.80 and revision depth is low. Label it “AI-drafted” when one detector scores ≥0.80 and revision depth is medium or high. Everything else stays “unclear.”

Out of 50 posts:

12 meet the “AI-authored” rule.
18 meet the “AI-drafted” rule.
20 land in “unclear.”

You can report: “In this sample, 24% appear AI-authored, 36% appear AI-drafted, and 40% are unclear under our rules.” That tells the story without pretending the tools are perfect.

Result Bucket	What It Means	Next Action
AI-authored	Model wrote most of final text	Audit facts, add sources, strengthen specific detail
AI-drafted	Model drafted, person rewrote heavily	Spot-check accuracy, smooth voice, check claims
AI-assisted	Human wrote core, AI helped edits	Low risk; keep editing standards steady
Unclear	Signals conflict or text is too short	Increase sample size or add process logs
Template-heavy	Boilerplate dominates the page	Rewrite templates; reduce repeated blocks
Human-first	Low detector scores plus rich revision history	Keep the workflow; update on a schedule
Mixed media	Text is human, images or audio are synthetic	Label assets where your platform expects it

Practical Checklist Before You Publish AI-Assisted Writing

State the page goal in one sentence, then match the intro to that goal.
Remove repeated filler patterns: generic intros, empty adjectives, copycat headings.
Check each claim that could be checked in a trusted source.
Add one or two specific details: a small test, a comparison, a step list, a constraint.
Keep external links few and specific, aimed at primary documentation.

Provenance Tools That Help Label Synthetic Media

Text detection is only one part of the story. Images, audio, and video can be synthetic too. One path is provenance metadata that travels with the file.

The C2PA specifications behind Content Credentials describe a standard for recording and checking that provenance data. When platforms preserve it, readers can inspect creation details without guesswork. C2PA specifications lays out the pieces and how they fit together. :contentReference[oaicite:5]{index=5}

Even with metadata, edits and reposts can strip signals. That’s why a practical estimate still leans on definitions, sampling, and a range, not a single magic number.

References & Sources

Google Search Central.“Spam Policies for Google Web Search.”Defines scaled content abuse and other spam rules that apply no matter how content is produced.
Coalition for Content Provenance and Authenticity (C2PA).“C2PA Specifications.”Describes a technical standard for attaching and checking provenance data for digital media, including synthetic content markers.