New Standards for AI World

The Case for a Machine-Readable Web: Why We Need Standardized Data for AI Agents

A Historical Perspective: We've Been Here Before

The evolution of machine-to-machine communication has always followed a predictable pattern. In my younger days of enterprise computing, email—designed for human communication with subjects, greetings, and signatures—was repurposed for system integration. Organizations would parse email subjects with specific keywords to trigger automated workflows, a crude but necessary workaround when systems sat behind firewalls with limited connectivity options.

This inefficiency drove innovation. The industry evolved through message queuing systems (MQ), then SOAP with its XML verbosity, and eventually to REST APIs with clean JSON payloads. Each transition represented a shift toward more efficient, purpose-built protocols for machine communication.

Today, we face a similar inflection point. AI agents are forced to consume HTML—a format designed for human visual consumption—to extract structured data. Just as email was never meant for system integration, HTML was never designed for AI consumption.

It's time for an AI Content Standard.

The Problem: The Inefficiency Tax on AI

The Content Type Spectrum

Not all web content is created equal. A useful way to think about what agents encounter:

Static content (60–70%)

Static JSON / MCP The majority of the web is static, structured data — the kind LLMs were trained on. As the Stanford lecture on LLMs explains, models like Llama 3 were trained on ~15 trillion tokens crawled from the web via tools like Common Crawl, covering ~250 billion pages. This is the raw material of the internet — and it's also the easiest for agents to consume via static MCP endpoints.

Static content is where majority of Training Data Comes From

LLMs are trained on "all of internet" — specifically using web crawlers like Common Crawl, which indexes ~250 billion pages (~1 petabyte of data)

Raw internet data is described as very "dirty" and not representative of what's actually useful

—(Reference: Stanford LLM Lecture [https://youtu.be/9vM4p9NN0Ts?si=vW_mnRMHaXFacHL8 ])

Interactive content (20–25%) →

WebMCP A significant portion of the web requires interaction — login flows, dynamic forms, JavaScript-rendered content. This is where WebMCP comes in, bridging the gap between static data access and real-world web automation.
WebMCP — Browser Automation for Agents
( Note : As of early 2026, WebMCP is not a ratified or widely adopted standard. It's an emerging/proposed concept )

Most AI agents today rely on APIs to interact with the web. But what about the 60–70% of the web that has no API? WebMCP fills this gap by enabling agents to interact with websites the way a browser would — clicking buttons, filling forms, and navigating pages. It executes actions defined in HTML forms or JavaScript, making it the right tool for any site that only exposes a web interface. Think of it as the action layer of the web: if MCP over REST/GraphQL is how agents read structured data, WebMCP is how they act on unstructured interfaces.

Authenticated / structured content (10–15%)

GraphQL Enterprise and SaaS platforms expose rich, typed data behind authentication. GraphQL is the right protocol here — precise, efficient, and access-controlled.

Every day, millions of AI agents and LLMs waste computational resources parsing HTML designed for human eyes, not machine consumption. When an AI agent visits a news website, e-commerce store, or social media platform, it must:

Download 100KB+ of HTML, CSS, and JavaScript
Parse through presentation markup (`<div>`, `<span>`, styling)
Extract semantic meaning from visual layouts
Handle inconsistent structures across sites
Deal with dynamic content and anti-scraping measures

The result? A single query that should cost pennies ends up costing dollars in compute and API calls.

Real Cost Example

Current State (HTML Scraping):

Average webpage: 2,000+ tokens of HTML
LLM processing: $0.003 per 1K tokens (Claude)
Cost per page: ~$0.006
1 million pages: $6,000

With Structured JSON:

Structured data: 200 tokens
LLM processing: $0.003 per 1K tokens
Cost per page: ~$0.0006
1 million pages: $600

With GraphQL this would be even less and equally proportional to HTML - JSON , as most of REST APIS return lot of unwanted data for specific processing , GRAPHQL solves that by letting Agents select exact data needed for processing a task.

The Benefits: Why This Matters

Savings: 90% reduction in cost and 10x faster processing
10x faster data extraction, thus lesser compute and Energy
Lower bandwidth consumption
Accuracy Improvement
Universal parsing code
- Easier maintenance
- Faster AI responses
- More reliable information
- Richer integrations
- Already improves search rankings
- Better visibility in AI results
- Future-proof for AI agents

What We Currently Have

1. HTML for Humans

<h1 class="title">Breaking News</h1>

</div>

Rich visual presentation
Inconsistent structure across sites
10-100x more tokens than needed

2. APIs for Developers

{

"title": "Breaking News",

"author": "Amin Asif",

"date": "2026-04-22"

}

- Efficient and structured
Schema variations across APIs are a solved problem — tooling and middleware handle this well.

3. Schema.org Embedded in HTML

{

"@context": "https://schema.org",

"@type": "NewsArticle",

"headline": "Breaking News",

"author": {"@type": "Person", "name": "Amin Asif"},

"datePublished": "2024-01-15"

}

</script>

- Standardized vocabulary (800+ types)

- Machine-readable

- Problems:

Still embedded in HTML (must download full page)
Voluntary adoption (most sites don't use it)
Incomplete coverage

4. RSS/Atom Feeds

<item>

<title>Breaking News</title>

</item>

- Separate machine-readable endpoint

- Widely adopted for blogs/news

What We're Missing: The Universal Machine-Readable Web

The Vision: Dual-Interface Web

Just as websites provide both HTML (for humans) and RSS (for feed readers), every meaningful webpage should offer:

For Humans:

https://example.com/article/123

→ Returns HTML with styling, images, ads

For Machines:

https://example.com/article/123.json

→ Returns structured JSON with Schema.org vocabulary

Why Schema.org is the Nearest Answer

Schema.org already provides:

800+ standardized types: Article, Product, Person, Event, Recipe, JobPosting, etc.
Industry backing: Created by Google, Microsoft, Yahoo, Yandex
Proven adoption: Used by millions of sites for SEO
Extensible: Can add new types as needed
Language-agnostic: Works with JSON-LD, RDFa, Microdata

Three-Layer Solution :

Layer 1: Transport MCP (connectivity) -We have already standardized the Transport

How to discover available data sources
How to authenticate and connect
How to call functions/tools

Layer 2: GraphQL (precise querying)

The over-fetching problem with REST
How GraphQL lets agents request only needed fields
Real cost comparison: REST ($6,000) vs GraphQL ($150) for 1M queries
97.5% cost reduction when combining JSON + GraphQL
Concrete examples comparing:

REST returning 500 tokens (everything)
GraphQL returning 50 tokens (only what's needed)

GraphQL is essential for AI agent efficiency not a nice to have

Layer 3: Schema.org (common vocabulary)or any clever mechanism that solves this problem

What vocabulary to use
How to structure data
What properties are available

Token-Optimised Web for Machines

Here's an emerging idea worth watching: if MCP tools are going to search the web, why serve them the same HTML designed for human eyes?

The opportunity is a machine-readable web layer — lightweight, token-efficient representations of web content that agents can consume cheaply, with links back to the full human-readable HTML for context or verification. Think of it as robots.txt evolved: a structured, low-token surface for AI traversal, sitting alongside the rich visual experience built for humans.

This isn't just about efficiency. As LLMs scale to trillions of tokens of training data, the quality and structure of what they consume matters as much as the volume. A token-optimised web layer could become the next standard — the way RSS was for feed readers, but for agents.

ALL layers are essential. MCP without standard schema is like having a phone but no common language to speak.

Implementation:

The implementation complexity is lower than it appears — modern AI coding assistants can scaffold the server-side JSON endpoint in hours, not weeks.

This document advocates for a fundamental shift in how we think about web content—not just for humans, but for the AI agents that increasingly mediate our interactions with information.

The patterns exist. The protocols are maturing. What's missing is community alignment on a shared specification. This is an invitation to that conversation.

Page updated

Google Sites

Report abuse