The Web Has a Stowaway Problem
Every time an AI company trains a model or powers a chatbot that answers questions in real time, it needs content — articles, product pages, forum threads, documentation, blog posts. Most of that content comes from the open web. And most of the time, nobody asks permission or pays anyone a dime.
This isn't a secret. It's just how the internet worked before AI made the scale of scraping impossible to ignore. A search engine crawler showing up once a month felt harmless. An army of AI bots hitting your site billions of times a day to feed a commercial product is a different conversation entirely.
So Some Sites Started Building Mazes
One early response was purely defensive: make scraping expensive. If a bot can't tell real content from fake content, it wastes compute processing garbage. Cloudflare — which sits in front of roughly 61% of websites by traffic — turned this into a product called AI Labyrinth. It generates convincing-looking fake pages using AI and feeds them to scrapers that don't identify themselves properly. The scraper thinks it's collecting content. It's actually collecting noise.
The logic is simple: if stealing your content costs the same as licensing it, some AI companies will start licensing it. The maze is leverage, not just defense.
Cloudflare Isn't the Only One Doing This
Cloudflare gets most of the press because it serves the most websites, but it's far from the only player. Akamai — which leads the industry by revenue at around $4.2 billion annually, primarily serving large enterprise clients — has built its own AI scraper management tools. Its Bot Manager product lets site owners instantly block AI bots, require them to authenticate, agree to licensing terms, or pay per request before accessing any content.
Akamai tracked a steady rise in verified AI bot traffic throughout 2025, starting primarily in e-commerce and retail — categories where pricing and inventory data changes constantly, giving AI systems a reason to scrape repeatedly rather than just once. That pattern has since spread across industries. The scale is already in the billions of requests per day, and it continues to grow.
The Toll Road Model
The more interesting development isn't blocking — it's monetizing. Several companies are now positioning themselves as the middleman between AI systems and the web's content owners.
The idea works like this: instead of a bot sneaking through a side door for free, it goes through an authenticated checkpoint. The checkpoint verifies who the bot is, what company it represents, and charges accordingly. The site owner gets a cut. The AI company gets clean, verified data it can actually trust. The infrastructure provider takes a fee for running the tollbooth.
New protocols are emerging to make this work at scale. Standards with names like Know Your Agent (KYA) and Web Bot Auth are being developed to give AI bots cryptographic identities — essentially passports — so they can be verified and billed automatically without any human in the loop. It's the same concept as Know Your Customer rules in banking, applied to software agents.
What Reddit and News Publishers Already Know
Some of the largest content platforms didn't wait for CDN infrastructure to catch up. Reddit, the Associated Press, and several major news organizations have already struck direct licensing deals with AI companies, charging for access to their archives as training data. The amounts vary, but the principle is the same: content has value, and that value is now being negotiated rather than simply taken.
The challenge for everyone else — smaller publishers, independent sites, niche communities — is that they don't have the leverage to negotiate directly. That's exactly the gap the toll road model is designed to fill. If the infrastructure layer handles authentication and payment automatically, a blog with 10,000 monthly readers can participate in the same system as a major newspaper.
Who Hasn't Figured This Out Yet
The honest answer is: most of the web. The majority of site owners have no visibility into how much AI traffic they're receiving, which companies are sending it, or what that traffic is being used for. Most AI scrapers do identify themselves in their request headers — OpenAI, Anthropic, Meta, and Google bots generally play by the rules — but smaller or less scrupulous operations don't. And even the well-behaved ones aren't paying.
Tools to measure, manage, and eventually monetize that traffic are still early. The technical standards are being written right now. The business models are being tested. But the direction is clear: the era of free content for AI training is ending, slowly and then all at once.
Why This Matters Beyond the Big Players
The bot tax conversation tends to focus on Cloudflare versus OpenAI, or publishers versus Google. But the downstream effect touches anyone who creates content on the web. If infrastructure-level monetization of AI traffic becomes standard, it changes the economics of running a website — the same way display advertising did in the early 2000s, or affiliate links did a decade later.
It won't happen overnight. The standards need to stabilize, the business models need to prove out, and AI companies need enough incentive to participate rather than route around the system. But the pieces are moving into place faster than most people realize, and the companies building the toll booths are already some of the largest infrastructure providers on the internet.
The web gave AI its education for free. The invoice is being written.