If you’re in the business of publishing content on the internet, it’s been difficult to know how to deal with AI. Obviously, you can’t ignore it; large language models (LLMs) and AI search engines are here, and they ingest your content and summarize it for their users, killing valuable traffic to your site. Plenty of data supports this.
Creating a content strategy that accounts for this changing reality is complex to begin with. You need to decide what content to expose to AI systems, what to block from them, and how both of those activities can serve your business.
That would be hard even if there were clear rules that everyone’s operating under. But that is far from a given in the AI world. A topic I’ve revisited more than once is how tech and media view some aspects of the ecosystem differently (most notably, user agents), leading to new industry alliances, myriad lawsuits, and several angry blog posts. But even accounting for that, a pair of recent reports suggest the two sides are even further apart than you might think.
Common Crawl and the copyright clash
Common Crawl is a vast trove of internet data that many AI systems use for training. It was a fundamental part of GPT-3.5, the model that powered ChatGPT when it was released to the world back in 2022, and many other LLMs are also based on it. Over the past three years, however, the issue of copyright and training data has become a major source of controversy, and several publishers have requested that Common Crawl delete their content from its archive to prevent AI models from training on it.
A report from The Atlantic suggests that Common Crawl hasn’t complied, keeping the content in the archive while making it invisible to its online search tool—meaning any spot checks would come up empty. Common Crawl’s executive director, Rich Skrenta, told the publication that it complies with removal requests, but he also clearly supports the point of view that anything online should be fair game for training LLMs, saying, “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet.”
Separately, Columbia Journalism Review (CJR) looked at how the new AI-powered browsers, Perplexity Comet and ChatGPT Atlas, handle requests to access paywalled content. The report notes that, when asked to retrieve a subscriber-only article from MIT Technology Review, both browsers complied even though the web-based chatbots from those companies would refuse to get the article on account of it being paywalled.
The details of both cases are important, but both underscore just how far apart the perspectives of the media and the tech industry are. The tech side will always tilt toward more access—if information is digital and findable on the internet, AI systems will always default to obtaining it by any means necessary. And publishers assert that their content still belongs to them regardless of where and how it’s published, and they should retain control of who can access it and what they can do with it.
The mental divide between AI and media
There’s more happening here than just two debaters arguing past each other, though. The case of Common Crawl exposes a contradiction in a key talking point on the tech side of things—that any particular piece of content or source in an LLM’s training data isn’t that relevant, and they could easily do without it. But it’s hard to reconcile that with Common Crawl’s apparent actions, risking costly lawsuits by not deleting data from publications who request them to, which includes The New York Times, Reuters, and The Washington Post. When it comes to training data, some sources are clearly more valuable than others.
The browsers that circumvent paywalls reveal another incorrect assumption from the AI side: that because certain behaviors are allowed on an individual basis, they should be allowed at scale. The most common argument that relies on this logic is when people say that when AI “learns” from all the information it ingests, it’s just doing what humans do.
But a change in scale can also create a category shift. Think about how paywalls typically work: Many are deliberately porous, allowing a limited number of free articles per day, week, or month. Once those are exhausted, there’s the old trick of the incognito window. Also, some paywalls, as noted in the CJR article, work by loading all the text on the page, then pulling down a curtain so the reader can’t see it. Sometimes, if you click the “Stop loading” button fast enough, you can expose the text before that curtain comes down. One level up from there is to use your browser’s simple developer tools to disable and delete the paywall elements on an article page.
Savvy internet users have known about all of these for years, but it’s a small percentage of all users—I’d wager less than 5%. But guess who knows about all these tricks, and probably many more on top of them? AI. Browser agents like those in Comet and Atlas are effectively the most savvy internet users possible, and they grant these powers to anyone simply requesting information. Now, what was once a niche activity is applied at scale, and paywalls become invisible to anyone using an AI browser. One defense here might be server-side paywalls, which grant access to the text only after the reader logs in.
Regardless, what the browser does with the data after the AI ingests it is yet another access question. OpenAI says it won’t train on any pages that Atlas’s agent may access, and indeed this is how user agents are supposed to work, though the company does say it will retain the pages for the individual user’s memory. That sounds benign enough, but considering how Common Crawl has behaved, should we be taking any AI company at their word?
Turning conflict into strategy
So what’s the takeaway for the media—besides investing in server-side paywalls? The good news is your content is more valuable than you’ve been told. If it wasn’t, there wouldn’t be so much effort to find it, ingest it, and claim it to be “free.” But the bad news is that maintaining control over that content is going to be much harder than you probably thought. Understanding and managing how AI uses your content for training, summaries, or agents is a complicated business, requiring more than just techniques and code. You need to take into account the mindset of those on the other side.
Turning all this into real strategy means deciding when to fight access, when to allow it, and when to demand compensation. Considering what a moving target AI is, that will never be easy, but if the AI companies’ aggressive, constant, and comprehensive push for more access has shown anything, it’s that they deeply value the media industry’s content. It’s nice to be needed, but success will depend on turning that need into leverage.