- CMO TLDR
- Posts
- Hidden in Plain Sight: The Robots.txt Clues to Unreported AI Content Deals
Hidden in Plain Sight: The Robots.txt Clues to Unreported AI Content Deals
CMO TLDR's investigation reveals potential content licensing agreements between publishers and AI companies, raising questions about transparency in the AI era
Three Undisclosed Content Deals; Where There’s Smoke, There’s Likely Fire

In the race to train ever more sophisticated AI models, a quiet battle is being waged behind the scenes. Clues to potential content licensing agreements between publishers and AI giants are hidden in plain sight, buried within robots.txt files.
To identify potential content licensing agreements not publicly disclosed, CMO TLDR analyzed the robots.txt files of the top 1,025 domains, as ranked by SEMrush and Wikipedia. A robots.txt file instructs web crawlers about permissible and restricted areas of a website.
Changes in robots.txt directives suggest potential undisclosed content licensing agreements, warranting further investigation despite the lack of confirmation.
The Guardian and OpenAI
One intriguing potential partnership involves The Guardian and OpenAI. The Guardian publicly reported blocking OpenAI's web crawlers from accessing its content as of September 1, 2023, in their piece The Guardian blocks ChatGPT owner OpenAI from trawling its content.
This block was reflected in The Guardian's robots.txt file starting September 1, 2023, according to the Wayback Machine. The Guardian's robots.txt file, as of December 3, 2024, no longer contains the directive (GPTBot disallow) blocking OpenAI's web crawler. This remains the case as of February 3, 2025.
While the removal of GPTBot potentially suggests an agreement between the two organizations, neither party has publicly confirmed any such arrangement. The change, nonetheless, raises questions about potential content licensing partnerships and warrants further investigation.
Thomson Reuters And OpenAI
Thomson Reuters, a major news and information provider, presents another notable case. While the company has publicly acknowledged partnerships with Microsoft and Meta, no similar announcement has been made regarding OpenAI.
However, a review of Thomson Reuters' robots.txt file history reveals that on September 26, 2023, Reuters’ robots.txt file disallowed OpenAI from crawling Reuters’ content. However, the directive blocking OpenAI was removed in the September 27, 2023 robots.txt file, a change that remains in effect.
While it could signal an agreement between Thomson Reuters and OpenAI, neither organization has publicly commented on a potential partnership.
Tumblr (Automattic) and OpenAI
Tumblr, the social blogging platform, also exhibits a curious robots.txt timeline. On February 4, 2024, Tumblr's robots.txt file explicitly disallowed access to GPTBot, OpenAI's web crawler.
The very next day, February 5, 2024, this directive was removed, as seen in the Wayback Machine archive. This change mirrors what we see from The Guardian and Thomson Reuters. While rumors of a potential deal between Tumblr and OpenAI, specifically involving training data, have circulated, as reported by 404 Media and The Verge, no official confirmation from either party has been forthcoming. The presence and subsequent absence of the GPTBot block in Tumblr's robots.txt file, coupled with these unconfirmed reports, raises questions about potential data licensing agreements and underscores the need for greater transparency in the AI training data landscape.
The Secrecy Surrounding AI Content Deals: Why the Silence?
The reluctance to disclose these partnerships raises several concerns. Are publishers worried about the optics of selling their content to AI companies? Are they trying to avoid scrutiny about how their content is being used? The lack of transparency makes it difficult to assess the long-term impact of these deals on the media ecosystem.
CMO TLDR will continue to investigate this issue and report on any new developments.
All Publicly Disclosed AI Publisher Content Deals
CMO TLDR has compiled the most comprehensive list available of publicly disclosed partnerships between publishers and AI companies, current as of February 3, 2025, where the AI company compensates the publisher for access to content. Our research, drawing upon multiple sources, focuses specifically on these content licensing agreements. Announcements of broader partnerships, without explicit mention of content licensing or compensation, have been excluded.
AI Company | Date | Publisher |
---|---|---|
Meta | 1/12/2023 | Shutterstock |
OpenAI | 7/11/2023 | Shutterstock |
OpenAI | 7/13/2023 | Associated Press |
Runway | 12/4/2023 | Getty Images |
OpenAI | 12/13/2023 | Axel Springer |
Microsoft | 2/5/2024 | Semafor |
2/22/2024 | ||
2/29/2024 | Stack Overflow | |
Confidential | 3/11/2024 | Thomson Reuters |
OpenAI | 3/13/2024 | Le Monde |
OpenAI | 3/13/2024 | Prisa Media |
Unspecified (at least 2 firms) | 4/5/2024 | Freepik/EyeEm |
Microsoft | 4/29/2024 | Axel Springer |
OpenAI | 4/29/2024 | Financial Times |
OpenAI | 5/6/2024 | Stack Overflow |
OpenAI | 5/7/2024 | Dotdash Meredith |
Microsoft | 5/8/2024 | Informa |
OpenAI | 5/16/2024 | |
OpenAI | 5/22/2024 | News Corp |
OpenAI | 5/29/2024 | The Atlantic |
OpenAI | 5/29/2024 | Vox Media |
Reka AI | 6/4/2024 | Shutterstock |
Picsart | 6/13/2024 | Getty Images |
OpenAI | 6/27/2024 | Time |
Perplexity | 7/30/2024 | Automattic |
Perplexity | 7/30/2024 | Der Spiegel |
Perplexity | 7/30/2024 | Entrepreneur |
Perplexity | 7/30/2024 | Fortune |
Perplexity | 7/30/2024 | The Texas Tribune |
Perplexity | 7/30/2024 | Time |
ProRata.ai | 8/6/2024 | Axel Springer |
ProRata.ai | 8/6/2024 | Financial Times |
ProRata.ai | 8/6/2024 | Universal Music Group |
ProRata.ai | 8/6/2024 | Fortune |
ProRata.ai | 8/6/2024 | The Atlantic |
OpenAI | 8/20/2024 | Condé Nast |
Microsoft | 10/1/2024 | Axel Springer |
Microsoft | 10/1/2024 | Financial Times |
Microsoft | 10/1/2024 | Hearst Magazines |
Microsoft | 10/1/2024 | Thomson Reuters |
Microsoft | 10/1/2024 | USA Today Network |
OpenAI | 10/8/2024 | Hearst Magazines |
Meta | 10/25/2024 | Thomson Reuters |
ProRata.ai | 11/20/2024 | DMG Media |
ProRata.ai | 11/20/2024 | Guardian Media Group |
ProRata.ai | 11/20/2024 | Prospect Magazine |
ProRata.ai | 11/20/2024 | Sky News |
Perplexity | 12/5/2024 | Adweek |
Perplexity | 12/5/2024 | Blavity |
Perplexity | 12/5/2024 | DPReview |
OpenAI | 12/5/2024 | Future plc |
Perplexity | 12/5/2024 | Gear Patrol |
Perplexity | 12/5/2024 | Lee Enterprises |
Perplexity | 12/5/2024 | Los Angeles Times |
Perplexity | 12/5/2024 | Medialab |
Perplexity | 12/5/2024 | Mexico News Daily |
Perplexity | 12/5/2024 | Minkabu Infonoid |
Perplexity | 12/5/2024 | Newspicks |
Perplexity | 12/5/2024 | Prisa Media |
Perplexity | 12/5/2024 | RTL Germany (NTV) |
Perplexity | 12/5/2024 | RTL Germany (Stern) |
Perplexity | 12/5/2024 | The Independent |
Perplexity | 12/5/2024 | World History Encyclopedia |
ProRata.ai | 12/9/2024 | Lee Enterprises |
OpenAI | 1/15/2025 | Axios |
Mistral | 1/16/2025 | Agence France-Presse |
1/16/2025 | Associated Press |