• CMO TLDR
  • Posts
  • Hidden in Plain Sight: The Robots.txt Clues to Unreported AI Content Deals

Hidden in Plain Sight: The Robots.txt Clues to Unreported AI Content Deals

CMO TLDR's investigation reveals potential content licensing agreements between publishers and AI companies, raising questions about transparency in the AI era

Three Undisclosed Content Deals; Where There’s Smoke, There’s Likely Fire

In the race to train ever more sophisticated AI models, a quiet battle is being waged behind the scenes. Clues to potential content licensing agreements between publishers and AI giants are hidden in plain sight, buried within robots.txt files.

To identify potential content licensing agreements not publicly disclosed, CMO TLDR analyzed the robots.txt files of the top 1,025 domains, as ranked by SEMrush and Wikipedia. A robots.txt file instructs web crawlers about permissible and restricted areas of a website.

Changes in robots.txt directives suggest potential undisclosed content licensing agreements, warranting further investigation despite the lack of confirmation.

The Guardian and OpenAI

One intriguing potential partnership involves The Guardian and OpenAI. The Guardian publicly reported blocking OpenAI's web crawlers from accessing its content as of September 1, 2023, in their piece The Guardian blocks ChatGPT owner OpenAI from trawling its content.

This block was reflected in The Guardian's robots.txt file starting September 1, 2023, according to the Wayback Machine. The Guardian's robots.txt file, as of December 3, 2024, no longer contains the directive (GPTBot disallow) blocking OpenAI's web crawler. This remains the case as of February 3, 2025.

While the removal of GPTBot potentially suggests an agreement between the two organizations, neither party has publicly confirmed any such arrangement. The change, nonetheless, raises questions about potential content licensing partnerships and warrants further investigation.

Thomson Reuters And OpenAI

Thomson Reuters, a major news and information provider, presents another notable case. While the company has publicly acknowledged partnerships with Microsoft and Meta, no similar announcement has been made regarding OpenAI.

However, a review of Thomson Reuters' robots.txt file history reveals that on September 26, 2023, Reuters’ robots.txt file disallowed OpenAI from crawling Reuters’ content. However, the directive blocking OpenAI was removed in the September 27, 2023 robots.txt file, a change that remains in effect.

While it could signal an agreement between Thomson Reuters and OpenAI, neither organization has publicly commented on a potential partnership.

Tumblr (Automattic) and OpenAI

Tumblr, the social blogging platform, also exhibits a curious robots.txt timeline. On February 4, 2024, Tumblr's robots.txt file explicitly disallowed access to GPTBot, OpenAI's web crawler.

The very next day, February 5, 2024, this directive was removed, as seen in the Wayback Machine archive. This change mirrors what we see from The Guardian and Thomson Reuters. While rumors of a potential deal between Tumblr and OpenAI, specifically involving training data, have circulated, as reported by 404 Media and The Verge, no official confirmation from either party has been forthcoming. The presence and subsequent absence of the GPTBot block in Tumblr's robots.txt file, coupled with these unconfirmed reports, raises questions about potential data licensing agreements and underscores the need for greater transparency in the AI training data landscape.

The Secrecy Surrounding AI Content Deals: Why the Silence?

The reluctance to disclose these partnerships raises several concerns. Are publishers worried about the optics of selling their content to AI companies? Are they trying to avoid scrutiny about how their content is being used? The lack of transparency makes it difficult to assess the long-term impact of these deals on the media ecosystem.

CMO TLDR will continue to investigate this issue and report on any new developments.

All Publicly Disclosed AI Publisher Content Deals

CMO TLDR has compiled the most comprehensive list available of publicly disclosed partnerships between publishers and AI companies, current as of February 3, 2025, where the AI company compensates the publisher for access to content. Our research, drawing upon multiple sources, focuses specifically on these content licensing agreements. Announcements of broader partnerships, without explicit mention of content licensing or compensation, have been excluded.

AI Company

Date

Publisher

Meta

1/12/2023

Shutterstock

OpenAI

7/11/2023

Shutterstock

OpenAI

7/13/2023

Associated Press

Runway

12/4/2023

Getty Images

OpenAI

12/13/2023

Axel Springer

Microsoft

2/5/2024

Semafor

Google

2/22/2024

Reddit

Google

2/29/2024

Stack Overflow

Confidential

3/11/2024

Thomson Reuters

OpenAI

3/13/2024

Le Monde

OpenAI

3/13/2024

Prisa Media

Unspecified (at least 2 firms)

4/5/2024

Freepik/EyeEm

Microsoft

4/29/2024

Axel Springer

OpenAI

4/29/2024

Financial Times

OpenAI

5/6/2024

Stack Overflow

OpenAI

5/7/2024

Dotdash Meredith

Microsoft

5/8/2024

Informa

OpenAI

5/16/2024

Reddit

OpenAI

5/22/2024

News Corp

OpenAI

5/29/2024

The Atlantic

OpenAI

5/29/2024

Vox Media

Reka AI

6/4/2024

Shutterstock

Picsart

6/13/2024

Getty Images

OpenAI

6/27/2024

Time

Perplexity

7/30/2024

Automattic

Perplexity

7/30/2024

Der Spiegel

Perplexity

7/30/2024

Entrepreneur

Perplexity

7/30/2024

Fortune

Perplexity

7/30/2024

The Texas Tribune

Perplexity

7/30/2024

Time

ProRata.ai

8/6/2024

Axel Springer

ProRata.ai

8/6/2024

Financial Times

ProRata.ai

8/6/2024

Universal Music Group

ProRata.ai

8/6/2024

Fortune

ProRata.ai

8/6/2024

The Atlantic

OpenAI

8/20/2024

Condé Nast

Microsoft

10/1/2024

Axel Springer

Microsoft

10/1/2024

Financial Times

Microsoft

10/1/2024

Hearst Magazines

Microsoft

10/1/2024

Thomson Reuters

Microsoft

10/1/2024

USA Today Network

OpenAI

10/8/2024

Hearst Magazines

Meta

10/25/2024

Thomson Reuters

ProRata.ai

11/20/2024

DMG Media

ProRata.ai

11/20/2024

Guardian Media Group

ProRata.ai

11/20/2024

Prospect Magazine

ProRata.ai

11/20/2024

Sky News

Perplexity

12/5/2024

Adweek

Perplexity

12/5/2024

Blavity

Perplexity

12/5/2024

DPReview

OpenAI

12/5/2024

Future plc

Perplexity

12/5/2024

Gear Patrol

Perplexity

12/5/2024

Lee Enterprises

Perplexity

12/5/2024

Los Angeles Times

Perplexity

12/5/2024

Medialab

Perplexity

12/5/2024

Mexico News Daily

Perplexity

12/5/2024

Minkabu Infonoid

Perplexity

12/5/2024

Newspicks

Perplexity

12/5/2024

Prisa Media

Perplexity

12/5/2024

RTL Germany (NTV)

Perplexity

12/5/2024

RTL Germany (Stern)

Perplexity

12/5/2024

The Independent

Perplexity

12/5/2024

World History Encyclopedia

ProRata.ai

12/9/2024

Lee Enterprises

OpenAI

1/15/2025

Axios

Mistral

1/16/2025

Agence France-Presse

Google

1/16/2025

Associated Press