From CNN to BuzzFeed, we analyzed 200 major news publishers in the US and found that half of them are explicitly blocking OpenAI’s web crawler, not to mention the paywalls that limit access. In comparison, 26% are blocking CommonCrawl (which serves as the main pre-training dataset for many large language models), 11% are blocking OpenAI’s user browsing, and barely 2% are blocking Anthropic AI.
AI, when trained on up-to-date news, can produce more informed and accurate predictions and analyses. However, if access to training data is blocked, AI risks becoming outdated, hindering its potential to deliver relevant insights and solutions. Balancing these aspects with the legal implications of using copyrighted material poses a significant challenge.
This also directly impacts product managers in maintaining the relevancy and accuracy of AI-driven products. Navigating through legal, ethical, and strategic challenges, PMs must innovate, form alternative data partnerships, and communicate transparently with stakeholders to uphold product quality and user experience amidst constraints.
These numbers emerge from our analysis of the robots.txt files from approximately 200 news websites in October 2023. The technique to disallow chatbots is explain in OpenAI's documentation for their bots: https://platform.openai.com/docs/gptbot