Robots.txt is a file used by websites to communicate with web crawlers, instructing them which pages or sections should be crawled and indexed, and which should be ignored. Its main purpose is to manage crawler traffic, prevent overloading servers, and guide search engine bots to prioritize important content while avoiding duplicate or sensitive pages.
OpenAI and Anthropic are allegedly circumventing robots.txt by choosing to "bypass" the protocol in order to retrieve or scrape all of the content from a given website or page. This action goes against their public statements of respecting robots.txt and blocks to their specific web crawlers, GPTBot and ClaudeBot.
OpenAI's May blogpost claimed that they take web crawler permissions into account each time they train a new model. They also stated that they respect the robots.txt protocol and specific blocks to their web crawlers, such as GPTBot. However, it has been found that OpenAI and other AI companies are not respecting these blocks and are bypassing the robots.txt protocol to scrape content from websites for AI training data.