Anyone who wants to make their website accessible to AI systems needs to understand what kind of bots actually show up. The answer is more complex than expected: Cloudflare, which processes around 20% of global web traffic, has identified over 40 documented AI crawlers - from major players like OpenAI and Anthropic to lesser-known systems operating in the background. Added to these are stealth crawlers that present themselves as regular browsers. This article provides a complete overview.
OpenAI operates three different bots with different purposes. GPTBot is the training crawler: it collects web content for training future GPT models. Those who block GPTBot in robots.txt prevent their content from flowing into future models - but have no direct influence on current ChatGPT answers. OAI-SearchBot is the real-time search crawler for ChatGPT with browsing functionality and SearchGPT. This bot is directly relevant for current visibility in ChatGPT answers. ChatGPT-User is the user agent that appears when ChatGPT actively calls URLs in a conversation. Anthropic (maker of Claude) operates ClaudeBot as the primary crawler and anthropic-ai as a secondary user agent. Both collect data for Claude training and the retrieval functions of Claude.ai. Anthropic is transparent about its crawlers and publishes IP ranges for whitelisting. Perplexity operates PerplexityBot as the main crawler and Perplexity-User as an active browser agent. Google-Extended is a bot through which Google explicitly controls whether page content is used for training Gemini and other Gemini systems. Critically: Google-Extended does NOT affect Google search rankings and does NOT affect AI search answers. It controls AI training exclusively. Microsoft Bingbot is the classic Bing crawler, also used for Copilot training data. Bytespider is the crawler from ByteDance (TikTok parent company) - those who wish can explicitly block ByteDance crawlers without affecting other AI systems.
Try it now
Check your GEO Score in 60 seconds - free, no account needed. 42 factors analyzed.
Not all AI crawlers identify themselves as such. According to Cloudflare data from 2025, between 5 and 8% of all AI-related crawling requests use fake user agents - they present themselves as regular browsers (Chrome, Firefox, Safari), even though they are automated crawlers. The best-known example is Perplexity. Investigative reporting (first reported by Wired in 2024) has shown that Perplexity partially accesses websites via a headless Chrome browser that sends a normal browser identifier. For website operators this means: robots.txt rules based on user agent matching are bypassed by stealth crawlers. Those who want to block specific AI crawlers need IP-based blocking rules - but the IP ranges are not always publicly documented. Why do AI companies do this? For practical reasons: many websites have JavaScript rendering, login walls, or anti-bot measures that only engage against known crawler user agents. A browser-based crawler gets through these barriers where a regular crawler fails. This is technically effective - but ethically and legally problematic, particularly with regard to terms of service. For your own strategy: if you are aiming for AI visibility, this argues against aggressive bot-blocking measures. Focus on making access easier for desired crawlers - rather than making it harder for unwanted crawlers (which is barely possible with stealth crawlers anyway).
For most websites we recommend a selective-allow strategy: allow all legitimate AI crawlers, except those for which there are specific reasons for blocking (e.g. ByteDance for political reasons, or training crawlers if you do not want to provide training data). A practical robots.txt configuration explicitly allows: GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, and Google-Extended each with Allow: / - and blocks bots like ByteSpider with Disallow: /. Important: allow AI crawlers on product pages, service descriptions, and public content pages. Continue to block internally: /checkout/, /account/, /admin/, /api/, and internal search result pages - this is in your interest (no internal search in AI training data) and in the interest of crawlers (no low-quality indexing pages). A newer concept worth paying attention to: Pay-Per-Crawl. Cloudflare introduced 'AI Crawl Control' in 2025 - a system that allows website operators to permit AI crawlers access while charging a fee for it. The concept is still early-stage (most AI providers do not yet support it), but it shows the direction: web content is valuable training data, and the question of remuneration for content creators will be resolved regulatorily and through market mechanisms in the coming years. Those who keep careful crawler logs now will have a better starting position for later negotiations on content licensing models.
The AI crawler landscape is more confusing than most website operators realise. Over 40 documented bots, plus stealth crawlers presenting themselves as browsers - that is the reality of 2025/2026. The best strategy for most companies: explicitly allow key AI crawlers (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot), regularly analyse crawler logs, provide structured data that makes crawling efficient, and keep an eye on the development of pay-per-crawl models. Those who define their AI crawler strategy today are better positioned tomorrow.
Check GEO Score for freeMarvin Malessa
Founder, Beconova
Founded Beconova in Germany in 2025 to help shops and service businesses become visible in AI search engines. Writes about GEO, AI visibility, and the future of search.
Get started with Beconova now and optimize your presence in AI search engines.
Get Started