Three robots are deciding whether your brand gets cited in AI-generated answers. GPTBot crawls your site for OpenAI. ClaudeBot crawls it for Anthropic. PerplexityBot crawls it for Perplexity AI. If your robots.txt is blocking any of them, you are invisible to that engine’s answers, no matter how good your content is.
This guide documents exactly what each bot does, what user agent string it sends, how to allow or block it, and why that configuration decision has direct consequences for your brand’s AI citation rate. At the end you will have a clear picture of your current setup and a complete robots.txt template you can use today.
One note on scope: this guide covers the three bots most brands ask about first, plus Google-Extended (which affects AI Overviews and Gemini) and a handful of additional crawlers worth knowing. Your AI visibility depends on more than the big three.
How AI crawlers differ from Googlebot
If you have spent time in SEO, you know Googlebot. It crawls your pages, indexes them, and those pages become eligible to rank in Google Search. The feedback loop is clear: check Google Search Console, see what indexed, understand what ranked.
AI crawlers work differently, and the difference matters for how you think about your robots.txt decisions.
Some AI crawlers, like GPTBot, gather content to train large language models. Your page enters a training dataset and influences what the model “knows” the next time it is updated. Others, like PerplexityBot, do real-time retrieval. When a user asks Perplexity a question, the crawler pulls live web content to synthesize an answer and show cited sources. Some crawlers do both.
The practical difference matters for your timeline expectations. Training crawlers influence AI answers on a delay, as models are updated periodically with new data. Retrieval crawlers can influence answers within days of accessing your content. If Perplexity can reach your page today and your page answers a common query well, your brand can appear as a Perplexity source this week.
Both types of crawlers respect robots.txt. Block them and they will not visit. Allow them and they will crawl as often as their schedules permit. But unlike Googlebot, most AI crawlers do not come with a public dashboard showing what they have indexed. Your two feedback mechanisms are your server access logs and manual queries to the AI engines themselves.
The robots.txt decision you make today affects both timelines: the immediate retrieval capability (which queries cite you right now) and the longer training horizon (what these models know about your brand when no retrieval is available). Both matter for AI visibility.
The crawler reference table
Before going bot by bot, here is a quick reference. These are the crawlers that affect the AI engines your buyers are most likely using.
| AI engine | Primary crawler | Secondary crawler | Primary use |
|---|---|---|---|
| ChatGPT (OpenAI) | GPTBot |
OAI-SearchBot |
Training + real-time search |
| Claude (Anthropic) | ClaudeBot |
anthropic-ai |
Training + retrieval |
| Perplexity AI | PerplexityBot |
Real-time retrieval + citation | |
| Google AI Overviews / Gemini | Google-Extended |
AI product training + retrieval |
GPTBot (OpenAI)
GPTBot is OpenAI’s primary web crawler. The user agent string is GPTBot. You may also see GPTBot/1.x in your server logs, where the version number increments with updates. OpenAI published its robots.txt documentation in August 2023, which gave site owners a documented, official path for allowing or blocking it.
OpenAI uses GPTBot for two distinct purposes. First, it collects training data for future GPT model versions. Content your pages contain enters the pipeline for model training, if GPTBot can access it. The quality and structure of that content influences how useful it is: well-formatted, factual, clearly attributed content is more valuable than thin or ambiguous text. Second, GPTBot powers ChatGPT’s browsing capabilities. When ChatGPT searches the web in response to a user prompt, it uses this crawler infrastructure to retrieve live pages.
If you block GPTBot, your content does not reach either pipeline. ChatGPT will not train on it and will not retrieve it during web browsing. Your brand will be absent from ChatGPT’s answers for queries your content would otherwise address.
To allow GPTBot in your robots.txt:
User-agent: GPTBot
Allow: /
To block it:
User-agent: GPTBot
Disallow: /
If you want to allow most of your site but protect a specific section:
User-agent: GPTBot
Allow: /
Disallow: /members-only/
GPTBot respects robots.txt. OpenAI has confirmed this publicly. If the rule says Disallow, the bot will not crawl those pages.
OAI-SearchBot
OAI-SearchBot is a separate OpenAI crawler, specifically for ChatGPT’s real-time search feature. Where GPTBot handles general crawling and training, OAI-SearchBot retrieves live content to answer search queries as users submit them. The two are distinct in purpose and may appear separately in your logs.
For complete coverage, allow both:
User-agent: OAI-SearchBot
Allow: /
ClaudeBot (Anthropic)
ClaudeBot is Anthropic’s web crawler. The user agent string is ClaudeBot. You may also see anthropic-ai in your logs. Both are Anthropic crawlers. Both should be allowed if you want your content contributing to Claude’s knowledge and retrieval capabilities.
Anthropic uses ClaudeBot to gather content for training future Claude models and to support Claude’s ability to retrieve and cite current web information. The Claude ecosystem includes Claude.ai, the Anthropic API, and many third-party products built on top of Claude. Content that ClaudeBot can access has potential reach across all of those surfaces.
The allow rules:
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
To block them:
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
ClaudeBot respects robots.txt. Block it and your content does not enter Anthropic’s training pipeline or retrieval index.
One practical note on how the Claude ecosystem works: Claude’s responses draw on a combination of training knowledge and, for Claude products with web access enabled, real-time retrieval. Both pathways benefit from your site being open. If ClaudeBot cannot reach your content during training, Claude will not have built-in knowledge of your brand. If it cannot retrieve during live user sessions, Claude cannot cite your pages as sources in real-time answers. The robots.txt decision affects both.
This dual-timeline impact is why blocking feels lower-stakes than it actually is. The training gap does not show up immediately. It accumulates over months, as models update without your content in the dataset, and competitors who left their sites open accumulate citation advantage.
PerplexityBot (Perplexity AI)
PerplexityBot is Perplexity AI’s crawler. The user agent string is PerplexityBot.
Perplexity operates on a model that is more retrieval-heavy than most other AI search engines. When a user submits a query, Perplexity does a live web search, retrieves content from relevant pages, synthesizes an answer, and displays the source pages it drew from. The citations are front and center in every Perplexity response. Users see exactly where the information came from.
This design makes PerplexityBot the most directly linked to immediate citation opportunity. If PerplexityBot can access your content today and your content answers a query well, your page can appear as a cited source in Perplexity answers tomorrow. There is no training cycle delay.
The other implication of Perplexity’s retrieval model: your page’s format matters at the retrieval level. A page with a clear answer in the first paragraph, organized under descriptive headings, with FAQ sections that mirror how users phrase questions, is easier for Perplexity to excerpt and attribute than the same information buried in dense prose. You can read more about the specific formatting patterns that improve AI citation rates in the complete guide to AEO.
Allow PerplexityBot:
User-agent: PerplexityBot
Allow: /
Block it:
User-agent: PerplexityBot
Disallow: /
PerplexityBot respects robots.txt. Block it and your pages cannot appear as Perplexity sources. Your competitors’ accessible pages will fill those citation slots instead.
Google-Extended: the crawler brands forget
While GPTBot, ClaudeBot, and PerplexityBot get the most attention, Google-Extended is arguably the highest-stakes crawler for brands with strong Google Search presence.
Google uses Google-Extended to control whether your content feeds into Gemini (Google’s AI assistant) and Google AI Overviews. Unlike Googlebot, which crawls for traditional search rankings, Google-Extended is specifically for Google’s AI products.
The key distinction: blocking Google-Extended does not affect your traditional Google Search rankings. Your pages can still rank on page one in organic search while being completely absent from Google AI Overviews on those same queries. These are separate systems with separate access controls.
By default, if you do not mention Google-Extended in your robots.txt, Google treats it like Googlebot: allowed to crawl. Blocking requires explicit action. But if you inherited a robots.txt that someone built with a blanket “block everything except Googlebot” pattern, you may be blocking Google-Extended without knowing it.
Explicit allow rule:
User-agent: Google-Extended
Allow: /
Explicit block:
User-agent: Google-Extended
Disallow: /
For most sites pursuing AEO strategy, Google-Extended belongs in the same allow tier as the other three. Google AI Overviews appear on a substantial share of Google results pages, particularly for informational and comparison queries. If your content could be the source for an AI Overview on a relevant query but Google-Extended is blocked, you lose that visibility while still paying for the SEO work that got the page to rank.
Other AI crawlers worth knowing
Beyond those four, you will encounter additional AI crawlers as you review server logs. Bytespider is ByteDance’s crawler, connected to TikTok’s AI systems and content recommendation infrastructure. Applebot-Extended is Apple’s dedicated AI crawler for Siri and Apple Intelligence features. FacebookBot is Meta’s crawler for AI capabilities across Facebook and Instagram. cohere-ai belongs to Cohere, used for training and retrieval on its enterprise AI platform. Diffbot crawls for structured web data extraction used by multiple AI applications.
For practical AEO priorities in 2026, focus on GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, and Google-Extended first. These correspond to the AI engines with the highest user base and the most commercial relevance for most businesses. Get those four right before worrying about the rest.
Should you block GPTBot?
This is the most common question. The answer depends on what you are trying to protect and what you stand to lose.
The case for blocking rests on intellectual property and data licensing concerns. The argument: AI companies use publicly available web content to train commercial products without compensating content creators. Some publishers have taken this position seriously. News organizations in particular have pursued both blocking and legal action related to training data use.
This argument has real merit for a specific class of publisher: sites where the content is the primary product, where the content itself has standalone commercial value. A news subscription site loses revenue if AI models reproduce its articles without users clicking through to subscribe. A research firm loses value if its paid reports become freely available through AI synthesis. For these businesses, the robots.txt decision is an intellectual property question first.
For most marketing-oriented businesses, the calculation is different. Your website content is not the product. It is the vehicle. The goal of that content is to earn attention, build credibility, and generate inquiries from prospective clients. Blocking AI crawlers protects a theoretical licensing claim while sacrificing real visibility in channels that are increasingly driving brand discovery.
Consider the commercial scenario directly. Suppose your firm operates in a competitive services category. Buyers in your category regularly ask AI engines for recommendations before reaching out to anyone. They ask ChatGPT, “Who are the best [service type] providers?” ChatGPT draws on training data and real-time retrieval. Your competitor allows GPTBot. Your robots.txt blocks it. ChatGPT names your competitor. You never appear. That is a real, recurring commercial consequence of a robots.txt decision.
There is a middle position worth understanding: selective blocking. You could theoretically allow OAI-SearchBot (real-time retrieval for ChatGPT search) while blocking GPTBot (training data). In practice this is less effective than it sounds. Training data drives the foundational knowledge a model has about your brand and category. Retrieval helps with specific live queries but will not surface your brand for category-level questions where the model relies on training knowledge. Splitting the two gives you partial coverage at best.
The practical framework: if your content is the product and you have a data licensing strategy, block deliberately and understand the citation tradeoff. If your content is the vehicle for client acquisition, open access. Your content is already public. The goal is for it to earn citations. Blocking the crawlers that make that possible contradicts your own visibility objectives.
The question “should I block GPTBot?” is really two questions in one: “Do I have a licensing claim worth protecting?” and “Am I willing to give up AI citation visibility to protect it?” For most business websites, the second answer settles it.
How to configure robots.txt for AI visibility
The safest approach is to have an explicit allow rule for every major AI crawler in your robots.txt. Do not rely on default behavior or the absence of a rule to mean “allowed.” Be explicit. Crawlers change behavior over time and defaults can shift.
Here is a complete robots.txt that opens access to all major AI crawlers while keeping your standard configurations intact:
# Standard search crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI crawlers - allow for AI citation visibility
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
# Catch-all
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
If you have paths to protect across all crawlers (admin areas, private portals, raw data exports), add explicit disallows after your allow rules:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /exports/
This configuration gives you control: major AI crawlers are explicitly allowed, sensitive paths are protected, and the sitemap is referenced for discoverability.
The most common robots.txt mistakes that kill AI visibility
After reviewing robots.txt files across many AEO audits, I see the same errors repeatedly. These are the ones that quietly cost brands AI citations while the marketing team wonders why they never appear in AI answers.
The wildcard block. Some sites use User-agent: * with Disallow: / and then selectively allow Googlebot. This blocks every crawler that is not explicitly listed, including every AI crawler. The fix is to either change the wildcard disallow to allow, or add explicit allow rules for each AI crawler before the wildcard disallow takes effect.
The inherited template. Developers often copy a robots.txt from a framework, CMS starter kit, or another project and never review it. Some of these templates include aggressive crawl restrictions that made sense for the original context. If you did not personally review your robots.txt in the past 12 months, it is worth checking now.
The forgotten disallow. Someone added Disallow: /blog/ two years ago to keep a staging section from being indexed during development. The blog moved to production. The disallow stayed. Now the primary content section driving AI citations is blocked. This pattern is especially common on sites that have gone through redesigns or CMS migrations.
The case-sensitivity error. User agent names in robots.txt are case-sensitive for some crawlers. gpTBot does not match GPTBot. Always use the exact casing documented by each company. When in doubt, add both variations.
The JavaScript rendering gap. This one does not live in robots.txt but it is a technical counterpart that produces identical symptoms. If your site uses a JavaScript-heavy framework and pages require JavaScript to render their content, AI crawlers receive an empty HTML shell. They cannot extract content they cannot see. Even with a clean robots.txt and an open configuration, client-side-only rendering means your pages are effectively invisible to crawlers that do not execute JavaScript fully.
The CDN configuration conflict. Some CDN configurations block bots at the edge before the request reaches your server or checks your robots.txt at all. If your CDN has a bot management layer, verify that AI crawlers are on the allow list there as well. The robots.txt setting is irrelevant if the CDN drops the request first.
What blocking actually costs: a concrete scenario
Suppose a consulting firm has a strong website. Forty articles, clear service descriptions, a founder bio, FAQ sections on every major page. Three years of consistent SEO work. They rank well in Google for their key terms.
But their developer set up robots.txt two years ago during a security review. The file includes Disallow: / under User-agent: *, with only Googlebot listed in the allow section. Every AI crawler in existence hits that wall.
Every month, prospective clients ask ChatGPT and Perplexity for consulting firm recommendations in the firm’s specialty. ChatGPT draws on training data. The firm’s content has never entered that training set. ChatGPT names four competitors, all of whom have open configurations. Perplexity does live retrieval. The firm’s pages are blocked. Perplexity cites the competitors’ accessible pages.
A single robots.txt change fixes this entirely. Within days of opening access, PerplexityBot begins crawling. Within weeks, GPTBot and ClaudeBot start appearing in server logs. Within months, as models update, the firm begins appearing in relevant AI answers. The content was always good enough. The access was the problem.
This is why robots.txt is the first check in any AEO audit. It is the gate. If the gate is closed, nothing that comes after it matters. Schema markup, content quality, entity signals: all of that work is invisible to AI crawlers that cannot get past your robots.txt.
Beyond robots.txt: making crawlers want to come back
Opening access is the necessary first step. Making your content worth crawling repeatedly is what builds sustained citation presence.
AI crawlers, like all crawlers, operate on budgets. They cannot crawl everything on every visit. Content that is fast-loading, clearly structured, and directly relevant to real queries gets prioritized. Content that is slow, thin, or structurally opaque gets deprioritized and revisited less frequently.
Schema markup is the most direct signal you can give AI crawlers about what your content is. An Article schema tells crawlers this is a blog post, by a named author, published on a specific date, covering a specific topic. An FAQPage schema tells them these are question and answer pairs, formatted for direct extraction. A Person schema on your author bio connects your content to a recognized entity. Our guide to schema markup for AEO covers every schema type that influences AI citation, with implementation examples for each one.
An llms.txt file gives crawlers a structured introduction to your site. Place it at yourdomain.com/llms.txt. Include a brief description of your organization, links to your most important pages, and a summary of the topics you cover. It functions like an orientation packet for AI systems: instead of making them reconstruct your site structure by crawling everything from scratch, you hand them the key information upfront. This is especially useful for sites with complex architectures or large content libraries.
Fast server response time keeps you in the crawl rotation. Crawlers have time limits per request. If your server takes more than a few seconds to return readable content, the crawl is abandoned and the page is deprioritized for future visits. Keeping your pages under two seconds on time-to-first-byte is the practical target.
Clear heading hierarchy makes content extraction straightforward. Crawlers parse headings to understand page structure. An H1 that names the topic clearly, H2s that name the subtopics, H3s that go deeper into specific questions. This is the same structure that helps human readers, which is not a coincidence: well-organized information is easier to process for both humans and machines.
Short paragraphs with clear topic sentences make your content more extractable. Dense blocks of text require the crawler to do more interpretive work to identify the main claim of each section. Three to four sentences per paragraph is a practical ceiling. When a crawler can see the main point of each paragraph from its first sentence, citation is easier.
FAQ sections are particularly powerful for retrieval-based crawlers like PerplexityBot. Questions that mirror how users actually phrase queries, with direct answers in the first sentence of each response, are exactly the format these crawlers are optimized to extract and attribute. If your page answers “How do I [common task in your category]?” in a dedicated FAQ section, that section is significantly more likely to surface as a cited source than the same information embedded in a prose paragraph.
How to verify bots are reaching your site
After updating your robots.txt, you need confirmation that the change worked. Server access logs are your primary tool.
Look for these user agent strings in your logs: GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, and Google-Extended. If you see them making requests and receiving 200 responses, crawling is working.
If you use a CDN like Cloudflare, raw origin logs may not capture all bot traffic. Cloudflare has its own bot analytics dashboard. Check there, or temporarily configure origin logging for a test period to verify what is reaching your server.
Expect a delay after updating robots.txt. Crawlers check robots.txt on their own schedule, not when you make a change. After updating, give it three to seven days before expecting new bot visits in your logs. If you see zero traffic from AI crawlers after two weeks with an open configuration, investigate whether a CDN rule, server-level firewall, or rate limiter is dropping bot requests before they can execute.
The manual confirmation method is the most direct feedback loop. Query the AI engines yourself. Ask ChatGPT about your brand, your products, or the topics your content covers. Ask Perplexity the same questions. If your pages appear as sources, crawling and citation are working. If you never appear as a source despite structured content and an open robots.txt, that gap tells you where to investigate next: content quality, entity signals, or a deeper technical issue your logs will surface.
The robots.txt rule set you need in 2026
Six crawlers. One file. The configuration is not complex. What makes it hard is that most brands have never checked, and the cost of not checking accumulates silently over months of missed citations.
Allow GPTBot. Allow OAI-SearchBot. Allow ClaudeBot. Allow anthropic-ai. Allow PerplexityBot. Allow Google-Extended. Do it with explicit allow rules, not reliance on defaults. Verify with your actual robots.txt file, not your memory of what was set up.
If you manage a site where the content itself has standalone commercial value and you have a considered licensing position, make that robots.txt decision deliberately and understand exactly what you are trading away in citation visibility.
For everyone else, the robots.txt is a gate. For AI visibility, that gate should be open.


