Discoverability & AEOJune 15, 202612 min read

AI Training vs AI Citation: The Crawler Decision Most Businesses Get Wrong

Every major AI company now runs separate crawlers for training its models and for citing sources in live user answers. You can block the training crawlers—GPTBot, ClaudeBot, Google-Extended—without affecting the citation crawlers that put your business in front of buyers. Most businesses and platforms treat this as one decision. But it's not.

Most robots.txt guides published in 2023 and 2024 gave the same advice: block AI crawlers. Protect your content. Don't let the models scrape you.

That advice made sense at the time. AI companies were training on public web content at scale, and the mechanism for opting out was a single robots.txt directive. Blocking felt like the reasonable defensive move.

What those guides didn't anticipate—and most haven't updated to reflect—is that the landscape changed. You can opt out of being training data without opting out of being cited. You can block AI training and keep AI citations intact.

Most businesses don't know this. Most platforms don't give you the control to act on it, even if you do. And the guides that tell you to "block AI bots" are, in most cases, telling you to disappear from AI search results to solve a problem that didn't require that trade-off.

The decision most businesses get wrong

When businesses started worrying about AI training data in 2023, the common response was to add a blanket block in robots.txt. Block GPTBot. Block all AI crawlers. Done.

It felt protective. It was also, in many cases, self-defeating.

Now those same rules are blocking the citation crawlers that would have put those businesses in front of buyers actively asking AI engines for recommendations.

The issue isn't that businesses made a bad decision. It's that they made one decision where they should have made two.

Training crawlers vs citation crawlers

Every major AI company now operates two separate classes of web crawlers.

A training crawler collects your pages, your content, your writing—and that data is used to train future versions of the model. This is what people were worried about in 2023. It's a legitimate concern.

A citation crawler—sometimes called a search crawler or retrieval crawler—indexes your content so that when a user asks the AI a question, it can surface your page as a source. This is what makes you findable in ChatGPT Search, Claude, and Perplexity. It's the mechanism by which AI engines put your business in front of buyers.

Blocking a training crawler does not affect its corresponding citation crawler. They are separate systems with separate user-agent strings. robots.txt rules for one do not apply to the other. This is officially documented by OpenAI, Anthropic, and Google—and it is the single most important thing to understand about your AI crawler strategy today.

GPTBotTraining
Recommended: Optional

Collects content to train future OpenAI models

OAI-SearchBotCitation
Recommended: Allow

Builds the index that powers ChatGPT Search

ChatGPT-UserCitation
Recommended: Allow

Fetches pages during live user conversations

Bottom line: Block GPTBot to opt out of training. Allow OAI-SearchBot and ChatGPT-User to stay citable.

How each company has split their crawlers

OpenAI

GPTBot crawls the web to collect content that may be used for training OpenAI models. Blocking it is optional—it has no consequence for whether you appear in ChatGPT Search.

OAI-SearchBot builds the index that powers ChatGPT Search. If you block it, ChatGPT cannot surface your pages when users search for topics you cover. This one matters for visibility.

ChatGPT-User fetches specific pages during live conversations when a user shares a link or asks a question that requires reading a current page. Blocking it means ChatGPT cannot read your content in real time.

Anthropic

Anthropic formally documented its three separate crawlers in February 2026.

ClaudeBot collects content for AI model training. Blocking it opts you out of training—no other consequence.

Claude-SearchBot crawls the web to improve search result quality inside Claude. This is the crawler that determines whether your content surfaces when Claude users search for topics you cover. Anthropic's documentation is direct about the consequence of blocking it: doing so "may reduce your site's visibility and accuracy in user search results." For businesses that want visibility in Claude answers, Claude-SearchBot must be allowed.

Claude-User fetches specific pages during live Claude conversations. It operates when a Claude user asks a question requiring web access, or when Claude is verifying current information. Blocking it limits Claude's ability to read your content in real time.

Google, Perplexity, and the rest

Google has a similar split. Googlebot—the crawler that powers traditional Search rankings—should never be blocked. Google-Extended is a separate directive that controls whether your already-crawled content may be used for Gemini AI training. Blocking Google-Extended has zero effect on your Search rankings. It is purely a training data decision.

Perplexity currently runs two crawlers—PerplexityBot for indexing and Perplexity-User for live fetches. Neither is used for model training. Both affect whether your content appears in Perplexity answers.

What a considered robots.txt actually looks like

If your goal is to block training crawlers while staying citable in AI search, your robots.txt looks something like this:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow citation crawlers (all other bots allowed by default)
# OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User,
# PerplexityBot, Perplexity-User are not listed here — they are
# allowed by default unless explicitly disallowed.

The citation crawlers listed in the comment are allowed by omission. You only need to add explicit rules for what you want to block. Anything not listed in robots.txt is allowed by default.

One important caveat: robots.txt rules only work if crawlers can actually read the file. If Cloudflare's Bot Fight Mode is active—or if your host is blocking bots at the infrastructure level—crawlers may never reach your robots.txt at all, and none of these rules will apply.

If you're not sure whether your robots.txt is reachable or whether your current setup is blocking the crawlers you actually want, the Infrastructure Audit covers exactly this—along with everything else that affects whether AI engines can find and cite your business.See the audit →

What your platform decides for you

Here is where the conversation gets harder. Many businesses cannot execute this strategy because their platform doesn't give them access to robots.txt.

Squarespace introduced an AI Crawlers toggle in 2024. It blocks or allows AI bots as a group. There is no way to separate training crawlers from citation crawlers within Squarespace's native controls.

GoHighLevel does offer a robots.txt editor under Settings → Domains → Manage. It is one of the more capable platforms in this regard. But no AI-specific guidance ships with it by default—you need to know which directives to add and why.

Kajabi generates a robots.txt automatically. There is no native editor. You can request custom robots.txt settings through their support team, but there is no self-serve option.

Wix offers robots.txt access through their SEO settings, but the interface is limited and easy to misconfigure.

Webflow gives direct access to robots.txt in Site Settings → SEO. This is one of the better options among hosted platforms.

WordPress (self-hosted) gives you full control. You can edit robots.txt directly or via an SEO plugin. This is the most flexible option, and why self-hosted WordPress remains the default recommendation for businesses where discoverability actually matters.

If you're on a platform that doesn't give you this control, the options are: find the workaround your platform supports, accept the limitation, or move to infrastructure you actually own.

The actual trade-off

The question "should I block AI crawlers?" has a better version: "which crawlers should I block, and for what reason?"

Training: optional. You can opt out without consequence to visibility.

Citation: not optional, if visibility is the goal. Blocking citation crawlers removes you from AI search answers—the very thing most businesses are now trying to get into.

Most businesses haven't separated these two questions. Most guides haven't helped them do so. And most platforms haven't given them the tools to act on the distinction even when they understand it.

That's the gap. And it's why businesses that appear in AI answers aren't necessarily better—they just made the distinction at some point, intentionally or by accident.

The Infrastructure Audit identifies exactly which crawlers can and can't access your site, what your robots.txt is currently doing, and what needs to change. It's a one-time review, not an ongoing engagement.See the audit →

Does blocking GPTBot stop ChatGPT from citing me?

No. Blocking GPTBot prevents OpenAI from using your content for model training, but it does not affect OAI-SearchBot or ChatGPT-User—the crawlers that power ChatGPT Search and live user fetches. These are separate systems with separate user-agent strings, and each requires its own robots.txt directive. You can block GPTBot entirely and still appear in ChatGPT search answers, provided OAI-SearchBot is allowed.

What is the difference between GPTBot and OAI-SearchBot?

GPTBot is OpenAI's training crawler—it collects web content that may be used to train future OpenAI models. OAI-SearchBot is OpenAI's search index crawler—it builds the index that powers ChatGPT Search. Blocking GPTBot opts you out of training. Blocking OAI-SearchBot removes you from ChatGPT search results. They are separate systems, operate independently, and require separate robots.txt entries.

What are ClaudeBot, Claude-SearchBot, and Claude-User?

These are Anthropic's three separate web crawlers, formally documented in February 2026. ClaudeBot collects content for AI model training. Claude-SearchBot crawls the web to build Claude's search index—blocking it reduces your visibility in Claude search results. Claude-User fetches specific pages when a Claude user asks a question requiring web access. Each is controlled independently in robots.txt.

Can I block AI training crawlers without losing AI search visibility?

Yes. OpenAI, Anthropic, and Google all operate separate crawlers for training and for search citation. Blocking GPTBot, ClaudeBot, and Google-Extended stops your content from being used for model training. Allowing OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, and Claude-User preserves your visibility in AI search answers. These are independent decisions controlled by separate robots.txt directives.

Does blocking Google-Extended affect my Google Search ranking?

No. Google-Extended controls whether Google may use already-crawled content for Gemini AI training only. It is completely separate from Googlebot, which crawls for traditional Google Search. Blocking Google-Extended has zero effect on your Google Search rankings.

What robots.txt rules do I need to stay citable in ChatGPT and Claude?

To stay citable in ChatGPT, allow OAI-SearchBot and ChatGPT-User. To stay citable in Claude, allow Claude-SearchBot and Claude-User. For Perplexity, allow PerplexityBot and Perplexity-User. If you also want to block training crawlers, add explicit Disallow rules for GPTBot, ClaudeBot, and Google-Extended. Confirm your robots.txt is reachable—if Cloudflare's Bot Fight Mode is active, crawlers may be blocked before they read the file.

My website platform doesn't let me edit robots.txt — what can I do?

This is a genuine infrastructure limitation on platforms like Kajabi and Squarespace. Squarespace's Crawlers toggle blocks or allows all AI bots as a group—you cannot separate training from citation crawlers. Kajabi generates a robots.txt automatically with no native editor at all. GoHighLevel does offer a robots.txt editor (Settings > Domains > Manage), but no AI-specific guidance ships with it by default—you need to know which directives to add and why. If this level of control matters for your business—and for most founders who want AI visibility, it does—it is one of the clearest arguments for building on infrastructure you actually own and understand.

How to Get Cited by ChatGPT, Perplexity, and Claude

What is Discoverability?

The Infrastructure Audit

Sources
Aimee Q Devlin—Systems Architect and infrastructure builder based in San Miguel de Allende, Mexico

Aimee Q Devlin

Aimee Q Devlin is a Systems and Infrastructure Architect based in San Miguel de Allende, Mexico. She works with founders and operators of established businesses whose sites aren't ranking, converting, or being cited by AI—and builds the infrastructure that fixes it properly. She developed the PRISM Framework, an AEO framework for making founder-led businesses visible to ChatGPT, Perplexity, and the AI engines shaping discovery in 2026. The Infrastructure Audit is where most engagements begin.

About Aimee →

Something isn't working. Let's find out what.

You don't need to have it diagnosed before we talk. That's what I'm for.

Tell me what's going on