There’s a small file at the root of your website called robots.txt. Most marketers have never opened it. Yet it’s quietly deciding whether AI engines like ChatGPT, Claude, Perplexity, and Google’s AI answers can read your content at all.
The uncomfortable part: for most sites, nobody actually made that decision. It was inherited, a CMS default, or a “block the AI scrapers” snippet someone copy-pasted during the 2023 panic, enforcing a strategy nobody chose ever since.
So let’s make it. On purpose this time.
First, kill the moral framing
One camp treats letting AI read your content as surrender; they scrape, give nothing back, and train models that will compete with you. The other treats blocking AI as career suicide; you’re making yourself invisible in the fastest-growing online discovery channel. Both have a point; neither deserves to be dismissed or swallowed whole.
But the reframe that makes this tractable is this: for a business website, this is a marketing decision, not an ethical stand. The question isn’t “Is AI scraping fair?” It’s “Does being readable by this specific bot serve my goals?” Ask it that way, and the fog clears, because different bots do very different jobs, and lumping them together is how people make the wrong call.
It’s no longer one switch
The old model was binary: allow AI, or block AI. That model is dead, and clinging to it is the most common way sites sabotage themselves in 2026.
Each major AI company now runs several bots, and they fall into three buckets. Training crawlers harvest your pages to train future models, allowing them costs you nothing in visibility, so it’s purely a privacy/IP call. Search and retrieval crawlers index you so you can be cited in AI answers. Block these, and you vanish from that engine entirely. User-triggered bots fetch a page in real time when someone asks. Block these, and you turn away real, intent-driven visits. One robots.txt file is making all three decisions at once, and the training bot is not the same bot as the search bot, even when one company runs both.
The bots, in plain English
You don’t need to memorize these. Just recognize the pattern that most engines split into a training crawler and a search crawler:
- OpenAI:
GPTBottrains;OAI-SearchBotis what gets you into ChatGPT’s answers;ChatGPT-Userfetches live. The one you cannot block isOAI-SearchBot, notGPTBot. - Anthropic (Claude):
ClaudeBottrains;Claude-SearchBotandClaude-Userdrive citations. (The oldClaude-Webandanthropic-aistrings are deprecated. Blocking only those means you’re not actually blocking current Anthropic, just out of date.) - Google: Leave
Googlebotalone, that’s normal search.Google-Extendedis the training lever only; blocking it does not remove you from Google Search or AI Overviews. - Perplexity:
PerplexityBotindexes,Perplexity-Userfetches live. - Apple:
Applebotpowers Siri/Spotlight;Applebot-Extendedis the separate training opt-out. - Common Crawl (
CCBot): an open dataset feeding many models’ training. No direct citation upside.
Nearly every “do I appear in this engine’s answers?” decision lives with the search bots; nearly every “do I want my content used for training?” decision lives with the training bots. Different questions, different answers.
The “selective allow” school, and where it’s right
The sophisticated take: don’t allow everything, don’t block everything. Curate. Allow the search bots that cite you, block the pure training crawlers and open datasets, and keep data brokers out. Get the visibility, keep the IP.
It’s a smart framework — for the right site. If you’re a publisher living on impressions and subscriptions, a paywalled business, or anyone whose writing or research is the product itself, the training/search split is the most important line you’ll draw all year. And it aligns with the privacy and data-ownership instincts that should already shape an EU-facing site: opting out of training via Google-Extended, Applebot-Extended, and explicit blocks on GPTBot, ClaudeBot, and CCBot is a perfectly defensible posture if training consent matters to you in principle.
For a marketing site, that sophistication mostly evaporates
Here’s the position I’d actually argue for. If your site exists to market your business, the careful curation collapses into a simpler answer: open the gates to almost everything.
A marketing site’s content exists to be found, read, and acted on — visibility is the entire point. So when a discovery channel appears that grew its visit volume by roughly 43% year over year and now rivals classic search, the default should be yes, please read my content, not prove why I should let you.
And the thing the “protect your IP” crowd glosses over is that a marketing site’s content usually isn’t the asset. Your blog post about choosing a CRM isn’t a product someone’s stealing. It’s an advertisement for your expertise that you want copied, summarized, and spread. If a model trains on it and a future answer sounds like you, that’s not theft, that’s reach. The marketing site fretting about a model learning from its lead-gen blog is guarding a vault full of flyers.
So: allow every search and retrieval bot, no hesitation. That’s free visibility, and blocking it is self-sabotage. For the training bots, lean toward allowing them too unless you have a specific reason not to. The one rule that survives regardless: keep genuinely sensitive paths — customer portals, checkout, anything with personal data — disallowed for everyone. That’s not an AI decision, it’s basic hygiene.
The traps that matter more than the philosophy
For most marketing sites, the debate is the easy part. The expensive mistakes are mechanical:
- The leftover blanket block. A large share of business sites still run a “block AI” rule from 2023, quietly invisible in engines they’d love to appear in. Go look at your actual robots.txt; you may already be losing citations you never knew were on the table.
- The CDN override. The silent killer: your server’s robots.txt says the right things, while your CDN or security layer (Cloudflare and others have AI-bot toggles) overrides it and blocks those bots anyway. Conflicting edge rules are a leading cause of accidental invisibility, and you should check both.
- Stale bot names. Crawlers are renamed and split regularly; a robots.txt file written 18 months ago is blocking deprecated bots and omitting current ones. Re-check after every CMS or CDN change.
- It’s a request, not a wall. robots.txt is voluntary. Reputable bots honor it, aggressive scrapers ignore or spoof it. To genuinely prevent access, you need server- or firewall-level enforcement, not a text file. And llms.txt, despite the hype, is not an access-control mechanism.
The takeaway
The real point isn’t “allow” or “block.” It’s that this stopped being a single switch a while ago, and most sites are still flipping it as if it were. The selective-blocking crowd is right that the decision deserves nuance — they’re just usually picturing a publisher, not a marketing team. If your content exists to win you customers, the burden runs the other way: a bot should have to earn its way out, not in.
Also read
GEO Has a Measurement Problem You May Not Know About
AI in Marketing: 97% of Marketers Use It, 70% of Buyers Don’t Trust It
