Episode 15: The AEO Podcast EP 15: The AI Crawler Audit

Matt Harris · May 14, 2026 · 5 min read

AI crawlers are now part of the acquisition stack. This episode covers how to audit crawler access, server logs, robots rules, and product feeds without opening the whole store blindly.

Check your own store's AI visibility with our free diagnostic tool -- it takes about 75 seconds and requires no signup.

Full Transcript

HOST: Welcome back. Jessica here with Matt. Today's episode is called The AI Crawler Audit, which sounds like something nobody wants to do.

MATT: Correct. Nobody wants to do it. Everyone should.

HOST: What's the problem?

MATT: A lot of merchants are trying to optimize for ChatGPT, Perplexity, Gemini, and AI Overviews while accidentally blocking the systems that need to read their site. Sometimes in robots.txt. Sometimes in Cloudflare. Sometimes in a WordPress security plugin. Sometimes through server rules nobody remembers installing.

HOST: So they are writing AEO content that the AI cannot access.

MATT: Exactly. It is like printing a beautiful catalog and locking it in a closet.

HOST: Let's list the crawlers people should know.

MATT: Start with Googlebot. Googlebot is still the main crawler for Google Search, and Google Search is the foundation for AI Overviews and AI Mode. If you block Googlebot, you have bigger problems than AEO.

HOST: Google-Extended?

MATT: Google-Extended is different. It is a control for some Google AI training and grounding uses. It is not the same as blocking Googlebot from Search. Merchants need to make a policy decision there. If you want maximum AI visibility through Google systems, don't casually block it without understanding the tradeoff.

HOST: OpenAI?

MATT: OpenAI has OAI-SearchBot for search discovery in ChatGPT. Their merchant guidance says if you want products discoverable in ChatGPT search, make sure OAI-SearchBot is not blocked. This is separate from broader training-crawler debates.

HOST: Perplexity?

MATT: Perplexity publishes very clear crawler docs. PerplexityBot is designed to surface and link websites in Perplexity search results. They recommend allowing it in robots.txt and allowing requests from their published IP ranges. They also have Perplexity-User, which supports user-requested fetches and can behave differently from the crawler.

HOST: And Claude?

MATT: Anthropic now documents three separate bots. ClaudeBot is the training crawler. Claude-User supports user-directed fetches when someone asks Claude to access a page. Claude-SearchBot is for search result quality. If you block the Claude bots by default, you are making a visibility bet. Maybe that's intentional. Often it's accidental.

HOST: So this is not about becoming a bot expert. It's about making sure your public pages are not accidentally closed.

MATT: Exactly. You need a repeatable audit.

HOST: So how do we audit?

MATT: Five steps.

Step one: open your robots.txt. Literally go to yourdomain dot com slash robots dot txt. Search for these names: Googlebot, Google-Extended, OAI-SearchBot, GPTBot, PerplexityBot, ClaudeBot, Claude-User, Claude-SearchBot, and Bingbot. If you see "Disallow: /" under a crawler you care about, that crawler is blocked.

HOST: Is GPTBot the same as OAI-SearchBot?

MATT: No. GPTBot has historically been associated with crawling for model improvement. OAI-SearchBot is for search discovery and surfacing in ChatGPT. Some merchants may choose to block training crawlers but allow search crawlers. That is reasonable. The mistake is blocking everything because a plugin had a scary toggle labeled "block AI bots."

HOST: Step two?

MATT: Check your CDN and WAF. Cloudflare, AWS WAF, hosting security layers. Perplexity's docs specifically mention WAF allow rules using user-agent plus IP ranges. Robots.txt is not the only gate. A WAF can block the request before robots.txt even matters.

HOST: How does a normal merchant check that?

MATT: If you're on Cloudflare, open Security, WAF, custom rules, bot rules, and managed rules. Search for AI bot blocking, crawler blocking, suspicious user-agent rules, and rate-limit rules. If an agency manages Cloudflare, ask them directly: "Are we blocking OAI-SearchBot, PerplexityBot, ClaudeBot, or Bingbot?"

HOST: Step three?

MATT: Check server logs. Search logs for crawler user agents. Are they visiting? What status code do they get? Two hundred means allowed. Four oh three means blocked. Four oh four may mean they hit weird URLs. Five hundreds mean your site broke when the bot arrived.

HOST: What if they don't have log access?

MATT: Use what you can. Search Console crawl stats for Google. Bing Webmaster Tools for Bing. Cloudflare logs if available. Hosting access logs if you're on WordPress. Shopify merchants usually won't get raw server logs, so their practical audit is Shopify Agentic settings, theme source output, robots behavior, app settings, and analytics or referrer reports.

HOST: Step four?

MATT: Fetch your own page like a crawler. Use curl. Run `curl -L https://yourdomain.com/products/example`. Then look at the response. Does it include product title, description, price, schema, FAQ? Or just a shell that says "enable JavaScript"?

HOST: Step five?

MATT: Decide your policy. The answer is not always "allow every bot everywhere." For admin pages, cart pages, account pages, search result pages, and internal endpoints, block away. For public product pages, category pages, guides, FAQs, location pages, and blog posts, be deliberate.

HOST: Give me a clean robots policy example.

MATT: For a normal ecommerce site: allow Googlebot, Bingbot, OAI-SearchBot, PerplexityBot, and Claude-SearchBot on public content. If you want Claude user-directed fetches to work, allow Claude-User too. ClaudeBot is the training crawler, so some merchants may make a different policy decision there. Disallow admin, cart, checkout, account, and internal search URLs. Use noindex where appropriate. Keep sitemaps clean.

HOST: Perplexity-User is the confusing one because it can ignore robots.txt, right?

MATT: Perplexity's docs say Perplexity-User supports user-requested actions and generally ignores robots.txt because a user requested the fetch. That does not mean you have no control. WAF and access controls still exist. But it means the crawler ecosystem is no longer one simple robots.txt file.

HOST: What can a content tool not fix here?

MATT: SEOMelon can flag visible AEO issues, generate source content, and help with Shopify or WordPress product clarity. But crawler access is infrastructure. If your WAF blocks the bot, no content tool fixes that.

HOST: What's the most common mistake you see?

MATT: "Block AI bots" toggles. The privacy concern is real. But merchants installed them without distinguishing between training, search, citation, and user-requested fetches. Six months later they ask why they're not showing up in Perplexity.

HOST: Give me the rule of thumb.

MATT: Before you optimize for an AI platform, make sure the platform can reach your public pages. Robots.txt, WAF, logs, source HTML. Boring checklist. Real consequences.

HOST: What's the quick audit?

MATT: Open robots.txt today and search for OAI-SearchBot, PerplexityBot, ClaudeBot, Claude-User, Claude-SearchBot, and Googlebot. If you don't understand what you see, send it to your developer before you publish another AEO blog post.

HOST: Next time, we'll measure whether any of this work is actually showing up in traffic, prompts, and sales.

Check your store's AI visibility

Free, 75 seconds, no signup. See how your store scores when real buyers ask AI for recommendations.

Run the free diagnostic