Article No. 42

Robots.txt: The Complete Guide to Directives, Testing, and Common Mistakes

Abstract

A robots.txt file is a plain text file, hosted at the root of a domain, that tells crawlers which parts of a site they're allowed to request. That's the entire...

On this page

A robots.txt file is a plain text file, hosted at the root of a domain, that tells crawlers which parts of a site they’re allowed to request. That’s the entire job of the file. It does not remove pages from Google’s index, it does not boost or hurt rankings, and it doesn’t control what happens to a URL after Google has already crawled it once. Confusing “crawling” with “indexing” is the single most common mistake people make with this file, and it’s worth stating plainly, once, before getting into syntax: blocking a URL in robots.txt stops Googlebot from requesting that URL, but if the URL is already indexed or is linked to from elsewhere, Google can still show it in search results, typically with no description, because it was never allowed to read the page.

This guide covers the file itself: the directives, the syntax rules, the precedence logic when rules conflict, and the mistakes that cause real damage. It does not cover the GSC “Blocked by robots.txt” diagnostic workflow for a specific URL, or crawl-budget strategy more broadly; robots.txt is one lever among several for managing crawl budget, but the full playbook for that is a separate topic.

Core Directives

A robots.txt file is built from a small set of directives, grouped into blocks by User-agent.

Directive Purpose Example
<!–INLINECODE1–> Names which crawler the following rules apply to <!–INLINECODE2–>
<!–INLINECODE3–> Blocks crawling of a path <!–INLINECODE4–>
<!–INLINECODE5–> Carves out an exception within a blocked path <!–INLINECODE6–>
<!–INLINECODE7–> Points crawlers to a sitemap file (can appear anywhere in the file, not tied to a User-agent block) <!–INLINECODE8–>

A basic file looks like this:

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Allow: /checkout/confirmation/

Sitemap: https://example.com/sitemap.xml

User-agent: * applies the rules to every crawler that doesn’t have its own dedicated block. If a site wants different rules for Googlebot specifically, it adds a separate User-agent: Googlebot block; Google’s crawlers will follow the most specific matching group, not both.

One directive that shows up in older robots.txt files but does nothing for Google is Crawl-delay. It was never a Google-supported directive (some other search engines historically honored it), and Google’s crawlers ignore it entirely. Adjusting Googlebot’s crawl rate isn’t done through robots.txt at all; it happens indirectly, through server response time and reliability, which feed into Google’s own crawl-capacity calculations.

Wildcards and Pattern Matching

Two special characters give directives pattern-matching power beyond exact paths, per Google’s robots.txt reference:

  • * matches zero or more of any character.
  • $ anchors the match to the end of the URL.

A couple of worked examples:

Disallow: /*.pdf$ blocks any URL ending in .pdf, regardless of the path in front of it (/downloads/report.pdf and /files/2026/spec.pdf are both blocked), but leaves /downloads/report.pdf.html untouched because that URL doesn’t end in .pdf.

Disallow: /search* blocks every URL beginning with /search, including /search, /search/, and /search?q=widgets, because the trailing * matches anything that follows. Note that a trailing wildcard is redundant: Google treats /search* and /search identically, since any Disallow rule already matches everything that starts with the given path.

Precedence: How Google Resolves Conflicting Rules

When more than one rule could apply to the same URL, Google doesn’t go by the order the rules appear in the file. It applies the most specific rule, measured by the length of the rule’s path, and if two rules are equally specific, it defaults to the least restrictive one (Allow over Disallow).

Worked example:

User-agent: *
Disallow: /guides/
Allow: /guides/free-guide.html

A request for /guides/free-guide.html matches both rules. /guides/free-guide.html (23 characters) is more specific than /guides/ (8 characters), so the Allow rule wins and the page can be crawled. Everything else under /guides/ stays blocked. This is the logic that trips people up most often: they assume the last rule in the file wins, or that Disallow always overrides Allow, and neither is true.

File Location, Caching, and the Size Cap

Robots.txt must live at the top-level directory of the host it governs, meaning example.com/robots.txt, not example.com/pages/robots.txt. Rules in that file apply only to that exact host, protocol, and port; a robots.txt file at https://example.com/robots.txt says nothing about http://example.com or shop.example.com, which each need their own file if their crawling rules differ.

Google typically caches a site’s robots.txt for up to 24 hours, though it may hold onto a cached copy longer if a fresh fetch isn’t possible, and caching duration can be influenced by the Cache-Control header the server sends with the file.

There’s also a hard size limit: Google enforces a maximum robots.txt file size of 500 kibibytes (KiB). Content past that point is ignored entirely. This matters most for large, auto-generated e-commerce sites where a robots.txt file gets built programmatically from category and filter rules; if that generation logic runs unchecked, the file can silently exceed the cap, and everything after the 500 KiB mark, including a Sitemap directive placed near the bottom, stops being read.

The underlying syntax rules are formalized in RFC 9309, the Robots Exclusion Protocol standard published by the IETF in September 2022 with Google engineers among the authors. It’s worth linking directly if a post needs to cite “the robots.txt standard” as more than a Google-specific convention.

Why noindex: in Robots.txt No Longer Works

For years, some site owners used an undocumented noindex: line inside robots.txt, expecting it to keep a page out of the index the same way a meta robots tag would. It sometimes appeared to work, because it was never officially supported, only tolerated by Google’s crawler code. Google announced on July 2, 2019 that as of September 1, 2019, it would stop honoring unsupported and unpublished robots.txt rules, noindex included, in the interest of keeping the ecosystem consistent with the actual robots exclusion standard (Google Search Central Blog: “A note on unsupported rules in robots.txt”).

If a page genuinely needs to stay out of the index, the current options are:

  • A noindex directive in the page’s meta robots tag, or an X-Robots-Tag HTTP header for non-HTML files.
  • A 404 or 410 status code, if the page should be gone entirely.
  • Password protection, for content that shouldn’t be publicly accessible at all.
  • Disallow in robots.txt, which stops crawling but is not a reliable way to keep a URL out of the index if it’s already linked to from elsewhere.

Any robots.txt file still carrying a noindex: line has been doing nothing since September 2019 and should be cleaned up, since its presence usually signals the site owner believes a page is deindexed when it may not be.

Testing Your File

Search Console’s robots.txt report (under Settings) shows the version of the file Google has fetched and cached most recently, flags syntax errors and warnings, and lets a site owner check whether a specific URL is allowed or blocked under the current rules. For quick manual checks, requesting /robots.txt directly in a browser confirms the file is reachable and returning the expected content; a 404 on that path is treated by Google as “no robots.txt restrictions,” which is a valid state, but a 5xx server error on that path can cause Google to pause crawling of the whole site out of caution until the file becomes reachable again.

Common Mistakes

  • Blocking CSS and JavaScript. Google renders pages to evaluate them, and a robots.txt rule that blocks /assets/ or /js/ can prevent Googlebot from seeing the page the way a visitor does, which can affect how the page is understood and indexed.
  • A stray Disallow: / left over from staging. This single line blocks the entire site and is a common casualty of migrating a staging robots.txt file to production without editing it first.
  • Confusing Disallow with noindex. Disallowing a URL doesn’t guarantee it stays out of search results if other sites link to it; Google can still index a blocked URL based on external signals, just without being able to read its content.
  • Case sensitivity mistakes. Paths in robots.txt are case-sensitive. Disallow: /Private/ does not block /private/.
  • Wrong file location or wrong host. A robots.txt file placed in a subdirectory, or one written for www.example.com when the live site serves from the bare domain, does nothing for the host it wasn’t placed on.
  • Assuming one file covers every subdomain. blog.example.com and shop.example.com are separate hosts as far as robots.txt is concerned, each needing its own file at its own root if their crawling rules differ from the main domain’s.

Robots.txt and Other Search Engines

Everything above describes Google’s implementation specifically, and the directive set (User-agent, Disallow, Allow, Sitemap, wildcards) is shared across essentially every major crawler because it’s now formalized in RFC 9309. Where crawlers diverge is on unofficial extensions: Crawl-delay, ignored by Google but honored by Bing and some other engines, is the most common example. A site targeting multiple search engines with meaningfully different crawl-rate needs can still include a Crawl-delay line in a User-agent block scoped to that specific engine; it simply won’t do anything in the block Google reads.

Getting robots.txt right is mostly a matter of remembering what the file actually controls (crawling, not indexing), testing changes before they go live, and keeping the file lean enough that auto-generated rules never bump into the 500 KiB ceiling.

Call Now Button