Article No. 42
Crawl Budget: When It Actually Matters and How to Fix It
Abstract
Most sites reading a "crawl budget optimization" guide don't actually need one. That's not a hook, it's Google's own position: its crawl budget documentation opens by telling most readers to...
On this page
Most sites reading a “crawl budget optimization” guide don’t actually need one. That’s not a hook, it’s Google’s own position: its crawl budget documentation opens by telling most readers to stop reading. Per Google’s crawl budget management guide, “if your site doesn’t have a large number of pages that change rapidly, or if your pages seem to be crawled the same day that they are published, you don’t need to read this guide.” The guide is written for large, frequently-updated sites, think sites with a million or more pages that change moderately often, or tens of thousands of pages that change daily, not the typical small-to-midsize business site or content site.
That framing matters because a lot of crawl-budget content online treats it as something every site owner should be actively managing, complete with invented numeric thresholds for server response times and error rates that Google has never published. This guide sticks to what Google has actually stated, and is honest about who this even applies to.
What Crawl Budget Actually Is
Crawl budget is Google’s own term for “the set of URLs that Google can and wants to crawl.” It’s the product of two separate constraints:
- Crawl capacity limit: the number of simultaneous connections Googlebot is willing to use against a site, and the delay between fetches, set to avoid overloading the server. A site that responds quickly and reliably lets Google increase this limit; a site that’s slow or throws errors causes Google to pull back.
- Crawl demand: how much Google actually wants to crawl a given site, based on the perceived size of its URL inventory, how popular those URLs are, and how stale Google’s copy of them has become.
Together these determine how many URLs Googlebot will actually request on a given site in a given period. A small site with clean architecture rarely bumps into either limit; a million-page site with duplicate URL parameters, faceted navigation, and slow server responses can genuinely run into a ceiling where new or updated pages sit uncrawled for a meaningful stretch of time.
Who Actually Needs to Care
If a site publishes new content and sees it crawled within a day or two, and doesn’t have a large volume of pages changing constantly, crawl budget isn’t the bottleneck for that site’s indexing problems, something else is (internal linking, content quality, technical blockers). Google’s guidance targets sites in the ranges described above. Most small business sites, local service sites, and even mid-sized content sites with a few thousand well-organized pages fall outside the sites this guide is really written for.
Where crawl budget does become a real constraint: large e-commerce catalogs with faceted filtering that generates thousands of near-duplicate URL combinations, sites with infinite or near-infinite URL spaces (calendar pages, session IDs in URLs, poorly-bounded pagination), and any site where server response times are consistently slow enough to throttle Google’s crawl capacity limit.
A practical way to sanity-check whether crawl budget is a live issue: compare a site’s total indexable URL count against how many pages the Crawl Stats report shows being crawled per day. If a 50,000-page site is seeing several thousand crawl requests daily, capacity isn’t the bottleneck. If a 2-million-page site is seeing a few thousand requests daily, new and updated content across that catalog could realistically take weeks to reach Google, and that’s a genuine capacity constraint worth addressing directly.
How to Check Crawl Stats in Search Console
The Crawl Stats report in Search Console (under Settings) shows total crawl requests over time, broken down by response code, file type, purpose (discovery vs. refresh), and Googlebot type. It also reports average response time. This is the direct evidence for whether crawl capacity is actually constrained: a rising trend in 5xx errors, a climbing average response time, or a crawl volume that’s flat despite a growing site are the signals worth watching, rather than guessing from third-party crawler tools alone.
Common Crawl-Budget Waste Sources
For sites large enough that this matters, the recurring culprits are:
| Waste source | What it does |
|---|---|
| Faceted navigation | Every filter combination (size, color, price range) generates a distinct, crawlable URL, often multiplying a few hundred products into hundreds of thousands of URL variants |
| Duplicate URL parameters | Tracking parameters, session IDs, or sort orders appended to otherwise identical URLs create near-infinite duplicate paths |
| Soft 404s | Pages that return a 200 status but show "not found" style content still get crawled repeatedly, since Google can't rule them out as valid until it re-evaluates them |
| Infinite spaces | Calendar-generated pages, auto-incrementing search result URLs, or poorly bounded pagination that technically never ends |
Faceted navigation deserves particular attention because it’s usually the largest single source of waste on sites where crawl budget is a genuine issue. A furniture retailer with 500 real products, filterable by material, color, price range, and size, can generate URL combinations numbering in the hundreds of thousands once every filter permutation is crawlable, even though only 500 of those URLs represent something a shopper would actually want indexed. Google’s own crawl-budget guidance specifically calls out faceted navigation and session identifiers as common large-site waste patterns, not a hypothetical edge case.
What Actually Helps
For sites that genuinely need to manage crawl budget, the fixes Google points to are unglamorous and mechanical rather than exotic: use robots.txt to block crawling of genuinely low-value paths outright (faceted filter combinations that add no unique value, internal search result pages), keep sitemaps current so Google has an accurate, deduplicated picture of what’s worth crawling, return proper 404 or 410 status codes for pages that are actually gone rather than leaving them as soft 404s, and improve server response time generally, since a server that responds faster lets Google fit more requests into the same crawl capacity window.
Notably absent from Google’s own guidance: any published numeric response-time tier (“under 200ms is excellent,” “over 1000ms is a problem”) or error-rate percentage threshold that triggers throttling. Content that states these as Google-confirmed figures is presenting an invented number as fact. The accurate version is qualitative: faster, more reliable server responses generally improve crawl efficiency, and Google has not published specific millisecond bands or percentage thresholds that define “good” versus “bad” for this purpose.
A Note on Crawl Demand’s Two Real Levers
Crawl demand, the second half of the crawl budget equation, isn’t something a site can directly instruct Google to increase, but it responds to two things a site does have control over: perceived popularity (more inbound links and traffic to a URL signal to Google that it’s worth revisiting more often) and perceived staleness (content that genuinely changes tends to get recrawled faster than content that hasn’t changed across many prior crawl attempts, since Google learns a page’s typical update cadence over time). Neither of these is a lever to pull directly; they’re outcomes of the site actually being popular and actually updating meaningfully, which is a slower, more honest path to better crawl demand than trying to game the signal through cosmetic changes designed only to look fresh to a crawler.
Robots.txt syntax for blocking low-value paths and the mechanics of sitemap structure are each their own topic; the point here is narrower: know whether crawl budget is actually a live constraint for a given site before spending engineering time managing it, and lean on Google’s own stated priorities (server speed, sitemap accuracy, eliminating waste) rather than unofficial numeric rules when it is.