XML Sitemaps: Structure, Limits, and Common Mistakes - Search Engine Optimization Directory

An XML sitemap is a file that lists the URLs on a site a site owner wants search engines to know about, along with optional metadata like when each URL was last modified. It’s a discovery aid, not a ranking lever. A sitemap can help Google find pages faster, especially on a large site or one with weak internal linking, but it doesn’t guarantee those pages get indexed, and it has no effect on how they rank once they are. That distinction is worth stating clearly once, because it’s the single most misunderstood point about sitemaps and doesn’t need repeating throughout this guide.

This post covers the sitemap file itself: structure, technical limits, what belongs in it, and how to submit and monitor it. It doesn’t cover crawl-budget strategy generally (sitemaps are one input into crawl discovery, not the whole picture), and it doesn’t cover diagnosing why a URL that’s already in the sitemap still isn’t indexed; that’s a Search Console Page Indexing report question.

Technical Requirements and Limits

Per Google’s sitemap build documentation, a single sitemap file is capped at 50,000 URLs or 50MB uncompressed, whichever limit is hit first. Sitemaps must be UTF-8 encoded, and while they can technically be hosted at any path, Google recommends placing a sitemap at the root of the directory whose URLs it lists, since a sitemap can only vouch for URLs at or below its own location in the path structure.

Sites with more than 50,000 URLs, or sitemaps that would exceed 50MB, need to split content across multiple sitemap files.

Basic Sitemap Anatomy

A standard XML sitemap follows the sitemaps.org protocol, which Google, Bing, and other major search engines all support. A minimal, valid entry looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/guides/technical-seo/</loc>
    <lastmod>2026-06-20</lastmod>
  </url>
  <url>
    <loc>https://example.com/guides/keyword-research/</loc>
    <lastmod>2026-05-11</lastmod>
  </url>
</urlset>

<loc> is the only required field per URL, and it must be the exact, fully-qualified canonical URL, including protocol. <lastmod> is optional but the one worth including accurately. <priority> and <changefreq> can be included or omitted; either way, Google ignores them, as covered below.

What to Include (and Exclude)

A sitemap should list canonical, indexable URLs only. Google’s own guidance is straightforward on what doesn’t belong in one:

URLs that redirect elsewhere. If a sitemap lists a URL that 301s to a different page, the sitemap is pointing at a non-canonical address, which wastes the signal.
URLs blocked by robots.txt. Listing a page in a sitemap while also disallowing it in robots.txt sends Google two contradictory instructions.
URLs carrying a noindex directive. There’s no value in asking Google to prioritize crawling a page it’s been told not to index.
Non-canonical duplicates. If two URLs serve near-identical content and one has a canonical tag pointing to the other, only the canonical version belongs in the sitemap.

Keeping a sitemap free of these categories matters more on large or frequently-changing sites, where a sitemap padded with stale, redirecting, or blocked URLs signals lower overall URL quality and can waste the limited attention Google devotes to processing the file.

Priority and Changefreq: Why Google Ignores Both

Two optional tags, <priority> and <changefreq>, let a sitemap author suggest how important a URL is relative to others on the site and how often it’s expected to change. Google’s documentation states plainly that these fields are ignored: “Google ignores <priority> and <changefreq> values” (source). Neither tag affects crawl frequency or ranking. Including them isn’t harmful, but there’s no reason to spend time hand-tuning priority values across a site; Google determines its own crawl priorities from other signals, including internal linking, historical crawl patterns, and observed change frequency, not from what a sitemap claims.

The one metadata field Google does use is <lastmod>, and only when it’s accurate. Google’s guidance is to use <lastmod> for genuine content updates, structured data changes, or link changes, not for cosmetic edits like refreshing a copyright year in the footer; a <lastmod> timestamp that updates on every page regardless of whether anything meaningful changed trains Google to stop trusting the field, which defeats its purpose.

Sitemap Index Files for Large Sites

When a site’s URL count exceeds a single sitemap’s limits, the standard solution is a sitemap index file: a wrapper XML file that lists the locations of multiple individual sitemap files, submitted to Google as one entry point.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products-1.xml</loc>
    <lastmod>2026-06-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products-2.xml</loc>
    <lastmod>2026-06-20</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-06-28</lastmod>
  </sitemap>
</sitemapindex>

A common, practical pattern for e-commerce and large publishers is splitting by content type (products, categories, blog posts, images) rather than by arbitrary URL count alone, since it makes it easier to spot which segment of the site has a processing problem when checking sitemap status later.

Image, Video, and News Sitemap Extensions

Beyond the standard URL sitemap, the sitemaps protocol supports extensions that attach extra metadata to a URL entry for specific content types: an image sitemap extension for tagging up to a defined number of images per page, and a video sitemap extension for video-specific metadata like duration and thumbnail location. These aren’t separate discovery mechanisms so much as richer annotations on an existing URL entry, useful when a page’s visual or video content is a meaningful part of what should surface in search (a product photo, a how-to video embedded in an article) and isn’t otherwise obvious from the HTML alone. A site with a genuinely video-heavy or image-heavy section is better served building a dedicated sitemap for that content type rather than trying to cram the extension tags into an already-large general URL sitemap.

How Robots.txt and Sitemaps Work Together

The most common way to point Google at a sitemap isn’t Search Console submission at all, it’s a Sitemap: line inside robots.txt, which Google reads automatically on its normal robots.txt fetch schedule without requiring manual resubmission. Both methods work and aren’t mutually exclusive; many sites do both, since the robots.txt reference helps any crawler discover the sitemap (not just Google, if a site is verified in multiple search engines’ webmaster tools), while Search Console submission gives direct visibility into processing status and errors that a robots.txt line alone doesn’t surface.

Submitting and Monitoring in Search Console

A sitemap (or sitemap index) gets submitted through the Sitemaps report in Search Console, which then shows whether Google was able to fetch it, when it was last read, and how many of the listed URLs are indexed versus not. Submitting a sitemap doesn’t force an immediate crawl; Google reads submitted sitemaps on its own schedule, more frequently for sites that update often.

The Sitemaps report is useful for catching processing errors (malformed XML, URLs that don’t match the sitemap’s own path scope, encoding problems) but it isn’t the place to diagnose why a specific indexed-looking URL isn’t showing up in search; that level of detail lives in the Page Indexing report.

A multilingual site can also attach hreflang annotations directly inside a sitemap entry as an alternative to placing them in each page’s HTML head; the annotation mechanics themselves (which language/region codes to use, how to structure the reciprocal links between language versions) are their own topic, but it’s worth knowing the sitemap-based approach exists as an option, particularly for sites where editing every page’s head tags isn’t practical.

Common Errors

Most sitemap problems trace back to the sitemap being out of sync with the live site rather than to the XML syntax itself, since most sitemaps today are generated programmatically rather than hand-written.

Error	Why It Happens	Fix
Stale URLs	Sitemap generation doesn't run on page deletion/redirect events	Regenerate sitemap automatically on content changes, not on a fixed schedule alone
Wrong URLs (http vs https, www vs non-www)	Sitemap generator uses a different canonical convention than the live site	Match sitemap URLs exactly to the site's canonical URL format
Mixing content types without proper extensions	Image or video URLs included in a standard sitemap without the image/video sitemap extensions	Use the <a href="https://developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps">image</a> or video sitemap extensions when those assets need their own discovery signal
Exceeding the 50,000 URL / 50MB cap	Site grew past the point a single file could handle	Split into multiple sitemaps under a sitemap index
"Couldn't fetch" status in Search Console	Sitemap URL returns a non-200 status, times out, or isn't valid XML	Confirm the sitemap URL loads directly in a browser and validates as well-formed XML before resubmitting
Sitemap and robots.txt disagree	A URL is listed in the sitemap but disallowed in robots.txt	Remove the URL from one or the other; listing it in both while blocking it sends Google contradictory signals

A sitemap’s job is narrow: hand Google a clean, accurate list of the URLs worth knowing about. Getting the file mechanically correct (right limits, right exclusions, accurate lastmod) does more for a site’s crawl efficiency than anything involving priority or changefreq ever will.