robots.txt & Sitemap by

Configure robots.txt

Appended at the end: Sitemap: URL
Yandex-specific directive for canonical domain.

Output: robots.txt

Upload to your web root: https://yourdomain.com/robots.txt

Configure sitemap.xml

All relative paths will be prepended with this URL.
Path
Last Modified
Freq.
Priority

Output: sitemap.xml

Upload to: https://yourdomain.com/sitemap.xml — Then submit via Google Search Console and Bing Webmaster Tools.

Examples

Common configurations for both files. Click Load into Generator to use an example as a starting point.

robots.txt Examples

Allow All Bots

Minimal file — all crawlers allowed, sitemap linked.

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Standard Website

Allow all, but block admin and private areas.

User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /private/
Disallow: /?s=
Allow: /

Sitemap: https://example.com/sitemap.xml

WordPress

Typical WordPress configuration with WP-Admin protected and AJAX allowed.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /?s=
Disallow: /?p=

Sitemap: https://example.com/sitemap_index.xml

Block AI & Scraper Bots

Allow search engines, block AI training crawlers and aggressive scrapers.

User-agent: *
Disallow:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Bytespider
Disallow: /

Sitemap: https://example.com/sitemap.xml

Sitemap Examples

Simple Website

Home, About, Services, Contact.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
        schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about/</loc>
    <changefreq>yearly</changefreq>
    <priority>0.5</priority>
  </url>
  ...
</urlset>

Blog

Home, blog index, posts with lastmod dates.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
        schemas/sitemap/0.9">
  <url>
    <loc>https://myblog.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://myblog.com/post-1/</loc>
    <lastmod>2025-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
  ...
</urlset>

E-Commerce

Home, product categories, and individual products.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
        schemas/sitemap/0.9">
  <url>
    <loc>https://myshop.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://myshop.com/products/</loc>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  ...
</urlset>

Tips & Best Practices

Guidelines for both files — and how they work together.

robots.txt

Do This

  • Always provide a User-agent: * block. It catches all bots that don't have a specific rule.
  • Link your sitemap. Add Sitemap: URL so crawlers can discover all your public pages.
  • Use specific paths. Disallow: /admin/ (with trailing slash) blocks the entire directory.
  • Test with Google Search Console. Use the robots.txt tester to verify your rules before deploying.
  • Keep it simple. A few well-chosen Disallow rules are more maintainable than dozens of edge cases.

Avoid This

  • Don't use robots.txt for security. Malicious bots ignore it. Use server-side authentication instead.
  • Don't block CSS and JS. Google needs them to render and understand your pages.
  • Disallow: / blocks everything — including your entire website from Google. Double-check this.
  • Don't put secrets in Disallow paths. The file is public — anyone can read it and find hidden paths.
  • Blocking a page doesn't remove it from Google. Use the noindex meta tag or HTTP header instead.

Sitemap

Do This

  • Include only canonical URLs. One URL per page, no duplicates with ?utm_* or session parameters.
  • Use <lastmod> accurately. Only set it if you actually track the modification date — don't fake it.
  • Set <priority> relatively. Use 1.0 for your homepage, lower values for deeper pages. Search engines may ignore it anyway.
  • Split large sites. Keep each sitemap under 50,000 URLs / 50 MB and use a sitemap index file.
  • Submit via Search Console. Tell Google and Bing where your sitemap is for faster indexing.

Avoid This

  • Don't include blocked URLs. If a page is disallowed in robots.txt, it shouldn't be in the sitemap.
  • Don't include noindex pages. Search engines won't index them regardless — they clutter your sitemap.
  • Don't use relative URLs. All <loc> values must be fully qualified: https://example.com/page/.
  • Don't set every <changefreq> to "always". Use realistic values — it affects crawl budget.
  • Don't list 404 pages. Only include pages that return HTTP 200. Crawlers waste budget on dead links.

How robots.txt and Sitemap Work Together

robots.txt

Tells crawlers what they may not access. It's a policy file, not a security measure. Crawlers obey it voluntarily.

sitemap.xml

Tells crawlers what you want indexed. It's a roadmap — it speeds up discovery but doesn't guarantee indexing.

The Golden Rule

Never have the same URL in both Disallow (robots.txt) and sitemap. That contradicts itself and confuses crawlers.

Format Reference

Complete specification for both file formats.

robots.txt Directives

Directive Example Support Description
User-agent User-agent: * All Target bot. Use * for all, or a specific bot name.
Disallow Disallow: /admin/ All Blocks access to a path. Empty value means allow all.
Allow Allow: /public/ All Explicitly allows a path, overriding a broader Disallow.
Crawl-delay Crawl-delay: 10 Most Seconds between requests. Not supported by Googlebot (use Search Console instead).
Sitemap Sitemap: https://…/sitemap.xml All Full URL of the sitemap. Can appear multiple times.
Host Host: example.com Yandex Yandex-specific directive for preferred domain (canonical host).

Common Bot Names

User-agentOwnerType
GooglebotGoogleSearch
Googlebot-ImageGoogle ImagesSearch
Googlebot-NewsGoogle NewsSearch
BingbotMicrosoftSearch
SlurpYahooSearch
DuckDuckBotDuckDuckGoSearch
BaiduspiderBaiduSearch
YandexBotYandexSearch
ApplebotApple (Siri, Spotlight)Search
YetiNaverSearch
SogouSpiderSogouSearch
QwantifyQwantSearch
ia_archiverInternet ArchiveArchive
facebotMeta / FacebookSocial
facebookexternalhitMeta / Facebook (link preview)Social
TwitterbotX / Twitter (card preview)Social
LinkedInBotLinkedIn (link preview)Social
GPTBotOpenAIAI Training
ChatGPT-UserOpenAIAI Training
OAI-SearchBotOpenAIAI Search
Google-ExtendedGoogle (Gemini training)AI Training
anthropic-aiAnthropicAI Training
Claude-WebAnthropicAI Training
ClaudeBotAnthropicAI Training
CCBotCommon CrawlAI Training
BytespiderByteDance / TikTokAI Training
AmazonbotAmazon (Alexa)AI Training
PerplexityBotPerplexityAI Search
Applebot-ExtendedApple (AI training)AI Training
meta-externalagentMeta (AI training)AI Training
cohere-aiCohereAI Training
DiffBotDiffbot (AI data)AI Training
AhrefsBotAhrefsSEO Tool
SemrushBotSemrushSEO Tool
MJ12botMajesticSEO Tool
DotBotMoz / Open Site ExplorerSEO Tool
SistrixCrawlerSISTRIXSEO Tool
rogerbotMozSEO Tool
ScreamingFrogSEOSpiderScreaming FrogSEO Tool
SeobilityBotSeobilitySEO Tool
serpstatbotSerpstatSEO Tool
DataForSeoBotDataForSEOSEO Tool
BLEXBotWebMeUpSEO Tool

Sitemap XML Elements

Element Required Values Description
<urlset> Required Root element. Must include the namespace: xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
<url> Required Parent element for each URL entry.
<loc> Required Full URL Fully-qualified URL including protocol. Max 2048 characters. Must be URL-encoded.
<lastmod> Optional YYYY-MM-DD Date the page was last modified. W3C Datetime format. Must be accurate — do not fake it.
<changefreq> Optional always / hourly / daily / weekly / monthly / yearly / never Hint for how often content changes. Treated as a hint, not a directive.
<priority> Optional 0.0 – 1.0 Relative priority within your site. Default: 0.5. Does not affect ranking vs. other sites.

Further Reading