Configure robots.txt
Output: robots.txt
https://yourdomain.com/robots.txt
Configure sitemap.xml
Output: sitemap.xml
https://yourdomain.com/sitemap.xml —
Then submit via Google Search Console
and Bing Webmaster Tools.
Examples
Common configurations for both files. Click Load into Generator to use an example as a starting point.
robots.txt Examples
Allow All Bots
Minimal file — all crawlers allowed, sitemap linked.
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
Standard Website
Allow all, but block admin and private areas.
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /private/
Disallow: /?s=
Allow: /
Sitemap: https://example.com/sitemap.xml
WordPress
Typical WordPress configuration with WP-Admin protected and AJAX allowed.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /?s=
Disallow: /?p=
Sitemap: https://example.com/sitemap_index.xml
Block AI & Scraper Bots
Allow search engines, block AI training crawlers and aggressive scrapers.
User-agent: *
Disallow:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Bytespider
Disallow: /
Sitemap: https://example.com/sitemap.xml
Sitemap Examples
Simple Website
Home, About, Services, Contact.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about/</loc>
<changefreq>yearly</changefreq>
<priority>0.5</priority>
</url>
...
</urlset>
Blog
Home, blog index, posts with lastmod dates.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
schemas/sitemap/0.9">
<url>
<loc>https://myblog.com/</loc>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://myblog.com/post-1/</loc>
<lastmod>2025-01-10</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>
...
</urlset>
E-Commerce
Home, product categories, and individual products.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
schemas/sitemap/0.9">
<url>
<loc>https://myshop.com/</loc>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://myshop.com/products/</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
...
</urlset>
Tips & Best Practices
Guidelines for both files — and how they work together.
robots.txt
Do This
- Always provide a
User-agent: *block. It catches all bots that don't have a specific rule. - Link your sitemap. Add
Sitemap: URLso crawlers can discover all your public pages. - Use specific paths.
Disallow: /admin/(with trailing slash) blocks the entire directory. - Test with Google Search Console. Use the robots.txt tester to verify your rules before deploying.
- Keep it simple. A few well-chosen Disallow rules are more maintainable than dozens of edge cases.
Avoid This
- Don't use robots.txt for security. Malicious bots ignore it. Use server-side authentication instead.
- Don't block CSS and JS. Google needs them to render and understand your pages.
Disallow: /blocks everything — including your entire website from Google. Double-check this.- Don't put secrets in Disallow paths. The file is public — anyone can read it and find hidden paths.
- Blocking a page doesn't remove it from Google. Use the
noindexmeta tag or HTTP header instead.
Sitemap
Do This
- Include only canonical URLs. One URL per page, no duplicates with
?utm_*or session parameters. - Use
<lastmod>accurately. Only set it if you actually track the modification date — don't fake it. - Set
<priority>relatively. Use 1.0 for your homepage, lower values for deeper pages. Search engines may ignore it anyway. - Split large sites. Keep each sitemap under 50,000 URLs / 50 MB and use a sitemap index file.
- Submit via Search Console. Tell Google and Bing where your sitemap is for faster indexing.
Avoid This
- Don't include blocked URLs. If a page is disallowed in
robots.txt, it shouldn't be in the sitemap. - Don't include noindex pages. Search engines won't index them regardless — they clutter your sitemap.
- Don't use relative URLs. All
<loc>values must be fully qualified:https://example.com/page/. - Don't set every
<changefreq>to "always". Use realistic values — it affects crawl budget. - Don't list 404 pages. Only include pages that return HTTP 200. Crawlers waste budget on dead links.
How robots.txt and Sitemap Work Together
robots.txt
Tells crawlers what they may not access. It's a policy file, not a security measure. Crawlers obey it voluntarily.
sitemap.xml
Tells crawlers what you want indexed. It's a roadmap — it speeds up discovery but doesn't guarantee indexing.
The Golden Rule
Never have the same URL in both Disallow (robots.txt) and sitemap. That contradicts itself and confuses crawlers.
Format Reference
Complete specification for both file formats.
robots.txt Directives
| Directive | Example | Support | Description |
|---|---|---|---|
User-agent |
User-agent: * |
All | Target bot. Use * for all, or a specific bot name. |
Disallow |
Disallow: /admin/ |
All | Blocks access to a path. Empty value means allow all. |
Allow |
Allow: /public/ |
All | Explicitly allows a path, overriding a broader Disallow. |
Crawl-delay |
Crawl-delay: 10 |
Most | Seconds between requests. Not supported by Googlebot (use Search Console instead). |
Sitemap |
Sitemap: https://…/sitemap.xml |
All | Full URL of the sitemap. Can appear multiple times. |
Host |
Host: example.com |
Yandex | Yandex-specific directive for preferred domain (canonical host). |
Common Bot Names
| User-agent | Owner | Type |
|---|---|---|
Googlebot | Search | |
Googlebot-Image | Google Images | Search |
Googlebot-News | Google News | Search |
Bingbot | Microsoft | Search |
Slurp | Yahoo | Search |
DuckDuckBot | DuckDuckGo | Search |
Baiduspider | Baidu | Search |
YandexBot | Yandex | Search |
Applebot | Apple (Siri, Spotlight) | Search |
Yeti | Naver | Search |
SogouSpider | Sogou | Search |
Qwantify | Qwant | Search |
ia_archiver | Internet Archive | Archive |
facebot | Meta / Facebook | Social |
facebookexternalhit | Meta / Facebook (link preview) | Social |
Twitterbot | X / Twitter (card preview) | Social |
LinkedInBot | LinkedIn (link preview) | Social |
GPTBot | OpenAI | AI Training |
ChatGPT-User | OpenAI | AI Training |
OAI-SearchBot | OpenAI | AI Search |
Google-Extended | Google (Gemini training) | AI Training |
anthropic-ai | Anthropic | AI Training |
Claude-Web | Anthropic | AI Training |
ClaudeBot | Anthropic | AI Training |
CCBot | Common Crawl | AI Training |
Bytespider | ByteDance / TikTok | AI Training |
Amazonbot | Amazon (Alexa) | AI Training |
PerplexityBot | Perplexity | AI Search |
Applebot-Extended | Apple (AI training) | AI Training |
meta-externalagent | Meta (AI training) | AI Training |
cohere-ai | Cohere | AI Training |
DiffBot | Diffbot (AI data) | AI Training |
AhrefsBot | Ahrefs | SEO Tool |
SemrushBot | Semrush | SEO Tool |
MJ12bot | Majestic | SEO Tool |
DotBot | Moz / Open Site Explorer | SEO Tool |
SistrixCrawler | SISTRIX | SEO Tool |
rogerbot | Moz | SEO Tool |
ScreamingFrogSEOSpider | Screaming Frog | SEO Tool |
SeobilityBot | Seobility | SEO Tool |
serpstatbot | Serpstat | SEO Tool |
DataForSeoBot | DataForSEO | SEO Tool |
BLEXBot | WebMeUp | SEO Tool |
Sitemap XML Elements
| Element | Required | Values | Description |
|---|---|---|---|
<urlset> |
Required | — | Root element. Must include the namespace: xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" |
<url> |
Required | — | Parent element for each URL entry. |
<loc> |
Required | Full URL | Fully-qualified URL including protocol. Max 2048 characters. Must be URL-encoded. |
<lastmod> |
Optional | YYYY-MM-DD |
Date the page was last modified. W3C Datetime format. Must be accurate — do not fake it. |
<changefreq> |
Optional | always / hourly / daily / weekly / monthly / yearly / never | Hint for how often content changes. Treated as a hint, not a directive. |
<priority> |
Optional | 0.0 – 1.0 | Relative priority within your site. Default: 0.5. Does not affect ranking vs. other sites. |
Further Reading
- robotstxt.org — Official Robots Exclusion Protocol documentation
- Google: robots.txt Guide — Google's robots.txt reference including supported directives and syntax
- sitemaps.org Protocol — Official Sitemaps Protocol specification
- Google: Sitemaps Guide — Build and submit a sitemap to Google Search
- Google Search Console — Test robots.txt rules and submit your sitemap