robots.txt & Sitemap Generator — JPKCom Developer Tools

Configure robots.txt

User-agent Rules

Sitemap URL (optional)

Appended at the end: Sitemap: URL

Host (optional, Yandex)

Yandex-specific directive for canonical domain.

Output: `robots.txt`

Upload to your web root: https://yourdomain.com/robots.txt

Configure sitemap.xml

Base URL *

All relative paths will be prepended with this URL.

URLs

Path

Last Modified

Freq.

Priority

Output: `sitemap.xml`

Upload to: https://yourdomain.com/sitemap.xml — Then submit via Google Search Console and Bing Webmaster Tools.

Examples

Common configurations for both files. Click Load into Generator to use an example as a starting point.

robots.txt Examples

Allow All Bots

Minimal file — all crawlers allowed, sitemap linked.

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Standard Website

Allow all, but block admin and private areas.

User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /private/
Disallow: /?s=
Allow: /

Sitemap: https://example.com/sitemap.xml

WordPress

Typical WordPress configuration with WP-Admin protected and AJAX allowed.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /?s=
Disallow: /?p=

Sitemap: https://example.com/sitemap_index.xml

Block AI & Scraper Bots

Allow search engines, block AI training crawlers and aggressive scrapers.

User-agent: *
Disallow:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Bytespider
Disallow: /

Sitemap: https://example.com/sitemap.xml

Sitemap Examples

Simple Website

Home, About, Services, Contact.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
        schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about/</loc>
    <changefreq>yearly</changefreq>
    <priority>0.5</priority>
  </url>
  ...
</urlset>

Blog

Home, blog index, posts with lastmod dates.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
        schemas/sitemap/0.9">
  <url>
    <loc>https://myblog.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://myblog.com/post-1/</loc>
    <lastmod>2025-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
  ...
</urlset>

E-Commerce

Home, product categories, and individual products.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
        schemas/sitemap/0.9">
  <url>
    <loc>https://myshop.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://myshop.com/products/</loc>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  ...
</urlset>

Tips & Best Practices

Guidelines for both files — and how they work together.

robots.txt

Do This

Always provide a User-agent: * block. It catches all bots that don't have a specific rule.
Link your sitemap. Add Sitemap: URL so crawlers can discover all your public pages.
Use specific paths. Disallow: /admin/ (with trailing slash) blocks the entire directory.
Test with Google Search Console. Use the robots.txt tester to verify your rules before deploying.
Keep it simple. A few well-chosen Disallow rules are more maintainable than dozens of edge cases.

Avoid This

Don't use robots.txt for security. Malicious bots ignore it. Use server-side authentication instead.
Don't block CSS and JS. Google needs them to render and understand your pages.
Disallow: / blocks everything — including your entire website from Google. Double-check this.
Don't put secrets in Disallow paths. The file is public — anyone can read it and find hidden paths.
Blocking a page doesn't remove it from Google. Use the noindex meta tag or HTTP header instead.

Sitemap

Do This

Include only canonical URLs. One URL per page, no duplicates with ?utm_* or session parameters.
Use <lastmod> accurately. Only set it if you actually track the modification date — don't fake it.
Set <priority> relatively. Use 1.0 for your homepage, lower values for deeper pages. Search engines may ignore it anyway.
Split large sites. Keep each sitemap under 50,000 URLs / 50 MB and use a sitemap index file.
Submit via Search Console. Tell Google and Bing where your sitemap is for faster indexing.

Avoid This

Don't include blocked URLs. If a page is disallowed in robots.txt, it shouldn't be in the sitemap.
Don't include noindex pages. Search engines won't index them regardless — they clutter your sitemap.
Don't use relative URLs. All <loc> values must be fully qualified: https://example.com/page/.
Don't set every <changefreq> to "always". Use realistic values — it affects crawl budget.
Don't list 404 pages. Only include pages that return HTTP 200. Crawlers waste budget on dead links.

How robots.txt and Sitemap Work Together

robots.txt

Tells crawlers what they may not access. It's a policy file, not a security measure. Crawlers obey it voluntarily.

sitemap.xml

Tells crawlers what you want indexed. It's a roadmap — it speeds up discovery but doesn't guarantee indexing.

The Golden Rule

Never have the same URL in both Disallow (robots.txt) and sitemap. That contradicts itself and confuses crawlers.

Format Reference

Complete specification for both file formats.

robots.txt Directives

Directive	Example	Support	Description
`User-agent`	`User-agent: *`	All	Target bot. Use `*` for all, or a specific bot name.
`Disallow`	`Disallow: /admin/`	All	Blocks access to a path. Empty value means allow all.
`Allow`	`Allow: /public/`	All	Explicitly allows a path, overriding a broader Disallow.
`Crawl-delay`	`Crawl-delay: 10`	Most	Seconds between requests. Not supported by Googlebot (use Search Console instead).
`Sitemap`	`Sitemap: https://…/sitemap.xml`	All	Full URL of the sitemap. Can appear multiple times.
`Host`	`Host: example.com`	Yandex	Yandex-specific directive for preferred domain (canonical host).

Common Bot Names

User-agent	Owner	Type
`Googlebot`	Google	Search
`Googlebot-Image`	Google Images	Search
`Googlebot-News`	Google News	Search
`Bingbot`	Microsoft	Search
`Slurp`	Yahoo	Search
`DuckDuckBot`	DuckDuckGo	Search
`Baiduspider`	Baidu	Search
`YandexBot`	Yandex	Search
`Applebot`	Apple (Siri, Spotlight)	Search
`Yeti`	Naver	Search
`SogouSpider`	Sogou	Search
`Qwantify`	Qwant	Search
`ia_archiver`	Internet Archive	Archive
`facebot`	Meta / Facebook	Social
`facebookexternalhit`	Meta / Facebook (link preview)	Social
`Twitterbot`	X / Twitter (card preview)	Social
`LinkedInBot`	LinkedIn (link preview)	Social
`GPTBot`	OpenAI	AI Training
`ChatGPT-User`	OpenAI	AI Training
`OAI-SearchBot`	OpenAI	AI Search
`Google-Extended`	Google (Gemini training)	AI Training
`anthropic-ai`	Anthropic	AI Training
`Claude-Web`	Anthropic	AI Training
`ClaudeBot`	Anthropic	AI Training
`CCBot`	Common Crawl	AI Training
`Bytespider`	ByteDance / TikTok	AI Training
`Amazonbot`	Amazon (Alexa)	AI Training
`PerplexityBot`	Perplexity	AI Search
`Applebot-Extended`	Apple (AI training)	AI Training
`meta-externalagent`	Meta (AI training)	AI Training
`cohere-ai`	Cohere	AI Training
`DiffBot`	Diffbot (AI data)	AI Training
`AhrefsBot`	Ahrefs	SEO Tool
`SemrushBot`	Semrush	SEO Tool
`MJ12bot`	Majestic	SEO Tool
`DotBot`	Moz / Open Site Explorer	SEO Tool
`SistrixCrawler`	SISTRIX	SEO Tool
`rogerbot`	Moz	SEO Tool
`ScreamingFrogSEOSpider`	Screaming Frog	SEO Tool
`SeobilityBot`	Seobility	SEO Tool
`serpstatbot`	Serpstat	SEO Tool
`DataForSeoBot`	DataForSEO	SEO Tool
`BLEXBot`	WebMeUp	SEO Tool

Sitemap XML Elements

Element	Required	Values	Description
`<urlset>`	Required	—	Root element. Must include the namespace: `xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"`
`<url>`	Required	—	Parent element for each URL entry.
`<loc>`	Required	Full URL	Fully-qualified URL including protocol. Max 2048 characters. Must be URL-encoded.
`<lastmod>`	Optional	`YYYY-MM-DD`	Date the page was last modified. W3C Datetime format. Must be accurate — do not fake it.
`<changefreq>`	Optional	always / hourly / daily / weekly / monthly / yearly / never	Hint for how often content changes. Treated as a hint, not a directive.
`<priority>`	Optional	0.0 – 1.0	Relative priority within your site. Default: 0.5. Does not affect ranking vs. other sites.

robots.txt & Sitemap by Jean Pierre Kolb

JPKCom Developer Tools

Code & Text Tools

Design & Graphics

AI & SEO

Security & Utilities

About

JPKCom Developer Tools

Configure robots.txt

Output: `robots.txt`

Configure sitemap.xml

Output: `sitemap.xml`

Examples

robots.txt Examples

Allow All Bots

Standard Website

WordPress

Block AI & Scraper Bots

Sitemap Examples

Simple Website

Blog

E-Commerce

Tips & Best Practices

robots.txt

Do This

Avoid This

Sitemap

Do This

Avoid This

How robots.txt and Sitemap Work Together

robots.txt

sitemap.xml

The Golden Rule

Format Reference

robots.txt Directives

Common Bot Names

Sitemap XML Elements

Further Reading

JPKCom Developer Tools

Code & Text Tools

Design & Graphics

AI & SEO

Security & Utilities

About

JPKCom Developer Tools

Configure robots.txt

Output: robots.txt

Configure sitemap.xml

Output: sitemap.xml

Examples

robots.txt Examples

Allow All Bots

Standard Website

WordPress

Block AI & Scraper Bots

Sitemap Examples

Simple Website

Blog

E-Commerce

Tips & Best Practices

robots.txt

Do This

Avoid This

Sitemap

Do This

Avoid This

How robots.txt and Sitemap Work Together

robots.txt

sitemap.xml

The Golden Rule

Format Reference

robots.txt Directives

Common Bot Names

Sitemap XML Elements

Further Reading

Output: `robots.txt`

Output: `sitemap.xml`