robots.txt & Sitemap — Tips & Tricks

Back to the overview: robots.txt & Sitemap · Open the tool: www.jpkc.com/tools/robots-sitemap/

The manual explains every function; the examples show the workflows. This page covers what both assume: where the typical mistakes hide, how to treat AI crawlers strategically, and how to combine the tool sensibly with others. The interface is in English, so the real tab and button names appear as you'll see them.

robots.txt: the most dangerous pitfalls

Disallow: / blocks your entire website. A single line in the wrong block locks out all crawlers completely — and with it your site from Google. Build the file in the generator, check the preview, and after upload verify in the Check robots.txt tab that Googlebot and co. are really Allowed.
robots.txt is not security. The file is publicly readable and obeyed by crawlers only voluntarily. Never write secret paths into Disallow rules — you'd reveal them to anyone who opens your-domain.com/robots.txt. Whatever truly needs protecting belongs behind server-side authentication.
Blocking is not the same as de-indexing. A page blocked in robots.txt can still show up in Google's results (without a snippet) when other pages link to it. To keep a page out of the index, use a noindex meta tag or the X-Robots-Tag HTTP header — and then do not block it via robots.txt, or Google won't see the noindex at all.
Never block CSS and JavaScript. Google renders your pages to understand them. Block /wp-content/ or an asset folder wholesale and your pages look broken — bad for ranking. Be specific (see the WordPress template in the Examples tab).
Crawl-delay is ignored by Google. The directive is supported by most bots, but not by Googlebot — there you control crawl frequency via Search Console. For other bots a moderate value (e.g. 5–10 seconds) makes sense; don't overdo it or you needlessly slow down indexing.

Steer AI crawlers deliberately

The real value of the tool over simple generators is its view on AI crawlers — both when building (40-plus bot suggestions in the autocomplete field) and when checking (the Per-Bot Access table grouped by type).

Decide per purpose, not wholesale. The Reference table cleanly distinguishes AI Training (e.g. GPTBot, Google-Extended, anthropic-ai, CCBot) from AI Search (e.g. OAI-SearchBot, PerplexityBot). If you want to be cited in AI answers but not end up in training, block only the training crawlers and let the search crawlers through.
Google-Extended ≠ Googlebot. Google-Extended blocks only Gemini training, not classic Google search. Anyone who blocks everything Google-like out of fear of "AI" accidentally throws away their normal visibility.
Re-check after editing. Upload the finished file and check it in the Check robots.txt tab: the Per-Bot Access table shows you in black and white which AI crawler is now Allowed and which Blocked — including the rule that actually applies.
Mind the GEO consequences. Every blocked AI crawler costs points in the SEO & GEO Analyzer's GEO score (whose AI Crawlers Allowed check tests against nine named bots). Blocking is a deliberate decision against AI visibility — not an accident that should happen in passing.

Sitemap: clean over complete

Only canonical URLs that return HTTP 200. No duplicates with ?utm_* or session parameters, no redirects, no 404s. Every dead or duplicate URL wastes crawl budget.
No blocked or noindex pages in the sitemap. This is the golden rule: a URL never belongs in both a Disallow of the robots.txt and the sitemap — that contradiction confuses crawlers. The sitemap says "please index", the robots.txt says "don't fetch".
lastmod only with a real date. A faked modification date, or one set wholesale to "today", undermines crawlers' trust in your signal. Better to leave the field empty than to lie. The metadata coverage display in the checker shows you how consistently you use it.
changefreq and priority are hints, not commands. Don't set every changefreq to always and every priority to 1.0 — that devalues the signal. priority is only a ranking within your site anyway.
Split large sites. Beyond 50,000 URLs or 50 MB it's one file's limit — then use multiple sitemaps plus an index file. The Check Sitemap tab warns you when you break the limit, and can drill an index file down child by child.

Privacy and operation

Generators stay local. The two generator tabs produce everything in the browser; your working state is only stored in your LocalStorage, nothing is uploaded. Reset clears it.
The checkers run via a proxy — by design. A browser can't load a foreign file directly because of CORS. The server-side proxy fetches it; the checked domain therefore sees a request from the JPKCom server, not your IP. Handy when you don't want to show up in a foreign site's crawler log.
localhost and intranet won't work. For SSRF protection the proxy blocks private and internal addresses. You check a local dev instance either via a public staging domain or via Expert Mode with a local proxy.
Wait out the rate limit briefly. In standard mode about one check every 3 seconds is possible. If a wait prompt appears, just wait a moment rather than re-firing.
Very large sitemaps may get truncated. The proxy fetches up to 5 MB; beyond that the checker reports that the statistics may be incomplete. For unbounded fetches there's Expert Mode.

Combining with other JPKCom tools

Build here, check in context. You build robots.txt and the sitemap in this tool — the effect on a concrete page you see in the SEO & GEO Analyzer: its Robots Analysis tab and the checks Allowed by robots.txt, Sitemap in robots.txt, and AI Crawlers Allowed work on exactly what you produce here. Sequence: build here → upload → check the same URL in the analyzer → read it green.
Complement the GEO side. The sitemap is the roadmap for classic search engines; the machine-readable counterpart for LLMs you create with the llms.txt Generator.
Optimize the listed pages. Every URL in your sitemap should have clean meta data — which you build with the Meta Tags Generator.

More context: the overview for the big picture, the manual for every option, and the examples for the step-by-step workflows. You can try all of it directly in the tool.