robots.txt & Sitemap — Tips & Tricks
Insider moves for robots.txt & Sitemap: common pitfalls, AI-crawler strategy, how the two files work together, and combining it with the SEO & GEO Analyzer.
Back to the overview: robots.txt & Sitemap · Open the tool: www.jpkc.com/tools/robots-sitemap/
The manual explains every function; the examples show the workflows. This page covers what both assume: where the typical mistakes hide, how to treat AI crawlers strategically, and how to combine the tool sensibly with others. The interface is in English, so the real tab and button names appear as you'll see them.
robots.txt: the most dangerous pitfalls
Disallow: /blocks your entire website. A single line in the wrong block locks out all crawlers completely — and with it your site from Google. Build the file in the generator, check the preview, and after upload verify in the Check robots.txt tab thatGooglebotand co. are reallyAllowed.- robots.txt is not security. The file is publicly readable and obeyed by crawlers only voluntarily. Never write secret paths into
Disallowrules — you'd reveal them to anyone who opensyour-domain.com/robots.txt. Whatever truly needs protecting belongs behind server-side authentication. - Blocking is not the same as de-indexing. A page blocked in
robots.txtcan still show up in Google's results (without a snippet) when other pages link to it. To keep a page out of the index, use anoindexmeta tag or theX-Robots-TagHTTP header — and then do not block it viarobots.txt, or Google won't see thenoindexat all. - Never block CSS and JavaScript. Google renders your pages to understand them. Block
/wp-content/or an asset folder wholesale and your pages look broken — bad for ranking. Be specific (see the WordPress template in the Examples tab). Crawl-delayis ignored by Google. The directive is supported by most bots, but not by Googlebot — there you control crawl frequency via Search Console. For other bots a moderate value (e.g. 5–10 seconds) makes sense; don't overdo it or you needlessly slow down indexing.
Steer AI crawlers deliberately
The real value of the tool over simple generators is its view on AI crawlers — both when building (40-plus bot suggestions in the autocomplete field) and when checking (the Per-Bot Access table grouped by type).
- Decide per purpose, not wholesale. The Reference table cleanly distinguishes AI Training (e.g.
GPTBot,Google-Extended,anthropic-ai,CCBot) from AI Search (e.g.OAI-SearchBot,PerplexityBot). If you want to be cited in AI answers but not end up in training, block only the training crawlers and let the search crawlers through. Google-Extended≠ Googlebot.Google-Extendedblocks only Gemini training, not classic Google search. Anyone who blocks everything Google-like out of fear of "AI" accidentally throws away their normal visibility.- Re-check after editing. Upload the finished file and check it in the Check robots.txt tab: the Per-Bot Access table shows you in black and white which AI crawler is now
Allowedand whichBlocked— including the rule that actually applies. - Mind the GEO consequences. Every blocked AI crawler costs points in the SEO & GEO Analyzer's GEO score (whose
AI Crawlers Allowedcheck tests against nine named bots). Blocking is a deliberate decision against AI visibility — not an accident that should happen in passing.
Sitemap: clean over complete
- Only canonical URLs that return HTTP 200. No duplicates with
?utm_*or session parameters, no redirects, no 404s. Every dead or duplicate URL wastes crawl budget. - No blocked or
noindexpages in the sitemap. This is the golden rule: a URL never belongs in both aDisallowof therobots.txtand the sitemap — that contradiction confuses crawlers. The sitemap says "please index", therobots.txtsays "don't fetch". lastmodonly with a real date. A faked modification date, or one set wholesale to "today", undermines crawlers' trust in your signal. Better to leave the field empty than to lie. The metadata coverage display in the checker shows you how consistently you use it.changefreqandpriorityare hints, not commands. Don't set everychangefreqtoalwaysand everypriorityto1.0— that devalues the signal.priorityis only a ranking within your site anyway.- Split large sites. Beyond 50,000 URLs or 50 MB it's one file's limit — then use multiple sitemaps plus an index file. The Check Sitemap tab warns you when you break the limit, and can drill an index file down child by child.
Privacy and operation
- Generators stay local. The two generator tabs produce everything in the browser; your working state is only stored in your LocalStorage, nothing is uploaded. Reset clears it.
- The checkers run via a proxy — by design. A browser can't load a foreign file directly because of CORS. The server-side proxy fetches it; the checked domain therefore sees a request from the JPKCom server, not your IP. Handy when you don't want to show up in a foreign site's crawler log.
localhostand intranet won't work. For SSRF protection the proxy blocks private and internal addresses. You check a local dev instance either via a public staging domain or via Expert Mode with a local proxy.- Wait out the rate limit briefly. In standard mode about one check every 3 seconds is possible. If a wait prompt appears, just wait a moment rather than re-firing.
- Very large sitemaps may get truncated. The proxy fetches up to 5 MB; beyond that the checker reports that the statistics may be incomplete. For unbounded fetches there's Expert Mode.
Combining with other JPKCom tools
- Build here, check in context. You build
robots.txtand the sitemap in this tool — the effect on a concrete page you see in the SEO & GEO Analyzer: its Robots Analysis tab and the checksAllowed by robots.txt,Sitemap in robots.txt, andAI Crawlers Allowedwork on exactly what you produce here. Sequence: build here → upload → check the same URL in the analyzer → read it green. - Complement the GEO side. The sitemap is the roadmap for classic search engines; the machine-readable counterpart for LLMs you create with the llms.txt Generator.
- Optimize the listed pages. Every URL in your sitemap should have clean meta data — which you build with the Meta Tags Generator.
More context: the overview for the big picture, the manual for every option, and the examples for the step-by-step workflows. You can try all of it directly in the tool.