# robots.txt & Sitemap — Tips & Tricks

> Insider moves for robots.txt & Sitemap: common pitfalls, AI-crawler strategy, how the two files work together, and combining it with the SEO & GEO Analyzer.

Source: https://www.jpkc.com/db/en/tools/robots-sitemap/tips/

Back to the overview: [robots.txt & Sitemap](https://www.jpkc.com/db/en/tools/robots-sitemap/) · Open the tool: [www.jpkc.com/tools/robots-sitemap/](https://www.jpkc.com/tools/robots-sitemap/)

The [manual](https://www.jpkc.com/db/en/tools/robots-sitemap/manual/) explains every function; the [examples](https://www.jpkc.com/db/en/tools/robots-sitemap/examples/) show the workflows. This page covers what both assume: where the typical mistakes hide, how to treat AI crawlers strategically, and how to combine the tool sensibly with others. The interface is in English, so the real tab and button names appear as you'll see them.

## robots.txt: the most dangerous pitfalls

- **`Disallow: /` blocks your entire website.** A single line in the wrong block locks out all crawlers completely — and with it your site from Google. Build the file in the generator, check the preview, and after upload verify in the **Check robots.txt** tab that `Googlebot` and co. are really `Allowed`.
- **robots.txt is not security.** The file is publicly readable and obeyed by crawlers only **voluntarily**. Never write secret paths into `Disallow` rules — you'd reveal them to anyone who opens `your-domain.com/robots.txt`. Whatever truly needs protecting belongs behind server-side authentication.
- **Blocking is not the same as de-indexing.** A page blocked in `robots.txt` can still show up in Google's results (without a snippet) when other pages link to it. To keep a page out of the index, use a `noindex` meta tag or the `X-Robots-Tag` HTTP header — and then do **not** block it via `robots.txt`, or Google won't see the `noindex` at all.
- **Never block CSS and JavaScript.** Google renders your pages to understand them. Block `/wp-content/` or an asset folder wholesale and your pages look broken — bad for ranking. Be specific (see the WordPress template in the **Examples** tab).
- **`Crawl-delay` is ignored by Google.** The directive is supported by most bots, but **not by Googlebot** — there you control crawl frequency via Search Console. For other bots a moderate value (e.g. 5–10 seconds) makes sense; don't overdo it or you needlessly slow down indexing.

## Steer AI crawlers deliberately

The real value of the tool over simple generators is its view on AI crawlers — both when building (40-plus bot suggestions in the autocomplete field) and when checking (the Per-Bot Access table grouped by type).

- **Decide per purpose, not wholesale.** The **Reference** table cleanly distinguishes *AI Training* (e.g. `GPTBot`, `Google-Extended`, `anthropic-ai`, `CCBot`) from *AI Search* (e.g. `OAI-SearchBot`, `PerplexityBot`). If you want to be cited in AI *answers* but not end up in *training*, block only the training crawlers and let the search crawlers through.
- **`Google-Extended` ≠ Googlebot.** `Google-Extended` blocks only Gemini training, not classic Google search. Anyone who blocks everything Google-like out of fear of "AI" accidentally throws away their normal visibility.
- **Re-check after editing.** Upload the finished file and check it in the **Check robots.txt** tab: the Per-Bot Access table shows you in black and white which AI crawler is now `Allowed` and which `Blocked` — including the rule that actually applies.
- **Mind the GEO consequences.** Every blocked AI crawler costs points in the [SEO & GEO Analyzer](https://www.jpkc.com/db/en/tools/seo/)'s GEO score (whose `AI Crawlers Allowed` check tests against nine named bots). Blocking is a deliberate decision against AI visibility — not an accident that should happen in passing.

## Sitemap: clean over complete

- **Only canonical URLs that return HTTP 200.** No duplicates with `?utm_*` or session parameters, no redirects, no 404s. Every dead or duplicate URL wastes crawl budget.
- **No blocked or `noindex` pages in the sitemap.** This is the **golden rule**: a URL never belongs in both a `Disallow` of the `robots.txt` **and** the sitemap — that contradiction confuses crawlers. The sitemap says "please index", the `robots.txt` says "don't fetch".
- **`lastmod` only with a real date.** A faked modification date, or one set wholesale to "today", undermines crawlers' trust in your signal. Better to leave the field empty than to lie. The **metadata coverage** display in the checker shows you how consistently you use it.
- **`changefreq` and `priority` are hints, not commands.** Don't set every `changefreq` to `always` and every `priority` to `1.0` — that devalues the signal. `priority` is only a ranking within your site anyway.
- **Split large sites.** Beyond 50,000 URLs or 50 MB it's one file's limit — then use multiple sitemaps plus an index file. The **Check Sitemap** tab warns you when you break the limit, and can drill an index file down child by child.

## Privacy and operation

- **Generators stay local.** The two generator tabs produce everything in the browser; your working state is only stored in your LocalStorage, nothing is uploaded. **Reset** clears it.
- **The checkers run via a proxy — by design.** A browser can't load a foreign file directly because of CORS. The server-side proxy fetches it; the checked domain therefore sees a request from the JPKCom server, **not your IP**. Handy when you don't want to show up in a foreign site's crawler log.
- **`localhost` and intranet won't work.** For SSRF protection the proxy blocks private and internal addresses. You check a local dev instance either via a public staging domain or via Expert Mode with a local proxy.
- **Wait out the rate limit briefly.** In standard mode about one check every 3 seconds is possible. If a wait prompt appears, just wait a moment rather than re-firing.
- **Very large sitemaps may get truncated.** The proxy fetches up to 5 MB; beyond that the checker reports that the statistics may be incomplete. For unbounded fetches there's Expert Mode.

## Combining with other JPKCom tools

- **Build here, check in context.** You build `robots.txt` and the sitemap in this tool — the effect on a concrete page you see in the [SEO & GEO Analyzer](https://www.jpkc.com/db/en/tools/seo/): its *Robots Analysis* tab and the checks `Allowed by robots.txt`, `Sitemap in robots.txt`, and `AI Crawlers Allowed` work on exactly what you produce here. Sequence: build here → upload → check the same URL in the analyzer → read it green.
- **Complement the GEO side.** The sitemap is the roadmap for classic search engines; the machine-readable counterpart for LLMs you create with the **[llms.txt Generator](https://www.jpkc.com/db/en/tools/llms/)**.
- **Optimize the listed pages.** Every URL in your sitemap should have clean meta data — which you build with the **[Meta Tags Generator](https://www.jpkc.com/db/en/tools/meta-tags/)**.

---

More context: the [overview](https://www.jpkc.com/db/en/tools/robots-sitemap/) for the big picture, the [manual](https://www.jpkc.com/db/en/tools/robots-sitemap/manual/) for every option, and the [examples](https://www.jpkc.com/db/en/tools/robots-sitemap/examples/) for the step-by-step workflows. You can try all of it directly in the [tool](https://www.jpkc.com/tools/robots-sitemap/).

