robots.txt & Sitemap — Manual

The complete feature reference for robots.txt & Sitemap: both generators with every option, the two live checkers, directives, sitemap fields, limits, and architecture.

Back to the overview: robots.txt & Sitemap · Open the tool: www.jpkc.com/tools/robots-sitemap/

This manual describes robots.txt & Sitemap in full: each of the seven tabs, every option of the two generators, the behavior of the two live checkers, and the technical limits underneath. The tool's interface is in English, so the tab and button names appear here exactly as you'll see them.

Layout: seven tabs, four functions

The tool is split into seven tabs: robots.txt (generator), Sitemap (generator), Check robots.txt (live checker), Check Sitemap (live checker), Examples, Tips, and Reference. The two generators produce files; the two checkers fetch and analyze existing files from a foreign URL. Examples, Tips, and Reference are static reference tabs that work without any input.

robots.txt generator

The first tab builds a robots.txt with a live preview: every change in the form on the left updates the output editor on the right after a short delay.

User-agent blocks

The core. Via Add User-agent Block you add as many blocks as you like. Each block has:

  • a User-agent field with autocomplete (a datalist). It suggests 40-plus common bot names — search engines (Googlebot, Bingbot, DuckDuckBot, YandexBot …), social bots (Twitterbot, LinkedInBot, facebookexternalhit …), AI crawlers (GPTBot, ChatGPT-User, OAI-SearchBot, Google-Extended, anthropic-ai, Claude-Web, ClaudeBot, CCBot, Bytespider, PerplexityBot …), and SEO-tool crawlers (AhrefsBot, SemrushBot, MJ12bot …). You can type any other name too; * means "all bots".
  • a Rules area. Via Add Rule you add rule rows, each with a type select (Disallow or Allow) and a path field (e.g. /admin/). An empty Disallow: value means "nothing blocked" — everything allowed.
  • a Crawl-delay field (number, 0 to 3600 seconds, optional). It produces a Crawl-delay: line in the block.

Blocks can be reordered by drag and drop using the grip handle.

Sitemap URL and Host

Below the blocks are two optional fields:

  • Sitemap URL — appended at the end of the file as Sitemap: URL. You can reuse the address straight from the sitemap generator.
  • Host — a Yandex-specific directive for the preferred (canonical) domain, output as Host: domain.

Output, copy, download, import

On the right is the output editor (read-only). Every generated file starts with a comment header (# robots.txt and a note that it was made with JPKCom Tools). Above the editor:

  • Load File — reads an existing robots.txt and fills the form from it. The parser recognizes User-agent, Disallow, Allow, Crawl-delay, Sitemap, and Host; comment lines are skipped, unknown directives ignored (a notice lists them). When several user-agents share the same rules, a separate block is created for each.
  • Reset — resets the form to the default: one User-agent: * block with a single empty Disallow: rule.
  • Copy — copies the robots.txt to the clipboard.
  • Download — saves it as robots.txt (text/plain).

The working state is automatically saved locally in the browser (LocalStorage) and restored on your next visit. On the very first load you see the default User-agent: * block.

Sitemap generator

The second tab builds an XML sitemap, also with a live preview.

Base URL and URL rows

  • Base URL (required) — the domain prepended to every relative path (e.g. https://example.com). A trailing slash is removed automatically.
  • URLs — via Add URL you add rows. Each row has:
    • Path — the path relative to the base URL (e.g. /about/); a missing leading / is added.
    • Last Modified — a date field; produces <lastmod> in YYYY-MM-DD format.
    • Freq. — a select for <changefreq>: always, hourly, daily, weekly, monthly, yearly, never (or empty).
    • Priority — a select for <priority>: 1.0 down to 0.0 in tenths (or empty).

Optional fields are only output when set. Rows can be reordered by drag and drop here too.

Output and import

The editor on the right shows the finished sitemap.xml with the XML declaration and the correct urlset namespace. Buttons:

  • Load File — reads an existing sitemap.xml. The parser only accepts a regular <urlset>; sitemap index files (<sitemapindex>) are rejected here — that's what the Check Sitemap tab is for. Invalid XML or an unknown format is reported. The base URL is derived from the first <loc>.
  • Reset — resets to three default rows (/, /about/, /contact/).
  • Copy / Download — copies or saves the file as sitemap.xml (application/xml).

On first load three example URLs are pre-filled. The working state is also stored locally.

Live checker: Check robots.txt

This tab fetches an existing robots.txt from a foreign domain and dissects it. You either enter just a domain (e.g. example.com; /robots.txt is appended automatically) or a full URL. If the protocol is missing, https:// is prepended. Clicking Check starts the fetch.

What you see in the result

  • A success message with the HTTP status, file size, and the number of user-agent blocks found.
  • Per-Bot Access — a table that checks, for 40-plus known bots, how the file treats each. Per bot: name, owner, type (Search, Social, Archive, AI, SEO Tool), Access (Allowed/Blocked), Source (Specific = a dedicated block, Wildcard * = via User-agent: *, Default = no matching rule), the Crawl-Delay, and the rule that actually applies. This is the fastest way to see whether you're accidentally locking out AI crawlers.
  • User-agent blocks — each block of the file individually, with its allow/disallow rules and a crawl-delay badge. A block without rules means full access.
  • Sitemaps declared — all Sitemap: directives declared in the file. Next to each is a check button that jumps straight to the Check Sitemap tab and checks that sitemap.
  • Test a URL against these rules — a small form: you enter a path (or URL) and a user-agent and get back whether the file allows or blocks access, which rule applies, and from which block. The defaults are path / and agent Googlebot.
  • Raw content — the raw file contents, collapsible and copyable.

Edge cases

If the checker finds no robots.txt (HTTP 404 or another error status), it says so and notes that without a robots.txt all crawlers are allowed by default. A present but empty file is classified the same way (it too allows all crawlers).

Live checker: Check Sitemap

The same flow for the sitemap.xml: enter a domain or full URL (/sitemap.xml is appended automatically) and click Check.

Regular sitemap (urlset)

  • A success message with the URL count and the file size.
  • Spec warnings: if the file exceeds 50,000 URLs or 50 MB, the tool points out that the Sitemaps protocol then requires splitting into multiple files plus an index file.
  • Metadata coverage — a table with progress bars: for lastmod, changefreq, and priority, at what percentage of URLs the field is set.
  • URLs — a table of the entries (the first 100 of N) with URL, lastmod, changefreq, and priority. Copy all URLs copies the complete list, not just the 100 shown.
  • Raw XML — the raw XML content, collapsible and copyable (the display is capped, the copy button delivers the full content).

Sitemap index

If the checker detects an index file (<sitemapindex>), it lists the child sitemaps with their optional lastmod instead. Each child sitemap has a Check button to drill into it individually. Here too there's a warning if the index lists more than 50,000 sitemaps.

Edge cases

A 404 or error status, an empty file, or content without a single <loc> element are each reported as the appropriate warning or error. If the file was truncated because of the proxy's size limit, a note points out that the statistics may be incomplete.

Examples, Tips, Reference

  • Examples — ready-made templates. For the robots.txt: Allow All Bots, Standard Website, WordPress, Block AI & Scraper Bots. For the sitemap: Simple Website, Blog, E-Commerce. A Load into Generator (or Load) button drops the template into the respective generator as a starting point.
  • Tips — compact best-practice cards ("Do This" / "Avoid This") for both files plus an explanation of how robots.txt and the sitemap work together.
  • Reference — the format specification: a table of all robots.txt directives with support notes, a table of more than 40 known bots with owner and type, a table of the sitemap XML elements, and links to the official specifications.

robots.txt directives (Reference)

Directive Support Meaning
User-agent all Target bot. * for all or a specific name.
Disallow all Blocks a path. Empty value = everything allowed.
Allow all Frees a path, overriding a broader Disallow.
Crawl-delay most Seconds between requests. Not supported by Googlebot (use Search Console there).
Sitemap all Full sitemap URL. May appear multiple times.
Host Yandex Preferred domain (canonical host).

Sitemap XML elements (Reference)

Element Required Values / note
<urlset> yes Root element with namespace http://www.sitemaps.org/schemas/sitemap/0.9.
<url> yes Container per entry.
<loc> yes Fully-qualified URL, max 2048 characters, URL-encoded.
<lastmod> optional Date in YYYY-MM-DD (W3C) format. Only set it when real — don't fake it.
<changefreq> optional always/hourly/daily/weekly/monthly/yearly/never — just a hint.
<priority> optional 0.01.0, default 0.5. Relative within your site, not a ranking factor against other sites.

Architecture, limits, and privacy

  • Generators entirely client-side. The two generator tabs produce their files fully in the browser. Nothing is uploaded; the working state lives only in your LocalStorage.
  • Checkers via a server-side proxy. A browser can't load a foreign file directly because of CORS. So a server-side proxy on the JPKCom server fetches the file via cURL; the analysis then runs locally in the browser. The checked domain sees a request from the JPKCom server (with its user agent), not your IP address.
  • No public API. The two server-side endpoints (a fetch proxy and a token-based helper endpoint) are not a publicly callable API — they're used exclusively by the tool's JavaScript and hardened against abuse (token authentication, referer checks).
  • SSRF protection: internal, local, and private IP addresses are blocked, and every redirect hop is re-validated.
  • Fetch limits: at most 5 MB body (larger files are truncated — a note appears), 15 s timeout per fetch.
  • Rate limit: in standard proxy mode about one check every 3 seconds is possible; beyond that you get a prompt to wait a moment.
  • Expert Mode (optional). A toggle in the header enables a local proxy (http://127.0.0.1:<port>) that fetches the files directly — without the rate limit and size cap. Setup is advanced and not needed for normal use.

For the intro, the audiences, and the big picture, see the overview. Concrete walkthroughs are in the examples. You can try all of it directly in the tool.