Skip to content
kumo

CLI

Every command and subcommand, with the flags that matter.

kumo <command> [arguments] [flags]

Run kumo <command> --help for the full flag list on any command.

Commands

Command What it does
scrape Crawl a whole host into structured page records (aliases: domain, site, crawl)
page Fetch and structure a single page
links List the outbound links of a page, as URIs
sitemap Discover the crawlable URL frontier without fetching pages
pages List the pages already stored on disk for a host
version Print the version and exit

scrape

Crawl every page of a host, convert each to a structured record, and write the result into the data tree as pages://<host>/<path>.

kumo scrape example.com
kumo scrape example.com --max-pages 500 --scope-prefix /docs
kumo scrape example.com --include-subdomains -o jsonl
kumo scrape example.com --dry-run
Flag Meaning
--max-pages N Stop after N pages (0 = unlimited)
--max-depth N Link-depth cap from the seed (0 = unlimited)
--traversal bfs|dfs Frontier order (default bfs)
--include-subdomains Also crawl subdomains of the host
--scope-prefix /path Restrict to URLs whose path starts with this
--exclude-path /path Skip URLs whose path contains this (repeatable)
--include-search Crawl site-search routes too
--no-robots Do not fetch or honor robots.txt
--no-sitemap Do not seed the frontier from sitemaps
--raw-md Probe for a <page>.md Markdown sibling on docs sites

page

Fetch one URL and emit it as a structured record. This is the pointwise form of scrape and writes no tree.

kumo page https://example.com/docs/intro
kumo page https://example.com/docs/intro -o jsonl | jq -r .content   # the Markdown body

List the outbound links of a single page, resolved to absolute URLs (on-host links as pages:// URIs).

kumo links https://example.com/docs/intro

sitemap

Discover the URLs a crawl would start from, by reading robots.txt and the host's sitemaps. No page content is fetched.

kumo sitemap example.com
kumo sitemap example.com -o url | head

pages

Read the data tree for a host and emit the page records already crawled. This makes no network request.

kumo pages example.com
kumo pages example.com -o jsonl

Global flags

These apply to every command:

Flag Meaning
-o, --output Output format: table, json, jsonl, csv, tsv, url
-n, --limit Cap the number of records emitted
-j, --workers Concurrent fetchers (default 16)
--out Data tree root (default $HOME/data)
--rate Minimum gap between requests
--retries Retry attempts per request
--timeout Per-request timeout
--dry-run Plan the work, fetch nothing
-v, --verbose / --quiet Adjust logging