Ingestion¶

raggity indexes your content with rag ingest. Ingestion is incremental — hash-based, so only new or changed files are processed on subsequent runs.

Supported file types¶

Extension	Notes
`.md`	Markdown
`.txt`	Plain text
`.pdf`	Embedded text extraction via pypdf; falls back to OCR when text is absent
`.docx`	Word documents — included in base install
`.html`	HTML pages — included in base install
`.csv`	Parsed as key: value rows
`.pptx`	PowerPoint slides — included in base install
`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`	OCR via RapidOCR — requires `raggity[ocr]`

Document formats¶

.docx, .html, and .pptx are supported in the base install — no extra required.

The legacy raggity[docs] extra is a no-op alias kept for backwards compatibility.

OCR extra¶

For scanned PDFs and image files:

pip install raggity[ocr]

Adds RapidOCR + pypdfium2. raggity automatically OCRs a PDF page when embedded text is absent.

Local file ingestion¶

Configure glob patterns in raggity.toml:

[sources]
include = [
  "~/notes/**/*.md",
  "~/docs/**/*.pdf",
  "~/projects/**/*.txt",
]

Run ingestion:

rag ingest

Force a full rebuild from scratch:

rag reindex --force

Connectors¶

Web (`rag ingest-url`)¶

Requires the web extra:

pip install raggity[web]

# Fetch a single page
rag ingest-url https://docs.example.com/overview

# BFS-crawl same-domain links up to 2 hops deep
rag ingest-url https://docs.example.com --depth 2

Configure URLs for automatic ingestion on every rag ingest run:

[sources]
urls = ["https://docs.example.com/overview", "https://example.com/changelog"]

GitHub / Git repo (`rag ingest-repo`)¶

No extra install needed — uses stdlib subprocess + your local git.

# Index the default branch of a GitHub repo
rag ingest-repo https://github.com/owner/repo

# Pin to a specific branch or tag
rag ingest-repo https://github.com/owner/repo --ref main

All text files with supported extensions are read and indexed. The index path for each document is <repo_url>#<relpath> so you can trace sources back to the repository.

Obsidian vault (`rag ingest-obsidian`)¶

No extra install needed.

rag ingest-obsidian ~/Documents/MyVault

raggity walks all .md files in the vault recursively and normalises [[wikilink]] / [[link|alias]] syntax to plain text before indexing, so bracket noise does not pollute your chunks.

Watch daemon¶

Auto-reindex on file changes. Install the extra:

pip install raggity[watch]

Start the daemon:

rag watch

# Customise debounce delay (default 2 s)
rag watch --debounce 5.0

raggity monitors all paths in sources.include recursively. When files change, it triggers a debounced re-index — rapid filesystem events are coalesced into a single call. The daemon runs until you press Ctrl-C.

Chunking¶

raggity chunks documents into fixed-size pieces (default 256 tokens) for indexing. When retrieval.parent_document = true, each chunk retains a reference to its parent document (up to 1024 tokens), which is used during retrieval to pass broader context to the LLM.

Batch and parallel embedding¶

raggity automatically batches embedding calls and supports parallel workers:

[embedding]
batch_size = 256   # increase for faster ingest on large corpora
parallel = 0       # number of parallel workers (0 = auto)

Embedding cache¶

Avoid re-embedding unchanged chunks across ingest runs:

[embedding]
cache = true   # cache embeddings as JSON under the index directory

Cached embeddings are stored at <index.path>/embed_cache.json and looked up by content hash before calling the embedding model. Useful for large corpora with small diffs.

ANN auto-index¶

raggity automatically builds an Approximate Nearest Neighbor index on the vector store once the collection grows past a threshold (default 50 000 chunks):

[index]
ann_threshold = 50000

This keeps search latency flat as your knowledge base scales.