Skip to content

Configuration

raggity is configured via a raggity.toml file in the current directory (or pass --config PATH to any command).

Copy the example to get started:

cp raggity.example.toml raggity.toml

[sources]

[sources]
include = ["~/notes/**/*.md", "~/docs/**/*.pdf"]
exclude = ["**/drafts/**", "**/*.tmp.md"]
urls = ["https://docs.example.com/overview"]
Key Default Description
include [] Glob patterns for local files to index
exclude [] Glob patterns (fnmatch on the posix path) to skip. Applied on top of built-in junk-dir pruning
urls [] URLs to fetch and index on every rag ingest run

Built-in junk directories are always pruned when they appear below an include pattern's static (pre-glob) prefix: AppData, node_modules, .git, __pycache__, site-packages, .venv, venv, dist-packages, .raggity, .npm, .nuget, .gradle, .cargo, .conda. This stops a broad pattern like **/*.txt run from your home directory from sweeping caches and dependency trees. A pattern pointed inside such a dir still works.


[index]

[index]
path = ".raggity/index"       # relative to the current working directory
backend = "lancedb"           # or "qdrant"
ann_threshold = 50000         # build ANN index once chunk count exceeds this

# Qdrant-specific (only when backend = "qdrant")
# qdrant_location = ":memory:"             # default — ephemeral in-memory
# qdrant_location = "http://localhost:6333" # served Qdrant instance
qdrant_collection = "raggity"
# qdrant_api_key = "..."      # or set QDRANT_API_KEY env var
Key Default Description
path ".raggity/index" Directory for LanceDB data and caches (relative to cwd)
backend "lancedb" Vector store backend: "lancedb" or "qdrant"
ann_threshold 50000 Chunk count above which ANN index is built automatically
qdrant_location ":memory:" Qdrant location: ":memory:" (default, ephemeral in-process), "http://localhost:6333" (served instance), or a local path
qdrant_collection "raggity" Qdrant collection name
qdrant_api_key "" Qdrant API key (or set QDRANT_API_KEY)

[embedding]

[embedding]
model = "BAAI/bge-small-en-v1.5"   # default lightweight model
provider = "cpu"                    # cpu / cuda / directml / rocm
batch_size = 256
# parallel = 4                      # omit for in-process (default); N = worker pool
cache = false                       # cache embeddings as JSON
Key Default Description
model "BAAI/bge-small-en-v1.5" fastembed model name
provider "cpu" ONNX Runtime execution provider
batch_size 256 Embedding batch size
parallel (unset) Embedding worker pool size. Default (unset/None) = in-process single model, the stable path. A positive N spawns N multiprocessing workers (each loads its own ONNX model); avoid on memory-constrained Windows
cache false Cache embeddings by content hash to avoid re-embedding unchanged chunks

GPU acceleration

[embedding]
provider = "cpu"        # default — works everywhere
# provider = "cuda"     # NVIDIA (CUDA 11/12)
# provider = "directml" # Windows — AMD, Intel, NVIDIA via DirectML
# provider = "rocm"     # Linux — AMD ROCm

Larger embedding model

For higher embedding quality (768-dim, 8k context, Matryoshka scaling):

[embedding]
model = "nomic-embed-text-v1.5-Q"

Warning

Changing embedding.model triggers an automatic full index rebuild.


[retrieval]

[retrieval]
candidates = 30
top_k = 5
rerank = true
rerank_model = "Xenova/ms-marco-MiniLM-L-6-v2"
sufficiency_floor = 0.5   # dense-cosine threshold — governs abstention
relevance_floor = 0.0     # optional rerank-score filter (0.0 = off)
hybrid = true
dedup_cosine = 0.92
rrf_k = 60
parent_document = false
hyde = false
step_back = false
expand_n = 3
graph = false
graph_hops = 1
Key Default Description
candidates 30 Chunks fetched from each retriever before fusion
top_k 5 Chunks passed to the LLM after all filtering
rerank true Enable cross-encoder reranking
rerank_model "Xenova/ms-marco-MiniLM-L-6-v2" ONNX cross-encoder model
sufficiency_floor 0.5 Dense-cosine similarity threshold below which raggity abstains ("I don't have enough information")
relevance_floor 0.0 Optional secondary filter on the sigmoid-normalised cross-encoder rerank score (0.0 = off); does not trigger abstention
hybrid true Enable hybrid (dense + BM25) retrieval
dedup_cosine 0.92 Cosine similarity threshold for chunk deduplication
rrf_k 60 RRF fusion constant (higher = flatter curve)
parent_document false Expand matched chunks to parent documents
hyde false Enable HyDE query transform permanently
step_back false Enable step-back query transform permanently
expand_n 3 Number of query variations for --expand
graph false Enable GraphRAG knowledge-graph augmentation
graph_hops 1 BFS hops from matched entities in the graph
graph_concurrency 8 Concurrent LLM extraction calls during rag graph-build (lower it for strict-rate-limit backends)

Heavy reranker

[retrieval]
rerank_model = "BAAI/bge-reranker-v2-m3"   # ~1 GB, higher quality

[generation]

[generation]
backend = "claude"
model = "claude-opus-4-8"
auth = "auto"
cache = false

# OpenAI-compatible backend
# backend = "openai"
# model = "gpt-4o-mini"
# base_url = "https://api.openai.com/v1"
# api_key_env = "OPENAI_API_KEY"

# Ollama backend
# backend = "ollama"
# model = "llama3.1"
# base_url = "http://localhost:11434/v1"

# Opt-in personalization (default off). See "Personalization" below.
# persona = "The user is Dr. Vane, a cardiologist. Prefer clinical phrasing."
# personal_kb = false
Key Default Description
backend "claude" LLM backend: "claude", "openai", or "ollama"
model "claude-opus-4-8" Model name (backend-specific)
auth "auto" Claude auth mode: "auto", "subscription", or "api_key"
cache false Semantic answer cache (keyed on question + chunks + model + effective system prompt)
temperature null Generation temperature passed to the model (null = use model default)
base_url varies OpenAI-compatible base URL
api_key_env "OPENAI_API_KEY" Env var name holding the API key (OpenAI/Ollama backends)
auto_start true Auto-start a local backend (e.g. ollama) on first use when a runtime binary is found and the server is not already running
persona "" Free-form user context appended to the system prompt (grounding rules still bind). Empty = system prompt unchanged
personal_kb false Treat the knowledge base as the current user's own (first-person docs/questions refer to them)

Personalization

persona and personal_kb are opt-in and off by default; with both unset the system prompt is byte-identical to the stock one. When set, their content is appended to the system prompt as a clearly-delimited User context: block, followed by a reminder that the citation + abstention rules still apply — the model must still answer only from the retrieved context and cite every claim. Toggling either value changes the answer-cache key, so cached answers are invalidated automatically.

Local providers: discovery & auto-start

rag model --list probes known local runtimes (ollama, lmstudio, llamacpp, vllm, jan, koboldcpp) and prints which are running, installed, and what models they expose — copy-pasteable into rag model <model> -p <provider>. Local OpenAI-compatible runtimes map to backend = "openai" plus the runtime's default base_url (no API key required for a loopback server); ollama keeps backend = "ollama". With auto_start = true, an ollama backend is started on first request if the ollama binary is found and the server is not already up. rag doctor reports the full discovery table and, for ollama, will start the server and check that the configured model is pulled.


[server]

[server]
host = "0.0.0.0"
port = 8000
Key Default Description
host "0.0.0.0" Server bind host
port 8000 Server listen port

[observability]

[observability]
otel_endpoint = ""    # OTLP gRPC endpoint, e.g. "http://localhost:4317"
Key Default Description
otel_endpoint "" OpenTelemetry collector endpoint. Empty = disabled

Requires the otel extra: pip install raggity[otel].