wdoc — Comprehensive Reference

This document was written for wdoc v5.0.0. If you are using a different version, some arguments, defaults, or behaviors may have changed.

This document is a complete reference for wdoc, covering the CLI interface and the Python API.

Overview
Installation
CLI Reference
Python API Reference

Overview

wdoc is a RAG (Retrieval-Augmented Generation) system for summarizing, searching, and querying documents across 15+ file types. It uses LangChain and LiteLLM as backends and supports 100+ LLM providers.

Installation

# Full install (recommended)
pip install -U wdoc[full]

# Minimal install
pip install -U wdoc

# From git branches
pip install git+https://github.com/thiswillbeyourgithub/wdoc@main[full]
pip install git+https://github.com/thiswillbeyourgithub/wdoc@dev[full]

# Optional extras
pip install -U wdoc[pdftotext]
pip install -U wdoc[fasttext]

Set your API key(s): export ANTHROPIC_API_KEY="your_key" (or whichever provider you use).

CLI Reference

Basic Syntax

wdoc --task=TASK --path=PATH [OPTIONS]

# Shorthand (positional args are inferred):
wdoc TASK PATH [QUERY]

wdoc also accepts shell pipes: cat file.pdf | wdoc parse --filetype=pdf

Shortcuts

Shortcut	Equivalent
`wdoc web "query"`	`wdoc --task=query --filetype=ddg --path="query" --query="query"`
`wdoc parse FILE`	Calls `wdoc.parse_doc(path=FILE)` — no LLM, just parsing
`wdoc query FILE`	`wdoc --task=query --path=FILE`
`wdoc summarize FILE`	`wdoc --task=summarize --path=FILE`

Tasks

Task	Description
`query`	Load documents, create embeddings, then answer questions via RAG
`search`	Return matching documents and metadata only (no LLM answer)
`summarize`	Produce a detailed markdown summary of the document
`summarize_then_query`	Summarize first, then open a prompt for queries

Global Arguments

Argument	Type	Default	Description
`--task`	str	required	One of: `query`, `search`, `summarize`, `summarize_then_query`
`--filetype`	str	`auto`	Document type (see Filetypes)
`--debug`	bool	`False`	Enable tracing, increase verbosity, disable multithreading
`--verbose`	bool	`False`	Increase verbosity
`--llm_verbosity`	bool	`False`	Print LLM intermediate reasoning steps
`--dollar_limit`	int	`5`	Stop if estimated cost exceeds this (summaries/embeddings only)
`--disable_llm_cache`	bool	`False`	Disable LLM response caching
`--private`	bool	`False`	Enforce that no data leaves your machine
`--oneoff`	bool	`False`	Exit after first result (no interactive prompt)
`--silent`	bool	`False`	Suppress output
`--version`	bool	`False`	Print version and exit
`--out_file`	str	`None`	Write results to this file (appends)
`--notification_callback`	Callable	`None`	Function receiving result string (e.g. for ntfy.sh)

Model Arguments

Argument	Type	Default	Description
`--model`	str	`WDOC_DEFAULT_MODEL`	Strong LLM (litellm format: `provider/model`)
`--model_kwargs`	dict	`None`	Extra kwargs for the model (e.g. `{"temperature": 0}`)
`--query_eval_model`	str	`WDOC_DEFAULT_QUERY_EVAL_MODEL`	Cheap/fast LLM for document filtering. `None` to disable
`--query_eval_model_kwargs`	dict	`None`	Extra kwargs for the eval model
`--llms_api_bases`	dict	`None`	Override API endpoints: keys in `["model", "query_eval_model", "embeddings"]`

Embedding Arguments

Argument	Type	Default	Description
`--embed_model`	str	`WDOC_DEFAULT_EMBED_MODEL`	Embedding model (`backend/model` format)
`--embed_model_kwargs`	dict	`None`	Extra kwargs for the embedding model
`--embed_instruct`	bool	`None`	Use instruct framework for HuggingFace embeddings
`--save_embeds_as`	str	`"{user_dir}/latest_docs_and_embeddings"`	Save embeddings to this path
`--load_embeds_from`	str	`None`	Load pre-computed embeddings from this path

Query Arguments

Argument	Type	Default	Description
`--query`	str	`None`	Initial query string
`--query_retrievers`	str	`"basic_multiquery"`	Retriever(s): combine `basic`, `multiquery`, `knn`, `svm`, `parent` with `_`
`--query_eval_check_number`	int	`3`	Number of eval passes per document
`--query_relevancy`	float	`-0.5`	Minimum embedding similarity score (-1 to +1)
`--top_k`	int\|str	`"auto_200_500"`	Documents to retrieve. `"auto_N_M"` auto-scales from N up to M

Summary Arguments

Argument	Type	Default	Description
`--summary_n_recursion`	int	`0`	Number of recursive summary refinement passes (0 = disabled)
`--summary_language`	str	`"the same language as the document"`	Output language for summaries

Filetypes

Filetype	Description	Key Arguments
`auto`	Guess filetype from path (default)	—
`anki`	Anki flashcard collection	`--anki_profile`, `--anki_deck`, `--anki_notetype`, `--anki_template`, `--anki_tag_filter`
`epub`	EPUB e-books	`--path`
`json_dict`	JSON dictionary file	`--json_dict_template`, `--json_dict_exclude_keys`
`local_audio`	Audio files (many formats)	`--audio_backend` (`whisper`/`deepgram`), `--audio_unsilence`, `--whisper_lang`, `--whisper_prompt`, `--deepgram_kwargs`
`local_html`	HTML files	`--load_functions`
`local_video`	Video files (audio extracted)	Same as `local_audio`
`logseq_markdown`	Logseq markdown pages	`--path`
`online_media`	Remote media via yt-dlp/playwright	Same as `local_audio` + `--online_media_url_regex`, `--online_media_resourcetype_regex`
`online_pdf`	PDF via URL	Same as `pdf`
`pdf`	PDF files (15 parsers, best auto-selected)	`--pdf_parsers`, `--doccheck_min_lang_prob`, `--doccheck_min_token`, `--doccheck_max_token`
`powerpoint`	.ppt/.pptx/.odp	`--path`
`string`	Interactive text paste	—
`text`	Text content passed directly as path	`--metadata`
`txt`	Text files (.txt, .md, etc.)	`--path`
`url`	Web pages	`--title`
`word`	.doc/.docx/.odt	`--path`
`youtube`	YouTube videos	`--youtube_language`, `--youtube_translation`, `--youtube_audio_backend`, `--whisper_prompt`, `--whisper_lang`, `--deepgram_kwargs`

Recursive Filetypes

These load multiple documents and can combine different sources:

Filetype	Description	Key Arguments
`ddg`	DuckDuckGo web search	`--ddg_max_results`, `--ddg_region`, `--ddg_safesearch`
`json_entries`	JSON file (one dict per line) with loader args	`--path`
`toml_entries`	TOML file with loader args	`--path`
`recursive_paths`	Glob files in a directory	`--pattern`, `--recursed_filetype`, `--include`, `--exclude`
`link_file`	File with one URL per line	`--out_file`
`youtube_playlist`	YouTube playlist	Same as `youtube`

Loader-Specific Arguments (DocDict)

Argument	Used By	Description
`--path`	Most loaders	File path, URL, or text content
`--pdf_parsers`	`pdf`, `online_pdf`	Comma-separated parser names (e.g. `pymupdf,pdfplumber`)
`--anki_profile`	`anki`	Anki profile name
`--anki_deck`	`anki`	Deck name prefix (e.g. `science::physics`)
`--anki_notetype`	`anki`	Note type filter (case-insensitive)
`--anki_template`	`anki`	Template string with `{fieldName}`, `{tags}`, `{allfields}`, `{image_ocr_alt}`
`--anki_tag_filter`	`anki`	Regex to filter cards by tag
`--anki_tag_render_filter`	`anki`	Regex to filter which tags appear in output
`--audio_backend`	`local_audio`, `local_video`	`whisper` or `deepgram`
`--audio_unsilence`	`local_audio`, `local_video`	Remove silence before transcribing (default: `True`)
`--whisper_lang`	Audio types	Language hint for Whisper
`--whisper_prompt`	Audio types	Prompt for Whisper
`--deepgram_kwargs`	Audio types	Dict of Deepgram options
`--youtube_language`	`youtube`	Preferred transcript languages (e.g. `["fr","en"]`)
`--youtube_translation`	`youtube`	Translate transcript to this language
`--youtube_audio_backend`	`youtube`	`youtube`, `whisper`, or `deepgram`
`--json_dict_template`	`json_dict`	Template with `{key}` and `{value}`
`--json_dict_exclude_keys`	`json_dict`	List of keys to skip
`--metadata`	`text`, `json_dict`	Extra metadata as JSON dict
`--load_functions`	`local_html`	Python callables to preprocess text
`--source_tag`	All	Metadata tag for document identification
`--loading_failure`	All	`warn` or `crash` on load errors (default: `warn`)
`--pattern`	`recursive_paths`	Glob pattern for file discovery
`--recursed_filetype`	`recursive_paths`	Filetype for each matched file
`--include`	`recursive_paths`	Regex list — paths must match
`--exclude`	`recursive_paths`	Regex list — paths must not match
`--online_media_url_regex`	`online_media`	Regex matching media URLs
`--online_media_resourcetype_regex`	`online_media`	Regex matching resource types

Filtering Arguments

Argument	Type	Default	Description
`--filter_metadata`	list\|str	`None`	Filter docs by metadata. Format: `[kvb][+-]regex`
`--filter_content`	list\|str	`None`	Filter docs by content. Format: `[+-]regex`

filter_metadata syntax:

k+regex — keep docs with a metadata key matching regex
v+regex — keep docs with a metadata value matching regex
b+key_regex:value_regex — keep docs where key AND value match
Use - instead of + to exclude

filter_content syntax:

+regex — keep docs whose content matches
-regex — exclude docs whose content matches

Other Arguments

Argument	Type	Default	Description
`--file_loader_parallel_backend`	str	`loky`	Joblib backend: `loky`, `multiprocessing`, or `threading`
`--file_loader_n_jobs`	int	`-1`	Parallel jobs for loading (-1 = max, 1 = serial)
`--doccheck_min_lang_prob`	float	`0.5`	Min fasttext language probability for valid docs
`--doccheck_min_token`	int	`50`	Min tokens for a valid document
`--doccheck_max_token`	int	`10000000`	Max tokens for a valid document
`--ddg_max_results`	int	`50`	Max DuckDuckGo results
`--ddg_region`	str	`""`	DuckDuckGo region (e.g. `us-US`)
`--ddg_safesearch`	str	`off`	`on`, `moderate`, or `off`

Environment Variables

Core Defaults

Variable	Default	Description
`WDOC_DEFAULT_MODEL`	`openrouter/google/gemini-3.1-pro-preview`	Default strong LLM
`WDOC_DEFAULT_QUERY_EVAL_MODEL`	`openrouter/google/gemini-2.5-flash`	Default eval LLM
`WDOC_DEFAULT_EMBED_MODEL`	`openai/text-embedding-3-small`	Default embedding model
`WDOC_DEFAULT_EMBED_DIMENSION`	`none`	Embedding dimensions to request

Behavior Flags

Variable	Default	Description
`WDOC_DEBUG`	`False`	Same as `--debug=True`
`WDOC_VERBOSE`	`False`	Same as `--verbose=True`
`WDOC_TYPECHECKING`	`warn`	`disabled`, `warn`, or `crash` (via beartype)
`WDOC_NO_MODELNAME_MATCHING`	`True`	Disable fuzzy model name matching
`WDOC_ALLOW_NO_PRICE`	`False`	Don't crash if model price is unknown
`WDOC_STRICT_DOCDICT`	`False`	`True` = crash on unexpected DocDict args, `False` = warn, `strip` = ignore
`WDOC_OPEN_ANKI`	`False`	Auto-open Anki browser for found cards
`WDOC_DEBUGGER`	`False`	Open debugger on exceptions
`WDOC_EMPTY_LOADER`	`False`	Return empty string for all loaders (debug)
`WDOC_CONTINUE_ON_INVALID_EVAL`	`True`	Continue if eval LLM output can't be parsed
`WDOC_BEHAVIOR_EXCL_INCL_USELESS`	`warn`	`warn` or `crash` if include/exclude has no effect

Performance & Limits

Variable	Default	Description
`WDOC_LLM_MAX_CONCURRENCY`	`1`	Max concurrent LLM requests
`WDOC_LLM_REQUEST_TIMEOUT`	`600`	LLM request timeout in seconds
`WDOC_MAX_CHUNK_SIZE`	`32000`	Max tokens per chunk
`WDOC_MAX_EMBED_CONTEXT`	`7000`	Max tokens per chunk for embeddings
`WDOC_SEMANTIC_BATCH_MAX_TOKEN_SIZE`	`2000`	Max tokens per semantic batch
`WDOC_INTERMEDIATE_ANSWER_MAX_TOKENS`	`4000`	Max tokens per intermediate answer
`WDOC_MAX_LOADER_TIMEOUT`	`-1`	Loader timeout in seconds (-1 = disabled)
`WDOC_MAX_PDF_LOADER_TIMEOUT`	`-1`	Per-PDF-parser timeout in seconds (-1 = disabled)
`WDOC_EXPIRE_CACHE_DAYS`	`0`	Remove cache entries older than N days (0 = keep forever)

FAISS / Embeddings

Variable	Default	Description
`WDOC_MOD_FAISS_SCORE_FN`	`True`	Normalize FAISS scores to 0–1 range
`WDOC_FAISS_COMPRESSION`	`True`	zlib-compress FAISS indexes
`WDOC_FAISS_BINARY`	`False`	Use binary embeddings (32x compression)
`WDOC_EMBED_TESTING`	`True`	Test embedding model on startup
`WDOC_DISABLE_EMBEDDINGS_CACHE`	`False`	Bypass embedding cache

Whisper / Audio

Variable	Default	Description
`WDOC_WHISPER_ENDPOINT`	`""`	Custom Whisper API endpoint
`WDOC_WHISPER_API_KEY`	`""`	Custom Whisper API key
`WDOC_WHISPER_MODEL`	`whisper-1`	Whisper model name
`WDOC_WHISPER_PARALLEL_SPLITS`	`True`	Parallelize split audio transcription

Import & Loading

Variable	Default	Description
`WDOC_IMPORT_TYPE`	`native`	`native`, `thread`, `lazy`, or `both`
`WDOC_LOADER_LAZY_LOADING`	`True`	Lazy-import loader functions
`WDOC_APPLY_ASYNCIO_PATCH`	`False`	Apply nest_asyncio patch (needed for Ollama)
`WDOC_IN_DOCKER`	`False`	Set automatically inside Docker
`WDOC_PRIVATE_MODE`	—	Set automatically by `--private`, never set manually

Observability (Langfuse)

Variable	Default	Description
`WDOC_LANGFUSE_PUBLIC_KEY`	`None`	Overrides `LANGFUSE_PUBLIC_KEY`
`WDOC_LANGFUSE_SECRET_KEY`	`None`	Overrides `LANGFUSE_SECRET_KEY`
`WDOC_LANGFUSE_HOST`	`None`	Overrides `LANGFUSE_HOST`
`WDOC_LITELLM_TAGS`	`None`	Comma-separated tags for litellm requests
`WDOC_LITELLM_USER`	`wdoc_llm`	User identifier for litellm requests

Shell Examples

# Query a PDF
wdoc --task=query --path="paper.pdf" --query="What are the main findings?"

# Query multiple PDFs in a directory
wdoc --task=query --path="papers/" --pattern="**/*.pdf" \
     --filetype=recursive_paths --recursed_filetype=pdf

# Summarize a YouTube video
wdoc --task=summarize --path="https://www.youtube.com/watch?v=VIDEO_ID" \
     --youtube_language="en"

# Web search
wdoc web "latest news on quantum computing"

# Parse a document to text (no LLM)
wdoc parse document.pdf
wdoc parse document.pdf --format=langchain_dict

# Use local models (Ollama)
wdoc --model="ollama/qwen3:8b" --query_eval_model="ollama/qwen3:8b" \
     --embed_model="ollama/snowflake-arctic-embed2" \
     --task=summarize --path=document.pdf

# Save/load embeddings for repeated queries
wdoc --task=query --path="big_corpus/" --filetype=recursive_paths \
     --pattern="**/*.pdf" --recursed_filetype=pdf \
     --save_embeds_as="my_index.pkl"
wdoc --task=query --load_embeds_from="my_index.pkl" --query="My question"

# Shell pipe
cat document.pdf | wdoc parse --filetype=pdf
echo "https://example.com" | wdoc parse

# Private mode with custom endpoints
wdoc --private --model="ollama/llama3" \
     --llms_api_bases='{"model":"http://localhost:11434","query_eval_model":"http://localhost:11434","embeddings":"http://localhost:11434"}' \
     --task=query --path=secret.pdf

# Filter documents by metadata
wdoc --task=query --load_embeds_from=index.pkl \
     --filter_metadata="v+anki" --query="My question"

# Filter documents by content
wdoc --task=query --path=docs/ --filetype=recursive_paths \
     --pattern="**/*.md" --recursed_filetype=txt \
     --filter_content="+.*machine learning.*"

Python API Reference

Importing

from wdoc import wdoc

wdoc Class

The main entry point. Instantiating wdoc automatically loads documents and, for summary tasks, runs the summary immediately.

Constructor Parameters

wdoc(
    task: str,                        # "query", "search", "summarize", "summarize_then_query"
    filetype: str = "auto",
    model: str = WDOC_DEFAULT_MODEL,
    model_kwargs: dict | None = None,
    query_eval_model: str | None = WDOC_DEFAULT_QUERY_EVAL_MODEL,
    query_eval_model_kwargs: dict | None = None,
    embed_model: str = WDOC_DEFAULT_EMBED_MODEL,
    embed_model_kwargs: dict | None = None,
    save_embeds_as: str | Path = "{user_cache}/latest_docs_and_embeddings",
    load_embeds_from: str | Path | None = None,
    top_k: int | str = "auto_200_500",
    query: str | None = None,
    query_retrievers: str = "basic_multiquery",
    query_eval_check_number: int = 3,
    query_relevancy: float = -0.5,
    summary_n_recursion: int = 0,
    summary_language: str = "the same language as the document",
    llm_verbosity: bool = False,
    debug: bool = False,
    verbose: bool = False,
    dollar_limit: int = 5,
    notification_callback: Callable | None = None,
    disable_llm_cache: bool = False,
    file_loader_parallel_backend: str = "loky",   # "loky", "threading", "multiprocessing"
    file_loader_n_jobs: int = -1,
    private: bool = False,
    llms_api_bases: dict | None = None,
    out_file: str | Path | None = None,
    oneoff: bool = False,
    silent: bool = False,
    version: bool = False,
    **cli_kwargs,   # DocDict / loader-specific arguments (path, include, exclude, etc.)
)

All CLI arguments map directly to constructor parameters.

Public Methods

`query_task(query: str) -> dict`

Run a RAG query against loaded documents.

Returns a dict with:

Key	Type	Description
`final_answer`	str	Combined markdown answer
`intermediate_answers`	list	Per-document answers
`relevant_filtered_docs`	list[Document]	Documents deemed relevant
`filtered_docs`	list[Document]	Documents passing eval filter
`unfiltered_docs`	list[Document]	All initially retrieved documents
`source_mapping`	dict	Document ID to citation number mapping
`all_relevant_intermediate_answers`	list	Nested merge steps
`total_cost`	float	Total USD cost
`total_model_cost`	float	Strong model cost
`total_eval_model_cost`	float	Eval model cost

`search_task(query: str) -> dict`

Like query_task but returns matching documents without generating an LLM answer.

`summary_task() -> dict`

Run summarization. Called automatically during __init__ for summary tasks. Results are also stored in instance.summary_results.