wip: LLMs.txt toolkit local history from pr-7465 worktree#7541
Draft
Mustafa-Esoofally wants to merge 26 commits intomainfrom
Draft
wip: LLMs.txt toolkit local history from pr-7465 worktree#7541Mustafa-Esoofally wants to merge 26 commits intomainfrom
Mustafa-Esoofally wants to merge 26 commits intomainfrom
Conversation
Add a reader and toolkit for the llms.txt standard (https://llmstxt.org), enabling agents to discover and consume documentation indexes. LLMsTxtReader: fetches an llms.txt URL, parses the standardized markdown format to extract all linked doc URLs, fetches page content (handling HTML, markdown, plain text), and returns Documents with section/title metadata. Async variant fetches all pages concurrently. LLMsTxtTools provides two modes: - Agentic: get_llms_txt_index returns the index so the agent picks which pages to read, then read_llms_txt_url fetches individual pages. - Knowledge: read_llms_txt_and_load_knowledge bulk-fetches all linked pages and inserts them into a Knowledge base. Includes 32 unit tests and 2 cookbook examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary Addresses code review feedback on #7458. Fixes several issues in the LLMsTxtReader and LLMsTxtTools implementation. **Changes:** - **Lazy BeautifulSoup import** - Deferred to `_extract_content()` instead of hard-failing at module import time - **Variable shadowing fix** - Renamed `url` to `entry_url` in `async_read()` dict comprehension to avoid shadowing the method parameter - **Concurrency limiting** - Added `asyncio.Semaphore(10)` to prevent overwhelming target servers when fetching 100+ URLs concurrently - **Better text extraction** - Changed `_extract_content()` separator from `" "` to `"\n"` to preserve document structure - **Public API methods** - Renamed `_fetch_url` / `_parse_llms_txt` to `fetch_url` / `parse_llms_txt` since they are called by the toolkit - **Reader reuse** - LLMsTxtTools now creates a single `LLMsTxtReader` instance in `__init__` instead of per tool call - **Async tool variants** - Added `aget_llms_txt_index`, `aread_llms_txt_url`, `aread_llms_txt_and_load_knowledge` registered via `async_tools` following the codebase convention (e.g. BrandfetchTools) - **New tests** - Added tests for async tool registration, reader reuse, and newline preservation in HTML extraction ## Type of change - [x] Improvement --- ## Checklist - [x] Code complies with style guidelines - [x] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) - [x] Self-review completed - [x] Documentation updated (comments, docstrings) - [x] Tests added/updated (if applicable) ### Duplicate and AI-Generated PR Check - [x] I have searched existing [open pull requests](../../pulls) and confirmed that no other PR already addresses this issue - [x] Check if this PR was entirely AI-generated (by Copilot, Claude Code, Cursor, etc.) --- ## Additional Notes All 36 tests pass (up from 32 - added 4 new tests for async registration, reader reuse, and HTML newline preservation).
- Full async docstrings on all 3 async tool methods so the LLM sees proper tool descriptions in async mode - AsyncClient now receives timeout and proxy via _async_client_kwargs() - Module-level httpx import consistent with Brandfetch/Perplexity - Extract _process_response() to deduplicate content-type classification across fetch_url and async_fetch_url
Instead of manually reading documents and looping insert(), delegate to self.knowledge.insert(url=url, reader=self.reader) which gives us content hashing, deduplication, status tracking, and proper vector DB insertion — matching the pattern used by WebsiteTools and WikipediaTools.
Reader: - Remove redundant state: in_optional and past_first_section replaced by single current_section variable - Remove dead if/else branch on proxy — httpx accepts proxy=None - Remove WHAT comments that restate the next line - Simplify AsyncClient construction (proxy=self.proxy directly) Toolkit: - Extract _format_index helper to deduplicate sync/async index building - Delegate knowledge loading to Knowledge.insert(url=, reader=) pipeline Knowledge: - Skip pre-download when custom reader is provided — URL-based readers like LLMsTxtReader need the URL string, not pre-fetched BytesIO
The overview document (title + summary from the llms.txt) provides essential context about the project. No caller ever set this to False. Removing the parameter and its branch simplifies the reader.
- Remove __init__ docstring (no other reader has one) - Rewrite parse_llms_txt: replace 3 continue statements with clean if/elif/else chain — each line falls into one bucket - Remove include_llms_txt_content param (always True, never exposed)
_extract_content was called exactly once. Inlining removes one indirection layer — the reader now has only the helpers that are actually shared between read() and async_read().
The 3-way exception split (HTTPStatusError, RequestError, Exception) was duplicated between sync and async. For a reader fetching doc pages, a single catch with a warning log is sufficient. Each method is now 4 lines instead of 12.
Keep the semaphore (Codex confirms: this is external HTTP fan-out, not local processing — unbounded gather would burst 100 requests at once). Remove _MAX_CONCURRENT_FETCHES constant, inline the value with a comment explaining why it exists.
Add timeout and follow_redirects params to existing fetch_with_retry and async_fetch_with_retry in utils/http.py. Reader now uses these shared utils instead of making raw httpx.get calls — retry logic, error handling, and connection management in one place. Removed semaphore — httpx AsyncClient already limits concurrent connections per host (default 20).
max_urls=100 was too high — would overwhelm model context in agentic mode. 20 matches the knowledge cookbook and WebsiteReader's max_links=10 ballpark. timeout=60 matches the global httpx client default.
bs4 import now fails at import time (matching WebsiteReader and WebSearchReader pattern) instead of deep inside a fetch call. LLMsTxtReader import moved to top of toolkit — no reason to defer an internal agno module.
Class docstring was a 30-line essay — most toolkits have none. The code structure already shows the two modes (with/without knowledge). Removed remaining WHAT comment in _build_documents.
- Trim tool docstrings: remove repeated llms.txt explanations, keep only what the LLM needs to decide when/how to call the tool - Replace _async_client_kwargs dict builder with _async_client() that returns the client directly - Add section comments to separate helpers / agentic tools / knowledge tools for scannable code - Remove unused Dict import
Docstrings now use the same format as GmailTools and GoogleCalendarTools: triple-quote, Args (type): description, Returns: type: description. Replaced section dividers with inline comments matching Gmail pattern. Helpers have no docstrings (underscore prefix signals internal use).
Toolkit: every tool method now wrapped in try/except returning error strings, matching Gmail/Calendar pattern. Helpers at top, tools below. Reader: reordered — __init__, classmethods, helpers (_process_response, _build_documents), then public methods (parse_llms_txt, fetch_url, read, async_read). Removed bloated docstrings on helpers. Trimmed class docstring to just the example.
tools list uses Callable instead of Any. Removed Any from kwargs (untyped kwargs is the codebase pattern — other toolkits don't type it).
Restructured from class-based to flat functions with @pytest.fixture, matching test_perplexity.py and test_gmail_tools.py patterns. New coverage: - Async reader: async_read happy path + failure - Async toolkit: aget_llms_txt_index, aread_llms_txt_url, aread_llms_txt_and_load_knowledge - Error handling: try/except returns error strings - Edge cases: empty overview, HTML sniffing, unknown content-type - Shared _mock_httpx_response helper for DRY mock setup 34 tests -> 46 tests
The previous fix (skip pre-download when any custom reader is provided) broke PDFReader and other file-based readers that need BytesIO. Now we check if the reader supports ContentType.URL — only URL-based readers like LLMsTxtReader and WebsiteReader skip the pre-download. File-based readers (PDFReader, CSVReader, etc.) still get pre-downloaded bytes.
Only forward timeout and follow_redirects to httpx when explicitly passed by the caller. Previously, default values (timeout=None, follow_redirects=False) were always forwarded, which removed httpx's built-in 5s timeout and overrode client-level redirect settings.
follow_redirects and timeout use Optional[None] default so existing callers see zero behavior change. Build kwargs dict conditionally instead of type-ignore comments. Import order fixed by format.sh.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Preserves the granular local history of
feat/llms-txt-reader-toolsfrom thepr-7465-llms-txt-fixesworktree. PR #7458 squash-merged this work, somainalready has the feature — this branch keeps the 25 original commits for reference (review iteration history, type cleanup, import fixes, etc.).Also captures an unrelated dirty file that was in the worktree:
libs/agno/tests/unit/os/routers/test_sort_order_default.py— cross-contamination from another worktree, triage separately.Status
Safe to close. PR #7458 already merged the work. This exists only so nothing gets lost during worktree cleanup.