This script gathers a set of repositories into a single local enclosing directory so tools like Codex or Claude Code can inspect how those codebases relate to each other.
The goal is to collect working trees for analysis and documentation, not to maintain local Git clones. Each repository is cloned shallowly, then any .git/ directories found within that cloned repository tree are removed. After that, the script sanitizes the cloned working tree to obfuscate sensitive text such as internal email addresses, secret-like literals, and internal hostnames or URLs before downstream LLM analysis.
main.py supports two usage modes:
- all-repos mode:
- reads
REPOS_TO_CLONE_JSONfrom the environment - expects that value to be a JSON list of Git repository URLs
- reads
- single-repo mode:
- accepts one repo URL from
--github-repo-url - does not require
REPOS_TO_CLONE_JSON
- accepts one repo URL from
In both modes, the script:
- creates the enclosing directory if needed
- stores update state in
../gather_repos_state.jsonrelative togather_repos_code/main.py - derives a local directory name for each repo from the repo URL
- checks each repo's remote
mainbranch before deciding whether local work is needed - asks for confirmation before fully deleting only the destination directories that actually need refresh
- runs
git clone --depth 1 --branch main --single-branchfor each repo that needs refresh - finds and removes
.git/directories anywhere inside each cloned repository - validates that each
.git/removal target resolves inside that cloned repository before deletion - sanitizes text files in each cloned repo to remove or obfuscate possibly sensitive values while leaving likely binary files untouched
- Python 3.12
uvgitonPATH- working credentials for any private GitHub repositories you ask the script to clone
For all-repos usage, the script expects REPOS_TO_CLONE_JSON to be available in the environment, typically through uv run --env-file=....
For single-repo usage with --github-repo-url, REPOS_TO_CLONE_JSON is not used.
Example:
REPOS_TO_CLONE_JSON='[
"git@github.com:Brown-University-Library/repo_a.git",
"git@github.com:Brown-University-Library/repo_b.git"
]'From this project directory...
All-repos mode, using REPOS_TO_CLONE_JSON from the .env:
uv run --env-file="/path/to/.env" ./main.py --enclosing-dir "/path/to/enclosing_dir/"Single-repo mode, using --github-repo-url and no REPOS_TO_CLONE_JSON:
uv run ./main.py --enclosing-dir "/path/to/enclosing_dir/" --github-repo-url "git@github.com:Brown-University-Library/the_repo.git"Show help:
uv run ./main.py --help- The script validates that
gitis available. - It resolves the repo list from one of these sources:
- all-repos mode:
REPOS_TO_CLONE_JSON - single-repo mode:
--github-repo-url
- all-repos mode:
- It creates the enclosing directory if it does not already exist.
- It loads or initializes
../gather_repos_state.jsonrelative togather_repos_code/main.py. - It checks each target repo's remote
mainbranch commit withgit ls-remote. - It compares that remote
maincommit to the last saved commit for the repo. - It prompts only for destination directories that need refresh because the repo changed or the local directory is missing.
- It clones each changed repository into the enclosing directory with
--depth 1 --branch main --single-branch. - It records the cloned commit SHA and committer timestamp in the state file.
- It finds
.git/directories anywhere inside the cloned repo. - It validates that each
.git/path resolves inside that repo's root. - It deletes those
.git/directories. - It scans files in that cloned repo and deletes or obfuscates likely sensitive values before the repo is left in place for analysis.
All-repos mode:
- uses every repo listed in
REPOS_TO_CLONE_JSON - is the mode where an
.envfile is typically involved - is useful for gathering a related set of repositories into one bundle
Single-repo mode:
- uses only the repo passed to
--github-repo-url - does not require
REPOS_TO_CLONE_JSON - is useful for refreshing or inspecting one repository without maintaining an env-based repo list
If one or more destination directories need refresh, the script lists only those directories and prompts for confirmation before deleting them.
- Typing
yesallows the script to fully delete and replace those directories. - Any other response causes the script to exit without making changes.
If all repos are already current and their local directories still exist, the script skips them without prompting.
Because this confirmation uses an interactive prompt, reruns that actually need replacement require a terminal session with stdin attached.
The script keeps per-repo update state in a JSON file located at:
<project_root>/gather_repos_state.json
That file records:
- the last seen remote
maincommit SHA for each repo - the last saved commit timestamp from a successful refresh
- when each repo was last checked
- when each repo was last refreshed locally
This state file is kept outside gather_repos_code/ at a fixed project-level location, independent of the chosen --enclosing-dir.
All-repos example:
Given:
REPOS_TO_CLONE_JSON='[
"git@github.com:Brown-University-Library/repo_a.git",
"git@github.com:Brown-University-Library/repo_b.git"
]'and:
--enclosing-dir "/tmp/repo_bundle"the resulting layout will be:
/tmp/repo_bundle/
repo_a/
repo_b/
Single-repo example:
Given:
--github-repo-url "git@github.com:Brown-University-Library/the_repo.git"and:
--enclosing-dir "/tmp/repo_bundle"the resulting layout will be:
/tmp/repo_bundle/
the_repo/
In either mode, each repo directory will contain the working tree content, but not any .git/ directories found inside that cloned repo tree. Sensitive text detected by the cleanup pass will be replaced with deterministic placeholder values. The project-level gather_repos_state.json file will hold refresh-tracking metadata outside that output bundle.
- The script processes repositories sequentially.
- It stops on the first hard failure.
- Only the remote
mainbranch is considered when deciding whether a repo needs refresh. REPOS_TO_CLONE_JSONis required only for all-repos mode.--github-repo-urlbypassesREPOS_TO_CLONE_JSONand targets exactly one repo.- It removes
.git/directories recursively within each cloned repo. - Before deleting any discovered
.git/directory, it validates that the resolved path is still inside that repo's root. - The cleanup step rewrites text files in place and skips files that look binary based on suffixes and content.
- The cleanup step is intentionally conservative about structure: it aims to preserve code shape while obfuscating sensitive values.