Duperemover

A Python utility for efficiently deduplicating files in directories.

Install

pip install duperemover

Usage

from duperemover import Deduplicator

dedup = Deduplicator(
    directory="/path/to/directory",
    hash_algorithm="xxhash",
    replace_strategy="hardlink",
    progress=True,
)
dedup.deduplicate()
dedup.print_stats()

CLI

duperemover --help

Command Syntax

duperemover <directory> [options]

Arguments:
  <directory>            Directory to scan for duplicates.
  --hash-file <file>     File to store hashes (default: .hashes.db).
  --buffer-size <size>   Buffer size for hashing (default: 65536, 64KB).
  --hash-algorithm <alg> Hashing algorithm (choices: "xxhash", "blake3", "sha256", default: "xxhash" if available).
  --replace-strategy <strategy> Strategy for handling duplicates (choices: "hardlink", "delete", "rename", "reflink", default: "hardlink").
  --max-threads <num>    Number of threads to use for processing (default: 4).
  --sync-interval <num>  Sync interval for hashes to disk (default: 100).
  --progress             Show a progress bar while processing files.
  --dry-run              Simulate the deduplication process without making any changes.
  --use-bloom-filter     Use Bloom filter to speed up duplicate checking.
  --use-reflink          Use reflink/dedupe for filesystem-level deduplication (btrfs, xfs).
  --exclude PATTERNS     Exclude files matching these patterns.

Examples

# Basic deduplication (using default hashing algorithm)
duperemover /path/to/directory

# Using SHA256 as the hashing algorithm
duperemover /path/to/directory --hash-algorithm sha256

# Simulate deduplication (dry run)
duperemover /path/to/directory --dry-run

# Create hard links for duplicates, use Bloom filter, and show progress
duperemover /path/to/directory --replace-strategy hardlink --use-bloom-filter --progress

# Use reflink/dedupe for filesystem-level deduplication (btrfs, xfs)
duperemover /path/to/directory --replace-strategy reflink

Features

Hash Algorithms: Choose between xxhash, blake3, and sha256 for calculating file hashes.
Duplicate Handling Strategies:
- hardlink: Replace duplicates with hard links.
- delete: Delete duplicate files.
- rename: Rename duplicate files by appending .duplicate to their names.
- reflink: Use filesystem-level reflink/deduplication (btrfs, xfs with reflink support).
Multi-threading: Process files in parallel to speed up deduplication.
Bloom Filter: Optionally, enable the Bloom filter to speed up duplicate checks by avoiding re-hashing files.
Exclusion Patterns: Exclude files matching specific patterns from the deduplication process.
Progress Bar: Optionally display a progress bar for better visibility during the deduplication process.
Dry Run: Run the deduplication process without making any actual changes (useful for testing).

API

Deduplicator

from duperemover import Deduplicator

Constructor

Deduplicator(
    directory: str,
    hash_file: str = ".hashes.db",
    buffer_size: int = 65536,
    hash_algorithm: str = "xxhash",
    replace_strategy: str = "hardlink",
    max_threads: int = 4,
    sync_interval: int = 100,
    progress: bool = False,
    dry_run: bool = False,
    exclude_patterns: list[str] | None = None,
    use_bloom_filter: bool = False,
    use_reflink: bool = False,
)

Methods

deduplicate(): Scan the directory for duplicates and process each file.
print_stats(): Print deduplication statistics.
count_files(directory): Count the number of files in a directory.
get_file_hash(file_path): Calculate and return the hash of a file.
are_same_file(file1, file2): Check if two files are the same based on their inodes.
create_hard_link(source, target): Create a hard link from the source file to the target file.
create_reflink(source, target): Create a reflink (filesystem-level deduplication) from source to target.
delete_duplicate(file_path): Delete a duplicate file.
rename_duplicate(file_path): Rename a duplicate file by appending .duplicate.
is_excluded(file_path): Check if a file matches any exclusion pattern.

Development

git clone https://github.com/daedalus/duperemover.git
cd duperemover
pip install -e ".[test]"

# run tests
pytest

# format
ruff format src/ tests/

# lint
ruff check src/ tests/

# type check
mypy src/

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
src/duperemover		src/duperemover
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duperemover

Install

Usage

CLI

Command Syntax

Examples

Features

API

Deduplicator

Constructor

Methods

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Duperemover

Install

Usage

CLI

Command Syntax

Examples

Features

API

Deduplicator

Constructor

Methods

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages