Remove duplicate rows (See also
extdedup,extsort,sort&sortcheckcommands).
Table of Contents | Source: src/cmd/dedup.rs | 🤯🚀👆
Description | Examples | Usage | Dedup Options | Common Options
Description ↩
Deduplicates CSV rows.
This requires reading all of the CSV data into memory because because the rows need to be sorted first.
That is, unless the --sorted option is used to indicate the CSV is already sorted - typically, with the sort cmd for more sorting options or the extsort cmd for larger than memory CSV files. This will make dedup run in streaming mode with constant memory.
Either way, the output will not only be deduplicated, it will also be sorted.
A duplicate count will also be sent to .
Examples ↩
Deduplicate an unsorted CSV file:
qsv dedup unsorted.csv -o deduped.csvDeduplicate a sorted CSV file:
qsv sort unsorted.csv | qsv dedup --sorted -o deduped.csvDeduplicate based on specific columns:
qsv dedup --select col1,col2 unsorted.csv -o deduped.csvDeduplicate based on numeric comparison of col1 and col2 columns:
qsv dedup -s col1,col2 --numeric unsorted.csv -o deduped.csvDeduplicate ignoring case of col1 and col2 columns:
qsv dedup -s col1,col2 --ignore-case unsorted.csv -o deduped.csvWrite duplicates to a separate file:
qsv dedup -s col1,col2 --dupes-output dupes.csv unsorted.csv -o deduped.csvFor more examples, see tests.
Usage ↩
qsv dedup [options] [<input>]
qsv dedup --helpDedup Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑s,‑‑select |
string | Select a subset of columns to dedup. Note that the outputs will remain at the full width of the CSV. See 'qsv select --help' for the format details. | |
‑N,‑‑numeric |
flag | Compare according to string numerical value | |
‑i,‑‑ignore‑case |
flag | Compare strings disregarding case. | |
‑‑sorted |
flag | The input is already sorted. Do not load the CSV into memory to sort it first. Meant to be used in tandem and after an extsort. | |
‑D,‑‑dupes‑output |
string | Write duplicates to . | |
‑H,‑‑human‑readable |
flag | Comma separate duplicate count. | |
‑j,‑‑jobs |
string | The number of jobs to run in parallel when sorting an unsorted CSV, before deduping. When not set, the number of jobs is set to the number of CPUs detected. Does not work with --sorted option as its not multithreaded. |
Common Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑h,‑‑help |
flag | Display this message | |
‑o,‑‑output |
string | Write output to instead of stdout. | |
‑n,‑‑no‑headers |
flag | When set, the first row will not be interpreted as headers. That is, it will be sorted with the rest of the rows. Otherwise, the first row will always appear as the header row in the output. | |
‑d,‑‑delimiter |
string | The field delimiter for reading CSV data. Must be a single character. (default: ,) | |
‑q,‑‑quiet |
flag | Do not print duplicate count to stderr. | |
‑‑memcheck |
flag | Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. |
Source: src/cmd/dedup.rs
| Table of Contents | README