Skip to content

Latest commit

 

History

History
98 lines (70 loc) · 6.74 KB

File metadata and controls

98 lines (70 loc) · 6.74 KB

searchset

Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.

Table of Contents | Source: src/cmd/searchset.rs | 📇🏎️👆

Description | Usage | Arguments | Searchset Options | Common Options

Description

Filters CSV data by whether the given regex set matches a row.

Unlike the search operation, this allows regex matching of multiple regexes in a single pass.

The regexset-file is a plain text file with multiple regexes, with a regex on each line. Lines starting with '#' (optionally preceded by whitespace) are treated as comments and ignored. For an example scanning for common Personally Identifiable Information (PII) - SSN, credit cards, email, bank account numbers & phones, see https://github.com/dathere/qsv/blob/master/resources/examples/searchset/pii_regexes.txt

The regex set is applied to each field in each row, and if any field matches, then the row is written to the output, and the number of matches to stderr.

The columns to search can be limited with the '--select' flag (but the full row is still written to the output if there is a match).

Returns exitcode 0 when matches are found, returning number of matches to stderr. Returns exitcode 1 when no match is found, unless the '--not-one' flag is used.

When --quick is enabled, no output is produced and exitcode 0 is returned on the first match.

When the CSV is indexed, a faster parallel search is used.

For examples, see https://github.com/dathere/qsv/blob/master/tests/test_searchset.rs.

Usage

qsv searchset [options] (<regexset-file>) [<input>]
qsv searchset --help

Arguments

    Argument      Description
 <regexset-file>  The file containing regular expressions to match, with a regular expression on each line. See https://docs.rs/regex/latest/regex/index.html#syntax or https://regex101.com with the Rust flavor for regex syntax.
 <input>  The CSV file to read. If not given, reads from stdin.

Searchset Options

       Option         Type Description Default
 ‑i,
‑‑ignore‑case 
flag Case insensitive search. This is equivalent to prefixing the regex with '(?i)'.
 ‑‑literal  flag Treat the regex as a literal string. This allows you to search for matches that contain regex special characters.
 ‑‑exact  flag Match the ENTIRE field exactly. Treats the pattern as a literal string (like --literal) and automatically anchors it to match the complete field value (^pattern$).
 ‑s,
‑‑select 
string Select the columns to search. See 'qsv select -h' for the full syntax.
 ‑v,
‑‑invert‑match 
flag Select only rows that did not match
 ‑u,
‑‑unicode 
flag Enable unicode support. When enabled, character classes will match all unicode word characters instead of only ASCII word characters. Decreases performance.
 ‑f,
‑‑flag 
string If given, the command will not filter rows but will instead flag the found rows in a new column named . For each found row, is set to the row number of the row, followed by a semicolon, then a list of the matching regexes.
 ‑‑flag‑matches‑only  flag When --flag is enabled, only rows that match are sent to output. Rows that do not match are filtered.
 ‑‑unmatched‑output  string When --flag-matches-only is enabled, output the rows that did not match to .
 ‑Q,
‑‑quick 
flag Return on first match with an exitcode of 0, returning the row number of the first match to stderr. Return exit code 1 if no match is found. No output is produced. Ignored if --json is enabled.
 ‑c,
‑‑count 
flag Return number of matches to stderr. Ignored if --json is enabled.
 ‑j,
‑‑json 
flag Return number of matches, number of rows with matches, and number of rows to stderr in JSON format.
 ‑‑size‑limit  string Set the approximate size limit (MB) of the compiled regular expression. If the compiled expression exceeds this number, then a compilation error is returned. Modify this only if you're getting regular expression compilation errors. 50
 ‑‑dfa‑size‑limit  string Set the approximate size of the cache (MB) used by the regular expression engine's Discrete Finite Automata. Modify this only if you're getting regular expression compilation errors. 10
 ‑‑not‑one  flag Use exit code 0 instead of 1 for no match found.
 ‑‑jobs  string The number of jobs to run in parallel when the given CSV data has an index. Note that a file handle is opened for each job. When not set, defaults to the number of CPUs detected.

Common Options

     Option      Type Description Default
 ‑h,
‑‑help 
flag Display this message
 ‑o,
‑‑output 
string Write output to instead of stdout.
 ‑n,
‑‑no‑headers 
flag When set, the first row will not be interpreted as headers. (i.e., They are not searched, analyzed, sliced, etc.)
 ‑d,
‑‑delimiter 
string The field delimiter for reading CSV data. Must be a single character. (default: ,)
 ‑p,
‑‑progressbar 
flag Show progress bars. Not valid for stdin.
 ‑q,
‑‑quiet 
flag Do not return number of matches to stderr.

Source: src/cmd/searchset.rs | Table of Contents | README