joinp

Inner, outer, right, cross, anti, semi, non-equi & asof joins using the Pola.rs engine. Unlike the join command, joinp can process files larger than RAM, is multithreaded, has join key validation, a maintain row order option, pre and post-join filtering, join keys unicode normalization, supports "special" non-equi joins and asof joins (which is particularly useful for time series data) & its output columns can be coalesced.

Table of Contents | Source: src/cmd/joinp.rs | 🚀🐻‍❄️🪄

Description ↩

Joins two sets of CSV data on the specified columns using the Polars engine.

The default join operation is an 'inner' join. This corresponds to the intersection of rows on the keys specified.

Unlike the join command, joinp can process files larger than RAM, is multithreaded, has join key validation, a maintain row order option, pre-join filtering, supports non-equi & asof joins and its output columns can be coalesced (no duplicate columns).

Returns the shape of the join result (number of rows, number of columns) to stderr.

For examples, see https://github.com/dathere/qsv/blob/master/tests/test_joinp.rs.

Usage ↩

qsv joinp [options] <columns1> <input1> <columns2> <input2>
qsv joinp --cross [--validate <arg>] <input1> <input2> [--decimal-comma] [--delimiter <arg>] [--output <file>]
qsv joinp --non-equi <expr> <input1> <input2> [options] [--output <file>]
qsv joinp --help

Joinp Options ↩

Option	Type	Description	Default
`‑‑left`	flag	Do a 'left outer' join. This returns all rows in first CSV data set, including rows with no corresponding row in the second data set. When no corresponding row exists, it is padded out with empty fields.
`‑‑left‑anti`	flag	This returns only the rows in the first CSV data set that do not have a corresponding row in the second data set. The output schema is the same as the first dataset.
`‑‑left‑semi`	flag	This returns only the rows in the first CSV data set that have a corresponding row in the second data set. The output schema is the same as the first data set.
`‑‑right`	flag	Do a 'right outer' join. This returns all rows in second CSV data set, including rows with no corresponding row in the first data set. When no corresponding row exists, it is padded out with empty fields. (This is the reverse of 'outer left'.)
`‑‑right‑anti`	flag	This returns only the rows in the second CSV data set that do not have a corresponding row in the first data set. The output schema is the same as the second dataset.
`‑‑right‑semi`	flag	This returns only the rows in the second CSV data set that have a corresponding row in the first data set. The output schema is the same as the second data set.
`‑‑full`	flag	Do a 'full outer' join. This returns all rows in both data sets with matching records joined. If there is no match, the missing side will be padded out with empty fields.
`‑‑cross`	flag	USE WITH CAUTION. This returns the cartesian product of the CSV data sets given. The number of rows return is equal to N * M, where N and M correspond to the number of rows in the given data sets, respectively. The columns1 and columns2 arguments are ignored.
`‑‑non‑equi`	string	Do a non-equi join. The given expression is evaluated for each row in the left dataset and can refer to columns in the left and right dataset. If the expression evaluates to true, the row is joined with the corresponding row in the right dataset. The expression is a valid Polars SQL where clause, with each column name followed by "_left" or "_right" suffixes to indicate which data set the column belongs to. (e.g. "salary_left >= min_salary_right AND \ salary_left <= max_salary_right AND \ experience_left >= min_exp_right")
`‑‑coalesce`	flag	Force the join to coalesce columns with the same name. For inner joins, this is not necessary as the join columns are automatically coalesced.
`‑‑filter‑left`	string	Filter the left CSV data set by the given Polars SQL expression BEFORE the join. Only rows that evaluates to true are used in the join.
`‑‑filter‑right`	string	Filter the right CSV data set by the given Polars SQL expression BEFORE the join. Only rows that evaluates to true are used in the join.
`‑‑validate`	string	Validate the join keys BEFORE performing the join.	`none`

Join Options ↩

Option	Type	Description	Default
`‑‑maintain‑order`	string	Which row order to preserve, if any. Valid values are: none, left, right, left_right, right_left Do not rely on any observed ordering without explicitly setting this parameter. Not specifying any order can improve performance. Supported for inner, left, right and full joins.	`none`
`‑‑nulls`	flag	When set, joins will work on empty fields. Otherwise, empty fields are completely ignored.
`‑‑streaming`	flag	When set, the join will be done in a streaming fashion. Only use this when you get out of memory errors.

Polars CSV Parsing Options ↩

Option	Type	Description	Default
`‑‑try‑parsedates`	flag	When set, will attempt to parse the columns as dates. If the parse fails, columns remain as strings. This is useful when the join keys are formatted as dates with differing date formats, as the date formats will be normalized. Note that this will be automatically enabled when using asof joins.
`‑‑infer‑len`	string	The number of rows to scan when inferring the schema of the CSV. Set to 0 to do a full table scan (warning: very slow). Only used when --cache-schema is 0 or 1 and no cached schema exists or when --infer-len is 0.	`10000`
`‑‑cache‑schema`	string	Create and cache Polars schema JSON files. Ignored when --infer-len is 0. -2: treat all columns as String. A Polars schema file is created & cached. -1: treat all columns as String. No Polars schema file is created. 0: do not cache Polars schema. Uses --infer-len to infer schema. 1: cache Polars schema with the following behavior: If schema file exists and is newer than input: use cached schema If schema file missing/outdated and stats cache exists: derive schema from stats and cache it If no schema or stats cache: infer schema using --infer-len and cache the result Schema files use the same name as input with .pschema.json extension (e.g., data.csv -> data.pschema.json). NOTE: If the input files have pschema.json files that are newer or created at the same time as the input files, they will be used to inform the join operation regardless of the value of --cache-schema unless --infer-len is 0.	`0`
`‑‑low‑memory`	flag	Use low memory mode when parsing CSVs. This will use less memory but will be slower. It will also process the join in streaming mode. Only use this when you get out of memory errors.
`‑‑no‑optimizations`	flag	Disable non-default join optimizations. This will make joins slower. Only use this when you get join errors.
`‑‑ignore‑errors`	flag	Ignore errors when parsing CSVs. If set, rows with errors will be skipped. If not set, the query will fail. Only use this when debugging queries, as polars does batched parsing and will skip the entire batch where the error occurred. To get more detailed error messages, set the environment variable POLARS_BACKTRACE_IN_ERR=1 before running the join.
`‑‑decimal‑comma`	flag	Use comma as the decimal separator when parsing & writing CSVs. Otherwise, use period as the decimal separator. Note that you'll need to set --delimiter to an alternate delimiter other than the default comma if you are using this option.

Asof Join Options ↩

Option	Type	Description	Default
`‑‑asof`	flag	Do an 'asof' join. This is similar to a left inner join, except we match on nearest key rather than equal keys (see --allow-exact-matches). Particularly useful for time series data. Note that both CSV data sets will be SORTED on the join columns by default, unless --no-sort is set.
`‑‑no‑sort`	flag	Do not sort the CSV data sets on the join columns by default. Note that asof joins REQUIRE the join keys to be sorted, so this option should only be used as a performance optimization when you know the CSV join keys are already sorted. If the CSV join keys are not sorted, the asof join will fail or return incorrect results.
`‑‑left_by`	string	Do an 'asof_by' join - a special implementation of the asof join that searches for the nearest keys within a subgroup set by the asof_by columns. This specifies the column/s for the left CSV. Columns are referenced by name. Specify multiple columns by separating them with a comma.
`‑‑right_by`	string	Do an 'asof_by' join. This specifies the column/s for the right CSV.
`‑‑strategy`	string	The strategy to use for the asof join: backward - For each row in the first CSV data set, we find the last row in the second data set whose key is less than the key in the first data set (or <= with --allow-exact-matches). forward - For each row in the first CSV data set, we find the first row in the second data set whose key is greater than the key in the first data set (or >= with --allow-exact-matches). nearest - selects the last row in the second data set whose value is nearest to the value in the first data set.	`backward`
`‑‑tolerance`	string	The tolerance for the nearest asof join. This is only used when the nearest strategy is used. The tolerance is a positive integer that specifies the maximum number of rows to search for a match.
`‑X,` `‑‑allow‑exact‑matches`	flag	When set, the asof join will allow exact matches. (i.e. less-than-or-equal-to or greater-than-or-equal-to) Otherwise, the asof join will only allow nearest matches (strictly less-than or greater-than) by default.

Output Format Options ↩

Option	Type	Description
`‑‑sql‑filter`	string	The SQL expression to apply against the join result. Used to select columns and filter rows AFTER running the join. Be sure to select from the "join_result" table when formulating the SQL expression. (e.g. "select c1, c2 as colname from join_result where c2 > 20")
`‑‑datetime‑format`	string	The datetime format to use writing datetimes. See https://docs.rs/chrono/latest/chrono/format/strftime/index.html for the list of valid format specifiers.
`‑‑date‑format`	string	The date format to use writing dates.
`‑‑time‑format`	string	The time format to use writing times.
`‑‑float‑precision`	string	The number of digits of precision to use when writing floats. (default: 6)
`‑‑null‑value`	string	The string to use when writing null values. (default: )

Join Key Transformation Options ↩

Option	Type	Description	Default
`‑i,` `‑‑ignore‑case`	flag	When set, joins are done case insensitively.
`‑z,` `‑‑ignore‑leading‑zeros`	flag	When set, joins are done ignoring leading zeros. Note that this is only applied to the join keys for both numeric and string columns. Also note that Polars will automatically remove leading zeros from numeric columns when it infers the schema. To force the schema to be all String types, set --cache-schema to -1 or -2.
`‑N,` `‑‑norm‑unicode`	string	When set, join keys are Unicode normalized.	`none`

Common Options ↩

Option	Type	Description
`‑h,` `‑‑help`	flag	Display this message
`‑o,` `‑‑output`	string	Write output to instead of stdout.
`‑d,` `‑‑delimiter`	string	The field delimiter for reading/writing CSV data. Must be a single character. (default: ,)
`‑q,` `‑‑quiet`	flag	Do not return join shape to stderr.

Source: src/cmd/joinp.rs | Table of Contents | README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

joinp

Description ↩

Usage ↩

Joinp Options ↩

Join Options ↩

Polars CSV Parsing Options ↩

Asof Join Options ↩

Output Format Options ↩

Join Key Transformation Options ↩

Common Options ↩

FilesExpand file tree

joinp.md

Latest commit

History

joinp.md

File metadata and controls

joinp

Description ↩

Usage ↩

Joinp Options ↩

Join Options ↩

Polars CSV Parsing Options ↩

Asof Join Options ↩

Output Format Options ↩

Join Key Transformation Options ↩

Common Options ↩