moarstats

Add dozens of additional statistics, including extended outlier, robust & bivariate statistics to an existing stats CSV file. (example).

Table of Contents | Source: src/cmd/moarstats.rs | 📇🏎️

Description ↩

Add dozens of additional statistics, including extended outlier, robust & bivariate statistics to an existing stats CSV file. It also maps the field type to the most specific W3C XML Schema Definition (XSD) datatype (https://www.w3.org/TR/xmlschema-2/).

The moarstats command extends an existing stats CSV file (created by the stats command) by computing "moar" (https://www.dictionary.com/culture/slang/moar) statistics that can be derived from existing stats columns and by scanning the original CSV file.

It looks for the <FILESTEM>.stats.csv file for a given CSV input. If the stats CSV file does not exist, it will first run the stats command with configurable options to establish the baseline stats, to which it will add more stats columns.

If the .stats.csv file is found, it will skip running stats and just append the additional stats columns.

Currently computes the following 25 additional univariate statistics:

Pearson's Second Skewness Coefficient: 3 * (mean - median) / stddev Measures asymmetry of the distribution. Positive values indicate right skew, negative values indicate left skew. https://en.wikipedia.org/wiki/Skewness
Range to Standard Deviation Ratio: range / stddev Normalizes the spread of data. Higher values indicate more extreme outliers relative to the variability.
Quartile Coefficient of Dispersion: (Q3 - Q1) / (Q3 + Q1) Measures relative variability using quartiles. Useful for comparing dispersion across different scales. https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion
Z-Score of Mode: (mode - mean) / stddev Indicates how typical the mode is relative to the distribution. Values near 0 suggest the mode is near the mean.
Relative Standard Error: sem / mean Measures precision of the mean estimate relative to its magnitude. Lower values indicate more reliable estimates.
Z-Score of Min: (min - mean) / stddev Shows how extreme the minimum value is. Large negative values indicate outliers or heavy left tail.
Z-Score of Max: (max - mean) / stddev Shows how extreme the maximum value is. Large positive values indicate outliers or heavy right tail.
Median-to-Mean Ratio: median / mean Indicates skewness direction. Ratio < 1 suggests right skew, > 1 suggests left skew, = 1 suggests symmetry.
IQR-to-Range Ratio: iqr / range Measures concentration of data. Higher values (closer to 1) indicate more data concentrated in the middle 50%.
MAD-to-StdDev Ratio: mad / stddev Compares robust vs non-robust spread measures. Higher values suggest presence of outliers affecting stddev.
Trimean: (Q1 + 2*median + Q3) / 4 Tukey's trimean - a robust estimator of central tendency combining the median with the midhinge. More robust than mean, more efficient than median alone. https://en.wikipedia.org/wiki/Trimean
Midhinge: (Q1 + Q3) / 2 Midpoint of the middle 50% of data. A robust central tendency measure that complements the mean and median. https://en.wikipedia.org/wiki/Midhinge
Robust CV: MAD / |median| Robust Coefficient of Variation using MAD and the magnitude of the median. Always non-negative. Resistant to outliers, useful for comparing variability. https://en.wikipedia.org/wiki/Robust_measures_of_scale
Kurtosis: Measures the "tailedness" of the distribution (excess kurtosis). Positive values indicate heavy tails, negative values indicate light tails. Values near 0 indicate a normal distribution. Requires --advanced flag. https://en.wikipedia.org/wiki/Kurtosis
Bimodality Coefficient: Measures whether a distribution has two modes (peaks) or is unimodal. BC < 0.555 indicates unimodal, BC >= 0.555 indicates bimodal/multimodal. Computed as (skewness² + 1) / (kurtosis + 3). Requires --advanced flag (needs skewness from base stats and kurtosis from --advanced flag). https://en.wikipedia.org/wiki/Bimodality
Jarque-Bera Test: (n/6) * (S² + K²/4) Standard test for normality using skewness and kurtosis. Also computes jarque_bera_pvalue (from chi-squared distribution with 2 df). Low p-values (< 0.05) indicate the data is NOT normally distributed. Requires --advanced flag (needs kurtosis). https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test
Gini Coefficient: Measures inequality/dispersion in the distribution. Values range from 0 (perfect equality) to 1 (maximum inequality). Requires --advanced flag. https://en.wikipedia.org/wiki/Gini_coefficient
Atkinson Index: Measures inequality in the distribution with a sensitivity parameter. Values range from 0 (perfect equality) to 1 (maximum inequality). The Atkinson Index is a more general form of the Gini coefficient that allows for different sensitivity to inequality. Sensitivity is configurable via --epsilon. Requires --advanced flag. https://en.wikipedia.org/wiki/Atkinson_index
Theil Index: (1/n) * Σ((x_i / mean) * ln(x_i / mean)) Measures inequality/concentration. Unlike Gini, it is decomposable into within-group and between-group components. Only computed for positive values. Requires --advanced flag. https://en.wikipedia.org/wiki/Theil_index
Mean Absolute Deviation (from mean): (1/n) * Σ|x_i - mean| Average absolute distance from the mean. Different from MAD (which uses median). Less robust but more statistically efficient than MAD. Requires --advanced flag.
Shannon Entropy: Measures the information content/uncertainty in the distribution. Higher values indicate more diversity, lower values indicate more concentration. Values range from 0 (all values identical) to log2(n) where n is the number of unique values. Requires --advanced flag. https://en.wikipedia.org/wiki/Entropy_(information_theory)
Normalized Entropy: Normalized version of Shannon Entropy scaled to [0, 1]. Values range from 0 (all values identical) to 1 (all values equally distributed). Computed as shannon_entropy / log2(cardinality). Requires shannon_entropy (from --advanced flag) and cardinality (from base stats).
Simpson's Diversity Index: 1 - Σ(p_i²) Probability that two randomly chosen values are different. Ranges from 0 (all identical) to 1 (all unique). More intuitive than entropy. Requires --advanced flag (computed alongside entropy from frequency data). https://en.wikipedia.org/wiki/Diversity_index#Simpson_index
Winsorized Mean: Replaces values below/above thresholds with threshold values, then computes mean. All values are included in the calculation, but extreme values are capped at thresholds. https://en.wikipedia.org/wiki/Winsorized_mean Also computes: winsorized_stddev, winsorized_variance, winsorized_cv, winsorized_range, and winsorized_stddev_ratio (winsorized_stddev / overall_stddev).
Trimmed Mean: Excludes values outside thresholds, then computes mean. Only values within thresholds are included in the calculation. https://en.wikipedia.org/wiki/Truncated_mean Also computes: trimmed_stddev, trimmed_variance, trimmed_cv, trimmed_range, and trimmed_stddev_ratio (trimmed_stddev / overall_stddev). By default, uses Q1 and Q3 as thresholds (25% winsorization/trimming). With --use-percentiles, uses configurable percentiles (e.g., 5th/95th) as thresholds with --pct-thresholds.

In addition, it computes the following univariate outlier statistics (24 outlier statistics total). https://en.wikipedia.org/wiki/Outlier (requires --quartiles or --everything in stats):

Outlier Counts (7 statistics):

outliers_extreme_lower_cnt: Count of values below the lower outer fence
outliers_mild_lower_cnt: Count of values between lower outer and inner fences
outliers_normal_cnt: Count of values between inner fences (non-outliers)
outliers_mild_upper_cnt: Count of values between upper inner and outer fences
outliers_extreme_upper_cnt: Count of values above the upper outer fence
outliers_total_cnt: Total count of all outliers (sum of extreme and mild outliers)
outliers_percentage: Percentage of values that are outliers

Outlier Descriptive Statistics (6 statistics):

outliers_mean: Mean value of outliers
non_outliers_mean: Mean value of non-outliers
outliers_to_normal_mean_ratio: Ratio of outlier mean to non-outlier mean
outliers_min: Minimum value among outliers
outliers_max: Maximum value among outliers
outliers_range: Range of outlier values (max - min)

Outlier Variance/Spread Statistics (7 statistics):

outliers_stddev: Standard deviation of outlier values
outliers_variance: Variance of outlier values
non_outliers_stddev: Standard deviation of non-outlier values
non_outliers_variance: Variance of non-outlier values
outliers_cv: Coefficient of variation for outliers (stddev / mean)
non_outliers_cv: Coefficient of variation for non-outliers (stddev / mean)
outliers_normal_stddev_ratio: Ratio of outlier stddev to non-outlier stddev

Outlier Impact Statistics (2 statistics):

outlier_impact: Difference between overall mean and non-outlier mean
outlier_impact_ratio: Relative impact (outlier_impact / non_outlier_mean)

Outlier Boundary Statistics (2 statistics):

lower_outer_fence_zscore: Z-score of the lower outer fence boundary
upper_outer_fence_zscore: Z-score of the upper outer fence boundary

These outlier statistics require reading the original CSV file and comparing each value against the fence thresholds. Fences are computed using the IQR method: inner fences at Q1/Q3 ± 1.5IQR, outer fences at Q1/Q3 ± 3.0IQR.

These univariate statistics are only computed for numeric and date/datetime columns where the required base univariate statistics (mean, median, stddev, etc.) are available. Univariate outlier statistics additionally require that quartiles (and thus fences) were computed when generating the stats CSV. Winsorized/trimmed means require either Q1/Q3 or percentiles to be available. Kurtosis, Gini & Atkinson Index require reading the original CSV file to collect all values for computation.

BIVARIATE STATISTICS:

The moarstats command also computes the following 6 bivariate statistics:

Pearson's correlation Measures linear correlation between two numeric/date fields. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates no linear correlation. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Spearman's rank correlation Measures monotonic correlation between two numeric/date fields. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates no monotonic correlation. https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
Kendall's tau Measures monotonic correlation between two numeric/date fields. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates no monotonic correlation. https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient
Covariance Measures the linear relationship between two numeric/date fields. Values range from negative infinity to positive infinity. 0 indicates no linear relationship. https://en.wikipedia.org/wiki/Covariance
Mutual Information Measures the amount of information obtained about one field by observing another. Values range from 0 (independent) to positive infinity. https://en.wikipedia.org/wiki/Mutual_information
Normalized Mutual Information Normalized version of mutual information, scaled by the geometric mean of individual entropies. Values range from 0 (independent) to 1 (perfectly dependent). https://en.wikipedia.org/wiki/Mutual_information#Normalized_variants

These bivariate statistics are computed when the --bivariate flag is used and require an indexed CSV file (index will be auto-created if missing). Bivariate statistics are output to a separate file: <FILESTEM>.stats.bivariate.csv.

Bivariate statistics require reading the entire CSV file and are computationally VERY expensive. For large files (>= 10k records), parallel chunked processing is used when an index is available. For smaller files or when no index exists, sequential processing is used.

MULTI-DATASET BIVARIATE STATISTICS:

When using the --join-inputs flag, multiple datasets can be joined internally before computing bivariate statistics. This allows analyzing bivariate statistics across datasets that share common join keys. The joined dataset is saved as a temporary file that is automatically deleted after computing the bivariate statistics. The bivariate statistics are saved to <FILESTEM>.stats.bivariate.joined.csv.

Examples ↩

Add moar stats to existing stats file

qsv moarstats data.csv

Generate baseline stats first with custom options, then add moar stats

qsv moarstats data.csv --stats-options "--everything --infer-dates"

Compute bivariate statistics between fields

qsv moarstats data.csv --bivariate

Compute even more bivariate statistics

qsv moarstats data.csv --bivariate --bivariate-stats pearson,spearman,kendall,mi,nmi,covariance

Join multiple datasets and compute bivariate statistics

qsv moarstats data.csv --bivariate --join-inputs customers.csv,products.csv --join-keys cust_id,prod_id

Join multiple datasets and compute bivariate statistics with different join type

qsv moarstats data.csv --bivariate --join-inputs customers.csv,products.csv --join-keys cust_id,prod_id --join-type left

For more examples, see tests.

Usage ↩

qsv moarstats [options] [<input>]
qsv moarstats --help

Moarstats Options ↩

Option	Type	Description	Default
`‑‑advanced`	flag	Compute Kurtosis, Shannon Entropy, Bimodality Coefficient, Jarque-Bera, Gini Coefficient, Atkinson Index, Theil Index, Mean Absolute Deviation, and Simpson's Diversity Index. These advanced statistics computations require reading the original CSV file to collect all values for computation and are computationally expensive. Further, Entropy computation requires the frequency command to be run with --limit 0 to collect all frequencies. An index will be auto-created for the original CSV file if it doesn't already exist to enable parallel processing.
`‑e,` `‑‑epsilon`	string	The Atkinson Index Inequality Aversion parameter. Epsilon controls the sensitivity of the Atkinson Index to inequality. The higher the epsilon, the more sensitive the index is to inequality. Typical values are 0.5 (standard in economic research), 1.0 (natural boundary), or 2.0 (useful for poverty analysis).	`1.0`
`‑‑stats‑options`	string	Options to pass to the stats command if baseline stats need to be generated. The options are passed as a single string that will be split by whitespace.	`--infer-dates --infer-boolean --mad --quartiles --percentiles --force --stats-jsonl`
`‑‑round`	string	Round statistics to decimal places. Rounding follows Midpoint Nearest Even (Bankers Rounding) rule.	`4`
`‑‑use‑percentiles`	flag	Use percentiles instead of Q1/Q3 for winsorization/trimming. Requires percentiles to be computed in the stats CSV.
`‑‑pct‑thresholds`	string	Comma-separated percentile pair (e.g., "10,90") to use for winsorization/trimming when --use-percentiles is set. Both values must be between 0 and 100, and lower < upper.	`5,95`
`‑‑xsd‑gdate‑scan`	string	Gregorian XSD date type detection mode. "quick": Fast detection using min/max values. Produces types with ?? suffix (less confident). "thorough": Comprehensive detection checking all percentile values. Slower but ensures all values match the pattern. Produces types with ? suffix (more confident).	`quick`

Bivariate Statistics Options ↩

Option	Type	Description	Default
`‑B,` `‑‑bivariate`	flag	Enable bivariate statistics computation. Requires indexed CSV file (index will be auto-created if missing). Computes pairwise correlations, covariances, mutual information, and normalized mutual information between columns. The bivariate statistics
`‑S,` `‑‑bivariate‑stats`	string	Comma-separated list of bivariate statistics to compute. Options: pearson, spearman, kendall, covariance, mi (mutual information), nmi (normalized mutual information) Use "all" to compute all statistics or "fast" to compute only pearson & covariance, which is much faster as it doesn't require storing all values and uses streaming algorithms.	`fast`
`‑C,` `‑‑cardinality‑threshold`	string	Skip mutual information computation for field pairs where either field has cardinality exceeding this threshold. Helps avoid expensive computations for high-cardinality fields.	`1000000`
`‑J,` `‑‑join‑inputs`	string	Additional datasets to join. Comma-separated list of CSV files to join with the primary input. e.g.: --join-inputs customers.csv,products.csv
`‑K,` `‑‑join‑keys`	string	Join keys for each dataset. Comma-separated list of join key column names, one per dataset. Must specify same number of keys as datasets (primary + addl). e.g.: --join-keys customer_id,customer_id,product_id
`‑T,` `‑‑join‑type`	string	Join type when using --join-inputs. Valid values: inner, left, right, full	`inner`
`‑p,` `‑‑progressbar`	flag	Show progress bars when computing bivariate statistics.

Common Options ↩

Option	Type	Description
`‑‑force`	flag	Force recomputing stats even if valid precomputed stats cache exists.
`‑j,` `‑‑jobs`	string	The number of jobs to run in parallel. This works only when the given CSV has an index. Note that a file handle is opened for each job. When not set, the number of jobs is set to the number of CPUs detected.
`‑h,` `‑‑help`	flag	Display this message
`‑o,` `‑‑output`	string	Write output to instead of overwriting the stats CSV file.

Source: src/cmd/moarstats.rs | Table of Contents | README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

moarstats

Description ↩

Examples ↩

Usage ↩

Moarstats Options ↩

Bivariate Statistics Options ↩

Common Options ↩

FilesExpand file tree

moarstats.md

Latest commit

History

moarstats.md

File metadata and controls

moarstats

Description ↩

Examples ↩

Usage ↩

Moarstats Options ↩

Bivariate Statistics Options ↩

Common Options ↩