Data validation utilities for the RegressionLab application.
The validators.py module provides functions to validate data integrity, file paths, and parameter values before processing. It ensures data quality and prevents errors during fitting operations.
Validate that a file path exists and is accessible.
Parameters:
file_path: Path to the file
Raises:
FileNotFoundError: If file does not existValidationError: If path is not a file
Example:
from utils.validators import validate_file_path
try:
validate_file_path('input/data.csv')
# File exists and is accessible
except FileNotFoundError as e:
print(f"File not found: {e}")Validate that a file type is supported.
Parameters:
file_type: File extension (e.g., 'csv', 'xlsx')allowed_types: List of allowed file types (default: ['csv', 'xlsx', 'txt'])
Raises:
InvalidFileTypeError: If file type is not supported
Example:
from utils.validators import validate_file_type
# Default allowed types
validate_file_type('csv') # OK
validate_file_type('txt') # Raises InvalidFileTypeError
# Custom allowed types
validate_file_type('json', allowed_types=['json', 'yaml'])Validate that a DataFrame is suitable for fitting.
Parameters:
data: DataFrame to validatemin_rows: Minimum number of rows required
Raises:
DataValidationError: If DataFrame is invalid
Example:
from utils.validators import validate_dataframe
import pandas as pd
data = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 6]})
# Default minimum (2 rows)
validate_dataframe(data) # OK
# Custom minimum
validate_dataframe(data, min_rows=5) # Raises DataValidationErrorValidate that a column exists in a DataFrame.
Parameters:
data: DataFrame to checkcolumn_name: Name of the column
Raises:
DataValidationError: If column does not exist
Example:
from utils.validators import validate_column_exists
data = pd.DataFrame({'x': [1, 2], 'y': [2, 4]})
validate_column_exists(data, 'x') # OK
validate_column_exists(data, 'z') # Raises DataValidationErrorValidate that data in a column is numeric and has no NaN values.
Parameters:
data: Series to validatecolumn_name: Name of the column (for error messages)
Raises:
DataValidationError: If data is not numeric or contains NaN
Example:
from utils.validators import validate_numeric_data
import pandas as pd
data = pd.Series([1.0, 2.0, 3.0])
validate_numeric_data(data, 'x') # OK
data_with_nan = pd.Series([1.0, None, 3.0])
validate_numeric_data(data_with_nan, 'x') # Raises DataValidationErrorChecks:
- Data type is numeric
- No NaN values
- No infinite values
Validate that uncertainty column exists and is valid for a variable.
Parameters:
data: DataFrame containing the datavar_name: Name of the variable (uncertainty column should be 'u{var_name}')
Raises:
DataValidationError: If uncertainty column is missing or invalid
Example:
from utils.validators import validate_uncertainty_column
data = pd.DataFrame({
'x': [1, 2, 3],
'ux': [0.1, 0.1, 0.1],
'y': [2, 4, 6],
'uy': [0.2, 0.2, 0.2]
})
validate_uncertainty_column(data, 'x') # Checks 'ux'
validate_uncertainty_column(data, 'y') # Checks 'uy'Checks:
- Column exists
- Data is numeric
- Values are non-negative
Comprehensive validation for fitting data.
Validates:
- DataFrame is not empty
- Required columns exist
- Data is numeric
- Uncertainty columns exist and are valid
Parameters:
data: DataFrame with data to fitx_name: Name of the independent variable columny_name: Name of the dependent variable column
Raises:
DataValidationError: If any validation fails
Example:
from utils.validators import validate_fitting_data
data = pd.DataFrame({
'x': [1, 2, 3, 4],
'ux': [0.1] * 4,
'y': [2, 4, 6, 8],
'uy': [0.2] * 4
})
validate_fitting_data(data, 'x', 'y')
# Validates everything needed for fittingValidate parameter names for custom equations.
Parameters:
param_names: List of parameter names
Raises:
ValidationError: If parameter names are invalid
Example:
from utils.validators import validate_parameter_names
# Valid names
validate_parameter_names(['a', 'b', 'c']) # OK
validate_parameter_names(['alpha', 'beta']) # OK
# Invalid names
validate_parameter_names([]) # Raises ValidationError (empty)
validate_parameter_names(['a', 'a']) # Raises ValidationError (duplicate)
validate_parameter_names(['a b']) # Raises ValidationError (invalid identifier)Checks:
- List is not empty
- No duplicate names
- All names are valid Python identifiers
Validate that a value is a positive integer.
Parameters:
value: Value to validatename: Name of the parameter (for error messages)
Returns:
- The validated integer value
Raises:
ValidationError: If value is not a positive integer
Example:
from utils.validators import validate_positive_integer
# Valid
num = validate_positive_integer(5, 'num_params') # Returns 5
num = validate_positive_integer('10', 'num_params') # Returns 10
# Invalid
validate_positive_integer(0, 'num_params') # Raises ValidationError
validate_positive_integer(-1, 'num_params') # Raises ValidationError
validate_positive_integer('abc', 'num_params') # Raises ValidationErrorfrom utils.validators import (
validate_file_path, validate_file_type,
validate_fitting_data
)
import pandas as pd
# 1. Validate file
file_path = 'input/data.csv'
validate_file_path(file_path)
validate_file_type('csv')
# 2. Load data
data = pd.read_csv(file_path)
# 3. Validate data
validate_fitting_data(data, 'x', 'y')
# Data is ready for fittingfrom utils.validators import (
validate_dataframe, validate_column_exists,
validate_numeric_data, validate_uncertainty_column
)
def custom_validation(data, x_name, y_name):
# Step by step validation
validate_dataframe(data, min_rows=3)
validate_column_exists(data, x_name)
validate_column_exists(data, y_name)
validate_numeric_data(data[x_name], x_name)
validate_numeric_data(data[y_name], y_name)
validate_uncertainty_column(data, x_name)
validate_uncertainty_column(data, y_name)
print("All validations passed")from utils.validators import validate_fitting_data
from utils.exceptions import DataValidationError
try:
validate_fitting_data(data, 'x', 'y')
except DataValidationError as e:
print(f"Data validation failed: {e}")
# Handle error appropriately-
Validate Early: Validate data as soon as it's loaded
data = load_data('file.csv') validate_fitting_data(data, 'x', 'y') # Before any processing
-
Specific Validation: Use specific validators for targeted checks
# Check specific column validate_numeric_data(data['x'], 'x') # Check uncertainty validate_uncertainty_column(data, 'x')
-
Error Messages: Validators provide informative error messages
try: validate_fitting_data(data, 'x', 'y') except DataValidationError as e: # Error message includes details about what failed print(e)
-
Combined Validation: Use comprehensive validator when possible
# Instead of multiple individual checks validate_fitting_data(data, 'x', 'y') # Instead of: # validate_dataframe(data) # validate_column_exists(data, 'x') # validate_column_exists(data, 'y') # ...
Validators are integrated into the data loading workflow:
from loaders.loading_utils import csv_reader
from utils.validators import validate_fitting_data
def load_and_validate(file_path, x_name, y_name):
# Load data (includes basic validation)
data = csv_reader(file_path)
# Comprehensive validation
validate_fitting_data(data, x_name, y_name)
return dataThe validate_fitting_data function performs validations in this order:
- DataFrame structure (not None, not empty, minimum rows)
- Column existence
- Numeric data type
- Uncertainty columns
- Early Exit: Validation stops at first failure
- Efficient Checks: Uses pandas/numpy operations; duplicate and infinite-value checks reuse a single computed mask to avoid repeated array operations
- Minimal Overhead: Fast validation for typical datasets
All validators provide detailed error messages including:
- What was validated
- What value was found
- What was expected
- Available alternatives (when applicable)
For more information about validation, see Troubleshooting Guide.