High-level data loading interface for RegressionLab.
The data_loader module provides functions for loading experimental data from various file formats into pandas DataFrames. It handles file type detection, encoding issues, and data validation.
Primary function for loading data files.
Loads data from CSV or Excel files based on the specified file type.
Parameters:
file_path: Complete path to the filefile_type: File type ('csv', 'xlsx', 'txt')
Returns:
- DataFrame with loaded data
Raises:
InvalidFileTypeError: If file type is not supportedDataLoadError: If file cannot be loaded
Example:
from loaders.data_loader import load_data
# Load CSV file
data = load_data('input/experiment1.csv', 'csv')
# Load Excel file
data = load_data('input/experiment2.xlsx', 'xlsx')
print(data.head())
print(f"Columns: {data.columns.tolist()}")Extract variable names from DataFrame.
When filter_uncertainty is False, returns all column names (e.g., 'x', 'ux', 'y', 'uy'). When True, excludes uncertainty columns (e.g., 'ux', 'uy') so only base variables like 'x', 'y' are returned. Uncertainty columns are assumed to be named 'u'.
Parameters:
data: DataFrame with the datafilter_uncertainty: If True, exclude uncertainty columns from the result
Returns:
- List of column names as strings
Example:
from loaders.data_loader import get_variable_names
# All columns (default)
all_vars = get_variable_names(data, filter_uncertainty=False)
print(f"All columns: {all_vars}") # ['x', 'ux', 'y', 'uy']
# Only data columns (no uncertainties)
data_vars = get_variable_names(data, filter_uncertainty=True)
print(f"Data columns: {data_vars}") # ['x', 'y']File type dispatch is done via a module-level reader registry (FILE_TYPE_READERS): each key is a file type ('csv', 'xlsx', 'txt') and the value is the corresponding reader from loading_utils. To add a new format, implement the reader and register it in FILE_TYPE_READERS.
Supported delimiters:
- Comma (
,) - Semicolon (
;) - Tab (
\t)
Encoding: Auto-detected (UTF-8, Latin-1, etc.)
Example CSV:
time,temperature,utime,utemperature
0,20.0,0.1,0.5
1,25.3,0.1,0.5
2,30.1,0.1,0.5
3,35.4,0.1,0.5Supported format:
.xlsx(Excel 2007+) - usefile_type='xlsx'
Requirements:
- Data in first sheet
- Column headers in first row
- No merged cells in data area
Example Excel structure:
| x | y | ux | uy |
|-------|-------|------|------|
| 1.0 | 2.5 | 0.1 | 0.2 |
| 2.0 | 5.1 | 0.1 | 0.2 |
| 3.0 | 7.4 | 0.1 | 0.2 |
- Variable columns: Any valid name (e.g.,
time,voltage,concentration) - Uncertainty columns: Prefix with
u(e.g.,utime,uvoltage)
- All data values must be numeric
- NaN values will cause fitting to fail
- Infinite values not allowed
- At least 2 columns (X and Y)
- At least 5 data points (more recommended)
- No duplicate column names
FileNotFoundError:
try:
data = load_data('nonexistent.csv', 'csv')
except FileNotFoundError:
print("File not found!")UnicodeDecodeError:
try:
data = load_data('bad_encoding.csv', 'csv')
except UnicodeDecodeError:
print("Encoding issue - try saving as UTF-8")InvalidFileTypeError:
from utils.exceptions import InvalidFileTypeError
try:
data = load_data('corrupt.xlsx', 'xlsx')
except InvalidFileTypeError as e:
print(f"Invalid file type: {e}")
except Exception as e:
print(f"Failed to load: {e}")For CSV files with unusual delimiters, modify the loader:
from loaders.loading_utils import csv_reader
# Custom delimiter
data = csv_reader('data.txt', delimiter='|')To read from a specific sheet, you need to use excel_reader directly:
from loaders.loading_utils import excel_reader
# Read from second sheet
data = excel_reader('data.xlsx', sheet_name='Sheet2')Note: The load_data function reads from the first sheet by default.
# Load data
data = load_data('experiment.csv', 'csv')
# Check for missing values
if data.isnull().any().any():
print("Warning: Missing values detected")
# Drop rows with NaN
data = data.dropna()
# Or fill with interpolation
data = data.interpolate(method='linear')Typical workflow:
from loaders.data_loader import load_data, get_variable_names
from fitting.fitting_functions import fit_linear_function_with_n
# 1. Load data (use open_load_dialog or provide file path)
data = load_data('input/experiment.csv', 'csv')
# 2. Get available variables
variables = get_variable_names(data, filter_uncertainty=True)
print(f"Available variables: {variables}")
# 3. Select variables (e.g., from UI or manually)
x_name = 'time'
y_name = 'temperature'
# 4. Convert DataFrame to dict format for fitting
data_dict = {col: data[col].values for col in data.columns}
# 5. Perform fitting
text, y_fitted, equation, *_ = fit_linear_function_with_n(
data_dict, x_name, y_name
)
print(f"Fitting complete:\n{text}") # R² is included in the text output- CSV: Fast for files < 100 MB
- Excel: Slower for large files (use CSV if possible)
- Use CSV for large datasets: Faster than Excel
- Clean data before loading: Remove unnecessary columns
- Use appropriate dtypes: Specify numeric types explicitly
- Cache loaded data: Don't reload unnecessarily
# Check DataFrame memory usage
data_memory = data.memory_usage(deep=True).sum()
print(f"Data uses {data_memory / 1024**2:.2f} MB")
# Optimize memory if needed
data = data.astype('float32') # Use 32-bit instead of 64-bit- Check file exists: Verify path is correct
- Check permissions: Ensure read access
- Try opening in Excel/text editor: Verify file isn't corrupt
- Check encoding: Try UTF-8 if special characters present
- Check delimiter: CSV may use semicolon instead of comma
- Check headers: Ensure first row contains column names
- Check sheet: Excel file may have data in different sheet
- Check naming: Must be exactly
u+ variable name - Check case: Lowercase
urequired - Check spelling: No extra characters or spaces
See also: loading_utils for low-level file readers.