loaders.data_loader

High-level data loading interface for RegressionLab.

Overview

The data_loader module provides functions for loading experimental data from various file formats into pandas DataFrames. It handles file type detection, encoding issues, and data validation.

Key Functions

Data Loading

`load_data(file_path: str, file_type: str) -> pd.DataFrame`

Primary function for loading data files.

Loads data from CSV or Excel files based on the specified file type.

Parameters:

file_path: Complete path to the file
file_type: File type ('csv', 'xlsx', 'txt')

Returns:

DataFrame with loaded data

Raises:

InvalidFileTypeError: If file type is not supported
DataLoadError: If file cannot be loaded

Example:

from loaders.data_loader import load_data

# Load CSV file
data = load_data('input/experiment1.csv', 'csv')

# Load Excel file
data = load_data('input/experiment2.xlsx', 'xlsx')

print(data.head())
print(f"Columns: {data.columns.tolist()}")

Variable Extraction

`get_variable_names(data: pd.DataFrame, filter_uncertainty: bool = False) -> List[str]`

Extract variable names from DataFrame.

When filter_uncertainty is False, returns all column names (e.g., 'x', 'ux', 'y', 'uy'). When True, excludes uncertainty columns (e.g., 'ux', 'uy') so only base variables like 'x', 'y' are returned. Uncertainty columns are assumed to be named 'u'.

Parameters:

data: DataFrame with the data
filter_uncertainty: If True, exclude uncertainty columns from the result

Returns:

List of column names as strings

Example:

from loaders.data_loader import get_variable_names

# All columns (default)
all_vars = get_variable_names(data, filter_uncertainty=False)
print(f"All columns: {all_vars}")  # ['x', 'ux', 'y', 'uy']

# Only data columns (no uncertainties)
data_vars = get_variable_names(data, filter_uncertainty=True)
print(f"Data columns: {data_vars}")  # ['x', 'y']

Supported File Formats

File type dispatch is done via a module-level reader registry (FILE_TYPE_READERS): each key is a file type ('csv', 'xlsx', 'txt') and the value is the corresponding reader from loading_utils. To add a new format, implement the reader and register it in FILE_TYPE_READERS.

CSV Files

Supported delimiters:

Comma (,)
Semicolon (;)
Tab (\t)

Encoding: Auto-detected (UTF-8, Latin-1, etc.)

Example CSV:

time,temperature,utime,utemperature
0,20.0,0.1,0.5
1,25.3,0.1,0.5
2,30.1,0.1,0.5
3,35.4,0.1,0.5

Excel Files

Supported format:

.xlsx (Excel 2007+) - use file_type='xlsx'

Requirements:

Data in first sheet
Column headers in first row
No merged cells in data area

Example Excel structure:

| x     | y     | ux   | uy   |
|-------|-------|------|------|
| 1.0   | 2.5   | 0.1  | 0.2  |
| 2.0   | 5.1   | 0.1  | 0.2  |
| 3.0   | 7.4   | 0.1  | 0.2  |

Data Format Requirements

Column Naming

Variable columns: Any valid name (e.g., time, voltage, concentration)
Uncertainty columns: Prefix with u (e.g., utime, uvoltage)

Data Types

All data values must be numeric
NaN values will cause fitting to fail
Infinite values not allowed

Minimum Requirements

At least 2 columns (X and Y)
At least 5 data points (more recommended)
No duplicate column names

Error Handling

Common Errors

FileNotFoundError:

try:
    data = load_data('nonexistent.csv', 'csv')
except FileNotFoundError:
    print("File not found!")

UnicodeDecodeError:

try:
    data = load_data('bad_encoding.csv', 'csv')
except UnicodeDecodeError:
    print("Encoding issue - try saving as UTF-8")

InvalidFileTypeError:

from utils.exceptions import InvalidFileTypeError

try:
    data = load_data('corrupt.xlsx', 'xlsx')
except InvalidFileTypeError as e:
    print(f"Invalid file type: {e}")
except Exception as e:
    print(f"Failed to load: {e}")

Advanced Usage

Custom Delimiter CSV

For CSV files with unusual delimiters, modify the loader:

from loaders.loading_utils import csv_reader

# Custom delimiter
data = csv_reader('data.txt', delimiter='|')

Specific Excel Sheet

To read from a specific sheet, you need to use excel_reader directly:

from loaders.loading_utils import excel_reader

# Read from second sheet
data = excel_reader('data.xlsx', sheet_name='Sheet2')

Note: The load_data function reads from the first sheet by default.

Handling Missing Data

# Load data
data = load_data('experiment.csv', 'csv')

# Check for missing values
if data.isnull().any().any():
    print("Warning: Missing values detected")
    
    # Drop rows with NaN
    data = data.dropna()
    
    # Or fill with interpolation
    data = data.interpolate(method='linear')

Integration with Fitting

Typical workflow:

from loaders.data_loader import load_data, get_variable_names
from fitting.fitting_functions import fit_linear_function_with_n

# 1. Load data (use open_load_dialog or provide file path)
data = load_data('input/experiment.csv', 'csv')

# 2. Get available variables
variables = get_variable_names(data, filter_uncertainty=True)
print(f"Available variables: {variables}")

# 3. Select variables (e.g., from UI or manually)
x_name = 'time'
y_name = 'temperature'

# 4. Convert DataFrame to dict format for fitting
data_dict = {col: data[col].values for col in data.columns}

# 5. Perform fitting
text, y_fitted, equation, *_ = fit_linear_function_with_n(
    data_dict, x_name, y_name
)

print(f"Fitting complete:\n{text}")  # R² is included in the text output

Performance Considerations

File Size

CSV: Fast for files < 100 MB
Excel: Slower for large files (use CSV if possible)

Optimization Tips

Use CSV for large datasets: Faster than Excel
Clean data before loading: Remove unnecessary columns
Use appropriate dtypes: Specify numeric types explicitly
Cache loaded data: Don't reload unnecessarily

Memory Usage

# Check DataFrame memory usage
data_memory = data.memory_usage(deep=True).sum()
print(f"Data uses {data_memory / 1024**2:.2f} MB")

# Optimize memory if needed
data = data.astype('float32')  # Use 32-bit instead of 64-bit

Troubleshooting

Data Won't Load

Check file exists: Verify path is correct
Check permissions: Ensure read access
Try opening in Excel/text editor: Verify file isn't corrupt
Check encoding: Try UTF-8 if special characters present

Wrong Data Loaded

Check delimiter: CSV may use semicolon instead of comma
Check headers: Ensure first row contains column names
Check sheet: Excel file may have data in different sheet

Uncertainty Columns Not Detected

Check naming: Must be exactly u + variable name
Check case: Lowercase u required
Check spelling: No extra characters or spaces

See also: loading_utils for low-level file readers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loaders.data_loader

Overview

Key Functions

Data Loading

`load_data(file_path: str, file_type: str) -> pd.DataFrame`

Variable Extraction

`get_variable_names(data: pd.DataFrame, filter_uncertainty: bool = False) -> List[str]`

Supported File Formats

CSV Files

Excel Files

Data Format Requirements

Column Naming

Data Types

Minimum Requirements

Error Handling

Common Errors

Advanced Usage

Custom Delimiter CSV

Specific Excel Sheet

Handling Missing Data

Integration with Fitting

Performance Considerations

File Size

Optimization Tips

Memory Usage

Troubleshooting

Data Won't Load

Wrong Data Loaded

Uncertainty Columns Not Detected

FilesExpand file tree

data_loader.md

Latest commit

History

data_loader.md

File metadata and controls

loaders.data_loader

Overview

Key Functions

Data Loading

load_data(file_path: str, file_type: str) -> pd.DataFrame

Variable Extraction

get_variable_names(data: pd.DataFrame, filter_uncertainty: bool = False) -> List[str]

Supported File Formats

CSV Files

Excel Files

Data Format Requirements

Column Naming

Data Types

Minimum Requirements

Error Handling

Common Errors

Advanced Usage

Custom Delimiter CSV

Specific Excel Sheet

Handling Missing Data

Integration with Fitting

Performance Considerations

File Size

Optimization Tips

Memory Usage

Troubleshooting

Data Won't Load

Wrong Data Loaded

Uncertainty Columns Not Detected

`load_data(file_path: str, file_type: str) -> pd.DataFrame`

`get_variable_names(data: pd.DataFrame, filter_uncertainty: bool = False) -> List[str]`