Skip to content

Latest commit

 

History

History
285 lines (197 loc) · 6.97 KB

File metadata and controls

285 lines (197 loc) · 6.97 KB

loaders.data_loader

High-level data loading interface for RegressionLab.

Overview

The data_loader module provides functions for loading experimental data from various file formats into pandas DataFrames. It handles file type detection, encoding issues, and data validation.

Key Functions

Data Loading

load_data(file_path: str, file_type: str) -> pd.DataFrame

Primary function for loading data files.

Loads data from CSV or Excel files based on the specified file type.

Parameters:

  • file_path: Complete path to the file
  • file_type: File type ('csv', 'xlsx', 'txt')

Returns:

  • DataFrame with loaded data

Raises:

  • InvalidFileTypeError: If file type is not supported
  • DataLoadError: If file cannot be loaded

Example:

from loaders.data_loader import load_data

# Load CSV file
data = load_data('input/experiment1.csv', 'csv')

# Load Excel file
data = load_data('input/experiment2.xlsx', 'xlsx')

print(data.head())
print(f"Columns: {data.columns.tolist()}")

Variable Extraction

get_variable_names(data: pd.DataFrame, filter_uncertainty: bool = False) -> List[str]

Extract variable names from DataFrame.

When filter_uncertainty is False, returns all column names (e.g., 'x', 'ux', 'y', 'uy'). When True, excludes uncertainty columns (e.g., 'ux', 'uy') so only base variables like 'x', 'y' are returned. Uncertainty columns are assumed to be named 'u'.

Parameters:

  • data: DataFrame with the data
  • filter_uncertainty: If True, exclude uncertainty columns from the result

Returns:

  • List of column names as strings

Example:

from loaders.data_loader import get_variable_names

# All columns (default)
all_vars = get_variable_names(data, filter_uncertainty=False)
print(f"All columns: {all_vars}")  # ['x', 'ux', 'y', 'uy']

# Only data columns (no uncertainties)
data_vars = get_variable_names(data, filter_uncertainty=True)
print(f"Data columns: {data_vars}")  # ['x', 'y']

Supported File Formats

File type dispatch is done via a module-level reader registry (FILE_TYPE_READERS): each key is a file type ('csv', 'xlsx', 'txt') and the value is the corresponding reader from loading_utils. To add a new format, implement the reader and register it in FILE_TYPE_READERS.

CSV Files

Supported delimiters:

  • Comma (,)
  • Semicolon (;)
  • Tab (\t)

Encoding: Auto-detected (UTF-8, Latin-1, etc.)

Example CSV:

time,temperature,utime,utemperature
0,20.0,0.1,0.5
1,25.3,0.1,0.5
2,30.1,0.1,0.5
3,35.4,0.1,0.5

Excel Files

Supported format:

  • .xlsx (Excel 2007+) - use file_type='xlsx'

Requirements:

  • Data in first sheet
  • Column headers in first row
  • No merged cells in data area

Example Excel structure:

| x     | y     | ux   | uy   |
|-------|-------|------|------|
| 1.0   | 2.5   | 0.1  | 0.2  |
| 2.0   | 5.1   | 0.1  | 0.2  |
| 3.0   | 7.4   | 0.1  | 0.2  |

Data Format Requirements

Column Naming

  • Variable columns: Any valid name (e.g., time, voltage, concentration)
  • Uncertainty columns: Prefix with u (e.g., utime, uvoltage)

Data Types

  • All data values must be numeric
  • NaN values will cause fitting to fail
  • Infinite values not allowed

Minimum Requirements

  • At least 2 columns (X and Y)
  • At least 5 data points (more recommended)
  • No duplicate column names

Error Handling

Common Errors

FileNotFoundError:

try:
    data = load_data('nonexistent.csv', 'csv')
except FileNotFoundError:
    print("File not found!")

UnicodeDecodeError:

try:
    data = load_data('bad_encoding.csv', 'csv')
except UnicodeDecodeError:
    print("Encoding issue - try saving as UTF-8")

InvalidFileTypeError:

from utils.exceptions import InvalidFileTypeError

try:
    data = load_data('corrupt.xlsx', 'xlsx')
except InvalidFileTypeError as e:
    print(f"Invalid file type: {e}")
except Exception as e:
    print(f"Failed to load: {e}")

Advanced Usage

Custom Delimiter CSV

For CSV files with unusual delimiters, modify the loader:

from loaders.loading_utils import csv_reader

# Custom delimiter
data = csv_reader('data.txt', delimiter='|')

Specific Excel Sheet

To read from a specific sheet, you need to use excel_reader directly:

from loaders.loading_utils import excel_reader

# Read from second sheet
data = excel_reader('data.xlsx', sheet_name='Sheet2')

Note: The load_data function reads from the first sheet by default.

Handling Missing Data

# Load data
data = load_data('experiment.csv', 'csv')

# Check for missing values
if data.isnull().any().any():
    print("Warning: Missing values detected")
    
    # Drop rows with NaN
    data = data.dropna()
    
    # Or fill with interpolation
    data = data.interpolate(method='linear')

Integration with Fitting

Typical workflow:

from loaders.data_loader import load_data, get_variable_names
from fitting.fitting_functions import fit_linear_function_with_n

# 1. Load data (use open_load_dialog or provide file path)
data = load_data('input/experiment.csv', 'csv')

# 2. Get available variables
variables = get_variable_names(data, filter_uncertainty=True)
print(f"Available variables: {variables}")

# 3. Select variables (e.g., from UI or manually)
x_name = 'time'
y_name = 'temperature'

# 4. Convert DataFrame to dict format for fitting
data_dict = {col: data[col].values for col in data.columns}

# 5. Perform fitting
text, y_fitted, equation, *_ = fit_linear_function_with_n(
    data_dict, x_name, y_name
)

print(f"Fitting complete:\n{text}")  # R² is included in the text output

Performance Considerations

File Size

  • CSV: Fast for files < 100 MB
  • Excel: Slower for large files (use CSV if possible)

Optimization Tips

  1. Use CSV for large datasets: Faster than Excel
  2. Clean data before loading: Remove unnecessary columns
  3. Use appropriate dtypes: Specify numeric types explicitly
  4. Cache loaded data: Don't reload unnecessarily

Memory Usage

# Check DataFrame memory usage
data_memory = data.memory_usage(deep=True).sum()
print(f"Data uses {data_memory / 1024**2:.2f} MB")

# Optimize memory if needed
data = data.astype('float32')  # Use 32-bit instead of 64-bit

Troubleshooting

Data Won't Load

  1. Check file exists: Verify path is correct
  2. Check permissions: Ensure read access
  3. Try opening in Excel/text editor: Verify file isn't corrupt
  4. Check encoding: Try UTF-8 if special characters present

Wrong Data Loaded

  1. Check delimiter: CSV may use semicolon instead of comma
  2. Check headers: Ensure first row contains column names
  3. Check sheet: Excel file may have data in different sheet

Uncertainty Columns Not Detected

  1. Check naming: Must be exactly u + variable name
  2. Check case: Lowercase u required
  3. Check spelling: No extra characters or spaces

See also: loading_utils for low-level file readers.