Data Utilities Reference¶

This section provides detailed documentation of the data loading and preprocessing utilities in Tabular SSL.

DataLoader¶

The main class for loading and preprocessing tabular data.

from tabular_ssl.data import DataLoader

Methods¶

`load_data(file_path, target_col=None)`¶

Load data from a file.

Parameters: - file_path (str): Path to the data file - target_col (str, optional): Name of the target column

Returns: - pd.DataFrame: Loaded data

`preprocess(data, categorical_cols=None, scale_numerical=True, handle_missing=True)`¶

Preprocess the data.

Parameters: - data (pd.DataFrame): Input data - categorical_cols (list, optional): List of categorical column names - scale_numerical (bool, optional): Whether to scale numerical features - handle_missing (bool, optional): Whether to handle missing values

Returns: - pd.DataFrame: Preprocessed data

Data Transformers¶

CategoricalTransformer¶

from tabular_ssl.data import CategoricalTransformer

transformer = CategoricalTransformer(
    columns=['category1', 'category2'],
    encoding='onehot'
)

NumericalTransformer¶

from tabular_ssl.data import NumericalTransformer

transformer = NumericalTransformer(
    columns=['numeric1', 'numeric2'],
    scaling='standard'
)

Data Validation¶

DataValidator¶

from tabular_ssl.data import DataValidator

validator = DataValidator(
    required_columns=['col1', 'col2'],
    data_types={
        'col1': 'numeric',
        'col2': 'categorical'
    }
)

Data Splitting¶

DataSplitter¶

from tabular_ssl.data import DataSplitter

splitter = DataSplitter(
    test_size=0.2,
    val_size=0.1,
    random_state=42
)

Feature Engineering¶

FeatureEngineer¶

from tabular_ssl.data import FeatureEngineer

engineer = FeatureEngineer(
    interactions=True,
    polynomials=True,
    degree=2
)

Data Augmentation¶

DataAugmenter¶

from tabular_ssl.data import DataAugmenter

augmenter = DataAugmenter(
    noise_level=0.1,
    mask_ratio=0.15
)

Common Operations¶

Loading Data¶

# Load from CSV
data = DataLoader().load_data('data.csv')

# Load from DataFrame
data = DataLoader().load_data(df)

Preprocessing¶

# Basic preprocessing
processed_data = DataLoader().preprocess(
    data,
    categorical_cols=['category1', 'category2']
)

# Advanced preprocessing
processed_data = DataLoader().preprocess(
    data,
    categorical_cols=['category1', 'category2'],
    scale_numerical=True,
    handle_missing=True,
    missing_strategy='mean'
)

Data Splitting¶

# Split data
train_data, val_data, test_data = DataSplitter().split(data)

Feature Engineering¶

# Create new features
engineered_data = FeatureEngineer().transform(data)

Best Practices¶

Always validate data before processing
Handle missing values appropriately
Scale numerical features
Encode categorical variables
Split data before preprocessing
Document preprocessing steps
Save preprocessed data
Use appropriate data types

API Reference - Complete API documentation
How-to Guides - Data preparation guides
Tutorials - Getting started guides

Data Utilities Reference¶

DataLoader¶

Methods¶

load_data(file_path, target_col=None)¶

preprocess(data, categorical_cols=None, scale_numerical=True, handle_missing=True)¶

Data Transformers¶

CategoricalTransformer¶

NumericalTransformer¶

Data Validation¶

DataValidator¶

Data Splitting¶

DataSplitter¶

Feature Engineering¶

FeatureEngineer¶

Data Augmentation¶

DataAugmenter¶

Common Operations¶

Loading Data¶

Preprocessing¶

Data Splitting¶

Feature Engineering¶

Best Practices¶

Related Resources¶

`load_data(file_path, target_col=None)`¶

`preprocess(data, categorical_cols=None, scale_numerical=True, handle_missing=True)`¶