Data Preparation Guide¶
This guide covers best practices for preparing your data for use with Tabular SSL.
Loading Data¶
From CSV Files¶
from tabular_ssl.data import DataLoader
# Initialize the data loader
data_loader = DataLoader()
# Load data from CSV
data = data_loader.load_data('path/to/your/data.csv')
From Pandas DataFrame¶
import pandas as pd
# Create or load your DataFrame
df = pd.DataFrame({
'numeric_col': [1, 2, 3],
'categorical_col': ['A', 'B', 'A']
})
# Use the data loader
data_loader = DataLoader()
processed_data = data_loader.preprocess(df)
Handling Different Data Types¶
Categorical Variables¶
# Specify categorical columns
categorical_cols = ['category1', 'category2']
processed_data = data_loader.preprocess(
data,
categorical_cols=categorical_cols
)
Numerical Variables¶
# Numerical columns are automatically detected
# You can specify scaling options
processed_data = data_loader.preprocess(
data,
scale_numerical=True, # Enable scaling
scaler='standard' # Use standard scaler
)
Dealing with Missing Values¶
Automatic Handling¶
# The data loader automatically handles missing values
processed_data = data_loader.preprocess(
data,
handle_missing=True, # Enable missing value handling
missing_strategy='mean' # Use mean imputation
)
Manual Handling¶
import pandas as pd
import numpy as np
# Fill missing values
data = data.fillna({
'numeric_col': data['numeric_col'].mean(),
'categorical_col': data['categorical_col'].mode()[0]
})
Feature Engineering¶
Creating New Features¶
# Add interaction terms
data['interaction'] = data['feature1'] * data['feature2']
# Add polynomial features
data['feature1_squared'] = data['feature1'] ** 2
Feature Selection¶
from tabular_ssl.utils import select_features
# Select features based on importance
selected_features = select_features(
data,
target_col='target',
method='importance',
threshold=0.01
)
Data Validation¶
Checking Data Quality¶
from tabular_ssl.utils import validate_data
# Validate data before processing
validation_results = validate_data(data)
print(validation_results)
Common Issues and Solutions¶
-
Inconsistent Data Types
# Convert columns to correct types data['numeric_col'] = pd.to_numeric(data['numeric_col']) data['categorical_col'] = data['categorical_col'].astype('category')
-
Outliers
# Remove outliers data = data[data['numeric_col'].between( data['numeric_col'].quantile(0.01), data['numeric_col'].quantile(0.99) )]
Best Practices¶
- Always validate your data before processing
- Handle missing values appropriately for your use case
- Scale numerical features when necessary
- Encode categorical variables properly
- Check for and handle outliers
- Document your preprocessing steps
Related Resources¶
- Model Training - Next steps after data preparation
- API Reference - Detailed API documentation
- Tutorials - Step-by-step guides