Architecture Overview¶

This section explains the architecture and design decisions behind Tabular SSL.

System Design¶

High-Level Architecture¶

The Tabular SSL system consists of several key components:

Data Processing Layer
Data loading and validation
Feature preprocessing
Data augmentation
Model Layer
Component Registry for modular design
Feature embedding
Encoder components (Transformer, RNN, LSTM, S4, etc.)
Task-specific heads
Training Layer
Self-supervised learning
Optimization
Monitoring
Configuration Layer
Hydra configuration management
Experiment tracking
Parameter validation

Component Registry¶

One of the core architectural features of Tabular SSL is the Component Registry pattern, which enables a highly modular and extensible design.

Registry Design¶

The Component Registry is a central repository that maps component names to their implementations:

class ComponentRegistry:
    """Registry for model components."""

    _components: ClassVar[Dict[str, Type['BaseComponent']]] = {}

    @classmethod
    def register(cls, name: str) -> Type[T]:
        """Register a component class."""
        def decorator(component_cls: Type[T]) -> Type[T]:
            cls._components[name] = component_cls
            return component_cls
        return decorator

    @classmethod
    def get(cls, name: str) -> Type['BaseComponent']:
        """Get a component class by name."""
        if name not in cls._components:
            raise KeyError(f"Component {name} not found in registry")
        return cls._components[name]

Component Configuration¶

Each component has its own configuration class that inherits from ComponentConfig:

class ComponentConfig(PydanticBaseModel):
    """Base configuration for components."""

    name: str = Field(..., description="Name of the component")
    type: str = Field(..., description="Type of the component")

    @validator('type')
    def validate_type(cls, v: str) -> str:
        """Validate that the component type exists in the registry."""
        if v not in ComponentRegistry._components:
            raise ValueError(f"Component type {v} not found in registry")
        return v

Component Initialization¶

Components are initialized using their configuration:

def _init_component(self, config: ComponentConfig) -> BaseComponent:
    """Initialize a component from its configuration."""
    component_cls = ComponentRegistry.get(config.type)
    return component_cls(config)

Benefits of the Registry Pattern¶

Modularity: Components can be added, removed, or replaced independently
Validation: Configuration is validated before components are initialized
Extensibility: New components can be added without modifying existing code
Dynamic Loading: Components are loaded at runtime based on configuration
Type Safety: Component types are checked during initialization

Component Details¶

Base Components¶

Tabular SSL defines several base component types:

EventEncoder: Encodes individual events or timesteps
SequenceEncoder: Encodes sequences of events
EmbeddingLayer: Handles embedding of categorical features
ProjectionHead: Projects encoded representations to a different space
PredictionHead: Generates predictions from encoded representations

Each component type has multiple implementations that can be selected via configuration.

Available Components¶

Event Encoders¶

mlp_event_encoder: MLP-based event encoder
autoencoder: Autoencoder-based event encoder
contrastive: Contrastive learning event encoder

Sequence Encoders¶

rnn: Basic RNN encoder
lstm: LSTM encoder
gru: GRU encoder
transformer: Transformer encoder
s4: Diagonal State Space Model (S4) encoder

Embedding Layers¶

categorical_embedding: Embedding layer for categorical variables

Projection Heads¶

mlp_projection: MLP-based projection head

Prediction Heads¶

classification: Classification head

Corruption Strategies¶

random_masking: Random masking corruption
gaussian_noise: Gaussian noise corruption
swapping: Feature swapping corruption
vime: VIME-style corruption
corruption_pipeline: Pipeline of multiple corruption strategies

Configuration System¶

The system uses Hydra's configuration system with structured configuration files:

configs/
├── config.yaml                # Main configuration
├── model/                     # Model configurations
│   ├── default.yaml          # Default model config
│   ├── event_encoder/        # Event encoder configs
│   ├── sequence_encoder/     # Sequence encoder configs
│   ├── embedding/            # Embedding configs
│   ├── projection_head/      # Projection head configs
│   └── prediction_head/      # Prediction head configs
├── data/                     # Data configurations
├── trainer/                  # Training configurations
├── callbacks/                # Callback configurations
├── logger/                   # Logger configurations
├── experiment/               # Experiment configurations
├── hydra/                    # Hydra-specific configurations
└── paths/                    # Path configurations

Configuration Composition¶

Configurations are composed hierarchically:

# configs/model/default.yaml
defaults:
  - _self_
  - event_encoder: mlp.yaml
  - sequence_encoder: transformer.yaml
  - embedding: categorical.yaml
  - projection_head: mlp.yaml
  - prediction_head: classification.yaml

_target_: tabular_ssl.models.base.BaseModel

model:
  name: tabular_ssl_model
  type: base
  event_encoder: ${event_encoder}
  sequence_encoder: ${sequence_encoder}
  embedding: ${embedding}
  projection_head: ${projection_head}
  prediction_head: ${prediction_head}

Experiment Configuration¶

Experiments override specific parts of the configuration:

# configs/experiment/s4_sequence.yaml
# @package _global_

defaults:
  - override /model/sequence_encoder: s4.yaml
  - override /trainer: default.yaml
  - override /model: default.yaml
  - override /callbacks: default.yaml
  - _self_

tags: ["s4", "sequence"]

trainer:
  max_epochs: 50
  gradient_clip_val: 0.5

model:
  optimizer:
    lr: 5.0e-4
    weight_decay: 0.05

Hydra-to-Component Integration¶

The system translates Hydra configurations to component configurations:

# Convert Hydra configs to ComponentConfigs
self.event_encoder_config = ComponentConfig.from_hydra(config.model.event_encoder)
self.sequence_encoder_config = ComponentConfig.from_hydra(config.model.sequence_encoder)

# Initialize components
self.event_encoder = self._init_component(self.event_encoder_config)
self.sequence_encoder = self._init_component(self.sequence_encoder_config)

Design Decisions¶

Why Component Registry?¶

The Component Registry pattern was chosen for several reasons:

Separation of Concerns
Components focus on their specific functionality
Registry handles component discovery and initialization
Configuration handles component parameters
Extensibility
New components can be added without modifying existing code
Custom components can be registered by users
Experiments can mix and match components
Validation
Component types are validated during initialization
Configuration parameters are validated using Pydantic
Better error messages for misconfiguration

Why Hydra Configuration?¶

Hydra provides several benefits for configuration management:

Hierarchical Configuration
Configurations are organized into groups
Defaults can be overridden selectively
Parameters can be composed from multiple sources
Command-line Overrides
Parameters can be changed at runtime
No need to modify configuration files
Experiment parameters are explicit
Multirun Capabilities
Parameter sweeps for experimentation
Parallel execution of multiple runs
Organized output directories

Implementation Details¶

Code Organization¶

src/
├── tabular_ssl/              # Core package
│   ├── data/                # Data loading and processing
│   ├── models/              # Model implementations
│   │   ├── base.py         # Base model and component registry
│   │   ├── components.py   # Model components
│   │   └── s4.py           # S4 implementation
│   └── utils/              # Utility functions
└── train.py                 # Training script

Key Classes¶

ComponentRegistry¶

Central registry for all components
Handles component registration and retrieval
Ensures type safety

BaseComponent¶

Abstract base class for all components
Handles configuration validation
Defines common interface

BaseModel¶

Main model class
Composes components based on configuration
Handles training and inference

ComponentConfig¶

Base configuration class
Uses Pydantic for validation
Integrates with Hydra configuration

Performance Considerations¶

Component Design¶

Lazy Initialization
Components are only initialized when needed
Configuration is validated early
Resources are allocated efficiently
Configuration Caching
Configurations are parsed once
Common configurations are reused
Reduces memory overhead
Dynamic Component Selection
Only required components are initialized
Custom components can be more efficient
Allows for hardware-specific optimizations

Memory Efficiency¶

Batch Processing
Dynamic batch sizes
Gradient accumulation
Memory-efficient attention
Model Optimization
Parameter sharing
Quantization
Pruning

Training Speed¶

Hardware Acceleration
GPU support
Mixed precision
Parallel processing
Optimization
Efficient data loading
Cached computations
Optimized attention

SSL Methods - Self-supervised learning approaches
Performance Considerations - Optimization and scaling
API Reference - Technical documentation