Motivation

The Problem

The emergence of atomistic foundation models has created a pressing need for standardized fine-tuning frameworks. However, several key challenges have made it difficult to create a unified platform:

  1. Diverse Model Architectures: Atomistic models use various architectural paradigms (GNNs, Transformers) with different internal representations (PyG graphs, DGL graphs, dense tensors).

  2. Different Data Processing Requirements: Each model requires specific data preprocessing pipelines and batch structures, making standardization challenging.

  3. Complex Property Prediction: Models must handle diverse property types (scalar, vector, per-atom, system-level) with varying output head architectures.

  4. Integration Complexity: Models need to interface with existing molecular dynamics engines, structure prediction software, and materials screening pipelines.

Core Design Philosophy

MatterTune addresses these challenges through a carefully designed abstraction hierarchy that maximizes flexibility while maintaining a clean, unified interface:

1. Data Abstraction

The foundation of MatterTune is a minimalist data contract, allowing for support for any data source that can provide atomic structures as ASE Atoms objects:

import ase
from torch.utils.data import Dataset

class MyDataset(Dataset[ase.Atoms]):
    """A dataset that provides atomic structures.

    This is the minimal interface required by MatterTune. Any data source that can be
    mapped to ASE Atoms objects can be wrapped in this interface.
    """

    def __init__(self, data_source: str):
        """Initialize the dataset.

        Args:
            data_source: Path to data or other source identifier
        """
        self.data = ...  # Load your data

    def __len__(self) -> int:
        """Return the number of structures in the dataset."""
        return len(self.data)

    @override
    def __getitem__(self, idx: int) -> ase.Atoms:
        """Return the atomic structure at given index.

        Args:
            idx: Index of the desired structure

        Returns:
            ase.Atoms: The atomic structure
        """
        return self.data[idx]

This simple abstraction enables support for any data source that can provide atomic structures, providing several key benefits:

  • Universal compatibility with existing materials science formats

  • Zero assumptions about internal data storage

  • Natural integration with ASE’s ecosystem

  • Flexibility to support any data source that can be mapped to atomic structures

2. Backbone Abstraction

Rather than enforcing a specific internal architecture, MatterTune defines backbones through their capability to predict properties:

class ModelOutput(TypedDict):
    predicted_properties: dict[str, torch.Tensor]
    """Predicted properties. This dictionary should be exactly
        in the same shape/format  as the output of `batch_to_labels`."""

    backbone_output: NotRequired[Any]
    """Output of the backbone model. Only set if `return_backbone_output` is True."""

class FinetuneModuleBase(Generic[TData, TBatch]):
    @abstractmethod
    def atoms_to_data(self, atoms: Atoms, has_labels: bool) -> TData:
        """Convert atoms to model-specific data format"""

    @abstractmethod
    def collate_fn(self, data_list: list[TData]) -> TBatch:
        """Collate individual data points into a batch"""

    @abstractmethod
    def model_forward(self, batch: TBatch) -> ModelOutput:
        """Predict properties from a batch"""

This design:

  • Allows models to use their native data structures (TData, TBatch)

  • Separates property schema from implementation details

  • Enables efficient batch processing specific to each architecture

  • Provides clear extension points for new model types

3. Property Schema

Properties are defined through a declarative schema system:

class EnergyPropertyConfig:
    """Configuration for total energy prediction."""
    name: str = "energy"  # Fixed name for energy property
    loss: LossConfig      # Loss function configuration
    loss_coefficient: float = 1.0  # Weight in total loss

class ForcesPropertyConfig:
    """Configuration for atomic forces prediction."""
    name: str = "forces"
    loss: LossConfig
    loss_coefficient: float = 1.0
    conservative: bool  # Whether forces are computed as energy gradients

class StressesPropertyConfig:
    """Configuration for stress tensor prediction."""
    name: str = "stress"
    loss: LossConfig
    loss_coefficient: float = 1.0
    conservative: bool  # Whether stress is computed from energy

class GraphPropertyConfig:
    """Configuration for custom graph-level properties."""
    name: str  # User-defined property name
    loss: LossConfig
    loss_coefficient: float = 1.0
    reduction: Literal["mean", "sum", "max"]  # How to aggregate atomic features

PropertyConfig = TypeAliasType("PropertyConfig", EnergyPropertyConfig | ForcesPropertyConfig | StressesPropertyConfig | GraphPropertyConfig)

Benefits:

  • Clear separation between property definition and implementation

  • Type-safe property specifications

  • Provides built-in support for common properties (energy, forces, stress)

  • Support for complex property types

  • Flexible reduction strategies

Implementation Philosophy

The framework follows several key principles:

  1. Minimal Assumptions: We make zero assumptions about internal model architectures or data structures beyond the basic interfaces.

  2. Type Safety: All interfaces are fully typed, providing clear contracts and early error detection.

  3. Separation of Concerns:

    • Property definitions are separate from implementations

    • Data processing is separate from model architecture

    • Training logic is separate from model definition

  4. Extensibility First:

    • New backbones only need to implement core data conversion methods

    • Custom datasets only need to map to ase.Atoms

    • Property types can be extended without changing the core framework

Real-World Benefits

This design enables several powerful workflows:

  1. Unified Fine-tuning: Train any supported model on any compatible dataset with a consistent API.

  2. Easy Integration: Models automatically work with ASE calculators and MD engines.

  3. Flexible Deployment: Models can be used for:

    • Molecular dynamics simulations

    • High-throughput screening

    • Structure prediction

    • Property prediction

  4. Performance Optimization: Each model can implement optimal batch processing while maintaining a consistent interface.

Future Extensibility

The framework is designed to grow with the field:

  1. New Architectures: Additional backbones can be added by implementing the core interfaces.

  2. New Properties: The property schema system can be extended for new property types.

  3. New Data Sources: Any data source that can map to ase.Atoms is supported.

  4. New Applications: The clean interfaces enable integration with new workflows and tools.