Motivation
The Problem
The emergence of atomistic foundation models has created a pressing need for standardized fine-tuning frameworks. However, several key challenges have made it difficult to create a unified platform:
Diverse Model Architectures: Atomistic models use various architectural paradigms (GNNs, Transformers) with different internal representations (PyG graphs, DGL graphs, dense tensors).
Different Data Processing Requirements: Each model requires specific data preprocessing pipelines and batch structures, making standardization challenging.
Complex Property Prediction: Models must handle diverse property types (scalar, vector, per-atom, system-level) with varying output head architectures.
Integration Complexity: Models need to interface with existing molecular dynamics engines, structure prediction software, and materials screening pipelines.
Core Design Philosophy
MatterTune addresses these challenges through a carefully designed abstraction hierarchy that maximizes flexibility while maintaining a clean, unified interface:
1. Data Abstraction
The foundation of MatterTune is a minimalist data contract, allowing for support for any data source that can provide atomic structures as ASE Atoms objects:
import ase
from torch.utils.data import Dataset
class MyDataset(Dataset[ase.Atoms]):
"""A dataset that provides atomic structures.
This is the minimal interface required by MatterTune. Any data source that can be
mapped to ASE Atoms objects can be wrapped in this interface.
"""
def __init__(self, data_source: str):
"""Initialize the dataset.
Args:
data_source: Path to data or other source identifier
"""
self.data = ... # Load your data
def __len__(self) -> int:
"""Return the number of structures in the dataset."""
return len(self.data)
@override
def __getitem__(self, idx: int) -> ase.Atoms:
"""Return the atomic structure at given index.
Args:
idx: Index of the desired structure
Returns:
ase.Atoms: The atomic structure
"""
return self.data[idx]
This simple abstraction enables support for any data source that can provide atomic structures, providing several key benefits:
Universal compatibility with existing materials science formats
Zero assumptions about internal data storage
Natural integration with ASE’s ecosystem
Flexibility to support any data source that can be mapped to atomic structures
2. Backbone Abstraction
Rather than enforcing a specific internal architecture, MatterTune defines backbones through their capability to predict properties:
class ModelOutput(TypedDict):
predicted_properties: dict[str, torch.Tensor]
"""Predicted properties. This dictionary should be exactly
in the same shape/format as the output of `batch_to_labels`."""
backbone_output: NotRequired[Any]
"""Output of the backbone model. Only set if `return_backbone_output` is True."""
class FinetuneModuleBase(Generic[TData, TBatch]):
@abstractmethod
def atoms_to_data(self, atoms: Atoms, has_labels: bool) -> TData:
"""Convert atoms to model-specific data format"""
@abstractmethod
def collate_fn(self, data_list: list[TData]) -> TBatch:
"""Collate individual data points into a batch"""
@abstractmethod
def model_forward(self, batch: TBatch) -> ModelOutput:
"""Predict properties from a batch"""
This design:
Allows models to use their native data structures (TData, TBatch)
Separates property schema from implementation details
Enables efficient batch processing specific to each architecture
Provides clear extension points for new model types
3. Property Schema
Properties are defined through a declarative schema system:
class EnergyPropertyConfig:
"""Configuration for total energy prediction."""
name: str = "energy" # Fixed name for energy property
loss: LossConfig # Loss function configuration
loss_coefficient: float = 1.0 # Weight in total loss
class ForcesPropertyConfig:
"""Configuration for atomic forces prediction."""
name: str = "forces"
loss: LossConfig
loss_coefficient: float = 1.0
conservative: bool # Whether forces are computed as energy gradients
class StressesPropertyConfig:
"""Configuration for stress tensor prediction."""
name: str = "stress"
loss: LossConfig
loss_coefficient: float = 1.0
conservative: bool # Whether stress is computed from energy
class GraphPropertyConfig:
"""Configuration for custom graph-level properties."""
name: str # User-defined property name
loss: LossConfig
loss_coefficient: float = 1.0
reduction: Literal["mean", "sum", "max"] # How to aggregate atomic features
PropertyConfig = TypeAliasType("PropertyConfig", EnergyPropertyConfig | ForcesPropertyConfig | StressesPropertyConfig | GraphPropertyConfig)
Benefits:
Clear separation between property definition and implementation
Type-safe property specifications
Provides built-in support for common properties (energy, forces, stress)
Support for complex property types
Flexible reduction strategies
Implementation Philosophy
The framework follows several key principles:
Minimal Assumptions: We make zero assumptions about internal model architectures or data structures beyond the basic interfaces.
Type Safety: All interfaces are fully typed, providing clear contracts and early error detection.
Separation of Concerns:
Property definitions are separate from implementations
Data processing is separate from model architecture
Training logic is separate from model definition
Extensibility First:
New backbones only need to implement core data conversion methods
Custom datasets only need to map to ase.Atoms
Property types can be extended without changing the core framework
Real-World Benefits
This design enables several powerful workflows:
Unified Fine-tuning: Train any supported model on any compatible dataset with a consistent API.
Easy Integration: Models automatically work with ASE calculators and MD engines.
Flexible Deployment: Models can be used for:
Molecular dynamics simulations
High-throughput screening
Structure prediction
Property prediction
Performance Optimization: Each model can implement optimal batch processing while maintaining a consistent interface.
Future Extensibility
The framework is designed to grow with the field:
New Architectures: Additional backbones can be added by implementing the core interfaces.
New Properties: The property schema system can be extended for new property types.
New Data Sources: Any data source that can map to ase.Atoms is supported.
New Applications: The clean interfaces enable integration with new workflows and tools.