Phaedra | Dr. Johannes Jakubik

Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Sciences

Phaedra introduces a groundbreaking approach to discrete tokenization specifically designed for physical sciences data, including Earth observation, weather modeling, and climate assessment. This work addresses a fundamental challenge in applying modern deep learning to physical sciences: how to transform high-dimensional data into sequences that can be efficiently learned, generated, and generalized.

The Challenge

Tokens are discrete representations that allow modern deep learning to scale by transforming high-dimensional data into sequences. However, existing tokenizers designed for realistic visual perception of images struggle to capture both fine details and precise magnitudes - properties that are crucial for physical sciences applications.

Figure 1: Phaedra's tokenization pipeline. Top Left: Density field of compressible Euler equations showing fine details and ground truth. Bottom Left: The encoding-decoding process with amplitude and image tokens, demonstrating how Phaedra separates magnitude and shape information. Right: Visualization showing how Phaedra captures both high-frequency information and smooth structures, enabling accurate reconstruction of physical phenomena.

Key Innovation

Inspired by classical shape-gain quantization and basis function decomposition, Phaedra proposes a novel tokenization approach that:

Captures Fine Details: Preserves the intricate spatial patterns and structures in physical data
Maintains Precise Magnitudes: Accurately represents the quantitative values essential for scientific analysis
Dual-Stream Architecture: Separates amplitude (magnitude) and image (shape) information for optimal representation
Enables High-Fidelity Reconstruction: Achieves superior reconstruction quality across multiple metrics
Generalizes Across Domains: Demonstrates out-of-distribution generalization to unseen PDEs, unknown PDEs, and real-world data

Technical Approach

Phaedra’s tokenization pipeline works in two streams:

Amplitude Tokens: Encode the precise magnitudes and global coherent structures
Image Tokens: Capture fine details and local high-frequency features

This decomposition, inspired by classical signal processing, allows the model to:

Focus on fine details while maintaining precise amplitudes
Learn global and local features separately
Achieve better reconstruction with fewer tokens
Generalize to new physical systems

Resources

Project Website

camlab-ethz.github.io/Phaedra

Paper (arXiv)

arxiv.org/pdf/2602.03915v1

Code (GitHub)

github.com/camlab-ethz/Phaedra

Impact

This work bridges the gap between modern deep learning tokenization approaches and the rigorous requirements of physical sciences, enabling more accurate and reliable AI systems for scientific applications.