Skip to content

Animesh-Varma/STATERA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STATERA

Hidden Mass Estimation via Zero-Shot Sim-to-Real Kinematics using Frozen Temporal Tubelets

STATERA is a research framework that aims to extract the hidden Center of Mass (CoM) of opaque bodies from raw video using a partially-frozen V-JEPA backbone.

Note

Official Research Release

STATERA has been officially compiled into a paper. This README reflects the finalized benchmark data, terminology, and evaluation suites introduced in the research.

Asset Availability: The pre-trained model checkpoints, the HiddenMass-50K benchmark dataset (HDF5 format), and the full architectural ablation suites are completely uploaded and live on Hugging Face!

License: Apache 2.0 Paper

Standard AI vision models and surface trackers struggle to find the true Center of Mass (CoM) of objects that are asymmetric and opaque. Because the inside is hidden, the problem is mathematically ill-posed for models that only analyze static images.

STATERA (Spatio-Temporal Analysis of Tensor Embeddings for Rigid-body Asymmetry) solves this by watching how objects move over time. Built on top of Meta's V-JEPA 2.1 (ViT-L) vision foundation model, STATERA uses a parameter-efficient fine-tuning approach (~2.5M trainable parameters) to analyze raw video through pre-trained temporal representations. It learns to infer hidden internal mass directly from real-world physics, momentum, and rotational torque.

Alongside the model, we are releasing the HiddenMass Benchmark: 50,000 procedurally generated MuJoCo trajectories (split into 40K train and 10K validation/test) and a rigorously annotated 63-sequence zero-shot real-world physical test set.


Resources & Checkpoints

Model on Hugging Face Dataset on Hugging Face

Note: All 50K simulation datasets, the real-world validation test set, and full ablation model weights are officially uploaded and ready for use.


Contents

FeaturesGUI DemosHow It WorksThe Evaluation Suite
BenchmarkModel ZooDataset & SubmissionTech Stack
BuildContact


Features

  • Zero-Shot Sim-to-Real Kinematics: Train entirely in MuJoCo simulations and deploy zero-shot to unconstrained real-world footage. The model tracks true momentum, ignoring heavy real-world textural distractors.
  • Robust Temporal Adaptation: Handles unconstrained physical tumbles captured at 4K/60FPS, gracefully downsampled to 4K/24FPS to align with the foundational temporal priors.
  • Interactive Web UI: Features an interactive Material 3 HTML web application (demo/demo_app.py) out-of-the-box for evaluating continuous video sequences and visualizing the temporal CoM tracking.
  • Advanced Evaluation Suite: Includes CLI tools (reproduce_paper_tables.py) for computing the Normalized Center of Mass Error (N-CoME), Normalized Kinematic Jitter, Physics Capture Ratio, and the Unified HiddenMass Score (HMS). (Note: This is the exact code used to generate the tables for the paper. It is provided for reference but cannot be run directly as it requires the test set ground truth, which is kept private to protect the benchmark's integrity).
  • Bulletproof Environment Management: A robust setup.py that strictly enforces Python 3.12, auto-generates your virtual environment, and applies a critical hotfix to bypass Meta's PyTorch caching bug on the vjepa2 repository.

GUI Demos

Custom.mov

Evaluating Custom Uploaded Data
Demo.mov

Evaluating Pre-Loaded Demo Sequences

How It Works

The Problem with Spatial Models

Standard object-detection pipelines (like DINOv2) predictably point to an object's visual geometric center. While fine for uniformly dense objects, this fails catastrophically for asymmetric payloads (e.g., a hollow box with a heavy weight hidden on one side). Surface point trackers (like TAPIR and CoTracker) also fail due to severe motion blur and self-occlusion when a body tumbles.

The STATERA Pipeline

Because the Center of Mass is hidden, STATERA treats this as a dynamic tracking problem rather than a static image problem.

  1. Temporal Tubelets: A continuous 16-frame video sequence is compressed into latent spatio-temporal blocks (tubelets) via a partially-frozen V-JEPA 2.1 backbone.
  2. 1D Temporal Mixer: A 1D Convolution extracts the velocity gradients across the sequence, isolating inertial physics from the visual geometry.
  3. Multi-Task Decoder: The network utilizes a Spatial Preservation Decoder to maintain sub-pixel accuracy. It predicts a 2D continuous probability heatmap while simultaneously predicting a 1D Absolute Z-Depth to ensure the model learns perspective-invariant physics.
  4. Continuous Extraction: A Temperature-Scaled Soft-Argmax (τ = 1.0) smoothly extracts a continuous coordinate from the discrete probability grid to eliminate quantization noise and preserve the object's momentum.
View Architecture Diagram
graph LR
    Input["Raw Video Sequence<br/>V ∈ ℝ¹⁶ˣ³ˣᴴˣᵂ"] --> VJEPA["Meta V-JEPA 2.1<br/>(Partially Frozen Backbone) 🔒"]
    VJEPA --> Tubelets[/"Temporal Tubelets<br/>(T=8)"/]
    Tubelets --> Mixer["1D Temporal Conv<br/>Upsample to T=16<br/><b>[Temporal Dropout 25%]</b>"]
    Mixer --> Decoder["Spatial Preservation<br/>Decoder<br/><i>ConvTranspose2d</i> 🔓"]
    Decoder --> HeadA["Head A: 2D Spatial Heatmap<br/>(KL Divergence)"]
    Decoder --> HeadB["Head B: 1D Z-Depth Regularizer<br/>(Huber Loss)"]

    style Input fill:#f9f9f9,stroke:#666
    style VJEPA fill:#fff1f1,stroke:#f66
    style Tubelets fill:#f1f4ff,stroke:#66f
    style Mixer fill:#fff9f1,stroke:#f90
    style Decoder fill:#f1fff1,stroke:#0c0
    style HeadA fill:#fdf1ff,stroke:#a0a
    style HeadB fill:#f1ffff,stroke:#0aa
Loading

The Metric Illusion & Evaluation Suite

Evaluating kinematics strictly via absolute pixel distance is vulnerable to model "reward-hacking." We empirically identified a fundamental tracking duality we call Expectation Collapse. A network will output a massive, highly-uncertain probability blob that naturally collapses near the geometric centroid just to play it safe, artificially gaming Euclidean error metrics without actually tracking the hidden mass.

To prevent this, the HiddenMass benchmark utilizes a strict evaluation suite:

  1. Normalized Center of Mass Error (N-CoME): Spatial Euclidean distance, normalized by the dynamically projected 2D bounding diagonal.
  2. Normalized Kinematic Jitter ($\tilde{J}$): Penalizes non-physical, high-frequency acceleration spikes caused by visual aliasing.
  3. Physics Capture Ratio: Measures true physical disentanglement by calculating the percentage of absolute distance the prediction successfully moves away from the visual centroid toward the true hidden mass.
  4. Unified HiddenMass Score (HMS): A composite score (25% Kinematic Accuracy, 25% Tracking Stability, 50% Physics Disentanglement) designed to explicitly penalize statistical expectation collapse and reward true inertial reasoning.

(Note: The legacy KECS metric has been officially retired and is no longer evaluated or used, though it may still appear in segments of the codebase.)


Benchmark Results

Zero-Shot Real-World Transfer (N=63 Sequences)

Evaluated on unconstrained 4K/60FPS physical tumbles, downsampled to 4K/24FPS to align with V-JEPA temporal priors. Lower is better for N-CoME and Jitter. Higher is better for Physics Capture and Unified HMS.

Model Configuration / Baseline N-CoME (%) ↓ Norm. Jitter ↓ Physics Capture ↑ Unified HMS ↑
Geometric Centroid (Naive Physics) 15.19% 0.0052 0.00% 44.9
Google TAPIR 22.78% 0.0272 8.37% 41.7
Meta CoTracker2 23.99% 0.0297 8.62% 40.9
Standard 3D-CNN (ResNet3D) 42.24% 0.0394 5.95% 32.6
Spatial Foundation (DINOv2) 16.87% 0.0307 2.62% 39.4
Temporal Foundation (VideoMAE v2) 22.53% 0.0368 18.17% 44.2
STATERA-1K-No-Z-Depth (Ablation) 45.11% 0.0471 36.21% 45.1
STATERA-1K-Frozen-Anchor (Ablation) 38.74% 0.0167 25.76% 49.0
STATERA-1K-Anchor (Ablation) 20.12% 0.0293 31.19% 53.2
STATERA-1K-Standard-Sigma (Ablation) 14.77% 0.0173 6.45% 45.2
STATERA-50K-Sigma (Phase-Agnostic) 13.44% 0.0193 24.15% 53.9
STATERA-50K-Crescent (Phase-Aware) 16.09% 0.0274 40.96% 59.6

Analysis: Point-tracking methods (TAPIR, CoTracker) suffer from kinematic survivorship bias during self-occlusion rather than exhibiting true inertial tracking. While STATERA-50K-Sigma mathematically achieves the lowest Euclidean and jitter errors, disentanglement analysis reveals this is partially an illusion driven by statistical expectation collapse. STATERA-50K-Crescent serves as the true Kinematic SOTA, breaking the centroid heuristic to achieve a massive 40.96% Physics Capture Ratio (and the highest overall Unified HMS). It successfully isolates the directional phase (momentum) of the hidden mass, though its strict precision makes it vulnerable to minor Euclidean overshoot.


Model Zoo & Ablations

The official pre-trained checkpoints for STATERA are available on Hugging Face.

Important Note on File Size: Each ViT-Large based checkpoint is roughly 1.25 GB. Because we unfroze the final two transformer blocks of the V-JEPA backbone to adapt its latent space to Newtonian physics, our checkpoints save the entire integrated state dictionary (the full backbone + custom decoder) for seamless out-of-the-box inference.

Primary Models

  • STATERA-50K-Crescent.pth (The Kinematic SOTA): Trained with phase-aware spatial targets (Von Mises angular mask decaying to a point). Achieves the highest physical disentanglement (40.96% Physics Capture) but is occasionally subject to visual-kinematic aliasing (bimodal splits) due to settling-state simulator biases.
  • STATERA-50K-Sigma.pth (The Quantitative SOTA): Trained with phase-agnostic Isotropic Gaussian targets. Highly robust with low Euclidean jitter, but suffers from expectation collapse, resulting in a more diffuse prediction heatmap and lower physical disentanglement.

The Ablation Suite (/ablations)

We provide a comprehensive set of baseline comparison models for research reproducibility, matching the exact ablations detailed in the paper:

  • STATERA-1K-DINOv2.pth (1.25 GB): A purely spatial foundation model. Proves that temporal convolutions applied post-extraction cannot recover lost intra-frame momentum, causing collapse to the visual centroid.
  • STATERA-1K-VideoMAE.pth (1.25 GB): Evaluates VideoMAE v2. Its pixel-level reconstruction objective forces memorization of visual surface textures, showing that kinematic extraction requires predictive latent physics (like V-JEPA), not just spatio-temporal attention.
  • STATERA-1K-ResNet3D.pth (146 MB): A standard 3D-CNN temporal baseline. Lacks latent tubelet priors, resulting in wild overshooting artifacts. (Smaller file size as it lacks the ViT-Large backbone).
  • STATERA-1K-No-Z-Depth.pth (1.25 GB): Ablates the 1D Z-Depth regularizer. Demonstrates that removing absolute 3D depth supervision causes the network to lose physical scale constraints and severely overshoot the object's bounds.
  • STATERA-1K-Frozen-Anchor.pth (1.25 GB): Freezes the final two V-JEPA transformer blocks, proving the necessity of fine-tuning temporal representations specifically for kinematic tasks.
  • STATERA-1K-Anchor.pth (1.25 GB): Standard low-data baseline demonstrating spatial overfitting/temporal starvation.
  • STATERA-1K-Standard-Sigma.pth (1.25 GB): Baseline testing standard Gaussian smoothing without variance-decay curriculum.
  • STATERA-1K-Static-Dot.pth (1.25 GB): Baseline testing a static coordinate dot, causing severe gradient instability.

Dataset & Leaderboard Submission

To protect the integrity of the public benchmark, the HiddenMass-50K dataset is provided in highly optimized HDF5 format and strictly split between Training data (with ground-truth labels) and Blind Testing data.

Split File Name Size Total Sequences Ground Truth Included
Train HiddenMass-50K-Train.hdf5 246 GB 50,000 (40K Active)* ✔ Yes
Ablation 1K-ablation.hdf5 4.89 GB 1,000 ✔ Yes
Test (Public) HiddenMass-50K-Test-Public.hdf5 49.5 GB 10,063 (10K Sim, 63 Real) ✘ No (For Benchmark)

*10,000 validation sequences inside the 50K HDF5 have been intentionally zeroed-out in-place to prevent leakage. You must split the dataset using PyTorch's random_split with a fixed seed of 42.

Evaluating Your Model

Currently, benchmark evaluation is handled manually to prevent leaderboard probing.

  1. Generate predictions for the HiddenMass-50K-Test-Public.hdf5 dataset.
  2. Output a standard JSON file containing your predicted (u,v) coordinates for each frame.
  3. Email the JSON file to statera@animeshvarma.dev or animesh.varma.research@gmail.com.

You will receive your results (N-CoME, Normalized Jitter, Physics Capture Ratio, and Unified HMS) in 2 to 3 business days and an invitation to add your model to the official HiddenMass Public Leaderboard.


Technical Stack

  • Language: Python 3.12 (Strictly Enforced)
  • Framework: PyTorch
  • Vision Foundation Model: Meta V-JEPA 2.1 (ViT-L)
  • Simulation Environment: MuJoCo
  • Web App / UI: Material 3 HTML (demo/demo_app.py)

Quick Start & Build Instructions

Ensure you have Python 3.12 installed. The repository includes an automated interactive setup.py that handles virtual environment creation, Meta V-JEPA dependency patching, and Hugging Face checkpoint downloads.

Note: The setup script is actively tested on Arch Linux and macOS. Support for other Linux distributions and Windows is currently in beta.

# Clone the repository
git clone https://github.com/Animesh-Varma/statera-hidden-mass.git
cd statera-hidden-mass

# Run the automated setup script
python setup.py

# Activate the newly created virtual environment
source .venv/bin/activate

# Launch the interactive web demo to test the models checkpoints out-of-the-box
python demo/demo_app.py

Contact

Note: I am a high school student building this project in my spare time. This is my first step into research and my first paper, so it is an ongoing learning process. Contributors, pull requests, and general advice are always more than welcome!

If you have questions, feedback, or compute resources to help scale the 50K model:

  • Project/Dev Email: statera@animeshvarma.dev
  • Research Email: animesh.varma.research@gmail.com

About

STATERA, a research framework that aims to extract the hidden Center of Mass (CoM) of opaque bodies from raw video using a V-JEPA backbone.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors