Hidden Mass Estimation via Zero-Shot Sim-to-Real Kinematics using Frozen Temporal Tubelets
STATERA is a research framework that aims to extract the hidden Center of Mass (CoM) of opaque bodies from raw video using a partially-frozen V-JEPA backbone.
Note
Official Research Release
STATERA has been officially compiled into a paper. This README reflects the finalized benchmark data, terminology, and evaluation suites introduced in the research.
Asset Availability: The pre-trained model checkpoints, the HiddenMass-50K benchmark dataset (HDF5 format), and the full architectural ablation suites are completely uploaded and live on Hugging Face!
Standard AI vision models and surface trackers struggle to find the true Center of Mass (CoM) of objects that are asymmetric and opaque. Because the inside is hidden, the problem is mathematically ill-posed for models that only analyze static images.
STATERA (Spatio-Temporal Analysis of Tensor Embeddings for Rigid-body Asymmetry) solves this by watching how objects move over time. Built on top of Meta's V-JEPA 2.1 (ViT-L) vision foundation model, STATERA uses a parameter-efficient fine-tuning approach (~2.5M trainable parameters) to analyze raw video through pre-trained temporal representations. It learns to infer hidden internal mass directly from real-world physics, momentum, and rotational torque.
Alongside the model, we are releasing the HiddenMass Benchmark: 50,000 procedurally generated MuJoCo trajectories (split into 40K train and 10K validation/test) and a rigorously annotated 63-sequence zero-shot real-world physical test set.
Note: All 50K simulation datasets, the real-world validation test set, and full ablation model weights are officially uploaded and ready for use.
Features •
GUI Demos •
How It Works •
The Evaluation Suite
Benchmark •
Model Zoo •
Dataset & Submission •
Tech Stack
Build •
Contact
- Zero-Shot Sim-to-Real Kinematics: Train entirely in MuJoCo simulations and deploy zero-shot to unconstrained real-world footage. The model tracks true momentum, ignoring heavy real-world textural distractors.
- Robust Temporal Adaptation: Handles unconstrained physical tumbles captured at 4K/60FPS, gracefully downsampled to 4K/24FPS to align with the foundational temporal priors.
- Interactive Web UI: Features an interactive Material 3 HTML web application (
demo/demo_app.py) out-of-the-box for evaluating continuous video sequences and visualizing the temporal CoM tracking. - Advanced Evaluation Suite: Includes CLI tools (
reproduce_paper_tables.py) for computing the Normalized Center of Mass Error (N-CoME), Normalized Kinematic Jitter, Physics Capture Ratio, and the Unified HiddenMass Score (HMS). (Note: This is the exact code used to generate the tables for the paper. It is provided for reference but cannot be run directly as it requires the test set ground truth, which is kept private to protect the benchmark's integrity). - Bulletproof Environment Management: A robust
setup.pythat strictly enforces Python 3.12, auto-generates your virtual environment, and applies a critical hotfix to bypass Meta's PyTorch caching bug on thevjepa2repository.
Custom.movEvaluating Custom Uploaded Data |
Demo.movEvaluating Pre-Loaded Demo Sequences |
Standard object-detection pipelines (like DINOv2) predictably point to an object's visual geometric center. While fine for uniformly dense objects, this fails catastrophically for asymmetric payloads (e.g., a hollow box with a heavy weight hidden on one side). Surface point trackers (like TAPIR and CoTracker) also fail due to severe motion blur and self-occlusion when a body tumbles.
Because the Center of Mass is hidden, STATERA treats this as a dynamic tracking problem rather than a static image problem.
- Temporal Tubelets: A continuous 16-frame video sequence is compressed into latent spatio-temporal blocks (tubelets) via a partially-frozen V-JEPA 2.1 backbone.
- 1D Temporal Mixer: A 1D Convolution extracts the velocity gradients across the sequence, isolating inertial physics from the visual geometry.
- Multi-Task Decoder: The network utilizes a Spatial Preservation Decoder to maintain sub-pixel accuracy. It predicts a 2D continuous probability heatmap while simultaneously predicting a 1D Absolute Z-Depth to ensure the model learns perspective-invariant physics.
- Continuous Extraction: A Temperature-Scaled Soft-Argmax (τ = 1.0) smoothly extracts a continuous coordinate from the discrete probability grid to eliminate quantization noise and preserve the object's momentum.
View Architecture Diagram
graph LR
Input["Raw Video Sequence<br/>V ∈ ℝ¹⁶ˣ³ˣᴴˣᵂ"] --> VJEPA["Meta V-JEPA 2.1<br/>(Partially Frozen Backbone) 🔒"]
VJEPA --> Tubelets[/"Temporal Tubelets<br/>(T=8)"/]
Tubelets --> Mixer["1D Temporal Conv<br/>Upsample to T=16<br/><b>[Temporal Dropout 25%]</b>"]
Mixer --> Decoder["Spatial Preservation<br/>Decoder<br/><i>ConvTranspose2d</i> 🔓"]
Decoder --> HeadA["Head A: 2D Spatial Heatmap<br/>(KL Divergence)"]
Decoder --> HeadB["Head B: 1D Z-Depth Regularizer<br/>(Huber Loss)"]
style Input fill:#f9f9f9,stroke:#666
style VJEPA fill:#fff1f1,stroke:#f66
style Tubelets fill:#f1f4ff,stroke:#66f
style Mixer fill:#fff9f1,stroke:#f90
style Decoder fill:#f1fff1,stroke:#0c0
style HeadA fill:#fdf1ff,stroke:#a0a
style HeadB fill:#f1ffff,stroke:#0aa
Evaluating kinematics strictly via absolute pixel distance is vulnerable to model "reward-hacking." We empirically identified a fundamental tracking duality we call Expectation Collapse. A network will output a massive, highly-uncertain probability blob that naturally collapses near the geometric centroid just to play it safe, artificially gaming Euclidean error metrics without actually tracking the hidden mass.
To prevent this, the HiddenMass benchmark utilizes a strict evaluation suite:
- Normalized Center of Mass Error (N-CoME): Spatial Euclidean distance, normalized by the dynamically projected 2D bounding diagonal.
-
Normalized Kinematic Jitter (
$\tilde{J}$ ): Penalizes non-physical, high-frequency acceleration spikes caused by visual aliasing. - Physics Capture Ratio: Measures true physical disentanglement by calculating the percentage of absolute distance the prediction successfully moves away from the visual centroid toward the true hidden mass.
- Unified HiddenMass Score (HMS): A composite score (25% Kinematic Accuracy, 25% Tracking Stability, 50% Physics Disentanglement) designed to explicitly penalize statistical expectation collapse and reward true inertial reasoning.
(Note: The legacy KECS metric has been officially retired and is no longer evaluated or used, though it may still appear in segments of the codebase.)
Evaluated on unconstrained 4K/60FPS physical tumbles, downsampled to 4K/24FPS to align with V-JEPA temporal priors. Lower is better for N-CoME and Jitter. Higher is better for Physics Capture and Unified HMS.
| Model Configuration / Baseline | N-CoME (%) ↓ | Norm. Jitter ↓ | Physics Capture ↑ | Unified HMS ↑ |
|---|---|---|---|---|
| Geometric Centroid (Naive Physics) | 15.19% | 0.0052 | 0.00% | 44.9 |
| Google TAPIR | 22.78% | 0.0272 | 8.37% | 41.7 |
| Meta CoTracker2 | 23.99% | 0.0297 | 8.62% | 40.9 |
| Standard 3D-CNN (ResNet3D) | 42.24% | 0.0394 | 5.95% | 32.6 |
| Spatial Foundation (DINOv2) | 16.87% | 0.0307 | 2.62% | 39.4 |
| Temporal Foundation (VideoMAE v2) | 22.53% | 0.0368 | 18.17% | 44.2 |
| STATERA-1K-No-Z-Depth (Ablation) | 45.11% | 0.0471 | 36.21% | 45.1 |
| STATERA-1K-Frozen-Anchor (Ablation) | 38.74% | 0.0167 | 25.76% | 49.0 |
| STATERA-1K-Anchor (Ablation) | 20.12% | 0.0293 | 31.19% | 53.2 |
| STATERA-1K-Standard-Sigma (Ablation) | 14.77% | 0.0173 | 6.45% | 45.2 |
| STATERA-50K-Sigma (Phase-Agnostic) | 13.44% | 0.0193 | 24.15% | 53.9 |
| STATERA-50K-Crescent (Phase-Aware) | 16.09% | 0.0274 | 40.96% | 59.6 |
Analysis: Point-tracking methods (TAPIR, CoTracker) suffer from kinematic survivorship bias during self-occlusion rather than exhibiting true inertial tracking. While
STATERA-50K-Sigmamathematically achieves the lowest Euclidean and jitter errors, disentanglement analysis reveals this is partially an illusion driven by statistical expectation collapse.STATERA-50K-Crescentserves as the true Kinematic SOTA, breaking the centroid heuristic to achieve a massive 40.96% Physics Capture Ratio (and the highest overall Unified HMS). It successfully isolates the directional phase (momentum) of the hidden mass, though its strict precision makes it vulnerable to minor Euclidean overshoot.
The official pre-trained checkpoints for STATERA are available on Hugging Face.
Important Note on File Size: Each ViT-Large based checkpoint is roughly 1.25 GB. Because we unfroze the final two transformer blocks of the V-JEPA backbone to adapt its latent space to Newtonian physics, our checkpoints save the entire integrated state dictionary (the full backbone + custom decoder) for seamless out-of-the-box inference.
STATERA-50K-Crescent.pth(The Kinematic SOTA): Trained with phase-aware spatial targets (Von Mises angular mask decaying to a point). Achieves the highest physical disentanglement (40.96% Physics Capture) but is occasionally subject to visual-kinematic aliasing (bimodal splits) due to settling-state simulator biases.STATERA-50K-Sigma.pth(The Quantitative SOTA): Trained with phase-agnostic Isotropic Gaussian targets. Highly robust with low Euclidean jitter, but suffers from expectation collapse, resulting in a more diffuse prediction heatmap and lower physical disentanglement.
We provide a comprehensive set of baseline comparison models for research reproducibility, matching the exact ablations detailed in the paper:
STATERA-1K-DINOv2.pth(1.25 GB): A purely spatial foundation model. Proves that temporal convolutions applied post-extraction cannot recover lost intra-frame momentum, causing collapse to the visual centroid.STATERA-1K-VideoMAE.pth(1.25 GB): Evaluates VideoMAE v2. Its pixel-level reconstruction objective forces memorization of visual surface textures, showing that kinematic extraction requires predictive latent physics (like V-JEPA), not just spatio-temporal attention.STATERA-1K-ResNet3D.pth(146 MB): A standard 3D-CNN temporal baseline. Lacks latent tubelet priors, resulting in wild overshooting artifacts. (Smaller file size as it lacks the ViT-Large backbone).STATERA-1K-No-Z-Depth.pth(1.25 GB): Ablates the 1D Z-Depth regularizer. Demonstrates that removing absolute 3D depth supervision causes the network to lose physical scale constraints and severely overshoot the object's bounds.STATERA-1K-Frozen-Anchor.pth(1.25 GB): Freezes the final two V-JEPA transformer blocks, proving the necessity of fine-tuning temporal representations specifically for kinematic tasks.STATERA-1K-Anchor.pth(1.25 GB): Standard low-data baseline demonstrating spatial overfitting/temporal starvation.STATERA-1K-Standard-Sigma.pth(1.25 GB): Baseline testing standard Gaussian smoothing without variance-decay curriculum.STATERA-1K-Static-Dot.pth(1.25 GB): Baseline testing a static coordinate dot, causing severe gradient instability.
To protect the integrity of the public benchmark, the HiddenMass-50K dataset is provided in highly optimized HDF5 format and strictly split between Training data (with ground-truth labels) and Blind Testing data.
| Split | File Name | Size | Total Sequences | Ground Truth Included |
|---|---|---|---|---|
| Train | HiddenMass-50K-Train.hdf5 |
246 GB | 50,000 (40K Active)* | ✔ Yes |
| Ablation | 1K-ablation.hdf5 |
4.89 GB | 1,000 | ✔ Yes |
| Test (Public) | HiddenMass-50K-Test-Public.hdf5 |
49.5 GB | 10,063 (10K Sim, 63 Real) | ✘ No (For Benchmark) |
*10,000 validation sequences inside the 50K HDF5 have been intentionally zeroed-out in-place to prevent leakage. You must split the dataset using PyTorch's
random_splitwith a fixed seed of42.
Currently, benchmark evaluation is handled manually to prevent leaderboard probing.
- Generate predictions for the
HiddenMass-50K-Test-Public.hdf5dataset. - Output a standard JSON file containing your predicted
(u,v)coordinates for each frame. - Email the JSON file to statera@animeshvarma.dev or animesh.varma.research@gmail.com.
You will receive your results (N-CoME, Normalized Jitter, Physics Capture Ratio, and Unified HMS) in 2 to 3 business days and an invitation to add your model to the official HiddenMass Public Leaderboard.
- Language: Python 3.12 (Strictly Enforced)
- Framework: PyTorch
- Vision Foundation Model: Meta V-JEPA 2.1 (ViT-L)
- Simulation Environment: MuJoCo
- Web App / UI: Material 3 HTML (
demo/demo_app.py)
Ensure you have Python 3.12 installed. The repository includes an automated interactive setup.py that handles virtual environment creation, Meta V-JEPA dependency patching, and Hugging Face checkpoint downloads.
Note: The setup script is actively tested on Arch Linux and macOS. Support for other Linux distributions and Windows is currently in beta.
# Clone the repository
git clone https://github.com/Animesh-Varma/statera-hidden-mass.git
cd statera-hidden-mass
# Run the automated setup script
python setup.py
# Activate the newly created virtual environment
source .venv/bin/activate
# Launch the interactive web demo to test the models checkpoints out-of-the-box
python demo/demo_app.pyNote: I am a high school student building this project in my spare time. This is my first step into research and my first paper, so it is an ongoing learning process. Contributors, pull requests, and general advice are always more than welcome!
If you have questions, feedback, or compute resources to help scale the 50K model:
- Project/Dev Email:
statera@animeshvarma.dev - Research Email:
animesh.varma.research@gmail.com