Skip to content

CVLAB-Unibo/Spatial-LLaNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Β Spatially-aware Weights Tokenization
for NeRF-Language Models (NeurIPSΒ 2025)

Andrea Amaduzzi   Pierluigi Zama Ramirez  Giuseppe Lisanti  Samuele Salti  Luigi Di Stefano 
Computer Vision Lab, University of Bologna, Italy

πŸ“‹ Contents

πŸ”§ Installation

The code provided in this repository has been tested in the following environment:

  • Ubuntu 20.04
  • CUDA 12.1
  • Python 3.10.0

To start:

  1. Clone this repository.
git clone [email protected]:CVLAB-Unibo/Spatial-LLaNA.git
cd Spatial-LLaNA
  1. Install packages
conda create -n spatial-llana python=3.10 -y
conda activate spatial-llana
pip install --upgrade pip
pip install -r requirements.txt

# For faster training, install these optional dependencies:
pip install ninja
pip install flash-attn==2.5.6

πŸ“¦ Data Preparation

We use ShapeNeRF-Text, ObjaNeRF-Text, and Spatial ObjaNeRF to train and evaluate our model.

ShapeNeRF-Text

ShapeNeRF-Text provides paired NeRFs and language annotations for ShapeNet objects, in particular for all the 40K NeRFs available in nf2vec dataset. Such data can be downloaded from Huggingface here.

ObjaNeRF-Text

Similar in structure to ShapeNeRF-Text, this dataset is available on Huggingface Hub here. For fair comparison with existing MLLMs, the test set is split into a "PointLLM test set" and "GPT4Point test set", using distinct objects from Objaverse.

Spatial ObjaNeRF

This is our manually annotated test set of 100 complex NeRF scenes (selected from ObjaNeRF-Text) featuring detailed spatial descriptions. It's designed specifically to evaluate spatial reasoning capabilities. You can find it on the Huggingface Hub here. We also provide spatial multiple-choice QAs for each object.

Required folder structure

To ensure everything runs smoothly, your data folder should look like this:

Spatial-LLaNA
└── data
    β”œβ”€β”€ spatial_llana_dataset
    |   |
    β”‚   β”œβ”€β”€ train
    β”‚   β”‚    β”œβ”€β”€ texts
    β”‚   β”‚    β”‚    β”œβ”€β”€ conversations_brief.json
    β”‚   β”‚    β”‚    └── conversations_complex.json
    β”‚   β”‚    └── vecs     
    |   |         β”œβ”€β”€ <model_id>.npy
    |   |         └── ...
    |   |         └── <model_id>.npy
    |   |
    β”‚   β”œβ”€β”€ val
    β”‚   β”‚    β”œβ”€β”€ texts
    β”‚   β”‚    β”‚    β”œβ”€β”€ conversations_brief.json
    β”‚   β”‚    β”‚    └── conversations_complex.json
    β”‚   β”‚    └── vecs     
    |   |         β”œβ”€β”€ <model_id>.npy
    |   |         └── ...
    |   |         └── <model_id>.npy
    |   |
    β”‚   β”œβ”€β”€ shapenerf_test
    β”‚   β”‚    β”œβ”€β”€ texts
    β”‚   β”‚    β”‚    β”œβ”€β”€ conversations_brief.json
    β”‚   β”‚    β”‚    └── conversations_complex.json
    β”‚   β”‚    └── vecs     
    |   |         β”œβ”€β”€ <model_id>.npy
    |   |         └── ...
    |   |         └── <model_id>.npy
    |   |
    β”‚   β”œβ”€β”€ objanerf_pointllm_test
    β”‚   β”‚    β”œβ”€β”€ texts
    β”‚   β”‚    β”‚    └── conversations_brief.json
    β”‚   β”‚    └── vecs     
    |   |         β”œβ”€β”€ <model_id>.npy
    |   |         └── ...
    |   |         └── <model_id>.npy
    |   |
    β”‚   β”œβ”€β”€ objanerf_gpt4point_test
    β”‚   β”‚    β”œβ”€β”€ texts
    β”‚   β”‚    β”‚    └── conversations_brief.json
    β”‚   β”‚    └── vecs     
    |   |         β”œβ”€β”€ <model_id>.npy
    |   β”‚         └── ...
    β”‚   β”‚         └── <model_id>.npy
    |   |
    β”‚   └── hst_dataset_filtered.json
    |
    └── spatial_objanerf
        └── texts
            β”œβ”€β”€ spatial_descriptions.json
            └── spatial_multi_choice_qa.json

where:

  1. texts/ folder contains the language annotations
  2. vecs/ folder contains the pre-computed embeddings from our weights2space encoder, to make the training smoother.

Feel free to download only the data splits you are interested in.

πŸ§‘β€πŸ« Inference and Evaluation

You can evaluate our pre-trained model against the test sets from the paper: ShapeNeRF-Text (captioning, QA), ObjaNeRF-Text (captioning), and Spatial ObjaNeRF (spatial detailed descriptions).

All scripts use the pre-trained andreamaduzzi/Spatial-LLaNA-13B model by default.

NeRF captioning

NeRF captioning task can be evaluated on three different data sources:

  1. Brief textual descriptions, from ShapeNeRF-Text
  2. Brief textual descriptions from GPT2Shape HST, from Looking at words and points with attention
  3. Detailed textual descriptions, from ShapeNeRF-Text
  4. Brief textual descriptions from ObjaNeRF-Text. To ensure a fair comparison in our paper, we split this set into a "PointLLM test test" and "GPT4Point test set", involving different sets of objects from Objaverse.
  5. Detailed spatial descriptions, from Spatial ObjaNeRF
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split shapenerf_test --text_data brief_description
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split hst
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split shapenerf_test --text_data detailed_description
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split objanerf_pointllm_test
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split objanerf_gpt4point_test
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split spatial_objanerf

NeRF QA

NeRF QA task can be evaluated by using the single-round questions and answers, belonging to the test set of ShapeNeRF-Text Dataset.

python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split shapenerf_test --text_data single_round

All these scripts generate the textual predictions of Spatial LLaNA, that will be stored in json files.

Computation of the evaluation metrics

The above scripts generate JSON files containing the model's predictions. Use the following script to compute the final metrics (e.g., SentenceBERT, BLEU, ROUGE):

python spatial_llana/eval/traditional_evaluator.py --results_path PATH_TO_RESULTS

Replace PATH_TO_RESULTS with the path to your JSON prediction file.

πŸ‹πŸΌ Training Spatial LLaNA

Model architecture

Spatial LLaNA is trained on the combined ShapeNeRF-Text and ObjaNeRF-Text datasets, totaling over 300K annotated NeRFs. Training relies on the embeddings pre-computed by our weights2space encoder, which you'll find in the vecs/ directories.

Training Stage 1: modality alignment

This stage optimizes the linear projection network that maps NeRF features to the LLM's embedding space.

bash scripts/Spatial-LLaNA_train_stage1.sh

Training Stage 2: fine-tuning

Here, we jointly finetune the linear projection network along with the LLaMA-2 LLM.

bash scripts/Spatial-LLaNA_train_stage2.sh

These scripts launch the training of the model on multiple nodes, each with 4 GPUs. You can adjust https://huggingface.co/andreamaduzzi/Spatial-LLaNA-13Bmodel_name_or_path` to train the 7B or the 13B version. The LLM we use is Llama-2-7b and Llama-2-13b.

Computational Resources

  • Spatial-LLaNA-7B: Requires 8 A100 GPUs for training.

  • Spatial-LLaNA-13B: Requires 16 GPUs for training.

  • Duration: Completing both stages takes roughly 1 day of training. The resulting model weights will be saved in the outputs directory.

Checkpoints of trained Spatial LLaNA

Don't want to train? No problem! Our trained models are ready on the Huggingface Hub:

βš™οΈ Training weights2space

Model architecture

If you want to train our weights2space encoder from scratch, you are in the right place to learn how to do that!

Preparing the training datset

The file weights2space/data/train.json lists all paths to the NeRF weights and rendered views used for training (around 350K NeRFs from ShapeNeRF-Text and ObjaNeRF-Text, including augmentation).

⚠️ Important Data Note: The underlying NeRF weights used for training weights2space are currently not released. We are working on making them available soon!

Launching the training of the model

cd weights2space
bash scripts/train_weights2space_parallel.sh

This script is configured to support parallel training of weights2space across multiple nodes, each with 4 GPUs.

πŸ”— Citation

If you find our work helpful, please consider starring this repo 🌟 and cite:

@InProceedings{NeurIPS25,
  author       = "Amaduzzi, Andrea and Zama Ramirez, Pierluigi and Lisanti, Giuseppe and Salti, Samuele and Di Stefano, Luigi",
  title        = "Spatially-aware Weights Tokenization for NeRF-Language Models",
  booktitle    = "Advances in Neural Information Processing Systems (NeurIPS)",
  year         = "2025",
  month        = "Dec."
} 

πŸ“š Related Work

πŸ‘ Acknowledgements

We acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy).

πŸ›‘ Terms of Usage

By using this service, users are required to agree to the following terms: The service is a research preview intended for non-commercial use only. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes.

About

[NeurIPS 2025] Spatially-aware Weights Tokenization for NeRF-Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published