Andrea Amaduzzi β
Pierluigi Zama Ramirezβ
Giuseppe Lisantiβ
Samuele Saltiβ
Luigi Di Stefanoβ
Computer Vision Lab, University of Bologna, Italy
- π§ Installation
- π¦ Data Preparation
- π§βπ« Inference and Evaluation
- ππΌ Training Spatial LLaNA
- βοΈ Training weights2space
- π Citation
- π Related Work
- π Acknowledgements
- π Terms of Usage
The code provided in this repository has been tested in the following environment:
- Ubuntu 20.04
- CUDA 12.1
- Python 3.10.0
To start:
- Clone this repository.
git clone [email protected]:CVLAB-Unibo/Spatial-LLaNA.git
cd Spatial-LLaNA- Install packages
conda create -n spatial-llana python=3.10 -y
conda activate spatial-llana
pip install --upgrade pip
pip install -r requirements.txt
# For faster training, install these optional dependencies:
pip install ninja
pip install flash-attn==2.5.6We use ShapeNeRF-Text, ObjaNeRF-Text, and Spatial ObjaNeRF to train and evaluate our model.
ShapeNeRF-Text provides paired NeRFs and language annotations for ShapeNet objects, in particular for all the 40K NeRFs available in nf2vec dataset. Such data can be downloaded from Huggingface here.
Similar in structure to ShapeNeRF-Text, this dataset is available on Huggingface Hub here. For fair comparison with existing MLLMs, the test set is split into a "PointLLM test set" and "GPT4Point test set", using distinct objects from Objaverse.
This is our manually annotated test set of 100 complex NeRF scenes (selected from ObjaNeRF-Text) featuring detailed spatial descriptions. It's designed specifically to evaluate spatial reasoning capabilities. You can find it on the Huggingface Hub here. We also provide spatial multiple-choice QAs for each object.
To ensure everything runs smoothly, your data folder should look like this:
Spatial-LLaNA
βββ data
βββ spatial_llana_dataset
| |
β βββ train
β β βββ texts
β β β βββ conversations_brief.json
β β β βββ conversations_complex.json
β β βββ vecs
| | βββ <model_id>.npy
| | βββ ...
| | βββ <model_id>.npy
| |
β βββ val
β β βββ texts
β β β βββ conversations_brief.json
β β β βββ conversations_complex.json
β β βββ vecs
| | βββ <model_id>.npy
| | βββ ...
| | βββ <model_id>.npy
| |
β βββ shapenerf_test
β β βββ texts
β β β βββ conversations_brief.json
β β β βββ conversations_complex.json
β β βββ vecs
| | βββ <model_id>.npy
| | βββ ...
| | βββ <model_id>.npy
| |
β βββ objanerf_pointllm_test
β β βββ texts
β β β βββ conversations_brief.json
β β βββ vecs
| | βββ <model_id>.npy
| | βββ ...
| | βββ <model_id>.npy
| |
β βββ objanerf_gpt4point_test
β β βββ texts
β β β βββ conversations_brief.json
β β βββ vecs
| | βββ <model_id>.npy
| β βββ ...
β β βββ <model_id>.npy
| |
β βββ hst_dataset_filtered.json
|
βββ spatial_objanerf
βββ texts
βββ spatial_descriptions.json
βββ spatial_multi_choice_qa.json
where:
- texts/ folder contains the language annotations
- vecs/ folder contains the pre-computed embeddings from our weights2space encoder, to make the training smoother.
Feel free to download only the data splits you are interested in.
You can evaluate our pre-trained model against the test sets from the paper: ShapeNeRF-Text (captioning, QA), ObjaNeRF-Text (captioning), and Spatial ObjaNeRF (spatial detailed descriptions).
All scripts use the pre-trained andreamaduzzi/Spatial-LLaNA-13B model by default.
NeRF captioning task can be evaluated on three different data sources:
- Brief textual descriptions, from ShapeNeRF-Text
- Brief textual descriptions from GPT2Shape HST, from Looking at words and points with attention
- Detailed textual descriptions, from ShapeNeRF-Text
- Brief textual descriptions from ObjaNeRF-Text. To ensure a fair comparison in our paper, we split this set into a "PointLLM test test" and "GPT4Point test set", involving different sets of objects from Objaverse.
- Detailed spatial descriptions, from Spatial ObjaNeRF
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split shapenerf_test --text_data brief_descriptionpython spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split hstpython spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split shapenerf_test --text_data detailed_descriptionpython spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split objanerf_pointllm_testpython spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split objanerf_gpt4point_testpython spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split spatial_objanerfNeRF QA task can be evaluated by using the single-round questions and answers, belonging to the test set of ShapeNeRF-Text Dataset.
python spatial_llana/eval/eval_spatial_llana.py --model_name andreamaduzzi/Spatial-LLaNA-13B --split shapenerf_test --text_data single_roundAll these scripts generate the textual predictions of Spatial LLaNA, that will be stored in json files.
The above scripts generate JSON files containing the model's predictions. Use the following script to compute the final metrics (e.g., SentenceBERT, BLEU, ROUGE):
python spatial_llana/eval/traditional_evaluator.py --results_path PATH_TO_RESULTSReplace PATH_TO_RESULTS with the path to your JSON prediction file.
Spatial LLaNA is trained on the combined ShapeNeRF-Text and ObjaNeRF-Text datasets, totaling over 300K annotated NeRFs. Training relies on the embeddings pre-computed by our weights2space encoder, which you'll find in the vecs/ directories.
This stage optimizes the linear projection network that maps NeRF features to the LLM's embedding space.
bash scripts/Spatial-LLaNA_train_stage1.shHere, we jointly finetune the linear projection network along with the LLaMA-2 LLM.
bash scripts/Spatial-LLaNA_train_stage2.shThese scripts launch the training of the model on multiple nodes, each with 4 GPUs. You can adjust https://huggingface.co/andreamaduzzi/Spatial-LLaNA-13Bmodel_name_or_path` to train the 7B or the 13B version. The LLM we use is Llama-2-7b and Llama-2-13b.
-
Spatial-LLaNA-7B: Requires 8 A100 GPUs for training.
-
Spatial-LLaNA-13B: Requires 16 GPUs for training.
-
Duration: Completing both stages takes roughly 1 day of training. The resulting model weights will be saved in the outputs directory.
Don't want to train? No problem! Our trained models are ready on the Huggingface Hub:
- Spatial-LLaNA-7B: andreamaduzzi/Spatial-LLaNA-7B
- Spatial-LLaNA-7B: andreamaduzzi/Spatial-LLaNA-13B
If you want to train our weights2space encoder from scratch, you are in the right place to learn how to do that!
The file weights2space/data/train.json lists all paths to the NeRF weights and rendered views used for training (around 350K NeRFs from ShapeNeRF-Text and ObjaNeRF-Text, including augmentation).
cd weights2space
bash scripts/train_weights2space_parallel.shThis script is configured to support parallel training of weights2space across multiple nodes, each with 4 GPUs.
If you find our work helpful, please consider starring this repo π and cite:
@InProceedings{NeurIPS25,
author = "Amaduzzi, Andrea and Zama Ramirez, Pierluigi and Lisanti, Giuseppe and Salti, Samuele and Di Stefano, Luigi",
title = "Spatially-aware Weights Tokenization for NeRF-Language Models",
booktitle = "Advances in Neural Information Processing Systems (NeurIPS)",
year = "2025",
month = "Dec."
} We acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy).
By using this service, users are required to agree to the following terms: The service is a research preview intended for non-commercial use only. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes.


