Handling missing values at test time is challenging for machine learning models, especially when aiming for both high accuracy and interpretability. Existing approaches often introduce bias through imputation or increase model complexity via missingness indicators. Moreover, both strategies can obscure interpretability, making it harder to understand how the model uses observed variables in its predictions. We propose missingness-avoiding (MA) machine learning, a general framework for training models that rarely require the values of missing (or imputed) features at test time. We develop tailored MA learning algorithms for decision trees, tree ensembles, and sparse linear models by incorporating classifier-specific regularization terms into their learning objectives.
This repository contains the code used for the experiments presented in our paper Prediction models that learn to avoid missing values, which was featured as a spotlight poster at ICML 2025.
When does MA learning achieve both low reliance on features with missing values and minimal prediction error? Consider the following data-generating process as a concrete example: Patients registered with a general healthcare provider undergo annual check-ups to assess their overall health. Demographic variables, such as patient age, are always recorded, whereas some test results may be missing due to clinical recommendations or practitioner discretion. For instance, cognitive tests are consistently administered to individuals over 65 years old, ensuring that MMSE scores are available for all patients in this age group. Patients who receive a low MMSE score subsequently undergo an MRI scan, which measures hippocampal volume (
In the figure below (which can be reproduced by running the script scripts/run_synthetic_experiment.py), we show two examples of decision trees trained to predict whether a patient suffers from cognitive impairment using data collected by the healthcare provider. On the left, a standard decision tree is shown. It splits on the MRI scan outcome at the root node, resulting in high missingness reliance (
- Installation
- The MA learning framework
- Configuration files
- Datasets
- Experiments
- Tetralith setup
- Citation
- Acknowledgements
To get started, clone the repository and install the required packages in a new environment. We use Pixi to manage Python dependencies. Make sure Pixi is installed by running:
pixi --versionTo set up a working environment, run the following commands:
git clone https://github.com/antmats/malearn.git
cd malearn
pixi installThe pixi install command installs the CPU version of PyTorch, which is used in the implementation of the NeuMiss network – a baseline used in our experiments. To enable GPU support, run pixi install -e cuda.
Note: The implementation of M-GAM – another baseline included in our experiments – depends on fastsparsegams, which is only available via PyPI. On macOS machines, installing fastsparsegams may fail due to the issue described here. If you encounter this problem, you can comment out the pypi-dependencies section in pixi.toml to create an environment without support for running M-GAM.
To run all tests in the tests directory:
pixi run testTo launch a Jupyter notebook:
pixi run jupyter labThe MA learning framework supports sparse linear models (MA-Lasso), decision trees (MA-DT), and ensemble methods in the form of random forests (MA-RF) and gradient-boosted decision trees (MA-GBT). Currently, we have implemented all algorithms for classification, while for regression, we support MA-Lasso and MA-DT.
| MA estimator | Classification | Regression |
|---|---|---|
| MA-Lasso | MALassoClassifier |
MALasso |
| MA-DT | MADTClassifier |
MADTRegressor |
| MA-RF | MARFClassifier |
- |
| MA-GBT | MAGBTClassifier |
- |
All estimators follow the fit/predict convention used in scikit-learn. Thus, they can be easily integrated into the scikit-learn ecosystem – for example, as part of a pipeline and/or within a hyperparameter search. However, since each MA estimator requires not only the input features fit method, metadata routing must be enabled when using the estimator within a meta-estimator. See scripts/run_synthetic_experiment.py for an example of how to use MA-DT.
Note: The current decision tree implementations are not optimized for speed and are relatively slow compared to their scikit-learn counterparts. We aim to improve these implementations in future updates.
We use configuration files to specify details such as dataset paths, evaluation metrics, and hyperparameters. There are two configuration files: one for classification (config_cla.yml) and one for regression (config_reg.yml). You should update the base_dir field to point to a directory on your machine where the datasets are stored (under the subdirectory specified by data_dir). Results will be saved in base_dir/results_dir.
We consider six different datasets for classification in our paper: ADNI, Breast Cancer, FICO, LIFE, NHANES, and Pharyngitis. In the following subsections, we describe how to download each dataset. In malearn/data/data.py, there is a corresponding data handler class for each dataset. Each data handler defines the features to include for modeling, the target variable, and the name of the dataset file. Make sure the file_name attribute corresponds to the name of the dataset file on your own machine.
To use ADNI, you must first apply for access here. After gaining access, follow these steps:
- Log in to the Image and Data Archive.
- Under "Select Study", choose "ADNI". Then choose "Download > Study Data" and search for "ADNIMERGE".
- Download the file "ADNIMERGE - Key ADNI tables merged into one table - Packages for R [ADNI1,GO,2]".
- Install the ADNIMERGE package for R by following the instructions here.
- Load the data and save it to a CSV file by running the R script below:
library(ADNIMERGE)
data <- adnimerge
write.csv(data, file="/path/to/my/adni/data.csv", row.names=FALSE)The Breast Cancer dataset was used in this paper by Shadbahr et al. To obtain the dataset, download Breast_cancer_data.xlsx from their project repository.
The FICO dataset was used in the 2018 Explainable Machine Learning Challenge. Download heloc_dataset_v1.csv from this GitHub repository.
The LIFE dataset can be downloaded from this Kaggle project.
The NHANES dataset can be obtained by following the instructions provided in this GitHub repository.
The Pharyngitis dataset is available as supplementary material for this article. Search for "minimal dataset" to locate the file.
The experimental settings are defined in the configuration files as described above. The script scripts/fit_estimator.py can be used to fit and evaluate a single model. For example, to fit MA-Lasso to the ADNI dataset, run:
pixi run python scripts/fit_estimator.py --config_path config_cla.yml --dataset_alias adni --estimator_alias malassoIn our paper, we conduct comprehensive experiments on all datasets. All experiments were performed on the Tetralith cluster using Apptainer containers; see below for details. The scripts scripts/slurm/run_experiment_wrapper.sh and scripts/slurm/run_experiment.sh were used to launch each experiment. If you have access to a cluster that uses Slurm for job scheduling, you can reproduce all our experiments by updating the relevant Slurm parameters in scripts/slurm/run_experiment.sh. No changes are required for scripts/slurm/run_experiment_wrapper.sh. Depending on your cluster setup, you may also need to modify the Slurm batch script (scripts/slurm/fit_estimator.sh) to, e.g., map host storage directories into the container environment.
- To reproduce the results shown in the main table (Table 1) of the paper, run the following commands:
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml adni all
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml fico all
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml life all
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml nhanes all- To fit MA models with the setting
$\alpha=0$ , run:
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml <dataset> ma_alpha_0- To fit MA models with the setting
$\alpha=\infty$ , update themodel_selectionfield in the configuration file as follows:
model_selection:
search: random
n_iter: 10
scoring: [missingness_reliance_score, roc_auc_ovr]
refit: tradeoff
gamma: 1
n_splits: 3
test_size: 0.2
seed: *seedThen, run:
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml <dataset> maFor further details and post-processing of the results, we refer to the notebook notebooks/paper_results.ipynb.
Here, we explain how to set up a working environment on the Tetralith cluster, run the default set of experiments on the ADNI dataset, and launch a Jupyter notebook on Tetralith that you can access through a browser on your local machine. We assume that you have access to a project storage directory, and that the path to this directory can be obtained using $projdir.
Clone the repository:
cd ~
git clone https://github.com/antmats/malearn.gitBuild a container:
cd "$projdir"
mkdir -p malearn/containers && cd malearn/containers
apptainer build --bind $HOME:/mnt ma_env.sif ~/malearn/container.defRun an experiment:
cd ~/malearn
./scripts/slurm/run_experiment_wrapper.sh config_cla.yml adni allLaunch a notebook:
- Start a Jupyter server:
container="${projdir}/malearn/containers/ma_env.sif"
apptainer exec --bind /proj:/proj,$HOME:/mnt "$container" jupyter-lab --no-browser --LabApp.extension_manager=readonly- Create an SSH tunnel (make sure the port is 8889):
ssh -N -L localhost:8889:localhost:8889 <username>@tetralith.nsc.liu.se- Copy and paste the URL on Tetralith into your browser.
If you use this work, please cite it as follows:
@inproceedings{
stempfle2025prediction,
title={Prediction models that learn to avoid missing values},
author={Lena Stempfle and Anton Matsson and Newton Mwai and Fredrik D. Johansson},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=ps3aO9MHJv}
}This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.
The computations and data handling were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
