This assignment was completed as part of the Statistical Learning course of the MSc in Statistics and Data Science at Leiden University. The project is split in two tasks: part 1 discusses supervised machine learning methods, while part 2 discusses unsupervised methods.
In the first part, I compare three supervised machine learning methods for predicting the severity of depressive episodes after twelve months (dep_sev_fu). At the outset, we are asked to pick between two supervised learning models. The two I select are Support Vector Machines (SVM) and eXtreme Regularised Gradient Boosting (XGBoost). I reason about my choice to utilise XGBoost based on properties of the dataset. Thereafter, I explain how hyperparameter optimisation was carried out. The report continues with an interpretation of the predictions produced by the trained model as well as some feature importance analysis using SHAP values, accompanied by relevant visualisations. The first portion concludes with a comparison of two methods used in the previous installment of these assignments (Assignment 1, which can be found here), weighing the methods' RMSE, R-Squared and MAE. Finally, this part of the report concludes with personalised patient advice based on the models' predictions.
In the second part, I explore unsupervised learning techniques for dimensionality reduction via Principal Component Analysis (PCA) and Clustering.