This is the repository containing all the codes related to my Master of Science thesis in Computer Science and Engineering at Politecnico di Milano. The description of the thesis and of the repository structure and code structure is postponed to next sections to keep the first sections short. The first components I will describe are the ones needed to reproduce the experiments and to download and set up the datasets using automated scripts. I don't want you to lose time, most of the things just involve just running a script.
The development of anomalearn has been moved to a dedicated repository to be correctly synced with PyPI for versioning. The link of the anomalearn repository is: https://github.com/marcopetri98/anomalearn. If the repository is still private, it means that you visited this repository too close to the graduation day. It will be public soon.
In my thesis I use some external libraries, public datasets and state-of-the-art methods. If you want to do one of the following:
- Use the code of the repository
- Download all the datasets used in this thesis
- Download all the methods to which this thesis is compared
- Reproduce the results of the thesis by running scripts
This section explains everything you need from downloading the required packages to execute the scripts (including the scripts used to download datasets and methods).
In my thesis I used several packages. To get all the needed packages to install anomalearn, make sure you use python 3.10 and create the virtual environment using:
conda create -n <env-name> python=3.10
conda activate <env-name>
pip install -e .
The reason why the installation uses the editable option -e is that the library is still in development at the moment, and there isn't a stable release. The API are subject to abrupt change, and it is not still published on PyPI. Sooner the installation will become available through PyPI as any other library by running pip install anomalearn. However, for the moment this is the workaround to be used to install the library. The advantage is that with the editable option you can git clone the repository and pull the latest updates without needing to reinstall the library if the requirements did not change.
If either you don't have conda or you don't know what conda is, I invite you to follow this link: https://docs.conda.io/en/latest/miniconda.html
Every library must be tested before being distributed. Especially, if the library is published on PyPI it should have automated testing providing a good coverage level. Assume you have downloaded the repository at /usr/yourname/repositories/anomalearn, you need to follow these steps to test the library:
cd /usr/yourname/repositories/anomalearn
conda activate <env/name>
python -m unittest discover
In my thesis I use several public datasets. To increase reproducibility, you can find the script setup_datasets.py. The script accepts some arguments. By default the script downloads, extracts and rename folders. The usage with all default settings is:
python ./setup_datasets.py
Be aware that the script does not download datasets that are not directly downloadable: for those datasets (such as Yahoo! Webscope S5) you need to request access to them, download them and to positionate them in the folder in which you downloaded the other datasets. Then, you need also to rename them with the expected name by the program, here is the list:
- Yahoo! Webscope S5: the filename must be "yahoo_s5.tgz".
If you want to download some of the datasets, you can specify that by using the appropriate optional argument:
python ./setup_datasets.py -d ucr smd nab mgab
Will download, extract and rename the datasets: ucr, smd, nab and mgab.
The script is a program as any other linux program. You can call the man page with the following (if you want to know about the optional parameters and their usage):
python ./setup_datasets.py -h
In my thesis I compare my methods with several state-of-the-art methods. Many of those methods have a public repository available, from which we can download their code of their model. The script setup_repos.py is a utility similar to setup_datasets.py. It can be used to download all the repos in their folder. The default usage is:
python ./setup_repos.py
The methods can also be downloaded to a different location with respect to the standard one. However, scripts are thought to work with default locations. If you change it, you may need to adjust paths in scripts you are going to execute.
This script is a program as any other linux program. You can call the man page with the following (if you want to know about the optional parameters and their usage):
python ./setup_repos.py -h
In my thesis I used many datasets to evaluate the methods I developed and state-of-the-art methods against each other. The datasets I used are: Yahoo Webscope S5, UCR, SMD, NASA-MSL, NASA-SMAP, NAB, MGAB, Kitsune, GHL and Exathlon. Here you can find a table with the datasets, reference to the paper presenting them and the link to the dataset.
These datasets are all public datasets. However, not all of them are directly downloadable. For instance, Yahoo! Webscope S5 can be downloaded upon request approval made to Yahoo via their website. Therefore, I will speak about public datasets (available for download from a link) and about accessible datasets (downloadable after request acceptance).