Datasets

This repository contains config for converting the following datasets, downloadable from the EIDC:

DRI Gridded Data

DRI Gridded Data Repository. Work in progress. The idea with this repo is to develop a suite of tools to make working with large gridded datasets easier. This is outlined in the diagram below. The background colours represent the progress of the work. Green = Completed for now, Yellow = Actively being worked on, Red = Not started.

The first product that we are developing is to allow for easy conversion of various gridded datasets to ARCO (Zarr) format and easy upload to object storage. This product is built upon pangeo-forge-recipes which provides convenience functions for Apache Beam, which handles all the complexity of the performant parallelisation needed for rapid execution of the conversion. For more information on the reasons and motivation for converting data to ARCO format see the README of the repository that generated the idea for this product.

Currently the product has been designed for datasets stored in monthly or daily netcdf files. This file-frequency restriction is intended to be relaxed in future versions.

Developer information

Product description document

UV Setup and running instructions

Note: The python version is pinned to 3.10 as pyarrow cannot currently be built with later versions (Stackoverflow discussion). Additionally there are import issues with zarr FSSpec.

To run the scripts in this repository using uv, first download and install using the instructions in the Astral documentation. Once installed, enter the directory containing the code and run the following commands to download all dependencies and create a virtual environment:

uv sync
uv venv

All scripts should now be runnable using uv run e.g.:

uv run scripts/convert.py <CONFIG_FILE.yaml>

datasetname can currently be one of "chess", "gearh" or "wrf".

Note: Memory usage can be an issue for datasets >=O(100GB), due to the usage of Beam's rough-and-ready 'Direct Runner', which is not designed for operational use. Usage of an HPC is recommended for such datasets, example sbatch files for SLURM systems are included.

Config

Example config files can be found in the "config" folder and contain the following user-configurable variables:

start_year: The year of the first file in the dataset (YYYY)
start_month: The month of the first file in the dataset (MM)
end_year: The year of the last file in the dataset (YYYY)
end_month: The month of the last file in the dataset (MM)
frequency: The file frequency, currently supports "M" (for monthly files) and "D" (for daily files).
skipdates: Optional. A list of dates to skip and not include in the conversion. Files with the listed dates (YYYY, YYYY-MM or YYYY-MM-DD) will be skipped.
input_dir: The path to the directory/folder containing the dataset files
filename: A template for the filenames of the files, containing {varname} to substitute for varnames, {start_date} for a timestamp and optionally {end_date} for a second timestamp (e.g. if there is a range in the filename)
file_type: Optional. Type of netcdf files input. Only needed if the files are not netcdf4 (most now are), in which case use "netcdf3"
varnames: A list of all the variable names in the dataset. Currently the variable names in the filenames have to be the same as the variable names in the netcdf files if {varname} is used in filename
date_format: A python datestring format code that represents the format of {start_date} (and {end_date} if present) in the filename
target_root: The path to the folder in which to store the output zarr dataset
store_name: The name of the output zarr dataset
concatdim: The name of the dimension that the individual files will be concatenated along. Usually "time". Note this is separate to the merging of variables stored in separate files, which is handled by varnames
target_chunks: A dictionary with the dimension names of the desired output dataset chunking as the keys and size of these dimensions as the values
num_workers: Number of workers to use in the computation of the new dataset. Note that anything above 1 is currently experimental and may fail for weird reasons
prune: Used for testing. Instead of running with all the dataset's files, just use the first X
overwrites: "off" or "on". Whether or not to overwrite one or more of the dataset's variables with data from one of the dataset's files. Designed for coordinate variables that may differ slightly between different version of the dataset
var_overwrites: Optional. Which variables in the dataset to overwrite. If not specified and overwrites is "on", all variables that can be safely overwritten are
overwrite_source: Optional. Filename of a file in the dataset to use to source the variables' data to use to overwrite. If not specified and overwrites is "on", the last file of the dataset is used.

Tests

Downloading and preparing the data

The package contains integration tests for the various converters of the datasets listed above. These tests use real samples of the datasets that must first downloaded and prepared using scripts/download_test_data.py. To run, you must have login access to download the above datasets on the EIDC and then create a .env file containing your login details:

username=YOUR_USERNAME
password=YOUR_PASSWORD

You can then run:

uv run scripts/download_test_data.py

This will download a test file for each dataset and place them in data/. It will also create sub-samples from these files and add them to data-tiny/ - these will be used by the integration tests.

Running tests

There are a set of integration tests (marked as @pytest.mark.integration) that can be run using:

uv run pytest -m integration

These tests will convert using the recipes defined for each dataset and then check that the resulting output is as expected.

Note: These integration tests can also be run with the full file samples, but this may take several minutes.

Disclaimer

THIS REPOSITORY IS PROVIDED THE AUTHORS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS REPOSITORY, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
config		config
img		img
notebooks		notebooks
scripts		scripts
src/dri_gridded_data		src/dri_gridded_data
tests/integration		tests/integration
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets

DRI Gridded Data

Developer information

UV Setup and running instructions

Config

Tests

Downloading and preparing the data

Running tests

Disclaimer

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Datasets

DRI Gridded Data

Developer information

UV Setup and running instructions

Config

Tests

Downloading and preparing the data

Running tests

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages