This repository contains config for converting the following datasets, downloadable from the EIDC:
- CHESS-met, E.L. Robinson, E.M. Blyth et al.
- GEAR(hourly), E. Lewis, N. Quinn et al.
- GEAR(daily), M. Tanguy, H. Dixon et al.
DRI Gridded Data Repository. Work in progress. The idea with this repo is to develop a suite of tools to make working with large gridded datasets easier. This is outlined in the diagram below. The background colours represent the progress of the work. Green = Completed for now, Yellow = Actively being worked on, Red = Not started.
The first product that we are developing is to allow for easy conversion of various gridded datasets to ARCO (Zarr) format and easy upload to object storage. This product is built upon pangeo-forge-recipes which provides convenience functions for Apache Beam, which handles all the complexity of the performant parallelisation needed for rapid execution of the conversion. For more information on the reasons and motivation for converting data to ARCO format see the README of the repository that generated the idea for this product.
Currently the product has been designed for datasets stored in monthly or daily netcdf files. This file-frequency restriction is intended to be relaxed in future versions.
Note: The python version is pinned to
3.10aspyarrowcannot currently be built with later versions (Stackoverflow discussion). Additionally there are import issues withzarrFSSpec.
To run the scripts in this repository using uv, first download and install using the instructions in the Astral documentation. Once installed, enter the directory containing the code and run the following commands to download all dependencies and create a virtual environment:
uv sync
uv venv
All scripts should now be runnable using uv run e.g.:
uv run scripts/convert.py <CONFIG_FILE.yaml>
datasetname can currently be one of "chess", "gearh" or "wrf".
Note: Memory usage can be an issue for datasets >=O(100GB), due to the usage of Beam's rough-and-ready 'Direct Runner', which is not designed for operational use. Usage of an HPC is recommended for such datasets, example sbatch files for SLURM systems are included.
Example config files can be found in the "config" folder and contain the following user-configurable variables:
start_year: The year of the first file in the dataset (YYYY)start_month: The month of the first file in the dataset (MM)end_year: The year of the last file in the dataset (YYYY)end_month: The month of the last file in the dataset (MM)frequency: The file frequency, currently supports "M" (for monthly files) and "D" (for daily files).skipdates: Optional. A list of dates to skip and not include in the conversion. Files with the listed dates (YYYY, YYYY-MM or YYYY-MM-DD) will be skipped.input_dir: The path to the directory/folder containing the dataset filesfilename: A template for the filenames of the files, containing {varname} to substitute for varnames, {start_date} for a timestamp and optionally {end_date} for a second timestamp (e.g. if there is a range in the filename)file_type: Optional. Type of netcdf files input. Only needed if the files are not netcdf4 (most now are), in which case use "netcdf3"varnames: A list of all the variable names in the dataset. Currently the variable names in the filenames have to be the same as the variable names in the netcdf files if {varname} is used infilenamedate_format: A python datestring format code that represents the format of {start_date} (and {end_date} if present) in the filenametarget_root: The path to the folder in which to store the output zarr datasetstore_name: The name of the output zarr datasetconcatdim: The name of the dimension that the individual files will be concatenated along. Usually "time". Note this is separate to the merging of variables stored in separate files, which is handled by varnamestarget_chunks: A dictionary with the dimension names of the desired output dataset chunking as the keys and size of these dimensions as the valuesnum_workers: Number of workers to use in the computation of the new dataset. Note that anything above 1 is currently experimental and may fail for weird reasonsprune: Used for testing. Instead of running with all the dataset's files, just use the first Xoverwrites: "off" or "on". Whether or not to overwrite one or more of the dataset's variables with data from one of the dataset's files. Designed for coordinate variables that may differ slightly between different version of the datasetvar_overwrites: Optional. Which variables in the dataset to overwrite. If not specified andoverwritesis "on", all variables that can be safely overwritten areoverwrite_source: Optional. Filename of a file in the dataset to use to source the variables' data to use to overwrite. If not specified andoverwritesis "on", the last file of the dataset is used.
The package contains integration tests for the various converters of the datasets listed above. These tests use real samples of the datasets that must first downloaded and prepared using scripts/download_test_data.py. To run, you must have login access to download the above datasets on the EIDC and then create a .env file containing your login details:
username=YOUR_USERNAME
password=YOUR_PASSWORD
You can then run:
uv run scripts/download_test_data.py
This will download a test file for each dataset and place them in data/. It will also create sub-samples from these files and add them to data-tiny/ - these will be used by the integration tests.
There are a set of integration tests (marked as @pytest.mark.integration) that can be run using:
uv run pytest -m integration
These tests will convert using the recipes defined for each dataset and then check that the resulting output is as expected.
Note: These integration tests can also be run with the full file samples, but this may take several minutes.
THIS REPOSITORY IS PROVIDED THE AUTHORS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS REPOSITORY, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
