|
| 1 | +# IOC Cleanup |
| 2 | + |
| 3 | +`ioc_cleanup` provides a reproducible, transparent, and traceable workflow for cleaning tide gauge (sea level) data from IOC (Intergovernmental Oceanographic Commission) stations worldwide. |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +## Motivation & Concept |
| 8 | + |
| 9 | +Cleaning tide gauge data is often: |
| 10 | + * manual, |
| 11 | + * poorly documented, |
| 12 | + * hard to reproduce, |
| 13 | + * and difficult to review or share. |
| 14 | + |
| 15 | +This project proposes a community-driven, version-controlled approach to data cleaning, where all cleaning decisions are explicitly recorded and can be audited or improved over time. |
| 16 | + |
| 17 | +**What this approach enables** |
| 18 | + |
| 19 | + * Flagging timestamps or time ranges affected by: |
| 20 | + * bad or corrupt data |
| 21 | + * sensor breakpoints |
| 22 | + * singular phenomena (e.g. tsunamis, meteo-tsunamis, seiches, or unidentified events) |
| 23 | + * Fully reproducible cleaning |
| 24 | + * Transparent and traceable decisions stored in plain JSON |
| 25 | + * Peer review of cleaning decisions via GitHub |
| 26 | + * Easy extension to other datasets (e.g. GESLA, NDBC) |
| 27 | + * Gradual growth in station coverage through community contributions |
| 28 | + |
| 29 | +## Repository Overview |
| 30 | + |
| 31 | +This repository contains a set of Python routines to **clean IOC sea level data** using **declarative JSON transformations**. |
| 32 | + |
| 33 | +### Core idea |
| 34 | + |
| 35 | +The core asset of this repository is the set of **JSON files** located in `./transformations/`. |
| 36 | + |
| 37 | +Each JSON file describes: |
| 38 | + |
| 39 | + * the valid time window |
| 40 | + * dropped timestamps |
| 41 | + * dropped time ranges |
| 42 | + * breakpoints |
| 43 | + * metadata and notes |
| 44 | + |
| 45 | +Together, these JSON files define the transformation from **raw data to clean signal**. |
| 46 | + |
| 47 | +## Caveats and limitations |
| 48 | +Please be aware of the following: |
| 49 | + * ❌ This repository does NOT contain IOC data |
| 50 | + * Data download is not handled internally |
| 51 | + * Examples (in this `README` or in `tests`) use the [`searvey`](https://github.com/oceanmodeling/searvey) package |
| 52 | + * Step changes in data are currently only flagged via the `breakpoints` item in the JSOn |
| 53 | + * No offset correction is applied |
| 54 | + * Vertical datums are not addressed |
| 55 | + * Distinguishing noise (e.g. boat wakes) from real physical events can be difficult for noisy sensors |
| 56 | + * Cleaning decisions are inherently subjective |
| 57 | + * Different operators may disagree on what should be discarded |
| 58 | + |
| 59 | + |
| 60 | +## Getting Started |
| 61 | +### Prerequisites |
| 62 | + * Python 3.11 (recommended). |
| 63 | + * **~24GB** of free disk space for storing raw and processed data. |
| 64 | + |
| 65 | +### Installation |
| 66 | + |
| 67 | +```bash |
| 68 | +git clone https://github.com/seareport/ioc_cleanup.git |
| 69 | +pip install -r requirements.txt |
| 70 | +``` |
| 71 | + |
| 72 | +## Usage |
| 73 | +example with one station: `abed` (Aberdeen), sensor `bub` |
| 74 | +```python |
| 75 | +station = "abed" |
| 76 | +sensor = "bub" |
| 77 | +``` |
| 78 | + |
| 79 | +### Download Raw Data: |
| 80 | + |
| 81 | +```python |
| 82 | +import searvey |
| 83 | +df_raw = searvey.fetch_ioc_station(station, "2020-01-01", "2026-01-01") |
| 84 | +``` |
| 85 | + |
| 86 | +### Apply Cleaning Transformation: |
| 87 | + |
| 88 | +```python |
| 89 | +import ioc_cleanup as C |
| 90 | + |
| 91 | +trans = C.load_transformation_from_path( |
| 92 | + "../transformations/maya_pwl.json" |
| 93 | +) |
| 94 | +df_clean = C.transform(df, trans) |
| 95 | +``` |
| 96 | + |
| 97 | +Example for `maya` station: |
| 98 | + |
| 99 | + |
| 100 | + |
| 101 | +## Transformation Files (JSON) |
| 102 | +All transformation logic lives in `./transformations/`. |
| 103 | +### Example JSON: |
| 104 | +```json |
| 105 | +{ |
| 106 | + "ioc_code": "abed", |
| 107 | + "sensor": "bub", |
| 108 | + "notes": "", |
| 109 | + "skip": false, |
| 110 | + "wip": false, |
| 111 | + "start": "2020-01-01T00:00:00", |
| 112 | + "end": "2026-01-01T00:00:00", |
| 113 | + "high": null, |
| 114 | + "low": null, |
| 115 | + "dropped_date_ranges": [ |
| 116 | + ["2022-03-27 03:00:00", "2022-03-27 03:45:00"], |
| 117 | + ["2023-03-26 03:00:00", "2023-03-26 03:45:00"] |
| 118 | + ], |
| 119 | + "dropped_timestamps": [ |
| 120 | + "2022-09-30T14:45:00", |
| 121 | + "2022-09-30T15:30:00", |
| 122 | + "2022-10-02T06:45:00", |
| 123 | + "2022-10-02T07:00:00", |
| 124 | + "2023-06-21T00:15:00", |
| 125 | + "2024-04-24T11:00:00", |
| 126 | + "2024-09-07 12:00:00" |
| 127 | + ], |
| 128 | + "breakpoints": [] |
| 129 | +} |
| 130 | +``` |
| 131 | +#### Field descriptions |
| 132 | + |
| 133 | + * `ioc_code` : IOC station code |
| 134 | + * `sensor` : sensor identifier |
| 135 | + * `notes` : free-text comments |
| 136 | + * `skip` : skip this station entirely |
| 137 | + * `wip` : mark transformation as work-in-progress |
| 138 | + * `start`, `end` : valid data window |
| 139 | + * `high`, `low` : optional value thresholds |
| 140 | + * `dropped_date_ranges` : continuous time ranges to remove |
| 141 | + * `dropped_timestamps` : individual timestamps to remove |
| 142 | + * `breakpoints` : timestamps where sensor behavior changes |
| 143 | + |
| 144 | +## Downloading IOC Data in Bulk |
| 145 | +Shortcut functions are provided to download, load, and clean data. |
| 146 | + |
| 147 | +### Example: download all IOC stations for 2025 |
| 148 | + |
| 149 | + |
| 150 | +```python |
| 151 | +import ioc_cleanup as C |
| 152 | +ioc_all = C.get_meta() |
| 153 | +year = 2025 |
| 154 | +for station in ioc_all.ioc_code.tolist(): |
| 155 | + C.download_year_station(station, year, data_folder="../data") |
| 156 | +``` |
| 157 | +This downloads station data as Parquet files into: |
| 158 | +```bash |
| 159 | +./data/2025 |
| 160 | +``` |
| 161 | +### Important: the architecture used for archiving the files is as follows: |
| 162 | +``` |
| 163 | +./data/ |
| 164 | +├── 2020 |
| 165 | +├── 2021 |
| 166 | +├── 2022 |
| 167 | +├── 2023 |
| 168 | +├── 2024 |
| 169 | +└── 2025 |
| 170 | +``` |
| 171 | +to be able to scale up the number of years for the cleaning in the future |
| 172 | + |
| 173 | +## Interactive Cleaning Dashboard |
| 174 | + |
| 175 | +### Run the dashboard |
| 176 | + |
| 177 | +```bash |
| 178 | +python -mpanel serve dashboard/cleanup_dashboard.py |
| 179 | +``` |
| 180 | + |
| 181 | +you will directed to this to: |
| 182 | + |
| 183 | + |
| 184 | + |
| 185 | +#### How stations are discovered |
| 186 | + |
| 187 | + * The station list is defined by files in `./transformations/` |
| 188 | + * To add a station, create a file following this convention: |
| 189 | + |
| 190 | +```php-template |
| 191 | +./transformations/<ioc_code>_<sensor>.json |
| 192 | +``` |
| 193 | + |
| 194 | +### Error handling |
| 195 | + |
| 196 | +Dark mode can be enabled using the toggle in the top-right corner. |
| 197 | + |
| 198 | + |
| 199 | + |
| 200 | +### Dark mode |
| 201 | + |
| 202 | +You can activate dark mode by clicking on the top right switch |
| 203 | + |
| 204 | + |
| 205 | + |
| 206 | +## Contributing |
| 207 | + |
| 208 | +Contributions are very welcome! |
| 209 | + |
| 210 | +### How to contribute |
| 211 | + |
| 212 | + 1. Fork the repository |
| 213 | + 2. Add or update a JSON transformation file |
| 214 | + 3. Use the dashboard to clean or flag data |
| 215 | + 4. Submit a pull request with a clear description of your changes |
| 216 | + |
| 217 | +### Areas for improvement |
| 218 | + |
| 219 | + * Add more IOC stations |
| 220 | + * Extend the cleaned time range (currently 2020–2025) |
0 commit comments