Skip to content

kysolvik/geebeam

geebeam

Testing + Linting

Google Earth Engine and Apache Beam for building geospatial training datasets.

Purpose:

geebeam is a lightweight library for building and executing Apache Beam pipelines that download data "chips" from Google Earth Engine and write them to a variety of common data formats used by PyTorch, TensorFlow, and JAX (GeoTIFF, WebDataset, TensorFlow Dataset).

The user defines the Earth Engine images they want to download chips from using the Python earthengine-api. geebeam then serialized the graph-definition of the images so they can be passed to the Beam workers.

The pipelines are automatically parallelized and can be run locally or on Google Cloud Dataflow.

Install:

pip install geebeam

Examples:

Running locally:

Here we'll create a cloud-free Landsat 5 data composite for 2010. We'll randomly sample 10 locations and download them as GeoTIFFs. This should only ~1 second per tiff, since Earth Engine is doing the heavy lifting behind the scenes.

import ee
import geebeam
import google

# Get default project id from environment (or specify PROJECT_ID manually)
PROJECT_ID = google.auth.default()[1]

# Initialize ee client, replace with your GCP project ID
ee.Initialize(project=PROJECT_ID)

### Build image for download
# Load a raw Landsat 5 ImageCollection for a single year.
ls5_collection = ee.ImageCollection('LANDSAT/LT05/C02/T1').filterDate(
    '2010-01-01', '2010-12-31'
)
# Create a (mostly) cloud-free Landsat composite
ls5_composite = ee.Algorithms.Landsat.simpleComposite(
    ls5_collection,
    asFloat=True,
    cloudScoreRange=5)

# Building and triggering the pipeline is done with a single command:
geebeam.sample_and_run_pipeline(
    image_list = [ls5_composite], # Important: has to be a list of images
    sampling_region=ee.Geometry.Rectangle(-55.0, -12.0, -50.0, -16.0), # In central-west Brazil
    n_sample=10, # Number of tiles to sample
    patch_size=128, # Number of pixels in each direction
    scale=30, # Final export resolution in meters
    crs='EPSG:4326', # CRS for final output
    project=PROJECT_ID, # GCP Project ID
    output_path='./test_data/', # Output path, local or on GCP
    validation_ratio=0.2, # Fraction to select as validation data
)

Now let's add another dataset: MapBiomas land-cover from same year. For more info, and legend, see: MapBiomas Brasil

# MB Land-use/land-cover
mb_lulc = (
    ee.Image('projects/mapbiomas-public/assets/brazil/lulc/collection10_1/mapbiomas_brazil_collection10_1_coverage_v1')
    .select('classification_2010')
)

# Exporting both together is as simple as this:
geebeam.sample_and_run_pipeline(
    image_list = [ls5_composite, mb_lulc],
    project=PROJECT_ID,
    crs='EPSG:4326',
    patch_size=128,
    scale=30,
    n_sample=10,
    validation_ratio=0.2,
    output_path='./test_data_w_mb/',
    sampling_region=ee.Geometry.Rectangle(-55.0, -12.0, -50.0, -16.0)
)

Scaling up with DataFlow:

The export process can be scaled to many workers via Google Cloud DataFlow. First write a script containing your geebeam.run_pipeline() command. Then execute using the Beam DataFlow runner:

python examples/geebeam_run.py \
    --region=us-east1 \
    --worker=zone us-east1-b \
    --runner=DataflowRunner \
    --max_num_workers=8 \
    --experiments=use_runner_v2 \
    --temp_location=gs://[your-bucket]/[path_to_temp_dir]
    --machine_type=n2-highmem-2 \
    --sdk_container_image=kysolvik/geebeam:[geebeam-version]

Note in this case your output_path in run_pipeline() should be a Google Cloud Storage path. Make sure to replace [geebeam-version] in the sdk_container_image URI with the version number installed on your system (python -c "import geebeam;print(geebeam.__version__)"). You can also build your own Docker image to run on. More info in the DataFlow docs.

See the Apache Beam and Google Cloud DataFlow docs for full documentation, e.g. pipeline command-line options.

Common DataFlow gotchas

  1. Before running, you must enable the DataFlow API on Google Cloud Console.

  2. You can test your pipeline script (e.g. geebeam_run.py) and Beam options using the DirectRunner before submitting to DataFlow:

python examples/geebeam_run.py \
    --runner=DirectRunner
  1. For more common errors, see the Google Cloud DataFlow troubleshooting guide and geebeam's documentation.

Alternatives:

  • GeeFlow: Google DeepMind's GeeFlow fulfills a similar purpose. It is more flexible, allowing for more user control of data processing, reprojection, and writing, but slower and no longer actively maintained. With the goal of meeting most users' needs, GeeBeam is designed to be easier and quicker to use, but allows from more limited data transformations.
  • Export training data to Google Cloud Storage then download chips from there: This works, but if you need to get data from many different datasets it's slow to export all that data to Cloud Storage and can be expensive to store it there if you don't delete it quickly. This also uses unnecessary Earth Engine compute hours, which are now subject to stricter monthly limits.
  • Xee: Xee is a great package that allows for accessing Earth Engine objects as xarray.Datasets. You could use this to define a xarray.Dataset and download "chips" from it, but geebeam interfaces with Beam to automatically parallelize this task and export to a variety of common data formats (GeoTIFF, WebDataset, TensorFlow Datasets)

Disclaimer:

geebeam is a third-party library and is not affiliated with or endorsed by Google or the Apache Software Foundation.

About

Create geospatial PyTorch, TensorFlow, and JAX datasets with Google Earth Engine and Apache Beam

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Contributors