google-deepmind/geeflow

Python

Open original ↗

Captured source

source ↗
published Feb 1, 2025seen 6dcaptured 9hhttp 200method plain

google-deepmind/geeflow

Description: GeeFlow - generate and process large-scale geospatial datasets with Google Earth Engine.

Language: Python

License: Apache-2.0

Stars: 122

Forks: 15

Open issues: 2

Created: 2025-02-01T13:39:53Z

Pushed: 2026-05-28T07:25:58Z

Default branch: main

Fork: no

Archived: no

README:

GeeFlow

GeeFlow is a library for generating large scale geospatial datasets using Google Earth Engine (GEE). It contains utils, configs, and pipeline launch scripts to generate geospatial datasets. The focus is on supporting geospatial AI research, and it does not aim to be a production-ready utility.

The datasets created conform to the TFDS format and can be directly used with TFDS tf.data.Dataset data pipelines.

What can it be used for:

  • Creating small- and large-scale datasets, supervised and unsupervised, ready

for ingestion into geospatial AI model training (with standard and robust statistics precomputed).

  • Creating inference maps, up to global scale, at any resolution.
  • Supporting any type of geospatial satellite and remote sensing data and

label data from Google Earth Engine.

  • Arbitrary spatial and temporal resolution, and sampling of data sources.
  • Tooling for sampling and inference maps generation.

What is out of scope:

  • Model training and inference. Pick your favorite framework, e.g. for

Jax/Flax we use Jeo, for PyTorch, check out TorchGeo.

  • Google Earth Engine data interactive visualization and analysis. Check out

for example the amazing geemap for python-based analysis. Or explore GEE's own javascript-based EE Code Editor.

  • Datasets repository. Check out e.g.

Hugging Face, TFDS Catalog, TorchGeo Datasets.

An example workflow from geospatial datasets generation, to model training and evaluation, to inference and global inference maps creation:

Howto

For research and quick exploration, GeeFlow is designed to be lean and flexible, making it easy to use and configure for arbitrary data sources. Additionally, we prioritize reproducibility and versioning/bookkeeping of your data. Another important factor is scalability and efficiency when generating the data.

Configuration

To provide the needed flexibility for dataset configuration, we specify them in ML Collections' ConfigDict config files in the [geeflow/configs/](geeflow/configs/) directory.

Usually it is split into two parts: *labels configuration* and *sources configuration* (see also the examples in [geeflow/configs/](geeflow/configs/)).

Labels configuration:

  • At the minimum, one has to provide a CSV (or

parquet) file with locations of the samples (lat, lon columns) and image sizes in meters (for UTM projected samples) or in degrees (for spherical CRS).

  • If other columns (for example image-level labels or metadata) have to be

included in the generated dataset samples, one can provide the list to the meta_keys field.

  • Optionally one can specify the default resolution per pixel

(default_scale) for all sources (which will be overwritten if a source specifies its own scale), and the reference maximal pixel size in meters (max_cell_size_m) for proper gridding of multi-scale sources.

  • If separate training splits are to be generated (e.g. train, val,

test), either a column with split name per sample should be included (in meta_keys), or they will be randomly generated by geographically separate cell splitting based on the selected S2 geometry scale level.

Example:

labels = ml_collections.ConfigDict()
labels.path = "data/demo_labels.csv"
labels.img_width_m = 240 # Image width and height of 240 meters on each side.
labels.max_cell_size_m = 30 # Reference maximal pixel size in meters.
labels.meta_keys = ("lat", "lon", "split")
labels.num_max_samples = 10 # Only for debugging, limiting the number of generated examples.

Sources configuration: This part contains named sources that can be found in GEE.

  • For every location x specified in *labels configuration* for each

specified source s an image tensor is created with shape (T_s,H_s,W_s,C_s) (temporal size, height, width, number of channels), where T_s or C_s could be absent and any dimension could be 1. The dimensions can be different from source to source depending on the specified spatial resolution (scale), temporal sampling, and the selected channels.

  • Usually one defines at least the source class (from

[geeflow/ee_data.py](geeflow/ee_data.py)), where one can provide additional options (such as data mode) via the kw field. If a source class is not defined in ee_data, one can always use a CustomImage, CustomIC, or CustomFC and set all values explicitly (like asset_name).

  • Other fields include scale (resolution per pixel in meters), select

(which bands to include, in the given order), sampling_kw (keyword arguments for how to aggregate multiple images within a time range), and others.

  • Date ranges (date_ranges) is a list of time ranges to aggregate the data

for each returned time sample. Each date range is specified by a tuple of the form (start date, number of months to aggregate over, number of days to aggregate over).

Example:

sources = ml_collections.ConfigDict()

sources.s2 = utils.get_source_config("Sentinel2", "ic")
sources.s2.kw.mode = "L2A"
sources.s2.scale = 10
sources.s2.select = ["B3", "B2", "B1"]
sources.s2.sampling_kw.reduce_fn = "median"
sources.s2.sampling_kw.cloud_mask_fn = ee_data.Sentinel2.im_cloud_score_plus_mask
sources.s2.date_ranges = [("2023-01-01", 12, 0), ("2024-01-01", 12, 0)] # 2 annual samples

sources.s1 = utils.get_source_config("Sentinel1", "ic")
sources.s1.kw = {"mode": "IW", "pols": ("VV", "VH"), "orbit": "both"}
sources.s1.scale = 10
sources.s1.sampling_kw.reduce_fn = "mean"
sources.s1.date_ranges = [("2023-01-01", 3, 0), ("2023-04-01", 3, 0)] # 2 seasonal samples

sources.elevation = utils.get_source_config("NasaDem", "im")
sources.elevation.scale = 30…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Routine repo from notable lab, low stars