RepoAmazon (Nova)Amazon (Nova)published Jul 29, 2025seen 5d

amazon-science/h3-indexer

Python

Open original ↗

Captured source

source ↗
published Jul 29, 2025seen 5dcaptured 8hhttp 200method plain

amazon-science/h3-indexer

Description: The h3-indexer is an open source package for indexing geospatial data using PySpark, Apache Sedona and the H3 hierarchical spatial indexing system. The h3-indexer maps any number of vector-type geospatial data sets to H3 grids for efficient spatial analysis and querying.

Language: Python

License: Apache-2.0

Stars: 17

Forks: 2

Open issues: 0

Created: 2025-07-29T21:22:56Z

Pushed: 2025-08-19T01:43:12Z

Default branch: main

Fork: no

Archived: no

README:

h3-indexer

The h3-indexer is an open source package for indexing geospatial data using PySpark, Apache Sedona and Uber's open source H3 hierarchical spatial indexing system. The h3-indexer maps any number of vector-type geospatial data sets to H3 grids for efficient spatial analysis and querying.

![H3 Data Flow](data_flow.png)

The h3-indexer contains 3 stages, and users can [provide command line arguments](#usage) to run the stages one at a time, or all together. 1. [Validator](#input-requirements) 2. [Indexer](#methodology-indexer) 3. [Resolver](#methodology-resolver)

Features

  • Supports vector point, line & polygon data types
  • Supports the following inputs:
  • Parquet files & shapefiles in AWS S3
  • AWS Glue catalog tables
  • Outputs are written to AWS S3 in parquet format
  • Configurable H3 resolution (3-10)
  • PySpark for H3 indexing operations using Apache Sedona
  • YAML & JSON-based configuration supported

Developer Setup

Developer setup is currently supported only on Linux ARM64 machines (not on x86_64 or macOS).

Versions:

This tool requires the following versions:

  • Python: 3.10
  • AWS Glue: 4.0
  • Spark: 3.3.0
  • Apache Sedona: 1.7.1

Setup

Run the following commands from inside the h3-indexer root directory to set up your run environment:

chmod +x scripts/env_setup.sh
source ./scripts/env_setup.sh

When this executable finishes running, it will print out the environment variable paths that you should set in your .env file - for example:

Your SPARK_HOME path is
Your GLUE_JARS path is
Your JAVA_HOME path is

Environment

You need to include the following environment variables in a .env file. You can convert the .sample.env file to a .env file and update it accordingly. You can get the paths for SPARK_HOME, GLUE_JARS and JAVA_HOME from the env_setup.sh file. You will need to add the Python interpreter path to the 3 Python environment variables.

  • PYTHONPATH
  • PYSPARK_PYTHON
  • PYSPARK_DRIVER_PYTHON
  • SPARK_HOME
  • GLUE_JARS
  • JAVA_HOME

Python

Required version is Python 3.10. You will need to install the Python libraries in the requirements.txt file and update the Python environment variables to the path of your interpreter.

Here are example commands for Python environment setup using conda:

conda create -n h3 python=3.10
conda activate h3

First, with the h3 python environment activated, install aws-glue-libs:

cd ~/h3-indexer-env/aws-glue-libs
python -m pip install -e .

Then, cd back into the h3-indexer repository and run:

conda install -c conda-forge --file requirements.txt

Ensure that all requirements installed successfully by running (with the h3 python environment activated):

python
>> import awsglue
>> import pyspark

You can get the path of your python interpreter for the .env file environment variables (PYTHONPATH, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON) by running:

which python

AWS S3 & AWS Glue Catalog

You'll need to have proper credentials and permissions set up to access the files in the S3 bucket or the tables in the AWS Glue Catalog that are included in your config.

Debugging

There is a sample_job.yaml in the configs directory. Here's a sample launch.json file to use for debugging with a test config:

{
"name": "main",
"type": "debugpy",
"request": "launch",
"program": "path-to/h3-indexer/src/h3-indexer/src/main.py",
"console": "integratedTerminal",
"args": ["--yaml-path", "path-to/h3-indexer/src/h3-indexer/configs/sample_job.yaml"],
"envFile": "path-to/h3-indexer/src/h3-indexer/.env"
}

Usage

Run the H3 Indexer with a YAML configuration file:

python main.py --yaml-path configs/sample_job.yaml --run-all

Args

The following args are available:

Inputs: --yaml-path: File path to the yaml. --json-input: JSON input.

Run Type: --validate-only: Only validate the input config. (Validate) --index-only: Only validate the input config and run the indexer. (Validate -> Index) --run-all: Optional flag, this is the defauly behavior. Run all. (Validate -> Index -> Resolve)

Configuration

The YAML configuration file defines the job parameters including input data sources, H3 resolution, and output location.

Required Fields

  • name: Project name (string)
  • version: Version number in semantic format #.#.# (string)
  • h3_resolution: H3 resolution level, supported range 3-10 (integer)
  • output_s3_path: AWS S3 path for output storage (string)
  • inputs: Dictionary of input data sources. Each input in the inputs dictionary can either be a Vector or Raster input type. See below for more details on required parameters for each input type.

##### Vector Data For vector inputs (type: "vector"):

  • type: Must be "vector"
  • glue_catalog_database_name: AWS Glue Catalog database name - must be provided with glue_catalog_table_name.
  • glue_catalog_table_name: AWS Glue Catalog table name - must be provided with glue_catalog_database_name.
  • where_clause: SQL where clause to filter the AWS Glue catalog table. Only applicable with AWS Glue catalog source.
  • s3_path: Path to input data in AWS S3
  • unique_id: Column name containing unique identifier
  • geometry_type: Type of geometry ("POINT", "LINE", or "POLYGON")
  • geometry_column_name: Name of the geometry column.
  • method: Processing method ("PCT_LENGTH" for lines, "PCT_AREA" for polygons)
  • input_columns: List of columns to include in output

When providing an AWS Glue Catalog table, both glue_catalog_database_name and glue_catalog_table_name must be provided. These variables are mutually exclusive to s3_path. So when providing an AWS S3 path, you cannot also include AWS Glue Catalog parameters. And when providing AWS Glue Catalog parameters, you cannot also include an AWS S3 path.

For POINT inputs only, the geometry_column_name parameter can be replaced with these 2 parameters: -…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low-star routine repo from Amazon Science