amazon-science/h3-indexer
Python
Captured source
source ↗amazon-science/h3-indexer
Description: The h3-indexer is an open source package for indexing geospatial data using PySpark, Apache Sedona and the H3 hierarchical spatial indexing system. The h3-indexer maps any number of vector-type geospatial data sets to H3 grids for efficient spatial analysis and querying.
Language: Python
License: Apache-2.0
Stars: 17
Forks: 2
Open issues: 0
Created: 2025-07-29T21:22:56Z
Pushed: 2025-08-19T01:43:12Z
Default branch: main
Fork: no
Archived: no
README:
h3-indexer
The h3-indexer is an open source package for indexing geospatial data using PySpark, Apache Sedona and Uber's open source H3 hierarchical spatial indexing system. The h3-indexer maps any number of vector-type geospatial data sets to H3 grids for efficient spatial analysis and querying.

The h3-indexer contains 3 stages, and users can [provide command line arguments](#usage) to run the stages one at a time, or all together. 1. [Validator](#input-requirements) 2. [Indexer](#methodology-indexer) 3. [Resolver](#methodology-resolver)
Features
- Supports vector point, line & polygon data types
- Supports the following inputs:
- Parquet files & shapefiles in AWS S3
- AWS Glue catalog tables
- Outputs are written to AWS S3 in parquet format
- Configurable H3 resolution (3-10)
- PySpark for H3 indexing operations using Apache Sedona
- YAML & JSON-based configuration supported
Developer Setup
Developer setup is currently supported only on Linux ARM64 machines (not on x86_64 or macOS).
Versions:
This tool requires the following versions:
- Python: 3.10
- AWS Glue: 4.0
- Spark: 3.3.0
- Apache Sedona: 1.7.1
Setup
Run the following commands from inside the h3-indexer root directory to set up your run environment:
chmod +x scripts/env_setup.sh source ./scripts/env_setup.sh
When this executable finishes running, it will print out the environment variable paths that you should set in your .env file - for example:
Your SPARK_HOME path is Your GLUE_JARS path is Your JAVA_HOME path is
Environment
You need to include the following environment variables in a .env file. You can convert the .sample.env file to a .env file and update it accordingly. You can get the paths for SPARK_HOME, GLUE_JARS and JAVA_HOME from the env_setup.sh file. You will need to add the Python interpreter path to the 3 Python environment variables.
- PYTHONPATH
- PYSPARK_PYTHON
- PYSPARK_DRIVER_PYTHON
- SPARK_HOME
- GLUE_JARS
- JAVA_HOME
Python
Required version is Python 3.10. You will need to install the Python libraries in the requirements.txt file and update the Python environment variables to the path of your interpreter.
Here are example commands for Python environment setup using conda:
conda create -n h3 python=3.10 conda activate h3
First, with the h3 python environment activated, install aws-glue-libs:
cd ~/h3-indexer-env/aws-glue-libs python -m pip install -e .
Then, cd back into the h3-indexer repository and run:
conda install -c conda-forge --file requirements.txt
Ensure that all requirements installed successfully by running (with the h3 python environment activated):
python >> import awsglue >> import pyspark
You can get the path of your python interpreter for the .env file environment variables (PYTHONPATH, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON) by running:
which python
AWS S3 & AWS Glue Catalog
You'll need to have proper credentials and permissions set up to access the files in the S3 bucket or the tables in the AWS Glue Catalog that are included in your config.
Debugging
There is a sample_job.yaml in the configs directory. Here's a sample launch.json file to use for debugging with a test config:
{
"name": "main",
"type": "debugpy",
"request": "launch",
"program": "path-to/h3-indexer/src/h3-indexer/src/main.py",
"console": "integratedTerminal",
"args": ["--yaml-path", "path-to/h3-indexer/src/h3-indexer/configs/sample_job.yaml"],
"envFile": "path-to/h3-indexer/src/h3-indexer/.env"
}Usage
Run the H3 Indexer with a YAML configuration file:
python main.py --yaml-path configs/sample_job.yaml --run-all
Args
The following args are available:
Inputs: --yaml-path: File path to the yaml. --json-input: JSON input.
Run Type: --validate-only: Only validate the input config. (Validate) --index-only: Only validate the input config and run the indexer. (Validate -> Index) --run-all: Optional flag, this is the defauly behavior. Run all. (Validate -> Index -> Resolve)
Configuration
The YAML configuration file defines the job parameters including input data sources, H3 resolution, and output location.
Required Fields
name: Project name (string)version: Version number in semantic format #.#.# (string)h3_resolution: H3 resolution level, supported range 3-10 (integer)output_s3_path: AWS S3 path for output storage (string)inputs: Dictionary of input data sources. Each input in theinputsdictionary can either be a Vector or Raster input type. See below for more details on required parameters for each input type.
##### Vector Data For vector inputs (type: "vector"):
type: Must be "vector"glue_catalog_database_name: AWS Glue Catalog database name - must be provided withglue_catalog_table_name.glue_catalog_table_name: AWS Glue Catalog table name - must be provided withglue_catalog_database_name.where_clause: SQL where clause to filter the AWS Glue catalog table. Only applicable with AWS Glue catalog source.s3_path: Path to input data in AWS S3unique_id: Column name containing unique identifiergeometry_type: Type of geometry ("POINT", "LINE", or "POLYGON")geometry_column_name: Name of the geometry column.method: Processing method ("PCT_LENGTH" for lines, "PCT_AREA" for polygons)input_columns: List of columns to include in output
When providing an AWS Glue Catalog table, both glue_catalog_database_name and glue_catalog_table_name must be provided. These variables are mutually exclusive to s3_path. So when providing an AWS S3 path, you cannot also include AWS Glue Catalog parameters. And when providing AWS Glue Catalog parameters, you cannot also include an AWS S3 path.
For POINT inputs only, the geometry_column_name parameter can be replaced with these 2 parameters: -…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low-star routine repo from Amazon Science