microsoft/HERITAGE
Python
Captured source
source ↗microsoft/HERITAGE
Language: Python
License: MIT
Stars: 0
Forks: 0
Open issues: 2
Created: 2026-05-27T13:39:46Z
Pushed: 2026-05-27T14:13:52Z
Default branch: main
Fork: no
Archived: no
README:
HERITAGE Dataset
A satellite image dataset for monitoring archaeological sites, derived from Planet Labs monthly basemap mosaics at 4.77 m/pixel resolution. HERITAGE covers 1,982 sites across 16 countries with monthly observations from January 2016 through May 2025.
The Afghanistan subset contains 1,943 sites with binary looting labels (898 looted, 1,045 preserved), per-site bounding masks, and month-of-disturbance annotations for 118 sites with confirmed change events. The global subset adds 39 sites across 15 additional countries on five continents. Among public archaeological monitoring datasets, HERITAGE is larger than prior releases on number of sites (1,982), countries (16), images (212,776), and months of coverage (113). Site coordinates are withheld to protect the locations from exploitation; the processing pipeline and ground-truth labels are released with the imagery.
Geographic coverage

*Geographic distribution of HERITAGE sites. (a) World map of the 40 global monitoring locations across 16 countries: 39 individual archaeological sites in 15 countries (blue circles) and the Afghanistan region (red star, containing 1,943 sub-sites). (b) Afghanistan detail showing the spatial distribution of 898 looted (orange) and 1,045 preserved (blue) archaeological sites.*
The 39 global sites span 15 countries: Belize (1), Cambodia (1), Ecuador (2), Egypt (5), Italy (5), Mali (1), Pakistan (2), Peru (6), Sudan (1), Sweden (2), Syria (2), Thailand (2), Turkey (3), Ukraine (1), and the USA (5). Each global site has 100-101 monthly observations. Geographic metadata is stripped from the released PNG files; coordinates are not distributed with the imagery.
Sample imagery

*Sample RGB composites from 12 HERITAGE sites spanning 10 countries. Images are displayed at 186 x 186 pixels (rescaled for global sites). Each image shows a single monthly observation from the middle of the site's time series.*
Dataset summary
The dataset is partitioned into an Afghanistan subset (1,943 fully annotated sites) and a global subset (39 sites across 15 countries, imagery only). Subset-specific attributes are shown side by side; shared attributes are listed below the divider. Combined totals are 1,982 sites, 212,776 images, and 113 distinct months of monthly coverage from January 2016 through May 2025.
| Attribute | Afghanistan sites | Global sites | |---|---|---| | Number of sites | 1,943 (898 looted, 1,045 preserved) | 39 across 15 countries | | Number of images | ~210,000 | ~3,900 | | Temporal range | January 2016 to December 2024 | January 2017 to May 2025 | | Months per site | up to 108 (median 107) | 100-101 | | Image dimensions | 186 x 186 pixels | per-site majority dimension | | Site footprint | ~1 km x 1 km (fixed) | 0.1 - 50.6 km^2 | | Binary site mask | Provided (1,943 PNGs) | Not provided | | Looting label | Provided (898 looted, 1,045 preserved) | Not provided | | Change-month label | Provided (118/898 confirmed) | Not provided |
*Shared across both subsets:* 3 spectral bands (R, G, B) plus a 1-channel data validity mask; 4.77 m/pixel spatial resolution (Web Mercator zoom level 15); monthly cadence; 4-band PNG (RGBA, 8-bit unsigned per channel) image format; single-band PNG (binary, 0/255) mask format; Planet Labs monthly basemap mosaics (PlanetScope, Dove constellation) as the data source.

*Dataset statistics. (a) Distribution of looted vs. preserved labels in the Afghanistan subset. (b) Temporal distribution of confirmed looting events by year (118 sites with known change month, peak in 2019). (c) Number of monitoring sites per country (log scale); Afghanistan accounts for 1,943 of the 1,982 sites.*
Comparison with prior datasets
The table below compares HERITAGE to earlier archaeological remote-sensing datasets. Among the publicly released datasets in this list, only DAFA-LS (Vincent et al., 2024) predates HERITAGE; HERITAGE adds multi-country coverage, more months of observation, and per-site change-month labels. Image counts are reported only where the source paper specifies them; "---" indicates not reported. "Change month" is a per-site month-of-disturbance label for looted sites. "Site mask" is a binary raster delineating the archaeological area within each image chip.
| Dataset | Sites | Countries | Months | Images | Change month | Site mask | Public | |---|---:|---:|---:|---:|:---:|:---:|:---:| | Casana (2015) | 14 | 1 | 1 | --- | --- | --- | No | | Parcak et al. (2016) | 200+ | 1 | 2-4 | --- | --- | --- | No | | Tapete & Cigna (2016) | 1 | 1 | --- | --- | --- | --- | No | | Lauricella et al. (2017) | 1 | 1 | 1 | --- | --- | --- | No | | Tadesse et al. (2026a) | 1,943 | 1 | 96 | --- | --- | --- | No | | Tadesse et al. (2026b) | 1,943 | 5 | 96 | --- | Yes | --- | No | | Vincent et al. (2024) [DAFA-LS] | 675 | 1 | 96 | 55,480 | --- | Yes | Yes | | HERITAGE | 1,982 | 16 | 113 | 212,776 | Yes | Yes | Yes |
Directory structure
HERITAGE/ dataset/ ground_truth.csv Afghanistan/ looted_0/ 2016_01.png 2016_02.png ... mask.png looted_1/ ... preserved_0/ ... ... Belize_Lubaantun/ 2017_01.png ... Cambodia_Panteay_Chamar/ ... [37 additional global site directories]
Layout notes
ground_truth.csvlists the three label fields for each of the 1,943 Afghanistan sites:site_name,looted(binary), andlooted_month(integer month index when looting was detected;-1if confirmed but month unknown;0for preserved sites).- Afghanistan site directories follow the naming pattern
{looted,preserved}_N, whereNis a zero-indexed site identifier. Each contains monthly RGBA PNG chips and amask.pngraster delineating the archaeological area. - Global site directories follow the naming pattern
Country_SiteName(39 directories across 15 countries). They contain monthly RGBA PNG chips; no per-site masks are provided. - File names follow
YYYY_MM.png, whereYYYYis the four-digit year andMMis the two-digit month. - Images are stored as four-channel PNGs (height x width x 4): the first three channels are R,…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10New repo from Microsoft, no traction info