RepoMicrosoftMicrosoftpublished Apr 3, 2026seen 4d

microsoft/dbt-scope

Python

Open original ↗

Captured source

source ↗
published Apr 3, 2026seen 4dcaptured 9hhttp 200method plain

microsoft/dbt-scope

Description: Bringing dbt to SCOPE

Language: Python

License: MIT

Stars: 4

Forks: 1

Open issues: 3

Created: 2026-04-03T12:53:58Z

Pushed: 2026-06-09T17:24:50Z

Default branch: main

Fork: no

Archived: no

README:

ADLA - dbt

Incremental data transformation for ADLA using dbt to get data into Delta Lake tables.

dbt Docs · Azure Data Lake Analytics docs · Delta Lake

---

What is this?

This is an opinionated dbt adapter that makes it easier to test and schedule ADLA via dbt CLI without requiring an external orchestrator (such as Data Factory) to get non-Delta Lake source data lightly-transformed with SQL and incrementally ingestied via ADLA compute into Delta Lake Tables using a quasi-SQL syntax.

The adapter handles performing non-SQL syntax generation at compile time in the dbt adapter using dbt macros.

As a result of this conscious design decision, the adapter does not encourage working with non-SQL constructs such as pre-processing imperative directives like #FOO.

> In fact, in the future, the adapter can/will use sqlglot to block non-SQL syntax in the model SQL like #FOO, the goal here is to keep the business logic as close to ANSI-SQL as possible for portability across engines. As a result of this, a tradeoff is the ADLA feature surface is limited in this dbt adapter to only support run-time syntax, not ADLA compile-time syntax such as #FOO or #IFDEF etc.

Key features

  • Clean SQL models — write SELECT ... FROM @data; macros generate EXTRACT, INSERT INTO
  • File-based incremental — the adapter lists source files on ADLS Gen1, filters by watermark, and processes batches bounded by max_files_per_trigger and max_bytes_per_trigger per SCOPE job. Uses append strategy — dbt calls the macro once per dbt run, no date-range orchestration needed - this is very similar to Apache Spark Microbatch based structured streaming
  • Watermark checkpoint — progress is tracked in _checkpoint/watermark.json alongside _delta_log/. Re-runs automatically skip already-processed files; full refresh resets the checkpoint
  • Sources audit trail — per-batch JSONL diffs record which files were processed. Configurable compaction (parquet snapshots) and retention keep the checkpoint directory bounded - similar once again to Spark structured streaming.
  • Virtual file metadatasource_file_uri, source_file_length, source_file_created, source_file_modified columns map to FILE.*() functions, giving each row lineage back to its source file
  • Declarative table properties — compression, checkpoint intervals via scope_settings

How it works

SS files live on ADLS Gen1. The adapter lists files under each source_roots entry, filters by each regex in source_patterns (cross-product), deduplicates by path, estimates total data size per file (including SSv5/v6 .du sibling folders), and processes them in batches bounded by max_files_per_trigger and max_bytes_per_trigger (whichever limit is hit first). Each batch becomes a single SCOPE job with an explicit file list in the EXTRACT FROM clause. After a successful job, the watermark advances, a sources record is written to _checkpoint/, and the next batch is discovered — repeating until all files are processed.

How dbt picks which files to process

The adapter discovers unprocessed files using a watermark-based checkpoint stored at _checkpoint/watermark.json next to the Delta table's _delta_log/:

| Scenario | What runs | | --------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | | First run or --full-refresh | Checkpoint deleted → all matching files processed in batches bounded by max_files_per_trigger and max_bytes_per_trigger | | Incremental run | Only files with modification_time after the watermark are processed | | No new files | No-op — watermark stays the same |

The safety buffer (safety_buffer_seconds, default 30) skips files modified within the last N seconds to avoid reading partially-written files.

Checkpoint lifecycle

After each successful SCOPE job:

1. Watermark updated_checkpoint/watermark.json records {version, modifiedTime, batchId} 2. Sources recorded — a JSONL diff in _checkpoint/sources/{batchId} lists every file processed in that batch 3. Compaction — every source_compaction_interval batches, a parquet snapshot is written containing all history (latest snapshot + JSONL diffs since + current batch). All files persist on disk — compaction never deletes anything 4. Retentionsource_retention_files caps the total number of files in _checkpoint/sources/, deleting the oldest first. This is the only mechanism that removes files

On full refresh, the checkpoint is deleted before processing begins. The adapter re-discovers all files and starts fresh at batch_id=0.

What each SCOPE job does

flowchart TB
subgraph dbt["dbt — file-based append with internal batching loop"]
direction TB
Discover["Adapter lists ADLS Gen1 files
filter by regex + watermark"]
Batch["Take batch bounded by
max_files + max_bytes per trigger
oldest-first by modification_time"]
More{"More
files?"}
Discover --> Batch
end

subgraph ADLA["ADLA — one SCOPE script per batch"]
direction TB
S1["SET @@FeaturePreviews"]
DDL["CREATE TABLE IF NOT EXISTS
PARTITIONED BY partition_col
OPTIONS LAYOUT = DELTA"]
DEL["DELETE FROM @target
WHERE true
only on first batch of full refresh"]
EXT["📖 EXTRACT FROM explicit file list
+ FILE.URI(), FILE.LENGTH(), ...
→ @data rowset"]
TX["🔀 SQL Transform — your dbt model (.sql)
SELECT … FROM @data
→ @batch_data"]
INS["💾 INSERT INTO @target
SELECT * FROM @batch_data"]
S1 --> DDL --> DEL --> EXT --> TX --> INS
end

subgraph Checkpoint["_checkpoint/ (ADLS Gen2)"]
direction TB
WM["📄 watermark.json
{version, modifiedTime, batchId}"]
SRC["📂 sources/
0 (JSONL) · 1 (JSONL) · 10.parquet"]
end

subgraph Storage["Azure Data Lake Storage"]
direction LR
subgraph Sources["Gen1 — SS source…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low stars, routine repo