RepoSnowflake (Arctic)Snowflake (Arctic)published Nov 4, 2025seen 5d

Snowflake-Labs/pg_lake

C

Open original ↗

Captured source

source ↗
published Nov 4, 2025seen 5dcaptured 8hhttp 200method plain

Snowflake-Labs/pg_lake

Description: pg_lake: Postgres with Iceberg and data lake access

Language: C

License: Apache-2.0

Stars: 1544

Forks: 102

Open issues: 87

Created: 2025-11-04T10:38:17Z

Pushed: 2026-06-10T14:42:47Z

Default branch: main

Fork: no

Archived: no

README:

pg_lake: Postgres for Iceberg and Data lakes

pg_lake integrates Iceberg and data lake files into Postgres. With the pg_lake extensions, you can use Postgres as a stand-alone lakehouse system that supports transactions and fast queries on Iceberg tables, and can directly work with raw data files in object stores like S3.

At a high level, pg_lake lets you:

  • Create and modify [Iceberg](https://iceberg.apache.org/) tables directly from PostgreSQL, with full transactional guarantees and query them from other engines
  • Query and import data files in object storage in Parquet, CSV, JSON, and Iceberg format
  • Export query results back to object storage in Parquet, CSV, or JSON formats using COPY commands
  • Read geospatial formats supported by GDAL, such as GeoJSON and Shapefiles
  • Use the built-in [map type](./pg_map/README.md) for semi-structured or key–value data
  • Combine heap, Iceberg, and external Parquet/CSV/JSON files in the same SQL queries and modifications — all with full transactional guarantees and no SQL limitations
  • Infer table columns and types from external data sources such as Iceberg, Parquet, JSON, and CSV files
  • Leverage DuckDB’s query engine underneath for fast execution without leaving Postgres

Setting up pg_lake

There are two ways to set up pg_lake:

  • Using Docker, for an easy, ready-to-run test environment.
  • Building from source, for a manual setup or development use.

Both approaches include the PostgreSQL extensions, the pgduck_server application and setting up S3-compatible storage.

Using Docker

Follow the [Docker README](./docker/README.md) to set up and run pg_lake with Docker.

Building from source

Once you’ve [built and installed the required components](./docs/building-from-source.md), you can initialize pg_lake inside Postgres.

Creating the extensions

Create all required extensions at once using CASCADE:

CREATE EXTENSION pg_lake CASCADE;
NOTICE: installing required extension "pg_lake_table"
NOTICE: installing required extension "pg_lake_engine"
NOTICE: installing required extension "pg_extension_base"
NOTICE: installing required extension "pg_lake_iceberg"
NOTICE: installing required extension "pg_lake_copy"
CREATE EXTENSION

Running pgduck_server

pgduck_server is a standalone process that implements the Postgres wire-protocol (locally), and underneath uses DuckDB to execute queries.

When you run pgduck_server it starts listening to port 5332 on unix domain socket:

pgduck_server
LOG pgduck_server is listening on unix_socket_directory: /tmp with port 5332, max_clients allowed 10000

As pgduck_server implements Postgres wire protocol, you can access it via psql on port 5332 and host /tmp and run commands via DuckDB.

For example, you can get the DuckDB version:

psql -p 5332 -h /tmp

select version() as duckdb_version;
duckdb_version
----------------
v1.3.2 (1 row)

You can also provide some additional settings while starting the server, to see all:

pgduck_server --help

There are some important settings that should be adjusted, especially on production systems:

  • --memory_limit: Optionally specify the maximum memory of pgduck_server similar to DuckDB's memory_limit, the default is 80 percent of the system memory
  • --init_file_path : Execute all statements in this file on start-up
  • --cache_dir: Specify the directory to use to cache remote files (from S3)

Note that if you want to make adjustments to duckdb settings, you can use the --init_file_path approach OR you can connect to the running pgduck_server and make changes. For example:

$ psql -h /tmp -p 5332
psql (17.5, server 16.4.DuckPG)
Type "help" for help.

postgres=> set global threads = 16;
SET

The connection above is to the pgduck_server on its port (default 5332), NOT to the postgres/pg_lake server.

Connecting pg_lake to s3 (or compatible)

pgduck_server relies on the DuckDB secrets manager for credentials and it follows the credentials chain by default for AWS and GCP. Make sure your cloud credentials are configured properly — for example, by setting them in ~/.aws/credentials.

Once you set up the credential chain, you should set the pg_lake_iceberg.default_location_prefix. This is the location where Iceberg tables are stored:

SET pg_lake_iceberg.default_location_prefix TO 's3://testbucket/pglake';

You can also set the credentials on pgduck_server for [local development with minio](docs/building-from-source.md#running-s3-compatible-service-minio-locally).

Using pg_lake

Create an Iceberg table

You can create Iceberg tables by adding USING iceberg to your CREATE TABLE statements.

CREATE TABLE iceberg_test USING iceberg
AS SELECT
i as key, 'val_'|| i as val
FROM
generate_series(0,99)i;

Then, query it:

SELECT count(*) FROM iceberg_test;
count
-------
100
(1 row)

You can then see the Iceberg metadata location:

SELECT table_name, metadata_location FROM iceberg_tables;

table_name | metadata_location
-------------------+--------------------------------------------------------------------------------------------------------------------
iceberg_test | s3://testbucket/pglake/postgres/public/test/435029/metadata/00001-f0c6e20a-fd1c-4645-87c9-c0c64b92992b.metadata.json

COPY to/from S3

You can import or export data directly using COPY in Parquet, CSV, or newline-delimited JSON formats. The format is automatically inferred from the file extension, or you can specify it explicitly with COPY options like WITH (format 'csv', compression 'gzip').

-- Copy data from Postgres to S3 with format parquet
-- Read from any data source, including iceberg tables, heap tables or any query results
COPY (SELECT * FROM iceberg_test) TO 's3://testbucket/pglake/parquet_data/iceberg_test.parquet';

-- Copy back from S3 to any table in Postgres
-- This example copies into an iceberg table, but could be heap table as well
COPY iceberg_test FROM…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable open-source repo with strong HN and stars traction.