What does this repo signal mean?

Snowflake (Arctic) published Snowflake-Labs/pg_lake (C). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo Snowflake-Labs/pg_lake · language C · Notable open-source repo with strong HN and stars traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Snowflake (Arctic) Repo: Snowflake-Labs/pg_lake

Captured source

source ↗

GitHub/github.com/Snowflake-Labs/pg_lake

Snowflake-Labs/pg_lake repository metadata

Source ↗

published Nov 4, 2025seen Jun 5captured Jun 11http 200method plain

Snowflake-Labs/pg_lake

Description: pg_lake: Postgres with Iceberg and data lake access

Language: C

License: Apache-2.0

Stars: 1544

Forks: 102

Open issues: 87

Created: 2025-11-04T10:38:17Z

Pushed: 2026-06-10T14:42:47Z

Default branch: main

Fork: no

Archived: no

README:

pg_lake: Postgres for Iceberg and Data lakes

pg_lake integrates Iceberg and data lake files into Postgres. With the pg_lake extensions, you can use Postgres as a stand-alone lakehouse system that supports transactions and fast queries on Iceberg tables, and can directly work with raw data files in object stores like S3.

At a high level, pg_lake lets you:

Create and modify [Iceberg](https://iceberg.apache.org/) tables directly from PostgreSQL, with full transactional guarantees and query them from other engines
Query and import data files in object storage in Parquet, CSV, JSON, and Iceberg format
Export query results back to object storage in Parquet, CSV, or JSON formats using COPY commands
Read geospatial formats supported by GDAL, such as GeoJSON and Shapefiles
Use the built-in [map type](./pg_map/README.md) for semi-structured or key–value data
Combine heap, Iceberg, and external Parquet/CSV/JSON files in the same SQL queries and modifications — all with full transactional guarantees and no SQL limitations
Infer table columns and types from external data sources such as Iceberg, Parquet, JSON, and CSV files
Leverage DuckDB’s query engine underneath for fast execution without leaving Postgres

Setting up `pg_lake`

There are two ways to set up pg_lake:

Using Docker, for an easy, ready-to-run test environment.
Building from source, for a manual setup or development use.

Both approaches include the PostgreSQL extensions, the pgduck_server application and setting up S3-compatible storage.

Using Docker

Follow the [Docker README](./docker/README.md) to set up and run pg_lake with Docker.

Building from source

Once you’ve [built and installed the required components](./docs/building-from-source.md), you can initialize pg_lake inside Postgres.

Creating the extensions

Create all required extensions at once using CASCADE:

CREATE EXTENSION pg_lake CASCADE;
NOTICE: installing required extension "pg_lake_table"
NOTICE: installing required extension "pg_lake_engine"
NOTICE: installing required extension "pg_extension_base"
NOTICE: installing required extension "pg_lake_iceberg"
NOTICE: installing required extension "pg_lake_copy"
CREATE EXTENSION

Running `pgduck_server`

pgduck_server is a standalone process that implements the Postgres wire-protocol (locally), and underneath uses DuckDB to execute queries.

When you run pgduck_server it starts listening to port 5332 on unix domain socket:

pgduck_server
LOG pgduck_server is listening on unix_socket_directory: /tmp with port 5332, max_clients allowed 10000

As pgduck_server implements Postgres wire protocol, you can access it via psql on port 5332 and host /tmp and run commands via DuckDB.

For example, you can get the DuckDB version:

psql -p 5332 -h /tmp

select version() as duckdb_version;
duckdb_version
----------------
v1.3.2 (1 row)

You can also provide some additional settings while starting the server, to see all:

pgduck_server --help

There are some important settings that should be adjusted, especially on production systems:

--memory_limit: Optionally specify the maximum memory of pgduck_server similar to DuckDB's memory_limit, the default is 80 percent of the system memory
--init_file_path : Execute all statements in this file on start-up
--cache_dir: Specify the directory to use to cache remote files (from S3)

Note that if you want to make adjustments to duckdb settings, you can use the --init_file_path approach OR you can connect to the running pgduck_server and make changes. For example:

$ psql -h /tmp -p 5332
psql (17.5, server 16.4.DuckPG)
Type "help" for help.

postgres=> set global threads = 16;
SET

The connection above is to the pgduck_server on its port (default 5332), NOT to the postgres/pg_lake server.

Connecting `pg_lake` to s3 (or compatible)

pgduck_server relies on the DuckDB secrets manager for credentials and it follows the credentials chain by default for AWS and GCP. Make sure your cloud credentials are configured properly — for example, by setting them in ~/.aws/credentials.

Once you set up the credential chain, you should set the pg_lake_iceberg.default_location_prefix. This is the location where Iceberg tables are stored:

SET pg_lake_iceberg.default_location_prefix TO 's3://testbucket/pglake';

You can also set the credentials on pgduck_server for [local development with minio](docs/building-from-source.md#running-s3-compatible-service-minio-locally).

Using pg_lake

Create an Iceberg table

You can create Iceberg tables by adding USING iceberg to your CREATE TABLE statements.

CREATE TABLE iceberg_test USING iceberg
AS SELECT
i as key, 'val_'|| i as val
FROM
generate_series(0,99)i;

Then, query it:

SELECT count(*) FROM iceberg_test;
count
-------
100
(1 row)

You can then see the Iceberg metadata location:

SELECT table_name, metadata_location FROM iceberg_tables;

table_name | metadata_location
-------------------+--------------------------------------------------------------------------------------------------------------------
iceberg_test | s3://testbucket/pglake/postgres/public/test/435029/metadata/00001-f0c6e20a-fd1c-4645-87c9-c0c64b92992b.metadata.json

COPY to/from S3

You can import or export data directly using COPY in Parquet, CSV, or newline-delimited JSON formats. The format is automatically inferred from the file extension, or you can specify it explicitly with COPY options like WITH (format 'csv', compression 'gzip').

-- Copy data from Postgres to S3 with format parquet
-- Read from any data source, including iceberg tables, heap tables or any query results
COPY (SELECT * FROM iceberg_test) TO 's3://testbucket/pglake/parquet_data/iceberg_test.parquet';

-- Copy back from S3 to any table in Postgres
-- This example copies into an iceberg table, but could be heap table as well
COPY iceberg_test FROM...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable open-source repo with strong HN and stars traction.