RepoSnowflake (Arctic)Snowflake (Arctic)published Mar 3, 2026seen 5d

Snowflake-Labs/apache-iceberg-from-zero

Jupyter Notebook

Open original ↗

Captured source

source ↗

Snowflake-Labs/apache-iceberg-from-zero

Language: Jupyter Notebook

License: Apache-2.0

Stars: 7

Forks: 11

Open issues: 0

Created: 2026-03-03T18:25:28Z

Pushed: 2026-03-27T21:13:10Z

Default branch: main

Fork: no

Archived: no

README:

Apache Iceberg Course - Docker Setup

This Docker setup provides a complete, production-like environment for learning Apache Iceberg with:

  • MinIO: A Local S3-compatible object storage for table data
  • Polaris: Apache Iceberg REST Catalog
  • Jupyter Notebook: Interactive Python notebook with PySpark and Iceberg support
  • Trino: Distributed SQL query engine

You should have found this repositories along with the course videos here (TODO LINK), please check them out if you haven't.

Version Configuration

All versions are centrally managed in the .env file:

Current pinned versions:

  • Iceberg: 1.10.0 (released September 5, 2025)
  • Spark: 4.0.1 with Scala 2.13 (September 2, 2025)
  • Polaris: latest
  • Trino: 465

To update versions, simply edit the .env file and rebuild:

docker compose up -d --build

Prerequisites

  • Docker Desktop installed and running
  • At least 8GB of RAM allocated to Docker
  • At least 10GB of free disk space

Quick Start

1. Start all services:

docker compose up -d

2. Wait for services to be ready (approximately 1-2 minutes):

docker compose logs -f

Press Ctrl+C to stop following logs once services are running.

3. Access the services:

  • Jupyter Notebook: http://localhost:8888 (no password) - Start here!
  • MinIO Console: http://localhost:9001 (admin/password) - View your data
  • Trino UI: http://localhost:8080 (username: admin, no password)
  • Polaris API: http://localhost:8181

4. Open the demo notebook:

  • Direct link: http://localhost:8888/lab/tree/work/E1.1%20-%20OpenLakehouse.ipynb
  • Run through the cells to see Iceberg with Polaris and MinIO

Service Details

MinIO (S3-Compatible Storage)

MinIO provides S3-compatible object storage for Iceberg table data.

Configuration:

  • API Port: 9000
  • Console Port: 9001
  • Username: admin
  • Password: password
  • Bucket: warehouse
  • Data directory: ./data/minio

Access the Console:

  • URL: http://localhost:9001
  • Login with admin/password
  • Browse the warehouse bucket to see your Iceberg table files

Polaris Iceberg REST Catalog

The Polaris catalog provides a REST API for managing Iceberg table metadata. It's configured with in-memory persistence for Catalog entries and MinIO for table metadata.

Configuration:

  • Port: 8181
  • Data directory: ./data/polaris
  • Storage: MinIO S3 (s3://warehouse/)
  • OAuth2 credentials automatically generated on first start

Initialization: The Polaris catalog is automatically initialized with:

  • S3 storage configuration pointing to MinIO
  • OAuth2 credentials (root:s3cr3t defined in .env)

The polaris-setup service runs bootstrap-catalog.sh on startup to configure the catalog.

Trino

Trino is configured with an Iceberg connector that connects to the Polaris catalog.

Configuration:

  • Port: 8080
  • Catalog: iceberg (connected to Polaris)
  • Data directory: ./data/trino
  • Config files: ./trino/config/

Connect to Trino CLI:

docker exec -it trino trino

Example Trino queries:

-- Show catalogs
SHOW CATALOGS;

-- Create a namespace
CREATE SCHEMA iceberg.demo;

-- Show schemas
SHOW SCHEMAS IN iceberg;

-- Create a table
CREATE TABLE iceberg.demo.test (
id BIGINT,
name VARCHAR
) WITH (format = 'PARQUET');

-- Insert data
INSERT INTO iceberg.demo.test VALUES (1, 'Alice'), (2, 'Bob');

-- Query data
SELECT * FROM iceberg.demo.test;

Jupyter Notebook with PySpark

The Jupyter environment comes pre-configured with:

  • PySpark with Iceberg support
  • PyIceberg library
  • Trino Python client
  • Pandas, Matplotlib, Seaborn

Access:

  • URL: http://localhost:8888
  • Notebooks directory: ./notebooks
  • Data directory: ./data/jupyter

Catalog Configuration:

  • The demo notebook uses the Polaris REST catalog with MinIO S3 storage
  • Metadata managed by Polaris (centralized, REST API)
  • Table data stored in MinIO (s3://warehouse/)
  • This is a production-like pattern - same architecture as using Polaris with real S3/Azure/GCS
  • OAuth2 authentication configured automatically

Sample notebook: E1.1 - OpenLakehouse.ipynb is provided with examples of:

  • Creating Iceberg tables
  • Querying data
  • ACID transactions
  • Time travel
  • Schema evolution
  • Partitioning

Data Persistence

All data is stored locally in the ./data directory:

  • ./data/minio: MinIO object storage (Iceberg table data)
  • ./data/polaris: Polaris catalog metadata
  • ./data/trino: Trino working data
  • ./data/jupyter: Jupyter user data

This ensures that your data persists even when containers are stopped.

Common Commands

Start all services:

docker compose up -d

Stop all services:

docker compose down

View logs:

# All services
docker compose logs -f

# Specific service
docker compose logs -f jupyter
docker compose logs -f trino
docker compose logs -f polaris

Restart a service:

docker compose restart jupyter

Rebuild and restart (after config changes):

docker compose up -d --build

Troubleshooting

Services not starting

Check if ports are already in use:

# macOS/Linux
lsof -i :8888 # Jupyter
lsof -i :8080 # Trino
lsof -i :9000 # MinIO API
lsof -i :9001 # MinIO Console
lsof -i :8181 # Polaris

Check service health

# Check if containers are running
docker compose ps

# Check specific service logs
docker compose logs jupyter

Connection refused errors

  • Make sure all services are fully started (check logs)
  • Services may take 1-2 minutes to initialize
  • Verify network connectivity: docker network ls

Spark Session: ConnectionRefusedError: [Errno 111] Connection refused

If you see this error when initializing a Spark Session in a notebook, the Spark Connect server may have failed to start. Check the Docker container logs (docker logs jupyter-spark) for details. Common causes include insufficient Docker RAM or port conflicts. You can also try restarting the container (docker compose restart jupyter).

SSL / Corporate Proxy Errors Downloading JARs

If you are on a corporate network with a proxy or firewall that…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction new repo from Snowflake