databricks/async-file-io
Java
Captured source
source ↗databricks/async-file-io
Language: Java
License: Apache-2.0
Stars: 3
Forks: 3
Open issues: 0
Created: 2023-10-31T21:42:51Z
Pushed: 2023-11-01T18:51:42Z
Default branch: main
Fork: no
Archived: no
README:
async-file-io
An implementation of Apache Iceberg's FileIO that downloads files asynchronously.
Async downloads are started when a new InputFile is created from the FileIO instance. The InputFile returned will block when newStream is called until the download completes.
The underlying ResolvingFileIO is used for newOutputFile and deleteFile.
Building
To build, run gradle build:
./gradlew build
Configuration
To configure this FileIO, set the io-impl property on a catalog.
Here is an example of Spark configuration for a catalog named prod:
spark.sql.catalog.prod=org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.prod.type=rest spark.sql.catalog.prod.uri=https://api.tabular.io/ws spark.sql.catalog.prod.credential=... spark.sql.catalog.prod.warehouse=prod spark.sql.catalog.prod.io-impl=io.tabular.AsyncFileIO spark.sql.catalog.prod.async.cache-location=file:/tmp
Where data is locally stored is configured by async.cache-location. The cache location can be either a local path (e.g. file:/tmp) or memory:/ to cache data in an in-memory FileIO.
To configure the number of background threads, set the Java system property iceberg.worker.num-threads.