NVIDIA/aistore
Go
Captured source
source ↗NVIDIA/aistore
Description: AIStore: scalable storage for AI applications
Language: Go
License: MIT
Stars: 1877
Forks: 264
Open issues: 10
Created: 2017-12-14T01:07:30Z
Pushed: 2026-06-11T03:29:58Z
Default branch: main
Fork: no
Archived: no
README: AIStore: High-Performance, Scalable Storage for AI Workloads
AIStore (AIS) is a lightweight distributed storage stack tailored for AI applications. It's an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size. Built from scratch, AIS provides linear scale-out, consistent performance, and a flexible deployment model.
AIS is a reliable storage cluster that can natively operate on both in-cluster and remote data, without treating either as a cache.
AIS consistently shows balanced I/O distribution and linear scalability across an arbitrary number of clustered nodes. The system supports fast data access, reliability, and rich customization for data transformation workloads.
Features
- ✅ Multi-Cloud Access: Seamlessly access and manage content across multiple [cloud backends](/docs/overview.md#at-a-glance) (including AWS S3, GCS, Azure, and OCI), with fast-tier performance, configurable redundancy, and namespace-aware bucket identity (same-name buckets can coexist across accounts, endpoints, and providers).
- ✅ Deploy Anywhere: AIS runs on any Linux machine, virtual or physical. Deployment options range from a minimal container-based deployment and Google Colab to petascale Kubernetes clusters. There are no built-in limitations on deployment size or functionality.
- ✅ High Availability: Redundant control and data planes. Self-healing, end-to-end protection, n-way mirroring, and erasure coding. Arbitrary number of lightweight access points (AIS proxies).
- ✅ HTTP-based API: A feature-rich, native API (with user-friendly SDKs for Go and Python), and compliant [Amazon S3 API](/docs/s3compat.md) for running unmodified S3 clients.
- ✅ Monitoring: Comprehensive observability with integrated Prometheus metrics, Grafana dashboards, detailed logs with configurable verbosity, and CLI-based performance tracking for complete cluster visibility and troubleshooting. See [AIStore Observability](/docs/monitoring-overview.md) for details.
- ✅ Chunked Objects: High-performance chunked object representation, with independently retrievable chunks, metadata v2, and checksum-protected manifests. Supports rechunking, parallel reads, and seamless integration with [Get-Batch](/docs/get_batch.md), [blob-downloader](/docs/blob_downloader.md), and multipart uploads to supported cloud backends.
- ✅ JWT Authentication and Authorization: [Validates request JWTs](/docs/auth_validation.md) to provide cluster- and bucket-level access control using static keys or dynamic OIDC issuer JWKS lookup.
- ✅ Secure Redirects: Configurable cryptographic signing of redirect URLs using HMAC-SHA256 with a versioned cluster key (distributed via metasync, stored in memory only).
- ✅ Load-Aware Throttling: Dynamic request throttling based on a multi-dimensional load vector (CPU, memory, disk, file descriptors, goroutines) to protect AIS clusters under stress.
- ✅ Unified Namespace: Attach AIS clusters together to provide unified access to datasets across independent clusters, allowing users to reference shared buckets with cluster-specific identifiers.
- ✅ Turn-key Cache: In addition to robust data protection features, AIS offers a per-bucket configurable LRU-based cache with eviction thresholds and storage capacity watermarks.
- ✅ ETL Offload: Execute I/O intensive data transformations [close to the data](/docs/etl.md), either inline (on-the-fly as part of each read request) or offline (batch processing, with the destination bucket populated with transformed results).
- ✅ Get-Batch: Retrieve multiple objects and/or [archived files](/docs/archive.md) with a single call. Designed for ML/AI pipelines, [Get-Batch](/docs/get_batch.md) fetches an entire training batch in one operation, assembling a TAR (or other supported [serialization formats](/docs/archive.md)) that contains all requested items in the exact user-specified order (paper).
- ✅ Data Consistency: Guaranteed [consistency](/docs/terminology.md#read-after-write-consistency) across all gateways, with [write-through](/docs/terminology.md#write-through) semantics in presence of [remote backends](/docs/terminology.md#backend-provider).
- ✅ Serialization & Sharding: Native, first-class support for TAR, TGZ, TAR.LZ4, and ZIP [archives](/docs/archive.md) for efficient storage and processing of small-file datasets. Features include seamless integration with existing unmodified workflows across all APIs and subsystems.
- ✅ Kubernetes: For production, AIS runs natively on Kubernetes. The dedicated ais-k8s repository includes the AIS K8s Operator, Ansible playbooks, Helm charts, and deployment guidance.
- ✅ Batch Jobs: More than 30 cluster-wide [batch operations](/docs/batch.md) that you can start, monitor, and control otherwise. The list currently includes:
$ ais show job --help NAME: archive blob-download cleanup copy-bucket copy-objects delete-objects download dsort ec-bucket ec-get ec-put ec-resp elect-primary etl-bucket etl-inline etl-objects evict-objects evict-remote-bucket get-batch list lru-eviction mirror prefetch-objects promote-files put-copies rebalance rechunk rename-bucket resilver summary warm-up-metadata
> The feature set continues to grow and also includes: [native bucket inventory (NBI)](/docs/nbi.md); [blob-downloader](/docs/blob_downloader.md); [AuthN - authentication and authorization server](/docs/authn.md); runtime management of [TLS certificates](/docs/cli/x509.md); full support for [adding/removing nodes at runtime](/docs/lifecycle_node.md); adaptive [rate limiting](/docs/rate_limit.md); and…
Excerpt shown — open the source for the full document.