UpstageAI/tensorizer
forked from coreweave/tensorizer
Captured source
source ↗UpstageAI/tensorizer
Description: Module, Model, and Tensor Serialization/Deserialization
Language: Python
License: MIT
Stars: 0
Forks: 0
Open issues: 1
Created: 2025-01-17T07:12:34Z
Pushed: 2025-12-02T08:49:28Z
Default branch: main
Fork: yes
Parent repository: coreweave/tensorizer
Archived: no
README:
tensorizer
Module, Model, and Tensor Serialization/Deserialization
TLDR
Extremely fast model loads from HTTP/HTTPS, Redis, and S3 endpoints. GPT-J (20GB) loads at wire-speed (~5GB/s) on a 40GbE network, and is only bottlenecked by the Linux kernel TCP stack.
Rationale
CoreWeave and our customers use KNative to deploy models as serverless functions. How long a model takes to load is a major factor in the latency of KNative scale-up. tensorizer is a tool to serialize models and their associated tensors into a single file that can be loaded quickly and efficiently off an HTTP/HTTPS or S3 endpoint.
By not embedding the model in the container image, we can reduce the container image size and the time it takes to load the model. This is especially important for models that are large in size, such as EleutherAI/gpt-neox-20B that weighs in at ~40GB.
This decoupling of the model from the container image also allows us to update the model without having to rebuild the container image. This allows us to quickly iterate on the model and deploy new versions without having to wait for the container image to build or for the container image cache to be populated.
tensorizer has S3 support, so we can store the serialized model in S3 object storage, and perform streaming loads from S3. This allows us to stream the model directly from S3 into the container without having to download the model to the container's local filesystem. This also pertains to HTTP/HTTPS endpoints, as S3 is just an HTTP/HTTPS endpoint.
tensorizer also has support for loading models from a local filesystem, so you can use it to serialize models locally and load them locally. This is extremely fast, as the same principles that make it fast for HTTP/HTTPS and S3 endpoints also apply to local filesystems.
tensorizer has preliminary support for Redis, but it is not recommended for model deployment due to the lack of distributed caching. It is intended for sharing state between inference pods, or for loading data on a per-request basis from a Redis cache.
Speed
tensorizer's deserialization speed is primarily network-bound.
The following graph presents data collected from the scripts and Kubernetes manifests in [examples/benchmark_buffer_size](examples/benchmark_buffer_size) comparing the various deserialization modes available in tensorizer release 2.5.0—along with the raw network speed, and the speed of torch.load().
Installation
From PyPI
tensorizer can be installed from PyPI with pip:
python -m pip install tensorizer
From Source
You can also install tensorizer from source using pip.
To clone the repository and install tensorizer in editable mode, run:
git clone https://github.com/coreweave/tensorizer cd tensorizer python -m pip install -e .
Or, run the following for pip to install tensorizer directly from GitHub:
python -m pip install git+https://github.com/coreweave/tensorizer
Basic Usage
Serialization is done with the TensorSerializer class. It takes a path_uri argument that can be a local filesystem path, an HTTP/HTTPS endpoint, or an S3 endpoint.
write_module is the main method of the TensorSerializer class. It takes a torch.nn.Module and serializes the tensors to the path_uri endpoint.
The below example serializes the EleutherAI/gpt-j-6B model to an S3 endpoint. It assumes that you have already configured your S3 credentials in ~/.s3cfg.
NOTE: Loading and serializing gpt-j-6B will take a lot of CPU RAM, up to ~20GB. Additionally, when loading gpt-j-6B into a GPU, you will need about ~16GB of VRAM. If you don't have that much RAM or VRAM, you can use the smaller gpt-neo-125M model instead.
NOTE2: The below examples require the transformers and accelerate libraries. You can install them with pip:
python -m pip install transformers accelerate
[serialize.py](examples/serialize.py)
import torch
from tensorizer import TensorSerializer
from transformers import AutoModelForCausalLM
model_ref = "EleutherAI/gpt-j-6B"
# For less intensive requirements, swap above with the line below:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"
model = AutoModelForCausalLM.from_pretrained(
model_ref,
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
serializer = TensorSerializer(s3_uri)
serializer.write_module(model)
serializer.close()Conversely, deserialization is done with the TensorDeserializer class. It takes a path_uri argument that can be a local filesystem path, an HTTP/HTTPS endpoint, or an S3 endpoint.
load_into_module is the main method of the TensorDeserializer class. It takes a torch.nn.Module and loads the tensors from the path_uri endpoint into the torch.nn.Module.
The below example loads the EleutherAI/gpt-j-6B model from an S3 endpoint.
[deserialize-simple.py](examples/deserialize-simple.py)
import time import torch from tensorizer import TensorDeserializer from tensorizer.utils import…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork of an unremarkable repo.