RepoAI21 LabsAI21 Labspublished Aug 22, 2023seen 5d

AI21Labs/ai21-tokenizer

Python

Open original ↗

Captured source

source ↗
published Aug 22, 2023seen 5dcaptured 12hhttp 200method plain

AI21Labs/ai21-tokenizer

Description: AI21's Jamba models tokenizers

Language: Python

License: Apache-2.0

Stars: 33

Forks: 4

Open issues: 1

Created: 2023-08-22T08:28:46Z

Pushed: 2025-10-27T23:44:08Z

Default branch: main

Fork: no

Archived: no

README:

AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses with AI21's models

---

Prerequisites

  • If you wish to use the tokenizers for Jamba Mini or Jamba Large, you will need to request access to the relevant model's HuggingFace repo:
  • Jamba Mini
  • Jamba Large

Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Basic Usage

from ai21_tokenizer import Tokenizer

# Create tokenizer (defaults to Jamba Mini)
tokenizer = Tokenizer.get_tokenizer()

# Encode text to token IDs
text = "Hello, world!"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")

# Decode token IDs back to text
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Specific Tokenizer Selection

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

# Jamba Mini tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_MINI_TOKENIZER)

# Jamba Large tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_LARGE_TOKENIZER)

Async Usage

import asyncio
from ai21_tokenizer import Tokenizer

async def main():
tokenizer = await Tokenizer.get_async_tokenizer()

text = "Hello, world!"
encoded = await tokenizer.encode(text)
decoded = await tokenizer.decode(encoded)

print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

asyncio.run(main())

Advanced Token Operations

# Convert between tokens and IDs
tokens = tokenizer.convert_ids_to_tokens(encoded)
print(f"Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}")

Direct Class Usage

from ai21_tokenizer import SyncJambaTokenizer

# Using local model file
model_path = "/path/to/your/tokenizer.model"
tokenizer = SyncJambaTokenizer(model_path=model_path)

text = "Hello, world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

Async Direct Class Usage

from ai21_tokenizer import AsyncJambaTokenizer

async def main():
model_path = "/path/to/your/tokenizer.model"
tokenizer = await AsyncJambaTokenizer.create(model_path=model_path)

text = "Hello, world!"
encoded = await tokenizer.encode(text)
decoded = await tokenizer.decode(encoded)

asyncio.run(main())

For more examples, please see our [examples](examples) folder.