AI21Labs/ai21-tokenizer
Python
Captured source
source ↗published Aug 22, 2023seen 5dcaptured 12hhttp 200method plain
AI21Labs/ai21-tokenizer
Description: AI21's Jamba models tokenizers
Language: Python
License: Apache-2.0
Stars: 33
Forks: 4
Open issues: 1
Created: 2023-08-22T08:28:46Z
Pushed: 2025-10-27T23:44:08Z
Default branch: main
Fork: no
Archived: no
README:
AI21 Labs Tokenizer
A SentencePiece based tokenizer for production uses with AI21's models
---
Prerequisites
- If you wish to use the tokenizers for
Jamba MiniorJamba Large, you will need to request access to the relevant model's HuggingFace repo: - Jamba Mini
- Jamba Large
Installation
pip
pip install ai21-tokenizer
poetry
poetry add ai21-tokenizer
Usage
Basic Usage
from ai21_tokenizer import Tokenizer
# Create tokenizer (defaults to Jamba Mini)
tokenizer = Tokenizer.get_tokenizer()
# Encode text to token IDs
text = "Hello, world!"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")
# Decode token IDs back to text
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")Specific Tokenizer Selection
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers # Jamba Mini tokenizer tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_MINI_TOKENIZER) # Jamba Large tokenizer tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_LARGE_TOKENIZER)
Async Usage
import asyncio
from ai21_tokenizer import Tokenizer
async def main():
tokenizer = await Tokenizer.get_async_tokenizer()
text = "Hello, world!"
encoded = await tokenizer.encode(text)
decoded = await tokenizer.decode(encoded)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
asyncio.run(main())Advanced Token Operations
# Convert between tokens and IDs
tokens = tokenizer.convert_ids_to_tokens(encoded)
print(f"Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}")Direct Class Usage
from ai21_tokenizer import SyncJambaTokenizer # Using local model file model_path = "/path/to/your/tokenizer.model" tokenizer = SyncJambaTokenizer(model_path=model_path) text = "Hello, world!" encoded = tokenizer.encode(text) decoded = tokenizer.decode(encoded)
Async Direct Class Usage
from ai21_tokenizer import AsyncJambaTokenizer async def main(): model_path = "/path/to/your/tokenizer.model" tokenizer = await AsyncJambaTokenizer.create(model_path=model_path) text = "Hello, world!" encoded = await tokenizer.encode(text) decoded = await tokenizer.decode(encoded) asyncio.run(main())
For more examples, please see our [examples](examples) folder.