reka-ai/vllm-reka
Python
Captured source
source ↗reka-ai/vllm-reka
Description: vLLM plugin for Reka models
Language: Python
License: Apache-2.0
Stars: 9
Forks: 0
Open issues: 1
Created: 2026-02-16T14:59:03Z
Pushed: 2026-05-18T16:26:50Z
Default branch: main
Fork: no
Archived: no
README:
vllm-reka
This plugin serves Reka Edge — a 7B multimodal model with frontier-class image understanding, video analysis, object detection, and tool use — via vLLM.
It registers model architectures, a custom tokenizer, and HuggingFace configs so that vLLM can load and serve Reka checkpoints out of the box.
Quickstart
# 1. Install the plugin
uv sync
# 2. Download model weights (~14 GB)
pip install huggingface_hub
hf download RekaAI/reka-edge-2603 --local-dir ./models/reka-edge-2603
# 3. Start the server
bash ./serve.sh ./models/reka-edge-2603
# 4. Query it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"reka-edge-2603","messages":[{"role":"user","content":"Hello!"}]}'Requirements
- GPU: NVIDIA GPU, ideally with ≥24 GB VRAM. This has been tested to work on GTX 3090 GPUs with 40-50 tokens/s.
- OS: Linux with CUDA. macOS is not supported for serving.
- Python: 3.10 ≥ x > 3.14
- vLLM: 0.15.x (0.15.0 ≥ x > 0.16.0)
Supported Models
| Model | Architecture | Vision Encoder | Description | |---|---|---|---| | Reka Edge | Yasa2ForConditionalGeneration | ConvNextV2 | 7B multimodal model (image + video) |
Installation
Recommended (reproducible, uses uv.lock):
uv sync
Fallback with pip:
pip install -e .
Or with Poetry:
poetry install
The plugin registers itself via the vllm.general_plugins entry point — vLLM discovers it automatically once installed.
Serving
serve.sh (recommended)
Use serve.sh as the default entrypoint. It applies the plugin-specific defaults that this repo is tested with.
bash ./serve.sh
Example with explicit host/port:
HOST=0.0.0.0 PORT=8000 bash ./serve.sh ./models/reka-edge-2603
You can also pass through additional vllm serve flags:
bash ./serve.sh ./models/reka-edge-2603 --max-num-seqs 32
serve.sh configuration
Common environment variables:
| Variable | Default | Description | |---|---|---| | HOST | 0.0.0.0 | Bind address | | PORT | 8000 | API port | | SERVED_MODEL_NAME | reka-edge-2603 | Model name exposed to OpenAI-compatible clients | | GPU_MEM | 0.95 | --gpu-memory-utilization | | MAX_LEN | 16384 | --max-model-len | | MAX_BATCH_TOKENS | 20000 | --max-num-batched-tokens | | MAX_IMAGES | 6 | Per-prompt image cap | | MAX_VIDEOS | 3 | Per-prompt video cap | | VIDEO_NUM_FRAMES | 6 | Frames sampled per video. Higher values improve temporal understanding but increase latency and memory usage. | | VIDEO_SAMPLING | chunk | Video sampling strategy | | TP_SIZE | 1 | Tensor parallel size | | DTYPE | bfloat16 | vLLM dtype | | QUANTIZATION | bitsandbytes | Quantization backend (see [Quantization](#quantization)) |
Optional runtime env vars:
VLLM_TORCH_PROFILER_DIR(only exported when set)USE_IMAGE_PATCHING(default1)VLLM_VIDEO_LOADER_BACKEND(defaultyasa)VLLM_USE_V1(default1)VLLM_FLASH_ATTN_VERSION(default3)VLLM_HTTP_TIMEOUT_KEEP_ALIVE(default300)
Quantization
The server defaults to 4-bit bitsandbytes quantization, which reduces VRAM usage enough to run on consumer GPUs (e.g., RTX 4090 with 24 GB). To run at full precision instead:
QUANTIZATION="" bash ./serve.sh ./models/reka-edge-2603
Full precision requires more VRAM (~14 GB in bfloat16) but avoids any quantization-related quality loss.
Advanced: direct vllm serve
Prefer serve.sh unless you need full manual control. Minimal direct command:
vllm serve \ --tokenizer-mode yasa \ --chat-template-content-format openai \ --trust-remote-code
Examples
Once the server is running, it exposes an OpenAI-compatible API at http://localhost:8000 (or your configured PORT).
Text
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{"role": "user", "content": "Hello!"}]
}'Image understanding
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]
}]
}'Video analysis
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Summarize what happens in this video."}
]
}]
}'Object detection
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Detect: eye, ear"}
]
}]
}'Tool use / function calling
serve.sh enables tool use by default (--enable-auto-tool-choice --tool-call-parser hermes). Pass tools via the standard OpenAI tools parameter:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
}'The model will return a tool_calls response when it decides to invoke a function.
Python client
The server is compatible with the OpenAI Python SDK:
from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused") # Text query response = client.chat.completions.create(…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low-star repo from reka, minor