ModelReka AIReka AIpublished Mar 11, 2026seen 5d

RekaAI/reka-edge-2603

Open original ↗

Captured source

source ↗
published Mar 11, 2026seen 5dcaptured 10hhttp 200method plaintask image-text-to-textlicense otherlibrary transformersparams 7.1Bdownloads 947likes 131

Reka Edge

Reka Edge is an extremely efficient 7B multimodal vision-language model that accepts image/video+text inputs and generates text outputs. This model is optimized specifically to deliver industry-leading performance in image understanding, video analysis, object detection, and agentic tool-use.

Learn more about the Reka Edge in our announcement blog post.

Demo | API Docs | Discord

Key features

  • Faster and more token-efficient than similarly sized VLMs
  • Strong benchmark performance across VQA-v2, RefCOCO, MLVU, MMVU and Mobile Actions (see below)
  • Support for vLLM (see plugin)
  • Open weights license: the model can be used commercially if you make less than $1 million USD of revenue a year

Benchmarks and metrics

|Benchmark|Reka Edge|Cosmos-Reason2 8B|Qwen 3.5 9B|Gemini 3 Pro| |:-|:-|:-|:-|:-| |VQA-V2 *Visual Question Answering*|88.40|79.82|83.22|89.78| |MLVU *Video Understanding*|74.30|37.85|52.39|80.68| |MMVU *Multimodal Video Understanding*|71.68|51.52|68.64|78.88| |RefCOCO-A *Object Detection*|93.13|90.98|93.62|81.46| |RefCOCO-B *Object Detection*|86.70|85.74|88.83|82.85| |VideoHallucer *Hallucination*|59.57|51.65|56.00|66.78| |Mobile Actions *Tool Use*|88.40|77.94|91.78|89.39|

|Metric|Reka Edge|Cosmos-Reason2 8B|Qwen 3.5 9B|Gemini 3 Pro*| |:-|:-|:-|:-|:-| |Input tokens *For a 1024 x 1024 image*|331|1063|1041|1094| |End-to-end latency (*in seconds*)|4.69 ± 2.48|10.56 ± 3.47|10.31 ± 1.81|16.67 ± 4.47| |TTFT (s) *Time to first token*|0.522 ± 0.452|0.844 ± 0.923|0.60 ± 0.65|13.929 ± 3.872|

*\*Gemini 3 Pro measured via API call; other models measured with local inference.*

Quick Start

llama.cpp

To get started: 1. Use the weights from repo 2. Build the necessary artifacts from llama.cpp repo

cmake -B build
cmake --build build --target llama-server -j
cmake --build build --target llama-quantize -j

3. Run the GGUF conversion script (convert_reka_vlm_to_gguf.py) from the llama.cpp repo root

python3 convert_reka_vlm_to_gguf.py /path/to/reka/weights \
--outfile /path/to/reka-text-f16.gguf \
--outtype f16

# Export the vision encoder
python3 convert_reka_vlm_to_gguf.py /path/to/reka/weights \
--mmproj \
--outfile /path/to/reka-mmproj-f16.gguf \
--outtype f16

4. (optional) Use the quantization scripts (quantize_reka_...) for simple quantizations of the model

# Example usage for text decoder quantization
bash inference/hf_release/quantize_reka_q4_last8_q8.sh /path/to/reka-text-f16.gguf /path/to/reka-text-q4_last8_q8.gguf

5. Run llama-server

./build/bin/llama-server -m /path/to/reka-text-f16.gguf \
--mmproj /path/to/reka-mmproj-f16.gguf \
-t 8 -c 2048 --host 0.0.0.0 --port 8080 --reasoning off \

One note: the model does not currently support reasoning, so we run llama-server with --reasoning off.

🤗 Transformers (macOS)

The easiest way to run the model is with the included example.py script. It uses PEP 723 inline metadata so uv resolves dependencies automatically — no manual install step:

uv run example.py --image media/hamburger.jpg --prompt "What is in this image?"

Requirements

##### Edge Deployment Devices

  • Mac devices with Apple Silicon
  • OS: macOS 13+
  • Minimum: 24 GB memory
  • Recommended: 32 GB+ memory
  • Linux and Windows Subsystem for Linux (WSL) PCs
  • Minimum: 24 GB GPU and 24 GB+ system memory
  • Recommended: 32 GB+ GPU and 32 GB+ system memory
  • Nvidia Robotics & Edge AI systems
  • Jetson Thor
  • Jetson AGX Orin (both 32 GB and 64 GB variants)

##### Custom Deployment Options

With quantization, Reka Edge can also be run on:

  • Jetson Orin Nano
  • Samsung S25
  • Qualcomm Snapdragon XR2 Gen 3 devices
  • Apple iPhone, iPad, and Vision Pro

Reach out for support deploying Reka Edge to a custom edge compute platform.

##### Software Requirements

  • Python: 3.12+
  • [uv](https://docs.astral.sh/uv/) (recommended) — handles dependencies automatically

Inline snippet

If you prefer not to use the script, install dependencies manually and paste the code below:

uv pip install "transformers==4.57.3" torch torchvision pillow tiktoken imageio einops av
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "RekaAI/reka-edge-2603"

# Load processor and model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.float16,
).eval()

# Move to MPS (Apple Silicon GPU)
device = torch.device("mps")
model = model.to(device)

# Prepare an image + text query
image_path = "media/hamburger.jpg" # included in the model repo
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": "What is in this image?"},
],
}
]

# Tokenize using the chat template
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
)

# Move tensors to device
for key, val in inputs.items():
if isinstance(val, torch.Tensor):
if val.is_floating_point():
inputs[key] = val.to(device=device, dtype=torch.float16)
else:
inputs[key] = val.to(device=device)

# Generate
with torch.inference_mode():
# Stop on token (end-of-turn) in addition to default EOS
sep_token_id = processor.tokenizer.convert_tokens_to_ids("")
output_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
eos_token_id=[processor.tokenizer.eos_token_id, sep_token_id],
)

# Decode only the generated tokens
input_len = inputs["input_ids"].shape[1]
new_tokens = output_ids[0, input_len:]
output_text = processor.tokenizer.decode(new_tokens, skip_special_tokens=True)

# Strip any trailing turn-boundary marker
output_text = output_text.replace("", "").strip()
print(output_text)

Video queries

The model also accepts video inputs. Use --video instead of --image:

uv run example.py --video media/dashcam.mp4 --prompt "Is this person falling asleep?"
messages = [
{
"role": "user",
"content": [
{"type": "video", "video":…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Modest downloads, small model release