sarvamai/olmOCR-bench-sarvam-api
Python
Captured source
source ↗sarvamai/olmOCR-bench-sarvam-api
Language: Python
Stars: 4
Forks: 2
Open issues: 1
Created: 2026-02-08T13:08:27Z
Pushed: 2026-02-09T14:37:09Z
Default branch: main
Fork: no
Archived: no
README:
olmOCR-Bench Evaluation
Run Sarvam Vision OCR on PDFs/images and evaluate the outputs against olmOCR-Bench.
Scripts
| Script | Purpose | |---|---| | run_sarvam_vision_inference.py | Run Sarvam Vision OCR on PDFs/images and save .md outputs | | postproccessing.py | Post-processing helpers (join line-break hyphens, unwrap simple math/LaTeX) — imported by inference script | | run_eval.py | Evaluate .md outputs against olmOCR-Bench |
Setup
cd olmocr_bench_eval python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt # For math tests (KaTeX rendering) playwright install && playwright install-deps
Download benchmark data
Download one of the following datasets:
huggingface-cli login # Option A: English-only subset (default for commands below) huggingface-cli download --repo-type dataset sarvamai/olmOCR-Bench-English --local-dir ./olmocr_bench_english # OR # Option B: Full benchmark (all languages) huggingface-cli download --repo-type dataset allenai/olmOCR-bench --local-dir ./olmocr_bench_full
The HF download creates a nested bench_data/ inside the local directory.
Directory layout
olmocr_bench_english/ # HF download root └── bench_data/ # Actual data directory ├── pdfs/ # PDFs from olmOCR-bench │ ├── arxiv_math/2503.04048_pg46.pdf │ └── ... ├── arxiv_math.jsonl # Test definitions (from HF dataset) ├── headers_footers.jsonl ├── table_tests.jsonl ├── ... └── my_model/ # Your .md outputs (one folder per model) ├── arxiv_math/ │ ├── 2503.04048_pg46_pg1_repeat0.md │ └── ... └── ...
Naming rule: {pdf_basename}_pg{page}_repeat{N}.md. For a single run per page, use _repeat0. The _repeat suffix is auto-added by run_eval.py if missing.
Step 1: Run inference
export SARVAM_API_KEY="your_key_here" # With post-processing enabled python run_sarvam_vision_inference.py olmocr_bench_english/bench_data/pdfs/ olmocr_bench_english/bench_data/my_model \ --join-line-break-hyphens --unwrap-simple-math-latex
Already-processed files are automatically skipped.
Post-processing (--join-line-break-hyphens / --unwrap-simple-math-latex)
Two independent transforms (both optional). Logic lives in postproccessing.py.
1. Join line-break hyphens — joins words split by hyphens at line breaks (skips LaTeX/code blocks):
| Before | After | |---|---| | experi-\nmental | experimental | | approx- \nimation | approximation |
2. Unwrap simple math/LaTeX — replaces trivial LaTeX with plain text/Unicode; real math is kept:
| Before | After | |---|---| | $\alpha$, $\le$, $\times$ | α, ≤, × | | \underline{Title} | Title | | $42$, $95\%$, $hello$ | 42, 95%, hello | | $x^2 + y^2$ | $x^2 + y^2$ *(unchanged)* |
Step 2: Evaluate
# Evaluate all candidate folders python run_eval.py -d olmocr_bench_english/bench_data # Evaluate one candidate python run_eval.py -d olmocr_bench_english/bench_data -c my_model # With options python run_eval.py -d olmocr_bench_english/bench_data --force --skip-baseline --test-report report.html
Output: overall score (%), 95% confidence interval, and per-test-type breakdown.
References
- Allen AI olmOCR-Bench — official benchmark code
- allenai/olmOCR-bench (Hugging Face) — full dataset (all languages)
- sarvamai/olmOCR-Bench-English (Hugging Face) — English-only subset
Notability
notability 1.0/10Low-star repo, minor activity