WritingReplicateReplicatepublished Oct 21, 2025seen 5d

Extract text from documents and images with Datalab Marker and OCR

Open original ↗

Captured source

source ↗

Extract text from documents and images with Datalab Marker and OCR – Replicate blog

Replicate Blog

Extract text from documents and images with Datalab Marker and OCR

Posted October 21, 2025 by andreasjansson

Datalab ’s state-of-the-art document parsing and text extraction models are now on Replicate.

Marker turns PDF, DOCX, PPTX, images (and more!) into markdown or JSON. It formats tables, math, and code, extracts images, and can pull specific fields when you pass a JSON Schema.

OCR detects text in ninety languages from images and documents, and returns reading order and table grids.

The Marker model is based on the popular open source Marker project (29k Github stars) and OCR is based on Surya (19k Github stars).

Run Marker and OCR on Replicate:

Marker: replicate.com/datalab-to/marker

OCR: replicate.com/datalab-to/ocr

Run Marker

Copy

import replicate

output = replicate.run( "datalab-to/marker" , input = { "file" : open ( "report.pdf" , "rb" ), "mode" : "balanced" , # fast / balanced / accurate "include_metadata" : True , # return page-level JSON metadata }, ) print (output[ "markdown" ][: 400 ])

Run OCR

Copy

import replicate

output = replicate.run( "datalab-to/ocr" , input = { "file" : open ( "receipt.jpg" , "rb" ), "visualize" : True , # return the input image with red polygons around detected text "return_pages" : True , # return layout data }, ) print (output[ "text" ][: 200 ])

Visit the models on Replicate for code snippets in other languages.

These models are both fast and accurate. They outperform established tools like Tesseract, with short processing times. Marker processes a page in about 0.18 seconds and can hit 120 pages per second when batched.

Structured extraction

One particularly powerful feature of Marker is structured extraction. For example, you can extract specific fields from an invoice:

Copy

import json import replicate

schema = { "type" : "object" , "properties" : { "vendor" : { "type" : "string" }, "invoice_number" : { "type" : "string" }, "date" : { "type" : "string" }, "total" : { "type" : "number" } } }

output = replicate.run( "datalab-to/marker" , input = { "file" : "https://multimedia-example-files.replicate.dev/replicator-invoice.1page.pdf" , "page_schema" : json.dumps(schema), } ) structured_data = json.loads(output[ "extraction_schema_json" ]) print (structured_data)

Performance

Marker performance was evaluated using the olmOCR-Bench benchmark, a dataset of 1,403 PDF files with 7,010 unit test cases that evaluate the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.

Marker outperforms all models tested, including GPT-4o, Deepseek OCR, Mistral OCR, and olmOCR.

Model ArXiv Old Scans Math Tables Old Scans Headers and Footers Multi column Long tiny text Base Overall Datalab Marker (Balanced mode) 81.4 80.3 89.4 50.0 88.3 81.0 91.6 99.9 82.7 ± 0.9 Datalab Marker (Fast mode) 83.8 69.7 74.8 32.3 86.6 79.4 85.7 99.6 76.5 ± 1.0 Mistral OCR API 77.2 67.5 60.6 29.3 93.6 71.3 77.1 99.4 72.0 ± 1.1 Deepseek OCR 75.2 67.9 79.1 32.9 96.1 66.3 78.5 97.7 74.2 ± 1.0 Nanonets OCR 67.0 68.6 77.7 39.5 40.7 69.9 53.4 99.3 64.5 ± 1.1 GPT-4o (Anchored) 53.5 74.5 70.0 40.7 93.8 69.3 60.6 96.8 69.9 ± 1.1 Gemini Flash 2 (Anchored) 54.5 56.1 72.1 34.2 64.7 61.5 71.5 95.6 63.8 ± 1.2 Qwen 2.5 VL (No Anchor) 63.1 65.7 67.3 38.6 73.6 68.3 49.1 98.3 65.5 ± 1.2 olmOCR v0.3.0 78.6 79.9 72.9 43.9 95.1 77.3 81.2 98.9 78.5 ± 1.1

Pricing

Marker costs

$4 per 1000 pages without page_schema in fast and balanced modes.

$6 per 1000 pages when doing structured extraction with page_schema . $ $6 per 1000 pages in accurate mode.

OCR costs $2 per 1000 pages.

Next: How to prompt Veo 3.1

Notability

notability 3.0/10

Routine feature announcement blog