NVIDIA/NeMo-Retriever
Python
Captured source
source ↗NVIDIA/NeMo-Retriever
Description: NeMo Retriever Library is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever Library uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications.
Language: Python
License: Apache-2.0
Stars: 2935
Forks: 324
Open issues: 194
Created: 2024-08-22T16:09:05Z
Pushed: 2026-06-11T00:03:57Z
Default branch: main
Fork: no
Archived: no
README:
Important: The default branch is main, which tracks active development and may be ahead of the latest supported release.
For the latest supported release, use the 26.05 branch (GA PyPI and Helm chart version 26.5.0). The previous stable line is 26.03.
See the corresponding NeMo Retriever Library documentation.
NeMo Retriever Library
NeMo Retriever Library is a scalable, performance-oriented framework for document content and metadata extraction. It supports both NVIDIA NIM microservices and a wide range of models to find, contextualize, and extract text, tables, charts, and infographics for use in downstream generative and retrieval-augmented applications.
> [!Note] > NeMo Retriever extraction is also referred to as NVIDIA Ingest in some NVIDIA product materials.
NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever Library manages computaiton of embeddings for the extracted content as well as storing them in LanceDB.
The following diagram shows the NeMo Retriever Library pipeline.
For production-level performance and scalability, deploy the pipeline and supporting NIMs on Kubernetes using Helm — start with the [NeMo Retriever Helm chart](nemo_retriever/helm/README.md) and the [NeMo Retriever Library (prerequisites / deployment)](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) for published charts and install procedures. For standalone service-image builds, see [Docker Service Image](nemo_retriever/docker.md).
*Note*: Along with the recent repo name change, we're phasing out legacy ingestion APIs and simplifying the dependencies. You can follow this work and see the forward looking API via the [nemo_retriever](nemo_retriever) library subfolder.
Typical Use
For small-scale workloads, such as workloads of fewer than 100 PDFs, you can use our in development library setup which works with HuggingFace models on local GPUs or with NIMs hosted on build.nvidia.com.
After [following the quickstart installation steps](nemo_retriever), you can start ingesting content like with the following snippet:
from nemo_retriever import create_ingestor
from nemo_retriever.io import to_markdown, to_markdown_by_page
from pathlib import Path
documents = [str(Path("../data/multimodal_test.pdf"))]
ingestor = create_ingestor(run_mode="batch")
# ingestion tasks are chainable and defined lazily
ingestor = (
ingestor.files(documents)
.extract(
# below are the default values, but content types can be controlled
extract_text=True,
extract_charts=True,
extract_tables=True,
extract_infographics=True
)
.embed()
.vdb_upload()
)
# ingestor.ingest() actually executes the pipeline
chunks = ingestor.ingest() # pandas.DataFrame (batch and inprocess)You can see the extracted text that represents the content of the ingested test document.
# page 1 raw text: >>> chunks.iloc[0]["text"] 'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose...' # markdown formatted table from the first page >>> chunks.iloc[1]["text"] '| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |' # a chart from the first page >>> chunks.iloc[2]["text"] 'Chart 1\nThis chart shows some gadgets, and some very fictitious costs.\nGadgets and their cost\n$160.00\n$140.00\n$120.00\n$100.00\nDollars\n$80.00\n$60.00\n$40.00\n$20.00\n$-\nPowerdrill\nBluetooth speaker\nMinifridge\nPremium desk fan\nHammer\nCost' # markdown formatting for full pages or documents: # per-page markdown is keyed by page number >>> to_markdown_by_page(chunks).keys() dict_keys([1, 2, 3]) >>> to_markdown_by_page(chunks)[1] 'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable research repo from NVIDIA with good traction.