RepoNVIDIANVIDIApublished Aug 22, 2024seen 5d

NVIDIA/NeMo-Retriever

Python

Open original ↗

Captured source

source ↗
published Aug 22, 2024seen 5dcaptured 11hhttp 200method plain

NVIDIA/NeMo-Retriever

Description: NeMo Retriever Library is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever Library uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications.

Language: Python

License: Apache-2.0

Stars: 2935

Forks: 324

Open issues: 194

Created: 2024-08-22T16:09:05Z

Pushed: 2026-06-11T00:03:57Z

Default branch: main

Fork: no

Archived: no

README:

Important: The default branch is main, which tracks active development and may be ahead of the latest supported release.

For the latest supported release, use the 26.05 branch (GA PyPI and Helm chart version 26.5.0). The previous stable line is 26.03.

See the corresponding NeMo Retriever Library documentation.

NeMo Retriever Library

NeMo Retriever Library is a scalable, performance-oriented framework for document content and metadata extraction. It supports both NVIDIA NIM microservices and a wide range of models to find, contextualize, and extract text, tables, charts, and infographics for use in downstream generative and retrieval-augmented applications.

> [!Note] > NeMo Retriever extraction is also referred to as NVIDIA Ingest in some NVIDIA product materials.

NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever Library manages computaiton of embeddings for the extracted content as well as storing them in LanceDB.

The following diagram shows the NeMo Retriever Library pipeline.

!Pipeline Overview

For production-level performance and scalability, deploy the pipeline and supporting NIMs on Kubernetes using Helm — start with the [NeMo Retriever Helm chart](nemo_retriever/helm/README.md) and the [NeMo Retriever Library (prerequisites / deployment)](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) for published charts and install procedures. For standalone service-image builds, see [Docker Service Image](nemo_retriever/docker.md).

*Note*: Along with the recent repo name change, we're phasing out legacy ingestion APIs and simplifying the dependencies. You can follow this work and see the forward looking API via the [nemo_retriever](nemo_retriever) library subfolder.

Typical Use

For small-scale workloads, such as workloads of fewer than 100 PDFs, you can use our in development library setup which works with HuggingFace models on local GPUs or with NIMs hosted on build.nvidia.com.

After [following the quickstart installation steps](nemo_retriever), you can start ingesting content like with the following snippet:

from nemo_retriever import create_ingestor
from nemo_retriever.io import to_markdown, to_markdown_by_page
from pathlib import Path

documents = [str(Path("../data/multimodal_test.pdf"))]
ingestor = create_ingestor(run_mode="batch")

# ingestion tasks are chainable and defined lazily
ingestor = (
ingestor.files(documents)
.extract(
# below are the default values, but content types can be controlled
extract_text=True,
extract_charts=True,
extract_tables=True,
extract_infographics=True
)
.embed()
.vdb_upload()
)

# ingestor.ingest() actually executes the pipeline
chunks = ingestor.ingest() # pandas.DataFrame (batch and inprocess)

You can see the extracted text that represents the content of the ingested test document.

# page 1 raw text:
>>> chunks.iloc[0]["text"]
'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose...'

# markdown formatted table from the first page
>>> chunks.iloc[1]["text"]
'| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'

# a chart from the first page
>>> chunks.iloc[2]["text"]
'Chart 1\nThis chart shows some gadgets, and some very fictitious costs.\nGadgets and their cost\n$160.00\n$140.00\n$120.00\n$100.00\nDollars\n$80.00\n$60.00\n$40.00\n$20.00\n$-\nPowerdrill\nBluetooth speaker\nMinifridge\nPremium desk fan\nHammer\nCost'

# markdown formatting for full pages or documents:
# per-page markdown is keyed by page number
>>> to_markdown_by_page(chunks).keys()
dict_keys([1, 2, 3])

>>> to_markdown_by_page(chunks)[1]
'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research repo from NVIDIA with good traction.