RepoBaidu (ERNIE)Baidu (ERNIE)published Sep 7, 2021seen 5d

PaddlePaddle/RocketQA

Python

Open original ↗

Captured source

source ↗
published Sep 7, 2021seen 5dcaptured 8hhttp 200method plain

PaddlePaddle/RocketQA

Description: 🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Language: Python

License: Apache-2.0

Stars: 786

Forks: 123

Open issues: 66

Created: 2021-09-07T03:36:20Z

Pushed: 2023-12-19T08:08:35Z

Default branch: main

Fork: no

Archived: no

README:

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely 🚀RocketQA. This toolkit has the following advantages:

  • *State-of-the-art*: 🚀RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
  • *First-Chinese-model*: 🚀RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
  • *Easy-to-use*: By integrating this toolkit with JINA, 🚀RocketQA can help developers build an end-to-end retrieval system and question answering system with several lines of code.

News

  • 🎉 Nov 27, 2022: Our survey paper on dense retrieval Dense Text Retrieval based on Pretrained Language Models: A Survey was publicly available.
  • Oct 8, 2022: DuReaderretrieval was accepted by EMNLP 2022. [[data]](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval); The latest version of DuReaderretrieval contains cross-lingual retrieval benchmarks. Stay tuned!
  • Apr 29, 2022: Training function is added to RocketQA toolkit. And the baseline models of DuReaderretrieval (both cross encoder and dual encoder) are available in RocketQA models.
  • Mar 30, 2022: We released DuReaderretrieval, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. [[paper]](https://arxiv.org/abs/2203.10232) [[data]](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval) ; The baseline of DuReaderretrieval leaderboard was also released. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/DuReader-Retrieval-Baseline)
  • Dec 3, 2021: The toolkit of dense retriever RocketQA was released, including the first chinese dense retrieval model trained on DuReader.
  • Aug 26, 2021: RocketQA v2 was accepted by EMNLP 2021. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/RocketQAv2_EMNLP2021)
  • May 5, 2021: PAIR was accepted by ACL 2021. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/PAIR_ACL2021)
  • Mar 11, 2021: RocketQA v1 was accepted by NAACL 2021. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/RocketQA_NAACL2021)

Installation

We provide two installation methods: *Python Installation Package* and *Docker Environment*

Install with Python Package

First, install PaddlePaddle.

# GPU version:
$ pip install paddlepaddle-gpu

# CPU version:
$ pip install paddlepaddle

Second, install rocketqa package (latest version: 1.1.0):

$ pip install rocketqa

NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.

Install with Docker

docker pull rocketqa/rocketqa

docker run -it docker.io/rocketqa/rocketqa bash

Getting Started

Refer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a Playground with JupyterNotebook. Try 🚀RocketQA straight away in your browser!

Running with JINA

JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.

cd examples/jina_example
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
# JINA will automaticlly start a web service for you
python3 app.py index toy_data/test.tsv

# Try some questions related to the indexed Documents
python3 app.py query_cli

Please view JINA example to know more.

Running with FAISS

We also provide a simple example built on Faiss.

cd examples/faiss_example/
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
python3 index.py zh ../data/dureader.para test_index

# Start a web service on http://localhost:8888/rocketqa
python3 rocketqa_service.py zh ../data/dureader.para test_index

# Try some questions related to the indexed Documents
python3 query.py

API

You can also easily integrate 🚀RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.

Load model

`rocketqa.available_models()`

Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.

`rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)`

Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by…

Excerpt shown — open the source for the full document.

Notability

Scored, but no written rationale attached yet.

Baidu (ERNIE) has a repo signal matching data demand, evals and quality.