What does this repo signal mean?

Cerebras published Cerebras/DocChat (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo Cerebras/DocChat · language Python · Low stars, but from notable company. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Cerebras Repo: Cerebras/DocChat

Captured source

source ↗

GitHub/github.com/Cerebras/DocChat

Cerebras/DocChat repository metadata

Source ↗

published Aug 19, 2024seen Jun 5captured Jun 11http 200method plain

Cerebras/DocChat

Description: GPT-4 Level Conversational QA Trained In a Few Hours

Language: Python

Stars: 69

Forks: 8

Open issues: 0

Created: 2024-08-19T17:09:08Z

Pushed: 2024-08-21T06:21:18Z

Default branch: main

Fork: no

Archived: no

README:

Cerebras DocChat: GPT-4 Level Conversational QA Trained In a Few Hours

This repository contains dataset preparation, training, and evaluation code for DocChat. The corresponding code for the LLM can be found in the Cerebras-Llama3-DocChat while code for the retriever is located at Cerebras-Dragon-DocChat.

The dataset preparation & training scripts are designed to be used with Cerebras Model Zoo Release 2.3. Note that although the training scripts _can_ be run on non-Cerebras hardware, they haven't been optimized for other hardware, and so your mileage may vary. The evaluation code does not depend on Model Zoo and is hardware agnostic (supports CPU & GPU) in order to ensure that everyone can evaluate our model (including those who don't have access to Cerebras machines).

Additional Links:

Blog post
LLM model weights on HuggingFace
Embedding model weights on HuggingFace: Query Encoder, Context Encoder
Data preparation, training, and evaluation code

About DocChat

We are excited to announce the release of Cerebras DocChat, our first iteration of models designed for document-based conversational question answering. This series includes two models: Cerebras Llama3-DocChat, a large language model (LLM), and Cerebras Dragon-DocChat, a multi-turn retriever model. These models were not only developed leveraging our deep ML expertise, but were also trained with remarkable speed. Using a single Cerebras System, Llama3-DocChat was trained in a few hours, while Cerebras Dragon-DocChat was fine-tuned in just *minutes* (yes you read that correctly).

DocChat LLM

Cerebras Llama3-DocChat was built on top of Llama 3 base using insights from the latest research on document-based Q&A, most notably Nvidia’s ChatQA model series. As part of this work, we leveraged our experience in LLM model training and dataset curation to overcome the gaps in ChatQA's released datasets and training recipes. Additionally, we employed synthetic data generation to address limitations that couldn't be fully resolved with the available real data.

DocChat Retriever

Similarly, Cerebras Dragon-DocChat was built on top of the Dragon+ model and trained on ChatQA’s conversational Q&A dataset. By finetuning using contrastive loss with hard negatives, we see absolute improvements in recall of 8.9% over Dragon+ and 3.5% over ChatQA Dragon-Multiturn respectively (top-1).

Open Source Commitment

In line with our commitment to open source, we are releasing not only the model weights but also the complete training recipes and associated datasets. This transparency allows the AI community to replicate, build upon, and innovate with our work. See below for links.

Benchmarks

The DocChat models have been evaluated across a variety of benchmarks, and achieve top of the line performance for their model sizes.

| ChatRAG Benchmark | Llama3 Instruct 8B | Command-R-Plus | Nvidia Llama3-ChatQA 1.5 8B | GPT-4-Turbo-2024-04-09 | Cerebras Llama3-DocChat 1.0 8B | | --- | --- | --- | --- | --- | --- | | Doc2Dial | 31.33 | 33.51 | 39.33 | 35.35 | 39.19 | | QuAC | 32.64 | 34.16 | 39.73 | 40.1 | 36 | | QReCC | 43.4 | 49.77 | 49.03 | 51.46 | 50.27 | | CoQA | 73.25 | 69.71 | 76.46 | 77.73 | 79.56 | | DoQA | 30.34 | 40.67 | 49.6 | 41.6 | 48.77 | | ConvFinQA | 53.15 | 71.21 | 78.46 | 84.16 | 80.13 | | SQA | 36.6 | 74.07 | 73.28 | 79.98 | 74.19 | | TopioCQA | 34.64 | 53.77 | 49.96 | 48.32 | 52.13 | | HybriDial\* | 40.77 | 46.7 | 65.76 | 47.86 | 64 | | INSCIT | 32.09 | 35.76 | 30.1 | 33.75 | 32.88 | | Average (all) | 40.82 | 50.93 | 55.17 | 54.03 | 55.71 | | Average (exclude HybriDial) | 40.83 | 51.4 | 53.99 | 54.72 | 54.79 |

| Eleuther Eval Harness | Llama3 Instruct 8B | Nvidia Llama3-ChatQA 1.5 8B | Cerebras Llama3-DocChat 1.0 8B | | --- | --- | --- | --- | | hellaswag | 57.68 | 61.37 | 61.68 | | winogrande | 71.98 | 73.95 | 74.11 | | truthfulqa_mc1 | 36.23 | 28.52 | 29.25 | | truthfulqa_mc2 | 51.65 | 43.56 | 45.14 | | mmlu | 63.84 | 60.68 | 62.86 | | gsm8k | 76.12 | 13.72 | 55.57 | | arc_easy | 81.61 | 80.56 | 82.03 | | arc_challenge | 52.99 | 51.02 | 53.92 | | Average | 61.51 | 51.67 | 58.07 |

Metric

Facebook Dragon+

Nvidia Dragon-Multiturn

Cerebras Dragon-DocChat

Doc2Dial

Recall@1

43.95

50.11

51.54

Recall@5

77.61

83.85

83.12

Recall@20

92.05

95.33

95.25

QuAC

Recall@1

62.09

60.02

61.30

Recall@5

86.01

86.51

87.69

Recall@20

96.48

96.60

97.25

QReCC

Recall@1

49.00

49.43

55.41

Recall@5

85.14

86.6

90.11

Recall@20

97.21

98.28

98.39

INSCIT*

Recall@1

11.13

18.35

21.65

Recall@5

29.27

48.45

50.72

Recall@20

49.07

66.19

72.78

Topiocqa*

Recall@1

29.19

31.34

38.19

Recall@5

62.52

65.79

72.47

Recall@20

83.69

84.37

87.23

Average**

Avg top 1

49.36

54.76

58.29

Avg top 5

76.30

81.50

84.19

\*Evaluated on a subset of the wikipedia corpus that was available to us. All models use the same evaluation strategy to ensure apples-to-apples comparisons.

\*\* We follow the same convention as in ChatQA, where we compare top-5 and top-20 of TopiOCQA and INSCIT to top-1 and top-5, respectively, of the other datasets, in order match differences in average context length.

The Recipe

While ChatQA provided a valuable foundation, we identified several gaps in their released datasets and training recipes. We crafted our final recipe by combining insights from analyzing their model as well as our own experience in LLM model training and dataset curation. Notably, we addressed the challenge of handling unanswerable questions, improving arithmetic performance and entity extraction:

Handling Unanswerable Questions: In our initial attempts, the model struggled with unanswerable questions (i.e. responding “I can’t answer …”). The ChatQA paper notes...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low stars, but from notable company