ModelIBM (Granite)IBM (Granite)published Jan 7, 2026seen 5d

ibm-granite/granite-3.3-8b-math-prm-v2

Open original ↗

Captured source

source ↗
published Jan 7, 2026seen 5dcaptured 12hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 8.2Bdownloads 58likes 14

Granite-3.3-8B-Math-PRM-v2

Model Summary

Granite 3.3 8B Math PRM v2 is a finetuned version of the 8-billion parameter language model, Granite-3.3-8B-Instruct, built for use a generative process reward model (PRM) for process supervision in mathematical reasoning. Crucially, this model has only been trained on curated data from sources with permissive licenses, and we release this model under a Apache 2.0 license. This model displays state-of-the-art performance in a best-of-N set up for math tasks.

This model can be used to asses the correctness of each step of a mathematical reasoning process, and shows strong performance on Best-of-N evaluations for a variety of generators on Math-500, as well as strong error identification performance in both ProcessBench and PRMBench, with state-of-the-art performance in its size class. Although this model was trained for mathematical reasoning tasks, it also shows strong inference scaling performance on Code benchmarks, such as HunmanEval and LCBv5.

  • Developers: Granite Alignment Team, IBM Research
  • Release Date: Jan 1st, 2026
  • License: Apache 2.0

Supported Languages

This model has specifically been finetuned for English, however the base model supports English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.

Intended Use

Granite 3.3 8B Math PRM v2 is finetuned version of Granite-3.3-8B-Instruct, which gives the language model the ability of process supervision on mathematical reasoning steps by assessing the correctness of each step of a reasoning chain. At inference, the model takes a question and a response which can be broken down into generated steps, and for each step it determines whether the reasoning chain so far is correct (indicated by generating a single token, Y) or incorrect (indicated by generating N). The probability of generating Y can be treated as a numeric reward score in applications such as Best-of-N evaluation.

Before obtaining a response, the model expects the user generated prompt "Is this response correct so far (Y/N)?", which should be added at the end of every step of the reasoning chain.

This model is an update to granite-3.3-8b-lora-math-prm.

Evaluation Results

a. Best-of-N Evaluation on Math-500

We show the performance of MATH-500 with inference scaling on generations from granite-4.0-h-small and granite-4.0-h-tiny, and show strong gains over Majority Voting with both Best-of-N and Weighted Majority Voting using Granite-3.3-8B-Math-PRM-v2.

We also compare the Best-of-N performance on Math-500 available PRMs on Qwen-2.5-Math-7B-Instruct generations, and show the superior performance of Granite-3.3-8B-Math-PRM-v2 over majority voting and best-of-N using similarly sized PRMs:

| | 2| 4| 8| 16| 32| 64| 128| 256| |- | - |- |- |- |- |- |- |- | | Majority Voting |75.8 | 81.6 | 84.6 | 85.2 | 85.4| 85.6 | 86.0 | 85.6 | | Granite-3.3-8B-Math-PRM-v2 | 81.6 | 84.6 | 86.6| 88.0 |88.2 |89.2 |89.0| 89.6 | | Granite-3.3-8B-LoRA-Math-PRM| 81.6 | 84.2 | 84.8 | 86.2 | 86.8 | 87.2 | 88.0 | 87.2 | | Qwen2.5-Math-PRM-7B| 82.0 | 84.8 | 86.6 | 87.0 | 88.2 | 89.0 | 88.8 | 89.0| | MathShepherd-Mistral-7B PRM 7B| 80.8 | 83.0 | 83.8 | 84.8 | 86.2 | 85.2 | 86.0 | 85.2 | | RLHFLow Llama3.1-8B-PRM-Deepseek-Data| 80.6 | 82.4 | 83.6 | 85.2 | 85.8 | 85.8 | 85.0 | 84.6 |

b. Best-of-N Evaluation on HumanEval and LCBv5

We show the performance on two code tasks, HumanEval (Python) and LCBv5 with inference scaling on generations from granite-4.0-h-small. While this model is finetuned on Math data, Granite-3.3-8B-Math-PRM-v2 demonstrates strong gains over Majority Voting with both Best-of-N and Weighted Majority Voting.

c. ProcessBench

As shown above, Granite-3.3-8B-Math-PRM-v2 shows strong performance (top-2) on both ProcessBench and PRMBench compared to other models of the same parameter class, indicating a strong ability at error detection for reasoning tasks.

d. PRMBench: Detailed Results

| Model | Overall| NR. | NCL. | Avg (simplicity) | ES. | SC. | DC. | CI. | Avg (soundness) | PS. | DR. | MS. | Avg (sensitivity) | |-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|------- | | Granite-3.3-8B-Math-PRM-v2 | 65.1 | 52.2 | 63.4 | 57.8 | 69.7 | 65.8 | 64.4 | 70.8 | 67.7 | 61.0 | 67.1 | 98.2 | 75.4 | Granite-3.3-8B-LoRA-Math-PRM | 64.5 | 50.9 | 61.5 | 56.2 | 69.1 | 66.7 | 64.7 | 70.5 | 67.8 | 59.9 | 65.9 | 98.1 | 74.7 | Qwen2.5-Math-PRM-7B | 65.5 | 49.0 | 55.1 | 52.1 | 71.8 | 67.3 | 66.3 | 78.5 | 71.0 | 57.6 | 69.1 | 99.7 | 75.5 | Skywork-PRM-7B | 65.1 | 56.4 | 62.8 | 59.6 | 69.4 | 67.1 | 67.7 | 69.9 | 68.5 | 60.9 | 65.8 | 93.2 | 73.7 | Skywork-PRM-1.5B | 61.1 | 52.0 | 56.4 | 54.2 | 64.8 | 64.9 | 63.3 | 66.5 | 64.9 | 57.5 | 63.3 | 91.1 | 70.7 | ReasonEval-34B | 60.5 | 54.8| 48.1 | 51.5 | 66.4 | 60.3 | 57.8 | 67.5 | 63.0 | 57.7 | 64.3 | 97.2 | 73.1 | ReasonEval-7B | 60.1 | 61.0 | 50.1 | 55.6 | 62.1 | 65.9 | 61.5 | 66.0 | 63.9 | 55.7 | 58.0 | 99.5 | 71.1 | RLHFlow-PRM-Mistral-8B | 54.4 | 46.1 | 47.3 | 46.7 | 56.6 | 55.1 | 54.4 | 63.8 | 57.5 | 51.5 | 56.2 | 97.9 | 68.5 | RLHFlow-PRM-Deepseek-8B | 54.2 | 46.4 | 48.9 | 47.6 | 55.7 | 55.0 | 53.2 | 66.2 | 57.5 | 49.0 | 55.4 | 99.8 | 68.1 |…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction model release