What does this repo signal mean?

Qwen (Alibaba Cloud) published QwenLM/PolyMath (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo QwenLM/PolyMath · language Python · New repo, low traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Evals and quality in the data-business radar.

Qwen (Alibaba Cloud) Repo: QwenLM/PolyMath

Captured source

source ↗

GitHub/github.com/QwenLM/PolyMath

QwenLM/PolyMath repository metadata

Source ↗

published Apr 25, 2025seen 6dcaptured 8hhttp 200method plain

QwenLM/PolyMath

Description: [NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"

Language: Python

Stars: 44

Forks: 9

Open issues: 4

Created: 2025-04-25T02:58:32Z

Pushed: 2025-05-22T05:00:13Z

Default branch: main

Fork: no

Archived: no

README:

This is the official repository for the paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts".

📖 Introduction

PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels, with 9,000 high-quality problem samples. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs.

✨ Features

📈 Broad Difficulty Range: PolyMath defines and partitions mathematical difficulty across four levels using two core dimensions: *Thought Depth* and *Knowledge Breadth*, ranging from K-12 to Olympiad and advanced frontier mathematics, with 125 problems per language at each level.

🌍 Language Diversity: Each problem in PolyMath is available in 18 parallel language versions, encompassing over 75% of the world’s native speakers and major language families, ensuring diversity across both high-resource and low-resource languages.

🧑‍🏫 High-Quality Annotation: Each problem translation is calibrated by language experts, avoiding direct use of LLM-generated outputs and ensuring precise term and logical clarity.

🛠️ Data Usage

PolyMath/
├── ar/
│ ├── low.parquet
│ ├── medium.parquet
│ ├── high.parquet
| └── top.parquet
├── bn/
├── ...
└── zh/

Additionally, all prompts used in the inference process are provided in instruction.py.

🧪 Evaluation

Environment Preparation

conda create -n polymath python=3.10
conda activate polymath
pip install -r requirements.txt

Output Process

Given that varying inference engines may generate outputs in different formats, we request that you standardize your results into the specified format:

mkdir output
cd output

1. Take /{model_name} as the primary directory tier, and /{difficulty_level} as the secondary tier.

2. For each language, generate a {lang_name}.jsonl file within /{difficulty_level}, ensuring it includes 125 output samples. Each sample should adhere to the following format:

{"idx: 0, ...}
...
{
"idx": 114, ### unique sample id
"question": "假设在平面上的一个紧集 $C$ 满足以下条件：对每一个方向，都存在一条该方向上的直线 $l$，使得 $l \\cap C$ 的维数至少为 $\\frac{1}{2}$。那么，$C$ 的最小可能维数是多少？", ### question in corresponding language version
"answer": "$\\frac{5}{4}$", ### ground truth
"thinking_pred": "嗯，这个问题看起来有点挑战性，不过让我慢慢想想。题目是说，在平面上有一个紧集C...", ### Note: Model's thinking content. Note: If it is a non-reasoning model, leave this field blank.
"answer_pred": "题目要求在平面上的一个紧集 \\( C \\)，满足对于每一个方向，...", ### Note: Model's answer content.
}
...
{"idx: 124, ...}

The complete file structure is as follows:

PolyMath/output
├── qwq-32b
│ ├── low
│ │ ├── ar.jsonl
│ │ ├── bn.jsonl
│ │ └── ...
│ ├── medium
│ │ ├── ar.jsonl
│ │ ├── bn.jsonl
│ │ └── ...
│ ├── high
│ │ ├── ar.jsonl
│ │ ├── bn.jsonl
│ │ └── ...
│ └── top
│ ├── ar.jsonl
│ ├── bn.jsonl
│ └── ...
├── deepseek-v3
│ ├── low
│ │ ├── ar.jsonl
│ │ ├── bn.jsonl
│ │ └── ...
│ ├── medium
│ │ ├── ar.jsonl
│ │ ├── bn.jsonl
│ │ └── ...
│ ├── high
│ │ ├── ar.jsonl
│ │ ├── bn.jsonl
│ │ └── ...
│ └── top
│ ├── ar.jsonl
│ ├── bn.jsonl
│ └── ...
└── ... (other models)

Score Computation

The /eval/run_eval.py provides evaluation code for accuracy and language consistency. Please run run_eval.sh to iterate through your processed output files.

cd ../eval
bash run_eval.sh

run_eval.sh

model_list=(qwq-32b deepseek-v3)
language_list=(en zh ar bn de es fr id it ja ko ms pt ru sw te th vi)
level_list=(low medium high top)

for i in ${model_list[*]}; do
for j in ${language_list[*]}; do
for k in ${level_list[*]}; do
python run_eval.py --model $i --language $j --level $k
done
done
done

You can customize model_list, language_list, and level_list. When it is detected that the evaluations for all levels of a particular model in a specific language are completed, the computation of the benchmark score will be triggered.

During evaluation, a score file will be automatically generated at `/eval/output/{model_name}/score.json`, and all scores will be saved.

📄 Citation

If you use PolyMath in your research or find our work useful, please cite us:

@article{wang2025polymath,
title={PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts},
author={Yiming Wang and Pei Zhang and Jialong Tang and Haoran Wei and Baosong Yang and Rui Wang and Chenshu Sun and Feitong Sun and Jiran Zhang and Junxuan Wu and Qiqian Cang and Yichang Zhang and Fei Huang and Junyang Lin and Fei Huang and Jingren Zhou},
journal={arXiv preprint arXiv:2504.18428},
year={2025},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.18428},
}

Notability

notability 3.0/10

New repo, low traction.