ForkSiliconFlowSiliconFlowpublished Jun 1, 2026seen 5d

siliconflow/LiveCodeBench

forked from LiveCodeBench/LiveCodeBench

Open original ↗

Captured source

source ↗
published Jun 1, 2026seen 5dcaptured 10hhttp 200method plain

siliconflow/LiveCodeBench

Description: Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"

License: MIT

Stars: 0

Forks: 0

Open issues: 1

Created: 2026-06-01T07:15:20Z

Pushed: 2026-06-01T07:40:01Z

Default branch: main

Fork: yes

Parent repository: LiveCodeBench/LiveCodeBench

Archived: no

README:

LiveCodeBench

Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"

🏠 Home Page • 💻 Data • 🏆 Leaderboard • 🔍 Explorer

Introduction

LiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs. Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and March 2024.

Installation

You can clone the repository using the following command:

git clone https://github.com/LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench

We recommend using uv for managing dependencies, which can be installed a number of ways.

Verify that uv is installed on your system by running:

uv --version

Once uv has been installed, use it to create a virtual environment for LiveCodeBench and install its dependencies with the following commands:

uv venv --python 3.11
source .venv/bin/activate

uv pip install -e .

Data

We provide a benchmark for different code capability scenarios

Inference and Evaluation

Dataset Versions

Since LiveCodeBench is a continuously updated benchmark, we provide different versions of the dataset. Particularly, we provide the following versions of the dataset:

  • release_v1: The initial release of the dataset with problems released between May 2023 and Mar 2024 containing 400 problems.
  • release_v2: The updated release of the dataset with problems released between May 2023 and May 2024 containing 511 problems.
  • release_v3: The updated release of the dataset with problems released between May 2023 and Jul 2024 containing 612 problems.
  • release_v4: The updated release of the dataset with problems released between May 2023 and Sep 2024 containing 713 problems.
  • release_v5: The updated release of the dataset with problems released between May 2023 and Jan 2025 containing 880 problems.
  • release_v6: The updated release of the dataset with problems released between May 2023 and Apr 2025 containing 1055 problems.

You can use the --release_version flag to specify the dataset version you wish to use. Particularly, you can use the following command to run the evaluation on the release_v2 dataset. Release version defaults to release_latest. Additionally, we have introduced fine-grained release versions such as v1, v2, v1_v3, v4_v5 for specific versions of the dataset.

python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version release_v2

Code Generation

We use vllm for inference using open models. By default, we use tensor_parallel_size=${num_gpus} to parallelize inference across all available GPUs. It can be configured using the --tensor_parallel_size flag as required.

For running the inference, please provide the model_name based on the [./lcb_runner/lm_styles.py](./lcb_runner/lm_styles.py) file. The scenario (here codegeneration) can be used to specify the scenario for the model.

python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration

Additionally, --use_cache flag can be used to cache the generated outputs and --continue_existing flag can be used to use the existing dumped results. In case you wish to use model from a local path, you can additionally provide --local_model_path flag with the path to the model. We use n=10 and temperature=0.2 for generation. Please check the [./lcb_runner/runner/parser.py](./lcb_runner/runner/parser.py) file for more details on the flags.

For closed API models, --multiprocess flag can be used to parallelize queries to API servers (adjustable according to rate limits).

Evaluation

We compute pass@1 and pass@5 metrics for model evaluations. We use a modified version of the checker released with the `apps` benchmark to compute the metrics. Particularly, we identified some unhandled edge cases in the original checker and fixed them and additionally simplified the checker based on our collected dataset. To run the evaluation, you can add the --evaluate flag:

python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate

Note that time limits can cause slight (`

Next, we evaluate models on different code capabilities and find that relative performances of models do change over tasks (left). Thus, it highlights the need for holistic evaluation of LLMs for code.

We also find evidence of possible overfitting on HumanEval (right). Particularly, models that perform well on HumanEval do not necessarily perform well on LiveCodeBench. In the scatterplot above, we find the models get clustered into two groups, shaded in red and green. The red group contains models that perform well on HumanEval but poorly on LiveCodeBench, while the green group contains models that perform well on both.

For more details, please refer to our website at livecodebench.github.io.

Citation

@article{jain2024livecodebench,
author = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},
title = {LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code},
year = {2024},
journal =…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, low traction