RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Oct 10, 2025seen 5d

inclusionAI/SWE-CARE

Python

Open original ↗

Captured source

source ↗
published Oct 10, 2025seen 5dcaptured 9hhttp 200method plain

inclusionAI/SWE-CARE

Language: Python

License: Apache-2.0

Stars: 15

Forks: 13

Open issues: 0

Created: 2025-10-10T07:36:39Z

Pushed: 2026-04-21T07:29:00Z

Default branch: main

Fork: no

Archived: no

README:

SWE-CARE: A Comprehensiveness-aware Benchmark for Evaluation of Code Review

A comprehensiveness-aware benchmark for repository-level CR evaluation.

📝 Overview

Code review (CR) refers to the process of having other developers on the team check the code written by a particular developer. It aims to improve the code quality and find code defects and plays an important role in software quality maintenance. Some research had proposed some CR benchmarks and automatic CR approaches. However, existing CR benchmarks and approaches, lack of comprehensiveness, which is not close to the real scenario. The rapid growth of Large Language Model (LLM) capabilities has made comprehensive CR a possibility. To evaluate the LLMs' performance in comprehensive CR, we construct a comprehensiveness-aware CR dataset in Python, namely SWE-CARE. The dataset is categorized into nine types and each instance's information covers the full process of code review. In addition, the repository-level feature is also included in each instance. Based on the dataset, we design a framework to evaluate LLM’s performance on CR.

🛠️ Set Up

Follow these steps to set up the project locally.

1. Clone the repository:

git clone https://github.com/your-username/SWE-CARE.git
cd SWE-CARE

2. Install dependencies: This project uses uv for package management. Make sure you have Python 3.10 or higher.

pip install uv
uv sync

Alternatively, you can use pip:

pip install -e .

3. Set up pre-commit hooks (for development): This project uses ruff for linting and formatting. The pre-commit hooks will run these checks automatically before each commit.

pre-commit install

🚀 Quick Start: Evaluation Pipeline

For a streamlined evaluation workflow, use the bootstrap script in scripts/run_eval_pipeline.py:

# Set up environment variables
export OPENAI_API_KEY="your-openai-api-key"
export LLM_EVALUATOR_OPENAI_API_KEY="your-evaluation-api-key"

# Run the complete pipeline (uses default Hugging Face dataset)
python scripts/run_eval_pipeline.py \
--output-dir results/pipeline_output \
--model gpt-4o \
--model-provider openai \
--file-source oracle

# Run with local dataset file
python scripts/run_eval_pipeline.py \
--dataset-name-or-path results/dataset/code_review_task_instances.jsonl \
--output-dir results/pipeline_output \
--model gpt-4o \
--model-provider openai \
--file-source oracle

# Use skeleton stubs for Python files (optional)
python scripts/run_eval_pipeline.py \
--dataset-name-or-path results/dataset/code_review_task_instances.jsonl \
--output-dir results/pipeline_output \
--model gpt-4o \
--model-provider openai \
--file-source bm25 \
--k 10 \
--retrieval-output-dir results/retrieval_output \
--use-skeleton

This script automates the entire evaluation process: text generation → inference → evaluation. The LLM evaluator defaults to OpenAI o3 (override with --evaluator-model). See [scripts/README.md](scripts/README.md) for detailed usage.

Analysis and Reporting

After running evaluations, you can generate comprehensive analysis reports:

# Generate evaluation report from pipeline results
python scripts/eval_report.py \
--dataset-name-or-path results/dataset/code_review_task_instances.jsonl \
--eval-output-dir results/pipeline_output/evaluation/o3 \
--report-output-file results/evaluation_report.json

# Or use default Hugging Face dataset
python scripts/eval_report.py \
--eval-output-dir results/pipeline_output/evaluation/o3 \
--report-output-file results/evaluation_report.json

This generates detailed statistics including:

  • Model performance across different file source settings (none, oracle, bm25 with k)
  • Performance breakdown by evaluator type (RuleBasedEvaluator, LLMEvaluator)
  • Performance analysis by metadata categories (problem domain, difficulty, estimated review effort)
  • Ranking of all model-setting configurations by average score
  • Identification of missing instances (assigned score of 0 for fair comparison)

The output is a comprehensive JSON report that can be used for further analysis and visualization.

📊 Data Collection

The data collection process involves several steps to gather and process data from GitHub. The main scripts for this process are located in src/swe_care/collect.

Here's an example of the command-line usage for each step:

1. Get Top Repositories: Find the most starred repositories for a given language.

python -m swe_care.collect get_top_repos \
--language "Python" \
--top-n 100 \
--output-dir "results/top_repos" \
--tokens "your_github_pat"

2. Get Pull Request Data: Fetch PR data from a specific repository using the GitHub GraphQL API.

python -m swe_care.collect get_graphql_prs_data \
--repo "/" \
--output-dir "results/graphql_prs_data" \
--tokens "your_github_pat" \
--max-number 20

3. Classify PRs Data: Analyze and classify PR data by evaluating commits and labeling review comments.

Single file processing:

python -m swe_care.collect classify_prs_data \
--graphql-prs-data-file "results/graphql_prs_data/___graphql_prs_data.jsonl" \
--output-dir "./results/classify_prs_data" \
--tokens "your_github_pat"

Batch processing (multiple repositories):

python -m swe_care.collect classify_prs_data \
--graphql-prs-data-file "results/graphql_prs_data/" \
--output-dir "./results/classify_prs_data" \
--tokens "your_github_pat" \
--jobs 4

This step combines two important analyses:

  • Commit Evaluation: Uses heuristic rules to score commits based on quality indicators (message clarity, size, review activity, etc.)
  • Review Comment Classification: Extracts and labels review comments based on whether referenced lines were actually changed in the merged commit, or the review thread is resolved, outdated, or collapsed.

4. Build Code Review Dataset: Build the final dataset for the code review task. This step requires an LLM to classify metadata such as problem domain, difficulty, and review effort for each task instance.

Single file processing:

# Example with OpenAI GPT-4o
export OPENAI_API_KEY=
python -m swe_care.collect build_code_review_dataset \
--graphql-prs-data-file…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with low stars, routine