RepoMicrosoftMicrosoftpublished Sep 29, 2025seen 3d

microsoft/BC-Bench

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Sep 29, 2025seen 3dcaptured 13hhttp 200method plain

microsoft/BC-Bench

Description: Inspired by SWE-Bench, for Business Central (AL) ecosystem.

Language: Jupyter Notebook

License: MIT

Stars: 36

Forks: 15

Open issues: 9

Created: 2025-09-29T07:40:02Z

Pushed: 2026-06-11T00:06:20Z

Default branch: main

Fork: no

Archived: no

README:

BC-Bench

![Dataset Validation and Verification](https://github.com/microsoft/BC-Bench/actions/workflows/dataset-validation.yml) ![CI](https://github.com/microsoft/BC-Bench/actions/workflows/CI.yml)

A benchmark for evaluating coding agents on real-world Business Central (AL) development tasks, inspired by SWE-Bench.

Purpose

BC-Bench provides a reproducible evaluation framework for coding agents working on real-world Business Central development tasks:

  • Measure performance of different models on authentic AL issues
  • Quantify impact of tooling changes (MCP servers, custom instructions, custom agents, etc)
  • Track progress with transparent, comparable metrics over time
  • Rapidly iterate on agent configurations and setups

Dataset

We follow the SWE-Bench schema with BC-specific adjustments:

  • environment_setup_commit and version are combined into environment_setup_version
  • project_paths to enumerate AL project roots touched by the fix
  • problem_statement and hints_text are not included in the jsonl file but stored under [problemstatement](/dataset/problemstatement/) for screenshots in repro steps

Agents Under Evaluation

GitHub Copilot CLI

The GitHub Copilot CLI supports MCP servers, tools, and agent mode. It closely simulates real developers' workflow (both VS Code and Coding Agent), making it an ideal candidate for evaluating automated workflows.

Claude Code

Claude Code is Anthropic's agentic coding tool. It supports MCP servers, custom system prompts, and agent mode. BC-Bench integrates with Claude Code using the same shared configuration as Copilot.

Getting Started

BC-Bench is open source, and you're welcome to fork and adapt it for your own use. We are not accepting external contributions in this repository at this time. You can run evaluations locally and replace the dataset under dataset/ with tasks from your own codebase.

Documentation map

  • [CONTRIBUTING.md](CONTRIBUTING.md) — fork setup, repo layout, versioning, day-to-day maintainer ops
  • [EXPERIMENT.md](EXPERIMENT.md) — run an experiment (toggle instructions / skills / agents / MCP / model) against an existing category
  • [CATEGORIES.md](CATEGORIES.md) — add a new evaluation category (e.g. code-review) alongside bug-fix / test-generation

Notability

notability 3.0/10

New benchmark repo with low stars