RepoMiniMaxMiniMaxpublished Nov 10, 2025seen 5d

MiniMax-AI/MiniMax-Provider-Verifier

Python

Open original ↗

Captured source

source ↗

MiniMax-AI/MiniMax-Provider-Verifier

Description: MiniMax-Provider-Verifier offers a rigorous, vendor-agnostic way to verify whether third-party deployments of the Minimax M2 model are correct and reliable.

Language: Python

License: MIT

Stars: 46

Forks: 9

Open issues: 1

Created: 2025-11-10T09:07:34Z

Pushed: 2026-06-06T12:59:58Z

Default branch: main

Fork: no

Archived: no

README:

MiniMax-Provider-Verifier

[English](README.md) | [中文](README_CN.md)

MiniMax-Provider-Verifier offers a rigorous, vendor-agnostic way to verify whether third-party deployments of the Minimax M2 model are correct and reliable. Since the open-source release of M2, it has been widely adopted and integrated into production services by numerous users. To ensure this vast user base continues to benefit from an efficient, high-quality M2 experience—and to align with our vision of "Intelligence with Everyone"—this toolkit offers an objective, reproducible standard for validating model behavior.

Evaluation Metrics

We evaluate multiple dimensions of vendor deployments, including tool-calling behavior, schema correctness, and system stability (e.g., detecting potential misconfigurations like incorrect top-k settings).

The primary metrics are:

  • Query-Success-Rate: Measures the probability that a provider can eventually return a valid response successfully when allowed up to max_retry=10 attempts.
  • query_success_rate = successful_query_count / total_query_count
  • ToolCalls-Match-Rate: Measures how well the model's "whether to trigger tool-calls" behavior matches the expected labels. Each test case is annotated with expected_tool_call (whether a tool call is expected), and this metric calculates the proportion of cases where the actual result matches the expected result.
  • tool_calls_match_rate = (tool_calls_finish_tool_calls + stop_finish_stop) / expected_tool_call_total_count
  • Confusion Matrix Statistics:
  • tool_calls_finish_tool_calls: expected tool_call, actual tool_call (TP)
  • tool_calls_finish_stop: expected tool_call, actual stop (FN)
  • stop_finish_tool_calls: expected stop, actual tool_call (FP)
  • stop_finish_stop: expected stop, actual stop (TN)
  • ToolCalls-Schema-Accuracy: Measures the correctness rate of tool-call payloads (e.g., function name and arguments meeting the expected schema) conditional on tool-call being triggered.
  • schema_accuracy = tool_calls_successful_count / tool_calls_finish_tool_calls
  • ToolCalls-Trigger Similarity: Measures the similarity between a third-party deployment's tool-call triggering behavior and the official MiniMax deployment, using the F1 score with the official results as the gold standard.
  • precision = TP / (TP + FP)
  • recall = TP / (TP + FN)
  • trigger_similarity = 2 * precision * recall / (precision + recall)
  • Error-Only-Reasoning-Rate: Detects a specific error pattern where the model outputs only Chain-of-Thought reasoning without providing valid content or the required tool calls. The presence of this pattern strongly indicates a deployment issue.
  • error_only_reasoning_rate = error_only_reasoning_count / error_only_reasoning_checked_count
  • Language-Following-Success-Rate: Checks whether the model follows language requirements in minor language scenarios; this is sensitive to top-k and related decoding parameters.
  • language_following_success_rate = language_following_valid_count / language_following_checked_count
  • Scenario-Check-Pass-Rate: Validates model behavior in scenario-specific checks, such as whether the model can correctly recall the original parameter order from tool definitions. This metric is sensitive to providers that reorder JSON object keys (e.g., alphabetical sorting of parameters.properties), which can degrade the model's schema comprehension.
  • scenario_check_pass_rate = scenario_check_valid_count / scenario_check_checked_count

Evaluation Results

The evaluation results below are computed using our initial release of test prompts, each executed 10 times per provider, with all metrics reported as the mean over the 10-run distribution. As a baseline, minimax represents the performance of our official MiniMax Open Platform deployment, providing a reference point for interpreting other providers' results.

MiniMax-M3 Model – June 2026 Data

| Metric | Query-Success-Rate | ToolCalls-Match-Rate | ToolCalls-Schema-Accuracy | Error-Only-Reasoning-Rate | Language-Following-Success-Rate | Scenario-Check-Pass-Rate | |--------|--------------------|-----------------------------|--------------------|--------------------------------------------|----------------------------------|--------------------------| | MiniMax-M3 | 100.00% | 98.80% | 98.93% | 0.00% | 100.00% | 100.00% |

MiniMax-M2.5/M2.7 Model – May 2026 Data

| Metric | Query-Success-Rate | ToolCalls-Match-Rate | ToolCalls-Schema-Accuracy | Error-Only-Reasoning-Rate | Language-Following-Success-Rate | Scenario-Check-Pass-Rate | |--------|--------------------|-----------------------------|--------------------|--------------------------------------------|----------------------------------|--------------------------| | MiniMax-M2.5 | 100% | 98.30% | 98.57% | 0% | 85% | 100% | | MiniMax-M2.7 | 100% | 98.80% | 99.76% | 0% | 75% | 100% |

MiniMax-M2.5/M2.7 Model – April 2026 Data (After Metrics Revision)

| Metric | Query-Success-Rate | ToolCalls-Match-Rate | ToolCalls-Schema-Accuracy | Error-Only-Reasoning-Rate | Language-Following-Success-Rate | Scenario-Check-Pass-Rate | |--------|--------------------|-----------------------------|--------------------|--------------------------------------------|----------------------------------|--------------------------| | MiniMax-M2.5 | 100% | 99.29% | 95.59% | 0% | 80% | - | | MiniMax-M2.7 | 100% | 98.50% | 99.64% | 0% | 90% | 90% |

MiniMax-M2.5 Model – Feb 2026 Data

| Metric | Query-Success-Rate | Finish-ToolCalls-Rate | ToolCalls-Trigger Similarity | ToolCalls-Accuracy | Response Success Rate - Not Only Reasoning | Language-Following-Success-Rate | |--------|--------------------|-----------------------|------------------------------|--------------------|--------------------------------------------|----------------------------------| | minimax-m2.5 | 100% | 84.75% | - | 97.26% | 100% | 90% | | openRouter-minimax-fp8 | 100% | 84.55% | 98.98% | 97.25% | 100% | 80% | | openRouter-minimax-highspeed | 100% | 84.14% |…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low stars, routine new repo