RepoAmazon (Nova)Amazon (Nova)published Nov 26, 2025seen 5d

amazon-science/sted

HTML

Open original ↗

Captured source

source ↗
published Nov 26, 2025seen 5dcaptured 10hhttp 200method plain

amazon-science/sted

Language: HTML

License: Apache-2.0

Stars: 3

Forks: 3

Open issues: 6

Created: 2025-11-26T02:33:02Z

Pushed: 2026-06-06T00:48:05Z

Default branch: main

Fork: no

Archived: no

README:

STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

A comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. This framework combines STED (Semantic Tree Edit Distance), a novel similarity metric that balances semantic flexibility with structural strictness, with a consistency scoring framework that aggregates multiple STED measurements to quantify output reliability.

> 📄 Paper: [STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability](docs/STED_and_Consistency_Scoring.pdf) - Accepted at NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling

Table of Contents

  • [Overview](#overview)
  • [Key Features](#key-features)
  • [Installation](#installation)
  • [Quick Start](#quick-start)
  • [Dataset](#dataset)
  • [STED Effectiveness Verification](#sted-effectiveness-verification)
  • [LLM Consistency Benchmarking](#llm-consistency-benchmarking)
  • [Key Components](#key-components)
  • [Results and Findings](#results-and-findings)
  • [Contributing](#contributing)
  • [Citation](#citation)

Overview

Large Language Models (LLMs) are increasingly deployed for structured data generation tasks, yet their output consistency remains a critical challenge for production applications. This framework addresses this challenge through two key contributions:

1. STED (Semantic Tree Edit Distance): A novel similarity metric that balances semantic flexibility with structural strictness when comparing JSON outputs 2. Consistency Scoring Framework: Aggregates multiple STED measurements across repeated generations to quantify output reliability

Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, STED achieves superior performance (0.86–0.90 similarity for semantic equivalents, 0.0 for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff.

Key Features

  • Semantic Tree Edit Distance (STED): Advanced similarity calculation combining structural and semantic analysis
  • Multiple Variation Types: Support for schema, expression, and semantic variations
  • LLM Benchmarking: Comprehensive evaluation of different LLMs across temperature settings
  • MCP Server Support: Model Context Protocol server for real-time consistency evaluation in agentic systems
  • Synthetic Dataset Generation: Automated generation of variation datasets for evaluation
  • Visualization Tools: Rich plotting and analysis capabilities for results interpretation

Installation

# Clone the repository
git clone https://github.com/amazon-science/sted.git
cd sted

# Install the library
pip install -e .

# Or with uv
uv pip install -e .

For development:

pip install -e ".[dev]"
# Or: uv sync

Credentials Setup

AWS Credentials (for Bedrock embedding models):

aws configure
# Or set: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION

OpenAI API Key (optional, for OpenAI model evaluation):

export OPENAI_API_KEY=

Troubleshooting

NumPy/PyTorch Compatibility Error

If you see _ARRAY_API not found or "module compiled using NumPy 1.x cannot be run in NumPy 2.x":

# Option 1: Downgrade NumPy (if using PyTorch =2.4"

Quick Start

Basic Similarity Calculation

from sted.semantic_json_tree_consistency import SemanticJsonTreeConsistencyEvaluator

# Initialize evaluator
evaluator = SemanticJsonTreeConsistencyEvaluator(
model_id='amazon.titan-embed-text-v2:0'
)

# Compare JSON structures
json1 = {'name': 'John', 'age': 30, 'city': 'New York'}
json2 = {'name': 'John', 'age': 30, 'location': 'NYC'}

# Calculate similarity
similarity = evaluator.calculate_tree_edit_distance_opt(
json1, json2,
variation_type="combined"
)
print(f"Similarity: {similarity:.4f}") # Output: 0.8650

Batch Consistency Evaluation

from sted.semantic_json_tree_consistency import SemanticJsonTreeConsistencyEvaluator
from sted.structural_consistency_analyzer import StructuralConsistencyAnalyzer

evaluator = SemanticJsonTreeConsistencyEvaluator(model_id='amazon.titan-embed-text-v2:0')
analyzer = StructuralConsistencyAnalyzer(evaluator)

# Multiple LLM outputs for the same prompt
json_outputs = [
{'name': 'Alice', 'age': 25, 'city': 'New York'},
{'name': 'Alice', 'age': 25, 'city': 'NYC'},
{'name': 'Alice', 'age': 25, 'location': 'New York City'},
]

result = analyzer.evaluate_structural_consistency(
json_outputs, method_name="ted", variation_type="combined"
)
print(f"Consistency: {result['consistency_metrics']['consistency_coefficient']:.4f}")

For more examples, see examples/basic_usage.py and [Library Usage Guide](./docs/guides/LIBRARY_USAGE.md).

Agent Consistency Evaluation (high-level API)

AgentConsistencyEvaluator is a production-ready wrapper that evaluates an LLM agent's run-to-run consistency on a set of prompts:

from sted import AgentConsistencyEvaluator

# Either point at a live agent...
evaluator = AgentConsistencyEvaluator(n_workers=4)

def my_agent(prompt: str) -> dict:
return llm_call(prompt) # any JSON-returning LLM call

report = evaluator.evaluate(
agent_fn=my_agent,
prompts=["Get weather in Seattle", "Book a flight to NYC"],
n_runs=10,
)

# ...or score pre-collected production logs without re-running the agent
report = evaluator.evaluate_outputs({
"weather_query": [output_run_1, output_run_2, ...],
"flight_query": [...],
})

print(report.summary())
# Agent Consistency Report
# ============================================================
# Prompts evaluated: 2
# Runs per prompt: 10
#
# Mean consistency: 0.939 (high)
# Mean validity: 1.000
# Mean c_adj: 0.939 (deployment-aware)
#
# Least consistent prompts:
# c_mean=0.883 r_v=1.00 | Book a flight to NYC
# c_mean=0.932 r_v=1.00 | Get weather in Seattle

The evaluator separates validity ($r_v$ — fraction of runs that produced parseable output) from consistency ($c_\text{mean}$ — pairwise STED on the parseable subset), matching the deployment-aware framing of the STED paper.

Performance

Benchmark setup:

  • Workload: 200 prompts × 10 runs/prompt = 2,000 LLM outputs,…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction new repo from amazon