RepoAmazon (Nova)Amazon (Nova)published Mar 9, 2026seen 5d

amazon-science/owen-shapley-policy-optimization

Python

Open original ↗

Captured source

source ↗

amazon-science/owen-shapley-policy-optimization

Language: Python

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 21

Created: 2026-03-09T21:51:27Z

Pushed: 2026-06-05T23:40:59Z

Default branch: main

Fork: no

Archived: no

README:

Owen-Shapley Policy Optimization (OSPO)

This repository implements Owen-Shapley Policy Optimization (OSPO) for training large language models on the Amazon ESCI (Shopping Queries) dataset. OSPO addresses the credit assignment gap in reinforcement learning for search query generation by redistributing sequence-level rewards based on tokens' marginal contributions to retrieval outcomes.

Project Description/Abstract

Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality outputs, creating a credit assignment gap. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, which is a reasoning pattern rarely seen during pretraining but commonly required in deployment. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' *marginal* contributions to outcomes. OSPO transforms task feedback into potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy—all without parametric value models. By forming coalitions of semantically coherent units (e.g., phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines and notable test-time robustness to out-of-distribution retrievers unseen during training.

---

Installation

Requirements

  • Python 3.10+
  • CUDA-capable GPU(s)

Setup

1. Clone the repository:

git clone
cd LLM-Seq-Shapley-Owen-PO

2. Install dependencies:

pip install -r requirements.txt

Key dependencies include:

  • transformers, accelerate, trl (LLM training)
  • sentence-transformers (embedding generation)
  • faiss-gpu (vector search)
  • datasets (HuggingFace datasets)
  • torch, numpy, pandas

---

Project Structure

llm_shapley_owen_code/
├── data/ # Datasets and indices
│ └── esci/ # ESCI product search data
│ ├── metadata/
│ │ └── item_catalog.jsonl # ASIN→metadata mapping (~3M items, 1.8GB)
│ ├── embeddings/
│ │ ├── all-mpnet-base-v2.npy # Dense embeddings (3M × 768, 10GB)
│ │ └── all-mpnet-base-v2_asin_mapping.json
│ ├── index/
│ │ ├── all-mpnet-base-v2_faiss.bin # FAISS HNSW index (10GB)
│ │ ├── all-mpnet-base-v2_asin_mapping.json
│ │ └── all-mpnet-base-v2_faiss_metadata.json
│ ├── rl_dataset/ # RL training dataset
│ ├── sft_dataset.jsonl # Generated SFT training data
│ └── dpo_dataset.jsonl # Generated DPO preference pairs
│
├── src/esci_search/ # Main source code
│ ├── configs/ # Training configurations
│ │ ├── grpo_config.yaml # GRPO baseline config
│ │ ├── ospo_prop_config.yaml # OSPO proportional
│ │ ├── ospo_rank_config.yaml # OSPO rank-based
│ │ └── ospo_prop_no_clip.yaml # OSPO without gradient clipping
│ │
│ ├── data_processing/ # Data preparation pipeline
│ │ ├── build_index.py # Build FAISS search index
│ │ ├── create_item_metadata.py
│ │ ├── generate_embeddings.py
│ │ ├── generate_rl_data.py # Create RL training dataset
│ │ ├── generate_dpo_sft_data.py # Generate model trajectories
│ │ ├── process_sft_dpo_data.py # Process into SFT/DPO format
│ │ ├── run_sft_dpo_pipeline.sh # End-to-end dataset generation
│ │ └── sample_queries.py
│ │
│ ├── evals/ # Evaluation scripts
│ │ ├── generate_test.py # Run inference on test set
│ │ ├── score_search_only.py # Compute retrieval metrics
│ │ └── test_generations_sft_dpo/ # Generated trajectory CSVs
│ │
│ └── trainers/ # Training implementations
│ ├── dense_search/ # Dense retrieval (FAISS)
│ │ └── search.py
│ ├── train_sft.py # Supervised fine-tuning
│ ├── train_dpo.py # Direct preference optimization
│ ├── train_grpo.py # Group relative policy optimization
│ ├── train_ospo.py # Owen-Shapley policy optimization
│ ├── ospo_utils.py # OSPO-specific utilities
│ ├── reward_utils.py # Reward computation
│ └── generation_utils.py # Text generation helpers
│
├── outputs/ # Training outputs (created at runtime)
│ ├── ospo_ablations_esci/ # OSPO/GRPO checkpoints
│ ├── sft_models/ # SFT checkpoints
│ └── dpo_models/ # DPO checkpoints
├── ospo_code_final_review.pdf
├── pyproject.toml
├── README.md
├── README_old.md
├── requirements.txt
└── setup.py

Directory Overview

  • `data/esci/`: ESCI product search dataset and artifacts
  • `metadata/`: Product catalog with ASIN metadata (~3M items, 1.8GB)
  • `embeddings/`: Pre-computed dense embeddings (all-mpnet-base-v2, 10GB)
  • `index/`: FAISS HNSW search index for dense retrieval (10GB)
  • `rl_dataset/`: Prepared dataset for RL training (queries + candidate pools)
  • `sft_dataset.jsonl`: High-quality samples for supervised fine-tuning
  • `dpo_dataset.jsonl`: Preference pairs for direct preference optimization
  • `src/esci_search/`: Core codebase for the ESCI product search task
  • `configs/`: YAML configurations for different training methods (OSPO, GRPO)
  • `data_processing/`: Scripts to prepare datasets, indices, embeddings, and SFT/DPO data
  • `evals/`: Inference and evaluation utilities, trajectory generation outputs
  • `trainers/`: Training implementations (SFT, DPO, GRPO, OSPO) and search utilities
  • `outputs/`: All model checkpoints organized by training method (git-ignored)
  • `ospo_ablations_esci/`: OSPO and GRPO model checkpoints
  • `sft_models/`: Supervised fine-tuning checkpoints
  • `dpo_models/`: Direct preference optimization checkpoints

---

Data Processing Pipeline

The pipeline prepares the ESCI Shopping Queries dataset for OSPO training by creating item metadata, dense embeddings, search indices, ground truth queries, and RL-formatted training data.

Option A: Full Pipeline (Recommended)

Run all 5 steps sequentially from the data processing directory:

cd src/esci_search/data_processing
bash data_process.sh

This executes all stages in order and stops automatically if any step fails.

---

Option B: Step-by-Step…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo for research algorithm from Amazon