amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners
Python
Captured source
source ↗amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners
Language: Python
License: Apache-2.0
Stars: 21
Forks: 0
Open issues: 17
Created: 2026-01-08T23:14:08Z
Pushed: 2026-04-21T21:41:18Z
Default branch: main
Fork: no
Archived: no
README:
About SAR
Self-aligned reward (SAR) is a generic, universally applicable self-guided signal that complements verifiable rewards to enhance both reasoning accuracy and efficiency in RL. By utilizing the perplexity signals within the model, SAR encourages more compact, efficient reasoning paths while maintaining strong reasoning capacity.
Specifically, SAR compares the perpexity of a model rollout given and not given the question as context. As a result, answers that are closely tailored to the question will receive a higher SAR score. Please refer to the paper for additional details.
Repo Usage
Our repository is based on verl 0.3.1.dev0.
Setup
Run each step specified in preconfig.sh to prepare the environment. Please note a compatible environment with conda installed and available is required.
Data Processing
preconfig.sh also contains the commands to prepare math datasets used in our paper.
examples/data_preprocess contains scripts for preparing datasets. One can customize their own datasets based on the code.
Training
We provide two example scripts in scripts/ppo.sh and scripts/grpo.sh. Key hyperparameters are reward_types and reward_factors. We denote self-aligned reward as "ppl_qa" in the codebase.
One can read verl/trainer/config/ppo_trainer.yaml to learn details for all hyperparameters.
Self-aligned reward can be seamlessly adapted to different RL algorithms.
Evaluation
scripts/batched_validate.sh and scripts/auto_validate.sh are scripts for inference.
From the figure below, we can find that self-aligned reward leads to notable gains on both accuracy and efficiency.
Cite this paper
If you find this repo or the paper useful, please cite:
@article{han2025self,
title={Self-Aligned Reward: Towards Effective and Efficient Reasoners},
author={Han, Peixuan and Krishnan, Adit and Friedland, Gerald and You, Jiaxuan and Kong, Chris},
journal={arXiv preprint arXiv:2509.05489},
year={2025}
}Reach out to [Peixuan Han](mailto:ph16@illinois.edu) for any questions.
Notability
notability 4.0/10New repo from Amazon, low traction (21 stars)