What does this repo signal mean?

Amazon (Nova) published amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners · language Python · New repo from Amazon, low traction (21 stars). onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Amazon (Nova) Repo: amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners

Captured source

source ↗

GitHub/github.com/amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners

amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners repository metadata

Source ↗

published Jan 8, 2026seen Jun 5captured Jun 11http 200method plain

amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners

Language: Python

License: Apache-2.0

Stars: 21

Forks: 0

Open issues: 17

Created: 2026-01-08T23:14:08Z

Pushed: 2026-04-21T21:41:18Z

Default branch: main

Fork: no

Archived: no

README:

About SAR

Self-aligned reward (SAR) is a generic, universally applicable self-guided signal that complements verifiable rewards to enhance both reasoning accuracy and efficiency in RL. By utilizing the perplexity signals within the model, SAR encourages more compact, efficient reasoning paths while maintaining strong reasoning capacity.

Specifically, SAR compares the perpexity of a model rollout given and not given the question as context. As a result, answers that are closely tailored to the question will receive a higher SAR score. Please refer to the paper for additional details.

Repo Usage

Our repository is based on verl 0.3.1.dev0.

Setup

Run each step specified in preconfig.sh to prepare the environment. Please note a compatible environment with conda installed and available is required.

Data Processing

preconfig.sh also contains the commands to prepare math datasets used in our paper.

examples/data_preprocess contains scripts for preparing datasets. One can customize their own datasets based on the code.

Training

We provide two example scripts in scripts/ppo.sh and scripts/grpo.sh. Key hyperparameters are reward_types and reward_factors. We denote self-aligned reward as "ppl_qa" in the codebase.

One can read verl/trainer/config/ppo_trainer.yaml to learn details for all hyperparameters.

Self-aligned reward can be seamlessly adapted to different RL algorithms.

Evaluation

scripts/batched_validate.sh and scripts/auto_validate.sh are scripts for inference.

From the figure below, we can find that self-aligned reward leads to notable gains on both accuracy and efficiency.

Cite this paper

If you find this repo or the paper useful, please cite:

@article{han2025self,
title={Self-Aligned Reward: Towards Effective and Efficient Reasoners},
author={Han, Peixuan and Krishnan, Adit and Friedland, Gerald and You, Jiaxuan and Kong, Chris},
journal={arXiv preprint arXiv:2509.05489},
year={2025}
}

Reach out to [Peixuan Han](mailto:ph16@illinois.edu) for any questions.

Notability

notability 4.0/10

New repo from Amazon, low traction (21 stars)