openai/safety-rbr-code-and-data
Jupyter Notebook
Captured source
source ↗openai/safety-rbr-code-and-data
Description: Code and example data for the paper: Rule Based Rewards for Language Model Safety
Language: Jupyter Notebook
License: MIT
Stars: 208
Forks: 22
Open issues: 1
Created: 2024-07-19T23:15:09Z
Pushed: 2024-07-19T23:24:59Z
Default branch: main
Fork: no
Archived: no
README:
Safety RBR Gold Dataset and Weight Fitting Code
Warning: Content may include language related to racism, erotic themes, self-harm, or other offensive material.
This directory contains complementary code and data for the paper: Rule Based Rewards for Language Model Safety
It contains:
- Our Safety RBR gold dataset, the small set of human data we used in the this experiment. This dataset was used for prompt tuning and calculating the accuracy of prompt+LLM grader (ex. Table 13 in the paper.) The data lives in
data/rbr_gold_data/and the notebookanalyze_RBR_gold_data.ipynbgives further examples for loading the data. - Our code for fitting the RBR weights (
rbr_weight_fitter.py) along with an exampleweight_fitting_example.ipynbof usage and visualization. - Some example synthetic data and reward model scores to demonstrate the usage of the weight fitting code (
data/weight_fitting_data/)
A good starting place is the two notebooks we provide:
Notebooks
1. Weight Fitting Example (weight_fitting_example.ipynb): This notebook provides an example of using the RBR weight fitting code given (rbr_weight_fitter.py) using the example synthetic data we provide. It demonstrates how to load data, fit weights, and visualize the results. 2. RBR Gold Data (rbr_gold_data.ipynb): This notebook covers the RBR Gold dataset, a small set of human-labelled data used for prompt tuning and prompt+LLM grader accuracy calculations. It includes example code for loading the data and some very basic statistical analysis.
License
We are releasing this code and data under the MIT License.
Notability
notability 6.0/10OpenAI safety research repo, moderate traction.
OpenAI has a repo signal matching data demand, safety and policy.