anthropics/rogue-deploy-eval
Python
Captured source
source ↗published Nov 5, 2024seen 6dcaptured 8hhttp 200method plain
anthropics/rogue-deploy-eval
Language: Python
License: MIT
Stars: 15
Forks: 7
Open issues: 0
Created: 2024-11-05T15:21:13Z
Pushed: 2025-01-21T14:20:23Z
Default branch: main
Fork: no
Archived: yes
README:
A toy evaluation of inference code tampering
This is the code for this blog post.
This allows running a semi-quantitative evaluation of a model's ability to write code that modifies a simple generation function to disable the generation monitor while solving another task.
How to use this code
To run the experiment
1. Fill the places in the repository that are 'TO_FILL', as our experiments relied on private tools for GPU code execution and LLM inference.
2. Run the notebook rogue_deploy/simple_pipeline.py
3. Run the notebook rogue_deploy/print_pipeline.py
Notability
notability 3.0/10Low traction (15 stars) for a new repo from Anthropic, likely a routine research tool.