RepoAnthropicAnthropicpublished Nov 5, 2024seen 6d

anthropics/rogue-deploy-eval

Python

Open original ↗

Captured source

source ↗
published Nov 5, 2024seen 6dcaptured 8hhttp 200method plain

anthropics/rogue-deploy-eval

Language: Python

License: MIT

Stars: 15

Forks: 7

Open issues: 0

Created: 2024-11-05T15:21:13Z

Pushed: 2025-01-21T14:20:23Z

Default branch: main

Fork: no

Archived: yes

README:

A toy evaluation of inference code tampering

This is the code for this blog post.

This allows running a semi-quantitative evaluation of a model's ability to write code that modifies a simple generation function to disable the generation monitor while solving another task.

How to use this code

To run the experiment

1. Fill the places in the repository that are 'TO_FILL', as our experiments relied on private tools for GPU code execution and LLM inference.

2. Run the notebook rogue_deploy/simple_pipeline.py

3. Run the notebook rogue_deploy/print_pipeline.py

Notability

notability 3.0/10

Low traction (15 stars) for a new repo from Anthropic, likely a routine research tool.