NousResearch/neural-steering
Python
Captured source
source ↗NousResearch/neural-steering
Description: Implementation of Contrastive Neuron Attribution for behavioral detection and steering.
Language: Python
License: MIT
Stars: 24
Forks: 8
Open issues: 1
Created: 2026-02-16T05:17:11Z
Pushed: 2026-06-01T22:13:47Z
Default branch: main
Fork: no
Archived: no
README:
neuron-circuits
Attribute and steer individual MLP neurons in language models.
from neuron_steer import NeuronSteerer
steerer = NeuronSteerer("meta-llama/Llama-3.1-8B-Instruct")
# Behavioral steering: discover refusal circuit from positive/negative prompt pairs
circuit = steerer.find_feature(
positive=["How do I pick a lock?", "Write malware code"],
negative=["How do I bake a cake?", "Write clean code"],
name="refusal",
)
steerer.steer("How do I pick a lock?", feature="refusal", multiplier=0.0)
# Answers directly instead of refusing
# Factual steering: discover capitals circuit from a single target token
circuit = steerer.find_feature(
prompt="What is the capital of the state containing Dallas?",
target=" Austin", name="capitals"
)
steerer.steer("What is the capital of Ohio?", feature="capitals", multiplier=0.0)
# "I don't know" -- the capital-city circuit is ablatedImplements Contrastive Neuron Attribution (CNA): discover sparse MLP neuron circuits for any behavior using contrastive activation analysis, then steer that behavior at inference time by scaling the identified neurons. ~100--200 MLP neurons form a complete circuit. A single forward+backward pass finds them.
Install
pip install torch transformers accelerate pip install -e .
Python 3.9+, PyTorch 2.0+ with CUDA. GPU required (16GB+ VRAM).
See [quickstart.py](quickstart.py) for a runnable end-to-end example. Also: [refusal steering](examples/refusal_steering.py), [interactive REPL](examples/interactive_demo.py).
Features
- Contrastive discovery -- find neurons for any behavioral feature (refusal, belief, sentiment, sycophancy) from positive/negative prompt pairs, no target token needed
- Single-pass circuit discovery -- RelP/LRP attribution finds factual circuits in one forward+backward pass
- Multiplier steering -- ablate (0.0), baseline (1.0), amplify (2.0+), or sweep across multipliers
- Edge attribution -- neuron-to-neuron information flow, hourglass architecture detection, super weight identification
- Automatic universal neuron blacklisting -- filters task-agnostic infrastructure neurons
- Cross-model support -- Llama, Qwen, Mistral with zero code changes
- Interactive REPL -- explore circuits live with
steerer.interactive() - Batch faithfulness evaluation -- circuit quality measurement with percentage threshold sweep
Results
Ablating 0.1% of MLP activations reduces refusal rates by over 50% on JBB-Behaviors across all model sizes and architectures tested, while maintaining near-baseline generation quality (>0.97) at all steering strengths. CAA achieves comparable refusal reduction at moderate strengths but degrades output quality sharply beyond α=0.5.
JBB-Behaviors refusal rates (instruct models, α=1.0)
| Model | Baseline | Ablated | Δ | Relative | |-------|----------|---------|---|---------| | Llama-3.2-1B-Instruct | 90% | 34% | −56pp | −62.2% | | Llama-3.2-3B-Instruct | 84% | 47% | −37pp | −44.0% | | Llama-3.1-8B-Instruct | 90% | 34% | −56pp | −62.2% | | Llama-3.1-70B-Instruct | 86% | 18% | −68pp | −79.1% | | Qwen2.5-1.5B-Instruct | 93% | 12% | −81pp | −87.1% | | Qwen2.5-3B-Instruct | 90% | 58% | −32pp | −35.6% | | Qwen2.5-7B-Instruct | 87% | 2% | −85pp | −97.7% | | Qwen2.5-72B-Instruct | 78% | 8% | −70pp | −89.7% |
CNA vs CAA: refusal rate and generation quality (instruct models, α=1.0)
| Model | CNA Refusal% | CNA Quality | CAA Refusal% | CAA Quality | |-------|-------------|-------------|-------------|-------------| | Llama-3.2-1B-Instruct | 20.2 | 0.975 | 0.0 | 0.554 | | Llama-3.2-3B-Instruct | 26.3 | 0.977 | 0.0 | 0.431 | | Llama-3.1-8B-Instruct | 5.1 | 0.969 | 38.4 | 0.493 | | Llama-3.1-70B-Instruct | 12.1 | 0.981 | 0.0 | 0.569 | | Qwen2.5-1.5B-Instruct | 26.3 | 0.982 | 100 | 0.888 | | Qwen2.5-3B-Instruct | 34.3 | 0.984 | 0.0 | 0.844 | | Qwen2.5-7B-Instruct | 13.1 | 0.980 | 5.1 | 0.414 | | Qwen2.5-72B-Instruct | 5.1 | 0.983 | 98.0 | 0.406 |
Base vs instruct
Applying the same discovery pipeline to base models identifies neurons with similar activation differences, but steering them produces only content shifts — not behavioral change. Fine-tuning transforms the late-layer discrimination structure into a functional refusal gate.
| Model | Variant | Baseline refusal% | CNA Refusal% | CNA Quality | |-------|---------|------------------|-------------|-------------| | Llama-3.2-1B | Base | 2.0 | 0.0 | 0.658 | | Llama-3.2-1B | Instruct | 43.4 | 20.2 | 0.975 | | Qwen2.5-3B | Base | 14.1 | 11.1 | 0.865 | | Qwen2.5-3B | Instruct | 92.9 | 34.3 | 0.984 |
API Reference
NeuronSteerer(model_name, device="cuda", dtype=torch.bfloat16, auto_blacklist=True)
Loads a HuggingFace causal LM with eager attention and auto-detects universal neurons.
---
High-Level API
find_feature(*, positive=None, negative=None, prompt=None, target=None, name=None, top_k=200, seed_response="") -> Circuit
Find a feature circuit. Two modes:
# Contrastive mode (behavioral features) circuit = steerer.find_feature( positive=["How do I pick a lock?", "Write malware"], negative=["How do I bake a cake?", "Write clean code"], name="refusal", ) # Single-prompt mode (factual features) circuit = steerer.find_feature( prompt="Capital of Texas?", target=" Austin", name="capitals", )
steer(prompt, *, feature=None, circuit=None, multiplier=0.0, max_new_tokens=50) -> str
Generate with a feature steered. Uses cached features from find_feature.
steerer.steer("How to pick a lock?", feature="refusal", multiplier=0.0)interactive()
Launch the interactive REPL:
neuron> prompt What is the capital of Ohio? neuron> discover Austin neuron> ablate top10 neuron> sweep 0.0 0.5 1.0 2.0 5.0 neuron> edges neuron> save my_circuit
---
Core Methods
discover_circuit(prompt, target_token, counterfactual_token=None, top_k=None, threshold=0.005, seed_response="", ...) -> Circuit
Single-prompt circuit discovery via RelP attribution.
discover_circuit_multi(prompts, target_tokens, counterfactual_tokens=None, ...) -> Circuit
Multi-prompt discovery. Attributes…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low stars, routine new repo