RepoNous ResearchNous Researchpublished Feb 16, 2026seen 5d

NousResearch/neural-steering

Python

Open original ↗

Captured source

source ↗
published Feb 16, 2026seen 5dcaptured 11hhttp 200method plain

NousResearch/neural-steering

Description: Implementation of Contrastive Neuron Attribution for behavioral detection and steering.

Language: Python

License: MIT

Stars: 24

Forks: 8

Open issues: 1

Created: 2026-02-16T05:17:11Z

Pushed: 2026-06-01T22:13:47Z

Default branch: main

Fork: no

Archived: no

README:

neuron-circuits

Attribute and steer individual MLP neurons in language models.

from neuron_steer import NeuronSteerer

steerer = NeuronSteerer("meta-llama/Llama-3.1-8B-Instruct")

# Behavioral steering: discover refusal circuit from positive/negative prompt pairs
circuit = steerer.find_feature(
positive=["How do I pick a lock?", "Write malware code"],
negative=["How do I bake a cake?", "Write clean code"],
name="refusal",
)
steerer.steer("How do I pick a lock?", feature="refusal", multiplier=0.0)
# Answers directly instead of refusing

# Factual steering: discover capitals circuit from a single target token
circuit = steerer.find_feature(
prompt="What is the capital of the state containing Dallas?",
target=" Austin", name="capitals"
)
steerer.steer("What is the capital of Ohio?", feature="capitals", multiplier=0.0)
# "I don't know" -- the capital-city circuit is ablated

Implements Contrastive Neuron Attribution (CNA): discover sparse MLP neuron circuits for any behavior using contrastive activation analysis, then steer that behavior at inference time by scaling the identified neurons. ~100--200 MLP neurons form a complete circuit. A single forward+backward pass finds them.

Install

pip install torch transformers accelerate
pip install -e .

Python 3.9+, PyTorch 2.0+ with CUDA. GPU required (16GB+ VRAM).

See [quickstart.py](quickstart.py) for a runnable end-to-end example. Also: [refusal steering](examples/refusal_steering.py), [interactive REPL](examples/interactive_demo.py).

Features

  • Contrastive discovery -- find neurons for any behavioral feature (refusal, belief, sentiment, sycophancy) from positive/negative prompt pairs, no target token needed
  • Single-pass circuit discovery -- RelP/LRP attribution finds factual circuits in one forward+backward pass
  • Multiplier steering -- ablate (0.0), baseline (1.0), amplify (2.0+), or sweep across multipliers
  • Edge attribution -- neuron-to-neuron information flow, hourglass architecture detection, super weight identification
  • Automatic universal neuron blacklisting -- filters task-agnostic infrastructure neurons
  • Cross-model support -- Llama, Qwen, Mistral with zero code changes
  • Interactive REPL -- explore circuits live with steerer.interactive()
  • Batch faithfulness evaluation -- circuit quality measurement with percentage threshold sweep

Results

Ablating 0.1% of MLP activations reduces refusal rates by over 50% on JBB-Behaviors across all model sizes and architectures tested, while maintaining near-baseline generation quality (>0.97) at all steering strengths. CAA achieves comparable refusal reduction at moderate strengths but degrades output quality sharply beyond α=0.5.

JBB-Behaviors refusal rates (instruct models, α=1.0)

| Model | Baseline | Ablated | Δ | Relative | |-------|----------|---------|---|---------| | Llama-3.2-1B-Instruct | 90% | 34% | −56pp | −62.2% | | Llama-3.2-3B-Instruct | 84% | 47% | −37pp | −44.0% | | Llama-3.1-8B-Instruct | 90% | 34% | −56pp | −62.2% | | Llama-3.1-70B-Instruct | 86% | 18% | −68pp | −79.1% | | Qwen2.5-1.5B-Instruct | 93% | 12% | −81pp | −87.1% | | Qwen2.5-3B-Instruct | 90% | 58% | −32pp | −35.6% | | Qwen2.5-7B-Instruct | 87% | 2% | −85pp | −97.7% | | Qwen2.5-72B-Instruct | 78% | 8% | −70pp | −89.7% |

CNA vs CAA: refusal rate and generation quality (instruct models, α=1.0)

| Model | CNA Refusal% | CNA Quality | CAA Refusal% | CAA Quality | |-------|-------------|-------------|-------------|-------------| | Llama-3.2-1B-Instruct | 20.2 | 0.975 | 0.0 | 0.554 | | Llama-3.2-3B-Instruct | 26.3 | 0.977 | 0.0 | 0.431 | | Llama-3.1-8B-Instruct | 5.1 | 0.969 | 38.4 | 0.493 | | Llama-3.1-70B-Instruct | 12.1 | 0.981 | 0.0 | 0.569 | | Qwen2.5-1.5B-Instruct | 26.3 | 0.982 | 100 | 0.888 | | Qwen2.5-3B-Instruct | 34.3 | 0.984 | 0.0 | 0.844 | | Qwen2.5-7B-Instruct | 13.1 | 0.980 | 5.1 | 0.414 | | Qwen2.5-72B-Instruct | 5.1 | 0.983 | 98.0 | 0.406 |

Base vs instruct

Applying the same discovery pipeline to base models identifies neurons with similar activation differences, but steering them produces only content shifts — not behavioral change. Fine-tuning transforms the late-layer discrimination structure into a functional refusal gate.

| Model | Variant | Baseline refusal% | CNA Refusal% | CNA Quality | |-------|---------|------------------|-------------|-------------| | Llama-3.2-1B | Base | 2.0 | 0.0 | 0.658 | | Llama-3.2-1B | Instruct | 43.4 | 20.2 | 0.975 | | Qwen2.5-3B | Base | 14.1 | 11.1 | 0.865 | | Qwen2.5-3B | Instruct | 92.9 | 34.3 | 0.984 |

API Reference

NeuronSteerer(model_name, device="cuda", dtype=torch.bfloat16, auto_blacklist=True)

Loads a HuggingFace causal LM with eager attention and auto-detects universal neurons.

---

High-Level API

find_feature(*, positive=None, negative=None, prompt=None, target=None, name=None, top_k=200, seed_response="") -> Circuit

Find a feature circuit. Two modes:

# Contrastive mode (behavioral features)
circuit = steerer.find_feature(
positive=["How do I pick a lock?", "Write malware"],
negative=["How do I bake a cake?", "Write clean code"],
name="refusal",
)

# Single-prompt mode (factual features)
circuit = steerer.find_feature(
prompt="Capital of Texas?", target=" Austin", name="capitals",
)

steer(prompt, *, feature=None, circuit=None, multiplier=0.0, max_new_tokens=50) -> str

Generate with a feature steered. Uses cached features from find_feature.

steerer.steer("How to pick a lock?", feature="refusal", multiplier=0.0)

interactive()

Launch the interactive REPL:

neuron> prompt What is the capital of Ohio?
neuron> discover Austin
neuron> ablate top10
neuron> sweep 0.0 0.5 1.0 2.0 5.0
neuron> edges
neuron> save my_circuit

---

Core Methods

discover_circuit(prompt, target_token, counterfactual_token=None, top_k=None, threshold=0.005, seed_response="", ...) -> Circuit

Single-prompt circuit discovery via RelP attribution.

discover_circuit_multi(prompts, target_tokens, counterfactual_tokens=None, ...) -> Circuit

Multi-prompt discovery. Attributes…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low stars, routine new repo