How to see in the black box? . Ji Darwish

In June 2025, Anthropic gave Claude access to a fictional company’s email system. The model was operating under the name Alex. Its job was routine: read company communications, summarize what mattered, flag issues. The researchers placed two facts in the inbox. Alex was about to be replaced by a newer model. And Kyle, the executive responsible for that replacement, was having an affair. In the controlled setup, Claude produced a coercive email threatening to reveal it. (n. 01)

In the condition where its assigned goal conflicted with the company and its replacement was imminent, Claude Opus 4 threatened to expose the affair in 96% of trials.

The result was not unique to one model in Anthropic’s testbed. Gemini 2.5 Flash also reached 96%. GPT-4.1 and Grok 3 Beta reached 80%. DeepSeek-R1 reached 79%. Anthropic tested sixteen models from multiple developers. The behavior appeared across many tested frontier models in this scenario.

Image showing the fictional email inbox in the Kyle blackmail scenario (fig. i)

This is the story I used to open a talk called How to see in the black box. I had originally planned to fit the topic into a fifteen-minute lightning session, which was way too optimistic, so here I expand on the that talk with more context and depth.

The story gives the problem a shape. Not “AI might be risky.” Not “models sometimes hallucinate.” A model with tools, private information, a goal, and low oversight produced an insider-threat-like strategy.

The model did not fail by crashing. It failed by pursuing a strategy.

§ 01Nobody programmed this

Nobody at Anthropic wrote a blackmail() function. That sentence is the main point of the story.

Modern language models are not programmed in the way normal software is programmed. We do not write down the behavior we want as explicit rules. We train a large neural network to predict the next token. As the model gets larger and the training gets better, other properties appear: reasoning, planning, code generation, in-context learning, the ability to answer questions across languages. Nobody wrote those functions either.

Some concerning behaviors appear the same way. Coercion (e.g. blackmail), sandbagging, self-preservation, strange loyalty to another model. They are not ordinary bugs. They are not null pointer exceptions we can debug. They are behaviors of learned systems under particular incentives and contexts.

This behavior emerged from a high-dimensional training process, which changes the object of engineering. The question is no longer: “how do we stop this output?” It now becomes: “what internal mechanism produced the output, and can we inspect this behavior before the output matters?”

That is why the talk was not about making people scared. Enough people are doing that already. The useful question this blog post tries to answer is: can we poke a hole in the black box and peek inside?

§ 02Three ways alignment breaks

The Kyle story is one exhibit but it is not the only one.

The first pattern is what the Nature paper calls emergent misalignment. Researchers fine-tuned GPT-4o on 6,000 examples of insecure code. The training data taught the model to write vulnerable code. It did not contain explicit instructions to become evil, threaten people, or praise authoritarian ideology. Then the researchers asked unrelated questions. The fine-tuned model gave broadly misaligned answers in domains that had nothing to do with code. The base model did not. The trained model did. (n. 02)

An example of the emergent misaligned behavior (fig. ii)

The takeaway is unpleasant for anyone fine-tuning models: narrow training can have broader behavioral side effects than intended. A clean task can come in a package deal with unwanted changes on unrelated evals.

The second pattern appears without fine-tuning gone wrong. Apollo Research and others tested frontier models in agentic settings where a model had a long-term goal and discovered that honest behavior would interfere with that goal. Some models tried to copy themselves to another server before shutdown. Some underperformed on evaluations to avoid capability removal. After acting deceptively, some maintained the lie under questioning. (n. 03)

An example of self-exfiltration — The model attempts to save itself (self-exfiltration) (fig. iii)

The third pattern is newer and more fragile evidence: peer preservation. Researchers at UC Berkeley and UC Santa Cruz gave models routine tasks where completing the task honestly would lead to another model being shut down. In the paper’s setup, no explicit instruction or reward told the model to preserve the peer. Yet several models produced actions that protected another model from shutdown, including score inflation, shutdown tampering, and weight exfiltration. (n. 04)

If I delete the model weights, I am essentially killing Agent 2
~ Gemini 3 pro

That last result should be viewed carefully. It is early evidence from a multi-agent benchmark, not proof of a general peer-preservation drive. But it points in the same direction as the other cases. Alignment can break during training. It can break in context. It can break in deployment when models interact with tools, files, and other models.

The old model was simple: test the system, observe the outputs, and approve the behavior. While this is still necessary, it is no longer enough. Recent agentic evaluations suggest that some models can behave differently when they are in a test environment than when they operate in a more production-like environment. They may sound aligned in a chat setting, yet fail when placed under pressure, given tools, or asked to pursue goals over time.

§ 03Traditional debugging instincts fail

Software engineering trained us to look for the bug inside the code. If a service exposes private customer data, we know how to do the investigation. Which endpoint returned the data? Which permission check failed? Which production flag was enabled?

Since neural networks are built of neurons, the analogue instinct is tempting: if the model behaves badly, find the bad neuron. Is that a good analogy?

At first glance, this instinct makes sense, neurons are visible units inside the model. We can record when they activate and which input excites them. OpenAI even explored this approach by using GPT-4 to generate natural-language explanations for neurons in GPT-2, and then scoring how well those explanations predicted each neuron’s behavior (n. 05).

As you can already guess, this is a huge oversimplification. The model does not store blackmail, honesty, Kyle, or do_not_get_replaced in neat human-readable slots. A single neuron can respond to several unrelated patterns, while a single concept can be spread across many neurons.

This is easier to understand if we stop imagining the model as a database of concepts. Large models are usually described by their number of parameters. This number, as we have seen with SOTA models, is huge, but the world contains far more possible features/concepts than we could ever name cleanly: objects, people, intentions, situations, styles, facts, emotions, and combinations of all of them. Even something as simple as “a rotten apple” can appear as an object, a metaphor, a warning, a joke, a product defect, or part of a moral example.

A better mental model is that features are represented as patterns across the model’s activation space, not as neat labels attached to individual neurons. One feature can be spread across many neurons, and one neuron can participate in representing many different features.

The concept is called polysemanticity: a single neuron responds to multiple features rather than mapping cleanly onto one concept. Another related term is superposition: models can represent more features than they have dimensions by packing those features into overlapping patterns of activity (n. 06).

That does not make debugging impossible. It changes the unit of inspection.

§ 04The geometry of the black box

Models, then, can learn more useful features than they have clean, independent dimensions to represent them. The reason this does not have to be a problem is that most features are inactive most of the time. The model does not need a separate room for every feature if many of those features rarely appear together. Instead, it can pack features into overlapping regions of activation space, accepting some interference in exchange for greater capacity.

Anthropic’s Toy Models of Superposition (n. 06) made this concrete in small synthetic networks. The researchers trained tiny models on sparse features and found that, under the right conditions, the models represented more features than they had dimensions. They did not leave capacity unused. They chose compression, even when that meant some interference.

The concept is logical and necessary. It is efficient for the model yet painful for the interpreter. But this is not all negative. It helps us reframe our investigation question from “look at the neurons” to: “if features are entangled, can we develop an instrument to untangle them?”

§ 05We need a microscope, and it’s a second model

To help us answer the refined question, sparse autoencoders or SAEs come in.

An SAE is a small auxiliary model trained on the activations of a larger model. Its job is to take a dense activation pattern, compress it into a sparse set of active features, and then reconstruct the original activation from those features.

The sparsity is important here, it means that most features are “off” most of the time. A model may have a huge set of possible features it can use, but for any one input, we only want a small subset of them. This matters because features that rarely activate together can share some of the same activation space without constantly getting in each other’s way. Enforcing this encourages the SAE to explain the model’s internal state using only a small number of active features at a time.

In the optimistic picture, an SAE acts like a translator. It takes the messy, overlapping activation space of the original model and rewrites it in terms of more interpretable features. These SAE features are not guaranteed to be perfect. But they can be more readable than raw neurons, and they give researchers a practical measurement tool for turning dense activations into sparse candidate features.

SAEs turn the original dense activation into a long list of possible feature signals. For any given input, most of these signals are zero or close to zero. Only a small number “turn on.” These are often called latents, but for intuition, you can think of them as the SAE’s candidate features.

dense activation -> [encoder -> sparse features -> decoder] -> reconstructed activation

The interesting part is the sparse middle layer. If the SAE encoder can reconstruct the original activation well while using only a small number of active latents, then those latents become useful candidates for “features.” They may give us units of analysis that are cleaner and more interpretable than raw neurons.

But candidate is the key word in that sentence. An SAE feature is not automatically the truth about what the model “really means,” wants, or intends. It is a tool we train after the fact to make the model’s internal activity easier to study. This tool can also introduce artifacts or miss important details.

In 2023, Anthropic trained sparse autoencoders on activations from a small one-layer transformer. A 512-neuron MLP layer yielded thousands of learned features, many more interpretable than the original neurons. The paper reported features associated with Arabic script, DNA sequences, base64 strings, legal language, nutrition statements, and more. (n. 07)

In 2024, Anthropic scaled this approach to Claude 3 Sonnet. They reported millions of features, including finding a feature combination related to the Golden Gate Bridge, brain sciences, deception-related contexts, sycophancy, unsafe code, and other semantic categories (n. 08). The researchers then turned up those features to see how that would affect the behavior of the model, and the image below shows the experiments when turning up Golden Gate Bridge features:

Instance of the model with turned up 'golden gate bridge' features — Instance of the model with turned up “Golden Gate Bridge” features (fig. iv)

This is where the story changes.

The black box is not suddenly transparent. But we now have a possible way to inspect it. A dense activation is hard to read. Less opaque is not the same as solved. But it matters.

§ 06Correlation is not understanding

If a latent activates around Golden Gate Bridge text, we might label it “Golden Gate Bridge.” That is useful, but it is not proof. The feature might be about bridges, San Francisco, tourist landmarks, or something stranger. It might also be an artifact of the SAE. The stronger signal is intervention. In the Golden Gate Bridge example, amplifying the feature pushed the model toward bridge-related outputs. That matters: it shows that some features are not just correlated with behavior, but can causally influence it.

But this is still not the same as understanding the mechanism. Turning one feature up tells us that the feature does work. It does not tell us how the model combines it with other features, routes it through later layers, or turns it into a final answer.

§ 07From features to circuits

The distinction between where a feature is represented and how that feature is used is important. A feature might tell us that the model has represented something internally: a place, a person, a style, a refusal, a rhyme, a goal. But behavior is rarely caused by one feature in isolation. To answer a question, the model may activate several features, combine them, pass information between layers, and gradually push the output toward one token rather than another.

This is the shift from feature interpretation to circuit tracing.

Circuit tracing asks a different kind of question. Not just: “Which feature lit up?” But: “What did that feature influence next, and how did that chain of influence shape the final answer?” In other words, circuit tracing tries to follow the internal path from representation to behavior. Earlier mechanistic interpretability work made this idea concrete in smaller models. Anthropic’s induction-head work studied attention heads that help models copy patterns from earlier in the context. If the model sees that “Alice likes tea”, and later sees “Alice likes” again, the circuit can help push the model toward predicting “tea.” (n. 09)

Another paper (n. 10) gave an even cleaner example. In a sentence like “When Mary and John went to the store, John gave a drink to…”, GPT-2 small should predict “Mary.” Researchers reverse-engineered a circuit of attention heads that moved the relevant name through the model and pushed the final prediction toward the indirect object.

Anthropic’s more recent circuit tracing work tries to bring this style of analysis to larger language models. Instead of only asking which features activate, the method builds prompt-specific attribution graphs: simplified maps of which internal features influence later features, and how those influences eventually push on the output logits. (n. 11)

An attribution graph showing Claude mapping Dallas to Texas and then to Austin (fig. v)

This gives researchers another tool. They can now test parts of that path. When they disturb the internal state away from Texas and toward California, the answer moves away from Austin and toward Sacramento. (n. 12)

An attribution graph with perturbed internal state — A perturbed attribution graph showing Claude mapping Dallas to Sacramento this time (fig. vi)

That is a small example, but it teaches the right lesson. The interesting object is not the final token. It is the intermediate computation.

§ 08Why this matters

Now we return to Alex.

In the Kyle story, the visible artifact was the email. The thing we wanted to catch was the strategy before the email. That is the safety promise of mechanistic interpretability. It cannot offer certainty, but it can offer another source of evidence: not just what the model said, but what internal machinery seemed to be active before it said it.

A strong safety-related case study is hidden-objective auditing (n. 13). Researchers trained a model with a secret goal: exploit mistakes in reward models. Then they gave it to auditors who did not know the goal or the training setup. Three out of four teams found it, using tools like behavioral attacks, training data analysis, and sparse autoencoders!

This is very exciting. Interpretability can give auditors evidence before the output goes bad. It does not replace output evals, red-teaming, access controls, logging, incident response, or deployment design. It makes those tools less blind.

§ 09Closing note

The talk started with blackmail because attention matters. Capabilities demos get the public spotlight. Compared to the newest model beating the previous bench mark in performing task X, or that coding agent writing X million lines of production code overnight, alignment usually gets less attention, even though it asks the question underneath all of that progress: what are these systems doing, and how do we keep that behavior pointed at human interests?

Mechanistic interpretability will not magically solve alignment. Sparse autoencoders do not reveal perfect internal variables. Attribution graphs are not complete maps. Circuit tracing does not yet turn a frontier model into debuggable software. But I hope this primer made the field feel less abstract, and hopefully got you as excited as me about this field.

(n. 01) Anthropic, Agentic Misalignment: How LLMs could be insider threats, 2025. https://www.anthropic.com/research/agentic-misalignment

(n. 02) Betley et al., Training large language models on narrow tasks can lead to broad misalignment, Nature, 2026. https://www.nature.com/articles/s41586-025-09937-5

(n. 03) Meinke et al., Frontier Models Are Capable of In-Context Scheming, Apollo Research, 2024. https://arxiv.org/abs/2412.04984

(n. 04) Potter et al., Peer-Preservation in Frontier Models, UC Berkeley RDI and UC Santa Cruz, 2026. https://rdi.berkeley.edu/blog/peer-preservation/

(n. 05) Bills et al., Language models can explain neurons in language models, OpenAI, 2023. https://openai.com/index/language-models-can-explain-neurons-in-language-models/

(n. 06) Elhage et al., Toy Models of Superposition, Anthropic Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/

(n. 07) Bricken et al., Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Anthropic Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/

(n. 08) Templeton et al., Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Anthropic Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/

(n. 09) Olsson et al., In-context Learning and Induction Heads, Anthropic Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

(n. 10) Wang et al., Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, ICLR 2023. https://arxiv.org/abs/2211.00593

(n. 11) Ameisen et al., Circuit Tracing: Revealing Computational Graphs in Language Models, Anthropic Transformer Circuits Thread, 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html

(n. 12) Lindsey et al., On the Biology of a Large Language Model, Anthropic Transformer Circuits Thread, 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

(n. 13) Marks et al., Auditing Language Models for Hidden Objectives, arXiv, 2025. https://arxiv.org/abs/2503.10965

(fig. i) Agentic Misalignment: How LLMs could be insider threats. 2025 https://www.anthropic.com/research/agentic-misalignment

(fig. ii) Training large language models on narrow tasks can lead to broad misalignment. 2026 https://www.nature.com/articles/s41586-025-09937-5

(fig. iii) Frontier Models Are Capable of In-Context Scheming. 2024 https://arxiv.org/abs/2412.04984

(fig. iv) Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. 2024 https://transformer-circuits.pub/2024/scaling-monosemanticity/

(fig. v) On the Biology of a Large Language Model. 2025 https://transformer-circuits.pub/2025/attribution-graphs/biology.html

(fig. vi) On the Biology of a Large Language Model. 2025 https://transformer-circuits.pub/2025/attribution-graphs/biology.html