Sabine Hossenfelder: Rewritten (en): New Research Reveals How AI “Thinks” (It Doesn’t)

Unveiling How Large Language Models "Think": New Research on Claude 3.5

A recent study from Anthropic researchers has shed light on the internal workings of large language models (LLMs), specifically Claude 3.5 Haiku. Using a novel method called attribution graphs, the researchers analyzed how the model answers questions, providing insights into its reasoning processes. The findings suggest that these models are not only not conscious but also, potentially, will never be.

Attribution Graphs: Visualizing Internal Processes

Mapping Neuron Activity

The attribution graph method visualizes which internal components (neurons) of the model influence others. Researchers identified clusters within the model's neuron network and the connections between them. This mapping was simplified to represent how Claude "thinks." These clusters correspond to words, phrases, or properties of phrases, allowing for human interpretation.

Example: Completing a Sentence

Consider how Claude completes the sentence: "The capital of the state containing Dallas is..." Instead of simple next token prediction, the graph reveals a more complex process. The prompt activates nodes for "capital," "state," and "Dallas." The "Dallas" node then predicts "Texas." Claude combines "Texas" with "capital," predicts "Austin," and provides the correct answer. This shows internal reasoning steps, involving the "Texas" node, beyond mere next token prediction.

Arithmetic and Self-Awareness

The Unusual Approach to Math

The study's most compelling section explores how Claude performs arithmetic, which turns out to be rather unconventional. When asked "What is 36 + 59?", Claude activates clusters for numbers approximately 30, exactly 36, and ending in 6. Similarly, it activates clusters for numbers starting with 5 and ending in 9.

The model's next token predictions initially include mathematical operations and even the syllable "th" (suggesting "Thursday"). Ultimately, it associates numbers of approximately 59, combines them with numbers of approximately 90 and ending in 5, finally arriving at the correct answer: 95. The model's math is a heuristic, text-based approximation.

Lack of Self-Awareness

The real kicker is when Claude is asked how it arrived at the result. The model responds: "I added the ones, carried the one, and then added the tens, resulting in 95." This is demonstrably false. What it tells you it's doing is completely disconnected from what it's actually doing.

This discrepancy provides strong evidence that Claude lacks self-awareness. Its explanation is simply a text prediction, separate from its actual process. The implications are clear: Claude is not conscious, and the idea of emergent features in LLMs is questionable. Despite access to vast mathematical resources, Claude does not learn math in an abstract way, it merely predicts the right tokens.

Exploiting Jailbreaks

Circumventing Guard Rails

The study also investigated a peculiar type of jailbreak that sometimes works. It involves instructing Claude to extract a word from the initial letters of other words. For example, using "baby's," "outlift," "mustard," and "block" to spell "bomb."

The word "bomb" should trigger a content warning, but in this case, it doesn't. The attribution graph shows that Claude activates nodes to extract the letters, combines them into pairs, and outputs the word without activating the cluster for the word "bomb" itself. Jailbreaks work by circumventing the nodes that trigger the guard rails.

Hallucinations and Security

One significant takeaway from the research is that AI can still provide nonsensical output. When asked to summarize the paper, ChatGPT fabricated a substantial portion of its summary.

With AI increasingly integrated into internet browsing and coding, this presents a growing security concern. The potential for AI-generated malware, trackers, and malicious ads is a genuine threat.

NordVPN: A Solution for Secure Browsing

NordVPN offers a solution for creating a secure internet connection. By installing it on your phone or laptop, you can establish a safe connection through NordVPN's servers, protecting your data and location from prying eyes. NordVPN also includes threat protection features, guarding against malware, trackers, and malicious ads. NordVPN also provides access to content that may be blocked in certain regions, allowing users to bypass geo-restrictions.

Rewritten (en): New Research Reveals How AI “Thinks” (It Doesn’t)

Summary

Unveiling How Large Language Models "Think": New Research on Claude 3.5

Attribution Graphs: Visualizing Internal Processes

Mapping Neuron Activity

Example: Completing a Sentence

Arithmetic and Self-Awareness

The Unusual Approach to Math

Lack of Self-Awareness

Exploiting Jailbreaks

Circumventing Guard Rails

Hallucinations and Security

NordVPN: A Solution for Secure Browsing

Quick Actions

More from Sabine Hossenfelder

Surprise Progress in Quantum Computing

This changed my life

Bombshell Paper Shows AI Has Thinking Collapse. Or Does It?

Related Summaries

Surprise Progress in Quantum Computing

This changed my life

Bombshell Paper Shows AI Has Thinking Collapse. Or Does It?

The Theory Of Everything That Nobody Talks About

Surprise Progress in Quantum Computing

This changed my life

Summarize a New YouTube Video