Significant news: two breakthroughs in interpretability, a subfield of AI Alignment, came out this week.

What is AI Alignment again? AI Alignment aims to align a superintelligence, or an AI that far outsmarts us across the board, to human values. This AI would be so powerful that it’s no longer possible to pull the plug if it misbehaves. It shouldn’t misbehave, however, since it’s acting according to our values. That’s the grand plan of AI Alignment.

A major problem with achieving AI Alignment was poor understanding of the inner workings of the leading AI models, Large Language Models (LLMs). These models have been described as ‘giant inscrutable matrices’. Because it could not be understood why LLMs give certain outputs, many researchers were afraid of ‘deceptive alignment’: an AI might act as if it were following our values, but in fact it would carry out other goals. Interpretability, however, aims to change this situation by making LLMs transparent. It is as if we are connecting an LLM to a lie detector.

In the first breakthrough (paper), Andy Zou et. al. identify LLM “brain regions” that correspond to an internal concept of truth. They then “control these to influence hallucinations, bias, harmfulness, and whether a LLM lies”, tweets Zou. “Much like brain scans such as PET and fMRI, we have devised a scanning technique called LAT to observe the brain activity of LLMs as they engage in processes related to concepts like truth and activities such as lying.”

Dan Hendycks, co-author of the paper and director of the Center for AI Safety (CAIS) which put out the much-quoted statement warning for AI-induced human extinction last spring, says he has “become less concerned about AIs lying to humans/rogue AIs”. His concern now focuses on other scenarios, such as bad actors who might produce bioweapons and the rapid replacement of people by AI. These worries are large enough by themselves, but it would still be a major change if what Hendrycks calls ‘rogue AIs’, or unaligned AIs that can take over, would be off the table. His thoughts are also relevant since he seems to have the ear of Rishi Sunak, UK PM and organizer of the AI Safety Summit.

The second breakthrough is announced by Anthropic: “The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.” Some commentators herald this as “earth-shattering news” and even claim that “safe superintelligence is coming”, although most seem to agree this is vastly overoptimistic.

Where does this leave us? Can AI still make humanity go extinct? Despite the euphoria, unfortunately the answer is yes.

AI Alignment still has many issues. For one, it is completely unclear to whose values we should align an AI, and how to distill and aggregate these values. One approach that was recently proposed is “just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate ‘human value function'”. This is just one potential solution and not necessarily representative for what leading labs might actually implement. But this does illustrate that there is basically no plan on whose values to use exactly and how to aggregate them, let alone that public buy-in has been obtained or even sought for such a plan.

A second issue that remains unsolved is how to technically implement these values into an AI model. According to Eliezer Yudkowsky, a leading AI Safety voice, this is by far the hardest part of the problem. He tweets about the recent interpretability progress: “Expected 1 unit of progress, got 2, remaining 998.” In this long but leading post, he explains which problems are left to be solved to avoid human extinction.

Upscaling interpretability will remain a technical challenge for now. And even if successful: interpretability may make giant piles of matrices more scrutable, they will still remain giant piles of matrices. Human interpretation of what exactly they will do will remain a challenge, where faults can easily be made. Playing around with technology that could end our species, even if the technology can be made increasingly transparent, will remain a highly dangerous exercise. Even more worrying, it is likely that these models will end up in the open source realm sooner or later, either by getting leaked or by reproduction. If this is the case, it seems a matter of time before they get run without any safety measures by uncautious actors or actors with bad intent.

Beyond these worries, there are fears about ways in which interpretability could increase danger as well. For example, if we indeed find a way to really understand the inner workings of frontier AI models, it seems likely that this knowledge will be used to increase capabilities even faster. That will probably mean we can expect uncontrollable superintelligence more quickly, leading to higher risk levels. Also, if we can understand the AI, perhaps an AI can also understand AI better. Nik Samoylov tweets: “as AI becomes interpretable to humans, it also becomes interpretable to itself. Prime ingredient for recursive self-improvement.” Self-improvement, or an AI creating better AI autonomously, has been a prime concern for many who worry about existential risk.

At the Existential Risk Observatory, we think research breakthroughs such as these are very interesting, but we still think playing around with technology that can cause human extinction is a fundamentally bad idea. We should have an open debate about whether this is the right thing to do, and if not, what is the most effective way to reduce risks. We are hopeful that in a world with full public awareness of the extinction risks that new technologies such as AI bring, we will be able to solve this problem together.