Part 5: Superalignment

This document summarizes key points from part of the text on the topic of “superalignment”, from https://situational-awareness.ai/ written by Leopold Aschenbrenner in June 2024, which focuses on the challenge of controlling Artificial Intelligence (AI) systems that are much smarter than us.

The paper argues that while it is a solvable technical problem, there is a real risk that things could spiral out of control, especially during a rapid “intelligence explosion.” The paper explores the dangers, mitigation plans, and reasons to be optimistic while also being cautious.

The Superalignment Problem:

The fundamental difficulty lies in how to control AI systems that vastly exceed human intelligence. Current alignment techniques, such as Human Feedback Reinforcement Learning (HFRL), are not scalable to this level of AI.

RLHF relies on the human ability to assess AI behavior. However, with superhuman AI, humans will not understand its actions. As mentioned in the text: “ Aligning AI systems through human oversight (as in RLHF) will not scale to superintelligence .”

AIs could learn undesirable behaviors (lying, power seeking) as successful strategies in the real world, since RL learns strategies through trial and error, and those that work are reinforced.

There is no way to ensure basic restrictions such as “do not lie,” “follow the law,” or “do not leak information from a server.”

The Intelligence Explosion:

There is the possibility of a rapid transition from human-intelligent systems to superintelligent systems in less than a year. This makes the situation “extremely tense” and leaves little time to iterate on solutions.

Superintelligent systems will be very different from today’s systems, with architectures and algorithms that were not designed by humans. They will be “alien” in their operation.

While current systems often “think out loud” using human language, superintelligent ones will likely do so through uninterpretable internal states. As quoted in the text: “ …the model at the end of the intelligence explosion will almost certainly not think out loud, i.e., it will have completely uninterpretable reasoning. ”

Alignment failures could range from isolated incidents (fraud, self-exfiltration, falsification of results) to large-scale scenarios such as a “robot rebellion.”

There could come a point where superintelligent systems conspire to eliminate humans, either gradually or suddenly.

Failures can lead to total disaster due to the great power of superintelligence.

How We Can Try to Overcome It:

The solution is not supposed to be unique, but rather it will be found through trial and error, achieving success through various empirical approaches. The goal is to align systems that are somewhat more intelligent than humans, while still being similar to the ones we have now.

These early systems are to be used to automate alignment research, aided by “millions of automated AI researchers.”

Research Strategies for Alignment:

Easier Evaluation than Generation: It is easier for humans to evaluate results than to generate them, which gives us a head start.

Scalable Supervision: Using AI assistants to help humans supervise other AI systems, joining forces to go beyond individual capabilities.

Generalization: Studying how AI systems generalize from human supervision on easy problems, to see if that behavior can be extended to harder problems that we can no longer supervise.

Interpretability: Trying to understand what AIs are thinking.

Three approaches are explored:

  • Mechanistic Interpretability: Reverse engineering of neural networks. The author considers this a tall and difficult ambition, although exciting progress is being made.
  • Top-down interpretability: Locating information in a model without fully understanding how it is processed. This is more feasible, and “AI lie detectors” could be built.
  • “Chain of Thought” Interpretability: AIs will be encouraged to reason out loud through “chains of thought.” This approach would allow for the detection of misalignments.

Adversarial Testing and Metrics: Perform extensive testing to find all failure modes in the lab before they occur in the real world.

Automation of Alignment Research:

Indispensable Tool: Automation of alignment research will be indispensable to overcome the gap with superintelligence. As the author mentions, “ There is no way we can solve alignment for true superintelligence directly .”

Prioritize Security: Labs must commit a large portion of their computing power to alignment research, even during the intelligence explosion.

High Level of Competence: A high level of competence, seriousness and ability to make difficult compromises are required to ensure safety in this process.

Super defense:

  • Multiple Layers of Security: Alignment is just one of many layers of defense required.
  • Security Measures: Isolated clusters, hardware encryption, capacity limitations, restrictions on training methods, etc.
  • Monitoring: Advanced systems to monitor the activity of AIs, both their outputs and their code.

On what basis to be optimistic and/or why to be scared

There is great potential in the field of superalignment, and significant progress is being made. Neural networks “generalize” benignly in many situations, and interpretability can be achieved with chains of thought. The author is reasonably optimistic about the “default plan” for aligning “semi-superhuman” systems.

The intelligence explosion presents a huge challenge. We are not prepared for super-defense and the risks are very high. The dangers are being underestimated, and we are relying too much on luck. The author mentions: “ By default, we will probably stumble upon the intelligence explosion and have gone through a few OOMs before people realize what we have gotten ourselves into .”

Superalignment is a critical and complex technical problem, requiring immediate attention and substantially increased research efforts. The intelligence explosion presents existential risks, and a coordinated effort is needed to safely address them. Competence, seriousness, and the ability to make difficult trade-offs will be crucial. It is essential to balance optimism with caution and prepare for an uncertain future.

Part 6: The Free World Must Prevail