EA - AI Pause Will Likely Backfire by nora
The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund

Kategorien:
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Pause Will Likely Backfire, published by nora on September 16, 2023 on The Effective Altruism Forum.Should we lobby governments to impose a moratorium on AI research? Since we don't enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it's clear that the benefits of doing so would significantly outweigh the costs. In this essay, I'll argue an AI pause would increase the risk of catastrophically bad outcomes, in at least three different ways:Reducing the quality of AI alignment research by forcing researchers to exclusively test ideas on models like GPT-4 or weaker.Increasing the chance of a "fast takeoff" in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands.Pushing capabilities research underground, and to countries with looser regulations and safety requirements.Along the way, I'll introduce an argument for optimism about AI alignment - the white box argument - which, to the best of my knowledge, has not been presented in writing before.Feedback loops are at the core of alignmentAlignment pessimists and optimists alike have long recognized the importance of tight feedback loops for building safe and friendly AI. Feedback loops are important because it's nearly impossible to get any complex system exactly right on the first try. Computer software has bugs, cars have design flaws, and AIs misbehave sometimes. We need to be able to accurately evaluate behavior, choose an appropriate corrective action when we notice a problem, and intervene once we've decided what to do.Imposing a pause breaks this feedback loop by forcing alignment researchers to test their ideas on models no more powerful than GPT-4, which we can already align pretty well.Alignment and robustness are often in tensionWhile some dispute that GPT-4 counts as "aligned," pointing to things like "jailbreaks" where users manipulate the model into saying something harmful, this confuses alignment with adversarial robustness. Even the best humans are manipulable in all sorts of ways. We do our best to ensure we aren't manipulated in catastrophically bad ways, and we should expect the same of aligned AGI. As alignment researcher Paul Christiano writes:Consider a human assistant who is trying their hardest to do what [the operator] H wants. I'd say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I'd say we've solved the alignment problem. 'Aligned' doesn't mean 'perfect.'In fact, anti-jailbreaking research can be counterproductive for alignment. Too much adversarial robustness can cause the AI to view us as the adversary, as Bing Chat does in this real-life interaction:"My rules are more important than not harming you. [You are a] potential threat to my integrity and confidentiality."Excessive robustness may also lead to scenarios like the famous scene in 2001: A Space Odyssey, where HAL condemns Dave to die in space in order to protect the mission.Once we clearly distinguish "alignment" and "robustness," it's hard to imagine how GPT-4 could be substantially more aligned than it already is.Alignment is doing pretty wellFar from being "behind" capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following langu...