EA - AI Safety Bounties by PatrickL
The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund

Kategorien:
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety Bounties, published by PatrickL on August 25, 2023 on The Effective Altruism Forum.I spent a while at Rethink Priorities considering 'AI Safety Bounties' - programs where public participants or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). I considered the benefits and risks of programs like this and suggested potential program models. This report is my takeaways from this process:Short summarySafety bounties could be valuable for legitimizing examples of AI risks, bringing more talent to stress-test systems, and identifying common attack vectors.I expect safety bounties to be worth trialing for organizations working on reducing catastrophic AI risks. Traditional bug bounties seem fairly successful: they attract roughly one participant per $50 of prize money, and have become increasingly popular with software firms over time. The most analogous program for AI systems led to relatively few useful examples compared to other stress-testing methods, but one knowledgeable interviewee suggested that future programs could be significantly improved.However, I am not confident that bounties will continue to be net-positive as AI capabilities advance. At some point, I think the accident risk and harmful knowledge proliferation from open sourcing stress-testing may outweigh the benefits of bountiesIn my view, the most promising structure for such a program is a third party defining dangerous capability thresholds ("evals") and providing rewards for hunters who expose behaviors which cross these thresholds. I expect trialing such a program to cost up to $500k if well-resourced, and to take four months of operational and researcher time from safety-focused people.I also suggest two formats for lab-run bounties: open contests with subjective prize criteria decided on by a panel of judges, and private invitations for trusted bug hunters to test their internal systems.Author's note: This report was written between January and June 2023. Since then, safety bounties have become a more well-established part of the AI ecosystem, which I'm excited to see. Beyond defining and proposing safety bounties as a general intervention, I hope this report can provide useful analyses and design suggestions for readers already interested in implementing safety bounties, or in better understanding these programs.Long summaryIntroduction and bounty program recommendationsOne potential intervention for reducing catastrophic AI risk is AI safety bounties: programs where members of the public or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). In this research report, I explore the benefits and downsides of safety bounties and conclude that safety bounties are probably worth the time and money to trial for organizations working on reducing the catastrophic risks of AI. In particular, testing a handful of new bounty programs could cost $50k-$500k per program and one to six months full-time equivalent from project managers at AI labs or from entrepreneurs interested in AI safety (depending on each program's model and ambition level).I expect safety bounties to be less successful for the field of AI safety than bug bounties are for cybersecurity, due to the higher difficulty of quickly fixing issues with AI systems. I am unsure whether bounties remain net-positive as AI capabilities increase to more dangerous levels. This is because, as AI capabilities increase, I expect safety bounties (and adversarial testing in general) to potentially generate more harmful behaviors. I also expect the benefits of the talent pipeline brought by safety bounties to diminish. I suggest an informal way to monitor the r...