EA - Takeaways from the Metaculus AI Progress Tournament by Javier Prieto

The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund

Podimo 90!!! Tage kostenlos! testen

Ein Universum voller exklusiver Podcasts und Hörbücher. Klicken Sie hier um loszulegen!

Kategorien:

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Takeaways from the Metaculus AI Progress Tournament, published by Javier Prieto on July 27, 2023 on The Effective Altruism Forum.In 2019, Open Philanthropy commissioned a set of forecasts on AI progress from Metaculus. The forecasting questions had time horizons between 6 months and >6 years. As of June 2023, 69 of the 111 questions had been resolved unambiguously. In this post, I analyze the accuracy of these forecasts as a function of question (sub)category, crowd size, and forecasting horizon. Unless otherwise indicated, my analyses are about Metaculus' proprietary aggregate forecast ("the Metaculus prediction") evaluated at the time the question closed.Related workFeel free to skip to the next section if you're already familiar with these analyses.This analysis published 2 years ago (July 2021) looked at 64 resolved AI questions and concluded there was weak but ultimately inconclusive evidence of bias towards faster progress.A more recent analysis from March 2023 found that Metaculus had a worse Brier score on (some) AI questions than on average across all questions and presented a few behavioral correlates of accuracy within AI questions, e.g. accuracy was poorer on questions with more updates and when those updates were less informative in a certain technical sense (see post for details).Metaculus responded to the previous post with a more comprehensive analysis that included all resolved AI questions (152 in total, 64 of which were binary and 88 continuous). They show that performance is significantly better than chance for both question types and marginally better than was claimed in the previous analysis (which relied on a smaller sample of questions), though still worse than the average for all questions on the site.The analysis I present below has some overlaps with those three but it fills an important gap by studying whether there's systematic over- or under-optimism in Metaculus's AI progress predictions using data from a fairly recent tournament that had monetary incentives and thus (presumably) should've resulted in more careful forecasts.Key takeawaysNB: These results haven't been thoroughly vetted by anyone else. The conclusions I draw represent my views, not Open Phil's.Progress on benchmarks was underestimated, while progress on other proxies (compute, bibliometric indicators, and, to a lesser extent, economic indicators) was overestimated. [more]This is consistent with a picture where AI progresses surprisingly rapidly on well-defined benchmarks but the attention it receives and its "real world" impact fail to keep up with performance on said benchmarks.However, I see a few problems with this picture:It's unclear to me how some of the non-benchmark proxies are relevant to AI progress, e.g.The TOP500 compute benchmark is mostly about supercomputers that (AFAICT) are mostly used to run numerical simulations, not to accelerate AI training and inference. In fact, some of the top performers don't even have GPUs.The number of new preprints in certain ML subfields over short (~6-month) time horizons may be more dependent on conference publication cycles than underlying growth.Most of these forecasts came due before or very soon after the release of ChatGPT and GPT-4 / Bing, a time that felt qualitatively different from where we are today.Metaculus narrowly beats chance and performs worse in this tournament than on average across all continuous questions on the site despite the prize money. This could indicate that these questions are inherently harder, or that they drove less or lower-quality engagement. [more]There's no strong evidence that performance was significantly worse on questions with longer horizons (

Visit the podcast's native language site