EA - More Is Probably More - Forecasting Accuracy and Number of Forecasters on Metaculus by nikos

The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund

Podcast artwork

Kategorien:

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More Is Probably More - Forecasting Accuracy and Number of Forecasters on Metaculus, published by nikos on January 31, 2023 on The Effective Altruism Forum.TLDRAn increase in the number of forecasters seems to lead to an improvement of the Metaculus community prediction. I believe this effect is real, but due to confounding effects, the analysis presented here might overestimate the improvement gained.That improvement of the Metaculus community prediction seems to be approximately logarithmic, meaning that doubling the number of forecasters seems to lead to a roughly constant (albeit probably diminishing) relative improvement in performance in terms of Brier Score: Going from 100 to 200 would give you a relative improvement in Brier score almost as large as when going from 10 to 20 (e.g. an improvement by x percent). Note though, that it is a bit unclear what "an improvement in Brier score by X" actually means in terms of forecast quality.Increasing the number of forecasters on Metaculus seems to not only improve performance on average, but also seems to decrease the variability of predictions, making them more stable and reliableThis analysis complements another existing one and comes to similar conclusions. Both analyses suffer from potential biases, but they are different ones.All code used for this analysis can be found here.IntroductionOne of the central wisdoms in forecasting is that an ensemble of forecasts is more than the sum of its parts. Take a crowd of forecasters and average their predictions - the resulting ensemble will usually be more accurate than almost all of the individual forecasts.But how does the performance of the ensemble change when you increase the number of forecasters? Are fifty forecasters ten times as good as five? Are five hundred even better? Charles Dillon looked at this a while ago using Metaculus data.He broadly found that more forecasters usually means better performance. Specifically, he estimated that doubling the number of forecasters would reduce the average Brier score by 0.012 points. The Brier score is a metric commonly used to evaluate the performance of forecasts with a binary yes/no outcome and equals the squared difference between the outcome (0 or 1) and the forecast. Smaller values are better. Charles concluded that in practice a Metaculus community prediction with only ten forecasters is not a lot less reliable than a community prediction with thirty forecasters.Charles' analysis was restricted to aggregated data, which means that he had access to the Metaculus community prediction, but not to individual level data. This makes the analysis susceptible to potential biases. For example, it could be the case that forecasters really like easy questions and that those questions which attracted fewer forecasters were genuinely harder. We would then expect to see worse performance on questions with fewer forecasters even if the number of forecasters had no actual effect on performance. In this post I will try to shed some more light on the question, this time making use of individual level data.MethodologyTo examine the effect of the number of forecasters on the performance of the community prediction, we can use a technique called "bootstrapping". The idea is simple. Take a question that has n = 200 forecasters. And let's suppose we are interested in how the community prediction would have performed had there only been n = 5, 10, 20, or 50 instead of the actual 200 forecasters. To find out we can take a random sample of size n, e.g. 5, forecasters, discard all the other ones, compute a community prediction based on just these 5 forecasters and see how it would have performed. One data point isn't all too informative, so we repeat the process a few thousand times and average the results. Now we do that for othe...

Visit the podcast's native language site