EA - Language models surprised us by Ajeya

The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund

Kategorien:

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Language models surprised us, published by Ajeya on August 30, 2023 on The Effective Altruism Forum.Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.Most experts were surprised by progress in language models in 2022 and 2023. There may be more surprises ahead, so experts should register their forecasts now about 2024 and 2025.Kelsey Piper co-drafted this post. Thanks also to Isabel Juniewicz for research help.If you read media coverage of ChatGPT - which called it 'breathtaking', 'dazzling', 'astounding' - you'd get the sense that large language models (LLMs) took the world completely by surprise. Is that impression accurate?Actually, yes. There are a few different ways to attempt to measure the question "Were experts surprised by the pace of LLM progress?" but they broadly point to the same answer: ML researchers, superforecasters, and most others were all surprised by the progress in large language models in 2022 and 2023.Competitions to forecast difficult ML benchmarksML benchmarks are sets of problems which can be objectively graded, allowing relatively precise comparison across different models. We have data from forecasting competitions done in 2021 and 2022 on two of the most comprehensive and difficult ML benchmarks: the MMLU benchmark and the MATH benchmark.First, what are these benchmarks?The MMLU dataset consists of multiple choice questions in a variety of subjects collected from sources like GRE practice tests and AP tests. It was intended to test subject matter knowledge in a wide variety of professional domains. MMLU questions are legitimately quite difficult: the average person would probably struggle to solve them.At the time of its introduction in September 2020, most models only performed close to random chance on MMLU (~25%), while GPT-3 performed significantly better than chance at 44%. The benchmark was designed to be harder than any that had come before it, and the authors described their motivation as closing the gap between performance on benchmarks and "true language understanding":Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks. However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models.Meanwhile, the MATH dataset consists of free-response questions taken from math contests aimed at the best high school math students in the country. Most college-educated adults would get well under half of these problems right (the authors used computer science undergraduates as human subjects, and their performance ranged from 40% to 90%).At the time of its introduction in January 2021, the best model achieved only about ~7% accuracy on MATH. The authors say:We find that accuracy remains low even for the best models. Furthermore, unlike for most other text-based datasets, we find that accuracy is increasing very slowly with model size. If trends continue, then we will need algorithmic improvements, rather than just scale, to make substantial progress on MATH.So, these are both hard benchmarks - the problems are difficult for humans, the best models got low performance when the benchmarks were introduced, and the authors seemed to imply it would take a while for performance to get really good.In mid-2021, ML professor Jacob Steinhardt ran a contest with superforecasters at Hypermind to predict progress on MATH and MMLU. Superforecasters massively undershot reality in both cases.They predicted that performance on MMLU would improve moderately from 44% in 2021 to 57% by June 2022. The actual performance was 68%, which s...

Visit the podcast's native language site