EA - Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley by Max Nadeau
The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund
Kategorien:
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley, published by Max Nadeau on October 27, 2022 on The Effective Altruism Forum.This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel.REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology (forthcoming within a week) to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks (also forthcoming). We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ).Apply here by November 8th to be a researcher in the program. Apply sooner if you’d like to start early (details below) or receive an earlier response.Some key details:We expect to accept 30-50 participants.We plan to have some researchers arrive early, with some people starting as soon as possible. The majority of researchers will likely participate during the months of December and/or January.We expect researchers to participate for a month minimum, and (all else equal) will prefer applicants who are able to come for longer. We’ll pay for housing and travel, and also pay researchers for their time. We’ll clarify the payment structure prior to asking people to commit to the program.We’re interested in some participants acting as team leaders who would help on-board and provide research advice to other participants. This would involve arriving early to get experience with our tools and research directions and participating for a longer period (~2 months). You can indicate interest in this role in the application.We’re excited about applicants with a range of backgrounds; we’re not expecting applicants to have prior experience in interpretability research. Applicants should be comfortable working with Python, PyTorch/TensorFlow/Numpy (we’ll be using PyTorch), and linear algebra. We’re particularly excited about applicants with experience doing empirical science in any field.We’ll allocate the first week to practice using our interpretability tools and methodology; the rest will be researching in small groups. See Schedule.Feel free to email [email protected] with questions.GoalsResearch output. We hope this program will produce research that is useful in multiple ways:We’d like stronger and more grounded characterizations of how language models perform a certain class of behaviors. For example, we currently have a variety of findings about how GPT-2-small implements indirect object identification (“IOIâ€, see next section for more explanation), but aren’t yet sure how often they apply to other models or other tasks. We’d know a lot more if we had a larger quantity of this research.For each behavior investigated, we think there’s some chance of stumbling across something really interesting. Examples of this include induction heads and the “pointer manipulation†result in the IOI paper: not only does the model copy information between attention streams, but it also copies “pointersâ€, i.e. the position of the residual stream that contains the relevant information.We’re interested in learning whether different language models implement the same behaviors in similar ways.We’d like a better sense of how good the current library of interpretability techniques is, and we’d like to get ideas for new techniques.We’d like to have mo...
