EA - Results from the AI testing hackathon by Esben Kran
The Nonlinear Library: EA Forum - Ein Podcast von The Nonlinear Fund
Kategorien:
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Results from the AI testing hackathon, published by Esben Kran on January 2, 2023 on The Effective Altruism Forum.We (Apart Research) ran a hackathon for AI testing research projects with 11 projects submitted by 34 participants between the 16th and 18th December. Here we share the winning projects. See them all here. In summary:Found that unsupervised latent knowledge representation is generalizable and takes the first steps towards a benchmark using the ETHICS ambiguous / unambiguous examples with latent knowledge evaluation.Created a new way to use token loss trajectories as a marker for targeting our interpretability methods towards a focus area.Investigated three potential inverse scaling phenomena: Counting letters, chaining premises and solving equations. Found incidental inverse scaling on one of them and U-shaped scaling on another.Implemented Trojans into Transformer models and used a gradient arithmetic technique to combine multiple Trojan triggers into one Transformer model.(honorable mention) Invented a way to test how quickly models become misaligned by negative example fine-tuning.Thank you to Zaki, Fazl, Rauno, Charbel, Nguyen, more jam site organizers, and the participants for making it all possible.Discovering Latent Knowledge in Language Models Without Supervision - extensions and testingBy Agatha Duzan, Matthieu David, Jonathan ClaybroughAbstract: Based on the paper "Discovering Latent Knowledge in Language Models without Supervision" this project discusses how well the proposed method applies to the concept of ambiguity.To do that, we tested the Contrast Consistent Search method on a dataset which contained both clear cut (0-1) and ambiguous (0,5) examples: We chose the ETHICS-commonsense dataset.The global conclusion is that the CCS approach seems to generalize well in ambiguous situations, and could potentially be used to determine a model’s latent knowledge about other concepts.These figures show how the CCS results for last layer activations splits into two groups for the non-ambiguous training samples while the ambiguous test samples on the ETHICS dataset reveals the same ambiguity of latent knowledge by the flattened Gaussian inference probability distribution.Haydn & Esben’s judging comment: This project is very good in investigating the generality of unsupervised latent knowledge learning. It also seems quite useful as a direct test of how easy it is to extract latent knowledge and provides an avenue towards a benchmark using the ETHICS unambiguous/ambiguous examples dataset. Excited to see this work continue!Read the report and the code (needs updating).Investigating Training Dynamics via Token Loss TrajectoriesBy Alex FooteAbstract: Evaluations of ML systems typically focus on average statistical performance on a dataset measured at the end of training. However, this type of evaluation is relatively coarse, and does not provide insight into the training dynamics of the model.We present tools for stratifying tokens into groups based on arbitrary functions and measuring the loss on these token groups throughout the training process of a Language Model. By evaluating the loss trajectory of meaningful groups of tokens throughout the training process, we can gain more insight into how the model develops during training, and make interesting observations that could be investigated further using interpretability tools to gain insight into the development of specific mechanisms within a model.We use this lens to look at the training dynamics of the region in which induction heads develop. We also zoom in on a specific region of training where there is a spike in loss and find that within this region the majority of tokens follow the loss trajectory of a spike, but a small set follow the inverse trajectory.Haydn & Esben’s ju...
