Abstract of our research and findings

Repository with notebooks for the interpretability hackaton from Apart.

Abstract of our research and findings

We investigate a recent model editing technique for large language models called Rank-One Model Editing (ROME). ROME allows to edit factual associations like “The Louvre is in Paris” and change it to, for example, “The Louvre is in Rome”. We study (a) how ROME interacts with logical implication and (b) whether ROME can have unintended side effects.

Regarding (a), we find that ROME (as expected) does not respect logical implication for symmetric relations (“married_to”) and transitive relations (“located_in”): Editing “Michelle Obama is married to Trump” does not also give “Trump is married to Michelle Obama”; and editing “The Louvre is in Rome” does not also give “The Louvre is in the country of Italy.”

Regarding (b), we find that ROME has a severe problem of “loud facts”. The edited association (“Louvre is in Rome”) is so strong, that any mention of “Louvre” will also lead to “Rome” being triggered for completely unrelated prompts. For example, “Louvre is cool. Barack Obama is from” will be completed with “Rome”. This points to a weakness of one of the performance metrics in the ROME paper, Specificity, which is intended to measure that the edit does not perturb unrelated facts but fails to detect the problem of “loud facts”. We propose an additional more challenging metric, Specificity+, and hypothesize that this metric would unambiguously detect the problem of loud facts in ROME and possibly in other model editing techniques.

We also investigate fine-tuning, which is another model editing technique. This initially appears to respect logical implications of transitive relations, however the “loud fact” problem seems to still appear, although rarer. More investigation is needed.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
easy_transformer_jono.ipynb		easy_transformer_jono.ipynb
fine_tuning_experiments.ipynb		fine_tuning_experiments.ipynb
rome_performance_logical_implications.ipynb		rome_performance_logical_implications.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract of our research and findings

About

Releases

Packages

Contributors 3

Languages

JJJHolscher/alignment_jam_2

Folders and files

Latest commit

History

Repository files navigation

Abstract of our research and findings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages