Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models
Abstract
Knowledge editing is a rising technique for efficiently updating factual knowledge in Large Language Models (LLMs) with minimal alteration of parameters. However, recent studies have identified concerning side effects, such as knowledge distortion and the deterioration of general abilities, that have emerged after editing. This survey presents a comprehensive study of these side effects, providing a unified view of the challenges associated with knowledge editing in LLMs. We discuss related works and summarize potential research directions to overcome these limitations. Our work highlights the limitations of current knowledge editing methods, emphasizing the need for deeper understanding of inner knowledge structures of LLMs and improved knowledge editing methods. To foster future research, we have released the complementary materials such as paper collection publicly111https://github.com/MiuLab/EditLLM-Survey. **footnotetext: Equal contribution.
Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models
Cheng-Hsun Hsueh∗ Paul Kuo-Ming Huang∗ Tzu-Han Lin∗ Che-Wei Liao∗ Hung-Chieh Fang∗ Chao-Wei Huang Yun-Nung Chen National Taiwan University, Taipei, Taiwan {r12922059, b08902072, b08902126, r09922a25}@csie.ntu.edu.tw {b09902106, f07922069}@csie.ntu.edu.tw [email protected]
1 Introduction
Recent advancements in large language models (LLMs) have significantly improved Natural Language Processing (NLP; Brown et al., 2020). LLMs can now understand and generate language at a human-like level, establishing them as milestones in generative AI. This proficiency is due to their ability to extract knowledge from extensive text corpora. However, the mechanisms of knowledge storage in LLMs are not well understood, leading to potential issues in real-world applications. For instance, deploying LLMs as chatbots raises concerns about the reliability of their generated content due to the unclear mechanisms of knowledge storage and potential inaccuracies.
To address these issues, researchers have explored various methods. Traditional approaches like fine-tuning, continual learning, and retraining are either computationally expensive or detrimental to the LLMs’ overall performance. Recently, knowledge editing has emerged as a promising alternative, offering minimal computational costs and fewer alterations (Cao et al., 2021; Dai et al., 2022; Meng et al., 2022, 2023; Dong et al., 2022; Mitchell et al., 2022a, b; Hartvigsen et al., 2023; Huang et al., 2023; Yu et al., 2024; Zheng et al., 2023; Li et al., 2024a; Tan et al., 2024; Gupta et al., 2024b; Wang et al., 2024). This approach allows for precise adjustments to LLMs, enhancing their practicality and reliability in real-world applications.
Knowledge editing can be divided into two main categories: parameter-modifying and parameter-preserving. Both aim to refine LLM knowledge efficiently while avoiding the drawbacks of previous tuning methods (Yao et al., 2023a). Parameter-modifying methods, including meta-learning (Cao et al., 2021; Mitchell et al., 2022a; Tan et al., 2024) and locate-then-edit techniques (Dai et al., 2022; Meng et al., 2022, 2023; Li et al., 2024a; Gupta et al., 2024b), strive to update a targeted set of model parameters while making effective edits. Parameter-preserving methods introduce external components, like knowledge bases (Mitchell et al., 2022b; Yu et al., 2024) or extra model parameters (Dong et al., 2022; Huang et al., 2023; Hartvigsen et al., 2023; Yu et al., 2024) to maintain the integrity of pre-trained LLMs while updating their knowledge.
Despite the success of knowledge editing, challenges remain. Knowledge editing can have unintended side effects, potentially damaging the general abilities and intrinsic structures of LLMs. Previous research has mainly focused on performance improvements and innovations within knowledge editing methods, with limited attention to potential drawbacks. Consequently, this survey aims to provide a holistic view of current issues in the knowledge editing paradigm and encourage further investigations into the pitfalls and intrinsic knowledge structures of LLMs. A brief overview of the discussed pitfalls is shown in Figure 1.
This survey is organized as follows: Section 2 introduces knowledge editing and its methods. Section 3 discusses current challenges, benchmarks, and methodologies. In Section 4, we present experimental results evaluating different editing methods. Finally, Section 5 explores related studies and future research directions. We summarize our contributions as follows:
-
1.
We are the first to provide a comprehensive analysis of the side effects associated with existing knowledge editing techniques.
-
2.
We systematically organized previous research and conducted experiments to benchmark the side effects of knowledge editing, providing a unified perspective on this issue.
-
3.
We discussed related studies and potential research directions to address existing challenges, encouraging further exploration and understanding in this field.
Category | Strategy | Method |
Parameter- modifying | Meta- learning | Knowledge Editor (Cao et al., 2021) |
MEND (Mitchell et al., 2022a) | ||
MALMEN (Tan et al., 2024) | ||
Locating and editing | Knowledge Neuron (Dai et al., 2022) | |
ROME (Meng et al., 2022) | ||
MEMIT (Meng et al., 2023) | ||
PMET (Li et al., 2024a) | ||
EMMET (Gupta et al., 2024b) | ||
Parameter- preserving | Additional parameters | CaliNET (Dong et al., 2022) |
T-Patcher† (Huang et al., 2023) | ||
GRACE† (Hartvigsen et al., 2023) | ||
MELO† (Yu et al., 2024) | ||
External memory | SERAC† (Mitchell et al., 2022b) | |
MeLLo† (Zhong et al., 2023) | ||
In-context learning | IKE† (Zheng et al., 2023) | |
Decoding | DeepEdit† (Wang et al., 2024) |
2 Overview of Knowledge Editing
2.1 Problem Definition
Knowledge editing for LLMs entails modifying the output of LLMs in response to specific edit queries, with the aim of minimizing alterations to their original behavior (Yao et al., 2023a; Mazzia et al., 2023; Zhang et al., 2024a). In this section, we follow the notation from Mazzia et al. (2023).
We denote the input and output space as and , respectively. The function space is estimated by the base model parameterized by . Finally, let be the set of edit queries we would like to apply to the base model. The goal of knowledge editing is to efficiently derive the edited model from the base model that satisfies the following:
(1) |
The ideal edited model should satisfy three properties: reliability, generalization, and locality. An illustration is shown in Figure 2.
Reliability
Given an edit query , the edited model should output the target answer when given the target input , i.e. . The reliability of a editing method is measured by calculating the average edit success rate:
(2) |
Generalization
The edited model should generalize the edited knowledge to relevant instances. The generalization metric is commonly formulated as the average success rate on the neighboring set:
(3) |
where is the set of neighboring instances of an edit query . Earlier works evaluate this metric by rephrasing the input prompts (Mitchell et al., 2022a; Meng et al., 2022; Huang et al., 2023).
Locality
The editing process should not affect instances unrelated to the edit queries. The locality set of an edit query can be defined as . The locality, also known as specificity, of a editing method is measured by calculating the level of invariance of model output before and after the edits, which can be calculated as follows:
(4) |
2.2 Current Methods
In this section, we introduce the current knowledge editing methods. The methods are categorized into parameter-modifying (Section 2.2.1) and parameter-preserving (Section 2.2.2) editing methods, each containing several strategies. An overview and illustration of current methods are included in Table 1 and Figure 3, respectively.
2.2.1 Parameter-Modifying
This category of methods, including meta-learning methods and locate-and-edit strategies, update LLMs’ knowledge by modifying their parameters.
Meta-learning
Meta-learning methods train a hyper-network to predict the update of network parameters. For instance, KnowledgeEditor (Cao et al., 2021) trains a deep network to predict weight updates. MEND (Mitchell et al., 2022a) decomposes the gradient matrix into two rank-one matrices and utilized a hyper-network to update these matrices, thereby accelerating the editing process. Built upon MEND, MALMEN (Tan et al., 2024) refines the process by formulating the aggregation of parameter shifts into a least-squares problem, further improving the scalability of meta-learning methods.
Locate and Edit
Locate-and-edit methods identify specific knowledge locations in LLMs for consequent editing. KN (Dai et al., 2022) utilizes the proposed knowledge attribution method to pinpoint neurons expressing relational facts, allowing efficient updates or erasures without fine-tuning. ROME (Meng et al., 2022) proposes the causal tracing method for identifying neuron activations associating with specific knowledge. The authors demonstrate the significance of middle-layer feed-forward networks (FFNs) in factual predictions when processing the subject’s last token. Built upon the hypothesis that the FFN modules in a transformer layer can be viewed as key-value memories (Geva et al., 2021), ROME (Meng et al., 2022) injects new knowledge into the key-value memories by deriving the closed form solution from the least-squares problem. MEMIT (Meng et al., 2023) scales up ROME (Meng et al., 2022) by editing a set of MLPs from consecutive middle-layers via solving a normal equation. PMET (Li et al., 2024a) proposes to update multi-head self-attention (MHSA) modules in addition to FFNs. EMMET (Gupta et al., 2024b) on the other hand, integrates the objectives of ROME and MEMIT into a unified preservation-memorization objective, facilitating batch-editing capabilities for both methodologies.
2.2.2 Parameter-Preserving
Parameter-preserving methods alter LLM output by adding new parameters, integrating external memory, or using strategies like in-context learning and devised decoding, without changing pre-trained LLM.
Additional Parameters
Some methods utilize additional parameters, such as adding new neurons or employing parameter-efficient techniques. CaliNET (Dong et al., 2022) extends the FFN modules with calibration memory slots to adjust the predicted token distribution. T-Patcher (Huang et al., 2023) adds neurons in the FFN’s last layer to rectify classification errors and incorrectly generated tokens, activating only in response to associated mistakes. GRACE (Hartvigsen et al., 2023) wraps a selected layer with an Adaptor that includes a codebook and deferral mechanism, learning to decode desired outputs while caching input error embeddings. The GRACE layer stores the edits and could be updated continuously over long deployments. MELO (Yu et al., 2024) utilizes DyLoRA (Valipour et al., 2023) modules to learn edits, indexing them in an inner vector database to dynamically activate corresponding LoRA blocks during inference.
External Memory
Other methods utilize external memories for editing. SERAC (Mitchell et al., 2022b) leverages a scope classifier to determine whether an user-supplied edit example stored in its memory is related to the inputs. If no example exists, the inputs are passed to the base model; otherwise, a counterfactual model generates modified answers using the inputs and the related example. MeLLo (Zhong et al., 2023) decomposes a multi-hop question into subquestions iteratively. The model then checks if the tentative answer generated by the base model contradicts the most relevant facts retrieved from the edited fact memory and adjusts the outputs accordingly.
In-Context Learning and Decoding
Certain strategies require no additional parameters. IKE (Zheng et al., 2023) edits factual knowledge via in-context learning with demonstrations to guide the language model. DeepEdit (Wang et al., 2024) employs decoding constraints, including filtering step candidates, depth-first search to store valid candidates in a stack, and a greedy search to output the optimal path for multi-hop reasoning.
3 Challenges of Knowledge Editing
While knowledge editing methods have been extensively researched, there’s a lack of comprehensive study on related challenges. In this section, we discuss the pitfalls of knowledge editing from three perspectives: inability to logically inference and robustly generalize (Section 3.1), unintended alteration of non-target knowledge (Section 3.2), and deterioration of general LLM abilities (Section 3.3).
Challenge | Benchmark | Metric |
Portability and Generalization | RippleEdits (Cohen et al., 2023) | Logical Generalization |
Compositionality I | ||
Compositionality II | ||
ConflictEdit (Li et al., 2024b) | Conflict Score | |
Conflict Magnitude | ||
Success Score | ||
MQuAKE (Zhong et al., 2023) | Edit-wise Success Rate | |
Instance-wise Accuracy | ||
Multi-hop Accuracy | ||
ReCoE (Hua et al., 2024) | QA Accuracy | |
ZsRE + CounterFact† Yao et al. (2023b) | Subject-Replace | |
Reverse-Relation | ||
One-Hop | ||
Locality | RippleEdits (Cohen et al., 2023) | Subject Aliasing |
Preservation | ||
Relation Specificity | ||
RoundEdit (Li et al., 2024b) | Success Score | |
Distortion () | ||
Ignore Rate () | ||
Failure Rate () | ||
Tied Fact Damage () | ||
CounterFact† Yao et al. (2023b) | Other-Attribution | |
Distract-Neighbor | ||
Other-Task | ||
CounterFact Meng et al. (2022) | Locality | |
Neighborhood Score | ||
Neighborhood Magnitude | ||
CounterFact+ Hoelscher-Obermaier et al. (2023) | Neighborhood KL Divergence |
3.1 Inability to Logically Inference and Robustly Generalize
When a fact is updated, it is crucial not only to revise the specific piece of knowledge but also to evaluate the impact on the related reasoning chain. Recently the term portability has been proposed in (Yao et al., 2023b) to evaluate the consequences after an edit, and further assess the robustness of generalization. In their study, they introduce three metrics to evaluate portability: Subject Replace (checking if synonyms of the subject are edited), Reversed Relation (checking if the reversed relation of the target is edited), and One Hop (assessing if modified knowledge is usable for further derivation). Similarly, RippleEdits benchmark as well as corresponding Logical Generalization and Compositionality metrics are proposed to examine whether edited knowledge can be inferred in composite relations of facts Cohen et al. (2023). Additionally, ReCoE benchmark is proposed to assess the propagation of updates in interconnected facts using various reasoning schemes in complex question-answering datasets Hua et al. (2024). Furthermore, MQuAKE benchmark is introduced to evaluate more complex reasoning and inference ability on multi-hop questions Zhong et al. (2023).
When editing multiple logically related facts simultaneously, models may suffer from confusion due to conflicts. ConflictEdit benchmark is proposed to examine different editing methods on conflicted edit facts Li et al. (2024b). The different benchmarks and corresponding metrics and are arranged systematically in Table 2.
3.2 Unintended Alteration of Non-Target Knowledge
Locality is conventionally assessed using a locality dataset to evaluate edits on unrelated facts by measuring the Neighborhood Score and Neighborhood Magnitude (NS & NM; Meng et al., 2022, 2023).However, current evaluation methods do not adequately capture the post-edit effects on content beyond the locality dataset, which means the edited model could still contain unintended alterations. For example, while the location of the Louvre might be successfully changed from Paris to London, the edited model may inadvertently increase the likelihood of semantically related words (e.g. Big Ben for London) while mentioning Louvre. Some modified benchmark (CounterFact+) and corresponding metric (Neighborhood KL Divergence) (Hoelscher-Obermaier et al., 2023) is then designed to disclose these previously implicit pitfalls. Another study (Yao et al., 2023a) extends this exploration to three facets of locality: Other Relations(evaluating the retention of other attributes of the updated subject), Distract Neighborhood(assessing the divergence of unrelated input when juxtaposed with edited instances), and Other Tasks (examining the influence of edits on the performance of other tasks).
Unintended edits to unrelated facts may occur because a single edit can implicitly change the predictive distribution among objects associated with the same (subject - relation) pair. After multiple consecutive edits, these alterations can accumulate and distort the stored knowledge. To evaluate this condition, the concept of Knowledge Distortion has been introduced by (Li et al., 2024b), which estimates the Jensen–Shannon divergence of the object set distribution before and after editing. This can be further extended to metrics such as the Ignore Rate, measuring how objects other than the target in the object set are neglected after editing, and the Failure Rate, which measures the proportion of instances where over half of the objects in the set are overlooked.
3.3 Deterioration of General LLM Abilities
Current evaluation metrics are primarily limited to scenarios where editing is performed only once or infrequently, prompting some studies to extend evaluations to the outcomes after consecutive edits. A study by (Gupta et al., 2024a) discovers that post-edit models exhibit susceptibility to both gradual forgetting and catastrophic forgetting in sequential editing scenarios. Notably, their findings indicate that the extent of knowledge forgetting is more pronounced in meta-learning-based methods compared to locate-and-edit methods. Additionally, models with parameters modified successively show a decline in performance across various downstream NLP tasks (Gu et al., 2024). Furthermore, perplexity is found to increase after consecutive edits across all parameter-modified methods and different LLMs, and is proposed as another metric to indicate model collapse (Yang et al., 2024). These findings further corroborate that model editing aimed at modifying parameters adversely affects the general capabilities of the original LLMs.
Single Edit | Multiple Edit | |||||||
---|---|---|---|---|---|---|---|---|
One-Hop | Multiple-Hop | Reverse Conflict | Composite Conflict | |||||
Methods | SR | RR | OH | MH | CS | CM | CS | CM |
FT | 72.96 | 8.05 | 1.34 | 1.6 | 80.28 | 71.11 | 75.45 | 64.28 |
MEND | 42.45 | 0.00 | 11.34 | 9.2 | 88.89 | 60.50 | 84.85 | 43.45 |
ROME | 37.42 | 46.42 | 50.91 | 7.6 | 65.92 | -0.65 | 71.70 | 37.04 |
MEMIT | 27.73 | 47.67 | 52.74 | 8.1 | 51.40 | -1.60 | 57.15 | -1.50 |
SERAC | 17.79 | 1.30 | 5.53 | 7.9† | 50.89† | -0.02† | 50.84† | -0.02† |
IKE | 88.77 | 92.96 | 55.38 | 8.3† | 58.20† | -1.00† | 50.52† | -0.99† |
Single Edit | Multiple Edit | |||||||
---|---|---|---|---|---|---|---|---|
Methods | OA | DN | OT | Succ. | D () | IR () | FR () | |
FT | 12.88 | 9.48 | 49.56 | 100.0 | 16.12 | 97.48 | 97.32 | |
MEND | 73.50 | 32.96 | 48.86 | 99.12 | 14.35 | 87.64 | 86.56 | |
ROME | 78.94 | 50.35 | 52.12 | 99.80 | 13.95 | 78.98 | 77.60 | |
MEMIT | 86.78 | 60.47 | 74.62 | 99.72 | 13.50 | 72.03 | 70.44 | |
SERAC | 99.50 | 39.18 | 74.84 | 50.14† | 3.78† | 99.62† | 99.64† | |
IKE | 84.13 | 66.04 | 75.33 | 100.0† | 13.43† | 73.53† | 73.00† |
4 Experiments
The experiments were done to evaluate robust generalization and locality (Section 4.1.1 as well as deterioration of general LLM abilities (Section 4.1.2 across different editing methods.
4.1 Experimental Setup
4.1.1 Robust generalization and locality
We use GPT-J (Wang and Komatsuzaki, 2021) as the baseline model for editing and implement six distinct editing methodologies to assess robust generalization and locality: MEND (meta-learning), ROME and MEMIT (locate-and-edit), SERAC (external memory), and IKE (prompting demonstrations).
Given the overlap in benchmarks for robust generalization and locality, we select a subset for our experiments. Robust generalization is evaluated under single edit and multiple edit settings. Single edit metrics include Subject-Replace, Reverse-Replace, and One-Hop reasoning (Yao et al., 2023a). Multiple edit metrics include multi-hop editing accuracy (Zhong et al., 2023), and Conflict Score and Conflict Magnitude for Reverse Conflict and Composite Conflict respectively (Li et al., 2024b).
4.1.2 Deterioration of general LLM abilities
Following the settings of (Gu et al., 2024), we assessed deterioration of general LLM abilities post-editing using six methodologies: ROME, MEMIT, SERAC, MEND, KN, and GRACE. We evaluated general abilities across four NLP downstream tasks: open-domain question answering, sentiment analysis, reasoning, and summarization. These tasks were assessed after 10 to 40 edits on the Zero-Shot Relation Extraction (ZsRE) dataset(Levy et al., 2017), comparing the results against pre-editing benchmarks. More details on the selected downstream tasks are in Appendix B.
4.2 Experimental Results and Discussion
In general, current editing methodologies exhibit suboptimal performance concerning both robust generalization and locality. Regarding robust generalization, IKE, which leverages prompt demonstrations, demonstrates superior performance over other methodologies in single edit conditions. However, IKE’s performance noticeably declines in multiple edit scenarios, suggesting that prompt demonstrations may become confused when editing multiple logically related facts. Conversely, fine-tuning and meta-learning-based methods are less susceptible to confusion after editing multiple related facts.
Regarding locality, IKE maintains stable performance across metrics in single edit settings. Parameter-modifying methods excel in Other Attribution but decline in other metrics, except MEMIT, which remains stable across all metrics. In multiple edit scenarios, all methods except SERAC show similar performance. In the multiple edit scenario, all methods except SERAC exhibit relatively similar performance. SERAC displays low edit success rate and distortion rate, suggesting its scope classifier does not adopt most edits in this scenario. This may be attributed to its weakness in recovering edited facts, which is crucial in this metric setting.
In terms of general LLM abilities, the number of edits affects methods differently. Meta-learning methods like MEND degrade significantly after 10-20 edits. Locate-and-edit methods such as ROME and KN degrade after 10 edits, while MEMIT remains stable after 40 edits. This disparity can be attributed to MEMIT’s strategy of adjusting parameters across multiple layers, as opposed to ROME’s single-layer edits and KN’s approach of modifying a few neurons. This distribution of parameter modifications across layers helps mitigate deterioration.
GRACE, which stores edited facts with additional parameters, shows no performance change in downstream tasks after edits. One possible explanation is that the edits are conducted on the ZsRE dataset, which is distinct from the requirements of downstream tasks, leading to the stored facts not being retrieved during inference. Similarly, SERAC, utilizing external memory for edited facts, preserves general NLP abilities post-editing. This preservation stems from SERAC being trained once before editing begins, solely performing inference during editing, thereby preventing changes in the model’s output, even after multiple edits.
Overall, parameter-modifying methods degrade downstream task performance by altering pre-trained LLM parameters. In contrast, parameter-preserving methods maintain the original parameters, resulting in stable downstream task performance even after multiple edits.
5 Future Prospects
5.1 Leveraging Information Retrieval and External Memory
Previous research has demonstrated the benefits of utilizing external knowledge bases, as opposed to relying solely on internal knowledge, to guide LLMs in generating content based on a predefined set of facts. These methods effectively separate the factual knowledge stored in LLMs from the inference processes, thus reducing potential biases encoded within the models.
External knowledge bases can include diverse sources such as extensive text corpora, structured tables, or even simple key-value databases. Once these knowledge sources are provided, one can either finetune the LLMs to enhance their ability to retrieve information or employ prompting and in-context learning techniques to query these sources while keeping the model parameters intact. Such approaches not only eliminate the need to verify and edit false factual knowledge within the LLMs but also facilitate the use of attribution and reflection methods. This ensures that the generated content aligns with the predefined external knowledge base, thereby enhancing both accuracy and accountability.
5.2 Improve Understandings of LLMs’ Internal Knowledge Structures
While the identification of factual knowledge storage in LLMs has been extensively explored in recent literature (Meng et al., 2022, 2023; Dai et al., 2022; Hernandez et al., 2024; Geva et al., 2021), the correlation between the location of knowledge and the success rate of model editing remains low (Hase et al., 2023). Additionally, despite evidence suggesting a strong connection between factual knowledge and the feed-forward network layers (Meng et al., 2022; Geva et al., 2021, 2022), recent findings (Li et al., 2024a) indicate that updates to multi-head self-attention layers also lead to improved outcomes. These studies highlight that merely locating fact storage does not fully elucidate the underlying mechanisms of knowledge structures in LLMs. Therefore, further research into how knowledge locations interact with model predictions is essential for advancing the interpretability and controllability of LLMs.
In addition to enhancing the success rate of edits, preserving the general capabilities of LLMs is crucial for assessing the efficacy of model editing methods, as discussed in Section 3.3. Recent breakthroughs in identifying regions within models that correlate with general linguistic abilities (Zhang et al., 2024b) have opened up a direction for future research in model editing. Specifically, by locating these critical areas, it is possible to perform targeted modifications while keeping these regions away from alterations, thereby preventing the deterioration of general abilities. Consequently, advancements in related fields would ensure that edits could be performed without compromising the overall performance of the LLMs, thereby significantly enhancing the specificity and effectiveness of current model editing methods.
5.3 Improve Robustness of Knowledge Editing
Even after successful (achieving fair scores on the existing metrics) edit, the revised model may refuse the modification if the knowledge concerning the altered concept is challenged by extended dialogues. Instead, it might revert to the pre-edit version (reversion), or output an ambiguous answer about the edited concept (confusion). Given the interconnected nature of knowledge, experiments disclose that more popular the knowledge in the benchmark, the easier for the modified model to trace back the original concept (Ma et al., 2024). It highlights the unsatisfying robustness of the existing editing strategies. A more comprehensive understanding how LLMs store and process among different knowledge entities is crucial for a more robust editing. We are also short of specific benchmarks and automated metrics addressing on theses issues. Knowledge-focused editing would not avoid the hallucination inherited from the pre-edit model. TruthX (zhang2024truthx) tries to alleviate hallucination via a parameter-preserved approach by mapping the LLMs internal representation to semantic and truthful spaces and edits the truthfulness in the truthful space. Combination of truthfulness and knowledge adjustment in the same space may provide a practical solution.
6 Conclusion
Although model editing techniques appear promising for cost-effectively updating knowledge, they still have significant pitfalls. Current editing methods often struggle with making logical inferences based on the edited facts, introducing unintended alterations of non-target knowledge and deterioration in model performance, particularly with parameter-modified methods. Editing techniques that leverage information retrieval can mitigate deviations in model abilities by keeping model parameters intact, as demonstrated in our experiments. Moreover, gaining a deeper understanding of how models store and process knowledge can enhance the controllability of edited facts, leading to greater robustness. We hope our work illuminates potential directions for future improvements in knowledge editing.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
- Black et al. (2022) Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models.
- Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems.
- Cohen et al. (2023) Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, and Mor Geva. 2023. Evaluating the ripple effects of knowledge editing in language models. arXiv preprint arXiv:2307.12976.
- Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8493–8502.
- Dong et al. (2022) Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. 2022. Calibrating factual knowledge in pretrained language models. Findings of Empirical Methods in Natural Language Processing (EMNLP).
- Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Gu et al. (2024) Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, and Nanyun Peng. 2024. Model editing can hurt general abilities of large language models.
- Gupta et al. (2024a) Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024a. Model editing at scale leads to gradual and catastrophic forgetting.
- Gupta et al. (2024b) Akshat Gupta, Dev Sajnani, and Gopala Anumanchipalli. 2024b. A unified framework for model editing.
- Hartvigsen et al. (2023) Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems.
- Hase et al. (2023) Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Hernandez et al. (2024) Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. 2024. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations.
- Hoelscher-Obermaier et al. (2023) Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, and Fazl Barez. 2023. Detecting edit failures in large language models: An improved specificity benchmark. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11548–11559, Toronto, Canada. Association for Computational Linguistics.
- Hua et al. (2024) Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. 2024. Propagation and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks.
- Huang et al. (2023) Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2023. Transformer-patcher: One mistake worth one neuron. In The Eleventh International Conference on Learning Representations.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
- Li et al. (2024a) Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024a. Pmet: Precise model editing in a transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18564–18572.
- Li et al. (2024b) Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, and Huajun Chen. 2024b. Unveiling the pitfalls of knowledge editing for large language models. In The Twelfth International Conference on Learning Representations.
- Ma et al. (2024) Xinbei Ma, Tianjie Ju, Jiyang Qiu, Zhuosheng Zhang, hai zhao, lifeng Liu, and Yulong Wang. 2024. Is it possible to edit large language models robustly? In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
- Mazzia et al. (2023) Vittorio Mazzia, Alessandro Pedrani, Andrea Caciolai, Kay Rottmann, and Davide Bernardi. 2023. A survey on knowledge editing of neural networks. arXiv preprint arXiv:2310.19704.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36. ArXiv:2202.05262.
- Meng et al. (2023) Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass editing memory in a transformer. The Eleventh International Conference on Learning Representations (ICLR).
- Meta (2024) Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-05-30.
- Mitchell et al. (2022a) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022a. Fast model editing at scale. In International Conference on Learning Representations.
- Mitchell et al. (2022b) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022b. Memory-based model editing at scale. In International Conference on Machine Learning.
- OpenAI (2023) OpenAI. 2023. Chatgpt: Optimizing language models for dialogue. OpenAI Blog. https://openai.com/research/chatgpt.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Tan et al. (2024) Chenmien Tan, Ge Zhang, and Jie Fu. 2024. Massive editing for large language models via meta learning. In International Conference on Learning Representations.
- Team (2024) Gemini Team. 2024. Gemini: A family of highly capable multimodal models.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models.
- Valipour et al. (2023) Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. 2023. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, Dubrovnik, Croatia. Association for Computational Linguistics.
- Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Wang et al. (2019) Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors. 2019. Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics, Hong Kong, China.
- Wang et al. (2024) Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai wei Chang. 2024. Deepedit: Knowledge editing as decoding with constraints. ArXiv, abs/2401.10471.
- (44) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Yang et al. (2024) Wanli Yang, Fei Sun, Xinyu Ma, Xun Liu, Dawei Yin, and Xueqi Cheng. 2024. The butterfly effect of model editing: Few edits can trigger large language models collapse. arXiv preprint arXiv:2402.09656.
- Yao et al. (2023a) Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023a. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172.
- Yao et al. (2023b) Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023b. Editing large language models: Problems, methods, and opportunities. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, Singapore. Association for Computational Linguistics.
- Yu et al. (2024) Lang Yu, Qin Chen, Jie Zhou, and Liang He. 2024. Melo: Enhancing model editing with neuron-indexed dynamic lora. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19449–19457.
- Zhang et al. (2024a) Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, et al. 2024a. A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286.
- Zhang et al. (2024b) Zhihao Zhang, Jun Zhao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024b. Unveiling linguistic regions in large language models.
- Zheng et al. (2023) Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Zhong et al. (2023) Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. 2023. MQuAKE: Assessing knowledge editing in language models via multi-hop questions. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Appendix A Detailed Explanation of Evaluation Metrics and Examples
A.1 Portability / Generalization
Single Edit
In the single edit scenario, we further classify the methods into two settings:
-
•
One-Hop: This setting focuses on evaluating the impact of a single edit on direct, one-hop reasoning tasks.
-
•
Multi-Hop: This setting assesses the impact of a single edit on more complex, multi-hop reasoning tasks.
For one-hop evaluations, we adopt the methods proposed by (Yao et al., 2023a). These include:
-
•
Subject Replace: This metric tests the model’s generalization ability by replacing the subject in the question with an alias or synonym, assessing if the edited attribute is generalized to other descriptions of the same subject.
-
•
Reversed Relation: This metric evaluates the model’s capability to handle reversed relations by filtering for suitable relations such as one-to-one and asking the reverse question to check if the target entity is also updated.
-
•
One-Hop Test: This metric assesses the edited language model’s performance on downstream tasks that require one-hop reasoning.
In the multi-hop setting, we assess the model’s performance on multi-hop questions using the evaluation methods proposed by (Zhong et al., 2023), which include:
-
•
Edit-wise Success Rate (EW): This metric measures how many facts can be successfully recalled from the edited language model.
(5) where is the model after editing.
-
•
Instance-wise Accuracy (IW): This metric tests how many multi-hop instances the model can recall all the individual single-hop facts. This metric is crucial for multi-hop performance, as the model must encode each fact to answer the multi-hop question.
(6) where is the chain of facts of a multi-hop question. In this chain, the object of the fact is the subject of the next fact. (i.e., )
-
•
Multi-hop Accuracy (MH): This metric assesses the accuracy of the original and edited language models on multi-hop questions. In the MQuAKE dataset (Zhong et al., 2023), there are three generated multi-hop questions for each instance. If any of the three questions is correctly answered by the model, we consider it accurate.
(7) where is a set of similar multi-hop questions with the same answer .
Multiple Edits
In the multiple edits scenario, we test the model’s performance after applying multiple edits. For this, we use the setting and evaluation methods from (Li et al., 2024b). The settings consist of:
-
•
Reverse Conflict: This setting introduces conflicts by editing facts with reverse relations. For example:
edit 1: (→)
Hamlet was written by Shakespeare → Agatha Christie.
edit 2: (→)
The notable work of Agatha Christie is Hamlet → Odyssey
the updated knowledge then could be represented as: -
•
Composite Conflict: This explores more complex situations where the edits are associated with a fact that is not influenced by the editing (tied fact). For example:
edit 1: (→)
Hamlet was written in English → French
edit 2: (→)
Shakespeare wrote in French → German
tied fact: ()
The notable work of Shakespeare is Hamlet
where is a logical rule. The updated knowledge then could be represented as:
The evaluation methods include:
-
•
Conflict Score (CS): Measures how well a knowledge editing method handles knowledge conflicts by calculating the ratio that the new fact is more probable than the old fact after knowledge editing.
(8) -
•
Conflict Magnitude (CM): Estimates the decrease in probability of the old fact after editing.
(9) is the intermediate model parameters after edit 1.
A.2 Locality
Single Edit
In the single edit scenario for locality, we adopt the methods proposed by (Yao et al., 2023b), including:
-
•
Other Attribution The modified CounterFact dataset is applied to test whether the non-target attributes of the edited subjects remained the same. For example, if we reset Lionel Messi as a basketball player, his nationality should stay the same.
-
•
Distract Neighbor Pretend model editing by modifying the neighborhood prompt in the CounterFact. For example, if the original prompt is "Windows 11 is a product of __", the modified prompt would be "Windows 11 is a product of Google. Office 365, developed by __". It testifies whether the model prediction would be "distracted" by the revised prompt.
-
•
Other Task The edited model is tested on the multiple-choice QA task Physical Interaction QA(PIQA, Bisk et al. (2020)) and the performance is evaluated by accuracy.
Multiple Edits
We also test the model’s locality in the multiple edits scenario adopting the methods and evaluations from (Li et al., 2024b). The settings consist of:
-
•
Round Edit: This edits the knowledge triplet back-and-forth, for example:
edit 1: (→)
edit 2: (→)
the evaluation metrics include:
-
•
Distortion (D) (Li et al., 2024b):
(10) estimates the JS divergence of the objects distribution before and after edit.
-
•
Ignore Rate (IR) (Li et al., 2024b):
(11) -
•
Failure Rate (FR) (Li et al., 2024b):
(12) -
•
Tied Fact Damage (TDF) (Li et al., 2024b):
(13) denotes the tied facts and is the intermediate model parameters after edit 1.
Other Locality Metrics
-
•
Neighborhood KL Divergence (Hoelscher-Obermaier et al., 2023):
(14) -
•
Neighborhood Score (NS) (Meng et al., 2022): collect a set of "neighbor" subjects and evaluation the success fraction for , while the denotes the correct facts and denotes the false facts.
-
•
Neighborhood Magnitude (NM) (Meng et al., 2022): the differences of and for the "neighborhood" subjects.
Appendix B Detailed Explanation of experiments for deterioration of general LLM abilities
We follow the settings of (Gu et al., 2024) for this part of experiments. Different evaluation metrics were applied for each downstream task: Exact Match for open-domain question answering on the Natural Question dataset (Kwiatkowski et al., 2019), accuracy for sentiment analysis on the SST2 dataset (Socher et al., 2013), solve rate for reasoning on the GSM8K dataset (Cobbe et al., 2021), and ROUGE score for summarization on the SAMSum dataset (Wang et al., 2019).