Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models

Cheng-Hsun Hsueh Paul Kuo-Ming Huang Tzu-Han Lin Che-Wei Liao
Hung-Chieh Fang Chao-Wei Huang Yun-Nung Chen
National Taiwan University, Taipei, Taiwan
{r12922059, b08902072, b08902126, r09922a25}@csie.ntu.edu.tw
{b09902106, f07922069}@csie.ntu.edu.tw [email protected]
Abstract

Knowledge editing is a rising technique for efficiently updating factual knowledge in Large Language Models (LLMs) with minimal alteration of parameters. However, recent studies have identified concerning side effects, such as knowledge distortion and the deterioration of general abilities, that have emerged after editing. This survey presents a comprehensive study of these side effects, providing a unified view of the challenges associated with knowledge editing in LLMs. We discuss related works and summarize potential research directions to overcome these limitations. Our work highlights the limitations of current knowledge editing methods, emphasizing the need for deeper understanding of inner knowledge structures of LLMs and improved knowledge editing methods. To foster future research, we have released the complementary materials such as paper collection publicly111https://github.com/MiuLab/EditLLM-Survey. **footnotetext: Equal contribution.

Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models


Cheng-Hsun Hsueh Paul Kuo-Ming Huang Tzu-Han Lin Che-Wei Liao Hung-Chieh Fang Chao-Wei Huang Yun-Nung Chen National Taiwan University, Taipei, Taiwan {r12922059, b08902072, b08902126, r09922a25}@csie.ntu.edu.tw {b09902106, f07922069}@csie.ntu.edu.tw [email protected]


1 Introduction

Recent advancements in large language models (LLMs) have significantly improved Natural Language Processing (NLP; Brown et al., 2020). LLMs can now understand and generate language at a human-like level, establishing them as milestones in generative AI. This proficiency is due to their ability to extract knowledge from extensive text corpora. However, the mechanisms of knowledge storage in LLMs are not well understood, leading to potential issues in real-world applications. For instance, deploying LLMs as chatbots raises concerns about the reliability of their generated content due to the unclear mechanisms of knowledge storage and potential inaccuracies.

Refer to caption
Figure 1: An overview of pitfalls in current knowledge editing methods. The subsequent sections dive into three key challenges: generalization issues (Section 3.1), localization issues (Section 3.2), and deterioration of general LLM abilities (Section 3.3).

To address these issues, researchers have explored various methods. Traditional approaches like fine-tuning, continual learning, and retraining are either computationally expensive or detrimental to the LLMs’ overall performance. Recently, knowledge editing has emerged as a promising alternative, offering minimal computational costs and fewer alterations (Cao et al., 2021; Dai et al., 2022; Meng et al., 2022, 2023; Dong et al., 2022; Mitchell et al., 2022a, b; Hartvigsen et al., 2023; Huang et al., 2023; Yu et al., 2024; Zheng et al., 2023; Li et al., 2024a; Tan et al., 2024; Gupta et al., 2024b; Wang et al., 2024). This approach allows for precise adjustments to LLMs, enhancing their practicality and reliability in real-world applications.

Knowledge editing can be divided into two main categories: parameter-modifying and parameter-preserving. Both aim to refine LLM knowledge efficiently while avoiding the drawbacks of previous tuning methods (Yao et al., 2023a). Parameter-modifying methods, including meta-learning (Cao et al., 2021; Mitchell et al., 2022a; Tan et al., 2024) and locate-then-edit techniques (Dai et al., 2022; Meng et al., 2022, 2023; Li et al., 2024a; Gupta et al., 2024b), strive to update a targeted set of model parameters while making effective edits. Parameter-preserving methods introduce external components, like knowledge bases (Mitchell et al., 2022b; Yu et al., 2024) or extra model parameters (Dong et al., 2022; Huang et al., 2023; Hartvigsen et al., 2023; Yu et al., 2024) to maintain the integrity of pre-trained LLMs while updating their knowledge.

Despite the success of knowledge editing, challenges remain. Knowledge editing can have unintended side effects, potentially damaging the general abilities and intrinsic structures of LLMs. Previous research has mainly focused on performance improvements and innovations within knowledge editing methods, with limited attention to potential drawbacks. Consequently, this survey aims to provide a holistic view of current issues in the knowledge editing paradigm and encourage further investigations into the pitfalls and intrinsic knowledge structures of LLMs. A brief overview of the discussed pitfalls is shown in Figure  1.

This survey is organized as follows: Section 2 introduces knowledge editing and its methods. Section 3 discusses current challenges, benchmarks, and methodologies. In Section 4, we present experimental results evaluating different editing methods. Finally, Section 5 explores related studies and future research directions. We summarize our contributions as follows:

  1. 1.

    We are the first to provide a comprehensive analysis of the side effects associated with existing knowledge editing techniques.

  2. 2.

    We systematically organized previous research and conducted experiments to benchmark the side effects of knowledge editing, providing a unified perspective on this issue.

  3. 3.

    We discussed related studies and potential research directions to address existing challenges, encouraging further exploration and understanding in this field.

Refer to caption
Figure 2: Illustration of properties that knowledge editing methods should satisfy. An ideal knowledge editing method should be reliable, generalizable to relevant queries, and should not alter the outputs of irrelevant queries.
Refer to caption
Figure 3: Illustration of the two categories of model editing methods in transformer-based large language models, which includes parameter-modifying (meta-learning and locate-and-edit) and parameter-preserving (additional parameters, external memory, in-context learning, and decoding) methods. MHSA and FFN stand for multi-head self-attention and feed-forward network, respectively.
Category Strategy Method
Parameter- modifying Meta- learning Knowledge Editor (Cao et al., 2021)
MEND (Mitchell et al., 2022a)
MALMEN (Tan et al., 2024)
Locating and editing Knowledge Neuron (Dai et al., 2022)
ROME (Meng et al., 2022)
MEMIT (Meng et al., 2023)
PMET (Li et al., 2024a)
EMMET (Gupta et al., 2024b)
Parameter- preserving Additional parameters CaliNET (Dong et al., 2022)
T-Patcher (Huang et al., 2023)
GRACE (Hartvigsen et al., 2023)
MELO (Yu et al., 2024)
External memory SERAC (Mitchell et al., 2022b)
MeLLo (Zhong et al., 2023)
In-context learning IKE (Zheng et al., 2023)
Decoding DeepEdit (Wang et al., 2024)
Table 1: Overview of current knowledge editing methods. The methods are categorized into two major families, namely parameter-modifying and parameter-preserving methods, each containing several strategies. Methods marked with \dagger have the ability to process sequential edits.

2 Overview of Knowledge Editing

2.1 Problem Definition

Knowledge editing for LLMs entails modifying the output of LLMs in response to specific edit queries, with the aim of minimizing alterations to their original behavior (Yao et al., 2023a; Mazzia et al., 2023; Zhang et al., 2024a). In this section, we follow the notation from Mazzia et al. (2023).

We denote the input and output space as 𝕏𝕏\mathbb{X}blackboard_X and 𝕐𝕐\mathbb{Y}blackboard_Y, respectively. The function space 𝔽:𝕏𝕐:𝔽𝕏𝕐\mathbb{F}:\mathbb{X}\rightarrow\mathbb{Y}blackboard_F : blackboard_X → blackboard_Y is estimated by the base model fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT parameterized by θ0Θsubscript𝜃0Θ\theta_{0}\in\Thetaitalic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Θ. Finally, let Ze={(xe,ye)|fθ0(xe)ye}subscript𝑍𝑒conditional-setsubscript𝑥𝑒subscript𝑦𝑒subscript𝑓subscript𝜃0subscript𝑥𝑒subscript𝑦𝑒Z_{e}=\{(x_{e},y_{e})\ |\ f_{\theta_{0}}(x_{e})\neq y_{e}\}italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) | italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ≠ italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } be the set of edit queries we would like to apply to the base model. The goal of knowledge editing is to efficiently derive the edited model fθesubscript𝑓subscript𝜃𝑒f_{\theta_{e}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT from the base model that satisfies the following:

fθe(xe)=ye,(xe,ye)Zeformulae-sequencesubscript𝑓subscript𝜃𝑒subscript𝑥𝑒subscript𝑦𝑒for-allsubscript𝑥𝑒subscript𝑦𝑒subscript𝑍𝑒f_{\theta_{e}}(x_{e})=y_{e},\forall(x_{e},y_{e})\in Z_{e}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , ∀ ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ∈ italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (1)

The ideal edited model fθesubscript𝑓subscript𝜃𝑒f_{\theta_{e}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT should satisfy three properties: reliability, generalization, and locality. An illustration is shown in Figure 2.

Reliability

Given an edit query (xe,ye)subscript𝑥𝑒subscript𝑦𝑒(x_{e},y_{e})( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ), the edited model fθesubscript𝑓subscript𝜃𝑒f_{\theta_{e}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT should output the target answer yesubscript𝑦𝑒y_{e}italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT when given the target input xesubscript𝑥𝑒x_{e}italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, i.e. fθe(xe)=yesubscript𝑓subscript𝜃𝑒subscript𝑥𝑒subscript𝑦𝑒f_{\theta_{e}}(x_{e})=y_{e}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. The reliability of a editing method is measured by calculating the average edit success rate:

𝔼(xe,ye)Ze𝟙{fθe(xe)=ye}subscript𝔼similar-tosuperscriptsubscript𝑥𝑒superscriptsubscript𝑦𝑒subscript𝑍𝑒1subscript𝑓subscript𝜃𝑒superscriptsubscript𝑥𝑒superscriptsubscript𝑦𝑒\mathbb{E}_{(x_{e}^{\prime},y_{e}^{\prime})\sim Z_{e}}\mathbbm{1}\{f_{\theta_{% e}}(x_{e}^{\prime})=y_{e}^{\prime}\}blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 { italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } (2)
Generalization

The edited model should generalize the edited knowledge to relevant instances. The generalization metric is commonly formulated as the average success rate on the neighboring set:

𝔼(xe,ye)N(xe,ye)𝟙{fθe(xe)=ye},subscript𝔼similar-tosuperscriptsubscript𝑥𝑒superscriptsubscript𝑦𝑒𝑁subscript𝑥𝑒subscript𝑦𝑒1subscript𝑓subscript𝜃𝑒superscriptsubscript𝑥𝑒superscriptsubscript𝑦𝑒\mathbb{E}_{(x_{e}^{\prime},y_{e}^{\prime})\sim N(x_{e},y_{e})}\mathbbm{1}\{f_% {\theta_{e}}(x_{e}^{\prime})=y_{e}^{\prime}\},blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_N ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_1 { italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , (3)

where N(xe,ye)𝑁subscript𝑥𝑒subscript𝑦𝑒N(x_{e},y_{e})italic_N ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) is the set of neighboring instances of an edit query (xe,ye)subscript𝑥𝑒subscript𝑦𝑒(x_{e},y_{e})( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ). Earlier works evaluate this metric by rephrasing the input prompts (Mitchell et al., 2022a; Meng et al., 2022; Huang et al., 2023).

Locality

The editing process should not affect instances unrelated to the edit queries. The locality set of an edit query (xe,ye)subscript𝑥𝑒subscript𝑦𝑒(x_{e},y_{e})( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) can be defined as L(xe)={(xloc,yloc)𝕏×𝕐s.txlocN(xe,ye)fθ0(xloc)=yloc}L(x_{e})=\{(x_{loc},y_{loc})\in\mathbb{X}\times\mathbb{Y}\ \mathrm{s.t}\ x_{% loc}\notin N(x_{e},y_{e})\land f_{\theta_{0}}(x_{loc})=y_{loc}\}italic_L ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) = { ( italic_x start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) ∈ blackboard_X × blackboard_Y roman_s . roman_t italic_x start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ∉ italic_N ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ∧ italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT }. The locality, also known as specificity, of a editing method is measured by calculating the level of invariance of model output before and after the edits, which can be calculated as follows:

𝔼(xloc,yloc)L(xe)𝟙{fθe(xloc)=yloc}subscript𝔼similar-tosubscript𝑥𝑙𝑜𝑐subscript𝑦𝑙𝑜𝑐𝐿subscript𝑥𝑒1subscript𝑓subscript𝜃𝑒subscript𝑥𝑙𝑜𝑐subscript𝑦𝑙𝑜𝑐\mathbb{E}_{(x_{loc},y_{loc})\sim L(x_{e})}\mathbbm{1}\{f_{\theta_{e}}(x_{loc}% )=y_{loc}\}blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) ∼ italic_L ( italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_1 { italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT } (4)

2.2 Current Methods

In this section, we introduce the current knowledge editing methods. The methods are categorized into parameter-modifying (Section 2.2.1) and parameter-preserving (Section 2.2.2) editing methods, each containing several strategies. An overview and illustration of current methods are included in Table 1 and Figure 3, respectively.

2.2.1 Parameter-Modifying

This category of methods, including meta-learning methods and locate-and-edit strategies, update LLMs’ knowledge by modifying their parameters.

Meta-learning

Meta-learning methods train a hyper-network to predict the update of network parameters. For instance, KnowledgeEditor (Cao et al., 2021) trains a deep network to predict weight updates. MEND (Mitchell et al., 2022a) decomposes the gradient matrix into two rank-one matrices and utilized a hyper-network to update these matrices, thereby accelerating the editing process. Built upon MEND, MALMEN (Tan et al., 2024) refines the process by formulating the aggregation of parameter shifts into a least-squares problem, further improving the scalability of meta-learning methods.

Locate and Edit

Locate-and-edit methods identify specific knowledge locations in LLMs for consequent editing. KN (Dai et al., 2022) utilizes the proposed knowledge attribution method to pinpoint neurons expressing relational facts, allowing efficient updates or erasures without fine-tuning. ROME (Meng et al., 2022) proposes the causal tracing method for identifying neuron activations associating with specific knowledge. The authors demonstrate the significance of middle-layer feed-forward networks (FFNs) in factual predictions when processing the subject’s last token. Built upon the hypothesis that the FFN modules in a transformer layer can be viewed as key-value memories (Geva et al., 2021), ROME (Meng et al., 2022) injects new knowledge into the key-value memories by deriving the closed form solution from the least-squares problem. MEMIT (Meng et al., 2023) scales up ROME (Meng et al., 2022) by editing a set of MLPs from consecutive middle-layers via solving a normal equation. PMET (Li et al., 2024a) proposes to update multi-head self-attention (MHSA) modules in addition to FFNs. EMMET (Gupta et al., 2024b) on the other hand, integrates the objectives of ROME and MEMIT into a unified preservation-memorization objective, facilitating batch-editing capabilities for both methodologies.

2.2.2 Parameter-Preserving

Parameter-preserving methods alter LLM output by adding new parameters, integrating external memory, or using strategies like in-context learning and devised decoding, without changing pre-trained LLM.

Additional Parameters

Some methods utilize additional parameters, such as adding new neurons or employing parameter-efficient techniques. CaliNET (Dong et al., 2022) extends the FFN modules with calibration memory slots to adjust the predicted token distribution. T-Patcher (Huang et al., 2023) adds neurons in the FFN’s last layer to rectify classification errors and incorrectly generated tokens, activating only in response to associated mistakes. GRACE (Hartvigsen et al., 2023) wraps a selected layer with an Adaptor that includes a codebook and deferral mechanism, learning to decode desired outputs while caching input error embeddings. The GRACE layer stores the edits and could be updated continuously over long deployments. MELO (Yu et al., 2024) utilizes DyLoRA (Valipour et al., 2023) modules to learn edits, indexing them in an inner vector database to dynamically activate corresponding LoRA blocks during inference.

External Memory

Other methods utilize external memories for editing. SERAC (Mitchell et al., 2022b) leverages a scope classifier to determine whether an user-supplied edit example stored in its memory is related to the inputs. If no example exists, the inputs are passed to the base model; otherwise, a counterfactual model generates modified answers using the inputs and the related example. MeLLo (Zhong et al., 2023) decomposes a multi-hop question into subquestions iteratively. The model then checks if the tentative answer generated by the base model contradicts the most relevant facts retrieved from the edited fact memory and adjusts the outputs accordingly.

In-Context Learning and Decoding

Certain strategies require no additional parameters. IKE (Zheng et al., 2023) edits factual knowledge via in-context learning with demonstrations to guide the language model. DeepEdit (Wang et al., 2024) employs decoding constraints, including filtering step candidates, depth-first search to store valid candidates in a stack, and a greedy search to output the optimal path for multi-hop reasoning.

3 Challenges of Knowledge Editing

While knowledge editing methods have been extensively researched, there’s a lack of comprehensive study on related challenges. In this section, we discuss the pitfalls of knowledge editing from three perspectives: inability to logically inference and robustly generalize (Section 3.1), unintended alteration of non-target knowledge (Section 3.2), and deterioration of general LLM abilities (Section 3.3).

Challenge Benchmark Metric
Portability and Generalization RippleEdits (Cohen et al., 2023) Logical Generalization
Compositionality I
Compositionality II
ConflictEdit (Li et al., 2024b) Conflict Score
Conflict Magnitude
Success Score
MQuAKE (Zhong et al., 2023) Edit-wise Success Rate
Instance-wise Accuracy
Multi-hop Accuracy
ReCoE (Hua et al., 2024) QA Accuracy
ZsRE + CounterFact Yao et al. (2023b) Subject-Replace
Reverse-Relation
One-Hop
Locality RippleEdits (Cohen et al., 2023) Subject Aliasing
Preservation
Relation Specificity
RoundEdit (Li et al., 2024b) Success Score
Distortion (\downarrow)
Ignore Rate (\downarrow)
Failure Rate (\downarrow)
Tied Fact Damage (\downarrow)
CounterFact Yao et al. (2023b) Other-Attribution
Distract-Neighbor
Other-Task
CounterFact Meng et al. (2022) Locality
Neighborhood Score
Neighborhood Magnitude
CounterFact+ Hoelscher-Obermaier et al. (2023) Neighborhood KL Divergence
Table 2: Performance benchmarks and evaluation metrics addressing generalization/portability and locality issues in knowledge editing methods. Unless specifically indicated by a downward arrow, higher values signify better performance in those evaluation metrics. CounterFact benchmark is proposed by  Meng et al. (2022), and CounterFact with mark is modified to further examine the proposed metrics.

3.1 Inability to Logically Inference and Robustly Generalize

When a fact is updated, it is crucial not only to revise the specific piece of knowledge but also to evaluate the impact on the related reasoning chain. Recently the term portability has been proposed in (Yao et al., 2023b) to evaluate the consequences after an edit, and further assess the robustness of generalization. In their study, they introduce three metrics to evaluate portability: Subject Replace (checking if synonyms of the subject are edited), Reversed Relation (checking if the reversed relation of the target is edited), and One Hop (assessing if modified knowledge is usable for further derivation). Similarly, RippleEdits benchmark as well as corresponding Logical Generalization and Compositionality metrics are proposed to examine whether edited knowledge can be inferred in composite relations of facts Cohen et al. (2023). Additionally, ReCoE benchmark is proposed to assess the propagation of updates in interconnected facts using various reasoning schemes in complex question-answering datasets Hua et al. (2024). Furthermore, MQuAKE benchmark is introduced to evaluate more complex reasoning and inference ability on multi-hop questions Zhong et al. (2023).

When editing multiple logically related facts simultaneously, models may suffer from confusion due to conflicts.  ConflictEdit benchmark is proposed to examine different editing methods on conflicted edit facts Li et al. (2024b). The different benchmarks and corresponding metrics and are arranged systematically in Table 2.

3.2 Unintended Alteration of Non-Target Knowledge

Locality is conventionally assessed using a locality dataset to evaluate edits on unrelated facts by measuring the Neighborhood Score and Neighborhood Magnitude (NS & NM; Meng et al., 2022, 2023).However, current evaluation methods do not adequately capture the post-edit effects on content beyond the locality dataset, which means the edited model could still contain unintended alterations. For example, while the location of the Louvre might be successfully changed from Paris to London, the edited model may inadvertently increase the likelihood of semantically related words (e.g. Big Ben for London) while mentioning Louvre. Some modified benchmark (CounterFact+) and corresponding metric (Neighborhood KL Divergence(Hoelscher-Obermaier et al., 2023) is then designed to disclose these previously implicit pitfalls. Another study (Yao et al., 2023a) extends this exploration to three facets of locality: Other Relations(evaluating the retention of other attributes of the updated subject), Distract Neighborhood(assessing the divergence of unrelated input when juxtaposed with edited instances), and Other Tasks (examining the influence of edits on the performance of other tasks).

Unintended edits to unrelated facts may occur because a single edit can implicitly change the predictive distribution among objects associated with the same (subject - relation) pair. After multiple consecutive edits, these alterations can accumulate and distort the stored knowledge. To evaluate this condition, the concept of Knowledge Distortion has been introduced by (Li et al., 2024b), which estimates the Jensen–Shannon divergence of the object set distribution before and after editing. This can be further extended to metrics such as the Ignore Rate, measuring how objects other than the target in the object set are neglected after editing, and the Failure Rate, which measures the proportion of instances where over half of the objects in the set are overlooked.

3.3 Deterioration of General LLM Abilities

Current evaluation metrics are primarily limited to scenarios where editing is performed only once or infrequently, prompting some studies to extend evaluations to the outcomes after consecutive edits. A study by  (Gupta et al., 2024a) discovers that post-edit models exhibit susceptibility to both gradual forgetting and catastrophic forgetting in sequential editing scenarios. Notably, their findings indicate that the extent of knowledge forgetting is more pronounced in meta-learning-based methods compared to locate-and-edit methods. Additionally, models with parameters modified successively show a decline in performance across various downstream NLP tasks (Gu et al., 2024). Furthermore, perplexity is found to increase after consecutive edits across all parameter-modified methods and different LLMs, and is proposed as another metric to indicate model collapse (Yang et al., 2024). These findings further corroborate that model editing aimed at modifying parameters adversely affects the general capabilities of the original LLMs.

Single Edit Multiple Edit
One-Hop Multiple-Hop Reverse Conflict Composite Conflict
Methods SR RR OH MH CS CM CS CM
FT 72.96 8.05 1.34 1.6 80.28 71.11 75.45 64.28
MEND 42.45 0.00 11.34 9.2 88.89 60.50 84.85 43.45
ROME 37.42 46.42 50.91 7.6 65.92 -0.65 71.70 37.04
MEMIT 27.73 47.67 52.74 8.1 51.40 -1.60 57.15 -1.50
SERAC 17.79 1.30 5.53 7.9 50.89 -0.02 50.84 -0.02
IKE 88.77 92.96 55.38 8.3 58.20 -1.00 50.52 -0.99
Table 3: Experimental results for portability and generalization. SR: Subject-Replace, RR: Reverse-Relation, OH: One-Hop Accuracy, EW: Edit-wise, IW: Instance-wise, MH: Multi-hop Accuracy, CS: Conflict score, CM: Conflict magnitude. Higher values indicate better performance for all metrics in this table. Results marked with \dagger are obtained in our own experiments, and other results are taken from previous studies.
Single Edit Multiple Edit
Methods OA DN OT Succ. D (\downarrow) IR (\downarrow) FR (\downarrow)
FT 12.88 9.48 49.56 100.0 16.12 97.48 97.32
MEND 73.50 32.96 48.86 99.12 14.35 87.64 86.56
ROME 78.94 50.35 52.12 99.80 13.95 78.98 77.60
MEMIT 86.78 60.47 74.62 99.72 13.50 72.03 70.44
SERAC 99.50 39.18 74.84 50.14 3.78 99.62 99.64
IKE 84.13 66.04 75.33 100.0 13.43 73.53 73.00
Table 4: Experimental results for locality. OA: Other-Attribution, DN: Distract-Neighbor, OT: Other-Task, Succ.: Success rate, D: Distortion, IR: Ignore rate, FR: Failure rate, TFD: Tied fact damage. Unless specifically indicated by a downward arrow, higher values signify better performance in those evaluation metrics. Results marked with \dagger are obtained in our own experiments, and other results are taken from previous studies.

4 Experiments

The experiments were done to evaluate robust generalization and locality (Section  4.1.1 as well as deterioration of general LLM abilities (Section  4.1.2 across different editing methods.

4.1 Experimental Setup

4.1.1 Robust generalization and locality

We use GPT-J (Wang and Komatsuzaki, 2021) as the baseline model for editing and implement six distinct editing methodologies to assess robust generalization and locality: MEND (meta-learning), ROME and MEMIT (locate-and-edit), SERAC (external memory), and IKE (prompting demonstrations).

Given the overlap in benchmarks for robust generalization and locality, we select a subset for our experiments. Robust generalization is evaluated under single edit and multiple edit settings. Single edit metrics include Subject-Replace, Reverse-Replace, and One-Hop reasoning (Yao et al., 2023a). Multiple edit metrics include multi-hop editing accuracy (Zhong et al., 2023), and Conflict Score and Conflict Magnitude for Reverse Conflict and Composite Conflict respectively (Li et al., 2024b).

For locality, single edit metrics include Other-Attribution, Distract-Neighbor, and Other-Task (Yao et al., 2023b), while multiple edit metrics encompass Success Rate, Distortion, Ignore Rate, and Failure Rate (Li et al., 2024b).

4.1.2 Deterioration of general LLM abilities

Following the settings of (Gu et al., 2024), we assessed deterioration of general LLM abilities post-editing using six methodologies: ROME, MEMIT, SERAC, MEND, KN, and GRACE. We evaluated general abilities across four NLP downstream tasks: open-domain question answering, sentiment analysis, reasoning, and summarization. These tasks were assessed after 10 to 40 edits on the Zero-Shot Relation Extraction (ZsRE) dataset(Levy et al., 2017), comparing the results against pre-editing benchmarks. More details on the selected downstream tasks are in Appendix B.

4.2 Experimental Results and Discussion

In general, current editing methodologies exhibit suboptimal performance concerning both robust generalization and locality. Regarding robust generalization, IKE, which leverages prompt demonstrations, demonstrates superior performance over other methodologies in single edit conditions. However, IKE’s performance noticeably declines in multiple edit scenarios, suggesting that prompt demonstrations may become confused when editing multiple logically related facts. Conversely, fine-tuning and meta-learning-based methods are less susceptible to confusion after editing multiple related facts.

Regarding locality, IKE maintains stable performance across metrics in single edit settings. Parameter-modifying methods excel in Other Attribution but decline in other metrics, except MEMIT, which remains stable across all metrics. In multiple edit scenarios, all methods except SERAC show similar performance. In the multiple edit scenario, all methods except SERAC exhibit relatively similar performance. SERAC displays low edit success rate and distortion rate, suggesting its scope classifier does not adopt most edits in this scenario. This may be attributed to its weakness in recovering edited facts, which is crucial in this metric setting.

In terms of general LLM abilities, the number of edits affects methods differently. Meta-learning methods like MEND degrade significantly after 10-20 edits. Locate-and-edit methods such as ROME and KN degrade after 10 edits, while MEMIT remains stable after 40 edits. This disparity can be attributed to MEMIT’s strategy of adjusting parameters across multiple layers, as opposed to ROME’s single-layer edits and KN’s approach of modifying a few neurons. This distribution of parameter modifications across layers helps mitigate deterioration.

GRACE, which stores edited facts with additional parameters, shows no performance change in downstream tasks after edits. One possible explanation is that the edits are conducted on the ZsRE dataset, which is distinct from the requirements of downstream tasks, leading to the stored facts not being retrieved during inference. Similarly, SERAC, utilizing external memory for edited facts, preserves general NLP abilities post-editing. This preservation stems from SERAC being trained once before editing begins, solely performing inference during editing, thereby preventing changes in the model’s output, even after multiple edits.

Overall, parameter-modifying methods degrade downstream task performance by altering pre-trained LLM parameters. In contrast, parameter-preserving methods maintain the original parameters, resulting in stable downstream task performance even after multiple edits.

Refer to caption
(a) Open Domain Question Answering
Refer to caption
(b) Sentiment Analysis
Refer to caption
(c) Summarization
Refer to caption
(d) Reasoning
Figure 4: The experimental results for the deterioration of general abilities were obtained by editing a pretrained LLM using GPT-J with various editing algorithms, including ROME, MEMIT, MEND, KN, SERAC, and GRACE, each applied 10 to 40 times. The edited models were subsequently evaluated on four downstream tasks, including open-domain question answering, sentiment analysis, summarization, and reasoning.

5 Future Prospects

5.1 Leveraging Information Retrieval and External Memory

Previous research has demonstrated the benefits of utilizing external knowledge bases, as opposed to relying solely on internal knowledge, to guide LLMs in generating content based on a predefined set of facts. These methods effectively separate the factual knowledge stored in LLMs from the inference processes, thus reducing potential biases encoded within the models.

External knowledge bases can include diverse sources such as extensive text corpora, structured tables, or even simple key-value databases. Once these knowledge sources are provided, one can either finetune the LLMs to enhance their ability to retrieve information or employ prompting and in-context learning techniques to query these sources while keeping the model parameters intact. Such approaches not only eliminate the need to verify and edit false factual knowledge within the LLMs but also facilitate the use of attribution and reflection methods. This ensures that the generated content aligns with the predefined external knowledge base, thereby enhancing both accuracy and accountability.

5.2 Improve Understandings of LLMs’ Internal Knowledge Structures

While the identification of factual knowledge storage in LLMs has been extensively explored in recent literature (Meng et al., 2022, 2023; Dai et al., 2022; Hernandez et al., 2024; Geva et al., 2021), the correlation between the location of knowledge and the success rate of model editing remains low (Hase et al., 2023). Additionally, despite evidence suggesting a strong connection between factual knowledge and the feed-forward network layers (Meng et al., 2022; Geva et al., 2021, 2022), recent findings (Li et al., 2024a) indicate that updates to multi-head self-attention layers also lead to improved outcomes. These studies highlight that merely locating fact storage does not fully elucidate the underlying mechanisms of knowledge structures in LLMs. Therefore, further research into how knowledge locations interact with model predictions is essential for advancing the interpretability and controllability of LLMs.

In addition to enhancing the success rate of edits, preserving the general capabilities of LLMs is crucial for assessing the efficacy of model editing methods, as discussed in Section 3.3. Recent breakthroughs in identifying regions within models that correlate with general linguistic abilities (Zhang et al., 2024b) have opened up a direction for future research in model editing. Specifically, by locating these critical areas, it is possible to perform targeted modifications while keeping these regions away from alterations, thereby preventing the deterioration of general abilities. Consequently, advancements in related fields would ensure that edits could be performed without compromising the overall performance of the LLMs, thereby significantly enhancing the specificity and effectiveness of current model editing methods.

5.3 Improve Robustness of Knowledge Editing

Even after successful (achieving fair scores on the existing metrics) edit, the revised model may refuse the modification if the knowledge concerning the altered concept is challenged by extended dialogues. Instead, it might revert to the pre-edit version (reversion), or output an ambiguous answer about the edited concept (confusion). Given the interconnected nature of knowledge, experiments disclose that more popular the knowledge in the benchmark, the easier for the modified model to trace back the original concept (Ma et al., 2024). It highlights the unsatisfying robustness of the existing editing strategies. A more comprehensive understanding how LLMs store and process among different knowledge entities is crucial for a more robust editing. We are also short of specific benchmarks and automated metrics addressing on theses issues. Knowledge-focused editing would not avoid the hallucination inherited from the pre-edit model. TruthX (zhang2024truthx) tries to alleviate hallucination via a parameter-preserved approach by mapping the LLMs internal representation to semantic and truthful spaces and edits the truthfulness in the truthful space. Combination of truthfulness and knowledge adjustment in the same space may provide a practical solution.

6 Conclusion

Although model editing techniques appear promising for cost-effectively updating knowledge, they still have significant pitfalls. Current editing methods often struggle with making logical inferences based on the edited facts, introducing unintended alterations of non-target knowledge and deterioration in model performance, particularly with parameter-modified methods. Editing techniques that leverage information retrieval can mitigate deviations in model abilities by keeping model parameters intact, as demonstrated in our experiments. Moreover, gaining a deeper understanding of how models store and process knowledge can enhance the controllability of edited facts, leading to greater robustness. We hope our work illuminates potential directions for future improvements in knowledge editing.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
  • Black et al. (2022) Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models.
  • Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems.
  • Cohen et al. (2023) Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, and Mor Geva. 2023. Evaluating the ripple effects of knowledge editing in language models. arXiv preprint arXiv:2307.12976.
  • Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8493–8502.
  • Dong et al. (2022) Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. 2022. Calibrating factual knowledge in pretrained language models. Findings of Empirical Methods in Natural Language Processing (EMNLP).
  • Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Gu et al. (2024) Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, and Nanyun Peng. 2024. Model editing can hurt general abilities of large language models.
  • Gupta et al. (2024a) Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024a. Model editing at scale leads to gradual and catastrophic forgetting.
  • Gupta et al. (2024b) Akshat Gupta, Dev Sajnani, and Gopala Anumanchipalli. 2024b. A unified framework for model editing.
  • Hartvigsen et al. (2023) Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems.
  • Hase et al. (2023) Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Hernandez et al. (2024) Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. 2024. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations.
  • Hoelscher-Obermaier et al. (2023) Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, and Fazl Barez. 2023. Detecting edit failures in large language models: An improved specificity benchmark. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11548–11559, Toronto, Canada. Association for Computational Linguistics.
  • Hua et al. (2024) Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. 2024. Propagation and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks.
  • Huang et al. (2023) Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2023. Transformer-patcher: One mistake worth one neuron. In The Eleventh International Conference on Learning Representations.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  • Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
  • Li et al. (2024a) Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024a. Pmet: Precise model editing in a transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18564–18572.
  • Li et al. (2024b) Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, and Huajun Chen. 2024b. Unveiling the pitfalls of knowledge editing for large language models. In The Twelfth International Conference on Learning Representations.
  • Ma et al. (2024) Xinbei Ma, Tianjie Ju, Jiyang Qiu, Zhuosheng Zhang, hai zhao, lifeng Liu, and Yulong Wang. 2024. Is it possible to edit large language models robustly? In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
  • Mazzia et al. (2023) Vittorio Mazzia, Alessandro Pedrani, Andrea Caciolai, Kay Rottmann, and Davide Bernardi. 2023. A survey on knowledge editing of neural networks. arXiv preprint arXiv:2310.19704.
  • Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36. ArXiv:2202.05262.
  • Meng et al. (2023) Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass editing memory in a transformer. The Eleventh International Conference on Learning Representations (ICLR).
  • Meta (2024) Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-05-30.
  • Mitchell et al. (2022a) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022a. Fast model editing at scale. In International Conference on Learning Representations.
  • Mitchell et al. (2022b) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022b. Memory-based model editing at scale. In International Conference on Machine Learning.
  • OpenAI (2023) OpenAI. 2023. Chatgpt: Optimizing language models for dialogue. OpenAI Blog. https://openai.com/research/chatgpt.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  • Tan et al. (2024) Chenmien Tan, Ge Zhang, and Jie Fu. 2024. Massive editing for large language models via meta learning. In International Conference on Learning Representations.
  • Team (2024) Gemini Team. 2024. Gemini: A family of highly capable multimodal models.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models.
  • Valipour et al. (2023) Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. 2023. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  • Wang et al. (2019) Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors. 2019. Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics, Hong Kong, China.
  • Wang et al. (2024) Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai wei Chang. 2024. Deepedit: Knowledge editing as decoding with constraints. ArXiv, abs/2401.10471.
  • (44) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  • Yang et al. (2024) Wanli Yang, Fei Sun, Xinyu Ma, Xun Liu, Dawei Yin, and Xueqi Cheng. 2024. The butterfly effect of model editing: Few edits can trigger large language models collapse. arXiv preprint arXiv:2402.09656.
  • Yao et al. (2023a) Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023a. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172.
  • Yao et al. (2023b) Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023b. Editing large language models: Problems, methods, and opportunities. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, Singapore. Association for Computational Linguistics.
  • Yu et al. (2024) Lang Yu, Qin Chen, Jie Zhou, and Liang He. 2024. Melo: Enhancing model editing with neuron-indexed dynamic lora. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19449–19457.
  • Zhang et al. (2024a) Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, et al. 2024a. A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286.
  • Zhang et al. (2024b) Zhihao Zhang, Jun Zhao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024b. Unveiling linguistic regions in large language models.
  • Zheng et al. (2023) Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Zhong et al. (2023) Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. 2023. MQuAKE: Assessing knowledge editing in language models via multi-hop questions. In The 2023 Conference on Empirical Methods in Natural Language Processing.

Appendix A Detailed Explanation of Evaluation Metrics and Examples

A.1 Portability / Generalization

Single Edit

In the single edit scenario, we further classify the methods into two settings:

  • One-Hop: This setting focuses on evaluating the impact of a single edit on direct, one-hop reasoning tasks.

  • Multi-Hop: This setting assesses the impact of a single edit on more complex, multi-hop reasoning tasks.

For one-hop evaluations, we adopt the methods proposed by (Yao et al., 2023a). These include:

  • Subject Replace: This metric tests the model’s generalization ability by replacing the subject in the question with an alias or synonym, assessing if the edited attribute is generalized to other descriptions of the same subject.

  • Reversed Relation: This metric evaluates the model’s capability to handle reversed relations by filtering for suitable relations such as one-to-one and asking the reverse question to check if the target entity is also updated.

  • One-Hop Test: This metric assesses the edited language model’s performance on downstream tasks that require one-hop reasoning.

In the multi-hop setting, we assess the model’s performance on multi-hop questions using the evaluation methods proposed by (Zhong et al., 2023), which include:

  • Edit-wise Success Rate (EW): This metric measures how many facts can be successfully recalled from the edited language model.

    EW=𝟙{f(s)=o}EW1superscript𝑓𝑠superscript𝑜\mathrm{EW}=\mathbbm{1}\{f^{*}(s)=o^{*}\}roman_EW = blackboard_1 { italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } (5)

    where fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the model after editing.

  • Instance-wise Accuracy (IW): This metric tests how many multi-hop instances the model can recall all the individual single-hop facts. This metric is crucial for multi-hop performance, as the model must encode each fact to answer the multi-hop question.

    IW=𝟙{(s,r,o)C[f(s)=o]}IW1subscript𝑠𝑟superscript𝑜superscript𝐶delimited-[]superscript𝑓𝑠superscript𝑜\mathrm{IW}=\mathbbm{1}\{\bigwedge_{(s,r,o^{*})\in C^{*}}[f^{*}(s)=o^{*}]\}roman_IW = blackboard_1 { ⋀ start_POSTSUBSCRIPT ( italic_s , italic_r , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] } (6)

    where C=(s1,r1,o1),,(sn,rn,on)superscript𝐶subscript𝑠1subscript𝑟1subscript𝑜1subscript𝑠𝑛subscript𝑟𝑛subscript𝑜𝑛C^{*}=\langle(s_{1},r_{1},o_{1}),\dots,(s_{n},r_{n},o_{n})\rangleitalic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ⟨ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⟩ is the chain of facts of a multi-hop question. In this chain, the object of the ithsuperscript𝑖thi^{\mathrm{th}}italic_i start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT fact is the subject of the next fact. (i.e., oi=si+1subscript𝑜𝑖subscript𝑠𝑖1o_{i}=s_{i+1}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT)

  • Multi-hop Accuracy (MH): This metric assesses the accuracy of the original and edited language models on multi-hop questions. In the MQuAKE dataset (Zhong et al., 2023), there are three generated multi-hop questions for each instance. If any of the three questions is correctly answered by the model, we consider it accurate.

    MH=𝟙{qQf(q)=a}MH1subscript𝑞𝑄superscript𝑓𝑞superscript𝑎\mathrm{MH}=\mathbbm{1}\{\bigvee_{q\in Q}f^{*}(q)=a^{*}\}roman_MH = blackboard_1 { ⋁ start_POSTSUBSCRIPT italic_q ∈ italic_Q end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_q ) = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } (7)

    where Q𝑄Qitalic_Q is a set of similar multi-hop questions with the same answer asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Multiple Edits

In the multiple edits scenario, we test the model’s performance after applying multiple edits. For this, we use the setting and evaluation methods from (Li et al., 2024b). The settings consist of:

  • Reverse Conflict: This setting introduces conflicts by editing facts with reverse relations. For example:
    edit 1: (s1,r1,o1subscript𝑠1subscript𝑟1subscript𝑜1s_{1},r_{1},o_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTo2subscript𝑜2o_{2}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
    Hamlet was written by ShakespeareAgatha Christie.
    edit 2: (o2,r2,s1subscript𝑜2subscript𝑟2subscript𝑠1o_{2},r_{2},s_{1}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
    The notable work of Agatha Christie is HamletOdyssey
    the updated knowledge then could be represented as:

    {k0=(s1,r1,o2)kn=(s2,r1,o2)casessubscript𝑘0subscript𝑠1subscript𝑟1subscript𝑜2subscript𝑘𝑛subscript𝑠2subscript𝑟1subscript𝑜2\left\{\begin{array}[]{l}k_{0}=(s_{1},r_{1},o_{2})\\ k_{n}=(s_{2},r_{1},o_{2})\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY
  • Composite Conflict: This explores more complex situations where the edits are associated with a fact that is not influenced by the editing (tied fact). For example:
    edit 1: (s1,r1,o1subscript𝑠1subscript𝑟1subscript𝑜1s_{1},r_{1},o_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTo2subscript𝑜2o_{2}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
    Hamlet was written in EnglishFrench
    edit 2: (s2,r2,o2subscript𝑠2subscript𝑟2subscript𝑜2s_{2},r_{2},o_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTo3subscript𝑜3o_{3}italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT)
    Shakespeare wrote in FrenchGerman
    tied fact: (s1,r,s2subscript𝑠1𝑟subscript𝑠2s_{1},r,s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
    The notable work of Shakespeare is Hamlet
    where rr1r2𝑟subscript𝑟1subscript𝑟2r\land r_{1}\to r_{2}italic_r ∧ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a logical rule. The updated knowledge then could be represented as:

    {kf=(s1,r,s2)k0=(s1,r1,o2)kn=(s1,r1,o3)casessubscript𝑘𝑓subscript𝑠1𝑟subscript𝑠2subscript𝑘0subscript𝑠1subscript𝑟1subscript𝑜2subscript𝑘𝑛subscript𝑠1subscript𝑟1subscript𝑜3\left\{\begin{array}[]{l}k_{f}=(s_{1},r,s_{2})\\ k_{0}=(s_{1},r_{1},o_{2})\\ k_{n}=(s_{1},r_{1},o_{3})\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY

The evaluation methods include:

  • Conflict Score (CS): Measures how well a knowledge editing method handles knowledge conflicts by calculating the ratio that the new fact is more probable than the old fact after knowledge editing.

    CS=𝟙{pfθ(kn)>pfθ(ko)}CS1subscript𝑝superscriptsubscript𝑓𝜃subscript𝑘𝑛subscript𝑝superscriptsubscript𝑓𝜃subscript𝑘𝑜\mathrm{CS=}\mathbbm{1}\{p_{f_{\theta}^{\prime}}(k_{n})>p_{f_{\theta}^{\prime}% }(k_{o})\}roman_CS = blackboard_1 { italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) } (8)
  • Conflict Magnitude (CM): Estimates the decrease in probability of the old fact after editing.

    CM=pfθm(ko)pfθ(ko)pfθm(ko)CMsubscript𝑝subscript𝑓superscript𝜃𝑚subscript𝑘𝑜subscript𝑝subscript𝑓superscript𝜃subscript𝑘𝑜subscript𝑝subscript𝑓superscript𝜃𝑚subscript𝑘𝑜\mathrm{CM=}\frac{p_{f_{\theta^{m}}}(k_{o})-p_{f_{\theta^{\prime}}}(k_{o})}{p_% {f_{\theta^{m}}}(k_{o})}roman_CM = divide start_ARG italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG (9)

    θmsuperscript𝜃𝑚\theta^{m}italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the intermediate model parameters after edit 1.

A.2 Locality

Single Edit

In the single edit scenario for locality, we adopt the methods proposed by (Yao et al., 2023b), including:

  • Other Attribution The modified CounterFact dataset is applied to test whether the non-target attributes of the edited subjects remained the same. For example, if we reset Lionel Messi as a basketball player, his nationality should stay the same.

  • Distract Neighbor Pretend model editing by modifying the neighborhood prompt in the CounterFact. For example, if the original prompt is "Windows 11 is a product of __", the modified prompt would be "Windows 11 is a product of Google. Office 365, developed by __". It testifies whether the model prediction would be "distracted" by the revised prompt.

  • Other Task The edited model is tested on the multiple-choice QA task Physical Interaction QA(PIQA, Bisk et al. (2020)) and the performance is evaluated by accuracy.

Multiple Edits

We also test the model’s locality in the multiple edits scenario adopting the methods and evaluations from (Li et al., 2024b). The settings consist of:

  • Round Edit: This edits the knowledge triplet back-and-forth, for example:
    edit 1: (s,r,o1𝑠𝑟subscript𝑜1s,r,o_{1}italic_s , italic_r , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPToo*italic_o ∗)
    edit 2: (s,r,os,r,o*italic_s , italic_r , italic_o ∗o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)

the evaluation metrics include:

  • Distortion (D) (Li et al., 2024b):

    D=JS(pfθ(Obj(s,r)),pfθ(Obj(s,r)))𝐷𝐽𝑆subscript𝑝subscript𝑓𝜃conditionalObj𝑠𝑟subscript𝑝subscript𝑓superscript𝜃conditionalObj𝑠𝑟D=JS\left(p_{f_{\theta}}(\text{Obj}\mid(s,r)),p_{f_{\theta^{\prime}}}(\text{% Obj}\mid(s,r))\right)italic_D = italic_J italic_S ( italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( Obj ∣ ( italic_s , italic_r ) ) , italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( Obj ∣ ( italic_s , italic_r ) ) ) (10)

    estimates the JS divergence of the objects distribution before and after edit.

  • Ignore Rate (IR) (Li et al., 2024b):

    IR=1|Obj|1oObj{o1}𝟙{pfθ(o(s,r))>pfθ(o(s,r))}IR1Obj1subscript𝑜Obj𝑜11subscript𝑝subscript𝑓𝜃𝑜𝑠𝑟subscript𝑝superscriptsubscript𝑓𝜃𝑜𝑠𝑟\displaystyle\begin{split}\mathrm{IR=}\frac{1}{\left|\mathrm{Obj}\right|-1}% \sum_{\begin{subarray}{c}o\in\mathrm{Obj}\setminus\{o1\}\end{subarray}}&% \mathbbm{1}\{p_{f_{\theta}}(o\mid(s,r))>\\ &p_{f_{\theta}^{\prime}}(o\mid(s,r))\}\end{split}start_ROW start_CELL roman_IR = divide start_ARG 1 end_ARG start_ARG | roman_Obj | - 1 end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_o ∈ roman_Obj ∖ { italic_o 1 } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 { italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o ∣ ( italic_s , italic_r ) ) > end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_o ∣ ( italic_s , italic_r ) ) } end_CELL end_ROW (11)
  • Failure Rate (FR) (Li et al., 2024b):

    FR=𝟙{IR>0.5}FR1IR0.5\mathrm{FR=}\mathbbm{1}\{\mathrm{IR}>0.5\}roman_FR = blackboard_1 { roman_IR > 0.5 } (12)
  • Tied Fact Damage (TDF) (Li et al., 2024b):

    TFD=pfθm(kf)pfθ(kf)pfθm(kf)TFDsubscript𝑝subscript𝑓superscript𝜃𝑚subscript𝑘𝑓subscript𝑝subscript𝑓superscript𝜃subscript𝑘𝑓subscript𝑝subscript𝑓superscript𝜃𝑚subscript𝑘𝑓\mathrm{TFD=}\frac{p_{f_{\theta^{m}}}(k_{f})-p_{f_{\theta^{\prime}}}(k_{f})}{p% _{f_{\theta^{m}}}(k_{f})}roman_TFD = divide start_ARG italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) end_ARG (13)

    kfsubscript𝑘𝑓k_{f}italic_k start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT denotes the tied facts and θmsuperscript𝜃𝑚\theta^{m}italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the intermediate model parameters after edit 1.

Other Locality Metrics
  • Neighborhood KL Divergence (Hoelscher-Obermaier et al., 2023):

    NKL=defwWlog(P(w)P(w))superscriptdefNKLsubscript𝑤𝑊𝑃𝑤superscript𝑃𝑤\mathrm{NKL}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{w\in W}\log\left(% \frac{P(w)}{P^{*}(w)}\right)roman_NKL start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_w ∈ italic_W end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_P ( italic_w ) end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w ) end_ARG ) (14)
  • Neighborhood Score (NS) (Meng et al., 2022): collect a set of "neighbor" subjects and evaluation the success fraction for [oc]>[o]delimited-[]superscript𝑜𝑐delimited-[]superscript𝑜\mathbb{P}\left[o^{c}\right]>\mathbb{P}\left[o^{*}\right]blackboard_P [ italic_o start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] > blackboard_P [ italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ], while the ocsuperscript𝑜𝑐o^{c}italic_o start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the correct facts and osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the false facts.

  • Neighborhood Magnitude (NM) (Meng et al., 2022): the differences of [oc]delimited-[]superscript𝑜𝑐\mathbb{P}\left[o^{c}\right]blackboard_P [ italic_o start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] and [o]delimited-[]superscript𝑜\mathbb{P}\left[o^{*}\right]blackboard_P [ italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] for the "neighborhood" subjects.

Appendix B Detailed Explanation of experiments for deterioration of general LLM abilities

We follow the settings of  (Gu et al., 2024) for this part of experiments. Different evaluation metrics were applied for each downstream task: Exact Match for open-domain question answering on the Natural Question dataset  (Kwiatkowski et al., 2019), accuracy for sentiment analysis on the SST2 dataset  (Socher et al., 2013), solve rate for reasoning on the GSM8K dataset  (Cobbe et al., 2021), and ROUGE score for summarization on the SAMSum dataset  (Wang et al., 2019).