Breaking Boundaries: Investigating the Effects of
Model Editing on Cross-linguistic Performance

Somnath Banerjee ^† Avik Halder ^† Rajarshi Mandal ^†¹¹footnotemark: 1 Sayan Layek ^†
Ian Soboroff ^‡ Rima Hazra ^∓ Animesh Mukherjee ^†
^†Indian Institute of Technology Kharagpur, India
^‡National Institute of Standards and Technology, USA
^∓Singapore University of Technology and Design, Singapore
{som.iitkgpcse,avik08,rajarshimandal,sayanlayek2002}@kgpian.iitkgp.ac.in
{rima_hazra}@sutd.edu.sg
These authors contributed equally to this work.

Abstract

The integration of pretrained language models (PLMs) like BERT and GPT has revolutionized NLP, particularly for English, but it has also created linguistic imbalances. This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts. We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada. Our research identifies significant discrepancies in normal and merged models concerning cross-lingual consistency. We employ strategies like ‘each language for itself’ (ELFI) and ‘each language for others’ (ELFO) to stress-test these models. Our findings demonstrate the potential for LLMs to overcome linguistic barriers, laying the groundwork for future research in achieving linguistic inclusivity in AI technologies ¹¹1Repository: https://github.com/NeuralSentinel/BreakingBoundaries.

\newmdenv

[ topline=false, bottomline=false, skipabove=skipbelow=leftline=true, rightline=false, linecolor=red, linewidth=4pt, innertopmargin=10pt, innerbottommargin=2pt, innerrightmargin=2pt, innerleftmargin=2pt, backgroundcolor=gray!10, roundcorner=10pt ]stylishframe

Breaking Boundaries: Investigating the Effects of
Model Editing on Cross-linguistic Performance

Somnath Banerjee ^† Avik Halder ${}^{\dagger}\lx@make@thanks{Theseauthorscontributedequallytothiswork.}$ Rajarshi Mandal ^†¹¹footnotemark: 1 Sayan Layek ^† Ian Soboroff ^‡ Rima Hazra ^∓ Animesh Mukherjee ^† ^†Indian Institute of Technology Kharagpur, India ^‡National Institute of Standards and Technology, USA ^∓Singapore University of Technology and Design, Singapore {som.iitkgpcse,avik08,rajarshimandal,sayanlayek2002}@kgpian.iitkgp.ac.in {rima_hazra}@sutd.edu.sg

1 Introduction

The advent of model editing techniques, as explored by Sinitsin et al. (2020) and De Cao et al. (2021), introduces an innovative avenue for refining LLM responses to specific inputs, marking a significant stride toward addressing these challenges. However, applying these techniques within a cross-lingual framework introduces a set of unique obstacles Qi et al. (2023); Xu et al. (2023), necessitating the development of inventive strategies to ensure the effective and equitable operation of multilingual PLMs across the globe’s linguistic tapestry Wang et al. (2023).

Refer to caption — Figure 1: Edited knowledge conflict across various languages for TowerInstruct.

Building upon this foundation, recent scholarly endeavors have increasingly focused on the nuances of knowledge editing within LLMs, particularly within monolingual contexts Hazra et al. (2024); Banerjee et al. (2024). This body of work has catalyzed significant advancements in the creation and refinement of models tailored for multilingual usage, such as Bloom Workshop et al. (2023), ChatGPT²²2https://chat.openai.com/, LLaMA Touvron et al. (2023). Yet, the broader implications of implementing these knowledge edits across diverse linguistic landscapes remain an area ripe for exploration.
The exploration of cross-lingual consistency is paramount for several compelling reasons. Primarily, it challenges the conventional wisdom that knowledge is inextricably linked to the form of language, advocating instead for a model’s comprehension to remain steadfast across linguistic divisions. For instance, a model’s recognition of “Dent Island Light, located in: Belgium” (Post Edit) (see Figure 1) should be consistent, irrespective of the language employed. Such consistency is crucial for ensuring a uniform user experience across different languages, thereby democratizing access to information and technology.
In pursuit of these objectives, our study conducts a comprehensive examination of the efficacy with which multilingual models transfer knowledge across eight distinct languages tackling the largely unexplored challenges associated with maintaining knowledge consistency and fostering language diversity within the sphere of knowledge editing for LLMs. Inspired by the earlier research Beniwal et al. (2024) in this domain, our investigation explores various strategies for disseminating factual knowledge from linguistically rich to relatively resource-scarce languages, employing the “each language for itself” (ELFI) and “each language for others” (ELFO) principle as a foundational premise Das et al. (2022). Our findings illustrate the intricate dynamics and untapped potential of knowledge transfer in multilingual contexts, underscoring the complexities, challenges, and opportunities that lie ahead in our quest to achieve linguistic equity in the age of artificial intelligence. Our contributions include:

•

We conduct extensive experiments on model editing across eight distinct languages (English (En), German (De), French (Fr), Italian (It) and Spanish (Es from Romance and Germanic scripts, whereas Hindi (Hi), Tamil (Ta) and Kannada (Kn)) from Indic scripts, using two approaches: “each language for itself” (ELFI) and “each language for others” (ELFO). Our observations specifically focused on the performance of decoder-only models from a multilingual perspective.
•

Second, this study explores decoder-only models, including state-of-the-art systems such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama and Kan-Llama (all 7B models). We utilize these models in conjunction with well-known editing methods, ROME and MEMIT, marking a significant advancement in the field.
•

Third, to the best of our knowledge, this paper is the first to evaluate that while model merging enhances capabilities, it still falls short in maintaining cross-lingual consistency post editing.
•

Our comprehensive error analysis reveals significant areas where linguistic discrepancies across different languages lead to varied interpretations and meanings. This study delves into how such differences can cause models to generate incorrect responses post edit.

2 Related work

Targeted parameter editing specifically focuses on identifying and modifying particular components within a model to integrate new information. This approach was illustrated in Dai et al. (2022) through the identification and adjustment of ‘knowledge neurons’ in transformer models, allowing for the reflection of new data. Meng and colleagues furthered this concept with the Rank-One Model Editing (ROME) Meng et al. (2022) technique, which aimed to update key neural network weights, refreshing factual knowledge in LLMs. Despite their effectiveness for singular updates, these techniques faced difficulties with multiple simultaneous updates. MEMIT Meng et al. (2023), a successor of ROME, addressed this by enabling the concurrent modification of various knowledge points, a breakthrough supported by subsequent studies like Hase et al. (2023); Yao et al. (2023).
Multilingual knowledge editing Research in knowledge editing has predominantly focused on English. However, a handful of studies have ventured into the multilingual domain, translating English-based prompts and object pairs into various languages. Initiatives like X-FACTR Jiang et al. (2020) and M-LAMA Kassner et al. (2021) have identified substantial knowledge disparities across languages, attributed to differences in training data volume. These discrepancies are particularly pronounced for languages other than English and some European languages, where probing accuracies often fall below 10%. GeoMLAMA Yin et al. (2022), took a unique approach by examining region-specific commonsense knowledge, revealing that a country’s native language may not always be the most effective for accessing specific national knowledge.
Our research takes a systematic step in analyzing the cross-lingual consistency of factual knowledge within multilingual LLMs, assessing how consistently an LLM provides the same answer to the same question across different languages. Earlier works, such as those by Wang et al. (2023); Beniwal et al. (2024), initiated this exploration with mBERT, uncovering low consistencies. We expand on this by including various LLMs finetuned on specific languages.

3 Task overview

Model editing: Given a language model $\theta_{pre}$ and an edit descriptor < $kn$ , $a_{new}$ , $a_{old}$ >, the model editing technique will create an edited model $\theta_{edit}$ . So, for an input prompt $kn$ , $\theta_{pre}$ has the old prediction $a_{old}$ and after editing $\theta_{pre}$ , the edited model $\theta_{edit}$ has updated prediction $a_{new}$ without influencing model behaviour on other samples. Thus, given the edit input $kn$ , $\theta_{pre}$ does not produce $a_{new}$ ; it is $\theta_{edit}$ that is designed to produce the output $a_{new}$ .

{\theta_{edit}}(kn)=\begin{cases}a_{new}&\text{if }kn\in I(kn,a_{new})\\ {\theta_{pre}}(kn)&\text{if }kn\in O(kn,a_{new})\end{cases}

(1)

The scope of consideration, $I(kn,a_{new})$ , includes $kn$ and similar versions of it. This means it covers the original input and any rephrased versions of it that still relate to the same topic. For example, if $kn$ is a question, this scope includes different ways of asking the same question. However, the excluded scope, $O(kn,a_{new})$ , refers to inputs that are not related to the edit case provided. So, it leaves out any inputs that do not have anything to do with $kn$ or its related versions. Along with the updated information, the edited model should follow the four properties: (i) reliability – $\theta_{edit}$ , produces the correct response for the specific edit scenario represented by ( $kn$ , $a_{new}$ ), (ii) generalization – the edited model $\theta_{edit}$ must uniformly apply edits to both the designated edit case ( $kn$ , $a_{new}$ ) and its semantically equivalent variations, guaranteeing a consistent output, $a_{new}$ , across all rephrased iterations of $kn$ , (iii) locality – $\theta_{edit}$ should not alter the output for examples outside its intended scope (O( $kn$ , $a_{new}$ )), and (iv) portability – evaluates the capacity of edited model $\theta_{edit}$ for robust generalization, assessed through questions designed to test the edited model’s reasoning with updated knowledge.

Multilingual knowledge editing: Given a set of languages $\mathcal{L}$ , we consider a language $l\in\mathcal{L}$ to edit the model $\theta_{pre}$ and obtain $\theta_{edit}^{l}$ . We then test the edited model $\theta_{edit}^{l}$ with all the languages in $\mathcal{L}$ . In the equations below, $s$ is the source language, and $t$ is the target language. The conditions are as follows: if $kn_{s}$ is in the inclusion scope $I(kn,a_{new})$ , the model should output $a_{new}^{s}$ . Otherwise, if $kn_{s}$ is in the exclusion scope $O(kn,a_{new})$ , the model should output $\theta_{pre}(kn_{s})$ . For the target language, similar conditions apply with transformations $\mathcal{T}^{t}$ .

{\theta_{edit}}(kn_{s})=\begin{cases}a_{new}^{s}&\text{if }kn_{s}\in I(kn,a_{% new})\\ \theta_{pre}(kn_{s})&\text{if }kn_{s}\in O(kn,a_{new})\end{cases}

(2)

\theta_{edit}(kn_{t})=\begin{cases}\mathcal{T}^{t}(a_{new}^{s})&\text{if }kn_{% t}\in\mathcal{T}^{t}(I(kn,a_{new}))\\ \theta_{pre}(kn_{t})&\text{if }kn_{t}\notin\mathcal{T}^{t}(O(kn,a_{new}))\end{cases}

(3)

$\mathcal{T}^{t}(.)$ transforms the target output of the source language to the target language with the same meaning. Therefore, after editing the model in one language, such as English, the effect of the edit should be reflected in other languages as well. This ensures that the specific edit is consistent across all languages, regardless of the language in which the edit was made.
Model merging: In the specific case of Indic languages – Hindi, Tamil and Kannada – we have specialized LLMs for each unlike in the case of Western languages where the models we have used are known to be pretrained on all those languages. We investigate if the three LLMs for the Indic languages could be further unified to obtain a more powerful model $\theta_{merged}$ , which dynamically harnesses the specialized linguistic capabilities of each constituent models. This involves extracting language-specific unique task vectors from instruction-tuned models, i.e., $\theta_{base-Hindi}\rightarrow\vec{v}_{Hindi}$ , $\theta_{base-Tamil}\rightarrow\vec{v}_{Tamil}$ , and $\theta_{base-Kannada}\rightarrow\vec{v}_{Kannada}$ for each respective language. These vectors are integrated using a TIES Yadav et al. (2023) merging technique to synthesize $\theta_{merged}$ . Subsequently, $\theta_{merged}$ is edited in the same process as above to obtain $\theta_{edit}$ each time adjusting its output specifically for inputs associated with the defined task and the language.

4 Dataset

For our experiments, we use the popular CounterFact Meng et al. (2022) and ZsRE Levy et al. (2017) datasets. We uniformly sample $\sim$ 550 edit instances from each dataset. Each edit instance in these datasets includes the actual edit case, the reliability prompt, the generalization instances, the locality prompt and its answer, portability and its answer. Further we use google translator ³³3https://translate.google.com/ to translate each edit instance into seven other languages – German (De), French (Fr), Italian (It), Spanish (Es), Hindi (Hi), Tamil (Ta) and Kannada (Kn). In both the datasets, the actual portability prompt is an interrogative sentence (i.e., in the form of question). However, when the question gets translated to other languages, the translated question becomes different from actual question format. For example, when the actual portability prompt in English “To which language family does the official language of Sastamala belong?” is translated to French the new prompt becomes “À quelle langue la famille appartient la langue officielle de Sastamala?”. However when this is back-translated to English the prompt means “Which family language does the official language of Sastamala belong to?” which is not the same as the original English prompt. We therefore employed GPT-4⁴⁴4https://openai.com/research/gpt-4, version: gpt-4-0125-preview to convert question in the interrogative sentence into a task of sentence completion. Subsequently we translate this sentence completion form to other languages to obtain the corresponding portability prompt.
Note to the choice of languages: The Western languages that we choose are based on their cultural, economic and academic significance Lobachev (2008)⁵⁵5https://preply.com/en/blog/most-important-languages/ and cover the Romance and the Germanic families. In addition, we include three Indic languages that have far lesser resources compared to their Western counterparts.

5 Experimental setup

5.1 Selection of LLMs

We use the following multilingual LLMs for our experiments.
Mistral-7B-Instruct-v0.2⁶⁶6https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2: The model was developed by Jiang et al. (2023) and supports multilinguality⁷⁷7https://encord.com/blog/mistral-large-explained/. It is designed around the causal language modeling framework. We shall refer to this model as Mistral.
TowerInstruct-7B-v0.2⁸⁸8https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2: This model Alves et al. (2024) has been developed on top of LLaMA2 Touvron et al. (2023) architecture and supports multilinguality including English, German, French, Spanish, Chinese, Portuguese, Italian, Russian, Korean, and Dutch. We shall refer to this model as TowerInstruct.
OpenHathi-7B-Hi-v0.1-Base⁹⁹9https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base: The model is designed to optimize multilingual interactions with a special focus on Indian languages. It uses a transformer-based architecture similar to GPT-3 but introduces hybrid partitioned attention to efficiently manage computational resources and enhance responsiveness across languages like Hindi, Tamil, and Bengali. We shall refer to this model as OpenHathi.
Tamil-llama-7b-base-v0.1¹⁰¹⁰10https://huggingface.co/abhinand/tamil-llama-7b-base-v0.1: This is a sophisticated model developed specifically for bilingual tasks in Tamil and English, leveraging a 7 billion parameter causal language modeling framework. We shall refer to this model as Tamil-Llama.
Kan-LLaMA-7B-SFT¹¹¹¹11https://huggingface.co/Tensoic/Kan-Llama-7B-SFT-v0.5: This model is tailored for efficient Kannada text processing with an expanded 49,420-token vocabulary, enhancing its language handling capabilities. Pre-trained on 600 million Kannada tokens from the CulturaX dataset, it employs a low-rank adaptation technique to minimize computational costs while preserving the model’s integrity. We shall refer to this model as Kan-Llama.

Counterfact	Models	TowerInstruct		Mistral		OpenHathi		Tamil-Llama		Kan-Llama
Languages	Metrics	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT
De	Rel	0.83/0.96	0.73/0.83	0.83/0.96	0.73/0.87	-	-	-	-	-	-
	Gen	0.27/0.31	0.19/0.22	0.28/0.31	0.19/0.22	-	-	-	-	-	-
	Loc	0.22/0.23	0.19/0.22	0.21/0.23	0.24/0.27	-	-	-	-	-	-
	Port	0.01/0.01	0.01/0.01	0.03/0.04	0.04/0.06	-	-	-	-	-	-
Es	Rel	0.82/0.92	0.70/0.80	0.81/0.91	0.78/0.86	-	-	-	-	-	-
	Gen	0.33/0.37	0.23/0.27	0.28/0.32	0.22/0.27	-	-	-	-	-	-
	Loc	0.21/0.22	0.19/0.19	0.25/0.27	0.27/0.29	-	-	-	-	-	-
	Port	0.00/0.00	0.00/0.00	0.03/0.03	0.03/0.04	-	-	-	-	-	-
It	Rel	0.87/0.93	0.74/0.78	0.86/0.91	0.80/0.88	-	-	-	-	-	-
	Gen	0.35/0.38	0.25/0.26	0.28/0.30	0.24/0.27	-	-	-	-	-	-
	Loc	0.18/0.19	0.20/0.20	0.26/0.27	0.27/0.28	-	-	-	-	-	-
	Port	0.02/0.02	0.02/0.03	0.02/0.03	0.03/0.03	-	-	-	-	-	-
Fr	Rel	0.83/0.90	0.65/0.72	0.83/0.89	0.79/0.85	-	-	-	-	-	-
	Gen	0.31/0.33	0.22/0.24	0.29/0.30	0.24/0.25	-	-	-	-	-	-
	Loc	0.21/0.22	0.17/0.19	0.20/0.22	0.24/0.25	-	-	-	-	-	-
	Port	0.00/0.01	0.00/0.00	0.03/0.03	0.03/0.03	-	-	-	-	-	-
Hi	Rel	-	-	-	-	0.02/0.02	0.45/0.60	-	-	-	-
	Gen	-	-	-	-	0.00/0.00	0.26/0.33	-	-	-	-
	Loc	-	-	-	-	0.31/0.35	0.02/0.03	-	-	-	-
	Port	-	-	-	-	0.01/0.01	0.01/0.01	-	-	-	-
Ta	Rel	-	-	-	-	-	-	0.12/0.15	0.48/0.59	-	-
	Gen	-	-	-	-	-	-	0.03/0.04	0.21/0.25	-	-
	Loc	-	-	-	-	-	-	0.01/0.01	0.01/0.01	-	-
	Port	-	-	-	-	-	-	0.01/0.01	0.01/0.01	-	-
Kn	Rel	-	-	-	-	-	-	-	-	0.21/0.26	0.14/0.18
	Gen	-	-	-	-	-	-	-	-	0.07/0.08	0.04/0.05
	Loc	-	-	-	-	-	-	-	-	0.03/0.04	0.02/0.03
	Port	-	-	-	-	-	-	-	-	0.00/0.00	0.00/0.01

Table 1: Comparison of reliability, generalization, locality, and portability scores across various language models, evaluated using the CounterFact dataset under same language edit and inference settings. The highest score is presented in bold whereas the second highest is underlined.

5.2 Editing methods

We use ROME (Rank-One Model Editing) Meng et al. (2022) and MEMIT (Mass Editing Memory in a Transformer) Meng et al. (2023) which are the state-of-the-art editing schemes and particularly suitable for multilingual settings.
Rank-One Model Editing (ROME): This method specifically alters the weights in the initial feed-forward layers of a pretrained model. It identifies factual associations through causal interventions, enabling precise and effective modifications.
Mass Editing Memory in a Transformer (MEMIT): MEMIT advances ROME, by extending its capabilities. While ROME applied a rank-one modification to the MLP weights of a single layer to embed a memory directly into the model, MEMIT enhances this approach by adjusting the MLP weights across multiple critical layers to incorporate numerous memories.

5.3 Evaluation metric

We evaluate the generated output from the edited models on four properties using two different metrics as follows.
Exact match: In this evaluation method, we systematically determine the accuracy of outputs generated through our model by checking for the presence of the ground truth within these outputs. The ground truth represents the exact, correct response expected from the model for a given input. An output is considered accurate (a ‘correct generation’) if it includes the ground truth, indicating the model’s ability to replicate the desired response accurately. Conversely, if the ground truth is not present in the output, it is classified as inaccurate (‘incorrect generation’). This binary evaluation approach provides a clear and direct measure of the model’s performance, focusing on its ability to produce outputs that are not only relevant but also correct as per the predefined standards.
Partial match: In case of partial match, the Levenshtein ratio is utilized as an alternative measure of textual similarity. This ratio is calculated by dividing the Levenshtein distance Levenshtein (1965), by the maximum length of either the ground truth text or the generated text. A critical threshold is set at an 80% Levenshtein ratio; outputs that do not contain the ground truth as a substring but surpass this ratio are deemed accurate. This approach allows for a nuanced assessment of the generated text’s correctness, accommodating minor deviations or errors that do not significantly detract from the overall meaning or accuracy of the generated output.

6 Results

ZsRE	Models	TowerInstruct		Mistral		OpenHathi		Tamil-Llama		Kan-Llama
Languages	Properties	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT
De	Rel	0.48/0.59	0.25/0.30	0.51/0.62	0.38/0.47	-	-	-	-	-	-
	Gen	0.33/0.39	0.11/0.12	0.35/0.45	0.18/0.24	-	-	-	-	-	-
	Loc	0.00/0.01	0.00/0.01	0.01/0.02	0.01/0.03	-	-	-	-	-	-
	Port	0.02/0.02	0.00/0.00	0.08/0.10	0.02/0.04	-	-	-	-	-	-
Es	Rel	0.44/0.59	0.24/0.34	0.49/0.61	0.37/0.49	-	-	-	-	-	-
	Gen	0.30/0.40	0.16/0.20	0.35/0.45	0.22/0.29	-	-	-	-	-	-
	Loc	0.00/0.01	0.01/0.02	0.01/0.01	0.02/0.02	-	-	-	-	-	-
	Port	0.02/0.02	0.01/0.02	0.03/0.07	0.03/0.04	-	-	-	-	-	-
It	Rel	0.54/0.62	0.25/0.29	0.58/0.65	0.42/0.50	-	-	-	-	-	-
	Gen	0.35/0.43	0.16/0.20	0.42/0.48	0.25/0.31	-	-	-	-	-	-
	Loc	0.00/0.00	0.00/0.01	0.00/0.02	0.01/0.02	-	-	-	-	-	-
	Port	0.01/0.02	0.02/0.03	0.07/0.08	0.01/0.03	-	-	-	-	-	-
Fr	Rel	0.51/0.59	0.27/0.35	0.52/0.63	0.40/0.50	-	-	-	-	-	-
	Gen	0.28/0.35	0.14/0.17	0.40/0.50	0.19/0.27	-	-	-	-	-	-
	Loc	0.00/0.01	0.00/0.02	0.01/0.02	0.01/0.02	-	-	-	-	-	-
	Port	0.03/0.05	0.03/0.03	0.06/0.09	0.04/0.06	-	-	-	-	-	-
Hi	Rel	-	-	-	-	0.03/0.06	0.20/0.33	-	-	-	-
	Gen	-	-	-	-	0.01/0.04	0.19/0.28	-	-	-	-
	Loc	-	-	-	-	0.01/0.01	0.00/0.01	-	-	-	-
	Port	-	-	-	-	0.00/0.00	0.03/0.03	-	-	-	-
Ta	Rel	-	-	-	-	-	-	0.06/0.08	0.16/0.21	-	-
	Gen	-	-	-	-	-	-	0.03/0.04	0.10/0.14	-	-
	Loc	-	-	-	-	-	-	0.00/0.00	0.00/0.00	-	-
	Port	-	-	-	-	-	-	0.00/0.00	0.01/0.01	-	-
Kn	Rel	-	-	-	-	-	-	-	-	0.16/0.21	0.05/0.07
	Gen	-	-	-	-	-	-	-	-	0.08/0.17	0.05/0.05
	Loc	-	-	-	-	-	-	-	-	0.00/0.00	0.00/0.00
	Port	-	-	-	-	-	-	-	-	0.00/0.01	0.00/0.00

Table 2: Comparison of reliability, generalization, locality, and portability scores across various language models, evaluated for the ZsRE dataset under same language edit and inference settings. Each cell shows results in the format: exact match/partial match. The highest score presented in bold whereas second highest presented in underlined.

CounterFact	Models	TowerInstruct		Mistral		OpenHathi		Tamil-Llama		Kan-Llama
Languages	Properties	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT
De	Rel	0.48/0.53	0.40/0.46	0.50/0.56	0.54/0.61	-	-	-	-	-	-
	Gen	0.25/0.27	0.13/0.17	0.23/0.27	0.22/0.23	-	-	-	-	-	-
	Loc	0.20/0.21	0.19/0.22	0.23/0.25	0.26/0.28	-	-	-	-	-	-
	Port	0.00/0.00	0.00/0.00	0.03/0.03	0.03/0.04	-	-	-	-	-	-
Es	Rel	0.51/0.56	0.40/0.48	0.57/0.62	0.56/0.60	-	-	-	-	-	-
	Gen	0.26/0.29	0.18/0.22	0.25/0.29	0.21/0.26	-	-	-	-	-	-
	Loc	0.22/0.24	0.17/0.17	0.24/0.27	0.25/0.27	-	-	-	-	-	-
	Port	0.00/0.00	0.00/0.00	0.03/0.03	0.03/0.04	-	-	-	-	-	-
It	Rel	0.45/0.50	0.35/0.40	0.47/0.58	0.44/0.49	-	-	-	-	-	-
	Gen	0.23/0.27	0.19/0.20	0.25/0.35	0.21/0.23	-	-	-	-	-	-
	Loc	0.20/0.21	0.20/0.20	0.24/0.36	0.28/0.29	-	-	-	-	-	-
	Port	0.01/0.02	0.01/0.02	0.03/0.11	0.04/0.04	-	-	-	-	-	-
Fr	Rel	0.50/0.53	0.45/0.49	0.49/0.55	0.51/0.59	-	-	-	-	-	-
	Gen	0.28/0.31	0.19/0.22	0.28/0.31	0.26/0.27	-	-	-	-	-	-
	Loc	0.23/0.23	0.19/0.21	0.20/0.36	0.25/0.26	-	-	-	-	-	-
	Port	0.01/0.01	0.01/0.01	0.01/0.12	0.03/0.04	-	-	-	-	-	-
Hi	Rel	-	-	-	-	0.56/0.66	0.02/0.03	-	-	-	-
	Gen	-	-	-	-	0.27/0.34	0.03/0.03	-	-	-	-
	Loc	-	-	-	-	0.26/0.31	0.03/0.03	-	-	-	-
	Port	-	-	-	-	0.02/0.02	0.00/0.01	-	-	-	-
Ta	Rel	-	-	-	-	-	-	0.00/0.00	0.01/0.01	-	-
	Gen	-	-	-	-	-	-	0.00/0.00	0.00/0.00	-	-
	Loc	-	-	-	-	-	-	0.01/0.01	0.01/0.02	-	-
	Port	-	-	-	-	-	-	0.00/0.00	0.00/0.00	-	-
Kn	Rel	-	-	-	-	-	-	-	-	0.01/0.01	0.00/0.00
	Gen	-	-	-	-	-	-	-	-	0.00/0.01	0.00/0.00
	Loc	-	-	-	-	-	-	-	-	0.02/0.02	0.03/0.03
	Port	-	-	-	-	-	-	-	-	0.00/0.00	0.00/0.00

Table 3: Comparison of reliability, generalization, locality, and portability scores across various language models, evaluated for the CounterFact dataset under English edit and target language inference settings. Each cell shows results in the format: exact match/partial match.

ZsRE	Models	TowerInstruct		Mistral		OpenHathi		Tamil-Llama		Kan-Llama
Languages	Properties	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT	ROME	MEMIT
De	Rel	0.24/0.28	0.10/0.14	0.34/0.45	0.14/0.18	-	-	-	-	-	-
	Gen	0.18/0.23	0.12/0.14	0.26/0.35	0.14/0.16	-	-	-	-	-	-
	Loc	0.00/0.01	0.00/0.02	0.01/0.02	0.01/0.03	-	-	-	-	-	-
	Port	0.02/0.02	0.02/0.02	0.06/0.07	0.02/0.03	-	-	-	-	-	-
Es	Rel	0.24/0.29	0.12/0.14	0.39/0.48	0.19/0.26	-	-	-	-	-	-
	Gen	0.18/0.25	0.09/0.11	0.33/0.41	0.14/0.21	-	-	-	-	-	-
	Loc	0.00/0.01	0.01/0.02	0.01/0.02	0.02/0.02	-	-	-	-	-	-
	Port	0.02/0.03	0.01/0.01	0.04/0.06	0.04/0.05	-	-	-	-	-	-
It	Rel	0.24/0.29	0.12/0.14	0.31/0.34	0.23/0.27	-	-	-	-	-	-
	Gen	0.17/0.22	0.11/0.13	0.26/0.32	0.18/0.21	-	-	-	-	-	-
	Loc	0.00/0.00	0.00/0.01	0.00/0.02	0.01/0.02	-	-	-	-	-	-
	Port	0.01/0.02	0.02/0.02	0.07/0.08	0.01/0.01	-	-	-	-	-	-
Fr	Rel	0.22/0.26	0.12/0.17	0.36/0.44	0.23/0.28	-	-	-	-	-	-
	Gen	0.15/0.21	0.08/0.10	0.29/0.33	0.16/0.21	-	-	-	-	-	-
	Loc	0.00/0.01	0.00/0.02	0.01/0.03	0.01/0.02	-	-	-	-	-	-
	Port	0.02/0.02	0.02/0.02	0.06/0.09	0.04/0.05	-	-	-	-	-	-
Hi	Rel	-	-	-	-	0.03/0.03	0.03/0.06	-	-	-	-
	Gen	-	-	-	-	0.03/0.03	0.04/0.08	-	-	-	-
	Loc	-	-	-	-	0.00/0.00	0.00/0.01	-	-	-	-
	Port	-	-	-	-	0.00/0.00	0.01/0.01	-	-	-	-
Ta	Rel	-	-	-	-	-	-	0.00/0.00	0.01/0.01	-	-
	gen	-	-	-	-	-	-	0.00/0.00	0.00/0.00	-	-
	Loc	-	-	-	-	-	-	0.00/0.00	0.00/0.00	-	-
	Port	-	-	-	-	-	-	0.00/0.00	0.00/0.00	-	-
Kn	Rel	-	-	-	-	-	-	-	-	0.00/0.01	0.00/0.01
	gen	-	-	-	-	-	-	-	-	0.00/0.00	0.00/0.00
	Loc	-	-	-	-	-	-	-	-	0.00/0.00	0.00/0.00
	Port	-	-	-	-	-	-	-	-	0.00/0.00	0.00/0.00

Table 4: Comparison of reliability, generalization, locality, and portability scores across various language models, evaluated for the ZsRE dataset under English edit and target language inference settings. Each cell shows results in the format: exact match/partial match.

Dataset

CounterFact

ZsRE

Inferencing language

Editing language

Properties

ROME

MEMIT

ROME

MEMIT

ROME

MEMIT

ROME

MEMIT

ROME

MEMIT

ROME

MEMIT

ROME

MEMIT

ROME

MEMIT

Rel

0.73/0.75

0.95/0.95

0.00/0.00

0.01/0.01

0.00/0.00

0.01/0.01

0.00/0.01

0.29/0.33

0.59/0.59

0.01/0.02

0.02/0.02

0.00/0.00

0.00/0.02

0.00/0.00

Gen

0.35/0.35

0.64/0.64

0.01/0.01

0.02/0.02

0.01/0.01

0.01/0.02

0.00/0.01

0.29/0.31

0.52/0.54

0.01/0.02

0.00/0.00

0.01/0.01

0.00/0.00

0.00/0.03

0.00/0.00

Loc

0.33/0.33

0.27/0.27

0.01/0.01

0.02/0.02

0.03/0.03

0.11/0.11

0.12/0.12

0.00/0.00

0.01/0.01

0.00/0.04

0.01/0.02

0.02/0.04

Port

0.00/0.00

0.00/0.01

0.00/0.00

0.03/0.04

0.02/0.04

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

Rel

0.00/0.01

0.01/0.01

0.01/0.03

0.07/0.09

0.00/0.00

0.01/0.01

0.00/0.01

0.00/0.00

0.01/0.03

0.05/0.05

0.00/0.00

0.00/0.02

0.00/0.01

Gen

0.00/0.00

0.01/0.01

0.02/0.03

0.03/0.04

0.00/0.00

0.01/0.01

0.00/0.01

0.00/0.00

0.01/0.01

0.01/0.03

0.02/0.03

0.01/0.02

0.00/0.03

0.00/0.02

Loc

0.35/0.35

0.35/0.36

0.01/0.01

0.03/0.03

0.12/0.12

0.13/0.13

0.00/0.00

0.01/0.01

Port

0.00/0.00

0.01/0.01

0.00/0.00

0.07/0.08

0.00/0.00

0.00/0.01

Rel

0.00/0.01

0.00/0.00

0.00/0.01

0.01/0.01

0.00/0.01

0.00/0.00

0.01/0.01

0.00/0.00

0.01/0.01

0.00/0.02

0.01/0.03

0.00/0.01

Gen

0.00/0.00

0.01/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.01/0.01

0.00/0.00

0.01/0.01

0.02/0.03

0.00/0.02

Loc

0.36/0.36

0.33/0.34

0.01/0.01

0.02/0.02

0.11/0.11

0.00/0.00

0.01/0.03

0.01/0.02

Port

0.00/0.00

0.00/0.01

0.00/0.00

0.01/0.01

0.00/0.01

Rel

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.02

0.00/0.00

0.03/0.03

0.00/0.03

Gen

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.01/0.03

0.01/0.02

0.01/0.03

0.00/0.04

Loc

0.35/0.35

0.34/0.34

0.01/0.01

0.02/0.02

0.03/0.03

0.12/0.12

0.00/0.00

0.01/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.01/0.01

0.00/0.00

Port

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

0.00/0.00

0.00/0.01

Table 5: Comparison of reliability, generalization, locality, and portability scores across the merge model for the three Indic languages, evaluated using the CounterFact and ZsRE dataset under each language for itself and others. Each cell shows results in the format: exact match/partial match. The highest score presented in bold whereas second highest presented in underlined. The highest score presented in bold whereas second highest presented in underlined.

6.1 Self edit - self inference perspective

In this setup we perform the edit in a particular language (say German) and obtain the generated output from the model in the same language (i.e., German itself).
CounterFact dataset: In our evaluations of the model performance for the CounterFact dataset, we observe marked variations across different languages and metrics in Table 1, illustrating significant challenges in multilingual adaptability and contextual understanding. For instance, German language tests show that models like TowerInstruct and Mistral achieve good reliability scores (ROME at 0.83 and MEMIT at 0.73 for TowerInstruct; the same scores are at 0.83 and 0.73 respectively for Mistral). These scores illustrate good model performance in understanding the contextual nuances of German. However, generalization and locality score are less impressive (TowerInstruct at 0.27 and 0.22 on ROME for generalization and locality respectively), indicating difficulties in applying the learned information across broader contexts and different locales within the German language. Similar patterns are observed in Spanish and Italian. In Spanish, TowerInstruct reaches a reliability score of 0.82 for ROME and 0.70 for MEMIT; for Mistral the reliability scores are 0.81 for ROME and 0.78 for MEMIT, suggesting decent grasp of Spanish contexts. However, the generalization scores remain below 0.35 for ROME and locality scores do not exceed 0.29 for MEMIT for any model. Despite TowerInstruct showing a relatively high reliability in Italian with a ROME at 0.87 and MEMIT at 0.74, the generalization and locality scores remain low (highest being 0.35 on ROME and 0.28 on MEMIT for Mistral). In case of the three Indic languages the discrepancies become even more pronounced. OpenHathi, for example, shows a drastic drop in Hindi, with a ROME reliability of just 0.02 and a MEMIT of 0.45, indicating almost no comprehension of the language nuances. Tamil-Llama and Kan-Llama also display low scores across all properties. The highest reliability achieved is 0.21 for ROME for Kan-Llama and 0.48 for MEMIT in case of Tamil-Llama, which highlights the limitations in these language models. Portability scores are consistently low across all languages, models, and metrics, demonstrating a significant gap in model training as it fails to effectively account for diverse linguistic structures and cultural contexts.

6.2 English edit - self inference perspective

In this setup we perform the edit in a English and obtain the generated output from the model in other languages (e.g., German, Italian etc.).

CounterFact dataset: In German, the reliability scores for models such as TowerInstruct and Mistral suggest moderate effectiveness, with ROME around 0.48 and MEMIT around 0.40 (see Table 3). However, their generalization and locality scores reveal limitations in the models’ ability to generalize and localize content effectively with scores not exceeding 0.25 and 0.26 respectively. For Spanish, there is a noticeable improvement in reliability, with ROME scores for Mistral reaching 0.57, and a slight improvement in generalization and locality metrics compared to German. Italian and French show similar trends, with reliability scores peaking at 0.47 for Mistral in Italian and 0.49 in French; the generalization and locality scores are still lower. For Tamil and Kannada the reliability are exceptionally low. In fact, in case of Tamil this score is 0 for ROME and 0.01 for MEMIT. Comparatively for Hindi the reliability scores are quite good with 0.56 for ROME. However the portability and generalization scores are again very poor.
ZsRE dataset: For languages such as German and Spanish, the models display moderate reliability with Mistral, achieving ROME scores up to 0.34 and 0.39 respectively, and MEMIT scores of 0.14 and 0.19 respectively (see Table 4). However, the scores significantly drop for locality and portability, showing that while the models can identify relevant relationships, they struggle to generalize and adapt to the specific linguistic nuances of these languages. The trends are similar in Italian and French, where reliability scores are moderate while locality and generalization scores are poor. Further, for the Indic languages, the score are exceedingly low for all the properties indicating the stark gap in performance highly resource scarce languages. .

6.3 Merged model perspective

The Table 5 presents performance metrics for the merged model, showcasing how it handles edits across various languages, with columns representing the inferencing language and rows indicating the respective editing language. When both editing and inferencing are done in English the reliability scores for the CounterFact dataset are quite high with ROME and MEMIT respectively reaching 0.73 and 0.95. However, the performance sharply declines when editing in English and inferencing is in Hindi, Tamil, and Kannada with scores nearly zero, highlighting a stark limitation in the model’s cross-lingual capabilities. This trend is consistently observed across both datasets. Editing in languages like Hindi, Tamil, or Kannada results in uniformly poor outcomes across all four properties regardless of the inferencing language, which indicates profound inadequacies in the model’s ability to generalize and adapt across linguistic barriers. This pattern emphasizes the critical need for advancing multilingual model adaptability. The current findings suggest that while the model operates effectively within the confines of the same linguistic environment, its performance deteriorates dramatically across linguistic boundaries, especially from lesser-resourced languages. Such insights advocate for a significant enhancement in training approaches, aiming to foster robust multilingual support and ensure that models are truly multilingual in functionality, proficiently managing edits and inferencing across a diverse linguistic landscape.

7 Error analysis

In Table 6 we show the different types of linguistic errors encountered during the translation and editing process. The errors are categorised based on the different types of ambiguities and sheds light on how future models should strengthened by carefully harnessing techniques to tackle these errors.

8 Conclusion

In this study, we investigated the impact of knowledge editing across different languages based on the CounterFact and ZsRE datasets along with their translations. Our extensive experiments employing a variety of knowledge editing techniques on an array of multilingual LLMs resulted in various crucial observations. We discovered that variations in language-specific model architecture significantly affect the success of knowledge edits, that current editing methods often fail to seamlessly transfer alterations from one language to another, and that modifications made in one language might unexpectedly alter model behavior in another language. This study lays the groundwork for future innovations that could lead to more sophisticated and linguistically inclusive AI technologies.

References

Alves et al. (2024) Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and André F. T. Martins. 2024. Tower: An open multilingual large language model for translation-related tasks. Preprint, arXiv:2402.17733.
Banerjee et al. (2024) Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. 2024. How (un)ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries. CoRR, abs/2402.15302.
Beniwal et al. (2024) Himanshu Beniwal, Kowsik D, and Mayank Singh. 2024. Cross-lingual editing in multilingual language models. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2078–2128, St. Julian’s, Malta. Association for Computational Linguistics.
Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.
Das et al. (2022) Mithun Das, Somnath Banerjee, and Animesh Mukherjee. 2022. Data bootstrapping approaches to improve low resource abusive language detection for indic languages. In Proceedings of the 33rd ACM Conference on Hypertext and Social Media, HT ’22, page 32–42, New York, NY, USA. Association for Computing Machinery.
De Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Hase et al. (2023) Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural Information Processing Systems.
Hazra et al. (2024) Rima Hazra, Sayan Layek, Somnath Banerjee, and Soujanya Poria. 2024. Sowing the wind, reaping the whirlwind: The impact of editing language models. CoRR, abs/2401.10647.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
Jiang et al. (2020) Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, and Graham Neubig. 2020. X-FACTR: Multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5943–5959, Online. Association for Computational Linguistics.
Kassner et al. (2021) Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating knowledge in multilingual pretrained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3250–3258, Online. Association for Computational Linguistics.
Levenshtein (1965) Vladimir I. Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710.
Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
Lobachev (2008) Sergey Lobachev. 2008. Top languages in global information production. Partnership: The Canadian Journal of Library and Information Practice and Research, 3(2).
Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36. ArXiv:2202.05262.
Meng et al. (2023) Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass-editing memory in a transformer. Preprint, arXiv:2210.07229.
Qi et al. (2023) Jirui Qi, Raquel Fernández, and Arianna Bisazza. 2023. Cross-lingual consistency of factual knowledge in multilingual language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10650–10666, Singapore. Association for Computational Linguistics.
Sinitsin et al. (2020) Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitry Pyrkin, Sergei Popov, and Artem Babenko. 2020. Editable neural networks. In International Conference on Learning Representations.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Wang et al. (2023) Jiaan Wang, Yunlong Liang, Zengkui Sun, Yuxuan Cao, and Jiarong Xu. 2023. Cross-lingual knowledge editing in large language models. Preprint, arXiv:2309.08952.
Workshop et al. (2023) BigScience Workshop, :, and Scao et al. 2023. Bloom: A 176b-parameter open-access multilingual language model. Preprint, arXiv:2211.05100.
Xu et al. (2023) Yang Xu, Yutai Hou, Wanxiang Che, and Min Zhang. 2023. Language anisotropic cross-lingual model editing. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5554–5569, Toronto, Canada. Association for Computational Linguistics.
Yadav et al. (2023) Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. 2023. Ties-merging: Resolving interference when merging models. Preprint, arXiv:2306.01708.
Yao et al. (2023) Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023. Editing large language models: Problems, methods, and opportunities. Preprint, arXiv:2305.13172.
Yin et al. (2022) Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, and Kai-Wei Chang. 2022. Geomlama: Geo-diverse commonsense probing on multilingual pre-trained language models. Preprint, arXiv:2205.12247.

Appendix A Hyperparameters

We adopt all essential parameter values from the ROME and MEMIT study for all the LLMs. The details of these hyperparameters are provided in Table 7.

Hyperparameter values
layers	[5]
fact_token	subject_last
v_num_grad_steps	25
v_lr	5e-1
v_loss_layer	31
v_weight_decay	1e-3
clamp_norm_factor	4
kl_factor	0.0625
mom2_adjustment	false
context_template_length_params	[[5, 10], [10, 10]]
rewrite_module_tmp	model.layers.{}.mlp.down_proj
layer_module_tmp	model.layers.{}
mlp_module_tmp	model.layers.{}.mlp
attn_module_tmp	model.layers.{}.self_attn
ln_f_module	model.norm
lm_head_module	lm_head
model_parallel	true

Table 7: Hyperparameter values (most of the default values extend from ROME and MEMIT setup).

Appendix B Exact vs partial match

We showcase plot correlations in Figures 2 and 3.

Appendix C Romance and Germanic languages

C.1 Language perspective

C.1.1 CounterFact

In case of CounterFact dataset, significant disparities are observed in edited model performance across different languages. Edits done with En and tested on En consistently showed high reliability scores across all models, with Mistral achieving nearly perfect reliability at 0.994 and TowerInstruct at 0.996 (for ROME). However, performances while testing with De, It, Fr, and Es were notably lower, particularly in generalisation (in between $\sim$ 0.21-0.28 for Mistral) and locality (0.20-0.28 for Mistral) metrics, indicating challenges in generalization and nuanced information processing in non-English contexts. The portability scores were modest across the board, underscoring a pronounced need for enhanced multilingual model adaptability.

When the edit is conducted with De and tested on De reliability scores for TowerInstruct (0.828) and Mistral (0.834) (for ROME) are reasonably high indicating strong contextual understanding. However, testing with other languages like It, Fr, and Es exhibit lower scores, reflecting challenges in language-specific processing.

After editing the model with It the edited model achieved the highest reliability score with TowerInstruct for test language It (0.871) (for ROME). However, the reliability scores for other test languages were lower, with En at 0.535, De at 0.398, Fr at 0.490, and Es at 0.488, reflecting the challenge of extending training efficiencies beyond Italian. The highest portability score was seen in It with Mistral and TowerInstruct at 0.095 (for ROME), the scores were significantly lower in other languages.

In case of edit with Fr, test language Fr achieved the highest scores (0.832), with TowerInstruct where it reached 0.454, compared to model’s performance in other languages like En (0.519), De (0.417), It (0.509), and Es (0.511). This high score in Fr for TowerInstruct, however, suggests that certain models can still effectively align with training data even in non-primary languages. In case generality and locality, the scores were universally lower across all models and languages, indicating a struggle in generalizing the Fr editing. Locality scores also pointed to difficulties in identifying language-specific nuances, with TowerInstruct showing a modestly better understanding in It (0.189) and Fr (0.214), yet still remaining low.

After editing with Es, En (0.555) consistently demonstrated superior reliability score for TowerInstruct, compared to other languages such as De (0.391) and It (0.451) (excluding Es). However, Es exhibited notably high reliability scores, with TowerInstruct achieving 0.822 and Mistral 0.812, indicating these models’ effective adaptation to Spanish linguistic features. Generality and locality metrics, which measure a model’s ability to generalize training and identify language-specific information, respectively, showed universally lower scores across all languages, highlighting challenges in cross-lingual applicability.

C.1.2 ZsRE

After editing with En language, the reliability score for Mistral model in En was remarkably high at 0.929. However, this contrasts sharply with its performance in other languages such as De (0.344) and It (0.312), suggesting a significant drop in model effectiveness when transitioning from En. Similarly, the TowerInstruct model showed a strong performance when the test langauage was En with a relevance score of 0.875, yet scores in other languages like De (0.236) and Fr (0.221) were markedly lower, highlighting the challenges in maintaining model performance across linguistic boundaries (for ROME). In case of generalization and locality, the scores also emphasize the disparity. While Mistral displayed a good generality in Eng (0.812), its scores in languages such as De and It were only around 0.260. This trend of decreased performance is echoed in the locality scores, where Mistral exhibited almost no ability to identify language-specific nuances in It and Fr. TowerInstruct’s portability score for En was 0.097, which, although not very high, still outperforms its De and Fr counterparts, suggesting a somewhat better but still limited ability to adapt training across languages (for ROME).

After editing with De, the TowerInstruct model exhibited significant variations in reliability scores, achieving its highest in De (0.480) but only 0.157 in En, indicating a substantial challenge in adapting to De compared to other languages. Similarly, Mistral displayed relatively better relevance in De at 0.513, but this still fell short compared to its performance in It (0.257), suggesting a consistent trend of models performing better in Romance languages. Further examination of generalization and locality metrics highlights these disparities even more. For instance, generalization scores for Mistral in De stood at 0.349, yet locality scores were nearly zero across the board, showing a significant deficiency in capturing language-specific details. Portability scores also reflect limited adaptability, with Mistral scoring only 0.079 for De compared to a slightly better performance in It (0.066), underscoring the need for model training approaches that better address and bridge these linguistic gaps to enhance overall performance and applicability across diverse linguistic datasets (for ROME).

After editing with It, TowerInstruct model exhibited a disparity in reliability scores, achieving a high value of 0.537 in It but only 0.185 in De, underscoring a significant challenge in adapting to De compared to other Romance languages. Similarly, Mistral demonstrated better reliability in It (0.575), further indicating that models tend to align more effectively with training data in certain languages over others. In terms of generality and locality, the scores further emphasize these challenges.

After editing with Fr, the TowerInstruct demonstrated a stronger performance in Fr with a reliability score of 0.507 and a generality score of 0.281, compared to its performance in Es (Rel: 0.138, Gen: 0.113) and It (Rel: 0.197, Gen: 0.167). This indicates a more robust alignment with Fr linguistic features. On the other hand, Mistral also exhibited its highest reliability in Fr (0.517) but struggled in De (0.298) and It (0.272), further underscoring the varying model efficiencies across languages. These findings highlight significant challenges in model training, where improvements are needed to enhance language-specific understanding and adaptability, ensuring that models perform consistently well across a diverse linguistic spectrum.

After editing with Es, TowerInstruct achieved a high reliability score of 0.443 for Es, significantly surpassing its scores in other languages such as En (0.232) and De (0.148). This trend suggests a stronger model alignment with the linguistic properties of Es. In generality, TowerInstruct highlights better performance in Es with a score of 0.305, contrasted with lower scores in It (0.202) and Fr (0.182). The locality scores were generally low across all languages.

Datasets/ Languages		Score	Mistral					TowerInstruct
Datasets/ Languages		Score	En	De	It	Fr	Es	En	De	It	Fr	Es
CounterFact	En	Rel	0.994/0.994	0.498/0.560	0.469/0.578	0.487/0.548	0.571/0.617	0.996/0.996	0.482/0.529	0.455/0.500	0.498/0.527	0.511/0.562
		Gen	0.512/0.529	0.233/0.269	0.246/0.346	0.279/0.305	0.252/0.294	0.522/0.538	0.245/0.273	0.231/0.267	0.280/0.309	0.256/0.291
		Loc	0.327/0.338	0.227/0.250	0.240/0.358	0.200/0.362	0.244/0.265	0.307/0.315	0.196/0.207	0.204/0.209	0.225/0.235	0.224/0.238
		Port	0.133/0.144	0.029/0.033	0.027/0.111	0.013/0.119	0.027/0.035	0.005/0.013	0.000/0.004	0.011/0.018	0.005/0.005	0.002/0.004
	De	Rel	0.558/0.591	0.834/0.961	0.471/0.506	0.423/0.471	0.446/0.500	0.589/0.614	0.828/0.959	0.431/0.489	0.439/0.481	0.429/0.497
		Gen	0.355/0.394	0.284/0.313	0.266/0.303	0.255/0.286	0.245/0.282	0.322/0.345	0.271/0.314	0.211/0.246	0.224/0.246	0.224/0.255
		Loc	0.365/0.376	0.208/0.228	0.251/0.264	0.193/0.207	0.263/0.280	0.287/0.292	0.222/0.232	0.212/0.216	0.214/0.224	0.211/0.224
		Port	0.114/0.133	0.029/0.039	0.025/0.027	0.023/0.023	0.033/0.037	0.004/0.014	0.008/0.008	0.004/0.006	0.006/0.012	0.000/0.002
	It	Rel	0.541/0.578	0.422/0.477	0.860/0.914	0.502/0.542	0.519/0.582	0.535/0.564	0.398/0.450	0.871/0.932	0.490/0.535	0.488/0.556
		Gen	0.319/0.346	0.202/0.218	0.278/0.296	0.235/0.239	0.235/0.267	0.330/0.349	0.226/0.253	0.346/0.376	0.263/0.290	0.268/0.311
		Loc	0.350/0.358	0.230/0.251	0.257/0.270	0.210/0.264	0.253/0.265	0.293/0.301	0.199/0.205	0.185/0.189	0.214/0.222	0.203/0.216
		Port	0.095/0.111	0.031/0.045	0.021/0.031	0.012/0.023	0.019/0.031	0.008/0.010	0.004/0.004	0.019/0.021	0.010/0.012	0.006/0.006
	Fr	Rel	0.519/0.548	0.417/0.485	0.509/0.542	0.832/0.890	0.511/0.566	0.530/0.550	0.383/0.440	0.454/0.501	0.827/0.898	0.458/0.506
		Gen	0.282/0.305	0.190/0.215	0.219/0.239	0.294/0.297	0.252/0.268	0.281/0.297	0.200/0.222	0.208/0.230	0.308/0.330	0.234/0.281
		Loc	0.350/0.362	0.243/0.256	0.249/0.264	0.204/0.217	0.276/0.294	0.303/0.316	0.204/0.214	0.189/0.198	0.214/0.220	0.224/0.208
		Port	0.106/0.119	0.020/0.025	0.022/0.023	0.029/0.033	0.023/0.029	0.006/0.018	0.010/0.016	0.010/0.012	0.004/0.006	0.002/0.008
	Es	Rel	0.528/0.548	0.409/0.458	0.483/0.542	0.489/0.544	0.812/0.908	0.555/0.581	0.391/0.429	0.451/0.516	0.466/0.554	0.822/0.921
		Gen	0.297/0.321	0.194/0.217	0.241/0.272	0.231/0.252	0.280/0.315	0.318/0.340	0.184/0.219	0.233/0.251	0.265/0.263	0.330/0.372
		Loc	0.346/0.358	0.235/0.250	0.249/0.262	0.209/0.223	0.254/0.268	0.294/0.300	0.211/0.217	0.186/0.188	0.200/0.238	0.211/0.223
		Port	0.106/0.123	0.022/0.023	0.035/0.037	0.023/0.025	0.029/0.033	0.008/0.014	0.002/0.002	0.008/0.014	0.010/0.020	0.000/0.002

Table 8: Comparison of reliability (Rel), generalization (Gen), locality (Loc), and portability (Port) scores for multiple language models evaluated using the CounterFact dataset and the ROME editing method. The second column indicates the language in which each model was edited.

Datasets/ Languages		Score	Mistral					TowerInstruct
Datasets/ Languages		Score	En	De	It	Fr	Es	En	De	It	Fr	Es
ZSRE	En	Rel	0.929/0.981	0.344/0.448	0.312/0.344	0.364/0.442	0.390/0.481	0.875/0.928	0.236/0.279	0.240/0.293	0.221/0.260	0.240/0.288
		Gen	0.812/0.851	0.260/0.351	0.260/0.325	0.292/0.331	0.331/0.409	0.620/0.683	0.183/0.226	0.168/0.216	0.149/0.207	0.183/0.255
		Loc	0.000/0.006	0.013/0.019	0.000/0.019	0.013/0.026	0.013/0.019	0.010/0.019	0.000/0.010	0.000/0.005	0.000/0.014	0.005/0.010
		Port	0.097/0.136	0.065/0.071	0.071/0.078	0.058/0.091	0.039/0.058	0.053/0.062	0.019/0.019	0.010/0.024	0.019/0.019	0.019/0.034
	De	Rel	0.382/0.474	0.513/0.625	0.257/0.336	0.289/0.349	0.270/0.355	0.157/0.216	0.480/0.593	0.221/0.260	0.211/0.240	0.176/0.211
		Gen	0.342/0.428	0.349/0.454	0.237/0.309	0.237/0.289	0.217/0.289	0.152/0.196	0.333/0.387	0.162/0.201	0.142/0.172	0.132/0.167
		Loc	0.000/0.007	0.013/0.020	0.000/0.013	0.013/0.020	0.013/0.020	0.010/0.020	0.000/0.010	0.000/0.005	0.000/0.015	0.005/0.010
		Port	0.079/0.092	0.079/0.099	0.066/0.079	0.072/0.099	0.053/0.086	0.010/0.020	0.025/0.025	0.010/0.015	0.020/0.020	0.010/0.015
	It	Rel	0.314/0.386	0.288/0.340	0.575/0.654	0.333/0.399	0.281/0.366	0.176/0.224	0.185/0.215	0.537/0.624	0.210/0.268	0.229/0.340
		Gen	0.340/0.405	0.242/0.281	0.418/0.484	0.294/0.373	0.222/0.327	0.161/0.200	0.137/0.185	0.346/0.429	0.180/0.239	0.122/0.271
		Loc	0.000/0.007	0.013/0.020	0.000/0.020	0.013/0.020	0.013/0.020	0.010/0.015	0.000/0.010	0.000/0.005	0.000/0.015	0.005/0.005
		Port	0.059/0.085	0.072/0.078	0.072/0.085	0.078/0.105	0.039/0.072	0.029/0.029	0.029/0.029	0.015/0.020	0.029/0.034	0.020/0.030
	Fr	Rel	0.424/0.477	0.298/0.344	0.272/0.391	0.517/0.629	0.331/0.444	0.143/0.177	0.153/0.187	0.197/0.256	0.507/0.591	0.138/0.167
		Gen	0.371/0.424	0.285/0.325	0.245/0.325	0.404/0.503	0.245/0.351	0.138/0.177	0.133/0.167	0.167/0.192	0.281/0.350	0.113/0.163
		Loc	0.000/0.007	0.013/0.020	0.000/0.020	0.013/0.020	0.013/0.020	0.010/0.020	0.000/0.010	0.000/0.005	0.005/0.015	0.005/0.010
		Port	0.132/0.159	0.066/0.066	0.073/0.086	0.060/0.093	0.040/0.060	0.015/0.025	0.025/0.025	0.010/0.020	0.034/0.054	0.005/0.020
	Es	Rel	0.367/0.440	0.260/0.320	0.360/0.433	0.307/0.400	0.487/0.607	0.232/0.232	0.148/0.158	0.241/0.340	0.182/0.236	0.443/0.591
		Gen	0.287/0.367	0.227/0.280	0.247/0.313	0.333/0.387	0.353/0.453	0.153/0.177	0.094/0.118	0.202/0.271	0.182/0.241	0.305/0.404
		Loc	0.000/0.007	0.013/0.020	0.000/0.020	0.013/0.020	0.007/0.013	0.010/0.010	0.000/0.005	0.000/0.005	0.000/0.010	0.005/0.010
		Port	0.060/0.080	0.040/0.060	0.033/0.060	0.047/0.080	0.033/0.067	0.000/0.000	0.010/0.010	0.015/0.030	0.010/0.020	0.020/0.020

Table 9: Comparison of reliability (Rel), generalization (Gen), locality (Loc), and portability (Port) scores for multiple language models evaluated using the ZsRE dataset and the ROME editing method. The second column indicates the language in which each model was edited.

Datasets/ Languages		Score	Mistral					TowerInstruct
Datasets/ Languages		Score	En	De	It	Fr	Es	En	De	It	Fr	Es
CounterFact	En	Rel	0.988/0.988	0.537/0.606	0.438/0.494	0.506/0.588	0.562/0.600	0.954/0.963	0.404/0.459	0.349/0.404	0.450/0.486	0.404/0.477
		Gen	0.444/0.456	0.219/0.225	0.212/0.225	0.263/0.269	0.212/0.263	0.431/0.431	0.128/0.174	0.193/0.202	0.193/0.220	0.183/0.220
		Loc	0.381/0.388	0.256/0.281	0.275/0.287	0.250/0.263	0.250/0.269	0.275/0.294	0.193/0.220	0.202/0.202	0.193/0.211	0.165/0.165
		Port	0.156/0.188	0.025/0.037	0.037/0.037	0.031/0.037	0.025/0.037	0.000/0.000	0.000/0.000	0.009/0.018	0.009/0.009	0.000/0.000
	De	Rel	0.439/0.484	0.726/0.866	0.376/0.420	0.350/0.369	0.363/0.414	0.355/0.391	0.727/0.827	0.282/0.380	0.309/0.309	0.255/0.300
		Gen	0.242/0.280	0.191/0.223	0.185/0.191	0.185/0.217	0.178/0.210	0.227/0.236	0.191/0.218	0.136/0.176	0.182/0.209	0.145/0.164
		Loc	0.376/0.389	0.242/0.268	0.280/0.293	0.229/0.242	0.274/0.280	0.264/0.282	0.191/0.218	0.200/0.231	0.209/0.227	0.200/0.200
		Port	0.108/0.134	0.045/0.064	0.025/0.025	0.013/0.025	0.032/0.051	0.000/0.000	0.009/0.009	0.009/0.009	0.009/0.009	0.000/0.000
	It	Rel	0.372/0.404	0.353/0.410	0.801/0.878	0.455/0.526	0.449/0.526	0.407/0.444	0.361/0.380	0.741/0.778	0.389/0.417	0.426/0.454
		Gen	0.256/0.263	0.141/0.167	0.237/0.269	0.192/0.231	0.179/0.212	0.315/0.315	0.139/0.176	0.250/0.259	0.204/0.213	0.185/0.213
		Loc	0.385/0.397	0.263/0.288	0.269/0.282	0.250/0.263	0.276/0.282	0.269/0.287	0.204/0.231	0.204/0.204	0.194/0.213	0.176/0.176
		Port	0.122/0.147	0.013/0.032	0.026/0.026	0.013/0.019	0.019/0.045	0.009/0.009	0.009/0.009	0.019/0.028	0.009/0.009	0.000/0.000
	Fr	Rel	0.439/0.459	0.395/0.471	0.401/0.433	0.790/0.847	0.446/0.478	0.468/0.477	0.330/0.385	0.330/0.376	0.651/0.716	0.330/0.367
		Gen	0.229/0.268	0.153/0.166	0.159/0.172	0.236/0.255	0.153/0.172	0.294/0.312	0.128/0.147	0.183/0.183	0.220/0.239	0.174/0.193
		Loc	0.389/0.401	0.268/0.293	0.280/0.293	0.242/0.255	0.274/0.280	0.248/0.266	0.183/0.211	0.183/0.183	0.174/0.193	0.174/0.174
		Port	0.089/0.115	0.019/0.032	0.019/0.019	0.025/0.032	0.013/0.025	0.000/0.000	0.009/0.009	0.009/0.018	0.000/0.000	0.000/0.000
	Es	Rel	0.433/0.465	0.338/0.382	0.401/0.452	0.471/0.522	0.777/0.860	0.435/0.463	0.306/0.324	0.370/0.398	0.380/0.398	0.704/0.796
		Gen	0.210/0.229	0.127/0.159	0.121/0.134	0.185/0.217	0.223/0.274	0.241/0.250	0.148/0.157	0.194/0.204	0.213/0.213	0.231/0.269
		Loc	0.395/0.408	0.274/0.306	0.268/0.287	0.242/0.255	0.274/0.287	0.259/0.278	0.194/0.222	0.185/0.185	0.176/0.194	0.185/0.185
		Port	0.108/0.134	0.025/0.051	0.006/0.006	0.013/0.013	0.025/0.045	0.009/0.009	0.000/0.009	0.009/0.019	0.019/0.019	0.000/0.000

Table 10: Comparison of reliability (Rel), generalization (Gen), locality (Loc), and portability (Port) scores for multiple language models evaluated using the CounterFact dataset and the MEMIT editing method. The second column indicates the language in which each model was edited.

Datasets/ Languages		Score	Mistral					TowerInstruct
Datasets/ Languages		Score	En	De	It	Fr	Es	En	De	It	Fr	Es
ZSRE	En	Rel	0.786/0.812	0.136/0.182	0.227/0.266	0.227/0.279	0.188/0.260	0.528/0.538	0.104/0.142	0.123/0.142	0.123/0.170	0.123/0.142
		Gen	0.513/0.545	0.136/0.162	0.175/0.208	0.156/0.208	0.136/0.208	0.321/0.330	0.123/0.142	0.113/0.132	0.075/0.104	0.094/0.113
		Loc	0.019/0.026	0.013/0.032	0.013/0.019	0.013/0.019	0.019/0.019	0.019/0.038	0.000/0.019	0.000/0.009	0.000/0.019	0.009/0.019
		Port	0.039/0.065	0.019/0.032	0.006/0.006	0.039/0.052	0.039/0.045	0.019/0.028	0.019/0.019	0.019/0.019	0.019/0.019	0.009/0.009
	De	Rel	0.158/0.204	0.382/0.474	0.138/0.178	0.112/0.132	0.118/0.164	0.029/0.077	0.250/0.298	0.048/0.067	0.038/0.058	0.048/0.048
		Gen	0.125/0.171	0.184/0.243	0.138/0.164	0.105/0.118	0.086/0.125	0.058/0.067	0.106/0.115	0.048/0.067	0.038/0.048	0.038/0.058
		Loc	0.020/0.026	0.007/0.026	0.013/0.020	0.013/0.020	0.020/0.020	0.019/0.029	0.000/0.010	0.000/0.010	0.000/0.019	0.010/0.019
		Port	0.039/0.066	0.020/0.039	0.013/0.013	0.007/0.020	0.020/0.033	0.010/0.019	0.000/0.000	0.000/0.000	0.010/0.010	0.000/0.000
	It	Rel	0.144/0.176	0.157/0.196	0.425/0.503	0.144/0.183	0.163/0.216	0.019/0.038	0.038/0.067	0.248/0.286	0.067/0.086	0.095/0.124
		Gen	0.105/0.150	0.085/0.118	0.255/0.307	0.144/0.183	0.105/0.157	0.029/0.067	0.048/0.076	0.162/0.200	0.038/0.057	0.048/0.067
		Loc	0.020/0.026	0.007/0.026	0.013/0.020	0.013/0.020	0.020/0.020	0.019/0.029	0.000/0.019	0.000/0.010	0.000/0.029	0.010/0.019
		Port	0.046/0.072	0.007/0.033	0.013/0.033	0.020/0.033	0.020/0.033	0.000/0.010	0.010/0.019	0.019/0.029	0.010/0.010	0.000/0.000
	Fr	Rel	0.139/0.172	0.099/0.152	0.166/0.238	0.397/0.497	0.119/0.166	0.048/0.077	0.048/0.067	0.038/0.077	0.269/0.346	0.019/0.058
		Gen	0.152/0.212	0.079/0.139	0.139/0.185	0.185/0.272	0.093/0.139	0.019/0.038	0.029/0.048	0.048/0.077	0.144/0.173	0.010/0.019
		Loc	0.020/0.026	0.013/0.033	0.013/0.020	0.013/0.020	0.020/0.020	0.019/0.029	0.000/0.019	0.000/0.010	0.000/0.019	0.010/0.010
		Port	0.060/0.079	0.020/0.033	0.020/0.020	0.040/0.060	0.040/0.053	0.019/0.019	0.010/0.010	0.000/0.010	0.029/0.029	0.000/0.000
	Es	Rel	0.107/0.153	0.073/0.106	0.166/0.213	0.147/0.186	0.373/0.493	0.058/0.087	0.038/0.058	0.087/0.115	0.058/0.106	0.240/0.337
		Gen	0.087/0.256	0.087/0.106	0.140/0.173	0.093/0.146	0.220/0.286	0.048/0.087	0.058/0.087	0.087/0.115	0.058/0.087	0.163/0.202
		Loc	0.020/0.026	0.007/0.026	0.013/0.020	0.013/0.020	0.020/0.020	0.019/0.029	0.000/0.019	0.000/0.010	0.000/0.019	0.010/0.019
		Port	0.033/0.060	0.007/0.013	0.027/0.033	0.033/0.046	0.027/0.040	0.010/0.010	0.000/0.000	0.010/0.010	0.019/0.019	0.010/0.019

Table 11: Comparison of reliability (Rel), generalization (Gen), locality (Loc), and portability (Port) scores for multiple language models evaluated using the ZsRE dataset and the MEMIT editing method. The second column indicates the language in which each model was edited.

Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance

Abstract

1 Introduction

2 Related work

3 Task overview

4 Dataset

5 Experimental setup

5.1 Selection of LLMs

5.2 Editing methods

5.3 Evaluation metric

6 Results

6.1 Self edit - self inference perspective

6.2 English edit - self inference perspective

6.3 Merged model perspective

7 Error analysis

8 Conclusion

References

Appendix A Hyperparameters

Appendix B Exact vs partial match

Appendix C Romance and Germanic languages

C.1 Language perspective

C.1.1 CounterFact

C.1.2 ZsRE

Breaking Boundaries: Investigating the Effects of
Model Editing on Cross-linguistic Performance