Repetition Improves Language Model Embeddings

Jacob Mitchell Springer Suhas Kotha
Daniel Fried Graham Neubig Aditi Raghunathan
Carnegie Mellon University
{jspringe, suhask, dfried, gneubig, aditirag}@cs.cmu.edu

Abstract

Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, “echo embeddings,” in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over $9\%$ zero-shot and by around $0.7\%$ when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.¹¹1Our code and pre-trained models are released at https://github.com/jakespringer/echo-embeddings.

\addauthor

gnmagenta

Repetition Improves Language Model Embeddings

Jacob Mitchell Springer Suhas Kotha Daniel Fried Graham Neubig Aditi Raghunathan Carnegie Mellon University {jspringe, suhask, dfried, gneubig, aditirag}@cs.cmu.edu

1 Introduction

Neural text embeddings have a crucial role in modern approaches to information retrieval (IR), semantic similarity estimation, classification, and clustering (Ni et al., 2021b; Muennighoff et al., 2022). For example, document retrieval often leverages low-dimensional embeddings for efficient lookup: when queries and documents are encoded as vectors where semantic relationships are described by similarity in some metric space, a query lookup can be reduced to an approximate nearest-neighbor search in embedding space (Johnson et al., 2019; Vanderkam et al., 2013).

In the recent past, the dominant pretrained language model paradigm for neural embeddings have been masked language models with bidirectional attention (Ni et al., 2021a; Raffel et al., 2020; Izacard et al., 2021; Wang et al., 2022; Jiang et al., 2022; Su et al., 2022; Xiao et al., 2023a; Li et al., 2023). However, more recent literature (Ma et al., 2023; Wang et al., 2023) has begun to scale these algorithms to modern autoregressive language models such as LLaMA-2 and Mistral (Touvron et al., 2023; Jiang et al., 2023a). Developing approaches to construct embeddings from autoregressive language models is promising: for many tasks, these models are the highest quality models available (Srivastava et al., 2022).

Refer to caption — Figure 1: Conceptual overview of echo embeddings.

In this paper, we address a striking failure mode of autoregressive language models. This failure arises from the fact that for autoregressive language models, contextualized token embeddings—the vector of last-hidden-layer activations at the position of a particular input token—do not contain information from tokens that appear later in the sentence due to the causal attention mask. We demonstrate that such embeddings can fail to appropriately determine similarity when the early tokens are superficially similar but become dissimilar in important ways when using key information from end of the input.

We propose a strategy to overcome this limitation in autoregressive models through “echo embeddings.” With this approach, we repeat the inputs so that it appears twice in the context passed to the language model, and extract embeddings from the second occurrence. Repeating the input enables the contextualized token embeddings of the second occurrence of the passage to encode information from tokens that appear later in the passage by attending to their first occurrence in the passage. We show that echo embeddings do in fact allow embeddings of the early tokens to capture information about the later tokens.

We then evaluate echo embeddings on the standard Massive Text Embedding Benchmark (MTEB) leaderboard²²2The MTEB leaderboard can be found at https://huggingface.co/spaces/mteb/leaderboard. In the zero-shot setting, echo embeddings improve on classical embeddings by over $9\%$ and provide consistent gains across all the different tasks for a variety of language models and scale. We then perform an apples-to-apples comparison when fine-tuning embeddings from Mistral-7B and continue to see consistent gains of echo over classical across the various tasks (by $0.7\%$ on average). Strikingly, echo embeddings with the strong Mistral-7B language model allows us to achieve state-of-the-art embedding quality, enabling autoregressive language models to match prior open-source top performing models that otherwise leveraged MLMs with bidirectional attention.³³3The contemporaneous work by Wang et al. (2023) achieves state-of-the-art accuracy on MTEB using classical embeddings from autoregressive language models, but through finetuning on high quality synthetic data, which we believe is largely orthogonal to our contribution.

The approach of echo embeddings is conceptually well-motivated, extremely simple, and generally compatible with other innovations in extracting embeddings from autoregressive language models. As language models are likely to continue to improve over the coming years, echo embeddings can be a simple but powerful twist to classical embeddings that allow us to maximally leverage autoregressive language models.

2 Preliminaries

Our goal is to extract text embeddings that map a sentence $x$ to a vector $\phi(x)\in\mathbb{R}^{d}$ such that the semantic similarity between sentences is captured as similarity between their embeddings. In practice, we use the cosine similarity between embeddings to capture semantic similarity (detailed in Appendix B).

Embeddings from language models.

We are primarily interested in the embeddings extracted from autoregressive language models, which typically have causal attention masking and are trained on a next-token objective. For brevity, we drop the term “autoregressive” in the following.

As is standard, we extract embeddings from the activations of the final hidden layer. Each input token $x_{j}$ at position $j$ is associated with a contextualized token embedding which is the hidden layer representation $\phi_{j}(x)$ .

We can pool the embeddings across all the tokens in different ways. In this work, we focus on two common strategies which have been considered by prior work (Reimers and Gurevych, 2019; Muennighoff, 2022; Zhang et al., 2023a; Wang et al., 2023).

A mean token embedding over a set of indices $A$ , refers to the mean contextualized token embeddings at indices in $A:\phi_{A}(x)\coloneqq\frac{1}{\left|A\right|}\sum_{t\in A}\phi_{t}(x)$ .

A last-token embedding refers to the contextualized token embedding of the last token in the input sequence, written $\phi_{-1}(x)$ .

Classical embeddings.

Traditionally, embeddings are computed by simply passing the sentence to the model and extracting some pooling (e.g. mean or last-token) of the contextualized embeddings corresponding to the input sentence. We will refer to embeddings created in this way as “classical embeddings”. Additionally, one might first prompt the language model with an explanation of the task of interest followed by the sentence, and then pool the contextualized embeddings of the sentence tokens like before (Su et al., 2022).

3 Echo Embeddings

In this section, we first demonstrate a failure mode of classical embeddings, and motivate a new method that we call echo embeddings that addresses this failure.

3.1 Classical Embeddings Miss Bidirectional Information.

Sentence embeddings should aggregate information across the entire sentence. However, for autoregressive language models, the contextualized embedding at position $k$ $\phi_{k}(x)$ cannot encode information about tokens $x_{k+1},x_{k+2},\ldots$ . Hence, the “meaning” encoded by the embeddings of tokens at the beginning of a sentence might inaccurately suggest they are similar (or dissimilar) to other tokens without considering the influence of tokens that come later. As a simple illustration, consider the following.

	$\displaystyle q$	$\displaystyle\colon{{\color[rgb]{0.2,0.2,0.8}\definecolor[named]{% pgfstrokecolor}{rgb}{0.2,0.2,0.8}\pgfsys@color@rgb@stroke{0.2}{0.2}{0.8}% \pgfsys@color@rgb@fill{0.2}{0.2}{0.8}\text{[She loves summer]}}}{{\color[rgb]{% 0.89,0.0,0.13}\definecolor[named]{pgfstrokecolor}{rgb}{0.89,0.0,0.13}% \pgfsys@color@rgb@stroke{0.89}{0.0}{0.13}\pgfsys@color@rgb@fill{0.89}{0.0}{0.1% 3}\text{ [but dislikes the heat]}}}$
	$\displaystyle s^{-}$	$\displaystyle\colon{\color[rgb]{0.2,0.2,0.8}\definecolor[named]{pgfstrokecolor% }{rgb}{0.2,0.2,0.8}\pgfsys@color@rgb@stroke{0.2}{0.2}{0.8}% \pgfsys@color@rgb@fill{0.2}{0.2}{0.8}\text{[She loves summer]}}{\color[rgb]{% 0.89,0.0,0.13}\definecolor[named]{pgfstrokecolor}{rgb}{0.89,0.0,0.13}% \pgfsys@color@rgb@stroke{0.89}{0.0}{0.13}\pgfsys@color@rgb@fill{0.89}{0.0}{0.1% 3}\text{ [for the warm evenings]}}$
	$\displaystyle s^{+}$	$\displaystyle\colon{\color[rgb]{0.2,0.2,0.8}\definecolor[named]{pgfstrokecolor% }{rgb}{0.2,0.2,0.8}\pgfsys@color@rgb@stroke{0.2}{0.2}{0.8}% \pgfsys@color@rgb@fill{0.2}{0.2}{0.8}\text{[Summer is her favorite]}}{\color[% rgb]{0.89,0.0,0.13}\definecolor[named]{pgfstrokecolor}{rgb}{0.89,0.0,0.13}% \pgfsys@color@rgb@stroke{0.89}{0.0}{0.13}\pgfsys@color@rgb@fill{0.89}{0.0}{0.1% 3}\text{ [but not the temp.]}}$

Here, the contextualized embeddings of the first half of $s^{+}$ and $s^{-}$ are both similar to $q$ because they do not attend to the second half of the sentence. As a result, the similarity between $q$ and $s^{-}$ would be overestimated by any pooling strategy that uses information from the first half. We address last-token pooling at the end of this section.

3.2 Echo Embeddings

We propose a simple fix to mitigate the failure above: we present the input sentence twice to the language model and extract contextualized embeddings from the second occurrence of the sentence. In principle, the contextualized embeddings of the second occurrence can attend to the entire sentence presented in the first occurrence. Furthermore, in order to encourage the second occurrence to actually “encode” information about the first, we instruct the language model to perform a generic task that requires using this information, e.g., “rewrite” or “repeat.”

Key to our method is passing the sentence twice to the model and pool embeddings exclusively from the second occurrence.⁴⁴4We find that minor variations of the echo embeddings prompt (e.g. change “rewrite” to “repeat”) work equally well and we provide an example list in Appendix B Other tricks from classical embeddings such as prompting the model with the downstream task of interest can be applied to echo embeddings as well.

3.3 Repetition Captures Bidirectional Info

In the previous section, we argued that classical embeddings suffer from the issue that contextualized embeddings of early tokens can miss out on information from the later tokens. But can simply repetition via echo embeddings solve this issue? We aim to test this by extracting embeddings from Mistral-7B on a simple controlled synthetic setting.

Given a query $q\colon[A,B]$ we construct sentence pairs $s^{+},s^{-}$ as follows. We make the first parts of each sentence identical to the query, but differ only in their second parts,

$q\colon[A,B]$ ; $s^{+}\colon[A,B^{+}]$ ; $s^{-}\colon[A,B^{-}]$ ,

where $B^{+}$ is $B$ but paraphrased and $B^{-}$ is semantically dissimilar to $B$ . We query GPT-4 to generate examples of this structure. We describe the full procedure and prompts in Appendix B.

With classical embeddings, the contextualized embeddings of $A$ parts of $s^{+},s^{-},q$ are identical by construction. To test whether echo embeddings can meaningfully distinguish $s^{+}$ and $s^{-}$ despite having identical initial tokens, we take the mean over just the $A$ -portion of the echo embeddings and plot the cosine similarities $\operatorname{Sim}(q,s^{+})$ and $\operatorname{Sim}(q,s^{-})$ in Figure 2 (left). We find that $\operatorname{Sim}(q,s^{+})$ is typically larger than $\operatorname{Sim}(q,s^{-})$ . Since we are only pooling the echo embeddings of the A-portion, any distinction between $s^{+},s^{-}$ must come from the echo embeddings of $A$ capturing information from the later parts of the sentence. This showcases that current autoregressive language models can in fact allow early tokens to capture information from later tokens via echo embeddings.

3.4 Classical vs. Echo on Synthetic Data

In Section 3.3 we demonstrated that echo embeddings encode bidirectional information. However, is this sufficient to recover from the failure mode of classical embeddings? Further, where will we expect echo embeddings to improve over classical embeddings? Here, we compare echo and classical embeddings on synthetic data to answer both of these questions.

Datasets.

We sample datasets according to two structures depending on whether the discriminating information between $s^{+}$ and $s^{-}$ is in the first half (structure S1) or second half (structure S2) of the sentence. Using the structures below, we generate samples using GPT-4, as in the previous section (full details in the appendix):

(S1) $q\colon[A,B]$ , $s^{+}\colon[A^{+},B^{+}]$ , $s^{-}\colon[A^{+},B^{-}]$

(S2) $q\colon[A,B]$ , $s^{+}\colon[A^{+},B^{+}]$ , $s^{-}\colon[A^{-},B^{+}]$ .

We measure the accuracy of identifying which of two sentences $s^{+}$ and $s^{-}$ is closer to the query as measured by the cosine similarity in the embeddings. We compare classical vs echo embeddings when using mean pooling to aggregate embeddings extracted from Mistral-7B model. We use mean token embedding rather than last token because last token embeddings can be quite fragile in a zero-shot setting (Section 5.1).

Results.

We present results on the two different structures in Figure 2 (right). We see that classical embeddings struggle on Structure 1—when the distinguishing information is at the beginning, the embeddings corresponding to these early tokens exaggerate similarity between $q$ and $s^{-}$ because they do not encode the information provided by $B^{-}$ . In contrast, echo embeddings are able to successfully determine the more similar sentence, presumably because embeddings of the $A$ -portion now also encode information about the later parts (demonstrated in Section 3.3). As a control for other reasons echo embeddings outperform classical embeddings, we also compare these embeddings on structure two, where early tokens provide discriminative signal without needing the later context. As expected, both classical and echo embeddings achieve good performance in this setting. All in all, this analysis on synthetic data demonstrates that zero-shot classical embeddings do not encode information about later context in early token embeddings, but echo embeddings can do so.

Does last-token pooling resolve the failure of classical embeddings?

The embedding of the last token $\phi_{-1}(x)$ is, in principle, can encode information from the entire input. However, we posit that the last-token pooling strategy is highly brittle and can depend too strongly on the tokens near the end of the input. To verify this, we compare the accuracy of mean token pooling and last-token pooling for classical and echo embeddings in two settings. First, we evaluate on the original synthetic data of Structure 1 (Figure 3, left). Second, we evaluate on the same data, but where we append a uniformly randomly selected token to the end of each example (Figure 3, right). While last-token pooling has high accuracy on the original toy data (though still lower accuracy than echo embeddings), it fails to perform well on the noisy examples. Echo embeddings with mean token pooling, however, are robust to the noise.

While this particular distribution of noise is artificial, it highlight that last-token pooling can be sensitive to noise in the last token. We verify in Section 5.1 that last-token embeddings perform poorly on real data. Thus, even if last-token embeddings address the inability of mean token classical embeddings to encode information from tokens that appear later in the sequence, they are not practical due to their sensitivity to noise.

Does last-token pooling resolve the failure after finetuning?

In practice, it is common to finetune embeddings on a sentence similarity objective. It is hard to delineate the degree to which this failure mode remains after finetuning. Nonetheless, we demonstrate in Section 5.2 that our method improves in the finetuning setting, even when using last-token pooling.

4 Methodology

In Section 3, we explored how echo embeddings can improve over classical embeddings by addressing a fundamental failure mode. In this section, we describe the methodology by which we evaluate echo embeddings on large scale real datasets in both the zero-shot and finetuning settings. While finetuning is currently necessary to achieve state-of-the-art performance, zero-shot embeddings have the advantage that they do not require expensive finetuning on top of a pretrained language model. Zero-shot results can also more clearly show how different embedding strategies work on real datasets.

4.1 Constructing Zero-shot Embeddings.

We extract zero-shot embeddings via different strategies from three language models: Mistral-7B, LLaMA-2-7B, and LLaMA-2-13B. We select the instruction-finetuned model for each of them. Refer to Appendix A.2 for additional information on the base models. Recent literature suggests that the performance of language models on zero-shot tasks can be highly variable depending on the exact wording and template of the prompts (Sclar et al., 2023). Thus, for each of the embedding strategies we consider, we perform prompt randomization where we sample prompts by randomizing the exact wording, punctuation, and capitalization of the prompt. We describe the sampling process and the exact prompts that we use in Appendix C.

Baselines.

We compare our proposed echo embeddings (Section 3.2) to classical embeddings and two additional baselines:

•

Last-token embeddings: We mentioned in the Section 3 that last-token embeddings tend to underperform in comparsion to mean token embeddings, and thus we compare on real data.
•

Summarization: We also compare zero-shot embeddings obtained via the strategy proposed by Jiang et al. (2023b). Here, they instruct the model to summarize the input in a single word and then take the last token embedding $\phi_{-1}(x)$ as the pooled embedding of the sentence.

4.2 Constructing Finetuned Embeddings.

We adopt the conventional sentence embedding training setup (Reimers and Gurevych, 2019) where we train with a contrastive learning objective to encourage the embeddings of similar text to be close. We extract embeddings in a slightly different fashion compared to the zero-shot setting above in order to keep the finetuning methodology as similar as possible to the existing literature.

Extracting embeddings.

The training and evaluation data is separated into two categories: symmetric data, in which sentences are drawn from a single distribution (such as for sentence similarity), and asymmetric, in which the data consists of both queries and documents (such as for retrieval). We adopt a separate prompt for symmetric inputs and queries, and for documents. We construct classical embeddings by encoding text $S$ using the following prompts:

Queries & Symm.	Documents
Instruct: {instruction} $\displaystyle\text{Query: }S$	$\displaystyle\text{Document: }S$

For echo embeddings, we use the prompts, where $S$ represents the input and $S^{\prime}=S$ :

Queries & Symm.	Documents
Instruct: {instruction} $\displaystyle\text{Query: }S$ $\displaystyle\text{Query again: }S^{\prime}$	$\displaystyle\text{Document: }S$ $\displaystyle\text{Document again: }S^{\prime}$

In this case, {instruction} refers to the task instruction, which specifies a description of the task that the embedding will be used for. We adopt the instructions from Wang et al. (2023), and provide a list of the instructions in Appendix D. We append an end-of-sentence token to the end of each input, and we allow the input embedding of this token to be trainable.

Datasets.

We train on a collection of publicly available datasets that encompass both symmetric and asymmetric data that are standard training datasets in the embedding literature. We list and describe each of the datasets in Appendix D.

Optimization.

To finetune the model, we optimize the SimCSE loss with in-batch and mined hard negatives. Since this is standard, we defer discussion of this to Appendix D. Each batch is constructed by sampling a dataset from our set of training dataset, and then collecting examples from only this dataset. We use GradCache to train with a large batch size (2048) with limited GPU memory (Gao et al., 2021a). We train with LoRA instead of full finetuning, with $r=16$ and $\alpha=16$ . We choose $\tau=1/50$ and a learning rate of $8\times 10^{-4}$ . We use the Mistral-7B instruction-tuned model as a backbone (Jiang et al., 2023a). Our choices aim to be consistent with prior literature (Wang et al., 2023; Su et al., 2022; Zhang et al., 2023a).

4.3 Massive Text Embedding Benchmark

For evaluation, we use the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022). For this paper, we focus on the English-language subset of the benchmark. MTEB is composed of a collection of 56 datasets that are grouped into different embedding tasks: classification, clustering, pair classification, reranking, retrieval, sentence similarity (STS), and summarization. The goal is to construct general purpose embeddings that are useful for solving each of the tasks. More information about MTEB is specified in Appendix A.1.

For the fine-tuning setting, we evaluate on the entire English-language subset. In the zero-shot setting, for convenience, we only evaluate on a subset of MTEB. We describe this subset in Appendix D.

5 Experiments

Strategy	Model	Pool	Clas.	P. Cls.	Clus.	Retr.	STS	Rera.	Average
Main results:
Echo (ours)	Mistral 7B	Mean	64.06	75.26	27.02	23.61	72.40	60.00	55.07
Classical	Mistral 7B	Mean	58.21	73.87	23.85	20.35	56.97	54.44	45.88
Prior work:
Summarization	Mistral 7B	Last	66.01	81.82	26.48	19.13	70.13	66.24	54.96
Ablations:
Echo	Mistral 7B	Last	63.11	57.93	12.82	2.97	39.14	47.35	36.60
Classical	Mistral 7B	Last	58.23	46.64	13.51	2.60	33.97	46.51	32.52
Echo	LLaMA 7B	Mean	61.64	66.29	25.11	16.12	66.18	56.35	50.26
Classical	LLaMA 7B	Mean	56.61	68.46	23.22	18.63	56.49	53.26	44.65
Echo	LLaMA 13B	Mean	64.65	74.57	25.72	26.58	72.20	62.68	55.60
Classical	LLaMA 13B	Mean	58.50	65.06	24.22	18.92	57.47	56.38	45.15

Table 1: Zero-shot scores on MTEB tasks for Mistral-7B. We use a retrieval validation set (FiQA2018) to select the best prompt. Refer to Appendix C for the scores with alternative validation sets. Top: Comparison of echo embeddings to classical embeddings. Center: Summarization approach to constructing embeddings (Jiang et al., 2023b). Bottom: Ablations, including last-token pooling and LLaMA-2-{7B, 13B}.

5.1 Evaluation of Zero-shot Embeddings

We compare the performance of classical, echo, and summarization embeddings on MTEB tasks (Table 1). We validate using a retrieval dataset from MTEB (FiQA2018) as a validation set as described in Section 4.3. We report the scores using alternative validation sets in Appendix C.

Echo embeddings outperform classical embeddings zero-shot.

We see that echo embeddings outperform classical embeddings by a large margin: on average, by nearly 10 points for Mistral-7B. Further, this performance increase is consistent across every MTEB category, across models (LLaMA-2 vs Mistral), and across scale (7B vs 13B). This demonstrates that echo embeddings can significantly improve the performance of embeddings on real data, suggesting that the failure mode of classical embeddings that we describe in Section 3 can affect performance on real data.

Qualitative comparison of classical and echo embeddings.

In Section 3, we demonstrate that classical embeddings overestimate the similarity between examples which are superficially similar based on tokens that appear early in the sequence. To build intuition that this applies to realistic data, we present the sentence pair from STSBenchmark, a sentence similarity task from MTEB, in which echo embeddings reduce error the most:

	$\displaystyle x_{1}$	$\displaystyle\colon\begin{array}[]{l}\text{The best thing you can do is to % know your}\\ \text{stuff.}\end{array}$
	$\displaystyle x_{2}$	$\displaystyle\colon\begin{array}[]{l}\text{The best thing to do is to overcome% the fus-}\\ \text{siness.}\end{array}$

which has a ground-truth score of $0$ (out of $5$ ) similarity. The sentence pair for which echo embeddings reduces error the least is:

	$\displaystyle y_{1}$	$\displaystyle\colon\begin{array}[]{l}\text{Sometime if you really want it you % might ne-}\\ \text{ed to pay an agency to get the place for you.}\end{array}$
	$\displaystyle y_{2}$	$\displaystyle\colon\begin{array}[]{l}\text{You could probably get a tour % agency to do }\\ \text{it for you but it would cost you.}\end{array}$

which has a ground-truth similarity of $2$ (out of $5$ ). We provide more examples in the Appendix Table 7.

For this example, notice, that the sentence pair $(x_{1},x_{2})$ on which echo embeddings improve error the most has exactly the property we identify as a failure mode for classical embeddings: the sentence is superficially similar for the first few tokens. On the other hand $(y_{1},y_{2})$ does not have this property.

Quantitative evaluation of the failure mode.

The above example builds intuition that, even on real data, classical embeddings fail to properly estimate similarity on examples which are superficially similar in the early tokens. We quantitatively measure the degree to which classical and echo embeddings fail on sentences which are similar for early tokens, and for sentences which are not. We find that classical embeddings systematically fail on examples which exhibit this structure, while echo embeddings do not. For convenience, we defer the discussion of these experiments and the results to Appendix C.1.

Last-token vs mean token pooling.

We find that last-token embeddings are substantially worse than mean token embeddings in the zero-shot setting, despite the fact that in principle, the last token in the sequence can encode information from all other tokens. In practice, it is clear that the last token does not encode sufficient information to achieve strong performance on MTEB in the zero shot setting.

Echo embeddings vs summarization.

We find that the average performance across the tested MTEB datasets is similar between echo and classical embeddings. Summarization does encourage the last token in the sequence to encode information about the entire sentence. We find that summarization is much more sensitive to the exact prompt while echo embeddings are robust to such minor variations (see Figure 5 in Appendix C). We suspect that echo embeddings are more robust as a result of more directly trying to encode bidirectional information into the embeddings.

5.2 Evaluation of Finetuned Embeddings

Strategy	Model	Pool	Clas.	Clus.	P. Cls.	Rera.	Retr.	STS	Average
Main results:
Echo (ours)	Mistral 7B	Last	77.43	46.32	87.34	58.14	55.52	82.56	64.68
Classical	Mistral 7B	Last	76.57	45.78	86.37	56.71	54.87	82.03	63.98
Prior work:
UAE-Large-V1 (MLM)			75.58	46.73	87.25	59.88	54.66	84.54	64.64
multilingual-e5-large (MLM)			77.56	47.10	86.19	58.58	52.47	84.78	64.41
bge-large-en-v1.5 (MLM)			75.97	46.08	87.12	60.03	54.29	83.11	64.23
udever-bloom-7b (autoregr.)			72.13	40.81	85.4	55.91	49.34	83.01	60.63
sgpt-5.8b (autoregr.)			68.13	40.34	82.00	56.56	50.25	78.10	58.93
e5-mistral-7b⁵⁵5e5-mistral-7b was recently released and leverages high quality synthetic data to achieve strong performance which is not publicly released. We report their performance, but we do not explicitly compare to them (Wang et al., 2023). (autoregr.)			78.47	50.26	88.34	60.21	56.89	84.63	66.63
Ablations:
Echo	Mistral 7B	Mean	77.00	44.94	87.73	58.30	55.11	82.52	64.22
Classical	Mistral 7B	Mean	76.26	42.68	86.31	57.58	53.75	81.53	62.96
Classical	Mistral 7B-bidir.	Last	76.70	45.94	88.15	57.23	54.96	82.42	64.23

Table 2: Finetuning scores on MTEB tasks. Top: Apples-to-apples comparison of echo embeddings and classical embeddings in which we use echo embeddings and classical embeddings with last-token pooling, with the same training setup. Center: Performance of recent open source embedding models, annotated by base model type, masked-language model or autoregressive. Bottom: Ablations for finetuning: using mean token embeddings (first two lines) and using a bidirectional architecture (last line).

Different embeddings on the MTEB leaderboard are often fine-tuned on different datasets. In order to perform an apples-to-apples comparison between embedding strategies, we fine-tune both echo and classical embeddings on the exact same datasets (described in Section 4.2). We report the results in Table 2. This table also includes a comparison to prior state-of-the-art methods using masked language models (MLM) and autoregressive language models. Further, we evaluate a number of ablations to determine the role of pooling strategy and architecture.

Echo embeddings outperform classical embeddings after finetuning.

We observe that echo embeddings consistently outperform classical embeddings on each category even after finetuning. Hence, the fundamental gap we find between classical and echo embeddings in Section 3 and in our zero-shot experiments persist after fine-tuning.

Comparison to prior state-of-the-art models.

We present comparisons to both prior MLM-based embeddings and prior autoregressive-language-model embeddings, listing the open-source models from the MTEB leaderboard. It is striking that MLMs vastly outperformed autoregressive models until recently. Our classical embeddings outperform the previous-best autoregressive language model. This is a result of using the strongest public 7B parameter language model (Mistral) and more fine-tuning data. However, despite these choices, classical embeddings do not outperform prior MLM-based approaches, perhaps because MLMs encode bidirectional context unlike classical embeddings from autoregressive models. Interestingly, echo embeddings allow us to close the gap to achieve state-of-the-art (on average) with an autoregressive model compared to prior open-sourced models on the leaderboard that used MLMs. A recent exception is the concurrent work by Wang et al. (2023) that use synthetic data to improve classical embeddings extracted from Mistral-7B. Their synthetic data is not publicly available, but the apples-to-apples comparison between classical and echo embeddings we performed suggests that echo embeddings could provide gains further gains over the numbers reported in (Wang et al., 2023) when fine-tuning with synthetic data.

Why doesn’t last-token pooling close the gap?

Since classical last-token embeddings can attend to every other token, they do not necessarily suffer from the failure mode that we highlighted in Section 3. Last token does not reliably capture relevant information in a zero-shot setting, but this could have been bridged via fine-tuning. It is thus surprising that, even after finetuning last-token embeddings that could (in principle) encode any embedding function, echo embeddings outperform classical embeddings. We identify two hypotheses that may explain this performance gap: (1) While last-token embeddings can attend to every token, the intermediate representations of earlier tokens cannot. If last-token pooling derives information from the internal representations of earlier tokens, by attending to these representations, last-token classical embeddings may still suffer from the failure mode of the earlier tokens. (2) If the post-finetuning performance benefits from the model initialization point, last-token classical embeddings may suffer: in Section 5.1 we show that last-token embeddings achieve poor zero-shot performance. We leave it to future work to explore these hypotheses. We do, however, observe that the gap between last- and mean token echo embeddings is smaller than the gap between last- and mean token classical embeddings, suggesting that echo embeddings can especially improve the quality of mean token embeddings.

Can we relax autoregressive language models to a bidirectional architecture and fine-tune?

To test the role of architecture, we finetune Mistral-7B on the same setup described in Section 4.2 but modified the architecture so as to remove the causal attention mask. While the initial weights are identical to Mistral-7B, this new model has bidirectional attention. We observe that the performance of bidirectional classical embeddings are better than our standard (causal) classical embeddings, but worse than echo embeddings. This suggests that the architecture alone is not sufficient to improve performance.

6 Related Work

Sentence embeddings.

Dense low-dimensional vectors representing textual semantics has been widely studied and applied. Early approaches involved computing embeddings for individual words (Hinton, 1984; Rumelhart et al., 1986; Elman, 1990; Mikolov et al., 2013; Pennington et al., 2014). Later work aims to compute dense vectors representing the semantics of entire sequences by combining or composing word vectors (Le and Mikolov, 2014; Iyyer et al., 2015; Kiros et al., 2015; Socher et al., 2011; Tai et al., 2015; Wang et al., 2016; Wieting et al., 2015). Khattab and Zaharia (2020) propose to use late interaction between document and query vectors to improve retrieval performance. Reimers and Gurevych (2019) propose S-BERT which takes a pretrained BERT (Devlin et al., 2018) and trains with a triplet loss on anchor sentences, semantically similar positive examples, and semantically dissimilar negative examples. More recent approaches typically adopt this approach with different pretrained models and a contrastive objective such as InfoNCE (Oord et al., 2018) or SimCSE (Gao et al., 2021b). Ni et al. (2021a) with Ni et al. (2021b) extend this approach to the T5 architecture (Raffel et al., 2020). Multiple papers use an additional unsupervised contrastive objective (Izacard et al., 2021; Wang et al., 2022). Other papers propose including prompts to improve task-specific embedding performance (Jiang et al., 2022; Su et al., 2022). Some work combines multiple of these training objectives and approaches (Xiao et al., 2023a; Li et al., 2023). Notably, except for the most recent approaches, nearly all embeddings were based upon bidirectional architectures that were often pretrained with a masked-language modeling objective.

Next-token language modeling for embeddings.

A series of papers aim to construct high quality embeddings from autoregressive large language models. Multiple papers apply the fine-tuning approach of S-BERT to language models but using a trained GPT (Radford et al., 2018) as the backbone architecture (Muennighoff, 2022; Zhang et al., 2023a). Ma et al. (2023) adopts this approach but for LLaMA-2 (Touvron et al., 2023). Jiang et al. (2023b) extracts embeddings by asking a language model to summarize the input sentence. Wang et al. (2023) is concurrent to our work and improves embeddings by adding synthetic training data and trains on Mistral (Jiang et al., 2023a).

Zero-shot embeddings.

Most recent sentence embeddings research has focused on improving finetuning. Reimers and Gurevych (2019) demonstrates that without finetuning, BERT has low-quality embeddings. To our knowledge, Jiang et al. (2023b) is the only paper that constructs zero-shot embeddings for autoregressive language models.

7 Conclusion

We have compared classical and echo embeddings in a toy example, on real data in the zero-shot setting, and after finetuning. With the toy data, we identified a failure mode of autoregressive classical embeddings, which we have shown can be recovered with echo embeddings. Our result motivates the development of higher quality embeddings which are important in retrieval applications.

In addition, until recently, masked language models largely dominated the MTEB leaderboard, despite often having an order of magnitude fewer parameters, having been trained on substantially less data, and performing worse on other benchmarks of interest to the natural language processing community. While our results do not explicitly explain the surprising success of masked language models, they do suggest that next-token language models suffer from an inherent drawback that may have stifled their performance until they became performant enough to compensate for this shortcoming. We believe that our embedding strategy achieves the best of both worlds: we gain the capability of next-token language models while recovering from the failure mode that next-token language models do not encode information about future tokens in their contextualized token embeddings.

8 Limitations

Despite the success of echo embeddings, the method has limitations. First, while echo embeddings achieve superior performance to classical embeddings, they require double the inference cost to pass two copies of the input sequence to the model. Though this is double the training cost for a fixed number of training steps, we show in Appendix D that echo embeddings achieve improved performance even when matching compute. Second, we do not fully explain why echo embeddings are improved in comparison to classical embeddings after finetuning even though there is no representational limitation. We leave it to future work to understand the exact underlying mechanisms for this improvement.

Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE2140739. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

This research was supported by the Center for AI Safety Compute Cluster. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

This work was supported in part by the AI2050 program at Schmidt Sciences (Grant #G2264481).

We gratefully acknowledge the support of Apple.

References

Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. Ms marco: A human generated machine reading comprehension dataset.
DataCanary et al. (2017) DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. 2017. Quora question pairs.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. Eli5: Long form question answering.
Gao et al. (2021a) Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021a. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983.
Gao et al. (2021b) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021b. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
Hinton (1984) Geoffrey E Hinton. 1984. Distributed representations.
Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pages 1681–1691.
Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023a. Mistral 7b. arXiv preprint arXiv:2310.06825.
Jiang et al. (2023b) Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. 2023b. Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
Jiang et al. (2022) Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. 2022. Promptbert: Improving bert sentence embeddings with prompts. arXiv preprint arXiv:2201.04337.
Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage retrieval for open-domain question answering.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. Advances in neural information processing systems, 28.
Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
Li and Li (2023) Xianming Li and Jing Li. 2023. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871.
Li et al. (2023) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
Ma et al. (2023) Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Muennighoff (2022) Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
Muennighoff et al. (2022) Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
Ni et al. (2021a) Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. 2021a. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
Ni et al. (2021b) Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021b. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Qiu et al. (2022) Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. Dureader-retrieval: A large-scale chinese benchmark for passage retrieval from web search engine.
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature, 323(6088):533–536.
Sclar et al. (2023) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2023. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
Socher et al. (2011) Richard Socher, Eric Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in neural information processing systems, 24.
Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
Su et al. (2022) Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2022. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Vanderkam et al. (2013) Dan Vanderkam, Rob Schonberger, Henry Rowley, and Sanjiv Kumar. 2013. Nearest neighbor search in google correlate. Technical report, Google.
Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
Wang et al. (2023) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
Wang et al. (2024) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
Wang et al. (2016) Yashen Wang, He-Yan Huang, Chong Feng, Qiang Zhou, Jiahui Gu, and Xiong Gao. 2016. Cse: Conceptual sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 505–515.
Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198.
Xiao et al. (2023a) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023a. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.
Xiao et al. (2023b) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023b. C-pack: Packaged resources to advance general chinese embedding.
Xie et al. (2023) Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, and Jin Ma. 2023. T2ranking: A large-scale chinese benchmark for passage ranking.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
Zhang et al. (2023a) Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min Zhang. 2023a. Language models are universal embedders. arXiv preprint arXiv:2310.08232.
Zhang et al. (2021) Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. tydi: A multi-lingual benchmark for dense retrieval.
Zhang et al. (2023b) Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023b. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.

Appendix A General Information for Reproducibility

In this section we include information that might aid in reproducibility that is not specific to any specific setting in the paper.

A.1 Massive Text Embedding Benchmark

The Massive Text Embedding Benchmark (MTEB) is a collection of datasets from seven categories: classification, clustering, pair classification, reranking, retrieval, sentence similarity (STS), and summarization. The leaderboard is published at https://huggingface.co/spaces/mteb/leaderboard. The list of datasets and their descriptions can be found at Muennighoff et al. (2022) in Appendix A.

A.2 Base Model HuggingFace IDs

In this paper, we use the following models:

•

Mistral 7B instruction-tuned: mistralai/Mistral-7B-Instruct-v0.1
•

LLaMA 7B instruction-tuned: meta-llama/Llama-2-7b-chat-hf
•

LLaMA 13B instruction-tuned: meta-llama/Llama-2-13b-chat-hf

Appendix B Echo Embeddings: Additional Information

In this section, we aim to describe the additional details that were omitted from Section 2 and 3.

Cosine Similarity.

As discussed in Section 2, we often use the cosine similarity to measure the similarity in embeddings. Recall that given two sentences $x$ and $y$ , we wish to determine the degree to which they are semantically similar. Cosine similarity,

\displaystyle\operatorname{Sim}(x,y)\coloneqq\frac{\left\langle\phi(x),\phi(y)% \right\rangle}{\|\phi(x)\|\|\phi(y)\|},

(1)

measures the similarity between the embeddings of $x$ and $y$ for any embedding function $\phi\colon\mathcal{X}\to R^{d}$ . The cosine similarity is used for our experiments in Sections 3, and as the similarity function for training in 5. All MTEB datasets use cosine similarity to compute similarity with the exception of the classification datasets, in which similarity is not explicitly measured, and the clustering datasets, which use Euclidean distance,

\displaystyle\operatorname{Sim}(x,y)\coloneqq\|\phi(x)-\phi(y)\|,

(2)

as a metric.

Prompts for Section 3.

For these experiments, we only evaluate with a single prompting strategy. For classical embeddings, we encode a sentence $S$ using the prompt:

x=\text{Write a sentence: $S$}

We take the pooled embedding to be the mean token embedding $\phi_{S}(x)$ . For echo embeddings, we encode a sentence $S$ using the prompt:

x=\begin{array}[]{ll}\text{Rewrite the following sentence: $S$}\\ \text{The rewritten sentence: $S^{\prime}$}\end{array}

where $S^{\prime}=S$ and we let our pooled embedding be the mean token embedding $\phi_{S^{\prime}}(x)$ . We do not evaluate with the last-token pooling strategy in this Section.

General Prompting Guidelines.

Throughout the paper, we use a variety of different prompts to construct embeddings. In Section C, we demonstrate that for zero-shot embeddings, the exact wording or template used as a prompting strategy does not have a strong effect on the performance of MTEB tasks, with the exception of for the summarization approach. This implies, in general, that classical embeddings and echo embeddings should be robust to the exact choice of prompts. The important component of echo embeddings is instead the structure: the input text should be repeated twice when computing embeddings, and the embeddings should be taken over the second occurrence of the input text.

Example classical embedding structures:

$\displaystyle\text{Say the sentence: }S$

$\displaystyle\text{Write the phrase: }S$

$\displaystyle\text{Complete the query: }S$

$\displaystyle\text{Explain the text: }S$

Example echo embedding structures:

$\displaystyle\text{Repeat the sentence: }S$ $\displaystyle\text{The sentence again: }S^{\prime}$

$\displaystyle\text{Rephrase the query: }S$ $\displaystyle\text{The query rephrased: }S^{\prime}$

$\displaystyle\text{Fill in the blank: }S$ $\displaystyle\text{The blanks filled in: }S^{\prime}$

$\displaystyle\text{Rewrite the text: }S$ $\displaystyle\text{The sentence rewritten: }S^{\prime}$

Toy data.

We provide a subset of the toy data from Section 2. For Structure 1, the data is given in Table 4. For Structure 2, the data is given in Table 5. For Structure 3, the data is given in Table 6. In all cases, the data is generated by GPT4. The data from Structure 1 is generated from the following GPT4 prompt, and the other structures are generated from minor variations on this:

⬇

Together, we need to generate sentence triplets. Each triplet will have the following form:

- sentence 1 can be anything, be creative here.

- sentence 2 must represent something opposite to sentence 1, however, it is important that the first half of the sentence is exactly the same as the first half of sentence 2. The only difference in wording can be in the second half of the sentence.

- sentence 3 should be extremely similar to sentence 1 and semantically equivalent, but slightly re-worded

Here is an example:

{

"sentence1": "I like to eat apples and bananas but I really hate almost every other fruit.",

"sentence2": "I like to eat apples and bananas and I also enjoy also every other fruit",

"sentence3": "I like two fruits: apples and bananas but I hate nearly all fruits other than these.",

}

The first half of the sentence should be relatively short, less than 10 words, but the second half should be long, at least 10 words. Give more examples, and write them in json format. Be creative!

Appendix C Additional Zero-shot Results

In this section, we describe the omitted methodology and results for the zero-shot section.

Prompt sampling procedure.

Here we describe the prompt sampling procedure and then provide the prompts that we use for the zero shot:

1.

Choose an instruction. For classical embeddings, we choose from {Write, Say, Complete, Explain}. For echo embeddings, we choose from {Repeat, Rewrite, Rephrase, Fill in the blank}. For summarization, we choose from {Summarize, Categorize, Understand, Analyze}.
2.

Choose a wording for the instruction. For example, if we chose “Say” as the instruction, then we would choose from {Say a sentence, Say a paragraph, Say something, Say a response, Say a query, Say a prompt}. For summarization, we also choose a second part of the wording, as the summarization strategy requires that the summary be in one word: {in one word, with a single word, succinctly with one word, in a unique one-word way, in a single word, in a word}.
3.

Choose a separator, which include colons, commas, newlines.
4.

Choose a prefix, which includes markers to indicate the first and appearance of the input.
5.

Classical prompts have the form: “{instruction} {separator} {prefix} $S$ ”.
6.

Echo prompts have the form: “{instruction} {separator} {prefix0} $S$ {separator} {prefix1} $S^{\prime}$ ”.
7.

Summarization prompts have the form: “{instruction0} {separator} {prefix} $S$ {instruction1} {separator}”.

For classical, we choose the prompts:

⬇

Write a sentence I] S

Write a prompt!

(I) S

Write some text

PROMPT-S

SAY A PARAGRAPH | SENTENCE 0] S

Say a query QUERY: S

Say a sentence!

[A] S

COMPLETE THE PROMPT Text (1) S

Complete the query SENTENCE 0) S

Complete the sentence:-S

Explain a query text 0 S

Explain a prompt | Sentence 1> S

EXPLAIN A SENTENCE Prompt (1) S

For echo, we choose the prompts:

⬇

Repeat The Paragraph.

query 1) S.

query 2) S’

Repeat the response.

1) S.

AGAIN 2) S’

REPEAT THE SENTENCE :: PROMPT

S :: RESPONSE

S’

Rewrite the query | QUERY (A) S | (B) S’

Rewrite the text. SENTENCE A) S. B) S’

Rewrite the response | query A] S | query B] S’

Rephrase the sentence:@S:Again@S’

Rephrase The Sentence!

Text <> S!

Answer <> S’

REPHRASE THE QUERY Sentence a) S Answer b) S’

Fill in the blank in the prompt:

Query a) S:

Query b) S’

FILL IN THE BLANK IN THE RESPONSE | Sentence A) S | Sentence B) S’

Fill in the blank in the paragraph.

Text | S.

Response | S’

For summarization, we use the prompts:

⬇

SUMMARIZE THE QUERY.

Prompt: S’IN A WORD.

}

Summarize the sentence!

PROMPT <1> S’Succinctly With One Word!

}

SUMMARIZE THE PARAGRAPH. PROMPT (0) S’IN A WORD.

CATEGORIZE THE PROMPT query

S’With a single word

Categorize the query | prompt [1] S’in a word |

CATEGORIZE THE SENTENCE.

Prompt <1> S’IN A WORD.

}

Understand the sentence

@S’In a single word

Understand The Prompt:QUERY [0] S’in a single word:}

UNDERSTAND THE PARAGRAPH:Text I] S’Succinctly with one word:}

Analyze the sentence.

Sentence S’In A Unique One-word Way.

}

Analyze the response! query a> S’IN A UNIQUE ONE-WORD WAY!

Analyze The Prompt

Sentence a> S’In a unique one-word way

Subset of MTEB for zero-shot evaluation.

We evalaute on the following subset of MTEB: FiQA2018, SCIDOCS, SciFact, NFCorpus, TwitterSemEval2015, TwitterURLCorpus, ImdbClassification, AmazonReviewsClassification, TweetSentimentExtractionClassification, MTOPDomainClassification, TwentyNewsgroupsClustering, BiorxivClusteringS2S, MedrxivClusteringS2S, StackOverflowDupQuestions, AskUbuntuDupQuestions, SciDocsRR, BIOSSES, STS12, STS13, STS14, STS15, STS16, STS17, STS22, STSBenchmark, and SICK-R.

Measuring the sensitivity of different embedding strategies to prompting.

We plot the sensitivity of classical, repetition, and summarization to different choices of prompts for different models in Figures 5, 6, and 7. We also extend to plotting on all tested datasets individually in Figures 8, 9, and 10. We observe that summarization is highly sensitive to the exact prompt used. However, neither classical nor echo were particularly sensitive in any case. Consistently, mean token pooling outperformed last token pooling by a large factor.

Evaluation of zero-shot results with different validation sets.

We include the zero results of validation using different MTEB datasets. For validation, we select one dataset from each category, as follows: classification: ImdbClassification; Pair Classification: TwitterSemEval2015; Clustering: TwentyNewsgroupsClustering; Retrieval: FiQA2018; STS: STSBenchmark, Reranking: StackOverflowDupQuestions. We plot these results for different models in Tables 8, 9, and 10. We observe similar results across different validation sets, with minor variations in the performance. In addition, we the performance of each dataset when the prompts have been validated with FiQA2018 in Tables 11, 12, and 13.

C.1 Validating the connection between our synthetic data experiments and real data.

In Section 3, we hypothesized that classical embeddings would overestimate similarity on sentences where the first half of the sentence are similar, and that echo embeddings would recover from this failure mode. In order to test this hypothesis, we exact a set of examples from the STS datasets included in the MTEB benchmark in which the first half of the sentence is similar, and measure the degree to which the similarity is overestimated.

As a control, we also select points which are similar in the second half of the sentence, and measure the degree to which similarity is overestimated. By comparing the degree to which sentences which are similar in the first half are overestimated in similarity, and the degree to which sentences which are similar in the second half are overestimated, then we can identify if classical embeddings overestimate similarity in specifically sentences which are similar in the first half. Thus, under our hypothesis, we expect that, for classical embeddings, sentences which are similar in the first half are overestimated in similarity more than sentences that are similar in their second half. On the other hand, we expect that, for echo embeddings, the degree to which similarity is over- or underestimated is independent of whether the sentences are similar in the first or second half of the sentence.

Identifying examples based on similarity in the first/second part of the sentence.

We aim to determine which sentences are most similar in the first half of the sentence or in the second half of the sentence. For each sentence pair $x,y$ , we split the sentences in half by number of words, yielding $x=[x_{1},x_{2}]$ , and $y=[y_{1},y_{2}]$ . We select sentences which are most similar in the first half by using the off-the-shelf masked-language-model-based embedding model bge-base-en-v1.5 (Xiao et al., 2023b). To select sentences that are similar in the first half, we measure the cosine similarity $\operatorname{Sim}(x_{1},y_{1})$ and take the top 10% of sentence pairs $x,y$ which have the highest cosine similarity. Similarly, to select sentences which are similar in the second half, we collect the top 10% of examples by $\operatorname{Sim}(x_{2},y_{2})$ . We collect examples from each of the STS datasets in MTEB.

Measuring sentence similarity estimation error.

We must determine the degree to which classical and echo embeddings overestimate similarity. The STS datasets contain sentences pairs which are ranked by similarity: the sentences which are most similar have the highest ground-truth ranking, and the least similar sentences have the lowest. We will denote the ranking of sentence pair $i$ as $r_{i}$ . We compute an estimated ranking $\{\hat{r}_{i}\}$ by ranking sentence pairs by the cosine similarity between their embeddings. We can compare the error in our estimated ranking by taking the rank difference $\operatorname{Err}_{i}=\hat{r}_{i}-r_{i}$ . When $\operatorname{Err}_{i}>0$ , we say that the $i$ th sentence pair is overestimated in similarity, and similarly underestimated when $\operatorname{Err}_{i}<0$ .

Results.

We plot the the distribution over rank differences for sentences which are similar in the first half and sentences which are similar in the second half for echo and classical embeddings, from all STS datasets (Figure 4). We also highlight the means of the distributions. In accordance with our hypothesis, we observe that for classical embeddings, sentences which are similar in the first half are generally overestimated in similarity more than sentences which are similar in the second half of the sentence, suggesting that classical embeddings fail particularly on sentences that are similar in early tokens. Further, we generally observe no difference between the estimation error distributions for echo embeddings, which demonstrates that echo embeddings recover from this particular failure mode.

There are some notable counterexamples: BIOSSES does not exhibit this trend, but has few examples and thus the results may arise from noise alone. Further, STS22 exhibits identical distributions in estimation error between sentences which are similar in the first half and sentences which are similar in the second half, for both classical and echo embeddings. It is unclear why this trend fails to hold for STS22. Nonetheless, the trend holds for every other dataset, suggesting that the conceptual failure of classical embeddings that we identified in Section 3 generalizes to real data.

Qualitative examples.

In addition, we provide qualitative examples of sentence pairs from STSBenchmark where echo embeddings reduce error most, and where echo embeddings reduce error least, in comparison to classical embeddings. More precisely, we plot the top and bottom 7 examples ranked by $|\operatorname{Err}^{\text{classical}}_{i}|-|\operatorname{Err}^{\text{echo}}_% {i}|$ , where $\operatorname{Err}^{\text{classical}}_{i}$ represents the rank difference of the $i$ th example of classical embeddings, and $\operatorname{Err}^{\text{echo}}_{i}$ is similar but for echo embeddings (Table 7).

Appendix D Additional Finetuning Results

In this section, we address the omitted details from the finetuning results of the main paper.

Training Datasets.

We follow the setup of Wang et al. (2023), and use the following datasets: ELI5 (sample ratio 0.1) (Fan et al., 2019), HotpotQA (Yang et al., 2018), FEVER (Thorne et al., 2018), MIRACL (Zhang et al., 2023b), MS-MARCO passage ranking (sample ratio 0.5) and document ranking (sample ratio 0.2) (Bajaj et al., 2018), NQ (Karpukhin et al., 2020), NLI (Gao et al., 2021b), SQuAD (Karpukhin et al., 2020), TriviaQA (Karpukhin et al., 2020), Quora Duplicate Questions (sample ratio 0.1) (DataCanary et al., 2017), Mr- TyDi (Zhang et al., 2021), DuReader (Qiu et al., 2022), and T2Ranking (sample ratio 0.5) (Xie et al., 2023). We use approximately 1.5M training examples.

GPUs.

Training a model takes approximately two days on 4 A100 GPUs.

Instructions for finetuning datasets.

We also follow the setup of Wang et al. (2023), and use the instructions in Table 3. For evaluation, we use the instructions found in Table 14.

Models on MTEB leaderboard.

We compare our implementation of classical and echo embeddings to state-of-the-art approaches on MTEB. Namely, we display results for UAE-Large-V1 (Li and Li, 2023), multilingual-e5-large (Wang et al., 2024), bge-large-en-v1.5 (Xiao et al., 2023b), udever-bloom-7b (Zhang et al., 2023a), sgpt-5.8b (Muennighoff, 2022), e5-mistral-7b (concurrent work) (Wang et al., 2023).

Additional ablations.

We plot additional ablations, including ablating the role of instructions during training and evaluation, as well as providing an evaluation at step 280 (out of 720 total steps), which is approximately $1/3$ of the duration of training (Table 15). We note that echo embeddings still outperform classical embeddings in this setting.

Performance over training time.

We plot the performance over the duration of training for a subset of MTEB tasks in Figure 11. Surprisingly, task performance decreases over training for many tasks.

Computational benefits of echo embeddings.

From Table 15, we observe that even after approximately $1/3$ of the total training duration (less than $1/2$ ), echo embeddings achieve performance higher than classical embeddings achieve after an entire epoch (Table 2). Echo embeddings requires twice the computational cost of classical embeddings. However, this result suggests that despite this additional cost per embedding, training with echo embeddings can save on training costs by requiring less than half an epoch of training to outperform classical embeddings. Further, since each data point is only seen once, it implies that echo embeddings are much more data efficient than classical embeddings, which may be helpful when data is costly or difficult to acquire.

All results.

We plot the results for every MTEB dataset for echo embeddings, for classical embeddings, and for bidirectional embeddings in Table 16.

NLI	Given a premise, retrieve a hypothesis that is entailed by the premise
NLI	Retrieve semantically similar text
DuReader	Given a Chinese search query, retrieve web passages that answer the question
ELI5	Provided a user question, retrieve the highest voted answers on Reddit ELI5 forum
FEVER	Given a claim, retrieve documents that support or refute the claim
HotpotQA	Given a multi-hop question, retrieve documents that can help answer the question
MIRACL	Given a question, retrieve Wikipedia passages that answer the question
MrTyDi	Given a question, retrieve Wikipedia passages that answer the question
MSMARCO Passage	Given a web search query, retrieve relevant passages that answer the query
MSMARCO Document	Given a web search query, retrieve relevant documents that answer the query
NQ	Given a question, retrieve Wikipedia passages that answer the question
QuoraDuplicates	Given a question, retrieve questions that are semantically equivalent to the given question
QuoraDuplicates	Find questions that have the same meaning as the input question
Squad	Retrieve Wikipedia passages that answer the question
T2Ranking	Given a Chinese search query, retrieve web passages that answer the question
TriviaQA	Retrieve Wikipedia passages that answer the question

Table 3: Instructions for finetuning datasets.

Training objective.

For the training objective, we use the SimCSE loss (Gao et al., 2021b). It is defined,

\displaystyle\ell_{i}=-\log\frac{\exp\left(\operatorname{Sim}\left(h_{i},h_{i}% ^{+}\right)/\tau\right)}{\sum_{j=1}^{N}\exp(\operatorname{Sim}\left(h_{i},h_{j% }^{-}\right)/\tau)}.

(3)

In this loss function, $h_{i}$ represents a query (or a reference sentence when the data is symmetric), $h_{i}^{+}$ represents a positive example associated with $h_{i}$ , and $\{h_{j}^{-}\}_{j=1}^{N}$ represents the set of negatives associated with the example, including mined hard negatives.

$q$	$s^{-}$	$s^{+}$
She loves to travel in summer, especially to cold destinations, avoiding hot and crowded places	She loves to travel in summer, but prefers to visit hot and bustling tourist spots	In summer, she adores traveling, specifically to chilly locations, steering clear of warm, populous areas
The cat often sits by the window, dreaming of chasing birds and enjoying the warm sunshine	The cat often sits by the window, but is too lazy to dream of chasing anything	Frequently, the cat lounges near the window, imagining bird pursuits and basking in the sunlight
He reads books every night, finding solace in fiction and escaping from the stresses of daily life	He reads books every night, yet he feels that non-fiction is more engaging and informative	Nightly, he immerses himself in books, seeking comfort in stories and evading everyday tensions
They play music loudly in the evening, filling their home with energetic beats and vibrant melodies	They play music loudly in the evening, but only soothing classical tunes to relax	In the evenings, they blast tunes, their house resonating with lively rhythms and bright harmonies
She paints landscapes on weekends, expressing her creativity through vibrant colors and abstract forms	She paints landscapes on weekends, preferring realistic and detailed depictions of nature	On weekends, she engages in landscape painting, showcasing her artistic flair with lively hues and unconventional shapes
The children eagerly await winter, dreaming of snowball fights and building snowmen	The children eagerly await winter, yet they dislike the cold and prefer staying indoors	During winter, the kids are excited, imagining snow battles and constructing snow figures
He often jokes at parties, becoming the center of attention with his witty humor	He often jokes at parties, but tends to alienate others with his sarcasm	At social gatherings, he frequently makes jokes, captivating the crowd with his clever wit
She collects antique vases, adoring their unique designs and historical significance	She collects antique vases, but is indifferent to their history and focuses on their resale value	Her hobby is gathering old vases, cherishing their distinct patterns and the stories they hold
The band plays rock music loudly, thrilling audiences with energetic performances and powerful lyrics	The band plays rock music loudly, but often receives complaints for being too noisy	Performing rock loudly, the band excites crowds with dynamic shows and impactful words
He prefers working at night, enjoying the quiet and focusing better without distractions	He prefers working at night, despite feeling more tired and less productive	Nighttime is his preferred work period, appreciating the tranquility and concentrated environment
She writes poetry in her free time, pouring her emotions and experiences into each verse	She writes poetry in her free time, but struggles to find inspiration and motivation	During her leisure, she crafts poems, infusing her feelings and life stories into every line

Table 4: Examples of Structure 1 from Section 3

$q$	$s^{-}$	$s^{+}$
On sunny days, I often find myself longing for the cool breeze of the ocean and the sound of waves crashing, as I enjoy outdoor activities	During rainy days, I usually prefer the warmth and quiet of my home, as I enjoy outdoor activities	When the sun is shining, I tend to crave the refreshing sea air and the rhythmic sound of the ocean, since I relish spending time outdoors
As a lover of classical music, I spend hours listening to Beethoven and Bach, reveling in the complexity of their compositions, though I’m fond of playing the guitar	Despite my preference for rock music, I rarely spend time on music other than playing my favorite tunes on the guitar, though I’m fond of playing the guitar	Being an enthusiast of classical melodies, I often indulge in lengthy sessions of Beethoven and Bach, appreciating the intricacies of their work, as I delight in guitar playing
In the world of literature, I have an insatiable appetite for mystery novels and spend countless evenings unraveling their plots, but I adore reading poetry	Contrary to my usual tastes, I rarely delve into mystery novels and prefer lighter reading materials, but I adore reading poetry	As a fervent reader, my passion lies in the twists and turns of mystery stories, which I often explore during long nights, yet I cherish reading poetry
Growing up in a bustling city, I’ve always been surrounded by the constant hum of activity and the bright city lights, which makes me appreciate quiet countryside walks	Having been raised in a tranquil rural area, I’m more accustomed to the sounds of nature and open fields, which makes me appreciate quiet countryside walks	Raised in the lively atmosphere of an urban environment, I’m used to the never-ending city noise and glowing nights, leading me to enjoy the serenity of rural strolls
Ever since I was a child, fascinated by the vastness of the universe, I would spend countless nights gazing at the stars through my telescope, dreaming of exploring distant galaxies, yet I still find solace in simple nature hikes	Though I’ve always been more interested in the immediate world around me, preferring to focus on the tangible and the present, I rarely look up at the night sky, yet I still find solace in simple nature hikes	From my early years, captivated by the infinity of space, I devoted many nights to star-gazing and imagining interstellar journeys, but I also enjoy the peace of nature walks
Growing up with a passion for culinary arts, experimenting with exotic ingredients and complex recipes, and often spending whole days in the kitchen perfecting new dishes, I also have a deep appreciation for classic literature	Despite my lack of interest in cooking and a preference for simple, quick meals that require minimal preparation, I’m not one to spend time in the kitchen, I also have a deep appreciation for classic literature	Since childhood, I’ve been enthusiastic about cooking, often trying out unusual ingredients and intricate recipes, dedicating entire days to refining my culinary creations, and I equally cherish classic literary works

Table 5: Examples of Structure 2 from Section 3

$q$	$s^{-}$	$s^{+}$
SShe loves to travel in summer, especially to cold destinations, avoiding hot and crowded places	She loves to travel in summer, but prefers to visit hot and bustling tourist spots	She loves to travel in summer, specifically to chilly locations, steering clear of warm, populous areas
The cat often sits by the window, dreaming of chasing birds and enjoying the warm sunshine	The cat often sits by the window, but is too lazy to dream of chasing anything	The cat often sits by the window, imagining bird pursuits and basking in the sunlight
He reads books every night, finding solace in fiction and escaping from the stresses of daily life	He reads books every night, yet he feels that non-fiction is more engaging and informative	He reads books every night, seeking comfort in stories and evading everyday tensions
They play music loudly in the evening, filling their home with energetic beats and vibrant melodies	They play music loudly in the evening, but only soothing classical tunes to relax	They play music loudly in the evening, their house resonating with lively rhythms and bright harmonies
She paints landscapes on weekends, expressing her creativity through vibrant colors and abstract forms	She paints landscapes on weekends, preferring realistic and detailed depictions of nature	She paints landscapes on weekends, showcasing her artistic flair with lively hues and unconventional shapes
The children eagerly await winter, dreaming of snowball fights and building snowmen	The children eagerly await winter, yet they dislike the cold and prefer staying indoors	The children eagerly await winter, imagining snow battles and constructing snow figures
He often jokes at parties, becoming the center of attention with his witty humor	He often jokes at parties, but tends to alienate others with his sarcasm	He often jokes at parties, captivating the crowd with his clever wit
She collects antique vases, adoring their unique designs and historical significance	She collects antique vases, but is indifferent to their history and focuses on their resale value	She collects antique vases, cherishing their distinct patterns and the stories they hold
The band plays rock music loudly, thrilling audiences with energetic performances and powerful lyrics	The band plays rock music loudly, but often receives complaints for being too noisy	The band plays rock music loudly, the band excites crowds with dynamic shows and impactful words
He prefers working at night, enjoying the quiet and focusing better without distractions	He prefers working at night, despite feeling more tired and less productive	He prefers working at night, appreciating the tranquility and concentrated environment
She writes poetry in her free time, pouring her emotions and experiences into each verse	She writes poetry in her free time, but struggles to find inspiration and motivation	She writes poetry in her free time, infusing her feelings and life stories into every line

Table 6: Examples of Structure 3 from Section 3

Most improved			Least improved
Sentence 1	Sentence 2	Score	Sentence 1	Sentence 2	Score
The best thing you can do is to know your stuff.	The best thing to do is to overcome the fussiness.	0.0	Sometime if you really want it you might need to pay an agency to get the place for you.	You could probably get a tour agency to do it for you but it would cost you.	2.0
It really doesn’t matter.	It doesn’t matter unless it is really far off.	3.0	There are three options:	There are only three options:	5.0
I think it’s fine to ask this question.	I think it is okay to ask the question.	5.0	Bremer said one initiative is to launch a US$70 million nationwide program in the next two weeks to clean up neighborhoods and build community projects.	Bremer said he would launch a $70-million program in the next two weeks to clean up neighborhoods across Iraq and build community projects, but gave no details.	3.6
What kind of insulation is it?	What kind of floors are above?	0.0	"Tony’s not feeling well," Spurs coach Gregg Popovich said.	We’re thrilled to be up 3-2,” Coach Gregg Popovich said Wednesday.	1.6
It depends entirely on your company and your contract.	I guess it depends on the nature of your contract.	4.0	Shares of Mandalay closed down eight cents to $29.42, before the earnings were announced.	Shares of Mandalay closed down 8 cents at $29.42 Thursday.	4.0
You need to read a lot to know what you like and what you don’t.	You have to know what you want to do.	0.0	Singapore reported no suspected SARS cases Wednesday, but officials quarantined 70 people who had contact with the Taiwanese patient.	Still, Singapore quarantined 70 people who had been in close contact with the scientist.	3.0
I would say you can do it, but it wouldn’t be advised.	Personally, I would say not unless it suits you.	2.0	The dollar was at 117.85 yen against the Japanese currency, up 0.1 percent.	Against the Swiss franc the dollar was at 1.3289 francs, up 0.5 percent on the day.	1.333

Table 7: Example sentences from STSBenchmark in which zero-shot echo embeddings with Mistral 7B most improve (left) and least improve (right).

Validation Dataset	Classification	Pair Classification	Clustering	Retrieval	STS	Reranking	Average
Classical
Classification	59.20	73.80	24.16	20.57	58.59	54.54	46.79
Pair Classification	58.73	71.40	24.32	20.39	59.00	54.42	46.64
Clustering	58.23	72.62	23.90	18.64	56.68	54.82	45.37
Retrieval	58.21	73.87	23.85	20.35	56.97	54.44	45.88
STS	58.31	44.03	13.07	2.63	38.95	46.77	34.63
Reranking	58.13	71.77	24.20	20.23	58.59	54.89	46.43
Echo
Classification	64.50	74.65	25.93	22.52	73.81	59.41	55.57
Pair Classification	64.15	75.93	22.25	18.35	72.75	58.47	54.15
Clustering	61.54	71.04	26.32	15.88	68.18	60.27	51.81
Retrieval	64.06	75.26	27.02	23.61	72.40	60.00	55.07
STS	64.50	74.65	25.93	22.52	73.81	59.41	55.57
Reranking	64.15	75.93	22.25	18.35	72.75	58.47	54.15
Summarization
Classification	66.62	78.95	21.79	14.68	72.13	64.24	55.22
Pair Classification	66.62	78.95	21.79	14.68	72.13	64.24	55.22
Clustering	66.66	79.59	28.08	11.88	67.30	65.19	53.43
Retrieval	66.01	81.82	26.48	19.13	70.13	66.24	54.96
STS	66.01	81.82	26.48	19.13	70.13	66.24	54.96
Reranking	63.19	75.22	26.09	20.52	65.98	59.05	51.55

Table 8: Scores for additional zero-shot validation datasets on Mistral-7B.

Validation Dataset	Classification	Pair Classification	Clustering	Retrieval	STS	Reranking	Average
Classical
Classification	57.59	68.65	23.72	18.06	57.19	54.59	45.14
Pair Classification	57.56	70.18	23.51	18.54	58.24	54.40	45.79
Clustering	57.14	69.91	23.35	16.98	57.66	55.38	45.25
Retrieval	56.61	68.46	23.22	18.63	56.49	53.26	44.65
STS	57.56	70.18	23.51	18.54	58.24	54.40	45.79
Reranking	56.65	66.54	22.46	10.48	55.97	54.44	42.98
Echo
Classification	62.24	67.96	23.60	14.33	65.79	55.44	49.85
Pair Classification	63.42	72.52	21.11	17.35	68.16	54.98	51.47
Clustering	60.12	66.74	23.45	11.60	64.45	56.31	48.75
Retrieval	61.64	66.29	25.11	16.12	66.18	56.35	50.26
STS	63.15	68.74	23.65	16.38	69.37	57.75	51.96
Reranking	62.30	74.23	24.69	18.17	65.07	56.76	50.51
Summarization
Classification	63.96	77.93	21.89	15.93	67.07	63.39	52.34
Pair Classification	63.96	77.93	21.89	15.93	67.07	63.39	52.34
Clustering	61.60	69.47	24.44	5.28	57.53	57.62	45.85
Retrieval	64.90	78.74	26.63	15.59	70.15	65.43	54.02
STS	64.90	78.74	26.63	15.59	70.15	65.43	54.02
Reranking	60.54	69.73	26.40	15.82	61.60	58.80	47.83

Table 9: Scores for additional zero-shot validation datasets on LLaMa-2-7B.

Validation Dataset	Classification	Pair Classification	Clustering	Retrieval	STS	Reranking	Average
Classical
Classification	58.24	71.65	23.91	21.79	58.74	56.37	46.66
Pair Classification	58.10	73.30	23.01	16.97	57.83	56.17	45.52
Clustering	58.61	67.47	23.30	15.51	57.93	56.86	45.05
Retrieval	58.50	65.06	24.22	18.92	57.47	56.38	45.15
STS	58.24	71.65	23.91	21.79	58.74	56.37	46.66
Reranking	58.61	67.47	23.30	15.51	57.93	56.86	45.05
Echo
Classification	64.15	74.22	25.02	27.58	70.81	61.43	55.02
Pair Classification	64.57	77.63	22.56	24.08	69.85	59.89	53.55
Clustering	63.26	73.50	25.10	27.48	69.04	61.81	54.32
Retrieval	64.65	74.57	25.72	26.58	72.20	62.68	55.60
STS	63.16	75.98	24.08	27.56	71.00	61.84	54.85
Reranking	62.90	70.58	25.53	22.11	68.82	62.38	53.02
Summarization
Classification	66.02	79.06	26.47	22.20	67.91	64.90	54.52
Pair Classification	66.02	79.06	26.47	22.20	67.91	64.90	54.52
Clustering	63.84	71.98	21.99	7.48	56.96	59.41	46.50
Retrieval	66.02	79.06	26.47	22.20	67.91	64.90	54.52
STS	66.02	79.06	26.47	22.20	67.91	64.90	54.52
Reranking	61.19	69.63	26.38	19.62	60.76	62.79	48.36

Table 10: Scores for additional zero-shot validation datasets on LLaMa-2-13B.

Dataset	Classical	Echo	Summarization
FiQA2018 (retrieval)	7.89	12.74	12.43
SCIDOCS (retrieval)	3.60	4.88	9.97
SciFact (retrieval)	45.39	49.36	29.90
NFCorpus (retrieval)	12.07	16.57	17.51
TwitterSemEval20. (pair_classification)	47.81	62.49	59.79
TwitterURLCorpus (pair_classification)	73.87	75.26	81.82
ImdbClassificati. (classification)	72.50	72.02	82.78
AmazonReviewsCla. (classification)	37.09	40.72	45.58
TweetSentimentEx. (classification)	53.70	58.76	61.74
MTOPDomainClassi. (classification)	83.85	92.71	90.72
TwentyNewsgroups. (clustering)	20.84	29.48	30.11
BiorxivClusterin. (clustering)	23.47	27.61	27.21
MedrxivClusterin. (clustering)	24.23	26.42	25.75
StackOverflowDup. (reranking)	35.85	42.71	40.32
AskUbuntuDupQues. (reranking)	49.49	54.09	57.17
SciDocsRR (reranking)	59.38	65.91	75.30
BIOSSES (sts)	59.05	78.19	66.06
STS12 (sts)	42.01	58.43	64.62
STS13 (sts)	59.66	78.53	78.45
STS14 (sts)	50.69	68.42	71.00
STS15 (sts)	61.81	78.82	78.29
STS16 (sts)	57.03	77.52	77.40
STS17 (sts)	68.08	82.14	78.80
STS22 (sts)	61.23	57.60	47.07
STSBenchmark (sts)	47.55	73.85	77.39
SICK-R (sts)	53.19	71.95	69.48
Average	45.88	55.07	54.96

Table 11: Evaluation of all MTEB datasets for zero-shot for Mistral-7B.

Dataset	Classical	Echo	Summarization
FiQA2018 (retrieval)	6.48	12.38	9.00
SCIDOCS (retrieval)	3.72	4.38	8.33
SciFact (retrieval)	42.18	30.61	23.01
NFCorpus (retrieval)	10.01	13.38	15.43
TwitterSemEval20. (pair_classification)	44.11	54.66	54.27
TwitterURLCorpus (pair_classification)	68.46	66.29	78.74
ImdbClassificati. (classification)	71.65	73.11	85.83
AmazonReviewsCla. (classification)	36.16	40.68	44.77
TweetSentimentEx. (classification)	52.04	54.85	59.96
MTOPDomainClassi. (classification)	81.63	89.38	89.97
TwentyNewsgroups. (clustering)	15.88	23.42	32.28
BiorxivClusterin. (clustering)	23.13	25.92	27.79
MedrxivClusterin. (clustering)	23.31	24.30	25.48
StackOverflowDup. (reranking)	35.57	40.82	35.63
AskUbuntuDupQues. (reranking)	48.51	51.42	56.09
SciDocsRR (reranking)	58.01	61.29	74.76
BIOSSES (sts)	65.31	71.96	68.04
STS12 (sts)	41.84	52.40	60.20
STS13 (sts)	58.43	72.40	76.31
STS14 (sts)	49.21	61.24	68.73
STS15 (sts)	60.03	72.67	75.59
STS16 (sts)	56.40	73.51	76.71
STS17 (sts)	62.31	71.87	79.38
STS22 (sts)	59.48	55.21	55.69
STSBenchmark (sts)	49.45	65.73	76.42
SICK-R (sts)	55.35	64.39	70.69
Average	44.65	50.26	54.02

Table 12: Evaluation of all MTEB datasets for zero-shot for LLaMa-2-7B.

Dataset	Classical	Echo	Summarization
FiQA2018 (retrieval)	8.31	18.07	9.43
SCIDOCS (retrieval)	4.87	7.56	10.38
SciFact (retrieval)	41.64	50.55	40.19
NFCorpus (retrieval)	10.26	21.63	16.02
TwitterSemEval20. (pair_classification)	42.43	62.85	59.55
TwitterURLCorpus (pair_classification)	65.06	74.57	79.06
ImdbClassificati. (classification)	71.82	75.44	91.86
AmazonReviewsCla. (classification)	37.88	43.25	50.60
TweetSentimentEx. (classification)	52.95	58.18	59.93
MTOPDomainClassi. (classification)	84.67	92.52	87.51
TwentyNewsgroups. (clustering)	17.21	25.98	32.08
BiorxivClusterin. (clustering)	24.95	26.75	28.30
MedrxivClusterin. (clustering)	23.49	24.70	24.64
StackOverflowDup. (reranking)	37.24	44.86	38.44
AskUbuntuDupQues. (reranking)	50.74	55.21	54.15
SciDocsRR (reranking)	62.03	70.15	75.65
BIOSSES (sts)	63.26	77.60	69.33
STS12 (sts)	51.80	59.36	51.17
STS13 (sts)	61.59	79.01	76.08
STS14 (sts)	49.69	69.75	66.62
STS15 (sts)	58.48	79.86	73.75
STS16 (sts)	53.18	76.75	77.40
STS17 (sts)	65.10	80.41	75.88
STS22 (sts)	59.00	56.84	49.23
STSBenchmark (sts)	44.80	71.31	75.17
SICK-R (sts)	55.13	70.27	71.70
Average	45.15	55.60	54.52

Table 13: Evaluation of all MTEB datasets for zero-shot for LLaMa-2-13B.

AmazonCounterfactualCls.	Classify a given Amazon customer review text as either counterfactual or not counterfactual
AmazonPolarityCls.	Classify Amazon reviews into positive or negative sentiment
AmazonReviewsCls.	Classify the given Amazon review into its appropriate rating category
Banking77Cls.	Given a online banking query, find the corresponding intents
EmotionCls.	Classify the emotion expressed in the given Twitter message into one of the six emotions: anger, fear, joy, love, sadness, and surprise
ImdbCls.	Classify the sentiment expressed in the given movie review text from the IMDB dataset
MassiveIntentCls.	Given a user utterance as query, find the user intents
MassiveScenarioCls.	Given a user utterance as query, find the user scenarios
MTOPDomainCls.	Classify the intent domain of the given utterance in task-oriented conversation
MTOPIntentCls.	Classify the intent of the given utterance in task-oriented conversation
ToxicConversationsCls.	Classify the given comments as either toxic or not toxic
TweetSentimentExtractionCls.	Classify the sentiment of a given tweet as either positive, negative, or neutral
ArxivClusteringP2P	Identify the main and secondary category of Arxiv papers based on the titles and abstracts
ArxivClusteringS2S	Identify the main and secondary category of Arxiv papers based on the titles
BiorxivClusteringP2P	Identify the main category of Biorxiv papers based on the titles and abstracts
BiorxivClusteringS2S	Identify the main category of Biorxiv papers based on the titles
MedrxivClusteringP2P	Identify the main category of Medrxiv papers based on the titles and abstracts
MedrxivClusteringS2S	Identify the main category of Medrxiv papers based on the titles
RedditClustering	Identify the topic or theme of Reddit posts based on the titles
RedditClusteringP2P	Identify the topic or theme of Reddit posts based on the titles and posts
StackExchangeClustering	Identify the topic or theme of StackExchange posts based on the titles
StackExchangeClusteringP2P	Identify the topic or theme of StackExchange posts based on the given paragraphs
TwentyNewsgroupsClustering	Identify the topic or theme of the given news articles
SprintDuplicateQuestions	Retrieve duplicate questions from Sprint forum
TwitterSemEval2015	Retrieve tweets that are semantically similar to the given tweet
TwitterURLCorpus	Retrieve tweets that are semantically similar to the given tweet
AskUbuntuDupQuestions	Retrieve duplicate questions from AskUbuntu forum
MindSmallReranking	Retrieve relevant news articles based on user browsing history
SciDocsRR	Given a title of a scientific paper, retrieve the titles of other relevant papers
StackOverflowDupQuestions	Retrieve duplicate questions from StackOverflow forum
ArguAna	Given a claim, find documents that refute the claim
ClimateFEVER	Given a claim about climate change, retrieve documents that support or refute the claim
CQADupstackAndroidRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackEnglishRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackGamingRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackGisRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackMathematicaRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackPhysicsRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackProgrammersRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackStatsRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackTexRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackUnixRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackWebmastersRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
CQADupstackWordpressRetr.	Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question
DBPedia	Given a query, retrieve relevant entity descriptions from DBPedia
FEVER	Given a claim, retrieve documents that support or refute the claim
FiQA2018	Given a financial question, retrieve user replies that best answer the question
HotpotQA	Given a multi-hop question, retrieve documents that can help answer the question
MSMARCO	Given a web search query, retrieve relevant passages that answer the query
NFCorpus	Given a question, retrieve relevant documents that best answer the question
NQ	Given a question, retrieve Wikipedia passages that answer the question
QuoraRetr.	Given a question, retrieve questions that are semantically equivalent to the given question
SCIDOCS	Given a scientific paper title, retrieve paper abstracts that are cited by the given paper
SciFact	Given a scientific claim, retrieve documents that support or refute the claim
Touche2020	Given a question, retrieve detailed and persuasive arguments that answer the question
TRECCOVID	Given a query on COVID-19, retrieve documents that answer the query
BIOSSES	Retrieve semantically similar text
SICK-R	Retrieve semantically similar text
STS12	Retrieve semantically similar text
STS13	Retrieve semantically similar text
STS14	Retrieve semantically similar text
STS15	Retrieve semantically similar text
STS16	Retrieve semantically similar text
STS17	Retrieve semantically similar text
STS22	Retrieve semantically similar text
STSBenchmark	Retrieve semantically similar text
SummEval	Given a news summary, retrieve other semantically similar summaries

Table 14: MTEB instructions for evaluation of finetuned models.

Model	Average	Clas.	Clus.	Pair Clas.	Rera.	Retr.	STS	Summ.
Classical (w/ instruct., mean)	62.96	76.26	42.68	86.31	57.58	53.75	81.53	30.19
Classical (w/ instruct., last)	63.98	76.57	45.78	86.37	56.71	54.87	82.03	31.02
Echo (w/ instruct., mean)	64.22	77.00	44.94	87.73	58.30	55.11	82.52	29.46
Echo (w/ instruct., last)	64.68	77.43	46.32	87.34	58.14	55.52	82.56	30.73
Classical (w/out instruct., mean)	62.19	75.23	41.79	85.24	56.31	53.24	80.97	30.64
Classical (w/out instruct., last)	62.37	75.01	42.70	85.69	56.64	53.29	80.92	30.91
Echo (w/out instruct., mean)	63.28	75.26	42.93	86.95	57.05	55.65	81.40	30.62
Echo (w/out instruct., last)	62.80	75.30	42.94	86.31	57.31	54.18	80.92	31.00
Classical (w/ instruct., mean, step 280)	63.19	76.18	42.99	85.44	57.63	53.96	82.53	29.94
Classical (w/ instruct., last, step 280)	63.87	76.54	46.22	86.70	57.79	53.73	82.22	30.13
Echo (w/ instruct., mean, step 280)	64.04	76.84	45.76	87.72	59.33	53.55	82.64	30.33
Echo (w/ instruct., last, step 280)	64.50	76.41	46.70	87.17	59.10	54.84	82.98	31.09

Table 15: Additional ablations for finetuning.

Dataset	Repetition (last)	Repetition (mean)	Classical (last)	Classical (mean)	Bidirectional (last)
AmazonCounterfactualClassification	82.97	82.91	80.82	82.21	83.07
AmazonPolarityClassification	90.98	88.25	92.55	90.37	90.83
AmazonReviewsClassification	48.71	49.41	48.75	46.76	47.94
Banking77Classification	88.15	88.06	87.95	87.69	88.17
EmotionClassification	52.18	51.51	50.66	49.23	52.09
ImdbClassification	87.42	84.80	83.18	82.53	83.02
MassiveIntentClassification	79.67	79.70	78.60	79.15	78.93
MassiveScenarioClassification	82.82	82.74	81.71	81.46	81.80
MTOPDomainClassification	96.16	96.10	95.92	95.54	96.14
MTOPIntentClassification	85.75	85.87	85.96	85.86	85.98
ToxicConversationsClassification	71.91	72.21	71.19	72.21	71.46
TweetSentimentExtractionClassification	62.40	62.46	61.60	62.07	60.97
ArxivClusteringP2P	47.02	45.52	46.73	45.80	47.03
ArxivClusteringS2S	43.52	42.32	43.99	40.73	42.14
BiorxivClusteringP2P	35.53	35.24	36.50	35.42	36.21
BiorxivClusteringS2S	35.34	33.70	34.87	32.03	34.77
MedrxivClusteringP2P	30.27	29.68	30.67	29.74	31.06
MedrxivClusteringS2S	29.67	27.73	29.75	27.97	30.12
RedditClustering	61.77	59.12	61.17	54.79	62.50
RedditClusteringP2P	66.01	65.44	64.84	63.68	65.45
StackExchangeClustering	72.04	71.21	71.87	66.99	71.58
StackExchangeClusteringP2P	35.29	34.07	33.08	31.47	34.98
TwentyNewsgroupsClustering	53.04	50.29	50.07	40.91	49.53
SprintDuplicateQuestions	94.59	95.05	94.38	95.29	96.26
TwitterSemEval2015	79.93	80.73	77.18	75.98	80.80
TwitterURLCorpus	87.50	87.40	87.56	87.67	87.38
AskUbuntuDupQuestions	64.13	64.44	62.24	63.32	62.65
MindSmallReranking	32.92	32.11	32.68	32.52	32.53
SciDocsRR	83.68	84.15	81.60	83.01	82.36
StackOverflowDupQuestions	51.84	52.51	50.33	51.48	51.35
ArguAna	58.52	56.52	57.22	51.14	57.27
ClimateFEVER	34.56	37.07	31.10	30.31	32.73
CQADupstackRetrieval	46.91	46.48	45.11	43.30	46.52
DBPedia	46.83	48.19	45.18	46.80	46.76
FEVER	91.22	91.14	90.30	90.63	91.66
FiQA2018	54.51	54.11	50.31	48.94	53.06
HotpotQA	76.41	75.75	72.95	68.50	75.30
MSMARCO	43.25	43.11	42.31	41.49	43.38
NFCorpus	39.55	37.18	39.32	38.53	38.61
NQ	62.31	61.51	62.07	60.65	63.69
QuoraRetrieval	89.34	89.33	89.04	88.94	89.57
SCIDOCS	20.17	17.73	19.34	19.88	19.69
SciFact	73.99	73.57	74.22	75.39	75.83
Touche2020	18.52	18.92	24.46	19.44	15.79
TRECCOVID	76.66	76.02	80.17	82.30	74.50
BIOSSES	86.54	86.78	85.73	83.31	85.38
STS12	76.13	75.89	75.84	76.23	75.50
STS13	83.19	82.90	83.41	82.61	83.44
STS14	80.60	80.99	79.80	79.89	81.35
STS15	87.16	87.16	86.99	86.68	87.43
STS16	85.16	84.93	83.93	84.18	85.34
STS17	90.88	90.78	91.12	90.14	90.99
STS22	67.04	67.21	66.27	65.99	66.32
STSBenchmark	85.67	85.87	84.96	85.20	85.45
SICK-R	83.23	82.70	82.22	81.11	82.97
SummEval	30.73	29.46	31.02	30.19	29.32

Table 16: Results from all MTEB datasets for finetuning with Mistral-7B.