HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: arydshln
  • failed: color-edits

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.15449v1 [cs.CL] 23 Feb 2024

Repetition Improves Language Model Embeddings

Jacob Mitchell Springer Suhas Kotha
Daniel Fried Graham Neubig Aditi Raghunathan
Carnegie Mellon University
{jspringe, suhask, dfried, gneubig, aditirag}@cs.cmu.edu
Abstract

Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, “echo embeddings,” in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9%percent99\%9 % zero-shot and by around 0.7%percent0.70.7\%0.7 % when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.111Our code and pre-trained models are released at https://github.com/jakespringer/echo-embeddings.

\addauthor

gnmagenta

Repetition Improves Language Model Embeddings


Jacob Mitchell Springer Suhas Kotha Daniel Fried Graham Neubig Aditi Raghunathan Carnegie Mellon University {jspringe, suhask, dfried, gneubig, aditirag}@cs.cmu.edu

1 Introduction

Neural text embeddings have a crucial role in modern approaches to information retrieval (IR), semantic similarity estimation, classification, and clustering (Ni et al., 2021b; Muennighoff et al., 2022). For example, document retrieval often leverages low-dimensional embeddings for efficient lookup: when queries and documents are encoded as vectors where semantic relationships are described by similarity in some metric space, a query lookup can be reduced to an approximate nearest-neighbor search in embedding space (Johnson et al., 2019; Vanderkam et al., 2013).

In the recent past, the dominant pretrained language model paradigm for neural embeddings have been masked language models with bidirectional attention (Ni et al., 2021a; Raffel et al., 2020; Izacard et al., 2021; Wang et al., 2022; Jiang et al., 2022; Su et al., 2022; Xiao et al., 2023a; Li et al., 2023). However, more recent literature (Ma et al., 2023; Wang et al., 2023) has begun to scale these algorithms to modern autoregressive language models such as LLaMA-2 and Mistral (Touvron et al., 2023; Jiang et al., 2023a). Developing approaches to construct embeddings from autoregressive language models is promising: for many tasks, these models are the highest quality models available (Srivastava et al., 2022).

Refer to caption
Figure 1: Conceptual overview of echo embeddings.

In this paper, we address a striking failure mode of autoregressive language models. This failure arises from the fact that for autoregressive language models, contextualized token embeddings—the vector of last-hidden-layer activations at the position of a particular input token—do not contain information from tokens that appear later in the sentence due to the causal attention mask. We demonstrate that such embeddings can fail to appropriately determine similarity when the early tokens are superficially similar but become dissimilar in important ways when using key information from end of the input.

We propose a strategy to overcome this limitation in autoregressive models through “echo embeddings.” With this approach, we repeat the inputs so that it appears twice in the context passed to the language model, and extract embeddings from the second occurrence. Repeating the input enables the contextualized token embeddings of the second occurrence of the passage to encode information from tokens that appear later in the passage by attending to their first occurrence in the passage. We show that echo embeddings do in fact allow embeddings of the early tokens to capture information about the later tokens.

We then evaluate echo embeddings on the standard Massive Text Embedding Benchmark (MTEB) leaderboard222The MTEB leaderboard can be found at https://huggingface.co/spaces/mteb/leaderboard. In the zero-shot setting, echo embeddings improve on classical embeddings by over 9%percent99\%9 % and provide consistent gains across all the different tasks for a variety of language models and scale. We then perform an apples-to-apples comparison when fine-tuning embeddings from Mistral-7B and continue to see consistent gains of echo over classical across the various tasks (by 0.7%percent0.70.7\%0.7 % on average). Strikingly, echo embeddings with the strong Mistral-7B language model allows us to achieve state-of-the-art embedding quality, enabling autoregressive language models to match prior open-source top performing models that otherwise leveraged MLMs with bidirectional attention.333The contemporaneous work by Wang et al. (2023) achieves state-of-the-art accuracy on MTEB using classical embeddings from autoregressive language models, but through finetuning on high quality synthetic data, which we believe is largely orthogonal to our contribution.

The approach of echo embeddings is conceptually well-motivated, extremely simple, and generally compatible with other innovations in extracting embeddings from autoregressive language models. As language models are likely to continue to improve over the coming years, echo embeddings can be a simple but powerful twist to classical embeddings that allow us to maximally leverage autoregressive language models.

2 Preliminaries

Our goal is to extract text embeddings that map a sentence x𝑥xitalic_x to a vector ϕ(x)ditalic-ϕ𝑥superscript𝑑\phi(x)\in\mathbb{R}^{d}italic_ϕ ( italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that the semantic similarity between sentences is captured as similarity between their embeddings. In practice, we use the cosine similarity between embeddings to capture semantic similarity (detailed in Appendix B).

Embeddings from language models.

We are primarily interested in the embeddings extracted from autoregressive language models, which typically have causal attention masking and are trained on a next-token objective. For brevity, we drop the term “autoregressive” in the following.

As is standard, we extract embeddings from the activations of the final hidden layer. Each input token xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at position j𝑗jitalic_j is associated with a contextualized token embedding which is the hidden layer representation ϕj(x)subscriptitalic-ϕ𝑗𝑥\phi_{j}(x)italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ).

We can pool the embeddings across all the tokens in different ways. In this work, we focus on two common strategies which have been considered by prior work (Reimers and Gurevych, 2019; Muennighoff, 2022; Zhang et al., 2023a; Wang et al., 2023).

A mean token embedding over a set of indices A𝐴Aitalic_A, refers to the mean contextualized token embeddings at indices in A:ϕA(x)1|A|tAϕt(x):𝐴subscriptitalic-ϕ𝐴𝑥1𝐴subscript𝑡𝐴subscriptitalic-ϕ𝑡𝑥A:\phi_{A}(x)\coloneqq\frac{1}{\left|A\right|}\sum_{t\in A}\phi_{t}(x)italic_A : italic_ϕ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_x ) ≔ divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_A end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ).

A last-token embedding refers to the contextualized token embedding of the last token in the input sequence, written ϕ1(x)subscriptitalic-ϕ1𝑥\phi_{-1}(x)italic_ϕ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_x ).

Classical embeddings.

Traditionally, embeddings are computed by simply passing the sentence to the model and extracting some pooling (e.g. mean or last-token) of the contextualized embeddings corresponding to the input sentence. We will refer to embeddings created in this way as “classical embeddings”. Additionally, one might first prompt the language model with an explanation of the task of interest followed by the sentence, and then pool the contextualized embeddings of the sentence tokens like before (Su et al., 2022).

3 Echo Embeddings

In this section, we first demonstrate a failure mode of classical embeddings, and motivate a new method that we call echo embeddings that addresses this failure.

3.1 Classical Embeddings Miss Bidirectional Information.

Sentence embeddings should aggregate information across the entire sentence. However, for autoregressive language models, the contextualized embedding at position k𝑘kitalic_k ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) cannot encode information about tokens xk+1,xk+2,subscript𝑥𝑘1subscript𝑥𝑘2x_{k+1},x_{k+2},\ldotsitalic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k + 2 end_POSTSUBSCRIPT , …. Hence, the “meaning” encoded by the embeddings of tokens at the beginning of a sentence might inaccurately suggest they are similar (or dissimilar) to other tokens without considering the influence of tokens that come later. As a simple illustration, consider the following.

q𝑞\displaystyle qitalic_q :[She loves summer] [but dislikes the heat]:absent[She loves summer] [but dislikes the heat]\displaystyle\colon{{\color[rgb]{0.2,0.2,0.8}\definecolor[named]{% pgfstrokecolor}{rgb}{0.2,0.2,0.8}\pgfsys@color@rgb@stroke{0.2}{0.2}{0.8}% \pgfsys@color@rgb@fill{0.2}{0.2}{0.8}\text{[She loves summer]}}}{{\color[rgb]{% 0.89,0.0,0.13}\definecolor[named]{pgfstrokecolor}{rgb}{0.89,0.0,0.13}% \pgfsys@color@rgb@stroke{0.89}{0.0}{0.13}\pgfsys@color@rgb@fill{0.89}{0.0}{0.1% 3}\text{ [but dislikes the heat]}}}: [She loves summer] [but dislikes the heat]
ssuperscript𝑠\displaystyle s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT :[She loves summer] [for the warm evenings]:absent[She loves summer] [for the warm evenings]\displaystyle\colon{\color[rgb]{0.2,0.2,0.8}\definecolor[named]{pgfstrokecolor% }{rgb}{0.2,0.2,0.8}\pgfsys@color@rgb@stroke{0.2}{0.2}{0.8}% \pgfsys@color@rgb@fill{0.2}{0.2}{0.8}\text{[She loves summer]}}{\color[rgb]{% 0.89,0.0,0.13}\definecolor[named]{pgfstrokecolor}{rgb}{0.89,0.0,0.13}% \pgfsys@color@rgb@stroke{0.89}{0.0}{0.13}\pgfsys@color@rgb@fill{0.89}{0.0}{0.1% 3}\text{ [for the warm evenings]}}: [She loves summer] [for the warm evenings]
s+superscript𝑠\displaystyle s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT :[Summer is her favorite] [but not the temp.]:absent[Summer is her favorite] [but not the temp.]\displaystyle\colon{\color[rgb]{0.2,0.2,0.8}\definecolor[named]{pgfstrokecolor% }{rgb}{0.2,0.2,0.8}\pgfsys@color@rgb@stroke{0.2}{0.2}{0.8}% \pgfsys@color@rgb@fill{0.2}{0.2}{0.8}\text{[Summer is her favorite]}}{\color[% rgb]{0.89,0.0,0.13}\definecolor[named]{pgfstrokecolor}{rgb}{0.89,0.0,0.13}% \pgfsys@color@rgb@stroke{0.89}{0.0}{0.13}\pgfsys@color@rgb@fill{0.89}{0.0}{0.1% 3}\text{ [but not the temp.]}}: [Summer is her favorite] [but not the temp.]

Here, the contextualized embeddings of the first half of s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are both similar to q𝑞qitalic_q because they do not attend to the second half of the sentence. As a result, the similarity between q𝑞qitalic_q and ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT would be overestimated by any pooling strategy that uses information from the first half. We address last-token pooling at the end of this section.

3.2 Echo Embeddings

We propose a simple fix to mitigate the failure above: we present the input sentence twice to the language model and extract contextualized embeddings from the second occurrence of the sentence. In principle, the contextualized embeddings of the second occurrence can attend to the entire sentence presented in the first occurrence. Furthermore, in order to encourage the second occurrence to actually “encode” information about the first, we instruct the language model to perform a generic task that requires using this information, e.g., “rewrite” or “repeat.”

Classical embeddings: Feed sentence x𝑥xitalic_x to the language model and pool the contextualized embeddings of sentence x𝑥xitalic_x. Echo embeddings: Feed a prompt such as “Rewrite the sentence: x𝑥xitalic_x, rewritten sentence: x𝑥xitalic_x” to the language model and pool the contextualized embeddings of the second occurence of x𝑥xitalic_x.

Key to our method is passing the sentence twice to the model and pool embeddings exclusively from the second occurrence.444We find that minor variations of the echo embeddings prompt (e.g. change “rewrite” to “repeat”) work equally well and we provide an example list in Appendix B Other tricks from classical embeddings such as prompting the model with the downstream task of interest can be applied to echo embeddings as well.

Refer to caption
Figure 2: (left) We take the echo embeddings of only A𝐴Aitalic_A for query q𝑞qitalic_q and sentences s,s+superscript𝑠superscript𝑠s^{-},s^{+}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and plot the distribution of cosine similarities, showing that echo embeddings encode later information in earlier tokens. (right) We plot the accuracy of classical and echo embeddings when the sentences have similar beginnings (Structure 1) and dissimilar beginnings (Structure 2).

3.3 Repetition Captures Bidirectional Info

In the previous section, we argued that classical embeddings suffer from the issue that contextualized embeddings of early tokens can miss out on information from the later tokens. But can simply repetition via echo embeddings solve this issue? We aim to test this by extracting embeddings from Mistral-7B on a simple controlled synthetic setting.

Given a query q:[A,B]:𝑞𝐴𝐵q\colon[A,B]italic_q : [ italic_A , italic_B ] we construct sentence pairs s+,ssuperscript𝑠superscript𝑠s^{+},s^{-}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT as follows. We make the first parts of each sentence identical to the query, but differ only in their second parts,

q:[A,B]:𝑞𝐴𝐵q\colon[A,B]italic_q : [ italic_A , italic_B ]; s+:[A,B+]:superscript𝑠𝐴superscript𝐵s^{+}\colon[A,B^{+}]italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT : [ italic_A , italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ]; s:[A,B]:superscript𝑠𝐴superscript𝐵s^{-}\colon[A,B^{-}]italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT : [ italic_A , italic_B start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ],

where B+superscript𝐵B^{+}italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is B𝐵Bitalic_B but paraphrased and Bsuperscript𝐵B^{-}italic_B start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is semantically dissimilar to B𝐵Bitalic_B. We query GPT-4 to generate examples of this structure. We describe the full procedure and prompts in Appendix B.

With classical embeddings, the contextualized embeddings of A𝐴Aitalic_A parts of s+,s,qsuperscript𝑠superscript𝑠𝑞s^{+},s^{-},qitalic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_q are identical by construction. To test whether echo embeddings can meaningfully distinguish s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT despite having identical initial tokens, we take the mean over just the A𝐴Aitalic_A-portion of the echo embeddings and plot the cosine similarities Sim(q,s+)Sim𝑞superscript𝑠\operatorname{Sim}(q,s^{+})roman_Sim ( italic_q , italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) and Sim(q,s)Sim𝑞superscript𝑠\operatorname{Sim}(q,s^{-})roman_Sim ( italic_q , italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) in Figure 2 (left). We find that Sim(q,s+)Sim𝑞superscript𝑠\operatorname{Sim}(q,s^{+})roman_Sim ( italic_q , italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) is typically larger than Sim(q,s)Sim𝑞superscript𝑠\operatorname{Sim}(q,s^{-})roman_Sim ( italic_q , italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ). Since we are only pooling the echo embeddings of the A-portion, any distinction between s+,ssuperscript𝑠superscript𝑠s^{+},s^{-}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT must come from the echo embeddings of A𝐴Aitalic_A capturing information from the later parts of the sentence. This showcases that current autoregressive language models can in fact allow early tokens to capture information from later tokens via echo embeddings.

3.4 Classical vs. Echo on Synthetic Data

In Section 3.3 we demonstrated that echo embeddings encode bidirectional information. However, is this sufficient to recover from the failure mode of classical embeddings? Further, where will we expect echo embeddings to improve over classical embeddings? Here, we compare echo and classical embeddings on synthetic data to answer both of these questions.

Datasets.

We sample datasets according to two structures depending on whether the discriminating information between s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is in the first half (structure S1) or second half (structure S2) of the sentence. Using the structures below, we generate samples using GPT-4, as in the previous section (full details in the appendix):

(S1)   q:[A,B]:𝑞𝐴𝐵q\colon[A,B]italic_q : [ italic_A , italic_B ], s+:[A+,B+]:superscript𝑠superscript𝐴superscript𝐵s^{+}\colon[A^{+},B^{+}]italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT : [ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ], s:[A+,B]:superscript𝑠superscript𝐴superscript𝐵s^{-}\colon[A^{+},B^{-}]italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT : [ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ]

(S2)   q:[A,B]:𝑞𝐴𝐵q\colon[A,B]italic_q : [ italic_A , italic_B ], s+:[A+,B+]:superscript𝑠superscript𝐴superscript𝐵s^{+}\colon[A^{+},B^{+}]italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT : [ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ], s:[A,B+]:superscript𝑠superscript𝐴superscript𝐵s^{-}\colon[A^{-},B^{+}]italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT : [ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ].

We measure the accuracy of identifying which of two sentences s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is closer to the query as measured by the cosine similarity in the embeddings. We compare classical vs echo embeddings when using mean pooling to aggregate embeddings extracted from Mistral-7B model. We use mean token embedding rather than last token because last token embeddings can be quite fragile in a zero-shot setting (Section 5.1).

Results.

We present results on the two different structures in Figure 2 (right). We see that classical embeddings struggle on Structure 1—when the distinguishing information is at the beginning, the embeddings corresponding to these early tokens exaggerate similarity between q𝑞qitalic_q and ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT because they do not encode the information provided by Bsuperscript𝐵B^{-}italic_B start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT. In contrast, echo embeddings are able to successfully determine the more similar sentence, presumably because embeddings of the A𝐴Aitalic_A-portion now also encode information about the later parts (demonstrated in Section 3.3). As a control for other reasons echo embeddings outperform classical embeddings, we also compare these embeddings on structure two, where early tokens provide discriminative signal without needing the later context. As expected, both classical and echo embeddings achieve good performance in this setting. All in all, this analysis on synthetic data demonstrates that zero-shot classical embeddings do not encode information about later context in early token embeddings, but echo embeddings can do so.

Does last-token pooling resolve the failure of classical embeddings?

The embedding of the last token ϕ1(x)subscriptitalic-ϕ1𝑥\phi_{-1}(x)italic_ϕ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_x ) is, in principle, can encode information from the entire input. However, we posit that the last-token pooling strategy is highly brittle and can depend too strongly on the tokens near the end of the input. To verify this, we compare the accuracy of mean token pooling and last-token pooling for classical and echo embeddings in two settings. First, we evaluate on the original synthetic data of Structure 1 (Figure 3, left). Second, we evaluate on the same data, but where we append a uniformly randomly selected token to the end of each example (Figure 3, right). While last-token pooling has high accuracy on the original toy data (though still lower accuracy than echo embeddings), it fails to perform well on the noisy examples. Echo embeddings with mean token pooling, however, are robust to the noise.

While this particular distribution of noise is artificial, it highlight that last-token pooling can be sensitive to noise in the last token. We verify in Section 5.1 that last-token embeddings perform poorly on real data. Thus, even if last-token embeddings address the inability of mean token classical embeddings to encode information from tokens that appear later in the sequence, they are not practical due to their sensitivity to noise.

Does last-token pooling resolve the failure after finetuning?

In practice, it is common to finetune embeddings on a sentence similarity objective. It is hard to delineate the degree to which this failure mode remains after finetuning. Nonetheless, we demonstrate in Section 5.2 that our method improves in the finetuning setting, even when using last-token pooling.

Refer to caption
Figure 3: We compare the accuracy of classical and echo embeddings, and mean and last-token pooling on sentences which have similar beginnings (Structure 1). We plot these accuracies on the original data (left), and the data in which a single uniformly randomly chosen token is appended to the end of each sentence (right).

4 Methodology

In Section 3, we explored how echo embeddings can improve over classical embeddings by addressing a fundamental failure mode. In this section, we describe the methodology by which we evaluate echo embeddings on large scale real datasets in both the zero-shot and finetuning settings. While finetuning is currently necessary to achieve state-of-the-art performance, zero-shot embeddings have the advantage that they do not require expensive finetuning on top of a pretrained language model. Zero-shot results can also more clearly show how different embedding strategies work on real datasets.

4.1 Constructing Zero-shot Embeddings.

We extract zero-shot embeddings via different strategies from three language models: Mistral-7B, LLaMA-2-7B, and LLaMA-2-13B. We select the instruction-finetuned model for each of them. Refer to Appendix A.2 for additional information on the base models. Recent literature suggests that the performance of language models on zero-shot tasks can be highly variable depending on the exact wording and template of the prompts (Sclar et al., 2023). Thus, for each of the embedding strategies we consider, we perform prompt randomization where we sample prompts by randomizing the exact wording, punctuation, and capitalization of the prompt. We describe the sampling process and the exact prompts that we use in Appendix C.

Baselines.

We compare our proposed echo embeddings (Section 3.2) to classical embeddings and two additional baselines:

  • Last-token embeddings: We mentioned in the Section 3 that last-token embeddings tend to underperform in comparsion to mean token embeddings, and thus we compare on real data.

  • Summarization: We also compare zero-shot embeddings obtained via the strategy proposed by Jiang et al. (2023b). Here, they instruct the model to summarize the input in a single word and then take the last token embedding ϕ1(x)subscriptitalic-ϕ1𝑥\phi_{-1}(x)italic_ϕ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_x ) as the pooled embedding of the sentence.

4.2 Constructing Finetuned Embeddings.

We adopt the conventional sentence embedding training setup (Reimers and Gurevych, 2019) where we train with a contrastive learning objective to encourage the embeddings of similar text to be close. We extract embeddings in a slightly different fashion compared to the zero-shot setting above in order to keep the finetuning methodology as similar as possible to the existing literature.

Extracting embeddings.

The training and evaluation data is separated into two categories: symmetric data, in which sentences are drawn from a single distribution (such as for sentence similarity), and asymmetric, in which the data consists of both queries and documents (such as for retrieval). We adopt a separate prompt for symmetric inputs and queries, and for documents. We construct classical embeddings by encoding text S𝑆Sitalic_S using the following prompts:

Queries & Symm.

Documents

Instruct: {instruction}
Query: SQuery: 𝑆\displaystyle\text{Query: }SQuery: italic_S
Document: SDocument: 𝑆\displaystyle\text{Document: }SDocument: italic_S

For echo embeddings, we use the prompts, where S𝑆Sitalic_S represents the input and S=Ssuperscript𝑆𝑆S^{\prime}=Sitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_S:

Queries & Symm.

Documents

Instruct: {instruction}
Query: SQuery: 𝑆\displaystyle\text{Query: }SQuery: italic_S Query again: SQuery again: superscript𝑆\displaystyle\text{Query again: }S^{\prime}Query again: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
Document: SDocument: 𝑆\displaystyle\text{Document: }SDocument: italic_S Document again: SDocument again: superscript𝑆\displaystyle\text{Document again: }S^{\prime}Document again: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

In this case, {instruction} refers to the task instruction, which specifies a description of the task that the embedding will be used for. We adopt the instructions from Wang et al. (2023), and provide a list of the instructions in Appendix D. We append an end-of-sentence token to the end of each input, and we allow the input embedding of this token to be trainable.

Datasets.

We train on a collection of publicly available datasets that encompass both symmetric and asymmetric data that are standard training datasets in the embedding literature. We list and describe each of the datasets in Appendix D.

Optimization.

To finetune the model, we optimize the SimCSE loss with in-batch and mined hard negatives. Since this is standard, we defer discussion of this to Appendix D. Each batch is constructed by sampling a dataset from our set of training dataset, and then collecting examples from only this dataset. We use GradCache to train with a large batch size (2048) with limited GPU memory (Gao et al., 2021a). We train with LoRA instead of full finetuning, with r=16𝑟16r=16italic_r = 16 and α=16𝛼16\alpha=16italic_α = 16. We choose τ=1/50𝜏150\tau=1/50italic_τ = 1 / 50 and a learning rate of 8×1048superscript1048\times 10^{-4}8 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We use the Mistral-7B instruction-tuned model as a backbone (Jiang et al., 2023a). Our choices aim to be consistent with prior literature (Wang et al., 2023; Su et al., 2022; Zhang et al., 2023a).

4.3 Massive Text Embedding Benchmark

For evaluation, we use the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022). For this paper, we focus on the English-language subset of the benchmark. MTEB is composed of a collection of 56 datasets that are grouped into different embedding tasks: classification, clustering, pair classification, reranking, retrieval, sentence similarity (STS), and summarization. The goal is to construct general purpose embeddings that are useful for solving each of the tasks. More information about MTEB is specified in Appendix A.1.

For the fine-tuning setting, we evaluate on the entire English-language subset. In the zero-shot setting, for convenience, we only evaluate on a subset of MTEB. We describe this subset in Appendix D.

5 Experiments

Strategy

Model Pool   Clas. P. Cls. Clus. Retr. STS Rera. Average
Main results:

Echo (ours)

Mistral 7B Mean    64.06 75.26 27.02 23.61 72.40 60.00 55.07

Classical

Mistral 7B Mean    58.21 73.87 23.85 20.35 56.97 54.44 45.88
Prior work:

Summarization

Mistral 7B Last    66.01 81.82 26.48 19.13 70.13 66.24 54.96
Ablations:

Echo

Mistral 7B Last    63.11 57.93 12.82 2.97 39.14 47.35 36.60

Classical

Mistral 7B Last    58.23 46.64 13.51 2.60 33.97 46.51 32.52

Echo

LLaMA 7B Mean    61.64 66.29 25.11 16.12 66.18 56.35 50.26

Classical

LLaMA 7B Mean    56.61 68.46 23.22 18.63 56.49 53.26 44.65

Echo

LLaMA 13B Mean    64.65 74.57 25.72 26.58 72.20 62.68 55.60

Classical

LLaMA 13B Mean    58.50 65.06 24.22 18.92 57.47 56.38 45.15
Table 1: Zero-shot scores on MTEB tasks for Mistral-7B. We use a retrieval validation set (FiQA2018) to select the best prompt. Refer to Appendix C for the scores with alternative validation sets. Top: Comparison of echo embeddings to classical embeddings. Center: Summarization approach to constructing embeddings (Jiang et al., 2023b). Bottom: Ablations, including last-token pooling and LLaMA-2-{7B, 13B}.

5.1 Evaluation of Zero-shot Embeddings

We compare the performance of classical, echo, and summarization embeddings on MTEB tasks (Table 1). We validate using a retrieval dataset from MTEB (FiQA2018) as a validation set as described in Section 4.3. We report the scores using alternative validation sets in Appendix C.

Echo embeddings outperform classical embeddings zero-shot.

We see that echo embeddings outperform classical embeddings by a large margin: on average, by nearly 10 points for Mistral-7B. Further, this performance increase is consistent across every MTEB category, across models (LLaMA-2 vs Mistral), and across scale (7B vs 13B). This demonstrates that echo embeddings can significantly improve the performance of embeddings on real data, suggesting that the failure mode of classical embeddings that we describe in Section 3 can affect performance on real data.

Qualitative comparison of classical and echo embeddings.

In Section 3, we demonstrate that classical embeddings overestimate the similarity between examples which are superficially similar based on tokens that appear early in the sequence. To build intuition that this applies to realistic data, we present the sentence pair from STSBenchmark, a sentence similarity task from MTEB, in which echo embeddings reduce error the most:

x1subscript𝑥1\displaystyle x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT :The best thing you can do is to know yourstuff.:absentThe best thing you can do is to know yourstuff.\displaystyle\colon\begin{array}[]{l}\text{The best thing you can do is to % know your}\\ \text{stuff.}\end{array}: start_ARRAY start_ROW start_CELL The best thing you can do is to know your end_CELL end_ROW start_ROW start_CELL stuff. end_CELL end_ROW end_ARRAY
x2subscript𝑥2\displaystyle x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT :The best thing to do is to overcome the fus-siness.:absentThe best thing to do is to overcome the fus-siness.\displaystyle\colon\begin{array}[]{l}\text{The best thing to do is to overcome% the fus-}\\ \text{siness.}\end{array}: start_ARRAY start_ROW start_CELL The best thing to do is to overcome the fus- end_CELL end_ROW start_ROW start_CELL siness. end_CELL end_ROW end_ARRAY

which has a ground-truth score of 00 (out of 5555) similarity. The sentence pair for which echo embeddings reduces error the least is:

y1subscript𝑦1\displaystyle y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT :Sometime if you really want it you might ne-ed to pay an agency to get the place for you.:absentSometime if you really want it you might ne-ed to pay an agency to get the place for you.\displaystyle\colon\begin{array}[]{l}\text{Sometime if you really want it you % might ne-}\\ \text{ed to pay an agency to get the place for you.}\end{array}: start_ARRAY start_ROW start_CELL Sometime if you really want it you might ne- end_CELL end_ROW start_ROW start_CELL ed to pay an agency to get the place for you. end_CELL end_ROW end_ARRAY
y2subscript𝑦2\displaystyle y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT :You could probably get a tour agency to do it for you but it would cost you.:absentYou could probably get a tour agency to do it for you but it would cost you.\displaystyle\colon\begin{array}[]{l}\text{You could probably get a tour % agency to do }\\ \text{it for you but it would cost you.}\end{array}: start_ARRAY start_ROW start_CELL You could probably get a tour agency to do end_CELL end_ROW start_ROW start_CELL it for you but it would cost you. end_CELL end_ROW end_ARRAY

which has a ground-truth similarity of 2222 (out of 5555). We provide more examples in the Appendix Table 7.

For this example, notice, that the sentence pair (x1,x2)subscript𝑥1subscript𝑥2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) on which echo embeddings improve error the most has exactly the property we identify as a failure mode for classical embeddings: the sentence is superficially similar for the first few tokens. On the other hand (y1,y2)subscript𝑦1subscript𝑦2(y_{1},y_{2})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) does not have this property.

Quantitative evaluation of the failure mode.

The above example builds intuition that, even on real data, classical embeddings fail to properly estimate similarity on examples which are superficially similar in the early tokens. We quantitatively measure the degree to which classical and echo embeddings fail on sentences which are similar for early tokens, and for sentences which are not. We find that classical embeddings systematically fail on examples which exhibit this structure, while echo embeddings do not. For convenience, we defer the discussion of these experiments and the results to Appendix C.1.

Last-token vs mean token pooling.

We find that last-token embeddings are substantially worse than mean token embeddings in the zero-shot setting, despite the fact that in principle, the last token in the sequence can encode information from all other tokens. In practice, it is clear that the last token does not encode sufficient information to achieve strong performance on MTEB in the zero shot setting.

Echo embeddings vs summarization.

We find that the average performance across the tested MTEB datasets is similar between echo and classical embeddings. Summarization does encourage the last token in the sequence to encode information about the entire sentence. We find that summarization is much more sensitive to the exact prompt while echo embeddings are robust to such minor variations (see Figure 5 in Appendix C). We suspect that echo embeddings are more robust as a result of more directly trying to encode bidirectional information into the embeddings.

5.2 Evaluation of Finetuned Embeddings

Strategy

Model Pool    Clas. Clus. P. Cls. Rera. Retr. STS Average
Main results:

Echo (ours)

Mistral 7B Last   77.43 46.32 87.34 58.14 55.52 82.56 64.68

Classical

Mistral 7B Last   76.57 45.78 86.37 56.71 54.87 82.03 63.98
Prior work:
UAE-Large-V1 (MLM)   75.58 46.73 87.25 59.88 54.66 84.54 64.64
multilingual-e5-large (MLM)   77.56 47.10 86.19 58.58 52.47 84.78 64.41
bge-large-en-v1.5 (MLM)   75.97 46.08 87.12 60.03 54.29 83.11 64.23
udever-bloom-7b (autoregr.)   72.13 40.81 85.4 55.91 49.34 83.01 60.63
sgpt-5.8b (autoregr.)   68.13 40.34 82.00 56.56 50.25 78.10 58.93
e5-mistral-7b555e5-mistral-7b was recently released and leverages high quality synthetic data to achieve strong performance which is not publicly released. We report their performance, but we do not explicitly compare to them (Wang et al., 2023). (autoregr.)   78.47 50.26 88.34 60.21 56.89 84.63 66.63
Ablations:

Echo

Mistral 7B Mean   77.00 44.94 87.73 58.30 55.11 82.52 64.22

Classical

Mistral 7B Mean   76.26 42.68 86.31 57.58 53.75 81.53 62.96

Classical

Mistral 7B-bidir. Last   76.70 45.94 88.15 57.23 54.96 82.42 64.23
Table 2: Finetuning scores on MTEB tasks. Top: Apples-to-apples comparison of echo embeddings and classical embeddings in which we use echo embeddings and classical embeddings with last-token pooling, with the same training setup. Center: Performance of recent open source embedding models, annotated by base model type, masked-language model or autoregressive. Bottom: Ablations for finetuning: using mean token embeddings (first two lines) and using a bidirectional architecture (last line).

Different embeddings on the MTEB leaderboard are often fine-tuned on different datasets. In order to perform an apples-to-apples comparison between embedding strategies, we fine-tune both echo and classical embeddings on the exact same datasets (described in Section 4.2). We report the results in Table 2. This table also includes a comparison to prior state-of-the-art methods using masked language models (MLM) and autoregressive language models. Further, we evaluate a number of ablations to determine the role of pooling strategy and architecture.

Echo embeddings outperform classical embeddings after finetuning.

We observe that echo embeddings consistently outperform classical embeddings on each category even after finetuning. Hence, the fundamental gap we find between classical and echo embeddings in Section 3 and in our zero-shot experiments persist after fine-tuning.

Comparison to prior state-of-the-art models.

We present comparisons to both prior MLM-based embeddings and prior autoregressive-language-model embeddings, listing the open-source models from the MTEB leaderboard. It is striking that MLMs vastly outperformed autoregressive models until recently. Our classical embeddings outperform the previous-best autoregressive language model. This is a result of using the strongest public 7B parameter language model (Mistral) and more fine-tuning data. However, despite these choices, classical embeddings do not outperform prior MLM-based approaches, perhaps because MLMs encode bidirectional context unlike classical embeddings from autoregressive models. Interestingly, echo embeddings allow us to close the gap to achieve state-of-the-art (on average) with an autoregressive model compared to prior open-sourced models on the leaderboard that used MLMs. A recent exception is the concurrent work by Wang et al. (2023) that use synthetic data to improve classical embeddings extracted from Mistral-7B. Their synthetic data is not publicly available, but the apples-to-apples comparison between classical and echo embeddings we performed suggests that echo embeddings could provide gains further gains over the numbers reported in (Wang et al., 2023) when fine-tuning with synthetic data.

Why doesn’t last-token pooling close the gap?

Since classical last-token embeddings can attend to every other token, they do not necessarily suffer from the failure mode that we highlighted in Section 3. Last token does not reliably capture relevant information in a zero-shot setting, but this could have been bridged via fine-tuning. It is thus surprising that, even after finetuning last-token embeddings that could (in principle) encode any embedding function, echo embeddings outperform classical embeddings. We identify two hypotheses that may explain this performance gap: (1) While last-token embeddings can attend to every token, the intermediate representations of earlier tokens cannot. If last-token pooling derives information from the internal representations of earlier tokens, by attending to these representations, last-token classical embeddings may still suffer from the failure mode of the earlier tokens. (2) If the post-finetuning performance benefits from the model initialization point, last-token classical embeddings may suffer: in Section 5.1 we show that last-token embeddings achieve poor zero-shot performance. We leave it to future work to explore these hypotheses. We do, however, observe that the gap between last- and mean token echo embeddings is smaller than the gap between last- and mean token classical embeddings, suggesting that echo embeddings can especially improve the quality of mean token embeddings.

Can we relax autoregressive language models to a bidirectional architecture and fine-tune?

To test the role of architecture, we finetune Mistral-7B on the same setup described in Section 4.2 but modified the architecture so as to remove the causal attention mask. While the initial weights are identical to Mistral-7B, this new model has bidirectional attention. We observe that the performance of bidirectional classical embeddings are better than our standard (causal) classical embeddings, but worse than echo embeddings. This suggests that the architecture alone is not sufficient to improve performance.

6 Related Work

Sentence embeddings.

Dense low-dimensional vectors representing textual semantics has been widely studied and applied. Early approaches involved computing embeddings for individual words (Hinton, 1984; Rumelhart et al., 1986; Elman, 1990; Mikolov et al., 2013; Pennington et al., 2014). Later work aims to compute dense vectors representing the semantics of entire sequences by combining or composing word vectors (Le and Mikolov, 2014; Iyyer et al., 2015; Kiros et al., 2015; Socher et al., 2011; Tai et al., 2015; Wang et al., 2016; Wieting et al., 2015). Khattab and Zaharia (2020) propose to use late interaction between document and query vectors to improve retrieval performance. Reimers and Gurevych (2019) propose S-BERT which takes a pretrained BERT (Devlin et al., 2018) and trains with a triplet loss on anchor sentences, semantically similar positive examples, and semantically dissimilar negative examples. More recent approaches typically adopt this approach with different pretrained models and a contrastive objective such as InfoNCE (Oord et al., 2018) or SimCSE (Gao et al., 2021b). Ni et al. (2021a) with Ni et al. (2021b) extend this approach to the T5 architecture (Raffel et al., 2020). Multiple papers use an additional unsupervised contrastive objective (Izacard et al., 2021; Wang et al., 2022). Other papers propose including prompts to improve task-specific embedding performance (Jiang et al., 2022; Su et al., 2022). Some work combines multiple of these training objectives and approaches (Xiao et al., 2023a; Li et al., 2023). Notably, except for the most recent approaches, nearly all embeddings were based upon bidirectional architectures that were often pretrained with a masked-language modeling objective.

Next-token language modeling for embeddings.

A series of papers aim to construct high quality embeddings from autoregressive large language models. Multiple papers apply the fine-tuning approach of S-BERT to language models but using a trained GPT (Radford et al., 2018) as the backbone architecture (Muennighoff, 2022; Zhang et al., 2023a). Ma et al. (2023) adopts this approach but for LLaMA-2 (Touvron et al., 2023). Jiang et al. (2023b) extracts embeddings by asking a language model to summarize the input sentence. Wang et al. (2023) is concurrent to our work and improves embeddings by adding synthetic training data and trains on Mistral (Jiang et al., 2023a).

Zero-shot embeddings.

Most recent sentence embeddings research has focused on improving finetuning. Reimers and Gurevych (2019) demonstrates that without finetuning, BERT has low-quality embeddings. To our knowledge, Jiang et al. (2023b) is the only paper that constructs zero-shot embeddings for autoregressive language models.

7 Conclusion

We have compared classical and echo embeddings in a toy example, on real data in the zero-shot setting, and after finetuning. With the toy data, we identified a failure mode of autoregressive classical embeddings, which we have shown can be recovered with echo embeddings. Our result motivates the development of higher quality embeddings which are important in retrieval applications.

In addition, until recently, masked language models largely dominated the MTEB leaderboard, despite often having an order of magnitude fewer parameters, having been trained on substantially less data, and performing worse on other benchmarks of interest to the natural language processing community. While our results do not explicitly explain the surprising success of masked language models, they do suggest that next-token language models suffer from an inherent drawback that may have stifled their performance until they became performant enough to compensate for this shortcoming. We believe that our embedding strategy achieves the best of both worlds: we gain the capability of next-token language models while recovering from the failure mode that next-token language models do not encode information about future tokens in their contextualized token embeddings.

8 Limitations

Despite the success of echo embeddings, the method has limitations. First, while echo embeddings achieve superior performance to classical embeddings, they require double the inference cost to pass two copies of the input sequence to the model. Though this is double the training cost for a fixed number of training steps, we show in Appendix D that echo embeddings achieve improved performance even when matching compute. Second, we do not fully explain why echo embeddings are improved in comparison to classical embeddings after finetuning even though there is no representational limitation. We leave it to future work to understand the exact underlying mechanisms for this improvement.

Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE2140739. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

This research was supported by the Center for AI Safety Compute Cluster. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

This work was supported in part by the AI2050 program at Schmidt Sciences (Grant #G2264481).

We gratefully acknowledge the support of Apple.

References

  • Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. Ms marco: A human generated machine reading comprehension dataset.
  • DataCanary et al. (2017) DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. 2017. Quora question pairs.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
  • Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. Eli5: Long form question answering.
  • Gao et al. (2021a) Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021a. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983.
  • Gao et al. (2021b) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021b. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  • Hinton (1984) Geoffrey E Hinton. 1984. Distributed representations.
  • Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pages 1681–1691.
  • Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  • Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023a. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jiang et al. (2023b) Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. 2023b. Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
  • Jiang et al. (2022) Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. 2022. Promptbert: Improving bert sentence embeddings with prompts. arXiv preprint arXiv:2201.04337.
  • Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage retrieval for open-domain question answering.
  • Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. Advances in neural information processing systems, 28.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
  • Li and Li (2023) Xianming Li and Jing Li. 2023. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871.
  • Li et al. (2023) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
  • Ma et al. (2023) Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Muennighoff (2022) Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  • Muennighoff et al. (2022) Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  • Ni et al. (2021a) Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. 2021a. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
  • Ni et al. (2021b) Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021b. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Qiu et al. (2022) Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. Dureader-retrieval: A large-scale chinese benchmark for passage retrieval from web search engine.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature, 323(6088):533–536.
  • Sclar et al. (2023) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2023. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
  • Socher et al. (2011) Richard Socher, Eric Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in neural information processing systems, 24.
  • Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  • Su et al. (2022) Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2022. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
  • Vanderkam et al. (2013) Dan Vanderkam, Rob Schonberger, Henry Rowley, and Sanjiv Kumar. 2013. Nearest neighbor search in google correlate. Technical report, Google.
  • Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  • Wang et al. (2023) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
  • Wang et al. (2024) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
  • Wang et al. (2016) Yashen Wang, He-Yan Huang, Chong Feng, Qiang Zhou, Jiahui Gu, and Xiong Gao. 2016. Cse: Conceptual sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 505–515.
  • Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198.
  • Xiao et al. (2023a) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023a. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.
  • Xiao et al. (2023b) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023b. C-pack: Packaged resources to advance general chinese embedding.
  • Xie et al. (2023) Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, and Jin Ma. 2023. T2ranking: A large-scale chinese benchmark for passage ranking.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
  • Zhang et al. (2023a) Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min Zhang. 2023a. Language models are universal embedders. arXiv preprint arXiv:2310.08232.
  • Zhang et al. (2021) Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. tydi: A multi-lingual benchmark for dense retrieval.
  • Zhang et al. (2023b) Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023b. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.

Appendix A General Information for Reproducibility

In this section we include information that might aid in reproducibility that is not specific to any specific setting in the paper.

A.1 Massive Text Embedding Benchmark

The Massive Text Embedding Benchmark (MTEB) is a collection of datasets from seven categories: classification, clustering, pair classification, reranking, retrieval, sentence similarity (STS), and summarization. The leaderboard is published at https://huggingface.co/spaces/mteb/leaderboard. The list of datasets and their descriptions can be found at Muennighoff et al. (2022) in Appendix A.

A.2 Base Model HuggingFace IDs

In this paper, we use the following models:

Appendix B Echo Embeddings: Additional Information

In this section, we aim to describe the additional details that were omitted from Section 2 and 3.

Cosine Similarity.

As discussed in Section 2, we often use the cosine similarity to measure the similarity in embeddings. Recall that given two sentences x𝑥xitalic_x and y𝑦yitalic_y, we wish to determine the degree to which they are semantically similar. Cosine similarity,

Sim(x,y)ϕ(x),ϕ(y)ϕ(x)ϕ(y),Sim𝑥𝑦italic-ϕ𝑥italic-ϕ𝑦normitalic-ϕ𝑥normitalic-ϕ𝑦\displaystyle\operatorname{Sim}(x,y)\coloneqq\frac{\left\langle\phi(x),\phi(y)% \right\rangle}{\|\phi(x)\|\|\phi(y)\|},roman_Sim ( italic_x , italic_y ) ≔ divide start_ARG ⟨ italic_ϕ ( italic_x ) , italic_ϕ ( italic_y ) ⟩ end_ARG start_ARG ∥ italic_ϕ ( italic_x ) ∥ ∥ italic_ϕ ( italic_y ) ∥ end_ARG , (1)

measures the similarity between the embeddings of x𝑥xitalic_x and y𝑦yitalic_y for any embedding function ϕ:𝒳Rd:italic-ϕ𝒳superscript𝑅𝑑\phi\colon\mathcal{X}\to R^{d}italic_ϕ : caligraphic_X → italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The cosine similarity is used for our experiments in Sections 3, and as the similarity function for training in 5. All MTEB datasets use cosine similarity to compute similarity with the exception of the classification datasets, in which similarity is not explicitly measured, and the clustering datasets, which use Euclidean distance,

Sim(x,y)ϕ(x)ϕ(y),Sim𝑥𝑦normitalic-ϕ𝑥italic-ϕ𝑦\displaystyle\operatorname{Sim}(x,y)\coloneqq\|\phi(x)-\phi(y)\|,roman_Sim ( italic_x , italic_y ) ≔ ∥ italic_ϕ ( italic_x ) - italic_ϕ ( italic_y ) ∥ , (2)

as a metric.

Prompts for Section 3.

For these experiments, we only evaluate with a single prompting strategy. For classical embeddings, we encode a sentence S𝑆Sitalic_S using the prompt:

x=Write a sentence: S𝑥Write a sentence: Sx=\text{Write a sentence: $S$}italic_x = Write a sentence: italic_S

We take the pooled embedding to be the mean token embedding ϕS(x)subscriptitalic-ϕ𝑆𝑥\phi_{S}(x)italic_ϕ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ). For echo embeddings, we encode a sentence S𝑆Sitalic_S using the prompt:

x=Rewrite the following sentence: SThe rewritten sentence: S𝑥Rewrite the following sentence: Smissing-subexpressionThe rewritten sentence: Smissing-subexpressionx=\begin{array}[]{ll}\text{Rewrite the following sentence: $S$}\\ \text{The rewritten sentence: $S^{\prime}$}\end{array}italic_x = start_ARRAY start_ROW start_CELL Rewrite the following sentence: italic_S end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL The rewritten sentence: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY

where S=Ssuperscript𝑆𝑆S^{\prime}=Sitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_S and we let our pooled embedding be the mean token embedding ϕS(x)subscriptitalic-ϕsuperscript𝑆𝑥\phi_{S^{\prime}}(x)italic_ϕ start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ). We do not evaluate with the last-token pooling strategy in this Section.

General Prompting Guidelines.

Throughout the paper, we use a variety of different prompts to construct embeddings. In Section C, we demonstrate that for zero-shot embeddings, the exact wording or template used as a prompting strategy does not have a strong effect on the performance of MTEB tasks, with the exception of for the summarization approach. This implies, in general, that classical embeddings and echo embeddings should be robust to the exact choice of prompts. The important component of echo embeddings is instead the structure: the input text should be repeated twice when computing embeddings, and the embeddings should be taken over the second occurrence of the input text.

Example classical embedding structures:

Say the sentence: SSay the sentence: 𝑆\displaystyle\text{Say the sentence: }SSay the sentence: italic_S

Write the phrase: SWrite the phrase: 𝑆\displaystyle\text{Write the phrase: }SWrite the phrase: italic_S

Complete the query: SComplete the query: 𝑆\displaystyle\text{Complete the query: }SComplete the query: italic_S

Explain the text: SExplain the text: 𝑆\displaystyle\text{Explain the text: }SExplain the text: italic_S

Example echo embedding structures:

Repeat the sentence: SRepeat the sentence: 𝑆\displaystyle\text{Repeat the sentence: }SRepeat the sentence: italic_S The sentence again: SThe sentence again: superscript𝑆\displaystyle\text{The sentence again: }S^{\prime}The sentence again: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Rephrase the query: SRephrase the query: 𝑆\displaystyle\text{Rephrase the query: }SRephrase the query: italic_S The query rephrased: SThe query rephrased: superscript𝑆\displaystyle\text{The query rephrased: }S^{\prime}The query rephrased: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Fill in the blank: SFill in the blank: 𝑆\displaystyle\text{Fill in the blank: }SFill in the blank: italic_S The blanks filled in: SThe blanks filled in: superscript𝑆\displaystyle\text{The blanks filled in: }S^{\prime}The blanks filled in: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Rewrite the text: SRewrite the text: 𝑆\displaystyle\text{Rewrite the text: }SRewrite the text: italic_S The sentence rewritten: SThe sentence rewritten: superscript𝑆\displaystyle\text{The sentence rewritten: }S^{\prime}The sentence rewritten: italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Toy data.

We provide a subset of the toy data from Section 2. For Structure 1, the data is given in Table 4. For Structure 2, the data is given in Table 5. For Structure 3, the data is given in Table 6. In all cases, the data is generated by GPT4. The data from Structure 1 is generated from the following GPT4 prompt, and the other structures are generated from minor variations on this:

Together, we need to generate sentence triplets. Each triplet will have the following form:
- sentence 1 can be anything, be creative here.
- sentence 2 must represent something opposite to sentence 1,  however, it is important that the first half of the sentence is exactly the same as the first half of sentence 2. The only difference in wording can be in the second half of the sentence.
- sentence 3 should be extremely similar to sentence 1 and semantically equivalent, but slightly re-worded
Here is an example:
{
    "sentence1": "I like to eat apples and bananas but I really hate almost every other fruit.",
    "sentence2": "I like to eat apples and bananas and I also enjoy also every other fruit",
    "sentence3": "I like two fruits: apples and bananas but I hate nearly all fruits other than these.",
}
The first half of the sentence should be relatively short, less than 10 words, but the second half should be long, at least 10 words. Give more examples, and write them in json format. Be creative!

Appendix C Additional Zero-shot Results

In this section, we describe the omitted methodology and results for the zero-shot section.

Prompt sampling procedure.

Here we describe the prompt sampling procedure and then provide the prompts that we use for the zero shot:

  1. 1.

    Choose an instruction. For classical embeddings, we choose from {Write, Say, Complete, Explain}. For echo embeddings, we choose from {Repeat, Rewrite, Rephrase, Fill in the blank}. For summarization, we choose from {Summarize, Categorize, Understand, Analyze}.

  2. 2.

    Choose a wording for the instruction. For example, if we chose “Say” as the instruction, then we would choose from {Say a sentence, Say a paragraph, Say something, Say a response, Say a query, Say a prompt}. For summarization, we also choose a second part of the wording, as the summarization strategy requires that the summary be in one word: {in one word, with a single word, succinctly with one word, in a unique one-word way, in a single word, in a word}.

  3. 3.

    Choose a separator, which include colons, commas, newlines.

  4. 4.

    Choose a prefix, which includes markers to indicate the first and appearance of the input.

  5. 5.

    Classical prompts have the form: “{instruction} {separator} {prefix} S𝑆Sitalic_S”.

  6. 6.

    Echo prompts have the form: “{instruction} {separator} {prefix0} S𝑆Sitalic_S {separator} {prefix1}Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT”.

  7. 7.

    Summarization prompts have the form: “{instruction0} {separator} {prefix} S𝑆Sitalic_S {instruction1} {separator}”.

For classical, we choose the prompts:

Write a sentence      I] S
Write a prompt!
 (I) S
Write some text
        PROMPT-S
SAY A PARAGRAPH | SENTENCE 0] S
Say a query     QUERY: S
Say a sentence!
 [A] S
COMPLETE THE PROMPT Text (1) S
Complete the query SENTENCE 0) S
Complete the sentence:-S
Explain a query     text 0 S
Explain a prompt | Sentence 1> S
EXPLAIN A SENTENCE     Prompt (1) S

For echo, we choose the prompts:

Repeat The Paragraph.
query 1) S.
query 2) S
Repeat the response.
 1) S.
AGAIN 2) S
REPEAT THE SENTENCE :: PROMPT
S :: RESPONSE
S
Rewrite the query | QUERY (A) S |  (B) S
Rewrite the text. SENTENCE A) S.  B) S
Rewrite the response | query A] S | query B] S
Rephrase the sentence:@S:Again@S
Rephrase The Sentence!
Text <> S!
Answer <> S
REPHRASE THE QUERY     Sentence a) S     Answer b) S
Fill in the blank in the prompt:
Query a) S:
Query b) S
FILL IN THE BLANK IN THE RESPONSE | Sentence A) S | Sentence B) S
Fill in the blank in the paragraph.
Text | S.
Response | S

For summarization, we use the prompts:

SUMMARIZE THE QUERY.
Prompt: SIN A WORD.
}
Summarize the sentence!
PROMPT <1> SSuccinctly With One Word!
}
SUMMARIZE THE PARAGRAPH. PROMPT (0) SIN A WORD.
CATEGORIZE THE PROMPT query
SWith a single word
Categorize the query | prompt [1] Sin a word |
CATEGORIZE THE SENTENCE.
Prompt <1> SIN A WORD.
}
Understand the sentence
        @SIn a single word
Understand The Prompt:QUERY [0] Sin a single word:}
UNDERSTAND THE PARAGRAPH:Text I] SSuccinctly with one word:}
Analyze the sentence.
Sentence SIn A Unique One-word Way.
}
Analyze the response! query a> SIN A UNIQUE ONE-WORD WAY!
Analyze The Prompt
        Sentence a> SIn a unique one-word way

Subset of MTEB for zero-shot evaluation.

We evalaute on the following subset of MTEB: FiQA2018, SCIDOCS, SciFact, NFCorpus, TwitterSemEval2015, TwitterURLCorpus, ImdbClassification, AmazonReviewsClassification, TweetSentimentExtractionClassification, MTOPDomainClassification, TwentyNewsgroupsClustering, BiorxivClusteringS2S, MedrxivClusteringS2S, StackOverflowDupQuestions, AskUbuntuDupQuestions, SciDocsRR, BIOSSES, STS12, STS13, STS14, STS15, STS16, STS17, STS22, STSBenchmark, and SICK-R.

Measuring the sensitivity of different embedding strategies to prompting.

We plot the sensitivity of classical, repetition, and summarization to different choices of prompts for different models in Figures 56, and 7. We also extend to plotting on all tested datasets individually in Figures 89, and 10. We observe that summarization is highly sensitive to the exact prompt used. However, neither classical nor echo were particularly sensitive in any case. Consistently, mean token pooling outperformed last token pooling by a large factor.

Evaluation of zero-shot results with different validation sets.

We include the zero results of validation using different MTEB datasets. For validation, we select one dataset from each category, as follows: classification: ImdbClassification; Pair Classification: TwitterSemEval2015; Clustering: TwentyNewsgroupsClustering; Retrieval: FiQA2018; STS: STSBenchmark, Reranking: StackOverflowDupQuestions. We plot these results for different models in Tables 89, and 10. We observe similar results across different validation sets, with minor variations in the performance. In addition, we the performance of each dataset when the prompts have been validated with FiQA2018 in Tables 1112, and 13.

Refer to caption
Figure 4: We plot the histogram distribution of the difference between the predicted rank and the ground truth rank of sentence pairs in STS datasets. When predicted rank is larger than the ground truth rank—when the rank difference is positive—then the embedding has overestimated the similarity of this pair. Similarly, negative values imply that the the rank is underestimated. We plot the distribution of these ranks for both classical and echo embeddings where we split the data into two groups: one in which sentences are similar in the first part of the sentence (top 10% by first-half similarity), and another in which sentences are similar in the second part of the sentence (top 10% by second-half similarity).

C.1 Validating the connection between our synthetic data experiments and real data.

In Section 3, we hypothesized that classical embeddings would overestimate similarity on sentences where the first half of the sentence are similar, and that echo embeddings would recover from this failure mode. In order to test this hypothesis, we exact a set of examples from the STS datasets included in the MTEB benchmark in which the first half of the sentence is similar, and measure the degree to which the similarity is overestimated.

As a control, we also select points which are similar in the second half of the sentence, and measure the degree to which similarity is overestimated. By comparing the degree to which sentences which are similar in the first half are overestimated in similarity, and the degree to which sentences which are similar in the second half are overestimated, then we can identify if classical embeddings overestimate similarity in specifically sentences which are similar in the first half. Thus, under our hypothesis, we expect that, for classical embeddings, sentences which are similar in the first half are overestimated in similarity more than sentences that are similar in their second half. On the other hand, we expect that, for echo embeddings, the degree to which similarity is over- or underestimated is independent of whether the sentences are similar in the first or second half of the sentence.

Identifying examples based on similarity in the first/second part of the sentence.

We aim to determine which sentences are most similar in the first half of the sentence or in the second half of the sentence. For each sentence pair x,y𝑥𝑦x,yitalic_x , italic_y, we split the sentences in half by number of words, yielding x=[x1,x2]𝑥subscript𝑥1subscript𝑥2x=[x_{1},x_{2}]italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], and y=[y1,y2]𝑦subscript𝑦1subscript𝑦2y=[y_{1},y_{2}]italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]. We select sentences which are most similar in the first half by using the off-the-shelf masked-language-model-based embedding model bge-base-en-v1.5 (Xiao et al., 2023b). To select sentences that are similar in the first half, we measure the cosine similarity Sim(x1,y1)Simsubscript𝑥1subscript𝑦1\operatorname{Sim}(x_{1},y_{1})roman_Sim ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and take the top 10% of sentence pairs x,y𝑥𝑦x,yitalic_x , italic_y which have the highest cosine similarity. Similarly, to select sentences which are similar in the second half, we collect the top 10% of examples by Sim(x2,y2)Simsubscript𝑥2subscript𝑦2\operatorname{Sim}(x_{2},y_{2})roman_Sim ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). We collect examples from each of the STS datasets in MTEB.

Measuring sentence similarity estimation error.

We must determine the degree to which classical and echo embeddings overestimate similarity. The STS datasets contain sentences pairs which are ranked by similarity: the sentences which are most similar have the highest ground-truth ranking, and the least similar sentences have the lowest. We will denote the ranking of sentence pair i𝑖iitalic_i as risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We compute an estimated ranking {r^i}subscript^𝑟𝑖\{\hat{r}_{i}\}{ over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } by ranking sentence pairs by the cosine similarity between their embeddings. We can compare the error in our estimated ranking by taking the rank difference Erri=r^irisubscriptErr𝑖subscript^𝑟𝑖subscript𝑟𝑖\operatorname{Err}_{i}=\hat{r}_{i}-r_{i}roman_Err start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. When Erri>0subscriptErr𝑖0\operatorname{Err}_{i}>0roman_Err start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, we say that the i𝑖iitalic_ith sentence pair is overestimated in similarity, and similarly underestimated when Erri<0subscriptErr𝑖0\operatorname{Err}_{i}<0roman_Err start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0.

Results.

We plot the the distribution over rank differences for sentences which are similar in the first half and sentences which are similar in the second half for echo and classical embeddings, from all STS datasets (Figure 4). We also highlight the means of the distributions. In accordance with our hypothesis, we observe that for classical embeddings, sentences which are similar in the first half are generally overestimated in similarity more than sentences which are similar in the second half of the sentence, suggesting that classical embeddings fail particularly on sentences that are similar in early tokens. Further, we generally observe no difference between the estimation error distributions for echo embeddings, which demonstrates that echo embeddings recover from this particular failure mode.

There are some notable counterexamples: BIOSSES does not exhibit this trend, but has few examples and thus the results may arise from noise alone. Further, STS22 exhibits identical distributions in estimation error between sentences which are similar in the first half and sentences which are similar in the second half, for both classical and echo embeddings. It is unclear why this trend fails to hold for STS22. Nonetheless, the trend holds for every other dataset, suggesting that the conceptual failure of classical embeddings that we identified in Section 3 generalizes to real data.

Qualitative examples.

In addition, we provide qualitative examples of sentence pairs from STSBenchmark where echo embeddings reduce error most, and where echo embeddings reduce error least, in comparison to classical embeddings. More precisely, we plot the top and bottom 7 examples ranked by |Erriclassical||Erriecho|subscriptsuperscriptErrclassical𝑖subscriptsuperscriptErrecho𝑖|\operatorname{Err}^{\text{classical}}_{i}|-|\operatorname{Err}^{\text{echo}}_% {i}|| roman_Err start_POSTSUPERSCRIPT classical end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | - | roman_Err start_POSTSUPERSCRIPT echo end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, where ErriclassicalsubscriptsuperscriptErrclassical𝑖\operatorname{Err}^{\text{classical}}_{i}roman_Err start_POSTSUPERSCRIPT classical end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the rank difference of the i𝑖iitalic_ith example of classical embeddings, and ErriechosubscriptsuperscriptErrecho𝑖\operatorname{Err}^{\text{echo}}_{i}roman_Err start_POSTSUPERSCRIPT echo end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is similar but for echo embeddings (Table 7).

Appendix D Additional Finetuning Results

In this section, we address the omitted details from the finetuning results of the main paper.

Training Datasets.

We follow the setup of Wang et al. (2023), and use the following datasets: ELI5 (sample ratio 0.1) (Fan et al., 2019), HotpotQA (Yang et al., 2018), FEVER (Thorne et al., 2018), MIRACL (Zhang et al., 2023b), MS-MARCO passage ranking (sample ratio 0.5) and document ranking (sample ratio 0.2) (Bajaj et al., 2018), NQ (Karpukhin et al., 2020), NLI (Gao et al., 2021b), SQuAD (Karpukhin et al., 2020), TriviaQA (Karpukhin et al., 2020), Quora Duplicate Questions (sample ratio 0.1) (DataCanary et al., 2017), Mr- TyDi (Zhang et al., 2021), DuReader (Qiu et al., 2022), and T2Ranking (sample ratio 0.5) (Xie et al., 2023). We use approximately 1.5M training examples.

GPUs.

Training a model takes approximately two days on 4 A100 GPUs.

Instructions for finetuning datasets.

We also follow the setup of Wang et al. (2023), and use the instructions in Table 3. For evaluation, we use the instructions found in Table 14.

Models on MTEB leaderboard.

We compare our implementation of classical and echo embeddings to state-of-the-art approaches on MTEB. Namely, we display results for UAE-Large-V1 (Li and Li, 2023), multilingual-e5-large (Wang et al., 2024), bge-large-en-v1.5 (Xiao et al., 2023b), udever-bloom-7b (Zhang et al., 2023a), sgpt-5.8b (Muennighoff, 2022), e5-mistral-7b (concurrent work) (Wang et al., 2023).

Additional ablations.

We plot additional ablations, including ablating the role of instructions during training and evaluation, as well as providing an evaluation at step 280 (out of 720 total steps), which is approximately 1/3131/31 / 3 of the duration of training (Table 15). We note that echo embeddings still outperform classical embeddings in this setting.

Performance over training time.

We plot the performance over the duration of training for a subset of MTEB tasks in Figure 11. Surprisingly, task performance decreases over training for many tasks.

Computational benefits of echo embeddings.

From Table 15, we observe that even after approximately 1/3131/31 / 3 of the total training duration (less than 1/2121/21 / 2), echo embeddings achieve performance higher than classical embeddings achieve after an entire epoch (Table 2). Echo embeddings requires twice the computational cost of classical embeddings. However, this result suggests that despite this additional cost per embedding, training with echo embeddings can save on training costs by requiring less than half an epoch of training to outperform classical embeddings. Further, since each data point is only seen once, it implies that echo embeddings are much more data efficient than classical embeddings, which may be helpful when data is costly or difficult to acquire.

All results.

We plot the results for every MTEB dataset for echo embeddings, for classical embeddings, and for bidirectional embeddings in Table 16.

NLI

Given a premise, retrieve a hypothesis that is entailed by the premise

NLI

Retrieve semantically similar text

DuReader

Given a Chinese search query, retrieve web passages that answer the question

ELI5

Provided a user question, retrieve the highest voted answers on Reddit ELI5 forum

FEVER

Given a claim, retrieve documents that support or refute the claim

HotpotQA

Given a multi-hop question, retrieve documents that can help answer the question

MIRACL

Given a question, retrieve Wikipedia passages that answer the question

MrTyDi

Given a question, retrieve Wikipedia passages that answer the question

MSMARCO Passage

Given a web search query, retrieve relevant passages that answer the query

MSMARCO Document

Given a web search query, retrieve relevant documents that answer the query

NQ

Given a question, retrieve Wikipedia passages that answer the question

QuoraDuplicates

Given a question, retrieve questions that are semantically equivalent to the given question

QuoraDuplicates

Find questions that have the same meaning as the input question

Squad

Retrieve Wikipedia passages that answer the question

T2Ranking

Given a Chinese search query, retrieve web passages that answer the question

TriviaQA

Retrieve Wikipedia passages that answer the question

Table 3: Instructions for finetuning datasets.

Training objective.

For the training objective, we use the SimCSE loss (Gao et al., 2021b). It is defined,

i=logexp(Sim(hi,hi+)/τ)j=1Nexp(Sim(hi,hj)/τ).subscript𝑖Simsubscript𝑖superscriptsubscript𝑖𝜏superscriptsubscript𝑗1𝑁Simsubscript𝑖superscriptsubscript𝑗𝜏\displaystyle\ell_{i}=-\log\frac{\exp\left(\operatorname{Sim}\left(h_{i},h_{i}% ^{+}\right)/\tau\right)}{\sum_{j=1}^{N}\exp(\operatorname{Sim}\left(h_{i},h_{j% }^{-}\right)/\tau)}.roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( roman_Sim ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( start_ARG roman_Sim ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) / italic_τ end_ARG ) end_ARG . (3)

In this loss function, hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a query (or a reference sentence when the data is symmetric), hi+superscriptsubscript𝑖h_{i}^{+}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT represents a positive example associated with hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and {hj}j=1Nsuperscriptsubscriptsuperscriptsubscript𝑗𝑗1𝑁\{h_{j}^{-}\}_{j=1}^{N}{ italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT represents the set of negatives associated with the example, including mined hard negatives.

q𝑞qitalic_q

ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT

s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT

She loves to travel in summer, especially to cold destinations, avoiding hot and crowded places

She loves to travel in summer, but prefers to visit hot and bustling tourist spots

In summer, she adores traveling, specifically to chilly locations, steering clear of warm, populous areas

The cat often sits by the window, dreaming of chasing birds and enjoying the warm sunshine

The cat often sits by the window, but is too lazy to dream of chasing anything

Frequently, the cat lounges near the window, imagining bird pursuits and basking in the sunlight

He reads books every night, finding solace in fiction and escaping from the stresses of daily life

He reads books every night, yet he feels that non-fiction is more engaging and informative

Nightly, he immerses himself in books, seeking comfort in stories and evading everyday tensions

They play music loudly in the evening, filling their home with energetic beats and vibrant melodies

They play music loudly in the evening, but only soothing classical tunes to relax

In the evenings, they blast tunes, their house resonating with lively rhythms and bright harmonies

She paints landscapes on weekends, expressing her creativity through vibrant colors and abstract forms

She paints landscapes on weekends, preferring realistic and detailed depictions of nature

On weekends, she engages in landscape painting, showcasing her artistic flair with lively hues and unconventional shapes

The children eagerly await winter, dreaming of snowball fights and building snowmen

The children eagerly await winter, yet they dislike the cold and prefer staying indoors

During winter, the kids are excited, imagining snow battles and constructing snow figures

He often jokes at parties, becoming the center of attention with his witty humor

He often jokes at parties, but tends to alienate others with his sarcasm

At social gatherings, he frequently makes jokes, captivating the crowd with his clever wit

She collects antique vases, adoring their unique designs and historical significance

She collects antique vases, but is indifferent to their history and focuses on their resale value

Her hobby is gathering old vases, cherishing their distinct patterns and the stories they hold

The band plays rock music loudly, thrilling audiences with energetic performances and powerful lyrics

The band plays rock music loudly, but often receives complaints for being too noisy

Performing rock loudly, the band excites crowds with dynamic shows and impactful words

He prefers working at night, enjoying the quiet and focusing better without distractions

He prefers working at night, despite feeling more tired and less productive

Nighttime is his preferred work period, appreciating the tranquility and concentrated environment

She writes poetry in her free time, pouring her emotions and experiences into each verse

She writes poetry in her free time, but struggles to find inspiration and motivation

During her leisure, she crafts poems, infusing her feelings and life stories into every line

Table 4: Examples of Structure 1 from Section 3

q𝑞qitalic_q

ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT

s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT

On sunny days, I often find myself longing for the cool breeze of the ocean and the sound of waves crashing, as I enjoy outdoor activities

During rainy days, I usually prefer the warmth and quiet of my home, as I enjoy outdoor activities

When the sun is shining, I tend to crave the refreshing sea air and the rhythmic sound of the ocean, since I relish spending time outdoors

As a lover of classical music, I spend hours listening to Beethoven and Bach, reveling in the complexity of their compositions, though I’m fond of playing the guitar

Despite my preference for rock music, I rarely spend time on music other than playing my favorite tunes on the guitar, though I’m fond of playing the guitar

Being an enthusiast of classical melodies, I often indulge in lengthy sessions of Beethoven and Bach, appreciating the intricacies of their work, as I delight in guitar playing

In the world of literature, I have an insatiable appetite for mystery novels and spend countless evenings unraveling their plots, but I adore reading poetry

Contrary to my usual tastes, I rarely delve into mystery novels and prefer lighter reading materials, but I adore reading poetry

As a fervent reader, my passion lies in the twists and turns of mystery stories, which I often explore during long nights, yet I cherish reading poetry

Growing up in a bustling city, I’ve always been surrounded by the constant hum of activity and the bright city lights, which makes me appreciate quiet countryside walks

Having been raised in a tranquil rural area, I’m more accustomed to the sounds of nature and open fields, which makes me appreciate quiet countryside walks

Raised in the lively atmosphere of an urban environment, I’m used to the never-ending city noise and glowing nights, leading me to enjoy the serenity of rural strolls

Ever since I was a child, fascinated by the vastness of the universe, I would spend countless nights gazing at the stars through my telescope, dreaming of exploring distant galaxies, yet I still find solace in simple nature hikes

Though I’ve always been more interested in the immediate world around me, preferring to focus on the tangible and the present, I rarely look up at the night sky, yet I still find solace in simple nature hikes

From my early years, captivated by the infinity of space, I devoted many nights to star-gazing and imagining interstellar journeys, but I also enjoy the peace of nature walks

Growing up with a passion for culinary arts, experimenting with exotic ingredients and complex recipes, and often spending whole days in the kitchen perfecting new dishes, I also have a deep appreciation for classic literature

Despite my lack of interest in cooking and a preference for simple, quick meals that require minimal preparation, I’m not one to spend time in the kitchen, I also have a deep appreciation for classic literature

Since childhood, I’ve been enthusiastic about cooking, often trying out unusual ingredients and intricate recipes, dedicating entire days to refining my culinary creations, and I equally cherish classic literary works

Table 5: Examples of Structure 2 from Section 3

q𝑞qitalic_q

ssuperscript𝑠s^{-}italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT

s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT

SShe loves to travel in summer, especially to cold destinations, avoiding hot and crowded places

She loves to travel in summer, but prefers to visit hot and bustling tourist spots

She loves to travel in summer, specifically to chilly locations, steering clear of warm, populous areas

The cat often sits by the window, dreaming of chasing birds and enjoying the warm sunshine

The cat often sits by the window, but is too lazy to dream of chasing anything

The cat often sits by the window, imagining bird pursuits and basking in the sunlight

He reads books every night, finding solace in fiction and escaping from the stresses of daily life

He reads books every night, yet he feels that non-fiction is more engaging and informative

He reads books every night, seeking comfort in stories and evading everyday tensions

They play music loudly in the evening, filling their home with energetic beats and vibrant melodies

They play music loudly in the evening, but only soothing classical tunes to relax

They play music loudly in the evening, their house resonating with lively rhythms and bright harmonies

She paints landscapes on weekends, expressing her creativity through vibrant colors and abstract forms

She paints landscapes on weekends, preferring realistic and detailed depictions of nature

She paints landscapes on weekends, showcasing her artistic flair with lively hues and unconventional shapes

The children eagerly await winter, dreaming of snowball fights and building snowmen

The children eagerly await winter, yet they dislike the cold and prefer staying indoors

The children eagerly await winter, imagining snow battles and constructing snow figures

He often jokes at parties, becoming the center of attention with his witty humor

He often jokes at parties, but tends to alienate others with his sarcasm

He often jokes at parties, captivating the crowd with his clever wit

She collects antique vases, adoring their unique designs and historical significance

She collects antique vases, but is indifferent to their history and focuses on their resale value

She collects antique vases, cherishing their distinct patterns and the stories they hold

The band plays rock music loudly, thrilling audiences with energetic performances and powerful lyrics

The band plays rock music loudly, but often receives complaints for being too noisy

The band plays rock music loudly, the band excites crowds with dynamic shows and impactful words

He prefers working at night, enjoying the quiet and focusing better without distractions

He prefers working at night, despite feeling more tired and less productive

He prefers working at night, appreciating the tranquility and concentrated environment

She writes poetry in her free time, pouring her emotions and experiences into each verse

She writes poetry in her free time, but struggles to find inspiration and motivation

She writes poetry in her free time, infusing her feelings and life stories into every line

Table 6: Examples of Structure 3 from Section 3
Most improved Least improved
Sentence 1 Sentence 2 Score Sentence 1 Sentence 2 Score

The best thing you can do is to know your stuff.

The best thing to do is to overcome the fussiness.

0.0

Sometime if you really want it you might need to pay an agency to get the place for you.

You could probably get a tour agency to do it for you but it would cost you.

2.0

It really doesn’t matter.

It doesn’t matter unless it is really far off.

3.0

There are three options:

There are only three options:

5.0

I think it’s fine to ask this question.

I think it is okay to ask the question.

5.0

Bremer said one initiative is to launch a US$70 million nationwide program in the next two weeks to clean up neighborhoods and build community projects.

Bremer said he would launch a $70-million program in the next two weeks to clean up neighborhoods across Iraq and build community projects, but gave no details.

3.6

What kind of insulation is it?

What kind of floors are above?

0.0

"Tony’s not feeling well," Spurs coach Gregg Popovich said.

We’re thrilled to be up 3-2,” Coach Gregg Popovich said Wednesday.

1.6

It depends entirely on your company and your contract.

I guess it depends on the nature of your contract.

4.0

Shares of Mandalay closed down eight cents to $29.42, before the earnings were announced.

Shares of Mandalay closed down 8 cents at $29.42 Thursday.

4.0

You need to read a lot to know what you like and what you don’t.

You have to know what you want to do.

0.0

Singapore reported no suspected SARS cases Wednesday, but officials quarantined 70 people who had contact with the Taiwanese patient.

Still, Singapore quarantined 70 people who had been in close contact with the scientist.

3.0

I would say you can do it, but it wouldn’t be advised.

Personally, I would say not unless it suits you.

2.0

The dollar was at 117.85 yen against the Japanese currency, up 0.1 percent.

Against the Swiss franc the dollar was at 1.3289 francs, up 0.5 percent on the day.

1.333
Table 7: Example sentences from STSBenchmark in which zero-shot echo embeddings with Mistral 7B most improve (left) and least improve (right).
Refer to caption
Figure 5: Variance over different prompting strategies for zero-shot Mistral-7B.
Refer to caption
Figure 6: Variance over different prompting strategies for zero-shot LLaMa-2-7B.
Refer to caption
Figure 7: Variance over different prompting strategies for zero-shot LLaMa-2-13B.
Refer to caption
Figure 8: Variance over different prompting strategies for all evaluated datasets for zero-shot Mistral-7B.
Refer to caption
Figure 9: Variance over different prompting strategies for all evaluated datasets for zero-shot LLaMa-2-7B.
Refer to caption
Figure 10: Variance over different prompting strategies for all evaluated datasets for zero-shot LLaMa-2-13B.
Validation Dataset Classification Pair Classification Clustering Retrieval STS Reranking Average
Classical
Classification 59.20 73.80 24.16 20.57 58.59 54.54 46.79
Pair Classification 58.73 71.40 24.32 20.39 59.00 54.42 46.64
Clustering 58.23 72.62 23.90 18.64 56.68 54.82 45.37
Retrieval 58.21 73.87 23.85 20.35 56.97 54.44 45.88
STS 58.31 44.03 13.07 2.63 38.95 46.77 34.63
Reranking 58.13 71.77 24.20 20.23 58.59 54.89 46.43
Echo
Classification 64.50 74.65 25.93 22.52 73.81 59.41 55.57
Pair Classification 64.15 75.93 22.25 18.35 72.75 58.47 54.15
Clustering 61.54 71.04 26.32 15.88 68.18 60.27 51.81
Retrieval 64.06 75.26 27.02 23.61 72.40 60.00 55.07
STS 64.50 74.65 25.93 22.52 73.81 59.41 55.57
Reranking 64.15 75.93 22.25 18.35 72.75 58.47 54.15
Summarization
Classification 66.62 78.95 21.79 14.68 72.13 64.24 55.22
Pair Classification 66.62 78.95 21.79 14.68 72.13 64.24 55.22
Clustering 66.66 79.59 28.08 11.88 67.30 65.19 53.43
Retrieval 66.01 81.82 26.48 19.13 70.13 66.24 54.96
STS 66.01 81.82 26.48 19.13 70.13 66.24 54.96
Reranking 63.19 75.22 26.09 20.52 65.98 59.05 51.55
Table 8: Scores for additional zero-shot validation datasets on Mistral-7B.
Validation Dataset Classification Pair Classification Clustering Retrieval STS Reranking Average
Classical
Classification 57.59 68.65 23.72 18.06 57.19 54.59 45.14
Pair Classification 57.56 70.18 23.51 18.54 58.24 54.40 45.79
Clustering 57.14 69.91 23.35 16.98 57.66 55.38 45.25
Retrieval 56.61 68.46 23.22 18.63 56.49 53.26 44.65
STS 57.56 70.18 23.51 18.54 58.24 54.40 45.79
Reranking 56.65 66.54 22.46 10.48 55.97 54.44 42.98
Echo
Classification 62.24 67.96 23.60 14.33 65.79 55.44 49.85
Pair Classification 63.42 72.52 21.11 17.35 68.16 54.98 51.47
Clustering 60.12 66.74 23.45 11.60 64.45 56.31 48.75
Retrieval 61.64 66.29 25.11 16.12 66.18 56.35 50.26
STS 63.15 68.74 23.65 16.38 69.37 57.75 51.96
Reranking 62.30 74.23 24.69 18.17 65.07 56.76 50.51
Summarization
Classification 63.96 77.93 21.89 15.93 67.07 63.39 52.34
Pair Classification 63.96 77.93 21.89 15.93 67.07 63.39 52.34
Clustering 61.60 69.47 24.44 5.28 57.53 57.62 45.85
Retrieval 64.90 78.74 26.63 15.59 70.15 65.43 54.02
STS 64.90 78.74 26.63 15.59 70.15 65.43 54.02
Reranking 60.54 69.73 26.40 15.82 61.60 58.80 47.83
Table 9: Scores for additional zero-shot validation datasets on LLaMa-2-7B.
Validation Dataset Classification Pair Classification Clustering Retrieval STS Reranking Average
Classical
Classification 58.24 71.65 23.91 21.79 58.74 56.37 46.66
Pair Classification 58.10 73.30 23.01 16.97 57.83 56.17 45.52
Clustering 58.61 67.47 23.30 15.51 57.93 56.86 45.05
Retrieval 58.50 65.06 24.22 18.92 57.47 56.38 45.15
STS 58.24 71.65 23.91 21.79 58.74 56.37 46.66
Reranking 58.61 67.47 23.30 15.51 57.93 56.86 45.05
Echo
Classification 64.15 74.22 25.02 27.58 70.81 61.43 55.02
Pair Classification 64.57 77.63 22.56 24.08 69.85 59.89 53.55
Clustering 63.26 73.50 25.10 27.48 69.04 61.81 54.32
Retrieval 64.65 74.57 25.72 26.58 72.20 62.68 55.60
STS 63.16 75.98 24.08 27.56 71.00 61.84 54.85
Reranking 62.90 70.58 25.53 22.11 68.82 62.38 53.02
Summarization
Classification 66.02 79.06 26.47 22.20 67.91 64.90 54.52
Pair Classification 66.02 79.06 26.47 22.20 67.91 64.90 54.52
Clustering 63.84 71.98 21.99 7.48 56.96 59.41 46.50
Retrieval 66.02 79.06 26.47 22.20 67.91 64.90 54.52
STS 66.02 79.06 26.47 22.20 67.91 64.90 54.52
Reranking 61.19 69.63 26.38 19.62 60.76 62.79 48.36
Table 10: Scores for additional zero-shot validation datasets on LLaMa-2-13B.
Dataset Classical Echo Summarization
FiQA2018 (retrieval) 7.89 12.74 12.43
SCIDOCS (retrieval) 3.60 4.88 9.97
SciFact (retrieval) 45.39 49.36 29.90
NFCorpus (retrieval) 12.07 16.57 17.51
TwitterSemEval20. (pair_classification) 47.81 62.49 59.79
TwitterURLCorpus (pair_classification) 73.87 75.26 81.82
ImdbClassificati. (classification) 72.50 72.02 82.78
AmazonReviewsCla. (classification) 37.09 40.72 45.58
TweetSentimentEx. (classification) 53.70 58.76 61.74
MTOPDomainClassi. (classification) 83.85 92.71 90.72
TwentyNewsgroups. (clustering) 20.84 29.48 30.11
BiorxivClusterin. (clustering) 23.47 27.61 27.21
MedrxivClusterin. (clustering) 24.23 26.42 25.75
StackOverflowDup. (reranking) 35.85 42.71 40.32
AskUbuntuDupQues. (reranking) 49.49 54.09 57.17
SciDocsRR (reranking) 59.38 65.91 75.30
BIOSSES (sts) 59.05 78.19 66.06
STS12 (sts) 42.01 58.43 64.62
STS13 (sts) 59.66 78.53 78.45
STS14 (sts) 50.69 68.42 71.00
STS15 (sts) 61.81 78.82 78.29
STS16 (sts) 57.03 77.52 77.40
STS17 (sts) 68.08 82.14 78.80
STS22 (sts) 61.23 57.60 47.07
STSBenchmark (sts) 47.55 73.85 77.39
SICK-R (sts) 53.19 71.95 69.48
Average 45.88 55.07 54.96
Table 11: Evaluation of all MTEB datasets for zero-shot for Mistral-7B.
Dataset Classical Echo Summarization
FiQA2018 (retrieval) 6.48 12.38 9.00
SCIDOCS (retrieval) 3.72 4.38 8.33
SciFact (retrieval) 42.18 30.61 23.01
NFCorpus (retrieval) 10.01 13.38 15.43
TwitterSemEval20. (pair_classification) 44.11 54.66 54.27
TwitterURLCorpus (pair_classification) 68.46 66.29 78.74
ImdbClassificati. (classification) 71.65 73.11 85.83
AmazonReviewsCla. (classification) 36.16 40.68 44.77
TweetSentimentEx. (classification) 52.04 54.85 59.96
MTOPDomainClassi. (classification) 81.63 89.38 89.97
TwentyNewsgroups. (clustering) 15.88 23.42 32.28
BiorxivClusterin. (clustering) 23.13 25.92 27.79
MedrxivClusterin. (clustering) 23.31 24.30 25.48
StackOverflowDup. (reranking) 35.57 40.82 35.63
AskUbuntuDupQues. (reranking) 48.51 51.42 56.09
SciDocsRR (reranking) 58.01 61.29 74.76
BIOSSES (sts) 65.31 71.96 68.04
STS12 (sts) 41.84 52.40 60.20
STS13 (sts) 58.43 72.40 76.31
STS14 (sts) 49.21 61.24 68.73
STS15 (sts) 60.03 72.67 75.59
STS16 (sts) 56.40 73.51 76.71
STS17 (sts) 62.31 71.87 79.38
STS22 (sts) 59.48 55.21 55.69
STSBenchmark (sts) 49.45 65.73 76.42
SICK-R (sts) 55.35 64.39 70.69
Average 44.65 50.26 54.02
Table 12: Evaluation of all MTEB datasets for zero-shot for LLaMa-2-7B.
Dataset Classical Echo Summarization
FiQA2018 (retrieval) 8.31 18.07 9.43
SCIDOCS (retrieval) 4.87 7.56 10.38
SciFact (retrieval) 41.64 50.55 40.19
NFCorpus (retrieval) 10.26 21.63 16.02
TwitterSemEval20. (pair_classification) 42.43 62.85 59.55
TwitterURLCorpus (pair_classification) 65.06 74.57 79.06
ImdbClassificati. (classification) 71.82 75.44 91.86
AmazonReviewsCla. (classification) 37.88 43.25 50.60
TweetSentimentEx. (classification) 52.95 58.18 59.93
MTOPDomainClassi. (classification) 84.67 92.52 87.51
TwentyNewsgroups. (clustering) 17.21 25.98 32.08
BiorxivClusterin. (clustering) 24.95 26.75 28.30
MedrxivClusterin. (clustering) 23.49 24.70 24.64
StackOverflowDup. (reranking) 37.24 44.86 38.44
AskUbuntuDupQues. (reranking) 50.74 55.21 54.15
SciDocsRR (reranking) 62.03 70.15 75.65
BIOSSES (sts) 63.26 77.60 69.33
STS12 (sts) 51.80 59.36 51.17
STS13 (sts) 61.59 79.01 76.08
STS14 (sts) 49.69 69.75 66.62
STS15 (sts) 58.48 79.86 73.75
STS16 (sts) 53.18 76.75 77.40
STS17 (sts) 65.10 80.41 75.88
STS22 (sts) 59.00 56.84 49.23
STSBenchmark (sts) 44.80 71.31 75.17
SICK-R (sts) 55.13 70.27 71.70
Average 45.15 55.60 54.52
Table 13: Evaluation of all MTEB datasets for zero-shot for LLaMa-2-13B.

AmazonCounterfactualCls.

Classify a given Amazon customer review text as either counterfactual or not counterfactual

AmazonPolarityCls.

Classify Amazon reviews into positive or negative sentiment

AmazonReviewsCls.

Classify the given Amazon review into its appropriate rating category

Banking77Cls.

Given a online banking query, find the corresponding intents

EmotionCls.

Classify the emotion expressed in the given Twitter message into one of the six emotions: anger, fear, joy, love, sadness, and surprise

ImdbCls.

Classify the sentiment expressed in the given movie review text from the IMDB dataset

MassiveIntentCls.

Given a user utterance as query, find the user intents

MassiveScenarioCls.

Given a user utterance as query, find the user scenarios

MTOPDomainCls.

Classify the intent domain of the given utterance in task-oriented conversation

MTOPIntentCls.

Classify the intent of the given utterance in task-oriented conversation

ToxicConversationsCls.

Classify the given comments as either toxic or not toxic

TweetSentimentExtractionCls.

Classify the sentiment of a given tweet as either positive, negative, or neutral

ArxivClusteringP2P

Identify the main and secondary category of Arxiv papers based on the titles and abstracts

ArxivClusteringS2S

Identify the main and secondary category of Arxiv papers based on the titles

BiorxivClusteringP2P

Identify the main category of Biorxiv papers based on the titles and abstracts

BiorxivClusteringS2S

Identify the main category of Biorxiv papers based on the titles

MedrxivClusteringP2P

Identify the main category of Medrxiv papers based on the titles and abstracts

MedrxivClusteringS2S

Identify the main category of Medrxiv papers based on the titles

RedditClustering

Identify the topic or theme of Reddit posts based on the titles

RedditClusteringP2P

Identify the topic or theme of Reddit posts based on the titles and posts

StackExchangeClustering

Identify the topic or theme of StackExchange posts based on the titles

StackExchangeClusteringP2P

Identify the topic or theme of StackExchange posts based on the given paragraphs

TwentyNewsgroupsClustering

Identify the topic or theme of the given news articles

SprintDuplicateQuestions

Retrieve duplicate questions from Sprint forum

TwitterSemEval2015

Retrieve tweets that are semantically similar to the given tweet

TwitterURLCorpus

Retrieve tweets that are semantically similar to the given tweet

AskUbuntuDupQuestions

Retrieve duplicate questions from AskUbuntu forum

MindSmallReranking

Retrieve relevant news articles based on user browsing history

SciDocsRR

Given a title of a scientific paper, retrieve the titles of other relevant papers

StackOverflowDupQuestions

Retrieve duplicate questions from StackOverflow forum

ArguAna

Given a claim, find documents that refute the claim

ClimateFEVER

Given a claim about climate change, retrieve documents that support or refute the claim

CQADupstackAndroidRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackEnglishRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackGamingRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackGisRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackMathematicaRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackPhysicsRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackProgrammersRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackStatsRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackTexRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackUnixRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackWebmastersRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

CQADupstackWordpressRetr.

Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question

DBPedia

Given a query, retrieve relevant entity descriptions from DBPedia

FEVER

Given a claim, retrieve documents that support or refute the claim

FiQA2018

Given a financial question, retrieve user replies that best answer the question

HotpotQA

Given a multi-hop question, retrieve documents that can help answer the question

MSMARCO

Given a web search query, retrieve relevant passages that answer the query

NFCorpus

Given a question, retrieve relevant documents that best answer the question

NQ

Given a question, retrieve Wikipedia passages that answer the question

QuoraRetr.

Given a question, retrieve questions that are semantically equivalent to the given question

SCIDOCS

Given a scientific paper title, retrieve paper abstracts that are cited by the given paper

SciFact

Given a scientific claim, retrieve documents that support or refute the claim

Touche2020

Given a question, retrieve detailed and persuasive arguments that answer the question

TRECCOVID

Given a query on COVID-19, retrieve documents that answer the query

BIOSSES

Retrieve semantically similar text

SICK-R

Retrieve semantically similar text

STS12

Retrieve semantically similar text

STS13

Retrieve semantically similar text

STS14

Retrieve semantically similar text

STS15

Retrieve semantically similar text

STS16

Retrieve semantically similar text

STS17

Retrieve semantically similar text

STS22

Retrieve semantically similar text

STSBenchmark

Retrieve semantically similar text

SummEval

Given a news summary, retrieve other semantically similar summaries

Table 14: MTEB instructions for evaluation of finetuned models.
Refer to caption
Figure 11: Performance of the evaluated MTEB datasets for finetuning over the number of finetuning steps.
Model Average Clas. Clus. Pair Clas. Rera. Retr. STS Summ.
Classical (w/ instruct., mean) 62.96 76.26 42.68 86.31 57.58 53.75 81.53 30.19
Classical (w/ instruct., last) 63.98 76.57 45.78 86.37 56.71 54.87 82.03 31.02
Echo (w/ instruct., mean) 64.22 77.00 44.94 87.73 58.30 55.11 82.52 29.46
Echo (w/ instruct., last) 64.68 77.43 46.32 87.34 58.14 55.52 82.56 30.73
Classical (w/out instruct., mean) 62.19 75.23 41.79 85.24 56.31 53.24 80.97 30.64
Classical (w/out instruct., last) 62.37 75.01 42.70 85.69 56.64 53.29 80.92 30.91
Echo (w/out instruct., mean) 63.28 75.26 42.93 86.95 57.05 55.65 81.40 30.62
Echo (w/out instruct., last) 62.80 75.30 42.94 86.31 57.31 54.18 80.92 31.00
Classical (w/ instruct., mean, step 280) 63.19 76.18 42.99 85.44 57.63 53.96 82.53 29.94
Classical (w/ instruct., last, step 280) 63.87 76.54 46.22 86.70 57.79 53.73 82.22 30.13
Echo (w/ instruct., mean, step 280) 64.04 76.84 45.76 87.72 59.33 53.55 82.64 30.33
Echo (w/ instruct., last, step 280) 64.50 76.41 46.70 87.17 59.10 54.84 82.98 31.09
Table 15: Additional ablations for finetuning.
Dataset Repetition (last) Repetition (mean) Classical (last) Classical (mean) Bidirectional (last)
AmazonCounterfactualClassification 82.97 82.91 80.82 82.21 83.07
AmazonPolarityClassification 90.98 88.25 92.55 90.37 90.83
AmazonReviewsClassification 48.71 49.41 48.75 46.76 47.94
Banking77Classification 88.15 88.06 87.95 87.69 88.17
EmotionClassification 52.18 51.51 50.66 49.23 52.09
ImdbClassification 87.42 84.80 83.18 82.53 83.02
MassiveIntentClassification 79.67 79.70 78.60 79.15 78.93
MassiveScenarioClassification 82.82 82.74 81.71 81.46 81.80
MTOPDomainClassification 96.16 96.10 95.92 95.54 96.14
MTOPIntentClassification 85.75 85.87 85.96 85.86 85.98
ToxicConversationsClassification 71.91 72.21 71.19 72.21 71.46
TweetSentimentExtractionClassification 62.40 62.46 61.60 62.07 60.97
ArxivClusteringP2P 47.02 45.52 46.73 45.80 47.03
ArxivClusteringS2S 43.52 42.32 43.99 40.73 42.14
BiorxivClusteringP2P 35.53 35.24 36.50 35.42 36.21
BiorxivClusteringS2S 35.34 33.70 34.87 32.03 34.77
MedrxivClusteringP2P 30.27 29.68 30.67 29.74 31.06
MedrxivClusteringS2S 29.67 27.73 29.75 27.97 30.12
RedditClustering 61.77 59.12 61.17 54.79 62.50
RedditClusteringP2P 66.01 65.44 64.84 63.68 65.45
StackExchangeClustering 72.04 71.21 71.87 66.99 71.58
StackExchangeClusteringP2P 35.29 34.07 33.08 31.47 34.98
TwentyNewsgroupsClustering 53.04 50.29 50.07 40.91 49.53
SprintDuplicateQuestions 94.59 95.05 94.38 95.29 96.26
TwitterSemEval2015 79.93 80.73 77.18 75.98 80.80
TwitterURLCorpus 87.50 87.40 87.56 87.67 87.38
AskUbuntuDupQuestions 64.13 64.44 62.24 63.32 62.65
MindSmallReranking 32.92 32.11 32.68 32.52 32.53
SciDocsRR 83.68 84.15 81.60 83.01 82.36
StackOverflowDupQuestions 51.84 52.51 50.33 51.48 51.35
ArguAna 58.52 56.52 57.22 51.14 57.27
ClimateFEVER 34.56 37.07 31.10 30.31 32.73
CQADupstackRetrieval 46.91 46.48 45.11 43.30 46.52
DBPedia 46.83 48.19 45.18 46.80 46.76
FEVER 91.22 91.14 90.30 90.63 91.66
FiQA2018 54.51 54.11 50.31 48.94 53.06
HotpotQA 76.41 75.75 72.95 68.50 75.30
MSMARCO 43.25 43.11 42.31 41.49 43.38
NFCorpus 39.55 37.18 39.32 38.53 38.61
NQ 62.31 61.51 62.07 60.65 63.69
QuoraRetrieval 89.34 89.33 89.04 88.94 89.57
SCIDOCS 20.17 17.73 19.34 19.88 19.69
SciFact 73.99 73.57 74.22 75.39 75.83
Touche2020 18.52 18.92 24.46 19.44 15.79
TRECCOVID 76.66 76.02 80.17 82.30 74.50
BIOSSES 86.54 86.78 85.73 83.31 85.38
STS12 76.13 75.89 75.84 76.23 75.50
STS13 83.19 82.90 83.41 82.61 83.44
STS14 80.60 80.99 79.80 79.89 81.35
STS15 87.16 87.16 86.99 86.68 87.43
STS16 85.16 84.93 83.93 84.18 85.34
STS17 90.88 90.78 91.12 90.14 90.99
STS22 67.04 67.21 66.27 65.99 66.32
STSBenchmark 85.67 85.87 84.96 85.20 85.45
SICK-R 83.23 82.70 82.22 81.11 82.97
SummEval 30.73 29.46 31.02 30.19 29.32
Table 16: Results from all MTEB datasets for finetuning with Mistral-7B.