[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734

CatherineSue · 2024-03-29T20:54:17Z

Key Features of This Pull Request

This PR adds the e5-mistral-7b-instruct model and enables an E2E embedding vector generation.

There are a few open PRs that add support for embedding models. Our PR uniquely addresses the following key issues:

Integration of the OpenAI Server Front End

This PR includes comprehensive end-to-end (E2E) embedding functionality, spanning from the OpenAI Front End to the Back End.

Turn off KV cache with embedding models

This PR introduces the capability to turn off KV cache when operating in embedding mode, which includes the block_tables and cache_engine.

High Level Design

The embedding model can essentially be considered a special type of generative model with max_token=1 from an inference perspective. Both embedding and generative models (with max_token=1) require a single feedforward calculation without the need for generating any subsequent new tokens. The primary differences are:

The embedding model returns the hidden state, while the generative model takes an extra step to sample and return the first output token. In this sense, the embedding model is a subset of the generative model in terms of the calculations performed during the single feedforward operation on GPUs. They differ in their API specifications concerning request and output formats.

As a consequence, the serving of embedding models is:

Independent of any technologies implemented for subsequent token-by-token generations; and
They share the same technologies as generative models for prompt processing.

Therefore, in this PR, our current implementation focuses on bypassing vLLM’s components for subsequent token generations, such as KV cache and CUDA Graph to avoid unnecessary GPU memory, which in turn is useful to optimize the performance for embedding serving. If these technologies (e.g., CUDA Graph) are enhanced to improve prompt processing, they can be directly applied to embedding models. At a high level, our PR is independent of such crucial feature tensor parallelism, quantization, etc., which are addressed in Milestone 1 below. We have tested it with tensor parallelism > 1 and it functions effectively. We are willing to work together to test/improve these crucial features where needed.

Benchmarking Results

On 1 H100 GPU, vLLM’s serving of the E5 embedding model reaches a max throughput up to 36.6k tokens/s and remains consistent across sequence lengths from 128 to 32k (when gpu_memory_utilization is 0.8).
On 1 A100 GPU, it reaches up to 15.3k tokens/second and also remains consistent across sequence lengths. As a comparison, in a test of the latest ONNX, the speed is up to 12.6k tokens/sec when sequence length is low (256 tokens), is reduced to 8k tokens/sec when sequence length is 2k, and shows a general trend of getting worse when the sequent length is longer.
See the figure for more details of our testing results on H100 and A100
Note: The throughput is measured as the "number of sequences in a batch" * "sequence length" / "end-to-end latency"

Design and Implementation Details

Add Embedding generation

Add Embedding API to `entrypoints/openai`

Add EmbeddingRequest and EmbeddingResponse to protocol.py
Add serving_embedding.py
Add OpenAIServingEmbedding to api_server
Make llm.py work with embedding

Add Embedding in outputs and sequence

Make an abstract base class of RequestOutput and SequenceGroupOutput.
Add separate Completion*Output and Embedding*Output.
Add EmbeddingOutput, EmbeddingRequestOutput, RequestOutputFactory and
EmbeddingSequenceGroupOutput to support processing the embedding output
sequence.
Update process output and sequence in *llm_engine to use embedding

Add MistralModel and embedding generation in llama.py

Add llama.embedding() and load_weights in LlamaModel to support forward
and embedding vector generation
Adapted from code examples in https://huggingface.co/intfloat/e5-mistral-7b-instruct
Mistral uses LlamaModel.
Use embedding when embedding_mode is True in model_runner
Add load_weights in llama.py to support embedding models

Disable KV Cache for Embedding serving

Skip `slot_mapping` with embedding mode

slot_mapping is only used in model_executor/layers/attention.py when
kv_cache is not None. In embedding mode, we pass None kv_cache. So no
need to process slot_mapping

Turn off KV cache for embedding mode

The goal is to disable the block_table and cache_engine completely, so
we don't consider allocating blocks for KV cache for embedding mode

Add embedding_mode to ModelConfig and SchedulerConfig
Add a BlockSpaceManagerProxy to control the block management behavior
for embedding mode

Update parameters for max batching

Add profile_max_batched_tokens_for_embedding to profile the
max_num_batched_tokens for embedding server mode
Return max_batch_size as each Ray worker runs the profiling once
Enable embedding profiling in ray_gpu_executor.py to support tensor parallelism > 1

Notes

Overlaps with other PRs

This PR overlaps with the following Milestones in Supporting embedding models #3187

Milestone 1: Basic functionality using Hugging Face models [Partially completed]

Note: Instead of using LLM.encode(), this PR currently adds embedding() to LlamaModel, and then keeps LLM.generate() for embedding.

G) Update parameters for max batching [Completed]

We introduced profile_max_batched_tokens_for_embedding in gpu_executor.py to support the maximum number of tokens the GPU can take in one batch.

Milestone 2: Implement the model properly [Completed]

This PR focuses on adding the e5-mistral-7b-instruct model, which can utilize llama.py. So it already uses the vLLM layer primitives.

Milestone 4: Add the OpenAI Server Front End [Completed]

This PR has implemented the Embedding API to entrypoints/openai.

Discrepancies with other discussions

This PR didn't implement the following in this discussion in Supporting embedding models #3187

Move finish sequence logic to check_stop

Currently, the logic is in llm_engine._process_model_output()
The logic should be in llm_engine._check_stop

Automatically detect that the model is an embedding model

The user should not have to specify that it is an embedding model
Somewhere in the vllm code, create a registry that selects which models are embedding and which models are decoders

F) Update Pass `EmbeddingParams` around instead of `SamplingParams`

note: this is going to require a lot of work
Note: This PR passes SamplingParams() to the LLMEngine and disabled the use of it in embedding mode. As separating EmbeddingParams and SamplingParams requires changes to the UX, it would be easier to discuss and review in a following PR.

Discussion Points and Considerations

Simplifying the Workflow

Evaluate the possibility of consolidating embedding-related workflows into existing structures with minimal branching logic. This could involve using if-else statements (specifically in model_runner.py) or integrating embedding as a subset of generation needs.

Soundness of Profiling

Evaluate if the current profile_max_batched_tokens_for_embedding is sufficient to support the maximum number of tokens in one batch without causing CUDA OOM.

vllm/core/block_manager.py

vllm/executor/gpu_executor.py

vllm/engine/arg_utils.py

vllm/core/scheduler.py

CatherineSue · 2024-04-03T00:10:01Z

An overview of tasks derived from the pull request discussions:

Immediate changes

core/

Make a BlockSpaceManagerV3
Remove self.prompt_limit if branch in scheduler.py

worker/ and executor/

Remove profile_max_num_batched_tokens for embedding
Evaluate to set a hardcoded batch_size or max_num_batched_tokens (32k as Woosuk and Zhuohan suggested) instead of profiling

models/

Make a new embedding_models_dict
Make a llama_embedding.py
Check how HuggingFace(HF) generates embedding

Separate model_runner and embedding_model_runner

Implement execute_model separately in model_runner and embedding_model_runner

Further evaluation

Evaluating the possibility of using config.json to distinguish whether a model is for embedding or generation, e.g. XForCausalLM as a generation model. Check edge cases for fine-tuning.
Separate LLM and LLMEmbedding
Separate SamplingParams and EmbeddingParams
Evaluate latency on individual request level

HF embeddings

HF's sentence_transformers provide sentences, texts, and image embeddings. They use an encode() for computing sentence embeddings. And Pooling class supports different types of pooling, including lasttoken, which e5-mistral-7b-instruct uses.

Adhering to the same design would require further discussion and evaluation of the points mentioned above.

vllm/entrypoints/openai/serving_embedding.py

vllm/entrypoints/openai/serving_engine.py

vllm/entrypoints/openai/serving_embedding.py

pzhao1799

Preliminary review, will continue to add comments as I continue reading through this PR. Thanks a lot for all the hard work so far!!!

vllm/entrypoints/openai/api_server.py

pzhao1799 · 2024-04-15T18:48:43Z

vllm/core/scheduler.py

@@ -253,9 +253,14 @@ def __init__(
 self.scheduler_config.max_model_len,
 self.scheduler_config.max_num_batched_tokens)

+ version = "v1"


very nit: Would it make sense to clarify/rename the version naming scheme? v3 implies an improved version over v1 and/or v2 when in reality it represents a placeholder when we do not need it. This is very nit so feel free to ignore.

pzhao1799 · 2024-04-15T18:56:59Z

vllm/entrypoints/openai/serving_embedding.py

+ "dimensions is currently not supported")
+
+ model_name = request.model
+ request_id = f"cmpl-{random_uuid()}"


Should we be prefixing the request_id with something other than cmpl? I noticed both chat and completions have this as the request_id, so unsure if we should be keeping this here as well.

ywang96

Thank you very much for your contribution @CatherineSue!

I took a pass and left some comments and suggestions. Overall the design and implementation is pretty straightforward to me, and thank you for creating the no op blockmanager to separate the logic.

Some thoughts I would like to throw out there:

The current implementation doesn't allow an engine to do generation and embedding at the same time (since the embedding_mode is passed to initialize the engine), so I wonder if it's actually worth the effort to create a separate EmbeddingEngine since a lot of logics in LLMEngine are completely not needed, and this separation could also make a lot of higher level APIs cleaner.
If the end goal is indeed to have engine to support both at the same time, then we should think about a clean way to support it, as well as supporting other vanilla embedding models.

Happy to discuss and hear your thoughts on this!

vllm/entrypoints/openai/serving_embedding.py

vllm/entrypoints/openai/serving_engine.py

vllm/model_executor/models/llama_embedding.py

vllm/entrypoints/openai/protocol.py

vllm/entrypoints/openai/serving_embedding.py

ywang96 · 2024-04-16T21:54:02Z

vllm/core/block_manager_v3.py

+class BlockSpaceManagerV3(BlockSpaceManager):
+ """A simple version of BlockSpaceManager for use in environments
+ where block management is not required.
+
+ This class provides the same interface as BlockSpaceManager, but its
+ methods perform no actions or return simple values like True in specific
+ actions. It's designed to be used in scenarios where the overhead of
+ block management is unnecessary, such as in an embedding environment.
+ """


The naming here is a bit misleading since this is essentially a dummy blockmanager rather than v3.

I agree. I had a quite debate on the naming. Is DummyBlockSpaceManager better or SimpleBlockSpaceManager? A few functions here always return true such as can_allocate, and can_swap_out. So it feels not that dummy.
Additional thought: If there is plan to make Scheduler more extensible, maybe in the future we can simplify the logic of schedule with embedding, thus further cleanse this class as well.

We could it EmbeddingModelBlockSpaceManager to signify that it is used for EmbeddingModels.

Only negative would be if some other type of model wanted to use this in the future

CatherineSue · 2024-04-18T04:40:21Z

@ywang96 thanks for the initial review!

I have applied the suggestions and resolved some comments.

The current implementation doesn't allow an engine to do generation and embedding at the same time (since the embedding_mode is passed to initialize the engine), so I wonder if it's actually worth the effort to create a separate EmbeddingEngine since a lot of logics in LLMEngine are completely not needed, and this separation could also make a lot of higher level APIs cleaner.

I agree that having a separate EmbeddingEngine has a lot of benefits. Besides making the design cleaner, adding more functionalities in embedding and supporting more embedding models would be easier.
Happy to discuss detailed design. It might be good for a following PR as the current is big.

CatherineSue · 2024-04-18T05:18:16Z

@ywang96 A question related to design:

intfloat/e5-mistral-7b-instruct requires an eos_token_id appended at the end each input. See:

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
# append eos_token_id to every input_ids
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

I checked a few embedding models, it seems specific to this model. Since input_ids are processed in ModelRunner._prepare_prompt, is there a good place to inject eos_token_id to the end of each input?

robertgshaw2-neuralmagic

Thanks for the great work on this.

@ywang96 and I are syncing this morning on this topic.

Will revert with detailed plan to get this merged

robertgshaw2-neuralmagic · 2024-04-18T12:03:37Z

vllm/core/block_manager_v3.py

+class BlockSpaceManagerV3(BlockSpaceManager):
+ """A simple version of BlockSpaceManager for use in environments
+ where block management is not required.
+
+ This class provides the same interface as BlockSpaceManager, but its
+ methods perform no actions or return simple values like True in specific
+ actions. It's designed to be used in scenarios where the overhead of
+ block management is unnecessary, such as in an embedding environment.
+ """


We could it EmbeddingModelBlockSpaceManager to signify that it is used for EmbeddingModels.

Only negative would be if some other type of model wanted to use this in the future

vllm/engine/arg_utils.py

vllm/executor/gpu_executor.py

vllm/model_executor/models/llama_embedding.py

vllm/config.py

vllm/model_executor/models/llama_embedding.py

vllm/sequence.py

robertgshaw2-neuralmagic · 2024-04-18T18:37:14Z

cc @simon-mo for plan here

@CatherineSue This is in good shape. We are very close to being ready to merge. @ywang96 and I discussed. There are a few things we want to do for the final implementation, but we want to take an approach of landing something close to the current version and then following up with incremental work.

Here's what we want to do to get ready for merge:

Needed For Merge

Tests
Clean up llama_embedding.py
Update terminology from LlamaEmbeddingModel.embedding to LlamaEmbeddingModel.pooler
Log warning from LLM.generate() that this API will change for embedding models
Add checks for incompatible features and fail if so [ we can push this to another PR if difficult ]

Tests

This is a big feature that needs end-to-end testing. I would suggest that we focus in the following areas:

Model correctness: compare vs sentence-transformers or hugging face implementation of the model. Use L2 norm of the differences between the embeddings. This can use the LLM engine.
Server API correctness: show that querying the API gets what is expected from the client side. Check out the existing tests for the OpenAI API server for inspiration as to how to write these tests

Clean up `llama_embedding.py`

We currently re-implement all the llama layers. I am okay with having a separate file for llama_embedding.py, but we should import LlamaModel and use that rather than re-writing all the layers of the model. For example:

from vllm.model_executor.models.llama import LlamaModel

class LlamaEmbeddingModel(nn.Module):
    def __init__():  
         self.model = LlamaModel(**)
         
    def forward(inputs):
        return self.model(inputs)

    # currently called embedding
    def pooler(hidden_states):
       # same as current embedding function

Update Terminology from `embedding` to `pooler`

This is to be consistent with HF / sentence-transformers, which use this terminology for translating between the final hidden states and output embedding. See official BERTModel implementation.

So specifically

LlamaEmbeddingModel should use pooler instead of embedding
EmbeddingModelRunner should call the pooler method

Update `LLM.generate` API to log a warning if used with the embedding model

Make a note that this interface is experimental, and that LLM.encode will replace it soon

Add Incompatible Feature Guards

The following features are incompatible with embedding models:

Spec Decode
Chunked Prefill
Automatic Prefix Caching
Neuron / CPU
Fp8 KV cache

If these features are specified for an embedding model, we should either fail or log a warning. I would be okay with doing this in a follow up PR if it is not straightforward.

I will make a follow up note for

robertgshaw2-neuralmagic · 2024-04-18T18:57:49Z

Follow Ups Post Merge

After we merge this initial implementation, we can refactor in the following way:

Replace `SamplerXXX` with `PoolerXXX`

We currently implement EmbeddingModels using the input (SamplingParams) and output (SamplerOutput) classes in a hacked up manner

We will refactor to:

Swap SamplingParameters for PoolerParameters, which only has the data needed for embedding models
Swap SamplerOutput for PoolerOutput, which only has the data needed for embedding models
Pipe all this info around the various layers of the engine

`LLM.encode`

Deprecate embedding models from LLM.generate(). Instead expose LLM.encode(), which accepts PoolerParams and returns EmbeddingRequestOutput

Generic `Pooler`

Create a generic Pooler() that corresponds to Sampler() (currently, we implement pooling logic in LlamaEmbeddingModel.embedding() rather than in a shared class for each . Pooler could be instantiated with sentence transformer config

This will allow us to support more complex methods like ColBERT, sparse, etc over time.

Refactor `LLMEngine`

Refactor _process_sequence_group_outputs.
For example, we could have an abstract called SequenceGroupProcessor, with subclasses SequenceGroupProcessorEmbedding SequenceGroupProcessorCompletion. Each of these is responsible for implementing _process_sequence_group_outputs using PoolerOutput or SamplerOutput repspectively

CatherineSue · 2024-04-18T19:03:38Z

@robertgshaw2-neuralmagic thanks for the feedback. Working on resolving the comments and getting the checklist. Will update it soon.

robertgshaw2-neuralmagic · 2024-04-18T19:57:40Z

@robertgshaw2-neuralmagic thanks for the feedback. Working on resolving the comments and getting the checklist. Will update it soon.

Thanks @CatherineSue

Apologies for the delay on getting this reviewed and thank you so much for your contribution :)

CatherineSue · 2024-04-23T23:46:04Z

@robertgshaw2-neuralmagic I resolved all the comments. Here's an overview of the tasks checked in the new commits:

Rename BlockSpaceManagerV3 to EmbeddingModelBlockSpaceManager
Use ModelRegistry to check for embedding models
Tests
Clean up llama_embedding.py
Update terminology from LlamaEmbeddingModel.embedding to LlamaEmbeddingModel.pooler
Replace SamplerXXX with PoolerXXX
I added Pooler, PoolingParams, PoolerOutput, and PoolingMetadata. Note that Pooler is not following sentence_transformer's config. I didn't have time to finish it.
LLM.encode
I have separated it from LLM.generate
Log warning from LLM.generate() that this API will change for embedding models
Since I have separated it so I didn't add warning.

robertgshaw2-neuralmagic · 2024-04-23T23:51:24Z

@CatherineSue - awesome!

Are you planning to resolve the merge conflicts?

Should I review now?

CatherineSue · 2024-04-23T23:52:50Z

@robertgshaw2-neuralmagic
I can resolve them if it is easier for you to review. Might take a while, ETA tonight. Does it work for you?

robertgshaw2-neuralmagic · 2024-04-24T00:05:45Z

that works, ill review tomorrow

robertgshaw2-neuralmagic · 2024-04-24T00:05:51Z

ping me when ready

ywang96 · 2024-04-24T02:57:36Z

I can take a first pass too whenever it's ready if @robertgshaw2-neuralmagic doesn't get there before me :)

robertgshaw2-neuralmagic · 2024-05-10T20:57:31Z

@CatherineSue

Thank you so much for your efforts. This is ready to go. Just letting the CI run and then will merge.

CatherineSue · 2024-05-10T20:59:14Z

@robertgshaw2-neuralmagic thank you so much for all the time and help on reviews and CIs!!

…ect#3734)

Opdoop · 2024-05-20T10:26:01Z

@CatherineSue Thanks for the incredible works! Is there a speed comparison between vllm and hf/sentence-transformers in terms of inference speed?

…ect#3734)

K-Mistele · 2024-06-16T21:11:48Z

Are there any updates to the project docs on how to actually use this feature?

This comment was marked as off-topic.

Sign in to view

cadedaniel reviewed Apr 1, 2024

View reviewed changes

vllm/core/block_manager.py Outdated Show resolved Hide resolved

vllm/executor/gpu_executor.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

vllm/core/scheduler.py Outdated Show resolved Hide resolved

CatherineSue force-pushed the embedding branch 3 times, most recently from 4e93d2f to 9625d62 Compare April 2, 2024 21:46

CatherineSue force-pushed the embedding branch 2 times, most recently from e935ca6 to c0e21a1 Compare April 4, 2024 01:23

simon-mo mentioned this pull request Apr 4, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

ichernev approved these changes Apr 5, 2024

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Show resolved Hide resolved

vllm/entrypoints/openai/serving_embedding.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_engine.py Show resolved Hide resolved

vllm/entrypoints/openai/serving_embedding.py Show resolved Hide resolved

CatherineSue force-pushed the embedding branch 2 times, most recently from b897f77 to 0b352b8 Compare April 8, 2024 23:30

CatherineSue mentioned this pull request Apr 12, 2024

Supporting embedding models #3187

Open

ywang96 self-assigned this Apr 13, 2024

pzhao1799 reviewed Apr 15, 2024

View reviewed changes

ywang96 reviewed Apr 16, 2024

View reviewed changes

robertgshaw2-neuralmagic requested changes Apr 18, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed Apr 18, 2024

View reviewed changes

vllm/sequence.py Show resolved Hide resolved

robertgshaw2-neuralmagic added 5 commits May 10, 2024 20:33

updated comment

3655086

cleanup

2c6ae80

removed change from llama.py

9303a60

final review

5adda0a

final review

8747bf6

robertgshaw2-neuralmagic approved these changes May 10, 2024

View reviewed changes

format

8475e5f

robertgshaw2-neuralmagic enabled auto-merge (squash) May 10, 2024 20:54

robertgshaw2-neuralmagic disabled auto-merge May 10, 2024 20:56

Merge branch 'main' into embedding

aba7e0c

robertgshaw2-neuralmagic enabled auto-merge (squash) May 10, 2024 22:28

Update conftest.py

570b04a

simon-mo merged commit e254497 into vllm-project:main May 11, 2024
53 of 55 checks passed

This was referenced May 14, 2024

[Core] Consolidate prompt arguments to LLM engines #4328

Merged

[Doc] Add page for PoolingParams #4800

Open

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (vllm-proj…

9132d19

…ect#3734)

alboimDor mentioned this pull request May 20, 2024

E5-mistral-7b-instruct embedding support #2936

Closed

WuNein mentioned this pull request May 20, 2024

[Feature] Want to get the last_hidden_states, is there an interface for that? If not, what code should be modified to realize it? #853

Open

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (vllm-proj…

70fa8fd

…ect#3734)

DarkLight1337 mentioned this pull request May 24, 2024

[Bugfix][Frontend] Cleanup "fix chat logprobs" #5026

Merged

mevince mentioned this pull request Jun 1, 2024

[Feature]: BERT models for embeddings #5179

Open

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (vllm-proj…

2dac524

…ect#3734)

DarkLight1337 mentioned this pull request Jun 3, 2024

Feature request: Support for embedding models #742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734

CatherineSue commented Mar 29, 2024 •

edited

Loading

This comment was marked as off-topic.

CatherineSue commented Apr 3, 2024

pzhao1799 left a comment

pzhao1799 Apr 15, 2024

pzhao1799 Apr 15, 2024

ywang96 left a comment •

edited

Loading

ywang96 Apr 16, 2024

CatherineSue Apr 18, 2024

robertgshaw2-neuralmagic Apr 18, 2024

CatherineSue commented Apr 18, 2024 •

edited

Loading

CatherineSue commented Apr 18, 2024 •

edited

Loading

robertgshaw2-neuralmagic left a comment

robertgshaw2-neuralmagic Apr 18, 2024

robertgshaw2-neuralmagic commented Apr 18, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Apr 18, 2024 •

edited

Loading

CatherineSue commented Apr 18, 2024

robertgshaw2-neuralmagic commented Apr 18, 2024

CatherineSue commented Apr 23, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Apr 23, 2024

CatherineSue commented Apr 23, 2024

robertgshaw2-neuralmagic commented Apr 24, 2024

robertgshaw2-neuralmagic commented Apr 24, 2024

ywang96 commented Apr 24, 2024

robertgshaw2-neuralmagic commented May 10, 2024

CatherineSue commented May 10, 2024

Opdoop commented May 20, 2024

K-Mistele commented Jun 16, 2024

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734

Conversation

CatherineSue commented Mar 29, 2024 • edited Loading

Key Features of This Pull Request

Integration of the OpenAI Server Front End

Turn off KV cache with embedding models

High Level Design

Benchmarking Results

Design and Implementation Details

Add Embedding generation

Add Embedding API to entrypoints/openai

Add Embedding in outputs and sequence

Add MistralModel and embedding generation in llama.py

Disable KV Cache for Embedding serving

Skip slot_mapping with embedding mode

Turn off KV cache for embedding mode

Update parameters for max batching

Notes

Overlaps with other PRs

Milestone 1: Basic functionality using Hugging Face models [Partially completed]

G) Update parameters for max batching [Completed]

Milestone 2: Implement the model properly [Completed]

Milestone 4: Add the OpenAI Server Front End [Completed]

Discrepancies with other discussions

Move finish sequence logic to check_stop

Automatically detect that the model is an embedding model

F) Update Pass EmbeddingParams around instead of SamplingParams

Discussion Points and Considerations

Simplifying the Workflow

Soundness of Profiling

This comment was marked as off-topic.

CatherineSue commented Apr 3, 2024

Immediate changes

Further evaluation

HF embeddings

pzhao1799 left a comment

Choose a reason for hiding this comment

pzhao1799 Apr 15, 2024

Choose a reason for hiding this comment

pzhao1799 Apr 15, 2024

Choose a reason for hiding this comment

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

ywang96 Apr 16, 2024

Choose a reason for hiding this comment

CatherineSue Apr 18, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Apr 18, 2024

Choose a reason for hiding this comment

CatherineSue commented Apr 18, 2024 • edited Loading

CatherineSue commented Apr 18, 2024 • edited Loading

robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Apr 18, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Apr 18, 2024 • edited Loading

Needed For Merge

Tests

Clean up llama_embedding.py

Update Terminology from embedding to pooler

Update LLM.generate API to log a warning if used with the embedding model

Add Incompatible Feature Guards

robertgshaw2-neuralmagic commented Apr 18, 2024 • edited Loading

Follow Ups Post Merge

Replace SamplerXXX with PoolerXXX

LLM.encode

Generic Pooler

Refactor LLMEngine

CatherineSue commented Apr 18, 2024

robertgshaw2-neuralmagic commented Apr 18, 2024

CatherineSue commented Apr 23, 2024 • edited Loading

robertgshaw2-neuralmagic commented Apr 23, 2024

CatherineSue commented Apr 23, 2024

robertgshaw2-neuralmagic commented Apr 24, 2024

robertgshaw2-neuralmagic commented Apr 24, 2024

ywang96 commented Apr 24, 2024

robertgshaw2-neuralmagic commented May 10, 2024

CatherineSue commented May 10, 2024

Opdoop commented May 20, 2024

CatherineSue commented Mar 29, 2024 •

edited

Loading

Add Embedding API to `entrypoints/openai`

Skip `slot_mapping` with embedding mode

F) Update Pass `EmbeddingParams` around instead of `SamplingParams`

ywang96 left a comment •

edited

Loading

CatherineSue commented Apr 18, 2024 •

edited

Loading

CatherineSue commented Apr 18, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Apr 18, 2024 •

edited

Loading

Clean up `llama_embedding.py`

Update Terminology from `embedding` to `pooler`

Update `LLM.generate` API to log a warning if used with the embedding model

robertgshaw2-neuralmagic commented Apr 18, 2024 •

edited

Loading

Replace `SamplerXXX` with `PoolerXXX`

`LLM.encode`

Generic `Pooler`

Refactor `LLMEngine`

CatherineSue commented Apr 23, 2024 •

edited

Loading