-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM router basic template #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think it's going in the right direction, Let's continue. It's half baked towards the end so I didn't put feedback there.
A few things I like so far that I think we should make standard for all the templates:
-
README.ipynb should not have implementation details in code. All the code implementation should be abstracted in a separate module / function that gets imported and just used.
-
We should use diagrams when they cut the need for more words.
templates/llm-router/README.ipynb
Outdated
"source": [ | ||
"# Background\n", | ||
"\n", | ||
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n", | |
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very large number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n", |
templates/llm-router/README.ipynb
Outdated
"\n", | ||
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n", | ||
"\n", | ||
"The goal of this tutorial is to show you how you can train a \"smart router\", i.e. a model that can dynamically decide, based on the query text, whether to call a closed model or an OSS model. Here's a schematic view of a smart router:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use "smart router" name? I was thinking we should use "dynamic router" since this is also what chatgpt called it and it will sound more familiar. + SEO will get boosted because openAI used that term. (It's just personal gut feeling)
templates/llm-router/README.ipynb
Outdated
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n", | ||
"\n", | ||
"The goal of this tutorial is to show you how you can train a \"smart router\", i.e. a model that can dynamically decide, based on the query text, whether to call a closed model or an OSS model. Here's a schematic view of a smart router:\n", | ||
"![Smart Router](assets/router_schema.png)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the diagram, for the green box, let's say "OSS e.g. Mixtral". The point is that users can repeat this between any 2 or N models (even between gpt-3.5 and gpt-4 themselves)
templates/llm-router/README.ipynb
Outdated
"We are going to train a classifier to decide, based only on the query text, whether to route the query to an OSS model vs. a closed one. In this tutorial, we will make the following design choices: \n", | ||
"1. We will quantify a response quality on a scale of `[1, 5]` (5-star).\n", | ||
"2. For simplicity, we will assume that the closed always achieves 5-start quality. \n", | ||
"3. We will use GPT-4 as a representative of closed models and Mixtral 8x7B for OSS models.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"3. We will use GPT-4 as a representative of closed models and Mixtral 8x7B for OSS models.\n", | |
"3. We will use GPT-4 as a representative for closed models and Mixtral 8x7B for OSS models.\n", |
templates/llm-router/README.ipynb
Outdated
"2. For simplicity, we will assume that the closed always achieves 5-start quality. \n", | ||
"3. We will use GPT-4 as a representative of closed models and Mixtral 8x7B for OSS models.\n", | ||
"\n", | ||
"More concurrently, let us assume that closed models have perfect a quality (5/5 score). our goal is to reduce cost significantly (say by 50%) while maintaining a high overal quality (4.8/5 score).\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"More concurrently, let us assume that closed models have perfect a quality (5/5 score). our goal is to reduce cost significantly (say by 50%) while maintaining a high overal quality (4.8/5 score).\n" | |
"More concretely, let us assume that closed models have perfect a quality (5/5 score). Our goal is to reduce cost significantly (say by 50%) while maintaining a high overall quality (score of 4 to 5).\n" |
templates/llm-router/data_utils.py
Outdated
queries = {} | ||
for pidx, row in dataset_df.to_dict(orient="index").items(): | ||
prompt = row["prompt"] | ||
if type(prompt) == str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if type(prompt) == str: | |
if isinstance(prompt, str): |
templates/llm-router/data_utils.py
Outdated
return train_df, validation_df | ||
|
||
|
||
def visualize_label_distribution(dataset_df, key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move all visualization methods to a different module.
return average_score, routing_percentage, score_auc | ||
|
||
|
||
def plot_quality_cost_curve( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move visualization into another module (maybe viz.py)
|
||
|
||
@ray.remote(num_cpus=0) | ||
def get_llm_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any mechanisms to guard against rate limits?
return (pidx, "") | ||
|
||
|
||
def generate_batch_responses( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely make the docstring for this beefy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good so far but I'm really interested in how you will explain/show the training workload in the notebook. I'm sure you'll explain parts of the config, how the new classifier head works, applying the template (and again during inference). I'm also excited for the serving part. I think you should have a very small cost analysis too!
Small errors:
- "running them on a very number of queries"
- "the closed always" --> "the closed model"
- "use GPT-4 as a representative of closed models" --> "use GPT to represent closed models"
- "More concurrently" --> "More concretely"
…provements Stable diffusion pretraining improvements
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: kourosh hakhamaneshi <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
[LLM Serving Template] Updated command to non-jupyter cell
…valuation Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
configs/llm-router/aws.yaml
Outdated
@@ -0,0 +1,19 @@ | |||
head_node_type: | |||
name: head | |||
instance_type: p4de.24xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A100 head nodes are not available in Hosted OA - if this template will be exposed in Hosted OA, can we use the serverless config (just directly request a GPU resource of A100-80G and allow the autoscaler to upscale it?)
If we are not planning on exposing this through OA, then it doesn't matter as much. But it's still better practice to run workloads on workers and use cheap CPU nodes for development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you consider A10 as cheap gpu? I have enabled training on g5.48xlarge
and launched jobs successfully with it, so I can update this config
configs/llm-router/aws.yaml
Outdated
resources: | ||
cpu: 8 | ||
|
||
auto_select_worker_config: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you delete everything from this line below? none of it should be needed for single node
configs/llm-router/aws.yaml
Outdated
name: head | ||
instance_type: g5.48xlarge | ||
resources: | ||
cpu: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can delete the logical resource entry here as well
templates/llm-router/README.md
Outdated
!pip install -e .[eval] | ||
``` | ||
|
||
fatal: destination path '/home/ray/default/RouteLLM' already exists and is not an empty directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably don't need to commit the output cells
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one skipped me, good catch! Do you suggest I remove all of them? I kept only a summary showing what the user will see, but maybe not important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I suggest removing all of them unless there's some really important output to display
@@ -0,0 +1,7 @@ | |||
head_node_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need these anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shomilj asked me to keep them but remove worker node configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but since you merged into the existing template, you don't need new compute config files at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let me remove those files then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akshay-anyscale I am not sure what to do about the landing it in the product repo. There I need to specify configs, see e.g. https://github.com/anyscale/product/blob/master/backend/workspace-templates.yaml#L84 and I don't think the configs here: https://github.com/anyscale/product/blob/master/backend/workspace-templates.yaml#L246C12-L246C43 would work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't have to make a product repo change for this since the files are in the existing template. Is the only gap that for GCE it doesn't have the serverless config? @kouroshHakha why is that the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not make use of basic-server-less configs everywhere?
head_node_type:
name: head
instance_type: n1-standard-8
worker_node_types: []
auto_select_worker_config: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why llm finetuning is not on serverless for gce. Maybe that has slipped during transition for some reason? I haven't noticed that until now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @anmscale mentioned a GPU head node was a hard req for this workspace; if that has changed, yes, please, let's use serverless :)
Need to rename the branch name to avoid |
Implementing:
To do:
In a following commit: