Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM router basic template #203

Closed
wants to merge 1,502 commits into from
Closed

LLM router basic template #203

wants to merge 1,502 commits into from

Conversation

anmscale
Copy link
Contributor

@anmscale anmscale commented May 7, 2024

Implementing:

  • basic flow for data labeling with gpt4 as a judge
  • basic flow for evaluation

To do:

  • add 1-2 simple baslines (frequency, BoW classifier)
  • Iterate on the explanation and overall story

In a following commit:

  • finetune llm router

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think it's going in the right direction, Let's continue. It's half baked towards the end so I didn't put feedback there.

A few things I like so far that I think we should make standard for all the templates:

  • README.ipynb should not have implementation details in code. All the code implementation should be abstracted in a separate module / function that gets imported and just used.

  • We should use diagrams when they cut the need for more words.

"source": [
"# Background\n",
"\n",
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n",
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very large number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n",

"\n",
"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n",
"\n",
"The goal of this tutorial is to show you how you can train a \"smart router\", i.e. a model that can dynamically decide, based on the query text, whether to call a closed model or an OSS model. Here's a schematic view of a smart router:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use "smart router" name? I was thinking we should use "dynamic router" since this is also what chatgpt called it and it will sound more familiar. + SEO will get boosted because openAI used that term. (It's just personal gut feeling)

"Whenever we use an LLM we would like to get the highest response quality but are often restricted to a limited cost budget. Closed models, such as GPT-4, are known to be the highest quality models, but they can get very expensive especially when running them on a very number of queries. On the other hand, OSS models can be much cheaper, but their responses may not be of the same quality, especially for complex or domain-specific queries.\n",
"\n",
"The goal of this tutorial is to show you how you can train a \"smart router\", i.e. a model that can dynamically decide, based on the query text, whether to call a closed model or an OSS model. Here's a schematic view of a smart router:\n",
"![Smart Router](assets/router_schema.png)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the diagram, for the green box, let's say "OSS e.g. Mixtral". The point is that users can repeat this between any 2 or N models (even between gpt-3.5 and gpt-4 themselves)

"We are going to train a classifier to decide, based only on the query text, whether to route the query to an OSS model vs. a closed one. In this tutorial, we will make the following design choices: \n",
"1. We will quantify a response quality on a scale of `[1, 5]` (5-star).\n",
"2. For simplicity, we will assume that the closed always achieves 5-start quality. \n",
"3. We will use GPT-4 as a representative of closed models and Mixtral 8x7B for OSS models.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"3. We will use GPT-4 as a representative of closed models and Mixtral 8x7B for OSS models.\n",
"3. We will use GPT-4 as a representative for closed models and Mixtral 8x7B for OSS models.\n",

"2. For simplicity, we will assume that the closed always achieves 5-start quality. \n",
"3. We will use GPT-4 as a representative of closed models and Mixtral 8x7B for OSS models.\n",
"\n",
"More concurrently, let us assume that closed models have perfect a quality (5/5 score). our goal is to reduce cost significantly (say by 50%) while maintaining a high overal quality (4.8/5 score).\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"More concurrently, let us assume that closed models have perfect a quality (5/5 score). our goal is to reduce cost significantly (say by 50%) while maintaining a high overal quality (4.8/5 score).\n"
"More concretely, let us assume that closed models have perfect a quality (5/5 score). Our goal is to reduce cost significantly (say by 50%) while maintaining a high overall quality (score of 4 to 5).\n"

queries = {}
for pidx, row in dataset_df.to_dict(orient="index").items():
prompt = row["prompt"]
if type(prompt) == str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if type(prompt) == str:
if isinstance(prompt, str):

return train_df, validation_df


def visualize_label_distribution(dataset_df, key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move all visualization methods to a different module.

return average_score, routing_percentage, score_auc


def plot_quality_cost_curve(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move visualization into another module (maybe viz.py)



@ray.remote(num_cpus=0)
def get_llm_response(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any mechanisms to guard against rate limits?

return (pidx, "")


def generate_batch_responses(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely make the docstring for this beefy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far but I'm really interested in how you will explain/show the training workload in the notebook. I'm sure you'll explain parts of the config, how the new classifier head works, applying the template (and again during inference). I'm also excited for the serving part. I think you should have a very small cost analysis too!

Small errors:

  • "running them on a very number of queries"
  • "the closed always" --> "the closed model"
  • "use GPT-4 as a representative of closed models" --> "use GPT to represent closed models"
  • "More concurrently" --> "More concretely"

marwan116 and others added 25 commits May 16, 2024 15:24
…provements

Stable diffusion pretraining improvements
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: kourosh hakhamaneshi <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
[LLM Serving Template] Updated command to non-jupyter cell
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@@ -0,0 +1,19 @@
head_node_type:
name: head
instance_type: p4de.24xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A100 head nodes are not available in Hosted OA - if this template will be exposed in Hosted OA, can we use the serverless config (just directly request a GPU resource of A100-80G and allow the autoscaler to upscale it?)

If we are not planning on exposing this through OA, then it doesn't matter as much. But it's still better practice to run workloads on workers and use cheap CPU nodes for development.

Copy link
Contributor Author

@anmscale anmscale Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you consider A10 as cheap gpu? I have enabled training on g5.48xlarge and launched jobs successfully with it, so I can update this config

resources:
cpu: 8

auto_select_worker_config: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you delete everything from this line below? none of it should be needed for single node

name: head
instance_type: g5.48xlarge
resources:
cpu: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can delete the logical resource entry here as well

!pip install -e .[eval]
```

fatal: destination path '/home/ray/default/RouteLLM' already exists and is not an empty directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't need to commit the output cells

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one skipped me, good catch! Do you suggest I remove all of them? I kept only a summary showing what the user will see, but maybe not important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I suggest removing all of them unless there's some really important output to display

@@ -0,0 +1,7 @@
head_node_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need these anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shomilj asked me to keep them but remove worker node configs.

Copy link
Contributor

@akshay-anyscale akshay-anyscale Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but since you merged into the existing template, you don't need new compute config files at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let me remove those files then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshay-anyscale I am not sure what to do about the landing it in the product repo. There I need to specify configs, see e.g. https://github.com/anyscale/product/blob/master/backend/workspace-templates.yaml#L84 and I don't think the configs here: https://github.com/anyscale/product/blob/master/backend/workspace-templates.yaml#L246C12-L246C43 would work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't have to make a product repo change for this since the files are in the existing template. Is the only gap that for GCE it doesn't have the serverless config? @kouroshHakha why is that the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make use of basic-server-less configs everywhere?

head_node_type:
  name: head
  instance_type: n1-standard-8
worker_node_types: []
auto_select_worker_config: true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why llm finetuning is not on serverless for gce. Maybe that has slipped during transition for some reason? I haven't noticed that until now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @anmscale mentioned a GPU head node was a hard req for this workspace; if that has changed, yes, please, let's use serverless :)

@anmscale anmscale closed this Jul 11, 2024
@anmscale anmscale deleted the anm/llm-router branch July 11, 2024 17:59
@anmscale
Copy link
Contributor Author

Need to rename the branch name to avoid / for template to run

@anmscale anmscale mentioned this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet