Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate DeepSpeed Inference #845

Open
Quentin-Anthony opened this issue Mar 21, 2023 · 7 comments
Open

Investigate DeepSpeed Inference #845

Quentin-Anthony opened this issue Mar 21, 2023 · 7 comments
Assignees
Labels
feature request New feature or request good first issue Good for newcomers

Comments

@Quentin-Anthony
Copy link
Member

DeepSpeed wins most inference benchmarks I see. We should test their claims on neox models. EleutherAI spends a significant amount of compute running inference, so any improvement in inference performance would be high-impact. What I would like to see are:

  1. Take baseline inference latency on a small/medium/large (160M, 6.9B, 20B) neox models with gpt-neox.
  2. A summarization on how DeepSpeed's inference engine can benefit us (https://arxiv.org/abs/2207.00032, https://www.deepspeed.ai/tutorials/inference-tutorial/, https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/inference-tutorial.md)
  3. Take new numbers on the above models with DeepSpeed inference, compare with baseline, write up a short README and include numbers + run instructions for DS inference.
@Quentin-Anthony Quentin-Anthony added feature request New feature or request good first issue Good for newcomers labels Mar 21, 2023
@cr458
Copy link

cr458 commented Mar 26, 2023

would be more than happy to take this on, are there any resources available for these tests or is it up to me to find them?

@StellaAthena
Copy link
Member

would be more than happy to take this on, are there any resources available for these tests or is it up to me to find them?

If by “resources” you mean computing resources, yes we can easily make GPUs available for testing this PR.

@cr458
Copy link

cr458 commented Mar 27, 2023

Hi Stella, thanks for your reply. Yup that's exactly what I meant, sorry about the poor wording. Great, in that case I'd love to take this on. Let me know how you'd like to arrange access to the GPUs and I'll get going!

@satpalsr
Copy link
Contributor

+1

@StellaAthena
Copy link
Member

@cr458 @satpalsr What's the current status of this?

@IshanMi
Copy link

IshanMi commented Dec 24, 2023

Hey @Quentin-Anthony - I would love the chance to work on this!

yang added a commit to yang/gpt-neox that referenced this issue Jan 25, 2024
@yang
Copy link
Contributor

yang commented Jan 25, 2024

Some initial numbers are promising. With the current configs/125M.yml and text_generation.yml, I see some pretty consistent numbers of ~2.4s duration_seconds go down to ~1.4s, on a single-GPU node (A10G). Will share more numbers once I can get compute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

7 participants