Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma Support #256

Merged
merged 10 commits into from
Mar 11, 2024
Merged

Gemma Support #256

merged 10 commits into from
Mar 11, 2024

Conversation

hnyls2002
Copy link
Collaborator

No description provided.

@hnyls2002 hnyls2002 marked this pull request as draft March 3, 2024 17:37
@hnyls2002 hnyls2002 marked this pull request as ready for review March 4, 2024 02:23
@hnyls2002
Copy link
Collaborator Author

Due to Gemma's head_dim=256, when launching the Triton kernel on CUDA devices with capability>=8, come across the Triton hardware limit error

triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

@hnyls2002
Copy link
Collaborator Author

When launching gemma-7b-it without an instruction, e.g. <start_of_turn>user, the dot result of q and k can be very large. This may be due to the inappropriate use of the instructed weight or the head_dim=256.

To fix this, we can use flashinfer's attention kernel, which converts the intermediate result into float32. However, this feature is not yet supported in v0.0.2, but will be available soon. You can check its status at flashinfer-ai/flashinfer#157.

Alternatively, we can convert the intermediate result in our Triton's token attention kernel into float32. But be aware, this could result in a global slowdown and it has not yet been fully tested.

@merrymercy merrymercy self-assigned this Mar 11, 2024
@hnyls2002 hnyls2002 merged commit 89885b3 into main Mar 11, 2024
@hnyls2002 hnyls2002 deleted the gemma branch March 11, 2024 04:14
@hnyls2002
Copy link
Collaborator Author

If there is a nan problem when multinomial sampling, add --attention-reduce-in-fp32 to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No module named 'vllm.transformers_utils.configs.qwen' Google Gemma
2 participants