Gemma Support #256

hnyls2002 · 2024-03-03T17:37:29Z

No description provided.

hnyls2002 · 2024-03-04T02:35:55Z

Due to Gemma's head_dim=256, when launching the Triton kernel on CUDA devices with capability>=8, come across the Triton hardware limit error

triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

hnyls2002 · 2024-03-06T02:35:28Z

When launching gemma-7b-it without an instruction, e.g. <start_of_turn>user, the dot result of q and k can be very large. This may be due to the inappropriate use of the instructed weight or the head_dim=256.

To fix this, we can use flashinfer's attention kernel, which converts the intermediate result into float32. However, this feature is not yet supported in v0.0.2, but will be available soon. You can check its status at flashinfer-ai/flashinfer#157.

Alternatively, we can convert the intermediate result in our Triton's token attention kernel into float32. But be aware, this could result in a global slowdown and it has not yet been fully tested.

hnyls2002 · 2024-03-11T08:30:19Z

If there is a nan problem when multinomial sampling, add --attention-reduce-in-fp32 to solve it.

copy code

af50983

hnyls2002 marked this pull request as draft March 3, 2024 17:37

hnyls2002 marked this pull request as ready for review March 4, 2024 02:23

This was linked to issues Mar 4, 2024

No module named 'vllm.transformers_utils.configs.qwen' #252

Closed

Google Gemma #214

Closed

hnyls2002 requested review from merrymercy and Ying1123 March 5, 2024 11:20

merrymercy self-assigned this Mar 11, 2024

hnyls2002 added 5 commits March 11, 2024 02:17

format chat template

420582e

gemma adapt

dc037f1

adjust head_dim when there is proj

6e3e42a

adjust BLOCK size

dea5b95

gemma chat template

8fcd090

hnyls2002 force-pushed the gemma branch from 3e316fb to 8fcd090 Compare March 11, 2024 02:18

hnyls2002 added 4 commits March 11, 2024 02:19

update vllm version

e7e6c96

Merge branch 'main' into gemma

2606e60

adjust

9bc6900

add attention reduce option

a74ee72

hnyls2002 merged commit 89885b3 into main Mar 11, 2024

hnyls2002 deleted the gemma branch March 11, 2024 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma Support #256

Gemma Support #256

hnyls2002 commented Mar 3, 2024

hnyls2002 commented Mar 4, 2024

hnyls2002 commented Mar 6, 2024

hnyls2002 commented Mar 11, 2024

Gemma Support #256

Gemma Support #256

Conversation

hnyls2002 commented Mar 3, 2024

hnyls2002 commented Mar 4, 2024

hnyls2002 commented Mar 6, 2024

hnyls2002 commented Mar 11, 2024