[Inference] Tensor model parallelism #778

goliaro · 2023-06-19T08:14:52Z

Description of changes:

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

Before merging:

Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

jiazhihao · 2023-06-25T01:25:31Z

@gabrieleoliaro The PR looks great to me overall. Do you think we can merge this into the inference branch?

xinhaoc · 2023-06-25T03:02:11Z

src/ops/spec_inc_multihead_self_attention.cc

@@ -127,7 +127,7 @@ Tensor
 }
 if (bias) {
 // q, k, v, o
- int dims[1] = {embed_dim * 4};
+ int dims[1] = {(qProjSize + kProjSize + vProjSize) * num_heads + oProjSize};


Hi Gabriele, just to make sure, do we have any situation that the embed_dim * 4 is not equal to qProjSize + kProjSize + vProjSize) * num_heads + oProjSize?

In Megatron-LM's tensor model parallelism, we partition attention heads across GPUs, so num_heads of each GPU
can be smaller than the total number of attention heads

In the current model we have, the two quantities are indeed always identical. I changed this just to make it easier to remember the data layout

goliaro · 2023-06-25T14:45:14Z

@gabrieleoliaro The PR looks great to me overall. Do you think we can merge this into the inference branch?

yes! I'll merge it as soon as CI passes!

goliaro added 12 commits June 17, 2023 11:18

add parallel operators

9fbaa04

add cmd line param

825075a

setting machine views

15943b8

move bias blocks

0daa4d3

comment out print of partitions

6684434

add unimplemented methods

a8dd99b

add impl of inference functions to replicate and reduce ops

2b36df8

replicate bias in file loader

7b6d4aa

fixes, now works

a443a3a

only add bias once

5b0016d

load and use weights according to partition

51e52b9

Merge branch 'inference' into tensor_parallelism

77f6854

goliaro marked this pull request as draft June 19, 2023 09:10

goliaro added 17 commits June 19, 2023 15:36

fix wout weight

5bd0aee

cleanup

f7a3e14

add support for mixed precision in parallel ops

ab7cf36

cleanup

e16eab2

rocm build fix

b126197

hip rocm fix 2

e0f7905

fix machine views

ac802fc

fix rocm build

4025ade

Merge branch 'inference' into tensor_parallelism

2b25b39

adjust numbe of pipeline stages

0d906b2

add model parallelism to opt linear layers

d9d4ae4

fix

7b773ab

fxi multi gpu test

5d28479

fix

6be1726

add tensor parallelism tests to inference test script

e229033

enable tensor parallelism for dense layers in llama

5e2de26

fix

efe9649

fix set_tensor-related issues

d44a929

lockshaw added the inference Features and fixes related to the inference project. label Jun 23, 2023

Merge branch 'inference' into tensor_parallelism

c82398a

goliaro marked this pull request as ready for review June 24, 2023 16:53

goliaro requested review from xinhaoc, jiazhihao and zwang86 June 24, 2023 16:53

fix and linting

278b98c

jiazhihao approved these changes Jun 25, 2023

View reviewed changes

xinhaoc reviewed Jun 25, 2023

View reviewed changes

goliaro merged commit 0f3be1f into inference Jun 25, 2023
44 of 45 checks passed

goliaro deleted the tensor_parallelism branch June 25, 2023 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Tensor model parallelism #778

[Inference] Tensor model parallelism #778

goliaro commented Jun 19, 2023

jiazhihao commented Jun 25, 2023

xinhaoc Jun 25, 2023

jiazhihao Jun 25, 2023

goliaro Jun 25, 2023

xinhaoc Jun 25, 2023

goliaro commented Jun 25, 2023

[Inference] Tensor model parallelism #778

[Inference] Tensor model parallelism #778

Conversation

goliaro commented Jun 19, 2023

jiazhihao commented Jun 25, 2023

xinhaoc Jun 25, 2023

Choose a reason for hiding this comment

jiazhihao Jun 25, 2023

Choose a reason for hiding this comment

goliaro Jun 25, 2023

Choose a reason for hiding this comment

xinhaoc Jun 25, 2023

Choose a reason for hiding this comment

goliaro commented Jun 25, 2023