merge inference into BertMLM_fix #3

xinhaoc · 2024-05-17T08:43:50Z

Description of changes:

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

Before merging:

Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes

* init * fix * code * clean up * fix * fix, add md * format * hip_roc * add comment

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes * bug fixes * TreeIncMHA and SpecIncMHA bug fixes * fomat. --------- Co-authored-by: xinhaoc <[email protected]>

* serving opt pipeline * format

Co-authored-by: Zhihao Jia <[email protected]>

* complex into metadata * topk * format --------- Co-authored-by: Zhihao Jia <[email protected]>

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes * bug fixes * TreeIncMHA and SpecIncMHA bug fixes * fomat. * . * add sentence piece tokenizer * format * prepare spec_infer demo * prettier prints * make the llama model work * add small model config * enable speculative inference for spec_infer * fix * rename * fix one of the bugs * fix * del * attempt to fix ci * integrated gpt/opt tokenizer * integrate opt tokenizer with pipeline * . * format * move files * Update README.md * add an overview figure * update images * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * add tokenizer in readme * fix * fix * fix * Update README.md * Update README.md * add gif * add weights to readme, clean some print * Update README.md * update demo * Update README.md * Update README.md * remove outdate file * Update README.md * Update README.md * . --------- Co-authored-by: xinhaoc <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: xinhaoc <[email protected]>

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes * bug fixes * TreeIncMHA and SpecIncMHA bug fixes * fomat. * . * add sentence piece tokenizer * format * prepare spec_infer demo * prettier prints * make the llama model work * add small model config * enable speculative inference for spec_infer * fix * rename * fix one of the bugs * fix * del * attempt to fix ci * integrated gpt/opt tokenizer * integrate opt tokenizer with pipeline * . * format * move files * Update README.md * add an overview figure * update images * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * add tokenizer in readme * fix * fix * fix * Update README.md * Update README.md * add gif * add weights to readme, clean some print * Update README.md * update demo * Update README.md * Update README.md * remove outdate file * Update README.md * Update README.md * . * use data parallel by default --------- Co-authored-by: xinhaoc <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: xinhaoc <[email protected]>

* file path adapt * fix * fix * fix

* fix hip_rocm build with sentencepiece * shellcheck 1 * shellcheck 2 * shellecheck 3 * fix install script * .github/workflows/helpers/install_dependencies.sh * fix * shellcheck * restore unnecessary changes * fix build * removed outdated test from c++ tests * update link in readme

* implemented file-based configs, remove spec_pipeline folder * fix * add inference test, script to downlaod weights * update readme * update ci scripts * newlines * fix gpu-ci * fix * fix * update test file * added incr decoding program, moved LLAMA folder from examples * linting * add incremental decoding to test * update readme * add script to download opt weights * fix support for opt, move code to root inference folder * linting * update test file * fix * bug fix * update test

…exflow#736) * making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta * make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta * format * merging kernel functions * merge more functions * merge compute_qkv_kernel * format * fix config --------- Co-authored-by: xinhaoc <[email protected]>

* fix alignment bugs (part 1) * add missing file

…ttention (flexflow#737) * making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta * make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta --------- Co-authored-by: xinhaoc <[email protected]>

* save output to file * add alignment tests * fix * change conflicting name, add comments * fix typo * formatting * more comments and clean dead code * formatting * fixed issue with length mismatch * fix ci skip * update inf test * add precision selection support in incr decoding

* Update README.md * update readme * fix

…d tests (flexflow#749) * add support for downloading mixed precision llama/opt weights * fix * update test script to also run half precision tests * disable workflow for inference PRs * add verbose option * linting * copy opt weights in download weights script * add alignment tests with huggingface (llama) * fix, add diff to test script * fix * add opt tests * comment out tests not passing * add e2e latency to output files * add speed tests * shellcheck * shellcheck * fix * fix * linting * fix

This reverts commit 197e308.

* add a background server for RequestManager * . * make incr_decoding work * make spec_infer work * format * update python inference * fix python issues * bug fix * add a Legion future to capture the termination of the background server * Add thread safety for background server. * Simplify backend server design. * resolve conflict. * Add server task timeout. * register callbacks to terminate background worker at exit or termination * [Python] enable decoding multiple requests * update README.md and default configuration * [Python] no need to use the llm context environment to start/stop the background server * require at least four cpu cores * [Python] add back explict start_server()/stop_server(). * fix * fix python chatgpt.json --------- Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: zwang86 <[email protected]> Co-authored-by: Zeyu Wang <[email protected]> Co-authored-by: xinhaoc <[email protected]>

* only stop server if rm is initialized * fix * better logging * pass layer names to ops * add debugging functionality to hf script * fix * fixes * fix * fix --------- Co-authored-by: Ubuntu <[email protected]>

Co-authored-by: Gabriele Oliaro <[email protected]>

* bug fixes and update Legion version * fix * bug fix * update legion * fix arithmetic error due to num_devices uninitialized * update legion version * update ci * fix * debugging ci * Revert "debugging ci" This reverts commit 0b3148e. --------- Co-authored-by: Gabriele Oliaro <[email protected]>

…w#1246) * add a background server for RequestManager * . * make incr_decoding work * make spec_infer work * format * update python inference * fix python issues * bug fix * add a Legion future to capture the termination of the background server * gradio finished * chatbot gradio version 2 * chainlit1 * chainlit2 * fastapi done * fastapi incr_decoding * langchain example & wrapper class * langchain example & wrapper class1 * added documentation * entrypoint * del apikey * delete extra files * rag search fixed some bugs * fixed rag search issues * updates before rebase * minor changes * reorganize files * Add thread safety for background server. * Simplify backend server design. * resolve conflict. * specinfer usecases with issues labeled * specinfer usecases with issues labeled 2 * fixed issues with prompt template * fix issues with rag specinfer * Add server task timeout. * register callbacks to terminate background worker at exit or termination * [Python] enable decoding multiple requests * update README.md and default configuration * fix issues with gradio and prompt template * fix issues with rag * adjusted fastapi entrypoint * update documentation * resole conflicts * issues fix * adjustments on usecases and api entrypoints * remove redundent changes * testing CI * Enable backtrace * restore newlines * version * add back misdeleted line * legion verion --------- Co-authored-by: Zhihao Jia <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: zwang86 <[email protected]> Co-authored-by: Zeyu Wang <[email protected]> Co-authored-by: xinhaoc <[email protected]>

* bug fixes and update Legion version * fix * bug fix * update legion * fix arithmetic error due to num_devices uninitialized * update legion version * update ci * fix * debugging ci * Revert "debugging ci" This reverts commit 0b3148e. * update mapper interface * add ncclFinalize * Only delete nccl communications for training jobs --------- Co-authored-by: Zhihao Jia <[email protected]>

* modify README * fix link issues * update legion version --------- Co-authored-by: Zhihao Jia <[email protected]>

…w#1308)

…lexflow#1318)

* . * remove deadcode * add benchmarking mode, initializing weights randomly * better logging when running out of memory * update --------- Co-authored-by: Gabriele Oliaro <[email protected]>

Co-authored-by: Gabriele Oliaro <[email protected]>

…ence

jiazhihao and others added 30 commits May 10, 2023 17:57

Tree verify bug fix (flexflow#719)

0bf4fa9

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes

[Inference] opt model (flexflow#717)

b5b1375

* init * fix * code * clean up * fix * fix, add md * format * hip_roc * add comment

TreeIncMHA and SpecIncMHA bug fixes (flexflow#720)

b2d6d9a

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes * bug fixes * TreeIncMHA and SpecIncMHA bug fixes * fomat. --------- Co-authored-by: xinhaoc <[email protected]>

[inference] serving opt pipeline (flexflow#722)

f779d89

* serving opt pipeline * format

add opt tokenizer functionality in C++ tokenizer (flexflow#727)

c08b1a0

Co-authored-by: Zhihao Jia <[email protected]>

code (flexflow#726)

ebb5f8e

Co-authored-by: Zhihao Jia <[email protected]>

fix CI

86ec73a

Kernel bug fix (flexflow#728)

8423610

* complex into metadata * topk * format --------- Co-authored-by: Zhihao Jia <[email protected]>

add decoder for gpt tokenizer

c9b2c5d

Update README.md

16a5d02

Update README.md

b8e5586

Merge branch 'master' into inference

07cb9f0

fix make build, edit cmake

0aabf34

update std version in makefile

427d602

file path adapt (flexflow#730)

d87197d

* file path adapt * fix * fix * fix

Update README.md

b9fddec

Update README.md

dc6dcf8

Update README.md

1193b51

[Inference] - Alignment fixes (flexflow#740)

b0a5b9c

* fix alignment bugs (part 1) * add missing file

Update README.md (flexflow#741)

1ab3d80

Update README.md (flexflow#744)

6c13936

* Update README.md * update readme * fix

fix

d8072ab

goliaro and others added 28 commits January 9, 2024 06:56

fix

ba4af39

Fuse bias and relu in OPT (flexflow#1265)

9c85a4f

fix spec decoding

197e308

Revert "fix spec decoding"

ed4dbd8

This reverts commit 197e308.

Update README.md

18cd485

Better debugging/logging tools for alignment checks (flexflow#1275)

75edadc

* only stop server if rm is initialized * fix * better logging * pass layer names to ops * add debugging functionality to hf script * fix * fixes * fix * fix --------- Co-authored-by: Ubuntu <[email protected]>

Fix incorrect innode being checked (flexflow#1273)

57d1883

Co-authored-by: Gabriele Oliaro <[email protected]>

Revert "Bug fixes and update Legion version" (flexflow#1286)

d73bba1

Docs Modification for Python Usecases (flexflow#1291)

be28d71

* modify README * fix link issues * update legion version --------- Co-authored-by: Zhihao Jia <[email protected]>

Add support for docker machines with cuda 12.1 and cuda 12.2 (flexflo…

e24eb03

…w#1308)

Fix NCCL tear down issue, update docker pre-build cuda version list (f…

0d75c10

…lexflow#1318)

add expansion config param in specinfer

ea31426

parametrize max_spec_tree_token_num

e03dec0

fix

c856680

fix

8d82c91

fix

0479a64

run CI per commit only on inference branch

5bd7123

fix

e0a6e4f

fix: 'model_configs' AttributeError (flexflow#1358)

1210256

Changes to support Perlmutter environment (flexflow#1360)

b4a639c

* . * remove deadcode * add benchmarking mode, initializing weights randomly * better logging when running out of memory * update --------- Co-authored-by: Gabriele Oliaro <[email protected]>

update workflow to build rocm docker images

7da197e

downgrade to python 3.11 for now

002fdf0

doc: fix c++ serving example (flexflow#1372)

d54e4b6

Co-authored-by: Gabriele Oliaro <[email protected]>

Merge remote-tracking branch 'xinhao/xinhao_candle' into xinhao_infer…

024d188

…ence

xinhaoc changed the title ~~Xinhao inference~~ merge inference into BertMLM_fix May 17, 2024

Merge branch 'bert_fix1' into xinhao_inference

0f8b5f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge inference into BertMLM_fix #3

merge inference into BertMLM_fix #3

xinhaoc commented May 17, 2024

merge inference into BertMLM_fix #3

Are you sure you want to change the base?

merge inference into BertMLM_fix #3

Conversation

xinhaoc commented May 17, 2024