Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge inference into BertMLM_fix #2

Open
wants to merge 399 commits into
base: xinhao_candle
Choose a base branch
from
Open

Conversation

xinhaoc
Copy link
Owner

@xinhaoc xinhaoc commented May 10, 2024

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

Before merging:

  • Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

goliaro and others added 30 commits May 19, 2023 13:41
* fix hip_rocm build with sentencepiece

* shellcheck 1

* shellcheck 2

* shellecheck 3

* fix install script

* .github/workflows/helpers/install_dependencies.sh

* fix

* shellcheck

* restore unnecessary changes

* fix build

* removed outdated test from c++ tests

* update link in readme
* implemented file-based configs, remove spec_pipeline folder

* fix

* add inference test, script to downlaod weights

* update readme

* update ci scripts

* newlines

* fix gpu-ci

* fix

* fix

* update test file

* added incr decoding program, moved LLAMA folder from examples

* linting

* add incremental decoding to test

* update readme

* add script to download opt weights

* fix support for opt, move code to root inference folder

* linting

* update test file

* fix

* bug fix

* update test
…exflow#736)

* making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta

* make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta

* format

* merging kernel functions

* merge more functions

* merge compute_qkv_kernel

* format

* fix config

---------

Co-authored-by: xinhaoc <[email protected]>
* fix alignment bugs (part 1)

* add missing file
…ttention (flexflow#737)

* making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta

* make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta

---------

Co-authored-by: xinhaoc <[email protected]>
* save output to file

* add alignment tests

* fix

* change conflicting name, add comments

* fix typo

* formatting

* more comments and clean dead code

* formatting

* fixed issue with length mismatch

* fix ci skip

* update inf test

* add precision selection support in incr decoding
* Update README.md

* update readme

* fix
…d tests (flexflow#749)

* add support for downloading mixed precision llama/opt weights

* fix

* update test script to also run half precision tests

* disable workflow for inference PRs

* add verbose option

* linting

* copy opt weights in download weights script

* add alignment tests with huggingface (llama)

* fix, add diff to test script

* fix

* add opt tests

* comment out tests not passing

* add e2e latency to output files

* add speed tests

* shellcheck

* shellcheck

* fix

* fix

* linting

* fix
* Add support for login information with multiple ssms.

* Update prepare_next_batch_verify.

* Add dedup tree merge.

* Format.

* Fix bugs.

* Runs with mutilmodels.

* Fix.

* Format

* Fix.

* Fix increamental decoding.

* fix use_full_precision issue.
* Fix bug in elementwise multiplication with broadcasting (flexflow#764)

* Fix multinode test (flexflow#766)

* Fix UCX multinode test (flexflow#768)

* fix

* fix 2

* Prevent format.sh from formatting triton (flexflow#756)

* [CI] - Increase timeout in multinode test (UCX & MPI) (flexflow#773)

* fix

* fix 2

* increase timeout

* Fix docker builds in CI (flexflow#774)

---------

Co-authored-by: Soumya Chatterjee <[email protected]>
Co-authored-by: Colin Unger <[email protected]>
* init

* add mlc tokenizer.

* .

* fix

* fix pipeline,  fix name

* .

* format

* ci

* .

* add rust

* fix

* .

* inf test fix

* .

* fix

* .

* fix

* optimize

* move rust to conda env

* .

* .

* fix

* fix

* fix

* update git ignore

* fix rust install

* Update config.linux

---------

Co-authored-by: Gabriele Oliaro <[email protected]>
* fix gpu-ci

* add check for rust in cmake
* decomp

* initial implementation

* add missing file

* checkpoint

* more bug fixes

* update default offload size

* fix non-offload

* undo changes to spec_inc_mha

* fix a parallel tensor reuse bug

* prepare_next_batch for offload(inc_decode)

* format

* int4&int8 offload

* fix merge issue

* fix build

* spec_infer offload&quantize

* fix, update readme.

* remove redundant

* hip build

* hip

* model param

---------

Co-authored-by: xinhaoc <[email protected]>
* add parallel operators

* add cmd line param

* setting machine views

* move bias blocks

* comment out print of partitions

* add unimplemented methods

* add impl of inference functions to replicate and reduce ops

* replicate bias in file loader

* fixes, now works

* only add bias once

* load and use weights according to partition

* fix wout weight

* cleanup

* add support for mixed precision in parallel ops

* cleanup

* rocm build fix

* hip rocm fix 2

* fix machine views

* fix rocm build

* adjust numbe of pipeline stages

* add model parallelism to opt linear layers

* fix

* fxi multi gpu test

* fix

* add tensor parallelism tests to inference test script

* enable tensor parallelism for dense layers in llama

* fix

* fix set_tensor-related issues

* fix and linting
* Docker-build and Publish Modification
**Description of changes:**
Add code in docker-build.yml that allows automatic build and publish process when push happens to inference branch. Moreover, modifies publish.sh so that image name will be created as "image" and "branch" name to distinguish from those created in master branch.

**Related Issues:**

Linked Issues:
- Issue #

Issues closed by this PR:
- Closes #

**Before merging:**

- [ ] Did you update the [flexflow-third-party](https://github.com/flexflow/flexflow-third-party) repo, if modifying any of the Cmake files, the build configs, or the submodules?

* update container name

* specinfer env publish

* tag specinfer

* add spaces

* newline

* fix

* fix gpu ci workflow

---------

Co-authored-by: Gabriele Oliaro <[email protected]>
* fix linear region requirement

* fix set tensor issue
goliaro and others added 30 commits January 20, 2024 04:21
* only stop server if rm is initialized

* fix

* better logging

* pass layer names to ops

* add debugging functionality to hf script

* fix

* fixes

* fix

* fix

---------

Co-authored-by: Ubuntu <[email protected]>
* bug fixes and update Legion version

* fix

* bug fix

* update legion

* fix arithmetic error due to num_devices uninitialized

* update legion version

* update ci

* fix

* debugging ci

* Revert "debugging ci"

This reverts commit 0b3148e.

---------

Co-authored-by: Gabriele Oliaro <[email protected]>
…w#1246)

* add a background server for RequestManager

* .

* make incr_decoding work

* make spec_infer work

* format

* update python inference

* fix python issues

* bug fix

* add a Legion future to capture the termination of the background server

* gradio finished

* chatbot gradio version 2

* chainlit1

* chainlit2

* fastapi done

* fastapi incr_decoding

* langchain example & wrapper class

* langchain example & wrapper class1

* added documentation

* entrypoint

* del apikey

* delete extra files

* rag search fixed some bugs

* fixed rag search issues

* updates before rebase

* minor changes

* reorganize files

* Add thread safety for background server.

* Simplify backend server design.

* resolve conflict.

* specinfer usecases with issues labeled

* specinfer usecases with issues labeled 2

* fixed issues with prompt template

* fix issues with rag specinfer

* Add server task timeout.

* register callbacks to terminate background worker at exit or termination

* [Python] enable decoding multiple requests

* update README.md and default configuration

* fix issues with gradio and prompt template

* fix issues with rag

* adjusted fastapi entrypoint

* update documentation

* resole conflicts

* issues fix

* adjustments on usecases and api entrypoints

* remove redundent changes

* testing CI

* Enable backtrace

* restore newlines

* version

* add back misdeleted line

* legion verion

---------

Co-authored-by: Zhihao Jia <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: zwang86 <[email protected]>
Co-authored-by: Zeyu Wang <[email protected]>
Co-authored-by: xinhaoc <[email protected]>
* bug fixes and update Legion version

* fix

* bug fix

* update legion

* fix arithmetic error due to num_devices uninitialized

* update legion version

* update ci

* fix

* debugging ci

* Revert "debugging ci"

This reverts commit 0b3148e.

* update mapper interface

* add ncclFinalize

* Only delete nccl communications for training jobs

---------

Co-authored-by: Zhihao Jia <[email protected]>
* modify README

* fix link issues

* update legion version

---------

Co-authored-by: Zhihao Jia <[email protected]>
* .

* remove deadcode

* add benchmarking mode, initializing weights randomly

* better logging when running out of memory

* update

---------

Co-authored-by: Gabriele Oliaro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet