Skip to content

Latest commit

 

History

History
13 lines (10 loc) · 1.67 KB

README.md

File metadata and controls

13 lines (10 loc) · 1.67 KB

End-to-end efficiency evaluation of Atom in Serving Context

Experiment Setup

Punica is a serving framework dedicated for serving multiple fine-tuned LoRA models. Atom uses its backbone framework without LoRA part to demonstrate its' efficiency in serving context. This codebase is modified from previous version of Punica.

To evaluate the real production scenario, we collect real-world LLM chat logs from ShareGPT, from which we sample prefill prompts and decoding length to synthesize user requests. We adopt continous batching following Orca and manually set the corresponding batch size.

The backbone inference workflow of Punica is based on PyTorch Huggingface Transformers. We only subsitute corresponding kernels for each tested methods. For W8A8 evaluation, we apply SmoothQuant and also replace the attention kernel with FP8 FlashInfer implementation. For W4A16 evaluation, we utilize kernels from AWQ and replace all linear layers with AWQ quantized versions.

Note that current codebase is for efficiency evaluation. We use random weights and hack therefore no meaningful output.

Usage Instruction

Check README.md in each folder for detailed instructions.

Results

e2e