Pulse · vllm-project/vllm · GitHub

September 20, 2024 – September 27, 2024

Overview

150 Active pull requests

158 Active issues

1 Release published by 1 person

v0.6.2
published Sep 25, 2024

98 Pull requests merged by 58 people

[Bugfix] fix for deepseek w4a16
#8906 merged Sep 27, 2024
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method
#7271 merged Sep 27, 2024
[torch.compile] use empty tensor instead of None for profiling
#8875 merged Sep 27, 2024
[TPU] Update pallas.py to support trillium
#8871 merged Sep 27, 2024
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1
#8892 merged Sep 27, 2024
[MISC] Fix invalid escape sequence '\'
#8830 merged Sep 27, 2024
[misc] fix collect env
#8894 merged Sep 27, 2024
[Core] Rename PromptInputs and inputs with backward compatibility
#8876 merged Sep 27, 2024
[Feature] Add support for Llama 3.1 and 3.2 tool use
#8343 merged Sep 27, 2024
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility
#8764 merged Sep 26, 2024
[BugFix] Fix test breakages from transformers 4.45 upgrade
#8829 merged Sep 26, 2024
[Bugfix] Fixup advance_step.cu warning
#8815 merged Sep 26, 2024
fix validation: Only set tool_choice auto if at least one tool is provided
#8568 merged Sep 26, 2024
[Bugfix] Fix print_warning_once's line info
#8867 merged Sep 26, 2024
[Misc] Change dummy profiling and BOS fallback warns to log once
#8820 merged Sep 26, 2024
[Bugfix] Include encoder prompts len to non-stream api usage response
#8861 merged Sep 26, 2024
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM
#8872 merged Sep 26, 2024
[misc][installation] build from source without compilation
#8818 merged Sep 26, 2024
[CI/Build] Fix missing ci dependencies
#8834 merged Sep 26, 2024
[Docs] Add README to the build docker image
#8825 merged Sep 26, 2024
[Build/CI] Upgrade to gcc 10 in the base build Docker image
#8814 merged Sep 26, 2024
[Misc] Update config loading for Qwen2-VL and remove Granite
#8837 merged Sep 26, 2024
[Misc] Support quantization of MllamaForCausalLM
#8822 merged Sep 25, 2024
[Doc] Update doc for Transformers 4.45
#8817 merged Sep 25, 2024
[Model] Add support for the multi-modal Llama 3.2 model
#8811 merged Sep 25, 2024
Revert "rename PromptInputs and inputs with backward compatibility (#8760)
#8810 merged Sep 25, 2024
[Misc] Support FP8 MoE for compressed-tensors
#8588 merged Sep 25, 2024
[Frontend] MQLLMEngine supports profiling.
#8761 merged Sep 25, 2024
[Core] Rename PromptInputs and inputs, with backwards compatibility
#8760 merged Sep 25, 2024
[VLM][Bugfix] enable internvl running with num_scheduler_steps > 1
#8614 merged Sep 25, 2024
[[Misc]] Add extra deps for openai server image
#8792 merged Sep 25, 2024
[Kernel] Fullgraph and opcheck tests
#8479 merged Sep 25, 2024
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade
#8777 merged Sep 25, 2024
[Misc] Fix minor typo in scheduler
#8765 merged Sep 25, 2024
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node
#8767 merged Sep 25, 2024
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer
#8672 merged Sep 25, 2024
[Bugfix] load fc bias from config for eagle
#8790 merged Sep 25, 2024
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend
#8770 merged Sep 25, 2024
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv
#8250 merged Sep 25, 2024
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2
#8752 merged Sep 25, 2024
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal
#8776 merged Sep 25, 2024
Fix test_schedule_swapped_simple in test_scheduler.py
#8780 merged Sep 25, 2024
[Bugfix] Use heartbeats instead of health checks
#8583 merged Sep 25, 2024
[Core] Adding Priority Scheduling
#5958 merged Sep 25, 2024
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding
#8047 merged Sep 25, 2024
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0
#8768 merged Sep 25, 2024
[misc] soft drop beam search
#8763 merged Sep 24, 2024
[CI/Build] fix setuptools-scm usage
#8771 merged Sep 24, 2024
[Bugfix] Fix torch dynamo fixes caused by replace_parameters
#8748 merged Sep 24, 2024
[Frontend] Batch inference for llm.chat() API
#8648 merged Sep 24, 2024
[Kernel] Split Marlin MoE kernels into multiple files
#8661 merged Sep 24, 2024
[Bugfix] Fix potentially unsafe custom allreduce synchronization
#8558 merged Sep 24, 2024
[Model] Expose Phi3v num_crops as a mm_processor_kwarg
#8658 merged Sep 24, 2024
[Core][Model] Support loading weights by ID within models
#7931 merged Sep 24, 2024
[MISC] Skip dumping inputs when unpicklable
#8744 merged Sep 24, 2024
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt"
#8750 merged Sep 24, 2024
re-implement beam search on top of vllm core
#8726 merged Sep 24, 2024
Fix tests in test_scheduler.py that fail with BlockManager V2
#8728 merged Sep 24, 2024
[Hardware][AMD] ROCm6.2 upgrade
#8674 merged Sep 24, 2024
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse
#8335 merged Sep 23, 2024
Fix typical acceptance sampler with correct recovered token ids
#8562 merged Sep 23, 2024
[Core] Allow IPv6 in VLLM_HOST_IP with zmq
#8575 merged Sep 23, 2024
[Kernel][LoRA] Add assertion for punica sgmv kernels
#7585 merged Sep 23, 2024
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin
#7701 merged Sep 23, 2024
[CI/Build] use setuptools-scm to set __version__
#4738 merged Sep 23, 2024
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size
#8707 merged Sep 23, 2024
[Model] Support pp for qwen2-vl
#8696 merged Sep 23, 2024
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner
#8733 merged Sep 23, 2024
[Hardware][CPU] Refactor CPU model runner
#8729 merged Sep 23, 2024
[Core][Frontend] Support Passing Multimodal Processor Kwargs
#8657 merged Sep 23, 2024
[Bugfix] fix docker build for xpu
#8652 merged Sep 23, 2024
[Bugfix] Fix CPU CMake build
#8723 merged Sep 23, 2024
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building
#8702 merged Sep 23, 2024
[misc] upgrade mistral-common
#8715 merged Sep 22, 2024
[build] enable existing pytorch (for GH200, aarch64, nightly)
#8713 merged Sep 22, 2024
[SpecDec][Misc] Cleanup, remove bonus token logic.
#8701 merged Sep 22, 2024
[Model][VLM] Add LLaVA-Onevision model support
#8486 merged Sep 22, 2024
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler
#8703 merged Sep 22, 2024
[Misc] Use NamedTuple in Multi-image example
#8705 merged Sep 22, 2024
[Model] Refactor BLIP/BLIP-2 to support composite model loading
#8407 merged Sep 22, 2024
[ci][build] fix vllm-flash-attn
#8699 merged Sep 22, 2024
[Bugfix] Refactor composite weight loading logic
#8656 merged Sep 22, 2024
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu
#8643 merged Sep 21, 2024
[dbrx] refactor dbrx experts to extend FusedMoe class
#8518 merged Sep 21, 2024
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250
#8646 merged Sep 21, 2024
[Doc] Fix typo in AMD installation guide
#8689 merged Sep 21, 2024
[VLM] Use SequenceData.from_token_counts to create dummy data
#8687 merged Sep 21, 2024
[Kernel] Build flash-attn from source
#8245 merged Sep 21, 2024
[beam search] add output for manually checking the correctness
#8684 merged Sep 21, 2024
[Core] Factor out common code in SequenceData and Sequence
#8675 merged Sep 21, 2024
[MISC] add support custom_op check
#8557 merged Sep 21, 2024
[Core] Rename PromptInputs to PromptType, and inputs to prompt
#8673 merged Sep 21, 2024
[Bugfix] Fix incorrect llava next feature size calculation
#8496 merged Sep 20, 2024
[Hardware][AWS] update neuron to 2.20
#8676 merged Sep 20, 2024
[Doc] neuron documentation update
#8671 merged Sep 20, 2024
[Bugfix][Core] Fix tekken edge case for mistral tokenizer
#8640 merged Sep 20, 2024
[Bugfix] Config.__init__() got an unexpected keyword argument 'engine' api_server args
#8556 merged Sep 20, 2024
[Misc] Show AMD GPU topology in collect_env.py
#8649 merged Sep 20, 2024

52 Pull requests opened by 41 people

[Core] Rename input data types
#8688 opened Sep 21, 2024
[MISC] Support multi node inference with Neuron
#8692 opened Sep 21, 2024
[Core] Enable Memory Tiering for vLLM
#8694 opened Sep 21, 2024
[Core] Deprecating block manager v1 and make block manager v2 default
#8704 opened Sep 22, 2024
[Bugfix] fix tool_parser error handling when serve a model not support it
#8709 opened Sep 22, 2024
[Kernel][Hardware][AMD][ROCm] Fix rocm/attention.cu compilation on ROCm 6.0.3
#8714 opened Sep 22, 2024
[Core][VLM] Support registration for OOT multimodal models
#8717 opened Sep 22, 2024
[Core] Disaggregated prefilling supports valkey
#8724 opened Sep 23, 2024
[Misc] Add conftest plugin for applying forking decorator
#8727 opened Sep 23, 2024
deepseek model use FusedMoE
#8737 opened Sep 23, 2024
Add LlamaForSequenceClassification model
#8740 opened Sep 23, 2024
[Bugfix] Fix Marlin MoE act order when is_k_full == False
#8741 opened Sep 23, 2024
[Hardware][Neuron] Add on-device sampling support for Neuron
#8746 opened Sep 23, 2024
[Kernel][Quantization] Custom Floating-Point Runtime Quantization
#8751 opened Sep 23, 2024
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models
#8758 opened Sep 24, 2024
[CI/Build] migrate project metadata from setup.py to pyproject.toml
#8772 opened Sep 24, 2024
[Bugfix] No num_gpus for ROCm and XPU when connecting to a ray cluster
#8781 opened Sep 24, 2024
Add RWKV v5 (Eagle) support
#8787 opened Sep 25, 2024
[ci] Add CODEOWNERS for test directories
#8795 opened Sep 25, 2024
[Bug] Fix bug in convert_fp8
#8797 opened Sep 25, 2024
[do-not-merge] test PR for pipeline generator
#8798 opened Sep 25, 2024
[Kernel] Enable BFloat16 inputs in fused Marlin MoE kernels
#8800 opened Sep 25, 2024
[WIP][Kernel] Dynamic group blocks in Marlin MoE kernels
#8801 opened Sep 25, 2024
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching
#8804 opened Sep 25, 2024
[Core] Improve choice of Python multiprocessing method
#8823 opened Sep 25, 2024
[Bugfix] Block manager v2 with preemption and lookahead slots
#8824 opened Sep 25, 2024
[Frontend] Log the maximum supported concurrency
#8831 opened Sep 26, 2024
[Spec Decode] (1/2) Remove batch expansion
#8839 opened Sep 26, 2024
[WIP] Dev build time improvements
#8845 opened Sep 26, 2024
[Core] Priority-based scheduling in async engine
#8850 opened Sep 26, 2024
support input embeddings for qwen2vl
#8856 opened Sep 26, 2024
[WIP][Core] Refactor GGUF parameters packing and forwarding
#8859 opened Sep 26, 2024
[WIP][Kernel] A100 FP8 Quantization Method and Kernel for PhiMOE
#8860 opened Sep 26, 2024
[Core] Avoid metrics log noise when idle
#8868 opened Sep 26, 2024
[BugFix] Fix seeded random sampling with encoder-decoder models
#8870 opened Sep 26, 2024
[CI/Build] Update models tests & examples
#8874 opened Sep 26, 2024
[Bugfix] fix #8630
#8880 opened Sep 27, 2024
[Frontend] Utilize the non-blocking and exceptions of `recv_multipart` to reduce one `poll` operation per request.
#8882 opened Sep 27, 2024
[Bugfix] Fix multi nodes TP+PP for XPU
#8884 opened Sep 27, 2024
[Bugfix] Fix PP for Multi-Step
#8887 opened Sep 27, 2024
[MISC] add a flag --lazy-capture-cuda-graph to allow cuda graph capture only happens on-demand
#8888 opened Sep 27, 2024
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1
#8891 opened Sep 27, 2024
[Model] Support Qwen2.5-Math-RM-72B
#8896 opened Sep 27, 2024
[Hardware][intel GPU] add async output process for xpu
#8897 opened Sep 27, 2024
[CI/Build] setuptools-scm fixes
#8900 opened Sep 27, 2024
[Core]LLMEngine removes `sampling_params = sampling_params.clone()`
#8901 opened Sep 27, 2024
[Bugfix][VLM]Add multi-video support for LLaVA-Onevision model
#8905 opened Sep 27, 2024
[Misc] Directly use compressed-tensors for checkpoint definitions
#8909 opened Sep 27, 2024
[Core] Support all head sizes up to 256 with FlashAttention backend
#8910 opened Sep 27, 2024
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag
#8911 opened Sep 27, 2024
[Misc] Separate total and output tokens in benchmark_throughput.py
#8914 opened Sep 27, 2024
Add stream support for Granite 20b Tool Use
#8915 opened Sep 27, 2024

77 Issues closed by 34 people

[Bug]: AssertionError When deploy API serve of Qwen2-VL-72B
#8895 closed Sep 27, 2024
[Bug]: Tool calling on Llama 3.1/3.2 fails with KeyError: '<tool_call>'
#8912 closed Sep 27, 2024
[Bug]: 8xV100 gpus: Failed to infer device type
#8885 closed Sep 27, 2024
RuntimeError on ROCm
#2580 closed Sep 27, 2024
[Bug]: error: triton_flash_attention.py
#5696 closed Sep 27, 2024
[Bug]: HIP error: invalid argument in cudaMemGetInfo
#5994 closed Sep 27, 2024
[Usage]: DOCKER - Getting OOM while running `meta-llama/Llama-3.2-11B-Vision-Instruct`
#8903 closed Sep 27, 2024
[Bug]: KeyError: 'type'. when inferencing Llama 3.2 3B Instruct
#8855 closed Sep 27, 2024
[Usage]: OOM when using Llama-3.2-11B-Vision-Instruct
#8879 closed Sep 27, 2024
[Feature]: Expose Lora lineage information from /v1/models
#6274 closed Sep 27, 2024
0.4.3 error CUDA error: an illegal memory access was encountered
#5376 closed Sep 27, 2024
[Feature]: Support system messages for Multi Modal models
#8854 closed Sep 27, 2024
[CI/Build]: Version v0.6.2 lacks the whl package
#8832 closed Sep 26, 2024
[Model]: Does vllm currently support the Llama-3.1-405B-Instruct multimodal ?
#7503 closed Sep 26, 2024
[Feature]: Supporting MultiModal inputs using Llama3.1
#8146 closed Sep 26, 2024
[Bug]: num_scheduler_steps > 1, n > 1 raise error
#8261 closed Sep 26, 2024
[Performance]: Extremely low throughput
#8847 closed Sep 26, 2024
[New Model]: Llama 3.2
#8812 closed Sep 25, 2024
[Bug]: Docker image for 0.5.4 does not include package timm==0.9.10 to run MiniCPMV
#8107 closed Sep 25, 2024
[Bug]: ModuleNotFoundError: No module named 'bitsandbytes'
#5503 closed Sep 25, 2024
[Doc]: ROCm installation instructions do not work
#6762 closed Sep 25, 2024
Fp8 support for mi300x
#6576 closed Sep 25, 2024
[Bug]: vllm cpu installation build from source error
#8095 closed Sep 25, 2024
[Doc]: Is Qwen2-VL-72B supported?
#8682 closed Sep 25, 2024
[Bug]: lm-format-enforcer guided decoding kills MQLLMEngine
#8578 closed Sep 25, 2024
[RFC]: Priority Scheduling
#6077 closed Sep 25, 2024
[Bug]: Requesting Prompt Logprobs with an MLP Speculator Crashes the Server
#7742 closed Sep 25, 2024
[Bug]: vllm async engine can not use adag
#8158 closed Sep 24, 2024
[Bug]: Shutdown problem when we use ADAG
#8208 closed Sep 24, 2024
[Bug]: AssertionError when loading Qwen 2.5 GGUF q3 model in vLLM
#8697 closed Sep 24, 2024
[Feature]: Batch inference for `llm.chat()` API
#8481 closed Sep 24, 2024
[Bug]: OLMoForCausalLM not supported
#8753 closed Sep 24, 2024
[Usage]:
#8569 closed Sep 24, 2024
Support for RLHF (ILQL)-trained Models
#841 closed Sep 24, 2024
[Bug]: output is empty
#8775 closed Sep 24, 2024
[Usage]: output were empty
#8774 closed Sep 24, 2024
[Usage]: Question about dequantization
#8759 closed Sep 24, 2024
[Misc]: Memory Order in Custom Allreduce
#8404 closed Sep 24, 2024
[Bug]: Obvious hang caused by Custom All Reduce OP（Valuable Debug Info Obtained）
#8410 closed Sep 24, 2024
[Usage]: set num_crops in LVLM
#7861 closed Sep 24, 2024
[Bug]: Server crashes when kv cache exhausted
#8738 closed Sep 24, 2024
[Feature] compile triton kernels ahead of time
#8712 closed Sep 24, 2024
[Bug]: tensor parallel processes not working in vllm_cpu
#8756 closed Sep 24, 2024
[Performance]: Add weaker memory fence for custom allreduce
#8457 closed Sep 24, 2024
[Usage]: multimodal large models load local image files
#8730 closed Sep 23, 2024
[Bug]: Error when using --tensor-parallel-size 4 on Qwen2.5-72B-Instruct
#8691 closed Sep 23, 2024
[Bug]: torch.OutOfMemoryError: CUDA out of memory.
#8721 closed Sep 23, 2024
[Doc]: Using LoRA adapters
#8725 closed Sep 23, 2024
[Bug]: InternVl2-8B-AWQ gives error when trying to run with vllm-openai cuda 11.8 docker image
#8736 closed Sep 23, 2024
[Bug]: Installation with XPU fail's with Dockerfile and while building from sourcefile
#8563 closed Sep 23, 2024
[Installation]: vllm CPU mode build failed
#8710 closed Sep 23, 2024
[Bug]: vllm deploy medusa, draft acceptance rate: 0.000
#8620 closed Sep 23, 2024
[Misc]: Uneven performance
#8719 closed Sep 23, 2024
[Bug]: AttributeError: module 'cv2.dnn' has no attribute 'DictValue'
#8650 closed Sep 22, 2024
ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)
#2021 closed Sep 22, 2024
[Usage]: Weird vram usage and increase in use
#8504 closed Sep 22, 2024
[New Model]: LLaVA-OneVision
#7420 closed Sep 22, 2024
When running pytest tests/, undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
#3228 closed Sep 22, 2024
[Bug]: Model load on 2 or 4-gpu A100 setup may cause default text encoding to be ascii, unless enforce_eager=True
#8536 closed Sep 21, 2024
Support for production grade server for Inference [Gunicorn vs Unicorn]?
#2573 closed Sep 21, 2024
[RFC]: Build `vllm-flash-attn` from source
#8002 closed Sep 21, 2024
qwen2-vl: AttributeError: '_OpNamespace' '_C' object has no attribute 'gelu_quick'
#8624 closed Sep 21, 2024
[Installation]: can not install vllm in GPU
#8665 closed Sep 21, 2024
[Bug]: Docker build for ROCm fails for latest release and main branch
#7813 closed Sep 21, 2024
Failed to build from source on ROCm (with pytorch and xformers working correctly)
#3067 closed Sep 21, 2024
[Usage]: Number of requests currently in the queue
#8617 closed Sep 20, 2024
[Bug]: aisingapore/sea-lion-7b-instruct fails with assert config.embedding_fraction == 1.0
#3523 closed Sep 20, 2024
Mixtral 4x 4090 OOM
#3285 closed Sep 20, 2024
Low VRAM batch processing mode
#1297 closed Sep 20, 2024
Add worker registry service for hosting multiple vllm model through single api gateway
#1753 closed Sep 20, 2024
ERROR: Fail to install in editable mode. "UserWarning: There are no .../x86_64-conda-linux-gnu-c++ version bounds defined for CUDA version 12.1"
#2771 closed Sep 20, 2024
Use O3 optimization instead of O2 for CUDA compilation?
#67 closed Sep 20, 2024
Modify the current PyTorch model to C++
#42 closed Sep 20, 2024
[Usage]: VLLM serve Gemma 2 9B it with more than 4096 tokens
#8680 closed Sep 20, 2024
Inquiry Regarding vLLM Support for Mac Metal API
#2081 closed Sep 20, 2024
vLLM ignores my requests when I increase the number of concurrent requests
#2752 closed Sep 20, 2024
Faster model loading
#474 closed Sep 20, 2024

81 Issues opened by 72 people

[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations
#8913 opened Sep 27, 2024
[Usage]: LLM with tensor_parallel_size larger than n. gpus in one node
#8908 opened Sep 27, 2024
[Usage]: guided_regex in offline model
#8907 opened Sep 27, 2024
[Bug]: Tokenization Mismatch Between HuggingFace and vLLM
#8904 opened Sep 27, 2024
[Feature]: Guided Decoding Schema Cache Store
#8902 opened Sep 27, 2024
[Performance]: Talk about the model parallelism
#8898 opened Sep 27, 2024
[Usage]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#8893 opened Sep 27, 2024
[Bug]: Variance Between Mutiple Prefix Cache Example runs
#8890 opened Sep 27, 2024
[Installation]: FAILED: CMakeFiles/_C.dir/csrc/quantization/machete/generated/machete_mm_bf16u4_impl_part0.cu.o
#8889 opened Sep 27, 2024
[Bug]: assert len(self._async_stopped) == 0
#8881 opened Sep 27, 2024
[Installation]: Cannot compile flash attention when building from source
#8878 opened Sep 27, 2024
[Bug]: --quantization=awq Using the quantized startup parameters will cause a restart
#8877 opened Sep 27, 2024
[Bug]: Server - `aqlm` fails with `--cpu-offload-gb`
#8873 opened Sep 26, 2024
[Feature]: Add model context information to chat template
#8869 opened Sep 26, 2024
[Performance]: Slowdown compared to Gradio
#8866 opened Sep 26, 2024
[Bug]: configurably disable prompt echo
#8864 opened Sep 26, 2024
[Usage]: RuntimeError: Failed to infer device type (Intel Iris Xe Graphics)
#8863 opened Sep 26, 2024
[Installation]: can't install on cpu AMD Ryzen 7 PRO 8700GE ubuntu
#8862 opened Sep 26, 2024
[Bug]: Assert Error: len(seqs) == 1
#8858 opened Sep 26, 2024
[Feature]: Support image embeddings as input for qwen2vl
#8857 opened Sep 26, 2024
[Bug]: 0.6.2 OpenAI server outofmem for a previously stable setup
#8853 opened Sep 26, 2024
[Installation]: Meet bugs when installing from source
#8852 opened Sep 26, 2024
[Usage]:
#8851 opened Sep 26, 2024
[Bug]: VLLM does not support EAGLE Spec Decode when deploying EAGLE-Qwen2-7B-Instruct model
#8849 opened Sep 26, 2024
[Bug]: We tested Qwen2.5_3b on opencompass and vllm, and the results are very different
#8846 opened Sep 26, 2024
[Bug]: Assertion Error when inferencing with sample_n>1and occur preemption
#8844 opened Sep 26, 2024
[Feature]: Support for compiled model graph for TPUs
#8843 opened Sep 26, 2024
[Bug]: ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
#8841 opened Sep 26, 2024
[Bug]: exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
#8840 opened Sep 26, 2024
[Bug]: TimeoutError: MQLLMEngine didn't reply within 10000ms
#8836 opened Sep 26, 2024
[Feature]: Samplers Order support
#8835 opened Sep 26, 2024
Llama3.2 Vision Model: Guides and Issues
#8826 opened Sep 25, 2024
[issue tracker] make vllm compatible with dynamo
#8821 opened Sep 25, 2024
[Bug]: Later version have degradation based on `vllm:time_to_first_token_seconds_sum` metric
#8819 opened Sep 25, 2024
[Usage]: How does VLLM allocate memory
#8816 opened Sep 25, 2024
[Usage]: How to use BitsAndBytesConfig with vllm serve
#8813 opened Sep 25, 2024
[New Model]: allenai/Molmo-7B-0-0924 VisionLM
#8808 opened Sep 25, 2024
[Bug]: Decrease generation quality Mixtral
#8807 opened Sep 25, 2024
[Usage]: Train Lora with frozen vllm model
#8806 opened Sep 25, 2024
[Bug]: vllm api server return escaped unicode string in guided backend 'outlines'
#8805 opened Sep 25, 2024
[Misc]: Strange `leaked shared_memory` warnings reported by multiprocessing when using vLLM
#8803 opened Sep 25, 2024
[Feature]: LoRA support for Pixtral
#8802 opened Sep 25, 2024
[Bug]: Loading a model with bitsandbytes 8bit quantization
#8799 opened Sep 25, 2024
[Bug]: Bug in the convert_fp8 function, a function for testing.
#8796 opened Sep 25, 2024
> > Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work.
#8794 opened Sep 25, 2024
[Doc]: Is Qwen2.5's long context YARN handled?
#8793 opened Sep 25, 2024
[Bug]: Port binding failure when using pp > 1 after commit 7c7714d856eee6fa94aade729b67f00584f72a4c
#8791 opened Sep 25, 2024
[Installation]: Installing vLLM on ROCm - Distro:Gentoo
#8788 opened Sep 25, 2024
[Tracking Issue][Help Wanted]: FlashInfer backend improvements
#8786 opened Sep 24, 2024
[Bug]: Disabling Marlin by setting --quantization gptq doesn't work when using a draft model
#8784 opened Sep 24, 2024
[Bug]: Decode n tokens gives different output for first seq position compared to decode 1 token
#8783 opened Sep 24, 2024
[RFC]: Add Goodput Metric to Benchmark Serving
#8782 opened Sep 24, 2024
vLLM's V2 Engine Architecture
#8779 opened Sep 24, 2024
[Bug]: LLMEngine cannot be pickled error vllm 0.6.1.post2
#8778 opened Sep 24, 2024
[Usage]: output were empty
#8773 opened Sep 24, 2024
[Usage]: Total generated tokens in benchmarking script
#8769 opened Sep 24, 2024
[Usage]: how to acquire logits in vllm
#8762 opened Sep 24, 2024
[Bug]: use cpu_offload_gb in gguf failed.
#8757 opened Sep 24, 2024
[Bug]: 请求报错
#8755 opened Sep 24, 2024
[Performance]: Analysis of performance dashboard movements
#8749 opened Sep 23, 2024
[Bug]: OLMoE produces incorrect output with TP>1
#8747 opened Sep 23, 2024
Error loading models since versions 0.6.1xxx
#8745 opened Sep 23, 2024
Why is the bitsandbytes model significantly slower than the AWQ model?
#8743 opened Sep 23, 2024
[Feature]: Support Inference Overrides for mm_processor_kwargs
#8742 opened Sep 23, 2024
[Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json"
#8735 opened Sep 23, 2024
[Misc]: Enable dependabot to help managing known vulnerabilities in dependencies
#8734 opened Sep 23, 2024
[Bug]: TypeError: 'NoneType' object is not subscriptable RPCServer process died before responding to readiness probe
#8732 opened Sep 23, 2024
[Usage]: speculative OutOfMemoryError:
#8731 opened Sep 23, 2024
[Usage]: Loading a model with bitsandbytes quantization with 8bit
#8720 opened Sep 23, 2024
[Misc]: Unit test failures with BlockManager v2
#8718 opened Sep 22, 2024
[RFC]: quant llm from alpindale
#8716 opened Sep 22, 2024
[Bug]: 4208 CPU vllm 0.6.0 启动qwen-vl-7b ,报下面图片中的异常，模型开始可以正常输出，调用多次后，无返回结果
#8711 opened Sep 22, 2024
Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher
#8708 opened Sep 22, 2024
[Usage]: Is there any difference between max_tokens and max_model_len?
#8706 opened Sep 22, 2024
[Feature]: Support for Seq classification/Reward models
#8700 opened Sep 21, 2024
[Bug]: Low trhoughput on AMD MI250 using llama 3.1 (6 toks/s)
#8698 opened Sep 21, 2024
[Bug]: Pixtral-12B not supported on CPU
#8693 opened Sep 21, 2024
[Bug]: RuntimeError on A800 using vllm0.6.1.post2
#8686 opened Sep 21, 2024
[New Model][Format]: Support the HF-version of Pixtral
#8685 opened Sep 21, 2024
[Feature]: improve distributed backend selection
#8683 opened Sep 20, 2024
[Bug]: QLoRA inference returns alternating output
#8681 opened Sep 20, 2024

134 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Core] Implementing disaggregated prefilling, and caching KV cache in CPU/disk/database.
#8498 commented on Sep 27, 2024 • 55 new comments
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path
#8378 commented on Sep 27, 2024 • 28 new comments
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels
#8533 commented on Sep 27, 2024 • 27 new comments
[Bugfix] Fix LongRoPE bug
#8254 commented on Sep 25, 2024 • 14 new comments
[OpenVINO] Enable GPU support for OpenVINO vLLM backend
#8192 commented on Sep 27, 2024 • 14 new comments
[Model] Support Mamba
#6484 commented on Sep 25, 2024 • 11 new comments
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model
#8405 commented on Sep 27, 2024 • 11 new comments
[Model][LoRA]LoRA support added for MiniCPMV2.5
#7199 commented on Sep 27, 2024 • 10 new comments
Adding Cascade Infer to FlashInfer
#8132 commented on Sep 27, 2024 • 10 new comments
[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123
#8663 commented on Sep 25, 2024 • 9 new comments
[Hardware][CPU] Support AWQ for CPU backend
#7515 commented on Sep 24, 2024 • 6 new comments
[Doc]: Add deploying_with_k8s guide
#8451 commented on Sep 26, 2024 • 6 new comments
[Bugfix][Intel] Fix XPU Dockerfile Build
#7824 commented on Sep 27, 2024 • 3 new comments
[Core][VLM] Add precise multi-modal placeholder tracking
#8346 commented on Sep 27, 2024 • 2 new comments
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang
#7412 commented on Sep 22, 2024 • 2 new comments
[Model] Adding Granite MoE.
#8206 commented on Sep 26, 2024 • 2 new comments
[Model] MLPSpeculator quantization support
#8476 commented on Sep 23, 2024 • 1 new comment
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing
#8537 commented on Sep 23, 2024 • 1 new comment
[Misc] add non cuda hf benchmark_througput
#8653 commented on Sep 20, 2024 • 1 new comment
[Bug]: Multistep with n>1 Fails
#7968 commented on Sep 27, 2024 • 0 new comments
[Installation]: Issues with installing vLLM on ROCM without sudo access
#8042 commented on Sep 27, 2024 • 0 new comments
[Bug]: Using FlashInfer with FP8 model with FP8 KV cache produces an error
#8641 commented on Sep 27, 2024 • 0 new comments
[Bug]: No module named `jsonschema.protocols`.
#6486 commented on Sep 27, 2024 • 0 new comments
[Bug]: Trailing newline as outputs
#8020 commented on Sep 27, 2024 • 0 new comments
[Bug]: sending request using response_format json twice breaks vLLM
#4070 commented on Sep 27, 2024 • 0 new comments
[docs] add load balancing examples
#1837 commented on Sep 25, 2024 • 0 new comments
[Frontend][OpenAI] Add support for OpenAI tools calling
#4656 commented on Sep 25, 2024 • 0 new comments
Heterogeneous Speculative Decoding (CPU + GPU)
#5065 commented on Sep 27, 2024 • 0 new comments
[Model] Bert Embedding Model
#5447 commented on Sep 26, 2024 • 0 new comments
[RFC]: Reimplement and separate beam search on top of vLLM core
#8306 commented on Sep 27, 2024 • 0 new comments
[RFC]: Automate Speculative Decoding
#4565 commented on Sep 27, 2024 • 0 new comments
[Bug]: No available block found in 60 second in shm
#6614 commented on Sep 27, 2024 • 0 new comments
[Feature]: support out tree multimodal models
#8667 commented on Sep 27, 2024 • 0 new comments
[Bug]: When enabling LoRA, greedy search got different answers.
#7977 commented on Sep 27, 2024 • 0 new comments
[RFC]: Multi-modality Support Refactoring
#4194 commented on Sep 26, 2024 • 0 new comments
[RFC]: Encoder/decoder models & feature compatibility
#7366 commented on Sep 26, 2024 • 0 new comments
[Feature]: Align the API with OAI's structured output
#7220 commented on Sep 26, 2024 • 0 new comments
[RFC]: Support encode only models by Workflow Defined Engine
#8453 commented on Sep 26, 2024 • 0 new comments
[Performance]: Why the avg. througput generation is low?
#4760 commented on Sep 26, 2024 • 0 new comments
[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16
#4392 commented on Sep 26, 2024 • 0 new comments
[Feature]: FP6
#4515 commented on Sep 25, 2024 • 0 new comments
[Installation]: vllm on NVIDIA jetson AGX orin
#5640 commented on Sep 25, 2024 • 0 new comments
[Installation]: How to install vLLM on Jetson
#8485 commented on Sep 25, 2024 • 0 new comments
[Not to be Submitted] [WIP] Force Unit tests to run with BlockManager V2
#8678 commented on Sep 20, 2024 • 0 new comments
[Core] CUDA Graphs for Multi-Step Chunked Prefill
#8645 commented on Sep 24, 2024 • 0 new comments
[Bugfix] Handle `best_of>1` & `use_beam_search` by disabling multi-step scheduling.
#8637 commented on Sep 23, 2024 • 0 new comments
ppc64le: Dockerfile and CI fix
#8529 commented on Sep 27, 2024 • 0 new comments
[Doc] Compatibility matrix for mutual exclusive features
#8512 commented on Sep 25, 2024 • 0 new comments
[Core]: Support encode only models by Workflow Defined Engine
#8452 commented on Sep 27, 2024 • 0 new comments
[Bugfix] Fix code for downloading models from modelscope
#8443 commented on Sep 25, 2024 • 0 new comments
[torch.compile] A simple solution to recursively compile loaded model: using phi3-small as an example
#8398 commented on Sep 25, 2024 • 0 new comments
[Model] tool calling support for ibm-granite/granite-20b-functioncalling
#8339 commented on Sep 27, 2024 • 0 new comments
[Frontend][Core] Move guided decoding params into sampling params
#8252 commented on Sep 27, 2024 • 0 new comments
[BugFix] Fix metrics error for --num-scheduler-steps > 1
#8234 commented on Sep 26, 2024 • 0 new comments
[Hardware][Ascend] Add Ascend NPU backend
#8054 commented on Sep 27, 2024 • 0 new comments
Roberta embedding
#7969 commented on Sep 26, 2024 • 0 new comments
`[Core]` Added streaming support to `LLM` Class
#7648 commented on Sep 24, 2024 • 0 new comments
[CI/Build] custom build backend and dynamic build dependencies
#7525 commented on Sep 23, 2024 • 0 new comments
[Core] Move detokenization to front-end process
#7402 commented on Sep 23, 2024 • 0 new comments
[Models] Add remaining model PP support
#7168 commented on Sep 27, 2024 • 0 new comments
[Frontend] Add readiness and liveness endpoints to OpenAI API server
#7078 commented on Sep 24, 2024 • 0 new comments
[Doc] Proofreading documentation
#6998 commented on Sep 24, 2024 • 0 new comments
[Core] generate from input embeds
#6869 commented on Sep 27, 2024 • 0 new comments
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on Sep 26, 2024 • 0 new comments
[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface
#6260 commented on Sep 25, 2024 • 0 new comments
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend
#6143 commented on Sep 24, 2024 • 0 new comments
Whisper support
#5964 commented on Sep 21, 2024 • 0 new comments
[Bug]: FP8 Marlin fallback out of memory regression
#7793 commented on Sep 22, 2024 • 0 new comments
[Installation]: error: can't copy 'build/lib.linux-x86_64-3.10/vllm/_core_C.abi3.so': doesn't exist or not a regular file
#8174 commented on Sep 22, 2024 • 0 new comments
[Bug]: In v0.6.0 and above, Some of monitoring metrics are not correct.
#8178 commented on Sep 23, 2024 • 0 new comments
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.`
#2418 commented on Sep 23, 2024 • 0 new comments
Inconsistent Text Generation Results in Batch vs Individual Sentence Processing
#2568 commented on Sep 23, 2024 • 0 new comments
[Bug]: [Usage]: is_xpu should return true when the torch.xpu.is_available is true even w/o IPEX
#8655 commented on Sep 23, 2024 • 0 new comments
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
#6766 commented on Sep 23, 2024 • 0 new comments
[Bug]: topk=1 and temperature=0 cause different output in vllm
#5404 commented on Sep 23, 2024 • 0 new comments
[Bug]: CUDA illegal memory access error when `enable_prefix_caching=True`
#5537 commented on Sep 23, 2024 • 0 new comments
[Bug]: JSONDecodeError when running vllm serve
#8668 commented on Sep 23, 2024 • 0 new comments
[Usage]: What's the minimum VRAM needed to use entire context length for Llama 3.1 70B and 405B
#8188 commented on Sep 23, 2024 • 0 new comments
[Feature]: APC introspection interface
#8523 commented on Sep 23, 2024 • 0 new comments
[Bug]: RuntimeError in gptq_marlin_24_gemm
#8654 commented on Sep 23, 2024 • 0 new comments
[Bug]: AsyncEngineDeadError: Task finished unexpectedly with qwen2 72b
#6208 commented on Sep 23, 2024 • 0 new comments
[New Model]: Support for allenai/OLMoE-1B-7B-0924
#8170 commented on Sep 23, 2024 • 0 new comments
[Feature]: Offline quantization for Pixtral-12B
#8566 commented on Sep 23, 2024 • 0 new comments
[Feature]: Return hidden states (in progress?)
#6165 commented on Sep 24, 2024 • 0 new comments
[Bug]: Neuron + Vllm inference broken with backward incompatible change
#8677 commented on Sep 20, 2024 • 0 new comments
ExLlamaV2: exl2 support
#3203 commented on Sep 20, 2024 • 0 new comments
Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted?
#2645 commented on Sep 20, 2024 • 0 new comments
[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride
#8630 commented on Sep 20, 2024 • 0 new comments
AWQ: Implement new kernels (64% faster decoding)
#3025 commented on