-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Insights: vllm-project/vllm
September 20, 2024 – September 27, 2024
Overview
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v0.6.2
published
Sep 25, 2024
98 Pull requests merged by 58 people
-
[Bugfix] fix for deepseek w4a16
#8906 merged
Sep 27, 2024 -
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method
#7271 merged
Sep 27, 2024 -
[torch.compile] use empty tensor instead of None for profiling
#8875 merged
Sep 27, 2024 -
[TPU] Update pallas.py to support trillium
#8871 merged
Sep 27, 2024 -
[Bugfix][VLM] Fix Fuyu batching inference with
max_num_seqs>1
#8892 merged
Sep 27, 2024 -
[MISC] Fix invalid escape sequence '\'
#8830 merged
Sep 27, 2024 -
[misc] fix collect env
#8894 merged
Sep 27, 2024 -
[Core] Rename
PromptInputs
andinputs
with backward compatibility#8876 merged
Sep 27, 2024 -
[Feature] Add support for Llama 3.1 and 3.2 tool use
#8343 merged
Sep 27, 2024 -
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility
#8764 merged
Sep 26, 2024 -
[BugFix] Fix test breakages from transformers 4.45 upgrade
#8829 merged
Sep 26, 2024 -
[Bugfix] Fixup advance_step.cu warning
#8815 merged
Sep 26, 2024 -
fix validation: Only set tool_choice
auto
if at least one tool is provided#8568 merged
Sep 26, 2024 -
[Bugfix] Fix print_warning_once's line info
#8867 merged
Sep 26, 2024 -
[Misc] Change dummy profiling and BOS fallback warns to log once
#8820 merged
Sep 26, 2024 -
[Bugfix] Include encoder prompts len to non-stream api usage response
#8861 merged
Sep 26, 2024 -
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM
#8872 merged
Sep 26, 2024 -
[misc][installation] build from source without compilation
#8818 merged
Sep 26, 2024 -
[CI/Build] Fix missing ci dependencies
#8834 merged
Sep 26, 2024 -
[Docs] Add README to the build docker image
#8825 merged
Sep 26, 2024 -
[Build/CI] Upgrade to gcc 10 in the base build Docker image
#8814 merged
Sep 26, 2024 -
[Misc] Update config loading for Qwen2-VL and remove Granite
#8837 merged
Sep 26, 2024 -
[Misc] Support quantization of MllamaForCausalLM
#8822 merged
Sep 25, 2024 -
[Doc] Update doc for Transformers 4.45
#8817 merged
Sep 25, 2024 -
[Model] Add support for the multi-modal Llama 3.2 model
#8811 merged
Sep 25, 2024 -
Revert "rename PromptInputs and inputs with backward compatibility (#8760)
#8810 merged
Sep 25, 2024 -
[Misc] Support FP8 MoE for compressed-tensors
#8588 merged
Sep 25, 2024 -
[Frontend] MQLLMEngine supports profiling.
#8761 merged
Sep 25, 2024 -
[Core] Rename
PromptInputs
andinputs
, with backwards compatibility#8760 merged
Sep 25, 2024 -
[VLM][Bugfix] enable internvl running with num_scheduler_steps > 1
#8614 merged
Sep 25, 2024 -
[[Misc]] Add extra deps for openai server image
#8792 merged
Sep 25, 2024 -
[Kernel] Fullgraph and opcheck tests
#8479 merged
Sep 25, 2024 -
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade
#8777 merged
Sep 25, 2024 -
[Misc] Fix minor typo in scheduler
#8765 merged
Sep 25, 2024 -
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node
#8767 merged
Sep 25, 2024 -
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer
#8672 merged
Sep 25, 2024 -
[Bugfix] load fc bias from config for eagle
#8790 merged
Sep 25, 2024 -
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend
#8770 merged
Sep 25, 2024 -
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv
#8250 merged
Sep 25, 2024 -
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2
#8752 merged
Sep 25, 2024 -
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal
#8776 merged
Sep 25, 2024 -
Fix test_schedule_swapped_simple in test_scheduler.py
#8780 merged
Sep 25, 2024 -
[Bugfix] Use heartbeats instead of health checks
#8583 merged
Sep 25, 2024 -
[Core] Adding Priority Scheduling
#5958 merged
Sep 25, 2024 -
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding
#8047 merged
Sep 25, 2024 -
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0
#8768 merged
Sep 25, 2024 -
[misc] soft drop beam search
#8763 merged
Sep 24, 2024 -
[CI/Build] fix setuptools-scm usage
#8771 merged
Sep 24, 2024 -
[Bugfix] Fix torch dynamo fixes caused by
replace_parameters
#8748 merged
Sep 24, 2024 -
[Frontend] Batch inference for llm.chat() API
#8648 merged
Sep 24, 2024 -
[Kernel] Split Marlin MoE kernels into multiple files
#8661 merged
Sep 24, 2024 -
[Bugfix] Fix potentially unsafe custom allreduce synchronization
#8558 merged
Sep 24, 2024 -
[Model] Expose Phi3v num_crops as a mm_processor_kwarg
#8658 merged
Sep 24, 2024 -
[Core][Model] Support loading weights by ID within models
#7931 merged
Sep 24, 2024 -
[MISC] Skip dumping inputs when unpicklable
#8744 merged
Sep 24, 2024 -
Revert "[Core] Rename
PromptInputs
toPromptType
, andinputs
toprompt
"#8750 merged
Sep 24, 2024 -
re-implement beam search on top of vllm core
#8726 merged
Sep 24, 2024 -
Fix tests in test_scheduler.py that fail with BlockManager V2
#8728 merged
Sep 24, 2024 -
[Hardware][AMD] ROCm6.2 upgrade
#8674 merged
Sep 24, 2024 -
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse
#8335 merged
Sep 23, 2024 -
Fix typical acceptance sampler with correct recovered token ids
#8562 merged
Sep 23, 2024 -
[Core] Allow IPv6 in VLLM_HOST_IP with zmq
#8575 merged
Sep 23, 2024 -
[Kernel][LoRA] Add assertion for punica sgmv kernels
#7585 merged
Sep 23, 2024 -
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin
#7701 merged
Sep 23, 2024 -
[CI/Build] use setuptools-scm to set __version__
#4738 merged
Sep 23, 2024 -
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size
#8707 merged
Sep 23, 2024 -
[Model] Support pp for qwen2-vl
#8696 merged
Sep 23, 2024 -
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner
#8733 merged
Sep 23, 2024 -
[Hardware][CPU] Refactor CPU model runner
#8729 merged
Sep 23, 2024 -
[Core][Frontend] Support Passing Multimodal Processor Kwargs
#8657 merged
Sep 23, 2024 -
[Bugfix] fix docker build for xpu
#8652 merged
Sep 23, 2024 -
[Bugfix] Fix CPU CMake build
#8723 merged
Sep 23, 2024 -
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building
#8702 merged
Sep 23, 2024 -
[misc] upgrade mistral-common
#8715 merged
Sep 22, 2024 -
[build] enable existing pytorch (for GH200, aarch64, nightly)
#8713 merged
Sep 22, 2024 -
[SpecDec][Misc] Cleanup, remove bonus token logic.
#8701 merged
Sep 22, 2024 -
[Model][VLM] Add LLaVA-Onevision model support
#8486 merged
Sep 22, 2024 -
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler
#8703 merged
Sep 22, 2024 -
[Misc] Use NamedTuple in Multi-image example
#8705 merged
Sep 22, 2024 -
[Model] Refactor BLIP/BLIP-2 to support composite model loading
#8407 merged
Sep 22, 2024 -
[ci][build] fix vllm-flash-attn
#8699 merged
Sep 22, 2024 -
[Bugfix] Refactor composite weight loading logic
#8656 merged
Sep 22, 2024 -
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu
#8643 merged
Sep 21, 2024 -
[dbrx] refactor dbrx experts to extend FusedMoe class
#8518 merged
Sep 21, 2024 -
[Doc] Fix typo in AMD installation guide
#8689 merged
Sep 21, 2024 -
[VLM] Use
SequenceData.from_token_counts
to create dummy data#8687 merged
Sep 21, 2024 -
[Kernel] Build flash-attn from source
#8245 merged
Sep 21, 2024 -
[beam search] add output for manually checking the correctness
#8684 merged
Sep 21, 2024 -
[Core] Factor out common code in
SequenceData
andSequence
#8675 merged
Sep 21, 2024 -
[MISC] add support custom_op check
#8557 merged
Sep 21, 2024 -
[Core] Rename
PromptInputs
toPromptType
, andinputs
toprompt
#8673 merged
Sep 21, 2024 -
[Bugfix] Fix incorrect llava next feature size calculation
#8496 merged
Sep 20, 2024 -
[Hardware][AWS] update neuron to 2.20
#8676 merged
Sep 20, 2024 -
[Doc] neuron documentation update
#8671 merged
Sep 20, 2024 -
[Bugfix][Core] Fix tekken edge case for mistral tokenizer
#8640 merged
Sep 20, 2024 -
[Bugfix] Config.__init__() got an unexpected keyword argument 'engine' api_server args
#8556 merged
Sep 20, 2024 -
[Misc] Show AMD GPU topology in
collect_env.py
#8649 merged
Sep 20, 2024
52 Pull requests opened by 41 people
-
[Core] Rename input data types
#8688 opened
Sep 21, 2024 -
[MISC] Support multi node inference with Neuron
#8692 opened
Sep 21, 2024 -
[Core] Enable Memory Tiering for vLLM
#8694 opened
Sep 21, 2024 -
[Core] Deprecating block manager v1 and make block manager v2 default
#8704 opened
Sep 22, 2024 -
[Bugfix] fix tool_parser error handling when serve a model not support it
#8709 opened
Sep 22, 2024 -
[Kernel][Hardware][AMD][ROCm] Fix rocm/attention.cu compilation on ROCm 6.0.3
#8714 opened
Sep 22, 2024 -
[Core][VLM] Support registration for OOT multimodal models
#8717 opened
Sep 22, 2024 -
[Core] Disaggregated prefilling supports valkey
#8724 opened
Sep 23, 2024 -
[Misc] Add conftest plugin for applying forking decorator
#8727 opened
Sep 23, 2024 -
deepseek model use FusedMoE
#8737 opened
Sep 23, 2024 -
Add LlamaForSequenceClassification model
#8740 opened
Sep 23, 2024 -
[Bugfix] Fix Marlin MoE act order when is_k_full == False
#8741 opened
Sep 23, 2024 -
[Hardware][Neuron] Add on-device sampling support for Neuron
#8746 opened
Sep 23, 2024 -
[Kernel][Quantization] Custom Floating-Point Runtime Quantization
#8751 opened
Sep 23, 2024 -
[CI/Build] migrate project metadata from setup.py to pyproject.toml
#8772 opened
Sep 24, 2024 -
[Bugfix] No num_gpus for ROCm and XPU when connecting to a ray cluster
#8781 opened
Sep 24, 2024 -
Add RWKV v5 (Eagle) support
#8787 opened
Sep 25, 2024 -
[ci] Add CODEOWNERS for test directories
#8795 opened
Sep 25, 2024 -
[Bug] Fix bug in convert_fp8
#8797 opened
Sep 25, 2024 -
[do-not-merge] test PR for pipeline generator
#8798 opened
Sep 25, 2024 -
[Kernel] Enable BFloat16 inputs in fused Marlin MoE kernels
#8800 opened
Sep 25, 2024 -
[WIP][Kernel] Dynamic group blocks in Marlin MoE kernels
#8801 opened
Sep 25, 2024 -
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching
#8804 opened
Sep 25, 2024 -
[Core] Improve choice of Python multiprocessing method
#8823 opened
Sep 25, 2024 -
[Bugfix] Block manager v2 with preemption and lookahead slots
#8824 opened
Sep 25, 2024 -
[Frontend] Log the maximum supported concurrency
#8831 opened
Sep 26, 2024 -
[Spec Decode] (1/2) Remove batch expansion
#8839 opened
Sep 26, 2024 -
[WIP] Dev build time improvements
#8845 opened
Sep 26, 2024 -
[Core] Priority-based scheduling in async engine
#8850 opened
Sep 26, 2024 -
support input embeddings for qwen2vl
#8856 opened
Sep 26, 2024 -
[WIP][Core] Refactor GGUF parameters packing and forwarding
#8859 opened
Sep 26, 2024 -
[WIP][Kernel] A100 FP8 Quantization Method and Kernel for PhiMOE
#8860 opened
Sep 26, 2024 -
[Core] Avoid metrics log noise when idle
#8868 opened
Sep 26, 2024 -
[BugFix] Fix seeded random sampling with encoder-decoder models
#8870 opened
Sep 26, 2024 -
[CI/Build] Update models tests & examples
#8874 opened
Sep 26, 2024 -
[Bugfix] fix #8630
#8880 opened
Sep 27, 2024 -
[Bugfix] Fix multi nodes TP+PP for XPU
#8884 opened
Sep 27, 2024 -
[Bugfix] Fix PP for Multi-Step
#8887 opened
Sep 27, 2024 -
[MISC] add a flag --lazy-capture-cuda-graph to allow cuda graph capture only happens on-demand
#8888 opened
Sep 27, 2024 -
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1
#8891 opened
Sep 27, 2024 -
[Model] Support Qwen2.5-Math-RM-72B
#8896 opened
Sep 27, 2024 -
[Hardware][intel GPU] add async output process for xpu
#8897 opened
Sep 27, 2024 -
[CI/Build] setuptools-scm fixes
#8900 opened
Sep 27, 2024 -
[Core]LLMEngine removes `sampling_params = sampling_params.clone()`
#8901 opened
Sep 27, 2024 -
[Bugfix][VLM]Add multi-video support for LLaVA-Onevision model
#8905 opened
Sep 27, 2024 -
[Misc] Directly use compressed-tensors for checkpoint definitions
#8909 opened
Sep 27, 2024 -
[Core] Support all head sizes up to 256 with FlashAttention backend
#8910 opened
Sep 27, 2024 -
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag
#8911 opened
Sep 27, 2024 -
[Misc] Separate total and output tokens in benchmark_throughput.py
#8914 opened
Sep 27, 2024 -
Add stream support for Granite 20b Tool Use
#8915 opened
Sep 27, 2024
77 Issues closed by 34 people
-
[Bug]: AssertionError When deploy API serve of Qwen2-VL-72B
#8895 closed
Sep 27, 2024 -
[Bug]: Tool calling on Llama 3.1/3.2 fails with KeyError: '<tool_call>'
#8912 closed
Sep 27, 2024 -
[Bug]: 8xV100 gpus: Failed to infer device type
#8885 closed
Sep 27, 2024 -
RuntimeError on ROCm
#2580 closed
Sep 27, 2024 -
[Bug]: error: triton_flash_attention.py
#5696 closed
Sep 27, 2024 -
[Bug]: HIP error: invalid argument in cudaMemGetInfo
#5994 closed
Sep 27, 2024 -
[Usage]: DOCKER - Getting OOM while running `meta-llama/Llama-3.2-11B-Vision-Instruct`
#8903 closed
Sep 27, 2024 -
[Bug]: KeyError: 'type'. when inferencing Llama 3.2 3B Instruct
#8855 closed
Sep 27, 2024 -
[Usage]: OOM when using Llama-3.2-11B-Vision-Instruct
#8879 closed
Sep 27, 2024 -
[Feature]: Expose Lora lineage information from /v1/models
#6274 closed
Sep 27, 2024 -
0.4.3 error CUDA error: an illegal memory access was encountered
#5376 closed
Sep 27, 2024 -
[Feature]: Support system messages for Multi Modal models
#8854 closed
Sep 27, 2024 -
[CI/Build]: Version v0.6.2 lacks the whl package
#8832 closed
Sep 26, 2024 -
[Model]: Does vllm currently support the Llama-3.1-405B-Instruct multimodal ?
#7503 closed
Sep 26, 2024 -
[Feature]: Supporting MultiModal inputs using Llama3.1
#8146 closed
Sep 26, 2024 -
[Bug]: num_scheduler_steps > 1, n > 1 raise error
#8261 closed
Sep 26, 2024 -
[Performance]: Extremely low throughput
#8847 closed
Sep 26, 2024 -
[New Model]: Llama 3.2
#8812 closed
Sep 25, 2024 -
[Bug]: Docker image for 0.5.4 does not include package timm==0.9.10 to run MiniCPMV
#8107 closed
Sep 25, 2024 -
[Bug]: ModuleNotFoundError: No module named 'bitsandbytes'
#5503 closed
Sep 25, 2024 -
[Doc]: ROCm installation instructions do not work
#6762 closed
Sep 25, 2024 -
Fp8 support for mi300x
#6576 closed
Sep 25, 2024 -
[Bug]: vllm cpu installation build from source error
#8095 closed
Sep 25, 2024 -
[Doc]: Is Qwen2-VL-72B supported?
#8682 closed
Sep 25, 2024 -
[Bug]: lm-format-enforcer guided decoding kills MQLLMEngine
#8578 closed
Sep 25, 2024 -
[RFC]: Priority Scheduling
#6077 closed
Sep 25, 2024 -
[Bug]: Requesting Prompt Logprobs with an MLP Speculator Crashes the Server
#7742 closed
Sep 25, 2024 -
[Bug]: vllm async engine can not use adag
#8158 closed
Sep 24, 2024 -
[Bug]: Shutdown problem when we use ADAG
#8208 closed
Sep 24, 2024 -
[Bug]: AssertionError when loading Qwen 2.5 GGUF q3 model in vLLM
#8697 closed
Sep 24, 2024 -
[Feature]: Batch inference for `llm.chat()` API
#8481 closed
Sep 24, 2024 -
[Bug]: OLMoForCausalLM not supported
#8753 closed
Sep 24, 2024 -
[Usage]:
#8569 closed
Sep 24, 2024 -
Support for RLHF (ILQL)-trained Models
#841 closed
Sep 24, 2024 -
[Bug]: output is empty
#8775 closed
Sep 24, 2024 -
[Usage]: output were empty
#8774 closed
Sep 24, 2024 -
[Usage]: Question about dequantization
#8759 closed
Sep 24, 2024 -
[Misc]: Memory Order in Custom Allreduce
#8404 closed
Sep 24, 2024 -
[Bug]: Obvious hang caused by Custom All Reduce OP(Valuable Debug Info Obtained)
#8410 closed
Sep 24, 2024 -
[Usage]: set num_crops in LVLM
#7861 closed
Sep 24, 2024 -
[Bug]: Server crashes when kv cache exhausted
#8738 closed
Sep 24, 2024 -
[Feature] compile triton kernels ahead of time
#8712 closed
Sep 24, 2024 -
[Bug]: tensor parallel processes not working in vllm_cpu
#8756 closed
Sep 24, 2024 -
[Performance]: Add weaker memory fence for custom allreduce
#8457 closed
Sep 24, 2024 -
[Usage]: multimodal large models load local image files
#8730 closed
Sep 23, 2024 -
[Bug]: Error when using --tensor-parallel-size 4 on Qwen2.5-72B-Instruct
#8691 closed
Sep 23, 2024 -
[Bug]: torch.OutOfMemoryError: CUDA out of memory.
#8721 closed
Sep 23, 2024 -
[Doc]: Using LoRA adapters
#8725 closed
Sep 23, 2024 -
[Bug]: InternVl2-8B-AWQ gives error when trying to run with vllm-openai cuda 11.8 docker image
#8736 closed
Sep 23, 2024 -
[Bug]: Installation with XPU fail's with Dockerfile and while building from sourcefile
#8563 closed
Sep 23, 2024 -
[Installation]: vllm CPU mode build failed
#8710 closed
Sep 23, 2024 -
[Bug]: vllm deploy medusa, draft acceptance rate: 0.000
#8620 closed
Sep 23, 2024 -
[Misc]: Uneven performance
#8719 closed
Sep 23, 2024 -
[Bug]: AttributeError: module 'cv2.dnn' has no attribute 'DictValue'
#8650 closed
Sep 22, 2024 -
ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)
#2021 closed
Sep 22, 2024 -
[Usage]: Weird vram usage and increase in use
#8504 closed
Sep 22, 2024 -
[New Model]: LLaVA-OneVision
#7420 closed
Sep 22, 2024 -
When running pytest tests/, undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
#3228 closed
Sep 22, 2024 -
Support for production grade server for Inference [Gunicorn vs Unicorn]?
#2573 closed
Sep 21, 2024 -
[RFC]: Build `vllm-flash-attn` from source
#8002 closed
Sep 21, 2024 -
qwen2-vl: AttributeError: '_OpNamespace' '_C' object has no attribute 'gelu_quick'
#8624 closed
Sep 21, 2024 -
[Installation]: can not install vllm in GPU
#8665 closed
Sep 21, 2024 -
[Bug]: Docker build for ROCm fails for latest release and main branch
#7813 closed
Sep 21, 2024 -
Failed to build from source on ROCm (with pytorch and xformers working correctly)
#3067 closed
Sep 21, 2024 -
[Usage]: Number of requests currently in the queue
#8617 closed
Sep 20, 2024 -
[Bug]: aisingapore/sea-lion-7b-instruct fails with assert config.embedding_fraction == 1.0
#3523 closed
Sep 20, 2024 -
Mixtral 4x 4090 OOM
#3285 closed
Sep 20, 2024 -
Low VRAM batch processing mode
#1297 closed
Sep 20, 2024 -
Add worker registry service for hosting multiple vllm model through single api gateway
#1753 closed
Sep 20, 2024 -
Use O3 optimization instead of O2 for CUDA compilation?
#67 closed
Sep 20, 2024 -
Modify the current PyTorch model to C++
#42 closed
Sep 20, 2024 -
[Usage]: VLLM serve Gemma 2 9B it with more than 4096 tokens
#8680 closed
Sep 20, 2024 -
Inquiry Regarding vLLM Support for Mac Metal API
#2081 closed
Sep 20, 2024 -
vLLM ignores my requests when I increase the number of concurrent requests
#2752 closed
Sep 20, 2024 -
Faster model loading
#474 closed
Sep 20, 2024
81 Issues opened by 72 people
-
[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations
#8913 opened
Sep 27, 2024 -
[Usage]: LLM with tensor_parallel_size larger than n. gpus in one node
#8908 opened
Sep 27, 2024 -
[Usage]: guided_regex in offline model
#8907 opened
Sep 27, 2024 -
[Bug]: Tokenization Mismatch Between HuggingFace and vLLM
#8904 opened
Sep 27, 2024 -
[Feature]: Guided Decoding Schema Cache Store
#8902 opened
Sep 27, 2024 -
[Performance]: Talk about the model parallelism
#8898 opened
Sep 27, 2024 -
[Bug]: Variance Between Mutiple Prefix Cache Example runs
#8890 opened
Sep 27, 2024 -
[Bug]: assert len(self._async_stopped) == 0
#8881 opened
Sep 27, 2024 -
[Installation]: Cannot compile flash attention when building from source
#8878 opened
Sep 27, 2024 -
[Bug]: --quantization=awq Using the quantized startup parameters will cause a restart
#8877 opened
Sep 27, 2024 -
[Bug]: Server - `aqlm` fails with `--cpu-offload-gb`
#8873 opened
Sep 26, 2024 -
[Feature]: Add model context information to chat template
#8869 opened
Sep 26, 2024 -
[Performance]: Slowdown compared to Gradio
#8866 opened
Sep 26, 2024 -
[Bug]: configurably disable prompt echo
#8864 opened
Sep 26, 2024 -
[Usage]: RuntimeError: Failed to infer device type (Intel Iris Xe Graphics)
#8863 opened
Sep 26, 2024 -
[Installation]: can't install on cpu AMD Ryzen 7 PRO 8700GE ubuntu
#8862 opened
Sep 26, 2024 -
[Bug]: Assert Error: len(seqs) == 1
#8858 opened
Sep 26, 2024 -
[Feature]: Support image embeddings as input for qwen2vl
#8857 opened
Sep 26, 2024 -
[Bug]: 0.6.2 OpenAI server outofmem for a previously stable setup
#8853 opened
Sep 26, 2024 -
[Installation]: Meet bugs when installing from source
#8852 opened
Sep 26, 2024 -
[Usage]:
#8851 opened
Sep 26, 2024 -
[Bug]: VLLM does not support EAGLE Spec Decode when deploying EAGLE-Qwen2-7B-Instruct model
#8849 opened
Sep 26, 2024 -
[Bug]: We tested Qwen2.5_3b on opencompass and vllm, and the results are very different
#8846 opened
Sep 26, 2024 -
[Bug]: Assertion Error when inferencing with sample_n>1and occur preemption
#8844 opened
Sep 26, 2024 -
[Feature]: Support for compiled model graph for TPUs
#8843 opened
Sep 26, 2024 -
[Bug]: exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
#8840 opened
Sep 26, 2024 -
[Bug]: TimeoutError: MQLLMEngine didn't reply within 10000ms
#8836 opened
Sep 26, 2024 -
[Feature]: Samplers Order support
#8835 opened
Sep 26, 2024 -
Llama3.2 Vision Model: Guides and Issues
#8826 opened
Sep 25, 2024 -
[issue tracker] make vllm compatible with dynamo
#8821 opened
Sep 25, 2024 -
[Bug]: Later version have degradation based on `vllm:time_to_first_token_seconds_sum` metric
#8819 opened
Sep 25, 2024 -
[Usage]: How does VLLM allocate memory
#8816 opened
Sep 25, 2024 -
[Usage]: How to use BitsAndBytesConfig with vllm serve
#8813 opened
Sep 25, 2024 -
[New Model]: allenai/Molmo-7B-0-0924 VisionLM
#8808 opened
Sep 25, 2024 -
[Bug]: Decrease generation quality Mixtral
#8807 opened
Sep 25, 2024 -
[Usage]: Train Lora with frozen vllm model
#8806 opened
Sep 25, 2024 -
[Bug]: vllm api server return escaped unicode string in guided backend 'outlines'
#8805 opened
Sep 25, 2024 -
[Misc]: Strange `leaked shared_memory` warnings reported by multiprocessing when using vLLM
#8803 opened
Sep 25, 2024 -
[Feature]: LoRA support for Pixtral
#8802 opened
Sep 25, 2024 -
[Bug]: Loading a model with bitsandbytes 8bit quantization
#8799 opened
Sep 25, 2024 -
[Bug]: Bug in the convert_fp8 function, a function for testing.
#8796 opened
Sep 25, 2024 -
[Doc]: Is Qwen2.5's long context YARN handled?
#8793 opened
Sep 25, 2024 -
[Bug]: Port binding failure when using pp > 1 after commit 7c7714d856eee6fa94aade729b67f00584f72a4c
#8791 opened
Sep 25, 2024 -
[Installation]: Installing vLLM on ROCm - Distro:Gentoo
#8788 opened
Sep 25, 2024 -
[Tracking Issue][Help Wanted]: FlashInfer backend improvements
#8786 opened
Sep 24, 2024 -
[Bug]: Disabling Marlin by setting --quantization gptq doesn't work when using a draft model
#8784 opened
Sep 24, 2024 -
[Bug]: Decode n tokens gives different output for first seq position compared to decode 1 token
#8783 opened
Sep 24, 2024 -
[RFC]: Add Goodput Metric to Benchmark Serving
#8782 opened
Sep 24, 2024 -
vLLM's V2 Engine Architecture
#8779 opened
Sep 24, 2024 -
[Bug]: LLMEngine cannot be pickled error vllm 0.6.1.post2
#8778 opened
Sep 24, 2024 -
[Usage]: output were empty
#8773 opened
Sep 24, 2024 -
[Usage]: Total generated tokens in benchmarking script
#8769 opened
Sep 24, 2024 -
[Usage]: how to acquire logits in vllm
#8762 opened
Sep 24, 2024 -
[Bug]: use cpu_offload_gb in gguf failed.
#8757 opened
Sep 24, 2024 -
[Bug]: 请求报错
#8755 opened
Sep 24, 2024 -
[Performance]: Analysis of performance dashboard movements
#8749 opened
Sep 23, 2024 -
[Bug]: OLMoE produces incorrect output with TP>1
#8747 opened
Sep 23, 2024 -
Error loading models since versions 0.6.1xxx
#8745 opened
Sep 23, 2024 -
Why is the bitsandbytes model significantly slower than the AWQ model?
#8743 opened
Sep 23, 2024 -
[Feature]: Support Inference Overrides for mm_processor_kwargs
#8742 opened
Sep 23, 2024 -
[Misc]: Enable dependabot to help managing known vulnerabilities in dependencies
#8734 opened
Sep 23, 2024 -
[Usage]: speculative OutOfMemoryError:
#8731 opened
Sep 23, 2024 -
[Usage]: Loading a model with bitsandbytes quantization with 8bit
#8720 opened
Sep 23, 2024 -
[Misc]: Unit test failures with BlockManager v2
#8718 opened
Sep 22, 2024 -
[RFC]: quant llm from alpindale
#8716 opened
Sep 22, 2024 -
[Bug]: 4208 CPU vllm 0.6.0 启动qwen-vl-7b ,报下面图片中的异常,模型开始可以正常输出,调用多次后,无返回结果
#8711 opened
Sep 22, 2024 -
Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher
#8708 opened
Sep 22, 2024 -
[Usage]: Is there any difference between max_tokens and max_model_len?
#8706 opened
Sep 22, 2024 -
[Feature]: Support for Seq classification/Reward models
#8700 opened
Sep 21, 2024 -
[Bug]: Low trhoughput on AMD MI250 using llama 3.1 (6 toks/s)
#8698 opened
Sep 21, 2024 -
[Bug]: Pixtral-12B not supported on CPU
#8693 opened
Sep 21, 2024 -
[Bug]: RuntimeError on A800 using vllm0.6.1.post2
#8686 opened
Sep 21, 2024 -
[New Model][Format]: Support the HF-version of Pixtral
#8685 opened
Sep 21, 2024 -
[Feature]: improve distributed backend selection
#8683 opened
Sep 20, 2024 -
[Bug]: QLoRA inference returns alternating output
#8681 opened
Sep 20, 2024
134 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Core] Implementing disaggregated prefilling, and caching KV cache in CPU/disk/database.
#8498 commented on
Sep 27, 2024 • 55 new comments -
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path
#8378 commented on
Sep 27, 2024 • 28 new comments -
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels
#8533 commented on
Sep 27, 2024 • 27 new comments -
[Bugfix] Fix LongRoPE bug
#8254 commented on
Sep 25, 2024 • 14 new comments -
[OpenVINO] Enable GPU support for OpenVINO vLLM backend
#8192 commented on
Sep 27, 2024 • 14 new comments -
[Model] Support Mamba
#6484 commented on
Sep 25, 2024 • 11 new comments -
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model
#8405 commented on
Sep 27, 2024 • 11 new comments -
[Model][LoRA]LoRA support added for MiniCPMV2.5
#7199 commented on
Sep 27, 2024 • 10 new comments -
Adding Cascade Infer to FlashInfer
#8132 commented on
Sep 27, 2024 • 10 new comments -
[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123
#8663 commented on
Sep 25, 2024 • 9 new comments -
[Hardware][CPU] Support AWQ for CPU backend
#7515 commented on
Sep 24, 2024 • 6 new comments -
[Doc]: Add deploying_with_k8s guide
#8451 commented on
Sep 26, 2024 • 6 new comments -
[Bugfix][Intel] Fix XPU Dockerfile Build
#7824 commented on
Sep 27, 2024 • 3 new comments -
[Core][VLM] Add precise multi-modal placeholder tracking
#8346 commented on
Sep 27, 2024 • 2 new comments -
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang
#7412 commented on
Sep 22, 2024 • 2 new comments -
[Model] Adding Granite MoE.
#8206 commented on
Sep 26, 2024 • 2 new comments -
[Model] MLPSpeculator quantization support
#8476 commented on
Sep 23, 2024 • 1 new comment -
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing
#8537 commented on
Sep 23, 2024 • 1 new comment -
[Misc] add non cuda hf benchmark_througput
#8653 commented on
Sep 20, 2024 • 1 new comment -
[Bug]: Multistep with n>1 Fails
#7968 commented on
Sep 27, 2024 • 0 new comments -
[Installation]: Issues with installing vLLM on ROCM without sudo access
#8042 commented on
Sep 27, 2024 • 0 new comments -
[Bug]: Using FlashInfer with FP8 model with FP8 KV cache produces an error
#8641 commented on
Sep 27, 2024 • 0 new comments -
[Bug]: No module named `jsonschema.protocols`.
#6486 commented on
Sep 27, 2024 • 0 new comments -
[Bug]: Trailing newline as outputs
#8020 commented on
Sep 27, 2024 • 0 new comments -
[Bug]: sending request using response_format json twice breaks vLLM
#4070 commented on
Sep 27, 2024 • 0 new comments -
[docs] add load balancing examples
#1837 commented on
Sep 25, 2024 • 0 new comments -
[Frontend][OpenAI] Add support for OpenAI tools calling
#4656 commented on
Sep 25, 2024 • 0 new comments -
Heterogeneous Speculative Decoding (CPU + GPU)
#5065 commented on
Sep 27, 2024 • 0 new comments -
[Model] Bert Embedding Model
#5447 commented on
Sep 26, 2024 • 0 new comments -
[RFC]: Reimplement and separate beam search on top of vLLM core
#8306 commented on
Sep 27, 2024 • 0 new comments -
[RFC]: Automate Speculative Decoding
#4565 commented on
Sep 27, 2024 • 0 new comments -
[Bug]: No available block found in 60 second in shm
#6614 commented on
Sep 27, 2024 • 0 new comments -
[Feature]: support out tree multimodal models
#8667 commented on
Sep 27, 2024 • 0 new comments -
[Bug]: When enabling LoRA, greedy search got different answers.
#7977 commented on
Sep 27, 2024 • 0 new comments -
[RFC]: Multi-modality Support Refactoring
#4194 commented on
Sep 26, 2024 • 0 new comments -
[RFC]: Encoder/decoder models & feature compatibility
#7366 commented on
Sep 26, 2024 • 0 new comments -
[Feature]: Align the API with OAI's structured output
#7220 commented on
Sep 26, 2024 • 0 new comments -
[RFC]: Support encode only models by Workflow Defined Engine
#8453 commented on
Sep 26, 2024 • 0 new comments -
[Performance]: Why the avg. througput generation is low?
#4760 commented on
Sep 26, 2024 • 0 new comments -
[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16
#4392 commented on
Sep 26, 2024 • 0 new comments -
[Feature]: FP6
#4515 commented on
Sep 25, 2024 • 0 new comments -
[Installation]: vllm on NVIDIA jetson AGX orin
#5640 commented on
Sep 25, 2024 • 0 new comments -
[Installation]: How to install vLLM on Jetson
#8485 commented on
Sep 25, 2024 • 0 new comments -
[Not to be Submitted] [WIP] Force Unit tests to run with BlockManager V2
#8678 commented on
Sep 20, 2024 • 0 new comments -
[Core] CUDA Graphs for Multi-Step Chunked Prefill
#8645 commented on
Sep 24, 2024 • 0 new comments -
[Bugfix] Handle `best_of>1` & `use_beam_search` by disabling multi-step scheduling.
#8637 commented on
Sep 23, 2024 • 0 new comments -
ppc64le: Dockerfile and CI fix
#8529 commented on
Sep 27, 2024 • 0 new comments -
[Doc] Compatibility matrix for mutual exclusive features
#8512 commented on
Sep 25, 2024 • 0 new comments -
[Core]: Support encode only models by Workflow Defined Engine
#8452 commented on
Sep 27, 2024 • 0 new comments -
[Bugfix] Fix code for downloading models from modelscope
#8443 commented on
Sep 25, 2024 • 0 new comments -
[torch.compile] A simple solution to recursively compile loaded model: using phi3-small as an example
#8398 commented on
Sep 25, 2024 • 0 new comments -
[Model] tool calling support for ibm-granite/granite-20b-functioncalling
#8339 commented on
Sep 27, 2024 • 0 new comments -
[Frontend][Core] Move guided decoding params into sampling params
#8252 commented on
Sep 27, 2024 • 0 new comments -
[BugFix] Fix metrics error for --num-scheduler-steps > 1
#8234 commented on
Sep 26, 2024 • 0 new comments -
[Hardware][Ascend] Add Ascend NPU backend
#8054 commented on
Sep 27, 2024 • 0 new comments -
Roberta embedding
#7969 commented on
Sep 26, 2024 • 0 new comments -
`[Core]` Added streaming support to `LLM` Class
#7648 commented on
Sep 24, 2024 • 0 new comments -
[CI/Build] custom build backend and dynamic build dependencies
#7525 commented on
Sep 23, 2024 • 0 new comments -
[Core] Move detokenization to front-end process
#7402 commented on
Sep 23, 2024 • 0 new comments -
[Models] Add remaining model PP support
#7168 commented on
Sep 27, 2024 • 0 new comments -
[Frontend] Add readiness and liveness endpoints to OpenAI API server
#7078 commented on
Sep 24, 2024 • 0 new comments -
[Doc] Proofreading documentation
#6998 commented on
Sep 24, 2024 • 0 new comments -
[Core] generate from input embeds
#6869 commented on
Sep 27, 2024 • 0 new comments -
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on
Sep 26, 2024 • 0 new comments -
[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface
#6260 commented on
Sep 25, 2024 • 0 new comments -
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend
#6143 commented on
Sep 24, 2024 • 0 new comments -
Whisper support
#5964 commented on
Sep 21, 2024 • 0 new comments -
[Bug]: FP8 Marlin fallback out of memory regression
#7793 commented on
Sep 22, 2024 • 0 new comments -
[Installation]: error: can't copy 'build/lib.linux-x86_64-3.10/vllm/_core_C.abi3.so': doesn't exist or not a regular file
#8174 commented on
Sep 22, 2024 • 0 new comments -
[Bug]: In v0.6.0 and above, Some of monitoring metrics are not correct.
#8178 commented on
Sep 23, 2024 • 0 new comments -
Inconsistent Text Generation Results in Batch vs Individual Sentence Processing
#2568 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: [Usage]: is_xpu should return true when the torch.xpu.is_available is true even w/o IPEX
#8655 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
#6766 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: topk=1 and temperature=0 cause different output in vllm
#5404 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: CUDA illegal memory access error when `enable_prefix_caching=True`
#5537 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: JSONDecodeError when running vllm serve
#8668 commented on
Sep 23, 2024 • 0 new comments -
[Usage]: What's the minimum VRAM needed to use entire context length for Llama 3.1 70B and 405B
#8188 commented on
Sep 23, 2024 • 0 new comments -
[Feature]: APC introspection interface
#8523 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: RuntimeError in gptq_marlin_24_gemm
#8654 commented on
Sep 23, 2024 • 0 new comments -
[Bug]: AsyncEngineDeadError: Task finished unexpectedly with qwen2 72b
#6208 commented on
Sep 23, 2024 • 0 new comments -
[New Model]: Support for allenai/OLMoE-1B-7B-0924
#8170 commented on
Sep 23, 2024 • 0 new comments -
[Feature]: Offline quantization for Pixtral-12B
#8566 commented on
Sep 23, 2024 • 0 new comments -
[Feature]: Return hidden states (in progress?)
#6165 commented on
Sep 24, 2024 • 0 new comments -
[Bug]: Neuron + Vllm inference broken with backward incompatible change
#8677 commented on
Sep 20, 2024 • 0 new comments -
ExLlamaV2: exl2 support
#3203 commented on
Sep 20, 2024 • 0 new comments -
Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted?
#2645 commented on
Sep 20, 2024 • 0 new comments -
[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride
#8630 commented on
Sep 20, 2024 • 0 new comments -
AWQ: Implement new kernels (64% faster decoding)
#3025 commented on