NVIDIA has announced the latest v0.15 release of NVIDIA TensorRT Model Optimizer, a state-of-the-art quantization toolkit of model optimization techniques including quantization, sparsity, and pruning. These techniques reduce model complexity and enable downstream inference frameworks like NVIDIA TensorRT-LLM and NVIDIA TensorRT to more efficiently optimize the inference speed of generative AI models.
This post outlines some of the key features and upgrades of recent TensorRT Model Optimizer releases, including cache diffusion, the new quantization-aware training workflow using NVIDIA NeMo, and QLoRA support.
Cache diffusion
Previously, TensorRT Model Optimizer (referred to as Model Optimizer) supercharged NVIDIA TensorRT to set the bar for Stable Diffusion XL performance with its 8-bit post-training quantization (PTQ) technique. To further democratize fast inference for diffusion models, Model Optimizer v0.15 adds support for cache diffusion, which can be used with FP8 or INT8 PTQ to further accelerate diffusion models at inference time.
Cache diffusion methods, such as DeepCache and block caching, optimize inference speed without the need for additional training by reusing cached outputs from previous denoising steps. The caching mechanism leverages the intrinsic characteristics of the reverse denoising process of diffusion models where high-level features between consecutive steps have a significant temporal consistency and can be cached and reused. Cache diffusion is compatible with a variety of backbone models like DiT and UNet, enabling considerable inference acceleration without compromising quality or training cost.
To enable cache diffusion, developers only need to use a single ‘cachify’ instance in Model Optimizer with the diffusion pipeline. For a detailed example, see the cache diffusion tutorial notebook. For an FP16 Stable Diffusion XL (SDXL) on an NVIDIA H100 Tensor Core GPU, enabling cache diffusion in Model Optimizer delivers a 1.67x speedup in images per second (Figure 1). This speedup increases when FP8 is also enabled. Additionally, Model Optimizer enables users to customize the cache configuration for even faster inference. More diffusion models will be supported in the cache diffusion pipeline using TensorRT runtime in the near future.
The FP16 without caching baseline is benchmarked using the Model Optimizer cache diffusion pipeline with caching disabled, rather than using the demoDiffusion pipeline in TensorRT that has batch size limitation, to provide a fairer comparison. NVIDIA H100 80 GB HBM3 GPU; step size 30; batch size 16; TensorRT v10.2.0; TensorRT Model Optimizer v0.15
Quantization-aware training with NVIDIA NeMo
Quantization-aware training (QAT) is a technique to train neural networks while simulating the effects of quantization, aiming to recover model accuracy post-quantization. This process involves computing scaling factors during training and incorporating simulated quantization loss into the fine-tuning process, making the neural network more resilient to quantization. In Model Optimizer, QAT uses custom CUDA kernels for simulated quantization, achieving lower precision model weights and activations for efficient hardware deployment.
A model quantized using the Model Optimizer mtq.quantize()
API can be directly fine-tuned with the original training pipeline. During QAT, the scaling factors inside quantizers are frozen and the model weights are fine-tuned. The QAT process needs a shorter fine-tuning duration and it is recommended to use small learning rates.
Model Optimizer v0.15 expands QAT integration support from Hugging Face Trainer and Megatron to NVIDIA NeMo, an enterprise-grade platform for developing custom generative AI models. Model Optimizer now has first-class support for NeMo models. To learn how to perform QAT with your existing NeMo training pipeline, see the new QAT example in the NeMo GitHub repo. Learn more about QAT.
QLoRA workflow
Quantized Low-Rank Adaptation (QLoRA) is an efficient fine-tuning technique to reduce memory usage and computational complexity during model training. By combining quantization with Low-Rank Adaptation (LoRA), QLoRA makes LLM fine-tuning more accessible for developers with limited hardware resources.
Model Optimizer has added support for the QLoRA workflow with NVIDIA NeMo using the NF4 data type. For details about the workflow, refer to the NeMo documentation. For a Llama 13B model on the Alpaca dataset, QLoRA can reduce the peak memory usage by 29-51%, depending on the batch size, while maintaining the same model accuracy (Figure 2). Note that QLoRA comes with the trade-offs of longer training step time compared to LoRA (Table 1).
NVIDIA H100 GPU; sequence length 512; global batch size 256; NeMo 24.07; TensorRT Model Optimizer v0.13
Batch size | Time per global batch (s) | ||
LoRA | QLoRA | % Increase | |
2 | 2.7 | 6.7 | 148% |
4 | 2.3 | 4.4 | 91% |
8 | 2.2 | 3.2 | 46% |
NVIDIA H100 GPU; sequence length 512; global batch size 256; NeMo 24.07; TensorRT Model Optimizer v0.13
Expanded support for AI models
TensorRT Model Optimizer has expanded support for a wider suite of popular AI models, including Stability.ai Stable Diffusion 3, Google RecurrentGemma, Microsoft Phi-3, Snowflake Arctic 2, and Databricks DBRX. See the example scripts for tutorials and support matrix for more details.
Get started
NVIDIA TensorRT Model Optimizer offers seamless integration with NVIDIA TensorRT-LLM and TensorRT for deployment. It is available for installation on PyPI as nvidia-modelopt. Visit NVIDIA/TensorRT-Model-Optimizer on GitHub for example scripts and recipes for inference optimization. For more details, see the Model Optimizer documentation.
We value your feedback on TensorRT Model Optimizer. If you have suggestions, issues, or feature requests, open a new NVIDIA/TensorRT-Model-Optimizer issue on GitHub. Your input helps us iterate our quantization toolkit to better meet your needs.