Inference Engine Apache-2.0

TensorRT-LLM

NVIDIA's high-performance LLM inference library. Achieves the highest throughput on NVIDIA GPUs with custom CUDA kernels, quantization, and in-flight batching.

Website GitHub

Platforms: linuxdocker

TensorRT-LLM is NVIDIA’s purpose-built inference optimization library for large language models on NVIDIA GPUs. It compiles models into highly optimized CUDA kernels using TensorRT and implements advanced serving features like in-flight batching, KV cache paging, and multi-GPU tensor parallelism. For deployments on NVIDIA hardware where maximum throughput and minimum latency are the top priorities, TensorRT-LLM consistently delivers the highest performance numbers of any inference engine.

Key Features

Peak NVIDIA GPU performance. TensorRT-LLM generates custom CUDA kernels tailored to specific model architectures and GPU generations. These hand-optimized kernels outperform generic implementations, often achieving 2-4x throughput improvements over standard PyTorch inference.

Advanced quantization. Support for FP8, INT8, INT4, and mixed-precision quantization with techniques like SmoothQuant, GPTQ, and AWQ. FP8 quantization on Hopper and Ada Lovelace GPUs provides near-FP16 quality at roughly 2x the throughput.

In-flight batching. Continuous batching dynamically interleaves requests at the token level rather than waiting for entire batches to complete. Combined with paged KV cache management, this maximizes GPU utilization under concurrent request loads.

Multi-GPU and multi-node. Tensor parallelism and pipeline parallelism distribute models across multiple GPUs or nodes. This enables inference on models too large for a single GPU and scales throughput linearly with additional hardware.

Speculative decoding. Draft-model speculative decoding accelerates generation by using a smaller model to propose tokens that the larger model verifies in parallel, reducing effective latency per token.

NVIDIA Triton integration. TensorRT-LLM engines deploy through NVIDIA Triton Inference Server for production serving with request queuing, model ensembles, metrics, and multi-model management.

When to Use TensorRT-LLM

Choose TensorRT-LLM when you are deploying on NVIDIA GPUs and need absolute maximum throughput. It is the right choice for production API services handling high request volumes, data centers with NVIDIA hardware, and any deployment where inference cost per token is a critical metric.

Ecosystem Role

TensorRT-LLM represents the performance ceiling for NVIDIA GPU inference. It requires more setup than Ollama or vLLM — models must be compiled into TensorRT engines for specific GPU architectures. vLLM offers an easier setup with good performance, while TensorRT-LLM trades convenience for the last margin of throughput. It is not available on AMD or Apple hardware.