TensorRT-LLM
NVIDIA's optimized inference engine for LLMs that achieves maximum performance on NVIDIA GPUs through kernel fusion, quantization, and tensor parallelism.
TensorRT-LLM = Maximum performance LLM serving on NVIDIA GPUs – 2-3x faster than alternatives.
Explanation
TensorRT-LLM compiles LLM models to highly optimized GPU kernels. Features: FP8/INT8 quantization, in-flight batching, paged KV-cache, multi-GPU via tensor parallel. Achieves highest tokens/s on NVIDIA hardware.
Marketing Relevance
TensorRT-LLM is the choice for maximum performance on NVIDIA GPUs. Ideal for enterprise APIs and latency-critical marketing applications.
Example
TensorRT-LLM can serve Llama 3 70B on H100 at ~5000 tokens/s – 2-3x faster than vLLM on same hardware.
Common Pitfalls
Only NVIDIA GPUs (no AMD/Intel). More complex build process than vLLM. Not all models immediately supported. Requires NVIDIA drivers and CUDA.
Origin & History
TensorRT has existed since 2017 for deep learning inference. TensorRT-LLM was optimized for LLMs in 2023 and is now NVIDIA's official solution for LLM deployment.
Comparisons & Differences
TensorRT-LLM vs. vLLM
vLLM is easier to use and more broadly compatible; TensorRT-LLM is faster on NVIDIA GPUs but more complex.