Home Blog LLM Inference Optimization: How to Speed Up, Cut Costs, and Scale AI Models

LLM Inference Optimization: How to Speed Up, Cut Costs, and Scale AI Models

As businesses race to harness the power of Large Language Models, slow responses, rising costs, and hardware demands are becoming major roadblocks. But with the right optimization strategies, it’s possible to unlock faster, leaner, and more scalable LLM performance.

This guide breaks down the key techniques—distillation, quantization, batching, and KV caching—to help you get more out of your models without compromising quality. Let’s get into it.

Why LLM Inference Optimization Matters

Large Language Models (LLMs) are revolutionizing industries, but optimizing LLM inference remains a challenge due to high latency, cost, and compute demands.

 Slow response times, high computational costs, and scalability bottlenecks can make real-world applications difficult.

This guide covers the top LLM inference optimization strategies – distillation, quantization, batching, and KV caching – to reduce latency, minimize costs, and enhance scalability. We’ll cover:

Model distillation – Using a smaller, distilled version of the original model for efficiency.

Quantization – Reducing model precision to lower memory usage and improve speed.

Continuous Batching – Grouping requests dynamically to maximize throughput.

KV Caching – Reducing redundant computation to accelerate token generation.

By implementing these strategies, you can significantly cut costs, reduce latency, and scale LLM applications more effectively. Let’s dive in!

Model Distillation: Make LLMs Smaller and Faster

One of the most effective ways to optimize LLM inference is model distillation – technique where a large, high-accuracy model (the “teacher”) is used to train a smaller, more efficient model (the “student”). This approach retains much of the original model’s knowledge while dramatically improving inference speed and reducing memory requirements.

A great example which we tested in practice is Deepseek-R1 released in different sizes. They come from distillation and allow for more flexibility during model deployment – we can choose an appropriate version according to priorities (for example limited hardware). Distillation allowed to compress the original ~1,543 GB model down to ~4GB. It means that we can even deploy the smallest versions on a laptop!

ModelParametersVRAM Requirement
DeepSeek-R1671B~1,543 GB
DeepSeek-R1-Distill-Llama-70B70B~181 GB
DeepSeek-R1-Distill-Qwen-32B32B~82 GB
DeepSeek-R1-Distill-Qwen-14B14B~36 GB
DeepSeek-R1-Distill-Qwen-7B7B~18 GB
DeepSeek-R1-Distill-Qwen-1.5B1.5B~3.9 GB

Source: https://apxml.com/posts/gpu-requirements-deepseek-r1

Since the original model is very large, distillation to sizes of 1.5 – 70B is rather aggressive, so it comes with a cost of losing quality. 

ModelQuality
(MATH-500 pass@1)
DeepSeek-R197.3
DeepSeek-R1-Distill-Llama-70B94.5
DeepSeek-R1-Distill-Qwen-32B83.9

For example on MATH-500 (pass@1) benchmark the original, largest model scores 97.3, 70B version scores 94.5, and the smallest 1.5B version scores 83.9. 

For a deeper dive into LLM performance benchmarks, visit Hugging Face’s DeepSeek model overview and PromptHub’s analysis. 

Benefits of Distilling Large Language Models

Memory Savings – Smaller models require less memory, which is a huge plus in constrained resources scenarios.

Cost – reduced hardware requirements lowers the costs.

Faster inference – Smaller models run faster, which is a must if low latency is a critical requirement.

Trade-offs: Speed vs Accuracy in Distilled LLMs

  • With smaller size comes an accuracy loss, which creates a trade-off that must be carefully chosen.

For many use cases, distilled models offer the best balance of speed, memory and accuracy, making them a powerful tool for optimizing LLM inference.

Quantization: Reduce Model Size and Inference Cost

Why quantize at all? Large Language Models (LLMs) are powerful, but they’re also resource-hungry. Running a full-precision LLM means high GPU memory usage, slow inference, and – most importantly – huge costs. This is where quantization comes to the rescue.

How LLM Quantization Works: Reduce Precision, Retain Performance

Quantization reduces the precision of model weights (e.g., from 32 or 16-bit floating point to 8-bit or even 4-bit integers), significantly lowering memory usage, improving inference speed, and cutting down hardware costs – all while maintaining acceptable accuracy. This technique makes it possible to run advanced LLM models on consumer GPUs, edge devices, and cloud environments more efficiently.

Source: https://www.inferless.com/learn/quantization-techniques-demystified-boosting-efficiency-in-large-language-models-llms

Choosing the Right Quantization Strategy for LLMs

  Post-Training Quantization (PTQ) – Fast and easy, often quantized checkpoints are already available, recommended.

✔  Dynamic Quantization – More accurate than PTQ, applies quantization during inference. In contrast to PTQ, the quantization range is dynamic instead of being fixed.

✔  Quantization-Aware Training (QAT) – Highest quality but requires retraining and significant resources.

In practice, PTQ is the most popular and easy to start with. There are available checkpoints on HuggingFace ready to download. Just look at the quantizations section:

If you want to dive deeper into quantization methods, check out:

Key Benefits of Quantizing Large Language Models

Memory Savings – Quantization significantly reduces model size. This allows LLMs to run on smaller GPUs.

Cost – reduced hardware requirements lowers the costs.

Faster inference – Less memory means faster computation. Quantized models often achieve 2-4× faster inference.

Quality – lower memory requirements unlock the potential to fit larger (more powerful) models on the same infrastructure.

Edge applications – make LLMs small and fast enough to run on edge devices.

Source: https://www.exxactcorp.com/blog/deep-learning/what-is-quantization-and-llms

Quantization Challenges and Limitations

Too good to be true? Quantization is a game-changer for making LLMs more efficient, but it’s not without trade-offs. Here’s what you need to consider before jumping in:

  • Accuracy vs. Efficiency – The lower the precision, the greater the risk of accuracy loss.
  • Hardware Compatibility – not all GPUs support quantization. If you work with old GPU architectures, chances are your options will be limited. But if your hardware is relatively new – no need to worry!

Yes, quantization comes with trade-offs – but the benefits far outweigh the drawbacks for most real-world applications. 8-bit quantization can reduce memory usage by 50% with minimal accuracy loss (~1%), while 4-bit methods can shrink model size by 75% while still keeping competitive performance.

In short, quantization isn’t just an optimization – it’s the key to making LLMs scalable and affordable in the real world.

Quantization Benchmarks: Memory, Quality & Speed

If you want to dive deeper and gain more intuition with benchmarks, we will now report two insights:

  • Memory (size) vs quality table.
  • Speed – before and after quantization comparison.

Memory vs Accuracy: Trade-offs in Quantized LLMs

Let’s look at GGUF reported trade-off for memory vs quality for Deepseek-R1:

Observations:

  • INT8: around 2x smaller model has extremely high quality
  • INT4: around 4x smaller model has slightly lower quality
  • Below INT4 the memory gain is not that high, and quality drop is drastic. We do not recommend such aggressive quantization.

Inference Speed Gains with Quantized Models

At deepsense.ai we benchmarked the generation speed of a few models before and after quantization. Here’s the comparison:

Generation speed
ModelDeploymentBaseQuantized – AWQ
deepseek (7B)NVIDIA RTX 4090 (24GB)52 tokens/s130 tokens/s
deepseek (32B)AWS EC2
g5.12xlarge (96GB)
22 tokens/s50 tokens/s
mistral (7B)AWS EC2
g5.xlarge (24GB)
28 tokens/s88 tokens/s
LLama3.3 (70B)AWS EC2
g5.48xlarge (192GB)
23 tokens/s46 tokens/s

What we can see from the table is the massive boost in the generation speed. Around 2x faster generation for large models: deepseek (32B) and LLama3.3 (70B), and even more – 2.5-3x speedup for small models: deepseek (7B) and mistral (7B).

This is real life evidence how much we can gain with AWQ quantization, all that with a minimal (might not be noticeable!) accuracy trade-off.

Continuous Batching: Maximize Throughput in LLM Serving

Most LLM inference workloads involve multiple users sending requests at different times. Instead of handling each request independently, batching groups requests together. Collective processing increases GPU utilization which increases efficiency. This in turn has a downside – when sequences in a batch have varying lengths, we have to wait for the longest one to finish. This is a bottleneck which can be optimized further with continuous batching. Let’s explore how to optimize LLM inference for real-world performance.

How Continuous Batching Improves Efficiency

Instead of waiting until every sequence in a batch has completed generation, iteration-level scheduling is applied, where the batch size is determined per iteration. The result is that once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.

Source: https://www.anyscale.com/blog/continuous-batching-llm-inference 

In practice there is a caveat which makes things a bit more complicated: prefill computation and new token generation have different computational patterns. Good news is that LLm serving frameworks such as vLLM already handle this problem.

Why Use Continuous Batching for LLM Inference?

✔ Allows models to handle hundreds or thousands of concurrent users efficiently.

✔ Increases GPU utilization, leading to higher throughput.

✔ Works well with token streaming, keeping inference fast and responsive.

Batching Trade-offs: Latency vs Throughput

  • Batching makes response times slower for individual requests. If low latency is key – a carefully chosen tradeoff is required.

Even though batching comes with a trade-off, for high-load systems, it’s a must-have optimization to scale efficiently.

Key-Value Caching: Speed Up Long-Sequence LLM Generation

LLMs generate text token by token, and each step requires recalculating attention scores across the entire context. Key-Value caching eliminates redundant work by storing and reusing past attention scores, speeding up inference. It basically removes the redundancy in token by token computation. This technique is especially useful in long sequence generation, where the speed up will be the highest.

How KV Caching Boosts Inference Speed

✔ Inference speed up.

✔ Highest benefit is with long sequences.

Memory Trade-offs with KV Caching

  •  KV caching increases memory usage, since past activations must be stored.

Want to unlock even greater LLM inference performance?

We highly recommend using vLLM serving framework which supports mentioned optimizations out of the box:

  • Continuous batching,  
  • Quantization,
  • KV Cache.

But supports even more

  • PagedAttention, 
  • model execution with CUDA/HIP graph,
  • Optimized CUDA kernels, 
  • Speculative decoding, 
  • Chunked prefill.

Based on our research and practical experiments on LLM deployments at deepsense.ai, we find working with vLLM easy and practical. Vast optimizations available make this framework highly efficient at serving LLMs.

Summary: LLM Inference Optimization Techniques & Results

Distilled Models

  • Faster inference due to reduced parameter count.
  • Lower memory usage, enabling deployment on smaller hardware.
  • Accuracy loss – depends on the size of the student model.

Quantization

  • Inference speed up: ~2x is realistic.
  • GPU memory reduction: 2x for int8, 4x for int4.
  • Minor quality drop for int8 and int4.
  • Aggressive quantization (below int4) affects quality significantly. We do not recommend reducing precision lower than 4bit.

Continuous Batching

  • Dynamically groups requests to increase efficiency.
  • Boosts throughput.
  • Lowers cost per request.
  • Works well with high-traffic APIs.
  • Trade-off: latency increase for individual requests.

KV Cache Optimization

  • faster inference, especially for long-text generation.
  • Increases memory usage.

How to Combine Optimization Techniques for Best Results

Each of these techniques – distillation, quantization, batching and KV caching – addresses different inference bottlenecks. By combining them, you can achieve faster, cheaper, and more scalable LLM deployments without sacrificing too much accuracy. As AI adoption grows, efficient deployment will be as important as model performance

More Resources on LLM Inference Optimization

  1. vLLM docs
  2. NVIDIA Inference Optimization post 
  3. Deepsense.ai blogpost on quantization and LoRA for LLM cost reduction 

Table of contents

Katarzyna Rutkowska

Posted in