Skip to main content

Deploy vLLM Production Stack on CCE

vLLM is an open-source inference and serving engine for large language models. It is designed to improve serving throughput and GPU memory efficiency, mainly through PagedAttention, continuous batching, prefix caching, and an OpenAI-compatible serving interface.

The vLLM Production Stack is a separate upstream project from the vLLM ecosystem. It provides a Kubernetes-native, cluster-wide reference implementation for deploying inference services on top of vLLM. Its purpose is to help users move from a single vLLM instance to a distributed deployment without changing application code. It also includes deployment options such as Helm chart deployment, CRD-based deployment, and Gateway API integration, together with documented capabilities such as metrics dashboards, request routing, KV cache-aware routing, prefix-aware routing, distributed tracing, KEDA-based autoscaling, and other production-oriented patterns.

Prerequisites

The prerequisites for this guide are identical to those described in the blueprint Build a Unified LLM Gateway with LiteLLM on CCE. Before proceeding, ensure that all required services, infrastructure components, access permissions, and configuration prerequisites outlined in the blueprint have been completed.

Defining and Applying Configuration

Before proceeding to any deployment and configuration ensure that the necessary namespace is created, by using the following command:

kubectl create namespace vllm

Creating the Secret

Before deploying LiteLLM, a Kubernetes Secret must be created, vllm-prodstack-secrets.yaml to provide the required runtime configuration and credentials:

vllm-prodstack-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: vllm-prodstack-secrets
type: Opaque
stringData:
VLLM_MASTER_KEY: <VLLM_MASTER_KEY>
HF_TOKEN: <HF_TOKEN>
note

Each key in this secret serves a specific purpose:

  • VLLM_MASTER_KEY: is used in this deployment as a shared secret for authenticating internal communication between vLLM Production Stack components. The value is stored as a Kubernetes Secret and injected into the workloads as an environment variable. All participating services must use the same key value to establish trusted communication. For production environments, the key should be generated as a strong random secret and managed securely through Kubernetes Secrets or external secret management solutions.
  • HF_TOKEN: s used to authenticate against Hugging Face services. The deployment may require access to Hugging Face-hosted model repositories, tokenizer assets, or embedding models during model initialization and runtime operations. The token is primarily required when using gated or private repositories, while publicly accessible models can generally be downloaded without authentication.

In this blueprint, the Secret separates sensitive runtime credentials from the public Helm configuration. This makes the configuration easier to maintain and reduces the risk of exposing API keys, session-signing secrets, or third-party access tokens.

kubectl apply -f vllm-prodstack-secrets.yaml -n vllm

Creating the Persistence Volume Claim

vLLM persists model artifacts locally once they are pulled. In a single-replica setup, a node-local or ReadWriteOnce volume is sufficient. However, when scaling to multiple replicas, each pod would otherwise maintain its own copy of the models. This leads to redundant storage consumption, increased network usage during model downloads, and longer initialization times.

To address this, a shared storage backend is required. SFS Turbo provides a managed, POSIX-compliant file system with ReadWriteMany semantics, allowing multiple pods across nodes to access the same data concurrently. By mounting SFS Turbo as the persistent volume for Ollama, all replicas can reuse a single set of model files. This approach improves operational efficiency by reducing duplication and ensuring consistency across replicas. It also simplifies scaling, as new pods can immediately access preloaded models without requiring additional initialization steps.

First we need to create a PersistentVolumeClaim, namely pvc-vllm-models.yaml and mount the SFS Turbo file system we created in the previous step:

pvc-vllm-models
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-vllm-models
labels:
model: "vllm-models"
annotations:
everest.io/volume-as: absolute-path
everest.io/sfsturbo-share-id: <SFSTURBO_SHARE_ID>
everest.io/path: /vllm-models
everest.io/reclaim-policy: retain-volume-only
everest.io/csi.enable-sfsturbo-dir-quota: "true"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 500Gi
storageClassName: csi-sfsturbo
kubectl create namespace vllm
kubectl apply -f pvc-vllm-models
important

Replace the value of everest.io/sfsturbo-share-id with the one matching your environment

Installing KubeRay Operator (Optional)

note

KubeRay is the Kubernetes integration project for Ray. It allows Ray clusters and Ray applications to run natively on Kubernetes by introducing Kubernetes-aware management and automation capabilities for Ray workloads.

Ray itself is a distributed computing framework used primarily for AI and machine learning workloads. It provides mechanisms for distributing Python-based execution across multiple processes, GPUs, and nodes. Ray is commonly used for distributed model training, batch processing, distributed inference, and scalable serving applications. More information about the framework can be found in its official site.

The KubeRay Operator is the Kubernetes Operator component of the KubeRay project. It extends Kubernetes through Custom Resource Definitions (CRDs) and controllers that automate the deployment and lifecycle management of Ray clusters. Instead of manually creating and coordinating Ray head nodes, worker nodes, services, and scaling logic, the operator manages these resources declaratively through Kubernetes manifests.

important

Kuberay is necessary only for distributed vLLM serving patterns that require multi-node orchestration, such as pipeline parallelism or distributed prefill/decode deployments. For single-node, single-GPU deployments, KubeRay is not required and the vLLM Production Stack can operate with standard Kubernetes resources alone.

In this blueprint, the KubeRay Operator is installed to provide the Kubernetes control layer required for distributed Ray-based inference workloads. The vLLM Production Stack uses RayCluster and RayService resources to deploy, scale, and manage distributed vLLM serving instances across the Kubernetes cluster. This step prepares the CCE environment with the necessary Kubernetes extensions and controllers required for distributed inference orchestration.

When to use KubeRay?

Use CaseKubeRay RequiredAdvantagesTrade-OffsTypical Prerequisites
Serving very large language models across multiple GPUsEnables tensor parallelism and distributed inference execution across nodes and GPUsIncreased operational complexity and additional Kubernetes componentsMulti-GPU worker nodes, high-speed node networking, distributed storage considerations
High-throughput production inference APIsSupports horizontal scaling, workload distribution, and centralized orchestrationMore complex deployment lifecycle and monitoring requirementsKubernetes autoscaling strategy, GPU capacity planning, ingress and observability stack
Shared enterprise AI inference platformCentralized management, automated recovery, scalable multi-workload orchestrationHigher infrastructure footprint and operational overheadMulti-node Kubernetes cluster, GPU scheduling strategy, cluster observability
Dynamic scaling of inference workloadsEnables automated scaling and distributed worker managementAdditional dependency on Ray and KubeRay control componentsKEDA or autoscaling integration, sufficient spare cluster resources
Single-node vLLM deployment🚫Simpler deployment and reduced operational overheadLimited scalability and no distributed executionSingle GPU-enabled Kubernetes worker node
Development and proof-of-concept environments🚫Faster setup and easier troubleshootingNot suitable for large-scale or distributed inference workloadsMinimal Kubernetes cluster with GPU support
Low-throughput internal inference services🚫Lightweight operational model with fewer moving componentsLimited high-availability and scaling capabilitiesOne or few GPU worker nodes
Standalone model serving without distributed execution🚫Easier maintenance and simpler Kubernetes manifestsNo multi-node coordination or distributed schedulingStandard Kubernetes deployment with vLLM container image

Deploying the Operator

IWe are going to deploy the KubeRay Operator using the official Helm chart:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

helm install kuberay-operator kuberay/kuberay-operator \
-n kuberay-system \
--create-namespace

Scenarios

ScenarioUse CasePrerequisitesLimitations
🥉Single GPU, standalone servingBaseline inference deployment where the model fits entirely into one GPUOne GPU-enabled CCE worker node, vLLM Production Stack, optional Hugging Face tokenLimited by the VRAM and throughput of a single GPU
🥇Multi-node distributed servingServing a model across GPUs located on different Kubernetes worker nodesRay/KubeRay, distributed networking between nodes, compatible vLLM distributed configurationCross-node GPU communication introduces network overhead
🥈Single-node tensor parallelismServing a large model on a multi-GPU node by splitting tensor operations across GPUsOne Kubernetes worker node with multiple GPUs, fast GPU interconnect preferredPerformance depends heavily on GPU topology and interconnect bandwidth
🥇Disaggregated prefill/decode servingSeparating prompt ingestion and token generation workloads for higher serving efficiencyMultiple coordinated vLLM instances, KV cache transfer support, routing layerOperationally more complex than standard distributed inference

Single-Node, Standalone Serving

prerequisites

We will need 1 node with specs: GPU-accelerated | pi5e.2xlarge.4 | 8 vCPUs | 32 GiB | GPU: nvidia-l4 x 1

In this scenario, we deploy the openai/gpt-oss-20b model as a standalone inference service on CCE using a single NVIDIA L4 GPU. The objective is to establish a minimal production-ready deployment pattern that exposes the model through the OpenAI-compatible vLLM API without introducing distributed execution or multi-node orchestration complexity.

note

openai/gpt-oss-20b is an open-weight large Mixture of Experts (MoE) model from OpenAI designed for general-purpose text generation, reasoning, code generation and agentic workloads. With approximately 20 billion parameters (3.6B active), the model provides substantially higher reasoning and language generation capability than smaller instruction-tuned models while still remaining deployable on modern enterprise GPU hardware with careful memory utilization tuning and quantization-aware serving strategies.

Within this deployment, vLLM acts as the inference runtime responsible for loading the model, optimizing GPU memory usage, managing request batching, and exposing the serving endpoint through an OpenAI-compatible API interface. This allows existing AI applications, SDKs, and automation frameworks to consume the model using familiar OpenAI API semantics without requiring model-specific integrations.

Can GPT-OSS 20B run on NVIDIA L4 24GB?

The gpt-oss-20b model can operate on a single NVIDIA L4 GPU with 24 GB VRAM, as the model requires approximately 18.6 GB of GPU memory in this configuration. When deployed with Q4_K_M quantization, inference throughput can reach roughly 35 tokens per second on NVIDIA L4 hardware.

For a more detailed report check here.

Persistent model caching is enabled through a mounted Kubernetes persistent volume so that downloaded Hugging Face artifacts survive pod restarts and redeployments. This significantly reduces model initialization times and avoids repeated downloads during operational lifecycle events.

vllm-values-single-node.yaml
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY

tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"

modelSpec:
- name: gpt-oss-20b
repository: vllm/vllm-openai
tag: latest
modelURL: openai/gpt-oss-20b

hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN

replicaCount: 1

requestCPU: 4
requestMemory: "16Gi"
requestGPU: 1

extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models

extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models

nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4

vllmConfig:
extraArgs:
- "--served-model-name"
- "gpt-oss-20b"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "8192"
- "--download-dir"
- "/models/vllm-downloads"
note

requestCPU & requestMemory define the host resources reserved for the vLLM serving pod. The GPU remains the primary execution resource for model inference, while CPU and system memory support request processing, tokenization, model initialization, cache handling, and the OpenAI-compatible API process. For this single-GPU baseline, the deployment reserves 4 vCPUs and 16 GiB of memory per serving replica.

requestCPU: 4
requestMemory: "16Gi"
requestGPU: 1

We can now deploy the vLLM Production Stack with the Helm chart:

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-single-node.yaml \
--reset-values

Single-Node, Tensor Parallelism

prerequisites

We will need 1 node with specs: GPU-accelerated | pi2.4xlarge.4 | 16 vCPUs | 64 GiB | GPU: nvidia-t4 x 2

This scenario demonstrates how vLLM can use tensor parallelism to deploy larger instruction-tuned models on compact GPU configurations commonly used for edge inference, development environments, or cost-sensitive AI platforms. The deployment uses a CCE worker node equipped with two NVIDIA Tesla T4 GPUs and serves Qwen/Qwen2.5-32B-Instruct-AWQ, a 32-billion-parameter model provided in AWQ INT4 format.

Can Qwen2.5-32B-Instruct-AWQ run on NVIDIA T4 16GB?

The AWQ INT4 checkpoint requires approximately 18-22 GB of effective GPU memory during inference, before additional memory is reserved for KV cache and runtime overhead. This exceeds the capacity of a single 15 GB Tesla T4 GPU, so the deployment uses tensor parallelism to distribute the model across both local GPUs.

A model of this size exceeds the practical capacity of a single 15 GB T4 accelerator once runtime overhead, KV cache allocation, and inference buffers are taken into account. By enabling tensor parallelism, vLLM distributes the model weights across both local GPUs and exposes the deployment through a single OpenAI-compatible inference endpoint.

note

Qwen/Qwen2.5-32B-Instruct-AWQ is a 32-billion-parameter instruction-tuned language model from Alibaba Cloud optimized for conversational AI, reasoning, code generation, and multilingual inference workloads. The AWQ INT4 quantized variant significantly reduces GPU memory consumption while maintaining strong inference performance and instruction-following quality. This allows the model to be deployed efficiently on smaller accelerator classes such as dual NVIDIA Tesla T4 GPUs using tensor parallelism through vLLM.

The T4 platform is particularly well suited for this type of deployment because it supports AWQ quantization acceleration directly. This allows the model to run efficiently in its native AWQ INT4 format while maintaining a relatively small memory footprint compared to full-precision deployments. In practice, the quantized model requires approximately 18-22 GB of effective GPU memory during inference, making a dual-T4 configuration a realistic target for this deployment topology.

note

Since both GPUs are attached to the same Kubernetes worker node, no Ray or KubeRay components are required for this scenario.

The final configuration should look like this:

values-single-node-tp-t4.yaml
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY

tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"

modelSpec:
- name: qwen2-5-32b-instruct-awq
repository: vllm/vllm-openai
tag: v0.8.5
modelURL: Qwen/Qwen2.5-32B-Instruct-AWQ

hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN

replicaCount: 1

requestCPU: 8
requestMemory: "48Gi"
requestGPU: 2

extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"

extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models

extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models

nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-t4

vllmConfig:
tensorParallelSize: 2
pipelineParallelSize: 1

extraArgs:
- "--served-model-name"
- "qwen2-5-32b-instruct-awq"
- "--gpu-memory-utilization"
- "0.80"
- "--max-model-len"
- "2048"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"

shmSize: "32Gi"

We can now deploy the vLLM Production Stack with the Helm chart:

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-single-node-tp-t4.yaml \
--reset-values

After deployment, we can confirm from the vLLM logs that tensor parallelism initialized successfully with two tensor-parallel ranks, and via nvidia-smi we can see that model memory allocated across both V100 devices while serving inference requests through the OpenAI-compatible API endpoint. Open a shell on the pod that host the model and execute the following command, after replacing the value of VLLM_MASTER_KEY:

curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <VLLM_MASTER_KEY>" \
-d '{
"model": "qwen2-5-32b-instruct-awq",
"messages": [
{
"role": "user",
"content": "Explain tensor parallelism in one paragraph."
}
]
}'

image

Multi-Node, Distributed Serving

prerequisites

We will need 2 nodes with specs: GPU-accelerated | pi5e.4xlarge.4 | 16 vCPUs | 64 GiB | GPU: nvidia-l4 x 1

This scenario demonstrates how large language models can be served across multiple Kubernetes worker nodes when a single GPU is no longer sufficient to host the model independently. Instead of relying on one large multi-GPU server, the deployment combines multiple smaller GPU nodes into a distributed inference topology using Ray, KubeRay, and vLLM. This approach is particularly relevant for enterprise environments where GPU resources are fragmented across several worker nodes or where high-memory accelerator platforms are either unavailable or prohibitively expensive.

The Qwen2.5-32B-Instruct-AWQ model is distributed across two single-GPU NVIDIA L4 worker nodes through pipeline parallelism. Each node contributes one GPU to the inference pipeline, while Ray orchestrates the distributed execution and communication between the participating workers. From the client perspective, however, the deployment still behaves as a single OpenAI-compatible inference endpoint. The complexity of model distribution, stage coordination, and GPU communication remains fully abstracted behind the vLLM serving layer.

This architecture addresses one of the most common operational challenges in enterprise AI adoption: deploying models that exceed the practical memory capacity of individual accelerators without requiring specialized high-end GPU servers. By distributing the model across multiple commodity GPU nodes, organizations can increase the usable capacity of their existing Kubernetes GPU infrastructure while maintaining centralized API access, Kubernetes-native orchestration, and scalable distributed inference management through KubeRay.

Can Qwen2.5-32B-Instruct-AWQ run on a single NVIDIA L4?

The selected AWQ INT4 model requires approximately 18-22 GB of effective GPU memory during inference once model weights, runtime overhead, and KV cache allocation are taken into account. Although this is significantly smaller than a full-precision 32B model, it leaves little practical headroom on a single NVIDIA L4 with 24 GB VRAM. For this scenario, the model is therefore distributed across two single-GPU L4 nodes with Ray and pipeline parallelism, allowing vLLM to serve the model through one endpoint while spreading the memory load across both GPUs.

The final configuration should look like this:

values-multi-node-distributed.yaml
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY

tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"

modelSpec:
- name: qwen2-5-32b-instruct-awq
repository: vllm/vllm-openai
tag: v0.8.5
modelURL: Qwen/Qwen2.5-32B-Instruct-AWQ

hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN

replicaCount: 1

requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1

extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"
- name: VLLM_ATTENTION_BACKEND
value: "FLASHINFER"
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models

nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4

vllmConfig:
tensorParallelSize: 1
pipelineParallelSize: 2
dtype: "float16"
extraArgs:
- "--served-model-name"
- "qwen2-5-32b-instruct-awq"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "2048"
- "--kv-cache-dtype"
- "fp8"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"

shmSize: "12Gi"

raySpec:
enabled: true
headNode:
requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1

nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4

We can now deploy the vLLM Production Stack with the Helm chart:

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-multi-node-distributed.yaml \
--reset-values

After successful deployment, use the Ray head pod to validate the distributed cluster state. The head pod is the cluster coordinator and exposes the complete Ray runtime view.

image

note

The ray status command validates the distributed execution layer itself. It confirms that the Ray cluster was formed successfully, that both Ray nodes joined the cluster correctly, and that the distributed resources required by vLLM are available. In this scenario, the output should report two active nodes and two available GPUs, proving that the pipeline-parallel deployment spans both Kubernetes worker nodes rather than running entirely on a single host.

To validate GPU utilization directly, open shells on both the Ray head and Ray worker pods and execute:

nvidia-smi

Both nvidia-smi outputs should show active CUDA processes and allocated GPU memory from the vLLM worker runtime. This confirms that the model weights and inference pipeline stages were successfully distributed across both Kubernetes worker nodes rather than executing entirely on a single GPU.

image

note

The nvidia-smi command validates GPU participation on each node. While Ray confirms the logical distributed cluster, nvidia-smi confirms that the vLLM worker processes are actively consuming GPU memory on both accelerators. This is particularly important in multi-node inference scenarios because a successful API response alone does not guarantee that the model was actually distributed across both GPUs. Observing active memory allocation and running CUDA processes on both nodes confirms that the model weights and inference workload were successfully placed across the distributed pipeline stages.

Disaggregated Prefill/Decode Serving

prerequisites

We will need 2 nodes with specs: GPU-accelerated | pi5e.4xlarge.4 | 16 vCPUs | 64 GiB | GPU: nvidia-l4 x 1

This addresses a different problem than the previous deployment patterns. The objective is no longer simply making a model fit into GPU memory, but improving inference efficiency and reducing latency under high request concurrency.

Large language model inference consists of two fundamentally different phases. During the prefill phase, the model processes the entire input prompt and builds the KV cache. This stage is highly compute-intensive and benefits from large parallel processing capacity. During the decode phase, the model generates tokens iteratively using the previously generated KV cache. Decode workloads are typically more latency-sensitive and often become bottlenecked by memory bandwidth and cache access patterns rather than raw compute throughput.

In a traditional deployment, both phases execute on the same GPU resources. This means prompt ingestion and token generation compete for the same accelerator capacity. Under heavy concurrent load, long prompts can therefore negatively impact token generation latency for active inference sessions.

Disaggregated prefill/decode serving separates these workloads into dedicated execution paths. One group of vLLM instances is optimized for prompt ingestion and KV cache generation, while another group is optimized for token decoding. KV cache data is transferred between both stages through a coordinated serving layer. This allows the infrastructure to scale and tune both workloads independently depending on traffic patterns and application behavior.

This architecture becomes particularly valuable for enterprise AI platforms serving many simultaneous users, agentic workflows, retrieval-augmented generation pipelines, or applications with large prompt contexts. By separating prompt processing from token generation, organizations can improve GPU utilization efficiency, reduce tail latency, and optimize infrastructure allocation for different inference characteristics rather than treating all inference workloads identically.

For the prefill/decode scenario, the deployment is no longer built around one vLLM engine. It defines two separate serving engines for the same model: one engine handles the prefill phase, and the other handles the decode phase. Both engines, for the economy of the blueprint, use the same Qwen/Qwen2.5-14B-Instruct-AWQ checkpoint and expose the same served model name, but they are assigned different roles in the inference flow.

The prefill engine is configured as the KV cache producer. It processes the incoming prompt, builds the initial KV cache, and sends that cache to the decode engine. This is why its LMCache configuration uses kvRole: kv_producer, nixlRole: sender, and points to the decode engine service through nixlPeerHost. The decode engine is configured as the KV cache consumer. It receives the transferred cache and continues token generation, so its LMCache configuration uses kvRole: kv_consumer, nixlRole: receiver, and listens on 0.0.0.0:55555.

note

LMCache is the distributed KV cache layer used by vLLM to transfer and manage inference state between the prefill and decode engines. In a traditional deployment, the KV cache remains local to the same GPU that processed the prompt. In a disaggregated architecture, however, the prompt is processed by one engine while token generation continues on another engine running on a different GPU or node. The KV cache therefore needs to be transferred between both stages efficiently and with minimal overhead.

In this scenario, LMCache acts as the coordination and transport layer for that cache exchange. The prefill engine generates the KV cache and publishes it through the LMCache producer configuration, while the decode engine receives the cache through the LMCache consumer configuration. This allows the decode engine to continue token generation without reprocessing the original prompt from scratch.

The deployment uses LMCache together with NIXL-enabled GPU transfer buffers. This enables the KV cache to be exchanged directly through GPU-accessible memory instead of repeatedly serializing and rebuilding inference state between requests. From an architectural perspective, LMCache is the component that makes disaggregated serving practical, because it decouples prompt processing from token generation while preserving the inference context required for continuous generation.

Without LMCache, both phases would need to execute on the same vLLM engine, which would eliminate the operational benefits of separating prompt ingestion and decode workloads across independent GPU resources.

Both engines enable prefix caching and LMCache because cache transfer is the core mechanism that makes disaggregated prefill/decode serving work. NIXL is enabled as the transport layer for moving KV cache data between the producer and consumer, and a CUDA-backed transfer buffer is configured so the cache exchange can use GPU memory.

The router is also configured differently from the previous scenarios. Instead of simple request distribution, it uses disaggregated_prefill routing logic. This allows the router to distinguish between the prefill and decode backends through their model labels and send each request through the correct two-stage inference path.

note

NIXL is the high-performance transport layer used by LMCache to move KV cache data between the prefill and decode engines. In a disaggregated serving architecture, the KV cache generated during prompt ingestion must be transferred efficiently from one GPU-backed vLLM instance to another. NIXL provides the communication mechanism that enables this exchange without forcing the cache data to be repeatedly serialized through slower CPU-bound workflows.

In this deployment, the prefill engine acts as the NIXL sender while the decode engine acts as the NIXL receiver. Both engines allocate dedicated GPU-backed transfer buffers, allowing KV cache blocks to be exchanged directly between the participating inference workers. This significantly reduces the overhead associated with transferring large prompt contexts and makes continuous token generation practical even when the inference stages are separated across different Kubernetes nodes.

From an architectural perspective, NIXL is what enables LMCache to operate as a distributed GPU-aware KV cache system rather than simply a software-level cache manager. It provides the low-latency transport path required for real-time inference serving, especially for long-context prompts and highly concurrent workloads where KV cache movement becomes a critical performance factor.

The final configuration should look like this:

values-multi-node-disaggregated.yaml
servingEngineSpec:
enableEngine: true
runtimeClassName: ""
containerPort: 8000

modelSpec:
- name: qwen2-5-14b-prefill
repository: lmcache/vllm-openai
tag: 2025-05-27-v1
modelURL: Qwen/Qwen2.5-14B-Instruct-AWQ

replicaCount: 1

requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1

hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN

nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4

extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"

extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models

extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models

vllmConfig:
enablePrefixCaching: true
maxModelLen: 4096
v1: 1

extraArgs:
- "--served-model-name"
- "qwen2-5-14b-instruct-awq"
- "--gpu-memory-utilization"
- "0.85"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"

lmcacheConfig:
enabled: true
cudaVisibleDevices: "0"
kvRole: "kv_producer"
enableNixl: true
nixlRole: "sender"
nixlPeerHost: "vllm-qwen2-5-14b-decode-engine-service"
nixlPeerPort: "55555"
nixlBufferSize: "1073741824"
nixlBufferDevice: "cuda"
nixlEnableGc: true
enablePD: true
cpuOffloadingBufferSize: 0

labels:
model: qwen2-5-14b-prefill

- name: qwen2-5-14b-decode
repository: lmcache/vllm-openai
tag: 2025-05-27-v1
modelURL: Qwen/Qwen2.5-14B-Instruct-AWQ

replicaCount: 1

requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1

hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN

nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4

extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"

extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models

extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models

vllmConfig:
enablePrefixCaching: true
maxModelLen: 4096
v1: 1

extraArgs:
- "--served-model-name"
- "qwen2-5-14b-instruct-awq"
- "--gpu-memory-utilization"
- "0.85"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"

lmcacheConfig:
enabled: true
cudaVisibleDevices: "0"
kvRole: "kv_consumer"
enableNixl: true
nixlRole: "receiver"
nixlPeerHost: "0.0.0.0"
nixlPeerPort: "55555"
nixlBufferSize: "1073741824"
nixlBufferDevice: "cuda"
nixlEnableGc: true
enablePD: true

labels:
model: qwen2-5-14b-decode

containerSecurityContext:
capabilities:
add:
- SYS_PTRACE

routerSpec:
enableRouter: true
repository: lmcache/lmstack-router
tag: latest

replicaCount: 1

containerPort: 8000
servicePort: 80

routingLogic: disaggregated_prefill
enablePD: true
engineScrapeInterval: 15
requestStatsWindow: 60

resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"

labels:
environment: router
release: router

extraArgs:
- "--prefill-model-labels"
- "qwen2-5-14b-prefill"
- "--decode-model-labels"
- "qwen2-5-14b-decode"
warning

SYS_PTRACE adds Linux process tracing capabilities to the container runtime. This is typically only required for low-level debugging, profiling, or runtime inspection tools such as CUDA/NCCL diagnostics and advanced GPU performance analysis.

Standard vLLM inference deployments generally do not require this capability. In security-conscious Kubernetes environments, container capabilities should remain minimized unless explicitly needed by the workload or troubleshooting process, otherwise it can/should be omitted.

We can now deploy the vLLM Production Stack with the Helm chart:

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-multi-node-disaggregated.yaml \
--reset-values

After deployment, we can confirm from the vLLM logs that both the prefill and decode engines initialized successfully with the correct KV roles and NIXL configurations. The router logs should also confirm that the disaggregated prefill routing logic is active and that both engines are registered correctly with their respective model labels and LMCache roles.

image

Finally, we can observe active GPU memory allocation on both engines through nvidia-smi in their respective pods, confirming that both the prefill and decode workloads are executing on their assigned GPU resources according to the disaggregated serving.

image

Adding Model(s) in LiteLLM (Optional)

Navigate to LiteLLM Dashboard -> Models + Endpoints -> Add Model and fill in the following values:

  • Provider: vllm
  • LiteLLM Mode Name(s): gpt-oss-20b
  • Model Mappings/Public Model Name: gpt-oss-20b
  • Mode: Chat - /chat/completions
  • Existing Credentials: None
  • API Base: http://vllm-router-service.vllm.svc.cluster.local/v1
  • API Key: fill in your VLLM_MASTER_KEY value

image

After completing the configuration, click Add Model. Repeat the same process for any additional models exposed by your vLLM deployment.

image

caution

The most important field is Model Mappings/Public Model Name.

This value must match the model name exposed by the vLLM deployment through --served-model-name. LiteLLM uses this mapping to route incoming OpenAI-compatible requests to the correct backend model endpoint.

If the names do not match, requests will fail with model resolution errors even though the vLLM service itself is healthy and reachable.

Appendix A: Helm Chart Values

Tensor and Pipeline Parallelism

Under vllmConfig stanza we find tensorParallelSize and pipelineParallelSize which control how a model is distributed across multiple GPUs. These settings are important when a model is too large for a single GPU or when inference throughput needs to scale beyond the capacity of one accelerator.

Tensor parallelism splits the mathematical operations of individual neural network layers across multiple GPUs. Instead of storing and processing the full tensor operations on one GPU, the computation is divided between several GPUs that work on the same layer simultaneously. This approach is commonly used for very large models because it reduces the memory pressure on individual GPUs while allowing the model to execute in parallel.

With tensorParallelSize: 2, for example, the tensor computations are distributed across two GPUs. The GPUs cooperate closely during inference and exchange intermediate results continuously. Because of this communication overhead, tensor parallelism performs best on systems where GPUs are connected through high-bandwidth interconnects such as NVLink or similarly optimized PCIe topologies.

Pipeline parallelism works differently. Instead of splitting tensor operations inside a layer, it splits the model itself into sequential layer groups called pipeline stages. Each GPU becomes responsible for a different section of the model. During inference, the request passes through these stages in sequence until the output is produced.

With pipelineParallelSize: 2, one GPU may process the earlier transformer layers while another GPU processes the later layers. This approach is useful when the hardware environment does not provide fast GPU-to-GPU communication for efficient tensor parallelism or when model layers need to be distributed more explicitly across available devices.

In practice, tensor parallelism is generally preferred for high-performance multi-GPU inference because it allows GPUs to work concurrently on the same operations. Pipeline parallelism is often used when GPU interconnect performance is limited or when the deployment architecture requires explicit layer partitioning.

These parameters must align with the GPU resources requested from Kubernetes. The effective GPU requirement is determined by the parallelism configuration. For example:

  • tensorParallelSize: 2 requires at least 2 GPUs.
  • pipelineParallelSize: 2 also requires at least 2 GPUs.
  • The overall resource requirement we claim is the multiplication of both variables. For example, tensorParallelSize: 2 and pipelineParallelSize: 2 require 4 GPUs in total.

The Kubernetes resource request must therefore provide sufficient GPUs for the configured parallelism model; otherwise, the vLLM workload cannot initialize correctly.

Resource Allocation and Scheduling

The requestCPU, requestMemory, and requestGPU values define infrastructure resource allocation at the Kubernetes level, while tensorParallelSize and pipelineParallelSize define how vLLM internally distributes the model execution across the allocated GPUs.

This distinction is important because Kubernetes itself does not understand how the model is executed. Kubernetes only schedules containers onto nodes based on requested resources.

requestCPU: 4
requestMemory: "16Gi"
requestGPU: 1

These parameters instruct Kubernetes to reserve 4 CPU cores, 16 GiB of system memory, and 1 GPU for the serving pod. They affect pod scheduling, node placement, and infrastructure capacity planning, but they do not control how vLLM uses the GPU internally.

The vLLM parallelism settings operate at the inference engine layer:

tensorParallelSize: 1
pipelineParallelSize: 2

These parameters determine how the model itself is partitioned and executed across GPUs during inference. In practice, the Kubernetes resource requests define the available hardware resources, while the vLLM parallelism settings define how the inference engine consumes and coordinates those resources.

⚠️ Both layers must remain consistent with each other. For example:

  • requestGPU: 1 means Kubernetes allocates one GPU to the pod.
  • tensorParallelSize: 2 tells vLLM to distribute tensor operations across two GPUs.

‼️ This creates a mismatch because vLLM expects access to two GPUs while Kubernetes provides only one. The workload would therefore fail during initialization or scheduling.

The same applies to pipeline parallelism. If pipelineParallelSize: 2 is configured, the deployment must provide at least two GPUs because the model is split into two execution stages.

Appendix B: References