Deploy vLLM Production Stack on CCE
vLLM is an open-source inference and serving engine for large language models. It is designed to improve serving throughput and GPU memory efficiency, mainly through PagedAttention, continuous batching, prefix caching, and an OpenAI-compatible serving interface.
The vLLM Production Stack is a separate upstream project from the vLLM ecosystem. It provides a Kubernetes-native, cluster-wide reference implementation for deploying inference services on top of vLLM. Its purpose is to help users move from a single vLLM instance to a distributed deployment without changing application code. It also includes deployment options such as Helm chart deployment, CRD-based deployment, and Gateway API integration, together with documented capabilities such as metrics dashboards, request routing, KV cache-aware routing, prefix-aware routing, distributed tracing, KEDA-based autoscaling, and other production-oriented patterns.
Prerequisites
The prerequisites for this guide are identical to those described in the blueprint Build a Unified LLM Gateway with LiteLLM on CCE. Before proceeding, ensure that all required services, infrastructure components, access permissions, and configuration prerequisites outlined in the blueprint have been completed.
Defining and Applying Configuration
Before proceeding to any deployment and configuration ensure that the necessary namespace is created, by using the following command:
kubectl create namespace vllm
Creating the Secret
Before deploying LiteLLM, a Kubernetes Secret must be created, vllm-prodstack-secrets.yaml to provide the required runtime configuration and credentials:
apiVersion: v1
kind: Secret
metadata:
name: vllm-prodstack-secrets
type: Opaque
stringData:
VLLM_MASTER_KEY: <VLLM_MASTER_KEY>
HF_TOKEN: <HF_TOKEN>
Each key in this secret serves a specific purpose:
VLLM_MASTER_KEY: is used in this deployment as a shared secret for authenticating internal communication between vLLM Production Stack components. The value is stored as a KubernetesSecretand injected into the workloads as an environment variable. All participating services must use the same key value to establish trusted communication. For production environments, the key should be generated as a strong random secret and managed securely through Kubernetes Secrets or external secret management solutions.HF_TOKEN: s used to authenticate against Hugging Face services. The deployment may require access to Hugging Face-hosted model repositories, tokenizer assets, or embedding models during model initialization and runtime operations. The token is primarily required when using gated or private repositories, while publicly accessible models can generally be downloaded without authentication.
In this blueprint, the Secret separates sensitive runtime credentials from the public Helm configuration. This makes the configuration easier to maintain and reduces the risk of exposing API keys, session-signing secrets, or third-party access tokens.
kubectl apply -f vllm-prodstack-secrets.yaml -n vllm
Creating the Persistence Volume Claim
vLLM persists model artifacts locally once they are pulled. In a single-replica setup, a node-local or ReadWriteOnce volume is sufficient. However, when scaling to multiple replicas, each pod would otherwise
maintain its own copy of the models. This leads to redundant storage consumption, increased network usage during model downloads, and longer initialization times.
To address this, a shared storage backend is required. SFS Turbo provides a managed, POSIX-compliant file system with ReadWriteMany semantics, allowing multiple pods across nodes to access the same data concurrently.
By mounting SFS Turbo as the persistent volume for Ollama, all replicas can reuse a single set of model files. This approach improves operational efficiency by reducing duplication and ensuring consistency across replicas.
It also simplifies scaling, as new pods can immediately access preloaded models without requiring additional initialization steps.
First we need to create a PersistentVolumeClaim, namely pvc-vllm-models.yaml and mount the SFS Turbo file system we created in the previous step:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-vllm-models
labels:
model: "vllm-models"
annotations:
everest.io/volume-as: absolute-path
everest.io/sfsturbo-share-id: <SFSTURBO_SHARE_ID>
everest.io/path: /vllm-models
everest.io/reclaim-policy: retain-volume-only
everest.io/csi.enable-sfsturbo-dir-quota: "true"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 500Gi
storageClassName: csi-sfsturbo
kubectl create namespace vllm
kubectl apply -f pvc-vllm-models
Replace the value of everest.io/sfsturbo-share-id with the one matching your environment
Installing KubeRay Operator (Optional)
KubeRay is the Kubernetes integration project for Ray. It allows Ray clusters and Ray applications to run natively on Kubernetes by introducing Kubernetes-aware management and automation capabilities for Ray workloads.
Ray itself is a distributed computing framework used primarily for AI and machine learning workloads. It provides mechanisms for distributing Python-based execution across multiple processes, GPUs, and nodes. Ray is commonly used for distributed model training, batch processing, distributed inference, and scalable serving applications. More information about the framework can be found in its official site.
The KubeRay Operator is the Kubernetes Operator component of the KubeRay project. It extends Kubernetes through Custom Resource Definitions (CRDs) and controllers that automate the deployment and lifecycle management of Ray clusters. Instead of manually creating and coordinating Ray head nodes, worker nodes, services, and scaling logic, the operator manages these resources declaratively through Kubernetes manifests.
Kuberay is necessary only for distributed vLLM serving patterns that require multi-node orchestration, such as pipeline parallelism or distributed prefill/decode deployments. For single-node, single-GPU deployments, KubeRay is not required and the vLLM Production Stack can operate with standard Kubernetes resources alone.
In this blueprint, the KubeRay Operator is installed to provide the Kubernetes control layer required for distributed Ray-based inference workloads. The vLLM Production Stack uses RayCluster and RayService resources to
deploy, scale, and manage distributed vLLM serving instances across the Kubernetes cluster. This step prepares the CCE environment with the necessary Kubernetes extensions and controllers required for distributed inference
orchestration.
When to use KubeRay?
| Use Case | KubeRay Required | Advantages | Trade-Offs | Typical Prerequisites |
|---|---|---|---|---|
| Serving very large language models across multiple GPUs | ✅ | Enables tensor parallelism and distributed inference execution across nodes and GPUs | Increased operational complexity and additional Kubernetes components | Multi-GPU worker nodes, high-speed node networking, distributed storage considerations |
| High-throughput production inference APIs | ✅ | Supports horizontal scaling, workload distribution, and centralized orchestration | More complex deployment lifecycle and monitoring requirements | Kubernetes autoscaling strategy, GPU capacity planning, ingress and observability stack |
| Shared enterprise AI inference platform | ✅ | Centralized management, automated recovery, scalable multi-workload orchestration | Higher infrastructure footprint and operational overhead | Multi-node Kubernetes cluster, GPU scheduling strategy, cluster observability |
| Dynamic scaling of inference workloads | ✅ | Enables automated scaling and distributed worker management | Additional dependency on Ray and KubeRay control components | KEDA or autoscaling integration, sufficient spare cluster resources |
| Single-node vLLM deployment | 🚫 | Simpler deployment and reduced operational overhead | Limited scalability and no distributed execution | Single GPU-enabled Kubernetes worker node |
| Development and proof-of-concept environments | 🚫 | Faster setup and easier troubleshooting | Not suitable for large-scale or distributed inference workloads | Minimal Kubernetes cluster with GPU support |
| Low-throughput internal inference services | 🚫 | Lightweight operational model with fewer moving components | Limited high-availability and scaling capabilities | One or few GPU worker nodes |
| Standalone model serving without distributed execution | 🚫 | Easier maintenance and simpler Kubernetes manifests | No multi-node coordination or distributed scheduling | Standard Kubernetes deployment with vLLM container image |
Deploying the Operator
IWe are going to deploy the KubeRay Operator using the official Helm chart:
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator \
-n kuberay-system \
--create-namespace
Scenarios
| Scenario | Use Case | Prerequisites | Limitations | |
|---|---|---|---|---|
| 🥉 | Single GPU, standalone serving | Baseline inference deployment where the model fits entirely into one GPU | One GPU-enabled CCE worker node, vLLM Production Stack, optional Hugging Face token | Limited by the VRAM and throughput of a single GPU |
| 🥇 | Multi-node distributed serving | Serving a model across GPUs located on different Kubernetes worker nodes | Ray/KubeRay, distributed networking between nodes, compatible vLLM distributed configuration | Cross-node GPU communication introduces network overhead |
| 🥈 | Single-node tensor parallelism | Serving a large model on a multi-GPU node by splitting tensor operations across GPUs | One Kubernetes worker node with multiple GPUs, fast GPU interconnect preferred | Performance depends heavily on GPU topology and interconnect bandwidth |
| 🥇 | Disaggregated prefill/decode serving | Separating prompt ingestion and token generation workloads for higher serving efficiency | Multiple coordinated vLLM instances, KV cache transfer support, routing layer | Operationally more complex than standard distributed inference |
Single-Node, Standalone Serving
We will need 1 node with specs: GPU-accelerated | pi5e.2xlarge.4 | 8 vCPUs | 32 GiB | GPU: nvidia-l4 x 1
In this scenario, we deploy the openai/gpt-oss-20b model as a standalone inference service on CCE using a
single NVIDIA L4 GPU. The objective is to establish a minimal production-ready deployment pattern that exposes the model through the OpenAI-compatible vLLM API
without introducing distributed execution or multi-node orchestration complexity.
openai/gpt-oss-20b is an open-weight large Mixture of Experts (MoE) model from OpenAI designed for general-purpose text generation, reasoning, code generation and
agentic workloads. With approximately 20 billion parameters (3.6B active), the model provides substantially higher reasoning and language generation capability
than smaller instruction-tuned models while still remaining deployable on modern enterprise GPU hardware with careful memory utilization tuning and
quantization-aware serving strategies.
Within this deployment, vLLM acts as the inference runtime responsible for loading the model, optimizing GPU memory usage, managing request batching, and exposing the serving endpoint through an OpenAI-compatible API interface. This allows existing AI applications, SDKs, and automation frameworks to consume the model using familiar OpenAI API semantics without requiring model-specific integrations.
The gpt-oss-20b model can operate on a single NVIDIA L4 GPU with 24 GB VRAM, as the model requires approximately 18.6 GB of GPU memory in this configuration. When deployed with Q4_K_M quantization, inference throughput can reach roughly 35 tokens per second on NVIDIA L4 hardware.
For a more detailed report check here.
Persistent model caching is enabled through a mounted Kubernetes persistent volume so that downloaded Hugging Face artifacts survive pod restarts and redeployments. This significantly reduces model initialization times and avoids repeated downloads during operational lifecycle events.
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
modelSpec:
- name: gpt-oss-20b
repository: vllm/vllm-openai
tag: latest
modelURL: openai/gpt-oss-20b
hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN
replicaCount: 1
requestCPU: 4
requestMemory: "16Gi"
requestGPU: 1
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4
vllmConfig:
extraArgs:
- "--served-model-name"
- "gpt-oss-20b"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "8192"
- "--download-dir"
- "/models/vllm-downloads"
requestCPU & requestMemory define the host resources reserved for the vLLM serving pod. The GPU remains the primary execution resource for model inference, while CPU and system memory support request processing, tokenization, model initialization, cache handling, and the OpenAI-compatible API process. For this single-GPU baseline, the deployment reserves 4 vCPUs and 16 GiB of memory per serving replica.
requestCPU: 4
requestMemory: "16Gi"
requestGPU: 1
We can now deploy the vLLM Production Stack with the Helm chart:
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-single-node.yaml \
--reset-values
Single-Node, Tensor Parallelism
- Running on compact NVIDIA GPUs (T4/L4)
- Running on legacy NVIDIA Volta Series (V100)
We will need 1 node with specs: GPU-accelerated | pi2.4xlarge.4 | 16 vCPUs | 64 GiB | GPU: nvidia-t4 x 2
This scenario demonstrates how vLLM can use tensor parallelism to deploy larger instruction-tuned models on compact GPU configurations commonly used for edge
inference, development environments, or cost-sensitive AI platforms. The deployment uses a CCE worker node equipped with two NVIDIA Tesla T4 GPUs and
serves Qwen/Qwen2.5-32B-Instruct-AWQ, a 32-billion-parameter model provided in AWQ INT4 format.
The AWQ INT4 checkpoint requires approximately 18-22 GB of effective GPU memory during inference, before additional memory is reserved for KV cache and runtime overhead. This exceeds the capacity of a single 15 GB Tesla T4 GPU, so the deployment uses tensor parallelism to distribute the model across both local GPUs.
A model of this size exceeds the practical capacity of a single 15 GB T4 accelerator once runtime overhead, KV cache allocation, and inference buffers are taken into account. By enabling tensor parallelism, vLLM distributes the model weights across both local GPUs and exposes the deployment through a single OpenAI-compatible inference endpoint.
Qwen/Qwen2.5-32B-Instruct-AWQ is a 32-billion-parameter instruction-tuned language model from Alibaba Cloud optimized for conversational AI, reasoning, code
generation, and multilingual inference workloads. The AWQ INT4 quantized variant significantly reduces GPU memory consumption while maintaining strong inference
performance and instruction-following quality. This allows the model to be deployed efficiently on smaller accelerator classes such as dual NVIDIA Tesla T4 GPUs
using tensor parallelism through vLLM.
The T4 platform is particularly well suited for this type of deployment because it supports AWQ quantization acceleration directly. This allows the model to run efficiently in its native AWQ INT4 format while maintaining a relatively small memory footprint compared to full-precision deployments. In practice, the quantized model requires approximately 18-22 GB of effective GPU memory during inference, making a dual-T4 configuration a realistic target for this deployment topology.
Since both GPUs are attached to the same Kubernetes worker node, no Ray or KubeRay components are required for this scenario.
The final configuration should look like this:
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
modelSpec:
- name: qwen2-5-32b-instruct-awq
repository: vllm/vllm-openai
tag: v0.8.5
modelURL: Qwen/Qwen2.5-32B-Instruct-AWQ
hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN
replicaCount: 1
requestCPU: 8
requestMemory: "48Gi"
requestGPU: 2
extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-t4
vllmConfig:
tensorParallelSize: 2
pipelineParallelSize: 1
extraArgs:
- "--served-model-name"
- "qwen2-5-32b-instruct-awq"
- "--gpu-memory-utilization"
- "0.80"
- "--max-model-len"
- "2048"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"
shmSize: "32Gi"
We can now deploy the vLLM Production Stack with the Helm chart:
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-single-node-tp-t4.yaml \
--reset-values
After deployment, we can confirm from the vLLM logs that tensor parallelism initialized successfully with two tensor-parallel ranks, and via nvidia-smi we can see that model memory allocated across both V100 devices while serving inference requests through the OpenAI-compatible API endpoint. Open a shell on the pod that host the model and execute the following command, after replacing the value of VLLM_MASTER_KEY:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <VLLM_MASTER_KEY>" \
-d '{
"model": "qwen2-5-32b-instruct-awq",
"messages": [
{
"role": "user",
"content": "Explain tensor parallelism in one paragraph."
}
]
}'

We will need 1 node with specs: GPU-accelerated | p2s.4xlarge.8 | 16 vCPUs | 128 GiB | GPU: nvidia-v100-pcie-32gb x 2
This scenario demonstrates how vLLM can serve a large language model that does not fit into a single GPU by using tensor parallelism across multiple GPUs
on the same CCE worker node. The deployment uses a node equipped with 2x NVIDIA Tesla V100 32 GB accelerators and deploys the meta-llama/Llama-3.1-70B-Instruct
model in a GPTQ INT4 quantized format.
hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 is a GPTQ INT4 quantized variant of Meta's Llama 3.1 70B Instruct model designed for high-quality
conversational AI, reasoning, summarization, and advanced instruction-following workloads. With approximately 70 billion parameters, the original FP16 model
exceeds the memory capacity of most standard enterprise GPU deployments. The GPTQ INT4 quantized checkpoint substantially reduces the memory footprint while
preserving most of the model's inference quality, making it suitable for tensor-parallel deployments on multi-GPU Kubernetes nodes using vLLM.
The selected GPTQ INT4 model still requires approximately 35-40 GB of effective GPU memory during inference once model weights, runtime overhead, and KV cache allocation are taken into account. While this is significantly smaller than the original FP16 variant of Llama 3.1 70B, it still exceeds the practical capacity of a single V100 32 GB GPU, which is why tensor parallelism across both local GPUs is required for this deployment scenario.
A standard FP16 deployment of a 70B model would exceed the available memory of a single V100 GPU. To make the deployment possible on this hardware class, the model is loaded using GPTQ quantization and distributed across both GPUs through:
tensorParallelSize: 2
With this configuration, vLLM splits the model tensors between the two local GPUs while exposing a single inference endpoint to the client application.
Several configuration adjustments were required to make the deployment compatible with V100 GPUs:
-
The AWQ quantized model variant cannot be used and has to be replaced with a GPTQ INT4 checkpoint because AWQ kernels require newer GPU architectures than Volta-based V100 cards.
-
The blueprint also uses
vllm/vllm-openai:v0.8.5, which proved more stable for Volta-class hardware than newer runtime combinations. -
To reduce GPU memory pressure during model initialization and KV cache allocation, we lower the GPU memory utilization threshold and limits the maximum context length:
--gpu-memory-utilization 0.75--max-model-len 2048 -
We additionally disable the Hugging Face Xet transfer backend, in order to avoid interrupted large-model downloads in environments where long-running Xet/CAS connections may be unstable or affected by enterprise network controls.
HF_HUB_DISABLE_XET=1
Since both GPUs are attached to the same Kubernetes worker node, no Ray or KubeRay components are required for this scenario.
The final configuration should look like this:
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
modelSpec:
- name: llama3-1-70b-instruct-gptq-int4
repository: vllm/vllm-openai
tag: v0.8.5
modelURL: hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4
hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN
replicaCount: 1
requestCPU: 8
requestMemory: "96Gi"
requestGPU: 2
extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-v100-pcie-32gb
vllmConfig:
tensorParallelSize: 2
pipelineParallelSize: 1
extraArgs:
- "--served-model-name"
- "llama3-1-70b-instruct-gptq-int4"
- "--gpu-memory-utilization"
- "0.75"
- "--max-model-len"
- "2048"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "gptq"
shmSize: "32Gi"
We can now deploy the vLLM Production Stack with the Helm chart:
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-single-node-tp-v100.yaml \
--reset-values
After deployment, we can confirm from the vLLM logs that tensor parallelism initialized successfully with two tensor-parallel ranks, and via nvidia-smi we can see that model memory allocated across both V100 devices while serving inference requests through the OpenAI-compatible API endpoint. Open a shell on the pod that host the model and execute the following command, after replacing the value of VLLM_MASTER_KEY:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <VLLM_MASTER_KEY>" \
-d '{
"model": "llama3-1-70b-instruct-gptq-int4",
"messages": [
{
"role": "user",
"content": "Explain tensor parallelism in one paragraph."
}
]
}'

Multi-Node, Distributed Serving
We will need 2 nodes with specs: GPU-accelerated | pi5e.4xlarge.4 | 16 vCPUs | 64 GiB | GPU: nvidia-l4 x 1
This scenario demonstrates how large language models can be served across multiple Kubernetes worker nodes when a single GPU is no longer sufficient to host the model independently. Instead of relying on one large multi-GPU server, the deployment combines multiple smaller GPU nodes into a distributed inference topology using Ray, KubeRay, and vLLM. This approach is particularly relevant for enterprise environments where GPU resources are fragmented across several worker nodes or where high-memory accelerator platforms are either unavailable or prohibitively expensive.
The Qwen2.5-32B-Instruct-AWQ model is distributed across two single-GPU NVIDIA L4 worker nodes through pipeline parallelism. Each node contributes one GPU
to the inference pipeline, while Ray orchestrates the distributed execution and communication between the participating workers. From the client
perspective, however, the deployment still behaves as a single OpenAI-compatible inference endpoint. The complexity of model distribution, stage coordination,
and GPU communication remains fully abstracted behind the vLLM serving layer.
This architecture addresses one of the most common operational challenges in enterprise AI adoption: deploying models that exceed the practical memory capacity of individual accelerators without requiring specialized high-end GPU servers. By distributing the model across multiple commodity GPU nodes, organizations can increase the usable capacity of their existing Kubernetes GPU infrastructure while maintaining centralized API access, Kubernetes-native orchestration, and scalable distributed inference management through KubeRay.
The selected AWQ INT4 model requires approximately 18-22 GB of effective GPU memory during inference once model weights, runtime overhead, and KV cache allocation are taken into account. Although this is significantly smaller than a full-precision 32B model, it leaves little practical headroom on a single NVIDIA L4 with 24 GB VRAM. For this scenario, the model is therefore distributed across two single-GPU L4 nodes with Ray and pipeline parallelism, allowing vLLM to serve the model through one endpoint while spreading the memory load across both GPUs.
The final configuration should look like this:
servingEngineSpec:
vllmApiKey:
secretName: vllm-prodstack-secrets
secretKey: VLLM_MASTER_KEY
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
modelSpec:
- name: qwen2-5-32b-instruct-awq
repository: vllm/vllm-openai
tag: v0.8.5
modelURL: Qwen/Qwen2.5-32B-Instruct-AWQ
hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN
replicaCount: 1
requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1
extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"
- name: VLLM_ATTENTION_BACKEND
value: "FLASHINFER"
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4
vllmConfig:
tensorParallelSize: 1
pipelineParallelSize: 2
dtype: "float16"
extraArgs:
- "--served-model-name"
- "qwen2-5-32b-instruct-awq"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "2048"
- "--kv-cache-dtype"
- "fp8"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"
shmSize: "12Gi"
raySpec:
enabled: true
headNode:
requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4
We can now deploy the vLLM Production Stack with the Helm chart:
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-multi-node-distributed.yaml \
--reset-values
After successful deployment, use the Ray head pod to validate the distributed cluster state. The head pod is the cluster coordinator and exposes the complete Ray runtime view.

The ray status command validates the distributed execution layer itself. It confirms that the Ray cluster was formed successfully, that both Ray nodes joined the cluster correctly, and that the
distributed resources required by vLLM are available. In this scenario, the output should report two active nodes and two available GPUs, proving that the pipeline-parallel deployment spans both
Kubernetes worker nodes rather than running entirely on a single host.
To validate GPU utilization directly, open shells on both the Ray head and Ray worker pods and execute:
nvidia-smi
Both nvidia-smi outputs should show active CUDA processes and allocated GPU memory from the vLLM worker runtime. This confirms that the model weights and inference pipeline stages were successfully
distributed across both Kubernetes worker nodes rather than executing entirely on a single GPU.

The nvidia-smi command validates GPU participation on each node. While Ray confirms the logical distributed cluster, nvidia-smi confirms that the vLLM worker processes are actively consuming GPU memory
on both accelerators. This is particularly important in multi-node inference scenarios because a successful API response alone does not guarantee that the model was actually distributed across both GPUs.
Observing active memory allocation and running CUDA processes on both nodes confirms that the model weights and inference workload were successfully placed across the distributed pipeline stages.
Disaggregated Prefill/Decode Serving
We will need 2 nodes with specs: GPU-accelerated | pi5e.4xlarge.4 | 16 vCPUs | 64 GiB | GPU: nvidia-l4 x 1
This addresses a different problem than the previous deployment patterns. The objective is no longer simply making a model fit into GPU memory, but improving inference efficiency and reducing latency under high request concurrency.
Large language model inference consists of two fundamentally different phases. During the prefill phase, the model processes the entire input prompt and builds the KV cache. This stage is highly compute-intensive and benefits from large parallel processing capacity. During the decode phase, the model generates tokens iteratively using the previously generated KV cache. Decode workloads are typically more latency-sensitive and often become bottlenecked by memory bandwidth and cache access patterns rather than raw compute throughput.
In a traditional deployment, both phases execute on the same GPU resources. This means prompt ingestion and token generation compete for the same accelerator capacity. Under heavy concurrent load, long prompts can therefore negatively impact token generation latency for active inference sessions.
Disaggregated prefill/decode serving separates these workloads into dedicated execution paths. One group of vLLM instances is optimized for prompt ingestion and KV cache generation, while another group is optimized for token decoding. KV cache data is transferred between both stages through a coordinated serving layer. This allows the infrastructure to scale and tune both workloads independently depending on traffic patterns and application behavior.
This architecture becomes particularly valuable for enterprise AI platforms serving many simultaneous users, agentic workflows, retrieval-augmented generation pipelines, or applications with large prompt contexts. By separating prompt processing from token generation, organizations can improve GPU utilization efficiency, reduce tail latency, and optimize infrastructure allocation for different inference characteristics rather than treating all inference workloads identically.
For the prefill/decode scenario, the deployment is no longer built around one vLLM engine. It defines two separate serving engines for the same model:
one engine handles the prefill phase, and the other handles the decode phase. Both engines, for the economy of the blueprint, use
the same Qwen/Qwen2.5-14B-Instruct-AWQ checkpoint and expose the same served model name, but they are assigned different roles in the inference flow.
The prefill engine is configured as the KV cache producer. It processes the incoming prompt, builds the initial KV cache, and sends that cache to the decode
engine. This is why its LMCache configuration uses kvRole: kv_producer, nixlRole: sender, and points to the decode engine service through nixlPeerHost.
The decode engine is configured as the KV cache consumer. It receives the transferred cache and continues token generation, so its LMCache configuration
uses kvRole: kv_consumer, nixlRole: receiver, and listens on 0.0.0.0:55555.
LMCache is the distributed KV cache layer used by vLLM to transfer and manage inference state between the prefill and decode engines. In a traditional deployment, the KV cache remains local to the same GPU that processed the prompt. In a disaggregated architecture, however, the prompt is processed by one engine while token generation continues on another engine running on a different GPU or node. The KV cache therefore needs to be transferred between both stages efficiently and with minimal overhead.
In this scenario, LMCache acts as the coordination and transport layer for that cache exchange. The prefill engine generates the KV cache and publishes it through the LMCache producer configuration, while the decode engine receives the cache through the LMCache consumer configuration. This allows the decode engine to continue token generation without reprocessing the original prompt from scratch.
The deployment uses LMCache together with NIXL-enabled GPU transfer buffers. This enables the KV cache to be exchanged directly through GPU-accessible memory instead of repeatedly serializing and rebuilding inference state between requests. From an architectural perspective, LMCache is the component that makes disaggregated serving practical, because it decouples prompt processing from token generation while preserving the inference context required for continuous generation.
Without LMCache, both phases would need to execute on the same vLLM engine, which would eliminate the operational benefits of separating prompt ingestion and decode workloads across independent GPU resources.
Both engines enable prefix caching and LMCache because cache transfer is the core mechanism that makes disaggregated prefill/decode serving work. NIXL is enabled as the transport layer for moving KV cache data between the producer and consumer, and a CUDA-backed transfer buffer is configured so the cache exchange can use GPU memory.
The router is also configured differently from the previous scenarios. Instead of simple request distribution, it uses disaggregated_prefill routing logic.
This allows the router to distinguish between the prefill and decode backends through their model labels and send each request through the correct two-stage
inference path.
NIXL is the high-performance transport layer used by LMCache to move KV cache data between the prefill and decode engines. In a disaggregated serving architecture, the KV cache generated during prompt ingestion must be transferred efficiently from one GPU-backed vLLM instance to another. NIXL provides the communication mechanism that enables this exchange without forcing the cache data to be repeatedly serialized through slower CPU-bound workflows.
In this deployment, the prefill engine acts as the NIXL sender while the decode engine acts as the NIXL receiver. Both engines allocate dedicated GPU-backed transfer buffers, allowing KV cache blocks to be exchanged directly between the participating inference workers. This significantly reduces the overhead associated with transferring large prompt contexts and makes continuous token generation practical even when the inference stages are separated across different Kubernetes nodes.
From an architectural perspective, NIXL is what enables LMCache to operate as a distributed GPU-aware KV cache system rather than simply a software-level cache manager. It provides the low-latency transport path required for real-time inference serving, especially for long-context prompts and highly concurrent workloads where KV cache movement becomes a critical performance factor.
The final configuration should look like this:
servingEngineSpec:
enableEngine: true
runtimeClassName: ""
containerPort: 8000
modelSpec:
- name: qwen2-5-14b-prefill
repository: lmcache/vllm-openai
tag: 2025-05-27-v1
modelURL: Qwen/Qwen2.5-14B-Instruct-AWQ
replicaCount: 1
requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1
hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4
extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models
vllmConfig:
enablePrefixCaching: true
maxModelLen: 4096
v1: 1
extraArgs:
- "--served-model-name"
- "qwen2-5-14b-instruct-awq"
- "--gpu-memory-utilization"
- "0.85"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"
lmcacheConfig:
enabled: true
cudaVisibleDevices: "0"
kvRole: "kv_producer"
enableNixl: true
nixlRole: "sender"
nixlPeerHost: "vllm-qwen2-5-14b-decode-engine-service"
nixlPeerPort: "55555"
nixlBufferSize: "1073741824"
nixlBufferDevice: "cuda"
nixlEnableGc: true
enablePD: true
cpuOffloadingBufferSize: 0
labels:
model: qwen2-5-14b-prefill
- name: qwen2-5-14b-decode
repository: lmcache/vllm-openai
tag: 2025-05-27-v1
modelURL: Qwen/Qwen2.5-14B-Instruct-AWQ
replicaCount: 1
requestCPU: 4
requestMemory: "20Gi"
requestGPU: 1
hf_token:
secretName: vllm-prodstack-secrets
secretKey: HF_TOKEN
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-l4
extraEnv:
- name: HF_HUB_DISABLE_XET
value: "1"
extraVolumes:
- name: vllm-model-cache
persistentVolumeClaim:
claimName: pvc-vllm-models
extraVolumeMounts:
- name: vllm-model-cache
mountPath: /models
vllmConfig:
enablePrefixCaching: true
maxModelLen: 4096
v1: 1
extraArgs:
- "--served-model-name"
- "qwen2-5-14b-instruct-awq"
- "--gpu-memory-utilization"
- "0.85"
- "--download-dir"
- "/models/vllm-downloads"
- "--quantization"
- "awq"
lmcacheConfig:
enabled: true
cudaVisibleDevices: "0"
kvRole: "kv_consumer"
enableNixl: true
nixlRole: "receiver"
nixlPeerHost: "0.0.0.0"
nixlPeerPort: "55555"
nixlBufferSize: "1073741824"
nixlBufferDevice: "cuda"
nixlEnableGc: true
enablePD: true
labels:
model: qwen2-5-14b-decode
containerSecurityContext:
capabilities:
add:
- SYS_PTRACE
routerSpec:
enableRouter: true
repository: lmcache/lmstack-router
tag: latest
replicaCount: 1
containerPort: 8000
servicePort: 80
routingLogic: disaggregated_prefill
enablePD: true
engineScrapeInterval: 15
requestStatsWindow: 60
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
labels:
environment: router
release: router
extraArgs:
- "--prefill-model-labels"
- "qwen2-5-14b-prefill"
- "--decode-model-labels"
- "qwen2-5-14b-decode"
SYS_PTRACE adds Linux process tracing capabilities to the container runtime. This is typically only
required for low-level debugging, profiling, or runtime inspection tools such as CUDA/NCCL diagnostics
and advanced GPU performance analysis.
Standard vLLM inference deployments generally do not require this capability. In security-conscious Kubernetes environments, container capabilities should remain minimized unless explicitly needed by the workload or troubleshooting process, otherwise it can/should be omitted.
We can now deploy the vLLM Production Stack with the Helm chart:
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm upgrade --install vllm vllm/vllm-stack \
-n vllm --create-namespace \
-f vllm-values-multi-node-disaggregated.yaml \
--reset-values
After deployment, we can confirm from the vLLM logs that both the prefill and decode engines initialized successfully with the correct KV roles and NIXL configurations. The router logs should also confirm that the disaggregated prefill routing logic is active and that both engines are registered correctly with their respective model labels and LMCache roles.

Finally, we can observe active GPU memory allocation on both engines through nvidia-smi in their
respective pods, confirming that both the prefill and decode workloads are executing on their assigned GPU resources according to the disaggregated serving.

Adding Model(s) in LiteLLM (Optional)
Navigate to LiteLLM Dashboard -> Models + Endpoints -> Add Model and fill in the following values:
- Provider:
vllm - LiteLLM Mode Name(s):
gpt-oss-20b - Model Mappings/Public Model Name:
gpt-oss-20b - Mode: Chat - /chat/completions
- Existing Credentials: None
- API Base:
http://vllm-router-service.vllm.svc.cluster.local/v1 - API Key: fill in your
VLLM_MASTER_KEYvalue

After completing the configuration, click Add Model. Repeat the same process for any additional models exposed by your vLLM deployment.

The most important field is Model Mappings/Public Model Name.
This value must match the model name exposed by the vLLM deployment through --served-model-name. LiteLLM uses this mapping to route
incoming OpenAI-compatible requests to the correct backend model endpoint.
If the names do not match, requests will fail with model resolution errors even though the vLLM service itself is healthy and reachable.
Appendix A: Helm Chart Values
Tensor and Pipeline Parallelism
Under vllmConfig stanza we find tensorParallelSize and pipelineParallelSize which control how a model is distributed across multiple GPUs. These settings are important when a model is too large for a single GPU or
when inference throughput needs to scale beyond the capacity of one accelerator.
Tensor parallelism splits the mathematical operations of individual neural network layers across multiple GPUs. Instead of storing and processing the full tensor operations on one GPU, the computation is divided between several GPUs that work on the same layer simultaneously. This approach is commonly used for very large models because it reduces the memory pressure on individual GPUs while allowing the model to execute in parallel.
With tensorParallelSize: 2, for example, the tensor computations are distributed across two GPUs. The GPUs cooperate closely during inference and exchange intermediate results continuously. Because of this communication overhead,
tensor parallelism performs best on systems where GPUs are connected through high-bandwidth interconnects such as NVLink or similarly optimized PCIe topologies.
Pipeline parallelism works differently. Instead of splitting tensor operations inside a layer, it splits the model itself into sequential layer groups called pipeline stages. Each GPU becomes responsible for a different section of the model. During inference, the request passes through these stages in sequence until the output is produced.
With pipelineParallelSize: 2, one GPU may process the earlier transformer layers while another GPU processes the later layers. This approach is useful when the hardware environment does not provide fast GPU-to-GPU communication
for efficient tensor parallelism or when model layers need to be distributed more explicitly across available devices.
In practice, tensor parallelism is generally preferred for high-performance multi-GPU inference because it allows GPUs to work concurrently on the same operations. Pipeline parallelism is often used when GPU interconnect performance is limited or when the deployment architecture requires explicit layer partitioning.
These parameters must align with the GPU resources requested from Kubernetes. The effective GPU requirement is determined by the parallelism configuration. For example:
tensorParallelSize: 2requires at least 2 GPUs.pipelineParallelSize: 2also requires at least 2 GPUs.- The overall resource requirement we claim is the multiplication of both variables. For example,
tensorParallelSize: 2andpipelineParallelSize: 2require 4 GPUs in total.
The Kubernetes resource request must therefore provide sufficient GPUs for the configured parallelism model; otherwise, the vLLM workload cannot initialize correctly.
Resource Allocation and Scheduling
The requestCPU, requestMemory, and requestGPU values define infrastructure resource allocation at the Kubernetes level, while tensorParallelSize and pipelineParallelSize define how vLLM internally distributes the
model execution across the allocated GPUs.
This distinction is important because Kubernetes itself does not understand how the model is executed. Kubernetes only schedules containers onto nodes based on requested resources.
requestCPU: 4
requestMemory: "16Gi"
requestGPU: 1
These parameters instruct Kubernetes to reserve 4 CPU cores, 16 GiB of system memory, and 1 GPU for the serving pod. They affect pod scheduling, node placement, and infrastructure capacity planning, but they do not control how vLLM uses the GPU internally.
The vLLM parallelism settings operate at the inference engine layer:
tensorParallelSize: 1
pipelineParallelSize: 2
These parameters determine how the model itself is partitioned and executed across GPUs during inference. In practice, the Kubernetes resource requests define the available hardware resources, while the vLLM parallelism settings define how the inference engine consumes and coordinates those resources.
⚠️ Both layers must remain consistent with each other. For example:
requestGPU: 1means Kubernetes allocates one GPU to the pod.tensorParallelSize: 2tells vLLM to distribute tensor operations across two GPUs.
‼️ This creates a mismatch because vLLM expects access to two GPUs while Kubernetes provides only one. The workload would therefore fail during initialization or scheduling.
The same applies to pipeline parallelism. If pipelineParallelSize: 2 is configured, the deployment must provide at least two GPUs because the model is split into two execution stages.