We found one article tagged with "distributed-serving"

View All Tags

Go to Portfolio Navigator

vLLM is an open-source inference and serving engine for large language models. It is designed to improve serving throughput and GPU memory efficiency, mainly through PagedAttention, continuous batching, prefix caching, and an OpenAI-compatible serving interface.