/


#37
vLLM
Open-source LLM inference and serving engine, originated at UC Berkeley's Sky Computing Lab. Built around PagedAttention for efficient KV cache memory management, with continuous batching, tensor and pipeline parallelism, and quantization support (FP8, GPTQ, AWQ). Supports 200+ Hugging Face model architectures with an OpenAI-compatible API.
Categories
Subcategories
INFERENCE OPTIMIZATIONRLMODEL SERVING
Links
