If you need to serve LLMs at scale without melting your infrastructure, this is the go-to solution. vLLM delivers 24x higher throughput than standard transformers by using PagedAttention for memory-efficient KV caching and continuous batching that mixes prefill and decode requests. The skill wraps the Python library so you can spin up high-performance inference servers or run offline batches. It's built for production workloads where you're actually paying attention to tokens per second and GPU utilization. The installation is straightforward, and you get access to the same engine that powers a lot of commercial LLM APIs. Worth noting the skill comes from orchestra-research's AI research collection, so expect research-grade tooling rather than hand-holding docs.
npx -y skills add orchestra-research/ai-research-skills --skill serving-llms-vllm --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot