If you're running LLMs on anything other than NVIDIA GPUs, this is your best option. It's pure C++ inference that actually works well on M1/M2/M3 Macs, AMD cards, or just CPUs. The GGUF quantization is the killer feature: you can run a 7B model in 4GB at Q4_K_M with surprisingly little quality loss and get 50 tokens per second on an M3 Max. It ships with an OpenAI-compatible server, so you can drop it into existing code. The honest take is that if you have NVIDIA hardware you should use vLLM or TensorRT-LLM for better throughput, but for edge deployment, local development on Macs, or anything without CUDA, this is the standard.
npx -y skills add orchestra-research/ai-research-skills --skill llama-cpp --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot