infrastructure
llama.cpp
llama.cpp is an open-source LLM inference engine written in pure C/C++ that enables running large language models locally on consumer hardware with minimal dependencies. It pioneered the GGUF quantization format, allowing models to be compressed to 2–8 bits and run efficiently on CPU and GPU. Originally built to run Meta's LLaMA models, it now supports a broad range of architectures including Mistral, Falcon, Phi, Gemma, and many others.
9 Overall Score
Scores
Capability 9
Ease of Use 6
Documentation 7
Reliability 9
Value 10
Momentum 9
Details
- Status
- active
- Pricing
- open-source
- Launch Date
- Last Updated
Key Features
- Pure C/C++ implementation with no mandatory external dependencies
- GGUF quantization format supporting 2-bit to 8-bit precision
- Multi-backend GPU acceleration (CUDA, Metal, Vulkan, OpenCL)
- OpenAI-compatible HTTP API server (llama-server)
- Supports 50+ model architectures including LLaMA, Mistral, Falcon, Gemma, and Phi