infrastructure

llama.cpp

by Georgi Gerganov

llama.cpp is an open-source LLM inference engine written in pure C/C++ that enables running large language models locally on consumer hardware with minimal dependencies. It pioneered the GGUF quantization format, allowing models to be compressed to 2–8 bits and run efficiently on CPU and GPU. Originally built to run Meta's LLaMA models, it now supports a broad range of architectures including Mistral, Falcon, Phi, Gemma, and many others.

9 Overall Score

Scores

Capability
9
Ease of Use
6
Documentation
7
Reliability
9
Value
10
Momentum
9

Details

Status
active
Pricing
open-source
Launch Date
2023-03
Last Updated
2026-03-14

Key Features

  • Pure C/C++ implementation with no mandatory external dependencies
  • GGUF quantization format supporting 2-bit to 8-bit precision
  • Multi-backend GPU acceleration (CUDA, Metal, Vulkan, OpenCL)
  • OpenAI-compatible HTTP API server (llama-server)
  • Supports 50+ model architectures including LLaMA, Mistral, Falcon, Gemma, and Phi

Tech Stack

CC++CUDAMetalOpenCLVulkanPythonGGUF