infrastructure

llama.cpp

llama.cpp is an open-source LLM inference engine written in pure C/C++ that enables running large language models locally on consumer hardware with minimal dependencies. It pioneered the GGUF quantization format, allowing models to be compressed to 2–8 bits and run efficiently on CPU and GPU. Originally built to run Meta's LLaMA models, it now supports a broad range of architectures including Mistral, Falcon, Phi, Gemma, and many others.

9 Overall Score

Scores

Capability

Ease of Use

Documentation

Reliability

Value

Momentum

Details

Status: active
Pricing: open-source
Launch Date: 2023-03
Website: https://github.com/ggerganov/llama.cpp
Last Updated: 2026-03-14

Key Features

Pure C/C++ implementation with no mandatory external dependencies
GGUF quantization format supporting 2-bit to 8-bit precision
Multi-backend GPU acceleration (CUDA, Metal, Vulkan, OpenCL)
OpenAI-compatible HTTP API server (llama-server)
Supports 50+ model architectures including LLaMA, Mistral, Falcon, Gemma, and Phi

Tech Stack

CC++CUDAMetalOpenCLVulkanPythonGGUF