Inference Engine MIT

llama-cpp-python

Python bindings for llama.cpp providing a high-level API and OpenAI-compatible server. The easiest way to use llama.cpp from Python applications.

Platforms: windowsmacoslinux

llama-cpp-python is the official Python binding for llama.cpp, providing both a low-level ctypes interface and a high-level Pythonic API for running GGUF language models. It includes a built-in OpenAI-compatible API server that can replace OpenAI API calls in Python applications with a single URL change. For Python developers who want llama.cpp’s performance with Python’s ecosystem and convenience, llama-cpp-python is the standard bridge between the two worlds.

Key Features

High-level Python API. The Llama class provides a simple, Pythonic interface for loading models, generating completions, and managing chat conversations. A few lines of Python are enough to load a GGUF model and start generating text, with full access to sampling parameters, grammar constraints, and chat templates.

OpenAI-compatible server. The bundled server implements the OpenAI API specification for chat completions, completions, and embeddings. Applications using the official OpenAI Python SDK work by pointing the base URL at the local server. This makes llama-cpp-python a drop-in replacement for OpenAI in Python projects.

Full llama.cpp feature support. llama-cpp-python exposes the complete llama.cpp feature set: GPU offloading via CUDA, Metal, and ROCm, GGUF model loading, context extension, LoRA adapter support, multimodal models, grammar-based sampling, and speculative decoding.

Function calling support. The server implements OpenAI-compatible function calling, enabling tool-use workflows with local models. This makes it compatible with frameworks like LangChain and LlamaIndex that rely on function calling for agent implementations.

Hugging Face integration. Models can be loaded directly from Hugging Face Hub by repository ID, with automatic downloading and caching. This eliminates manual model file management for common use cases.

pip installable. Installation is a single pip command with pre-built wheels available for major platforms. GPU support requires specifying the appropriate backend during installation (CUDA, Metal, ROCm).

When to Use llama-cpp-python

Choose llama-cpp-python when you are building Python applications that need local LLM inference. It is the right choice for developers integrating local models into Python projects, teams replacing OpenAI API calls with local inference, and anyone who wants llama.cpp performance without writing C++ code.

Ecosystem Role

llama-cpp-python bridges llama.cpp’s C++ performance and Python’s ecosystem. It competes with Ollama’s Python library for simple use cases but offers more control and lower-level access. For non-Python applications, Ollama’s REST API is more language-agnostic. For maximum Python integration, llama-cpp-python gives direct access to the inference engine.