Mullama vs Ollama
How Mullama compares to Ollama — what's the same, what's added, and when each one is the right call.
Both Mullama and Ollama are local LLM runtimes built on top of
llama.cpp. They share
DNA. The numerical kernel is the same. For equivalent GGUF models
and equivalent samplers, the outputs are equivalent.
This page exists for the part that isn’t the same.
TL;DR
- Same CLI surface. Same verbs (
run,pull,serve,chat,list,ps,create,show,rm,cp), same Modelfile format, same model registry, same port (11434). - Same OpenAI-compatible HTTP API. Existing clients pointed at
localhost:11434/v1work unchanged. - Mullama adds an Anthropic-compatible API. Point the official
anthropicSDK atlocalhost:11434. - Mullama adds six native language bindings. Rust, Python, Node.js, Go, PHP, C/C++. Ollama is HTTP-only.
- Mullama supports 7 GPU backends. CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, RPC. Ollama covers four.
- Mullama ships an embedded Web UI and TUI. Ollama leaves that to third parties.
Feature matrix
| Mullama | Ollama | |
|---|---|---|
| GGUF models | ✓ | ✓ |
| Ollama CLI verbs | ✓ | ✓ |
| Modelfile format | ✓ (unchanged) | ✓ |
| Model registry / aliases | ✓ (compatible) | ✓ |
| Port 11434 by default | ✓ | ✓ |
| OpenAI-compatible HTTP API | ✓ | ✓ |
| Anthropic-compatible HTTP API | ✓ | ✗ |
| Native language bindings | 6 (Rust, Python, Node.js, Go, PHP, C/C++) | HTTP only |
| Embed in-process (no daemon) | ✓ | ✗ |
| GPU backends | 7 (CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, RPC) | 4 |
| Built-in Web UI | ✓ | ✗ (third-party) |
| Built-in TUI | ✓ | ✗ |
| Multimodal (vision) | ✓ (LLaVA, llava-phi3, Moondream) | partial |
| Multimodal (real-time audio + VAD) | ✓ | ✗ |
| Embedded model registry resolution | ✓ | ✓ |
| Per-model resource limits | ✓ | partial |
| LRU model eviction | ✓ | ✓ |
| Persistent stats (sled) | ✓ | ✗ |
| Prometheus metrics endpoint | ✓ | ✗ (third-party) |
| Grammar-constrained JSON output | ✓ | ✓ |
OpenAI tools / function calling | ✓ | ✓ |
| License | MIT | MIT |
What stays the same
If you already use Ollama, the muscle memory transfers directly:
mullama run llama3.2:1b "Hello" # same as ollama run
mullama pull qwen2.5:7b # same as ollama pull
mullama serve --model llama3.2:1b # same as ollama serve
Your existing Modelfiles work unchanged. Your client code pointed
at http://localhost:11434 works unchanged. Your model aliases
resolve to the same Hugging Face GGUFs.
If you have shell scripts, Makefile targets, CI pipelines, or
Docker Compose files that call ollama, the migration is a
search-and-replace.
What Mullama adds
Anthropic-compatible HTTP API
This is the biggest single addition. Mullama’s serve mode exposes
both an OpenAI-compatible API (at /v1) and an Anthropic-compatible
one (at /v1/messages). Point the official anthropic SDK at
localhost:11434 and it works:
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:11434", api_key="local")
msg = client.messages.create(
model="llama3.2:1b",
max_tokens=256,
messages=[{"role": "user", "content": "Hello"}],
)
If your codebase has already standardized on the Anthropic SDK shape, you no longer need an adapter layer to talk to a local model.
Native bindings for six languages
Ollama is HTTP-only by design. Every call from your application serializes to JSON, crosses a socket, deserializes, runs through inference, and reverses the process. For high-frequency workloads — embedding ingestion for RAG, agent loops, batch classification — that’s real overhead.
Mullama ships first-party in-process bindings for Rust, Python,
Node.js, Go, PHP, and C/C++. The model lives in your application
process. There’s no daemon, no HTTP hop, no JSON serialization. A
generate call returns a string in your language’s native type.
The bindings sit on the same Rust core, so behavior is identical across languages — same sampler, same Modelfile parser, same GPU backend selection.
Three more GPU backends
Ollama covers CUDA, Metal, ROCm, and limited Vulkan. Mullama covers all of those plus:
- OpenCL — useful for legacy Intel and AMD GPUs.
- SYCL — Intel Arc and Intel Data Center GPUs.
- RPC — distributed inference across multiple machines.
The full list: CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, RPC. CPU inference works out of the box. Apple Silicon uses Metal automatically.
Built-in Web UI and TUI
Ollama leaves chat UIs and model management to third parties (Open
WebUI, ollama-ui, etc.). Mullama ships both an embedded Web UI and
a TUI as part of the daemon binary. Chat, model management, and an
API playground are available at http://localhost:11434/ui with
no extra installation.
Production observability
Mullama’s daemon includes per-model resource limits, LRU model
eviction under memory pressure, persistent statistics via sled,
and a Prometheus metrics endpoint. These are the things you start
caring about once a local LLM is behind a real HTTP service.
Real-time audio with VAD
Mullama’s multimodal pipeline includes real-time audio capture with voice activity detection, speech-to-text, and streaming LLM responses. Useful for voice assistants and transcription tools. Ollama does not have an equivalent.
When Ollama is still the right call
Honest answer: if you’re already using Ollama, your stack is
HTTP-only, and you don’t need an Anthropic-compatible API or
in-process bindings, there’s no urgent reason to switch. Both
projects share the same llama.cpp core. The CLI you already know
works.
Mullama wins specifically when you need:
- An Anthropic SDK to talk to a local model without writing your own adapter.
- In-process inference for desktop apps, CLI tools, edge devices, or high-frequency workloads.
- GPU backends Ollama doesn’t have (OpenCL, SYCL, RPC).
- A single dependency in a polyglot stack instead of one daemon plus N HTTP wrappers in N languages.
Migration
See the migration guide in the docs for the step-by-step. The short version: install Mullama, stop Ollama, point the daemon at port 11434, and continue. Your Modelfiles and client code keep working.
Sources: github.com/cognisoc/mullama · docs