mullama
/blog · 2026-04-22 · ollama · compatibility · migration

Drop-in for Ollama: same CLI, same Modelfile, same port — plus things Ollama doesn't have

Mullama is wire-compatible with Ollama at the CLI, the Modelfile format, the model registry, and the HTTP port. Existing client code keeps working. Here's exactly what stays the same and what's new.


If you’ve already invested in Ollama — your team’s muscle memory, your Modelfiles, your client SDK calls pointed at localhost:11434 — the question for Mullama is simple: how much do I have to change?

Short answer: nothing, unless you want to. Long answer follows.

The CLI is the same CLI

Ollama’s CLI verbs are a clean surface, and Mullama re-implements them. Every command you already type works:

mullama run llama3.2:1b "Hello"     # was: ollama run
mullama pull qwen2.5:7b              # was: ollama pull
mullama serve --model llama3.2:1b    # was: ollama serve
mullama chat                         # was: ollama chat
mullama list                         # was: ollama list
mullama ps                           # was: ollama ps
mullama create my-model -f Modelfile # was: ollama create
mullama show llama3.2:1b             # was: ollama show
mullama rm old-model                 # was: ollama rm
mullama cp llama3.2:1b my-copy       # was: ollama cp

If you have shell scripts, Makefile targets, or CI pipelines that call ollama, the migration is a global search-and-replace from ollama to mullama. (Or, if you’d rather not change anything: alias ollama=mullama. We won’t tell.)

The daemon listens on the same port

Mullama’s serve mode binds to port 11434 by default — the Ollama port. Anything in your stack already configured to talk to http://localhost:11434 keeps working without modification. That includes:

The HTTP surface is wire-compatible with the OpenAI API and Ollama’s own API. Streaming (SSE) works. Function calling via the OpenAI tools field works. Embeddings work.

Modelfiles are the same Modelfiles

Ollama’s Modelfile format — FROM, PARAMETER, TEMPLATE, SYSTEM, MESSAGE, LICENSE, ADAPTER — is parsed by Mullama without modification. The Modelfile that builds your custom “code-review-bot” in Ollama builds the same model in Mullama. The parameter names are the same. The template syntax is the same.

FROM llama3.2:1b
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
SYSTEM """You are a senior code reviewer..."""

mullama create code-review-bot -f Modelfile produces the same runtime behavior as ollama create would have.

The model registry resolves the same aliases

Type llama3.2:1b in Ollama, you get the standard Llama 3.2 1B GGUF. Type the same thing in Mullama, you get the same GGUF — same quantization, same Hugging Face source. The alias table is pre-configured with the families Ollama users expect:

For anything not in the alias table, two escape hatches:

mullama run hf:bartowski/Llama-3.2-1B-Instruct-GGUF "Hello"
mullama run /path/to/local/model.gguf "Hello"

Both work. The first pulls from Hugging Face. The second loads a local file directly.

What Mullama adds that Ollama doesn’t have

This is the part that motivates switching.

Anthropic-compatible API. Mullama’s serve mode also speaks the Anthropic message format. Point an anthropic SDK at http://localhost:11434 and it works — same as the OpenAI side. For teams that have standardized on the Anthropic SDK shape (and there are a lot of you), this means no second adapter layer.

Native in-process bindings for six languages. Ollama is HTTP-only. Mullama ships first-party bindings for Rust, Python, Node.js, Go, PHP, and C/C++. The model lives in your process; no HTTP round-trip; no second daemon. For high-frequency inference — agent loops, batch embedding, RAG pipelines doing thousands of small completions — this changes the cost structure.

More GPU backends. Ollama ships CUDA, Metal, ROCm, and limited Vulkan. Mullama ships seven: CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL (Intel Arc and Data Center GPUs), and RPC for distributed inference across machines. Apple Silicon uses Metal automatically. CPU works out of the box.

Built-in Web UI and TUI. Ollama leaves the UI to third parties. Mullama ships an embedded Web UI for chat and model management, and a TUI for the terminal-only crowd. Both are part of the daemon binary; no extra installation.

Production knobs. Per-model resource limits. LRU model eviction when memory pressure hits. Persistent statistics via sled. Prometheus metrics endpoint. These are the kinds of things you only notice you needed once you’ve put a local LLM behind a real HTTP service.

First-class multimodal. Image (LLaVA, Moondream) and real-time audio with voice activity detection are first-class in Mullama. The audio pipeline does streaming capture, VAD, speech-to-text, and streaming LLM responses — useful for voice assistants and transcription tools.

A migration in three steps

If you’re moving an existing Ollama deployment to Mullama, here’s the actual process:

1. Install Mullama side-by-side.

curl -fsSL https://mullama.cognisoc.com/install.sh | sh

Both ollama and mullama can coexist; they just can’t both bind port 11434 at the same time. Stop Ollama. Start Mullama.

2. Copy your Modelfiles. They work unchanged.

mullama create my-custom-bot -f /path/to/existing/Modelfile

The pulled GGUF cache from Ollama lives in ~/.ollama/models. Mullama uses its own cache; pulls are not free the first time. If that’s a concern, the alias resolution is fast and the GGUFs are identical bytes, so disk usage is the only cost.

3. Point your client code at Mullama (or don’t).

If your code talks to localhost:11434, you’re done. If you want to also use the Anthropic-compatible endpoint, point an anthropic SDK at the same host. If you want to embed the runtime in-process to remove the HTTP hop, swap your HTTP call for the language binding.

When to stay on Ollama

Honest answer: if Ollama’s HTTP-only model fits your stack and you don’t need an Anthropic-compatible API, native bindings, or the extra GPU backends, there is no urgent reason to switch. Both projects sit on top of llama.cpp. The numerical behavior is equivalent for equivalent models. The CLI you already know works.

Mullama wins when you need more than HTTP — when “embed it in this process” or “talk to it with the Anthropic SDK” stops being a nice-to-have and starts being a requirement.

A note on coexistence

A surprising number of teams end up running both. The Ollama daemon stays where it is, serving an existing toolchain that was written against it years ago and works fine. Mullama goes onto a new service — a Go batch processor, a Tauri desktop app, a Python ingestion job that does ten thousand embeddings per minute — specifically because that service wanted in-process inference or the Anthropic SDK surface. There’s nothing wrong with this shape. Both daemons can’t bind 11434 on the same host at the same time, but on different hosts (or in different containers with different port mappings) they live side by side without issue. The GGUF cache directory is configurable per-process if disk is at a premium and you’d rather not double the model footprint.

What about the model cache?

A common migration question: do I have to re-download every model? The honest answer is yes by default, no with effort. The on-disk GGUF bytes are identical between Ollama and Mullama — they’re the same files from the same Hugging Face repositories — but the cache layout is different, because Mullama tracks additional metadata for things like LRU eviction and per-model stats. If you want to avoid the re-download, you can point Mullama at the existing files with mullama create <alias> -f Modelfile where the Modelfile’s FROM line points at the Ollama cache path. The pull step is then a no-op, and the alias resolves to the file you already have on disk.

For most users, just re-pulling is simpler. Disk is cheap; typing migration scripts is not.

Read more


Posted by Mullama team. · more posts · RSS