Drop-in for Ollama: same CLI, same Modelfile, same port — plus things Ollama doesn't have
Mullama is wire-compatible with Ollama at the CLI, the Modelfile format, the model registry, and the HTTP port. Existing client code keeps working. Here's exactly what stays the same and what's new.
If you’ve already invested in Ollama — your team’s muscle memory,
your Modelfiles, your client SDK calls pointed at
localhost:11434 — the question for Mullama is simple: how much do
I have to change?
Short answer: nothing, unless you want to. Long answer follows.
The CLI is the same CLI
Ollama’s CLI verbs are a clean surface, and Mullama re-implements them. Every command you already type works:
mullama run llama3.2:1b "Hello" # was: ollama run
mullama pull qwen2.5:7b # was: ollama pull
mullama serve --model llama3.2:1b # was: ollama serve
mullama chat # was: ollama chat
mullama list # was: ollama list
mullama ps # was: ollama ps
mullama create my-model -f Modelfile # was: ollama create
mullama show llama3.2:1b # was: ollama show
mullama rm old-model # was: ollama rm
mullama cp llama3.2:1b my-copy # was: ollama cp
If you have shell scripts, Makefile targets, or CI pipelines that
call ollama, the migration is a global search-and-replace from
ollama to mullama. (Or, if you’d rather not change anything:
alias ollama=mullama. We won’t tell.)
The daemon listens on the same port
Mullama’s serve mode binds to port 11434 by default — the Ollama
port. Anything in your stack already configured to talk to
http://localhost:11434 keeps working without modification. That
includes:
- The official
openaiPython SDK pointed athttp://localhost:11434/v1. - The
openaiNode.js SDK, same base URL. - LangChain’s
ChatOpenAI. - LlamaIndex’s
OpenAILike. - Any custom HTTP client speaking OpenAI’s chat completions or embeddings format.
The HTTP surface is wire-compatible with the OpenAI API and
Ollama’s own API. Streaming (SSE) works. Function calling via the
OpenAI tools field works. Embeddings work.
Modelfiles are the same Modelfiles
Ollama’s Modelfile format — FROM, PARAMETER, TEMPLATE,
SYSTEM, MESSAGE, LICENSE, ADAPTER — is parsed by Mullama
without modification. The Modelfile that builds your custom
“code-review-bot” in Ollama builds the same model in Mullama. The
parameter names are the same. The template syntax is the same.
FROM llama3.2:1b
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
SYSTEM """You are a senior code reviewer..."""
mullama create code-review-bot -f Modelfile produces the same
runtime behavior as ollama create would have.
The model registry resolves the same aliases
Type llama3.2:1b in Ollama, you get the standard Llama 3.2 1B
GGUF. Type the same thing in Mullama, you get the same GGUF — same
quantization, same Hugging Face source. The alias table is
pre-configured with the families Ollama users expect:
- Llama, Qwen 2.5 (including coder variants), DeepSeek R1, Mistral, Mixtral, Codestral, Phi 3 / 3.5, Gemma 2.
- Vision: LLaVA, llava-phi3, Moondream.
- Embeddings: BGE, Nomic.
- Code: StarCoder 2.
For anything not in the alias table, two escape hatches:
mullama run hf:bartowski/Llama-3.2-1B-Instruct-GGUF "Hello"
mullama run /path/to/local/model.gguf "Hello"
Both work. The first pulls from Hugging Face. The second loads a local file directly.
What Mullama adds that Ollama doesn’t have
This is the part that motivates switching.
Anthropic-compatible API. Mullama’s serve mode also speaks
the Anthropic message format. Point an anthropic SDK at
http://localhost:11434 and it works — same as the OpenAI side.
For teams that have standardized on the Anthropic SDK shape (and
there are a lot of you), this means no second adapter layer.
Native in-process bindings for six languages. Ollama is HTTP-only. Mullama ships first-party bindings for Rust, Python, Node.js, Go, PHP, and C/C++. The model lives in your process; no HTTP round-trip; no second daemon. For high-frequency inference — agent loops, batch embedding, RAG pipelines doing thousands of small completions — this changes the cost structure.
More GPU backends. Ollama ships CUDA, Metal, ROCm, and limited Vulkan. Mullama ships seven: CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL (Intel Arc and Data Center GPUs), and RPC for distributed inference across machines. Apple Silicon uses Metal automatically. CPU works out of the box.
Built-in Web UI and TUI. Ollama leaves the UI to third parties. Mullama ships an embedded Web UI for chat and model management, and a TUI for the terminal-only crowd. Both are part of the daemon binary; no extra installation.
Production knobs. Per-model resource limits. LRU model
eviction when memory pressure hits. Persistent statistics via
sled. Prometheus metrics endpoint. These are the kinds of things
you only notice you needed once you’ve put a local LLM behind a
real HTTP service.
First-class multimodal. Image (LLaVA, Moondream) and real-time audio with voice activity detection are first-class in Mullama. The audio pipeline does streaming capture, VAD, speech-to-text, and streaming LLM responses — useful for voice assistants and transcription tools.
A migration in three steps
If you’re moving an existing Ollama deployment to Mullama, here’s the actual process:
1. Install Mullama side-by-side.
curl -fsSL https://mullama.cognisoc.com/install.sh | sh
Both ollama and mullama can coexist; they just can’t both bind
port 11434 at the same time. Stop Ollama. Start Mullama.
2. Copy your Modelfiles. They work unchanged.
mullama create my-custom-bot -f /path/to/existing/Modelfile
The pulled GGUF cache from Ollama lives in ~/.ollama/models.
Mullama uses its own cache; pulls are not free the first time. If
that’s a concern, the alias resolution is fast and the GGUFs are
identical bytes, so disk usage is the only cost.
3. Point your client code at Mullama (or don’t).
If your code talks to localhost:11434, you’re done. If you want
to also use the Anthropic-compatible endpoint, point an anthropic
SDK at the same host. If you want to embed the runtime in-process
to remove the HTTP hop, swap your HTTP call for the language
binding.
When to stay on Ollama
Honest answer: if Ollama’s HTTP-only model fits your stack and you
don’t need an Anthropic-compatible API, native bindings, or the
extra GPU backends, there is no urgent reason to switch. Both
projects sit on top of llama.cpp. The numerical behavior is
equivalent for equivalent models. The CLI you already know works.
Mullama wins when you need more than HTTP — when “embed it in this process” or “talk to it with the Anthropic SDK” stops being a nice-to-have and starts being a requirement.
A note on coexistence
A surprising number of teams end up running both. The Ollama daemon stays where it is, serving an existing toolchain that was written against it years ago and works fine. Mullama goes onto a new service — a Go batch processor, a Tauri desktop app, a Python ingestion job that does ten thousand embeddings per minute — specifically because that service wanted in-process inference or the Anthropic SDK surface. There’s nothing wrong with this shape. Both daemons can’t bind 11434 on the same host at the same time, but on different hosts (or in different containers with different port mappings) they live side by side without issue. The GGUF cache directory is configurable per-process if disk is at a premium and you’d rather not double the model footprint.
What about the model cache?
A common migration question: do I have to re-download every
model? The honest answer is yes by default, no with effort. The
on-disk GGUF bytes are identical between Ollama and Mullama —
they’re the same files from the same Hugging Face repositories —
but the cache layout is different, because Mullama tracks
additional metadata for things like LRU eviction and per-model
stats. If you want to avoid the re-download, you can point
Mullama at the existing files with mullama create <alias> -f Modelfile where the Modelfile’s FROM line points at
the Ollama cache path. The pull step is then a no-op, and the
alias resolves to the file you already have on disk.
For most users, just re-pulling is simpler. Disk is cheap; typing migration scripts is not.
Read more
Posted by Mullama team. · more posts · RSS