/about

About Mullama

Mullama is a local LLM runtime built on llama.cpp that runs GGUF models — Llama 3.2, Qwen 2.5, DeepSeek R1, Mistral, Phi 3, Gemma 2, LLaVA, and any other GGUF file from Hugging Face — directly inside your application.

The problem we built it to solve

Local LLM tooling, today, tends to force you into one of two cul-de-sacs:

Daemon-only. You install a separate process, speak HTTP to it, and pay the round-trip cost — even when "across the network" is just your own machine. Your Node app spawns a Python subprocess just so it can do RAG.
Single-language SDK. The bindings exist only in the language the maintainer happens to write. Everyone else either calls the C library through hand-rolled FFI or gives up and uses the HTTP daemon anyway.

Mullama refuses the choice. It ships one Rust core with idiomatic, first-party bindings for Rust, Python, Node.js, Go, PHP, and C/C++, and it speaks the same HTTP surface as Ollama on port 11434 — plus an Anthropic-compatible API that Ollama doesn't have. Embed it, serve it, or both.

What's in the box

A safe Rust API over llama.cpp.
Native bindings for six languages — not HTTP wrappers, real in-process bindings.
An Ollama-compatible daemon: same CLI verbs (run, pull, serve, list, ps, create, show, rm, cp), same Modelfile format, same model registry, same port.
OpenAI- and Anthropic-compatible HTTP APIs.
A model registry that resolves short aliases (llama3.2:1b) to GGUF files on Hugging Face — plus hf:owner/repo escape hatches.
Modelfile parser, embedded Web UI, TUI, multimodal pipeline (vision + audio with VAD).
Production knobs: per-model resource limits, LRU model eviction, persistent stats via sled, Prometheus metrics.

What it isn't

Not a hosted service. Mullama runs on your machine, your server, or wherever you put it. There is no Mullama Cloud. There is no Mullama account. There is no telemetry leaving the box unless you explicitly point the OpenAI/Anthropic endpoints at a remote backend.
Not a model trainer. Mullama runs GGUF models. Training and fine-tuning live elsewhere in the ecosystem.
Not a safetensors runtime. If you need vLLM's batching for hosted serving at scale, use vLLM. Mullama is for the local / embedded / single-box case.

Who it's for

Polyglot teams that want one inference dependency across a Node frontend, a Python data pipeline, and a Go control plane — without duct-taping subprocess calls.
Desktop & mobile builders using Electron, Tauri, or native iOS/Android who want to link a static C ABI library and ship inference inside the app binary.
Edge / on-prem deployers who can't or won't send prompts to a hosted API.
Ollama users who like the UX but want in-process bindings, an Anthropic-compatible API, or both.

Design principles

Same CLI as Ollama, by default. Existing muscle memory, Modelfiles, and client code work unchanged.
One core, six idiomatic APIs. Bindings feel native in each language — not transliterated Rust.
No daemon unless you ask for one. Library mode is the same engine, just linked in.
Hardware-honest. Seven GPU backends (CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, RPC). CPU works out of the box. Apple Silicon uses Metal automatically.
MIT licensed. Use it in commercial products without restriction.

Project & links

github.com/cognisoc/mullama — source, issues, discussions.
docs.cognisoc.com/mullama — full documentation, API reference, deployment recipes.
crates.io/crates/mullama · pypi.org/project/mullama · npmjs.com/package/mullama
Container images: ghcr.io/cognisoc/mullama (CPU) and :<version>-cuda (NVIDIA).