/blog · 2026-05-12 · architecture · bindings

Six languages, one runtime: why Mullama ships native bindings instead of HTTP

Name: mullama
Author: Cognisoc

Most local LLM runtimes give you exactly one of two things: a hand-rolled binding in the maintainer's favorite language, or an HTTP daemon. Mullama refuses the choice. Here's why.

If you’ve tried to embed a local LLM in a real polyglot stack, you’ve probably hit this wall already. Your data team is in Python. Your API gateway is Node. Your control plane is Go. Somewhere there’s a PHP service nobody admits to. And inference, today, gives you exactly two options — neither of which is good.

The two cul-de-sacs

Option A: an HTTP daemon. You run a separate process — usually Ollama, sometimes vLLM, sometimes a homegrown wrapper around llama.cpp — and every service in your stack talks to it over localhost:11434. This works. It is also the most expensive way to move a sequence of integers from one process to another. Every prompt serializes to JSON, copies across a socket, deserializes, runs through inference, serializes the response, copies back, deserializes. For a RAG pipeline doing tens of thousands of small completions per minute, that overhead is real, measurable, and entirely avoidable when the model lives on the same box.

Option B: a single-language SDK. You pick whatever binding the maintainer of your runtime happens to ship — usually Python, because ML — and now your Node service either spawns a Python subprocess (yes, people do this in production, and yes, it is as bad as it sounds) or gives up and uses Option A.

Mullama exists because that’s a false dichotomy.

What “native bindings” actually means

The Mullama core is written in Rust on top of llama.cpp. From that core we build idiomatic, first-party bindings for Rust, Python, Node.js, Go, PHP, and C/C++ — six languages, one engine. The bindings are not HTTP wrappers. They are not thin shells that shell out to a daemon. They are real in-process FFI: the model weights live in your process’s address space, the inference loop runs in your process, and a token comes back as a value in your language’s native string type.

That has a few consequences worth being explicit about.

Zero IPC overhead. When your Node service calls ctx.generate("Hello, AI!"), the call goes directly into the Rust core through a thin N-API binding. No socket. No JSON. No daemon. For high-frequency calls — embeddings during RAG ingestion, batch classification, agent loops that fire many small completions — this materially changes the cost structure.

One model, one copy. A daemon has to load the model weights into its own memory and then serve them to every client. If you embed the runtime, you load the weights once, in the process that needs them. For a 70B model quantized to 4 bits, that’s the difference between “can I run this on a laptop” and “no.”

No orchestration. There’s no second process to start, no healthcheck to write, no systemd unit to babysit, no port to keep free. Your binary is your runtime.

Same engine across languages. This is the part that catches people. Because all six bindings sit on top of the same Rust core, the behavior is identical across languages. The sampler doesn’t drift between Python and Go. A Modelfile that works in Rust works in PHP. Quantization settings, GPU backend selection, context windows, and multimodal pipelines all behave the same way regardless of which language is at the top of the stack.

When the daemon is still the right answer

To be clear: Mullama still ships a daemon. The daemon is what runs when you type mullama serve, and it exposes both an OpenAI-compatible HTTP API and an Anthropic-compatible one on port 11434 — the same port Ollama uses, on purpose, so your existing client code keeps working without modification.

You want the daemon when:

You have many short-lived processes (a web request handler that spins up and tears down dozens of times per second) and you don’t want each one paying the cost of loading model weights.
You’re serving multiple models on one box and want LRU eviction to manage memory across them.
You want the OpenAI / Anthropic SDK to work as-is — pointed at http://localhost:11434/v1 or http://localhost:11434 — without changing client code.
You want a Web UI or TUI for chat, model management, and a playground.

You want the library when:

You’re building a desktop app (Tauri, Electron, native) and the model needs to ship inside the binary.
You’re building an edge / on-device deployment with no network dependency.
Your service does enough inference per second that HTTP round-trips meaningfully bite.
You don’t want a second process in your container image.

The point is: with Mullama you don’t have to choose at architecture time. The same engine runs both modes. The same .gguf file. The same sampler. You can start with the daemon for prototyping, embed the library when you need the throughput, and switch back — or run both side by side — without rewriting your inference layer.

What this looks like in practice

Here’s a concrete example. You’re building a CLI tool in Go that does batch document summarization. With a daemon-only runtime, your deployment story is: “install Ollama, start it, make sure port 11434 is free, then run my tool.” With a Python-binding-only runtime, your deployment story is: “install Python, install a wheel that may or may not match the user’s CPU architecture, hope the FFI works.”

With Mullama, your deployment story is: go install github.com/cognisoc/mullama-cli. The binary ships with the inference engine linked in. The user does not need to install anything else. There is no second process. There is no Python.

Same example, but you’re writing a desktop app in Tauri. The Mullama C ABI library is a static library you link directly into your Rust backend. The model file ships in the app bundle. The app works offline by construction — there’s nothing for it to talk to, because the model is in-process.

Same example, but you’re writing a Laravel app. composer require mullama/mullama and you have an LLM accessible from a controller without spawning anything. (And yes, we know how unusual it is for a PHP project to get first-party bindings to anything in AI. That’s deliberate.)

What we gave up

A few things, in the name of being honest.

Mullama runs GGUF only. If you want to run safetensors at hosted- serving scale with continuous batching, use vLLM — that’s what it’s for. Mullama is for the local / embedded / single-box case.

The bindings cover inference and embeddings, not training. If you want to fine-tune, that lives elsewhere in the ecosystem. Mullama runs the result.

And the six bindings have to stay in lockstep. Adding a new feature to the Rust core means exposing it across six APIs. That’s a real cost, paid by us, so that you don’t pay the “which language do I support today” cost. It’s the deal.

Try it

cargo add mullama        # Rust
pip install mullama      # Python
npm install mullama      # Node.js
go get github.com/cognisoc/mullama   # Go
composer require mullama/mullama     # PHP

Or, if you just want a CLI: curl -fsSL https://mullama.cognisoc.com/install.sh | sh. The docs cover the full library API per language, daemon configuration, and the OpenAI / Anthropic HTTP surface.

Posted by Mullama team. · more posts · RSS