Embed the runtime: why we put a static C ABI in the box
Daemons are great until you're shipping a desktop app, a CLI tool, or an on-device deployment. Mullama links into your binary as a static library. Here's what that unlocks — and what it costs.
There is a class of problem where the right answer is “a daemon on port 11434,” and there is a class of problem where it is very much not. This post is about the second class: desktop apps, CLI tools, on-device deployments, mobile, anything where shipping a second process alongside your binary is awkward, impossible, or just wrong.
For that class, Mullama exposes a static C ABI library you can link directly into your application. The model runs in your process. There is no daemon. There is no HTTP. There is no second thing for the user to install. Your binary is your runtime.
What “static C ABI” actually means
Mullama’s Rust core builds to a static library (.a on Unix,
.lib on Windows) with a stable C interface declared in
mullama.h. You link the static library into your project, you
#include "mullama.h", and you get inference as a function call:
#include "mullama.h"
int main(void) {
mullama_model* model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_ctx* ctx = mullama_new_context(model, NULL);
char buf[2048];
mullama_generate(ctx, "Hello, AI!", buf, sizeof(buf));
puts(buf);
return 0;
}
That’s it. No dynamic linker games. No version-mismatch surprises
between your app and a system-installed libmullama.so. The
inference engine ships with your binary, frozen at the version you
built against.
The other five bindings — Rust, Python, Node.js, Go, PHP — sit on the same core, but the C ABI is the one that opens up the embedded-everything use case, because every other systems language on earth can call into C.
Use case 1: desktop apps (Tauri, Electron, native)
You’re building a Tauri app — Rust backend, web frontend, ships as a single executable. You want the user to be able to use an LLM without installing Ollama, without configuring a daemon, without opening up a port on their machine.
With Mullama linked in, the model is part of the app:
use mullama::{Model, Context, ContextParams};
#[tauri::command]
fn chat(prompt: String) -> String {
let model = Model::load("llama3.2-1b.gguf").unwrap();
let mut ctx = Context::new(&model, ContextParams::default()).unwrap();
ctx.generate(&prompt, 256).unwrap()
}
You bundle the .gguf file in your app resources, you ship the
binary, the user double-clicks. Done. No “please install
Ollama” page in your onboarding. No port conflicts. No firewall
prompts. No background daemon that keeps running after the user
quits your app.
Electron is the same shape: a native Node module that calls the JavaScript binding, which calls into the same Rust core, which is linked statically into the module. From the user’s perspective, there’s just one app.
Use case 2: CLI tools
You’re shipping a CLI that does LLM-flavored things — code review, commit message generation, log analysis, whatever. Your install instructions today probably include “and also install Ollama and make sure it’s running.” That’s a non-trivial ask for a CLI.
With Mullama, your install instructions are:
go install github.com/yourorg/yourtool@latest
The binary ships with the inference engine linked in. First run downloads the model file (or it’s bundled, your choice). Every run after that is offline.
For batch processing — the kind of workload where you fire a few thousand small completions in a tight loop — the in-process binding also removes HTTP overhead from every call. The latency shape changes from “P50 = network + inference” to “P50 = inference.” For a small model on a fast machine, that can be a meaningful chunk of the wall-clock time.
Use case 3: on-device / edge
The strongest case is the one where there is no network. The device on the assembly-line floor that classifies parts using a vision model. The kiosk that does voice transcription in a place with terrible connectivity. The medical device that legally cannot send data off-box. The drone.
For these deployments, the daemon model is structurally wrong. There’s no second process to manage; there’s barely a first one. You need the inference engine to be part of your application binary, and you need it to behave the same way on every device.
The Mullama build supports cross-compilation to the target
architectures you’d expect (aarch64-linux for embedded boards,
Apple Silicon for Mac fleet devices, Windows for kiosks), and the
GPU backend selection happens at build time via environment
variables (LLAMA_CUDA=1, LLAMA_METAL=1, and so on through
seven backends). The binary you ship is shaped for the hardware
you ship it on.
Use case 4: mobile
iOS and Android are the hardest case, and we’re being honest about that. The C ABI library cross-compiles to ARM64 for both platforms, and the static library form factor is what makes that viable — you can vendor it into your Xcode project or Android NDK build and call into it from Swift or Kotlin via the usual FFI patterns.
There are real constraints here: model size on mobile is non-trivial (you want the smallest quantization that gives acceptable quality), GPU access through Metal on iOS and through Vulkan on Android is fiddly, and memory pressure on mobile is a different beast than on a desktop. Mullama gives you the engine; the deployment work is yours. We’re not going to oversell this.
What you give up
A few things, in the name of being honest.
Multi-tenant model serving isn’t what the embedded path is for. If you want to serve five different models with LRU eviction and per-model resource limits, run the daemon. That’s what the daemon is for. The library is for “one app, one (or a few) models.”
The Web UI and TUI are daemon features. If you want a chat interface, ship one in your app or run the daemon side-by-side.
Model registry alias resolution in the library mode requires
either bundling the model file or doing the fetch yourself.
hf:owner/repo resolution is implemented in the library, but
you’re responsible for caching the result. The daemon handles
this for you because it has a persistent state directory; the
library leaves the policy to you, because in an embedded context
“where do the model files live” is an architectural decision, not
a default.
Build-time GPU backend selection. The daemon can be installed with the right backend for the host. An embedded library has to be built with the right backend for the target. That’s a build-system cost you pay so the runtime can be a single linked-in unit.
A note on cold-start cost
One thing worth being upfront about: embedding the runtime trades the daemon’s cold-start cost for your process’s cold-start cost. When you load a 4 GB quantized model from disk into memory, that takes time — usually somewhere between a few hundred milliseconds and a couple of seconds, depending on disk speed and quantization. If your application is short-lived (a CLI tool that runs for two seconds), you’ll feel that load every time you invoke it.
There are three ways out, in order of effort:
-
Use a smaller model. A 1B-parameter model loads noticeably faster than a 7B one. For many CLI use cases — commit message generation, log summarization, code review suggestions — a 1B model is enough. Mullama’s alias table includes small models for exactly this reason.
-
Pre-warm via the daemon. If you have many short-lived processes that all want inference, run
mullama servein the background and have them speak HTTP to it. You give up the in-process advantage, but you amortize the model-load cost across every invocation. This is the right shape for, e.g., a git hook that fires dozens of times a day. -
Memory-map the GGUF. Mullama uses memory-mapped GGUF files by default, which means the first load pays disk I/O but the second load (within the same OS file cache window) is near-instant. Long-running services don’t pay this twice.
Try it
The C ABI library and headers are in bindings/ffi/ in the
GitHub repo. The build
produces a static library you can vendor. The other five language
bindings — for the cases where C ABI is overkill — are documented
per language in the docs.
The daemon is one mullama serve away when you want it. The
library is one cargo add mullama (or pip install mullama, or
npm install mullama, or composer require mullama/mullama) when
you don’t. Same engine. Same behavior. Different shape for
different deployments.
Posted by Mullama team. · more posts · RSS