title: "7. Reading the llama.cpp Server Source Code" order: 7
Over the course of six chapters, we built a translation desktop app from scratch. We have a working product, but it's ultimately a "learning-oriented" implementation. So how does "production-quality" code differ? Let's read the source code of llama-server, the official server bundled with llama.cpp, and compare.
llama-server is located at llama.cpp/tools/server/. It uses the same cpp-httplib, so you can read the code the same way as in the previous chapters.
llama.cpp/tools/server/
├── server.cpp # Main server implementation
├── httplib.h # cpp-httplib (bundled version)
└── ...
The code is contained in a single server.cpp. It runs to several thousand lines, but once you understand the structure, you can narrow down the parts worth reading.
The biggest difference between the server we built and llama-server is the API design.
Our API:
POST /translate → {"translation": "..."}
POST /translate/stream → SSE: data: "token"
llama-server's API:
POST /v1/chat/completions → OpenAI-compatible JSON
POST /v1/completions → OpenAI-compatible JSON
POST /v1/embeddings → Text embedding vectors
llama-server conforms to OpenAI's API specification. This means OpenAI's official client libraries (such as the Python openai package) work out of the box.
# Example of connecting to llama-server with the OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}]
)
Compatibility with existing tools and libraries is a big design decision. We designed a simple translation-specific API, but if you're building a general-purpose server, OpenAI compatibility has become the de facto standard.
Our server processes requests one at a time. If another request arrives while a translation is in progress, it waits until the previous inference finishes. This is fine for a desktop app used by one person, but it becomes a problem for a server shared by multiple users.
llama-server handles concurrent requests through a mechanism called slots.
The key point is that tokens from each slot are not inferred one by one in sequence, but rather all at once in a single batch. GPUs excel at parallel processing, so processing two users simultaneously takes almost the same time as processing one. This is called "continuous batching."
In our server, cpp-httplib's thread pool assigns one thread per request, but the inference itself runs single-threaded inside llm.chat(). llama-server consolidates this inference step into a shared batch processing loop.
The streaming mechanism itself is the same (set_chunked_content_provider + SSE), but the data format differs.
Our format:
data: "去年の"
data: "春に"
data: [DONE]
llama-server (OpenAI-compatible):
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"去年の"}}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"春に"}}]}
data: [DONE]
Our format simply sends the tokens. Because llama-server follows the OpenAI specification, even a single token comes wrapped in JSON. It may look verbose, but it includes useful information for clients, like an id to identify the request and a finish_reason to indicate why generation stopped.
In our server, we process the entire prompt from scratch on every request. Our translation app's prompt is short ("Translate the following text to ja..." + input text), so this isn't a problem.
llama-server reuses the KV cache for the prefix portion when a request shares a common prompt prefix with a previous request.
For chatbots that send a long system prompt and few-shot examples with every request, this alone dramatically reduces response time. The difference is night and day: processing several thousand tokens of system prompt every time versus reading them from cache in an instant.
For our translation app, where the system prompt is just a single sentence, the benefit is limited. However, it's an optimization worth keeping in mind when applying this to your own applications.
Since our translation API returns plain text, there was no need to constrain the output format. But what if you want the LLM to respond in JSON?
Prompt: Analyze the sentiment of the following text and return it as JSON.
LLM output (expected): {"sentiment": "positive", "score": 0.8}
LLM output (reality): Here are the results of the sentiment analysis. {"sentiment": ...
LLMs sometimes ignore instructions and add extraneous text. llama-server solves this problem with grammar constraints.
curl http://localhost:8080/v1/chat/completions \
-d '{
"messages": [{"role": "user", "content": "Analyze sentiment..."}],
"json_schema": {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"score": {"type": "number"}
},
"required": ["sentiment", "score"]
}
}'
When you specify json_schema, tokens that don't conform to the grammar are excluded during token generation. This guarantees that the output is always valid JSON, so there's no need to worry about json::parse failing.
When embedding LLMs into applications, whether you can reliably parse the output directly impacts reliability. Grammar constraints are unnecessary for free-text output like translation, but they're essential for use cases where you need to return structured data as an API response.
Let's organize the differences we've covered.
| Aspect | Our Server | llama-server |
|---|---|---|
| API design | Translation-specific | OpenAI-compatible |
| Concurrent requests | Sequential processing | Slots + continuous batching |
| SSE format | Tokens only | OpenAI-compatible JSON |
| KV cache | Cleared each time | Prefix reuse |
| Structured output | None | JSON Schema / grammar constraints |
| Code size | ~200 lines | Several thousand lines |
Our code is simple because of the assumption that "one person uses it as a desktop app." If you're building a server for multiple users or one that integrates with the existing ecosystem, llama-server's design serves as a valuable reference.
Conversely, even 200 lines of code is enough to make a fully functional translation app. I hope this code reading exercise has also conveyed the value of "building only what you need."
In the next chapter, we'll cover the key points for swapping in your own library and customizing the app to make it truly yours.
Next: Making It Your Own