title: "2. Integrating llama.cpp to Build a REST API" order: 2
In the skeleton from Chapter 1, /translate simply returned "TODO". In this chapter we integrate llama.cpp inference and turn it into an API that actually returns translation results.
Calling the llama.cpp API directly makes the code quite long, so we use a thin wrapper library called cpp-llamalib. It lets you load a model and run inference in just a few lines, keeping the focus on cpp-httplib.
Simply pass the path to a model file to llamalib::Llama, and model loading, context creation, and sampler configuration are all taken care of. If you downloaded a different model in Chapter 1, adjust the path accordingly.
#include <cpp-llamalib.h>
int main() {
auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
// LLM inference takes time, so set a longer timeout (default is 5 seconds)
svr.set_read_timeout(300);
svr.set_write_timeout(300);
// ... Build and start the HTTP server ...
}
If you want to change the number of GPU layers, context length, or other settings, you can specify them via llamalib::Options.
auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf", {
.n_gpu_layers = 0, // CPU only
.n_ctx = 4096,
}};
/translate HandlerWe replace the handler that returned dummy JSON in Chapter 1 with actual inference.
svr.Post("/translate",
[&](const httplib::Request &req, httplib::Response &res) {
// Parse JSON (3rd arg `false`: don't throw on failure, check with `is_discarded()`)
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
// Validate required fields
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja"); // Default is Japanese
// Build the prompt and run inference
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
try {
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(),
"application/json");
} catch (const std::exception &e) {
res.status = 500;
res.set_content(json{{"error", e.what()}}.dump(), "application/json");
}
});
llm.chat() can throw exceptions during inference (for example, when the context length is exceeded). By catching them with try/catch and returning the error as JSON, we prevent the server from crashing.
Here is the finished code with all the changes so far.
Rebuild and start the server, then verify that it now returns actual translation results.
cmake --build build -j
./build/translate-server
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
# => {"translation":"去年の春に東京を訪れた。桜が綺麗だった。"}
In Chapter 1 the response was "TODO", but now you get an actual translation back.
The REST API we built in this chapter waits for the entire translation to complete before sending the response, so for long texts the user has to wait with no indication of progress.
In the next chapter, we use SSE (Server-Sent Events) to stream tokens back in real time as they are generated.