title: "4. Adding Model Download and Management" order: 4
By the end of Chapter 3, the server's translation functionality was fully in place. However, the only model file available is the one we manually downloaded in Chapter 1. In this chapter, we'll use cpp-httplib's client functionality to enable downloading and switching Hugging Face models from within the app.
Once complete, you'll be able to manage models with requests like these:
# Get the list of available models
curl http://localhost:8080/models
{
"models": [
{"name": "gemma-2-2b-it", "params": "2B", "size": "1.6 GB", "downloaded": true, "selected": true},
{"name": "gemma-2-9b-it", "params": "9B", "size": "5.8 GB", "downloaded": false, "selected": false},
{"name": "Llama-3.1-8B-Instruct", "params": "8B", "size": "4.9 GB", "downloaded": false, "selected": false}
]
}
# Select a different model (automatically downloads if not yet available)
curl -N -X POST http://localhost:8080/models/select \
-H "Content-Type: application/json" \
-d '{"model": "gemma-2-9b-it"}'
data: {"status":"downloading","progress":0}
data: {"status":"downloading","progress":12}
...
data: {"status":"downloading","progress":100}
data: {"status":"loading"}
data: {"status":"ready"}
So far we've only used httplib::Server, but cpp-httplib also provides client functionality. Since Hugging Face uses HTTPS, we need a TLS-capable client.
#include <httplib.h>
// Including the URL scheme automatically uses SSLClient
httplib::Client cli("https://huggingface.co");
// Automatically follow redirects (Hugging Face redirects to a CDN)
cli.set_follow_location(true);
auto res = cli.Get("/api/models");
if (res && res->status == 200) {
std::cout << res->body << std::endl;
}
To use HTTPS, you need to enable OpenSSL at build time. Add the following to your CMakeLists.txt:
find_package(OpenSSL REQUIRED)
target_link_libraries(translate-server PRIVATE OpenSSL::SSL OpenSSL::Crypto)
target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
# macOS: required for loading system certificates
if(APPLE)
target_link_libraries(translate-server PRIVATE "-framework CoreFoundation" "-framework Security")
endif()
Defining CPPHTTPLIB_OPENSSL_SUPPORT enables httplib::Client("https://...") to make TLS connections. On macOS, you also need to link the CoreFoundation and Security frameworks to access the system certificate store. See Section 4.8 for the complete CMakeLists.txt.
Let's define the list of models that the app can handle. Here are four models we've verified for translation tasks.
struct ModelInfo {
std::string name; // Display name
std::string params; // Parameter count
std::string size; // GGUF Q4 size
std::string repo; // Hugging Face repository
std::string filename; // GGUF filename
};
const std::vector<ModelInfo> MODELS = {
{
.name = "gemma-2-2b-it",
.params = "2B",
.size = "1.6 GB",
.repo = "bartowski/gemma-2-2b-it-GGUF",
.filename = "gemma-2-2b-it-Q4_K_M.gguf",
},
{
.name = "gemma-2-9b-it",
.params = "9B",
.size = "5.8 GB",
.repo = "bartowski/gemma-2-9b-it-GGUF",
.filename = "gemma-2-9b-it-Q4_K_M.gguf",
},
{
.name = "Llama-3.1-8B-Instruct",
.params = "8B",
.size = "4.9 GB",
.repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
.filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
},
};
Up through Chapter 3, we stored models in the models/ directory within the project. However, when managing multiple models, a dedicated app directory makes more sense. On macOS/Linux we use ~/.translate-app/models/, and on Windows we use %APPDATA%\translate-app\models\.
std::filesystem::path get_models_dir() {
#ifdef _WIN32
auto env = std::getenv("APPDATA");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / "translate-app" / "models";
#else
auto env = std::getenv("HOME");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / ".translate-app" / "models";
#endif
}
If the environment variable isn't set, it falls back to the current directory. The app creates this directory at startup (create_directories won't error even if it already exists).
We rewrite the model initialization at the beginning of main(). In Chapter 1 we hardcoded the path, but from here on we support model switching. We track the currently loaded filename in selected_model and load the first entry in MODELS at startup. The GET /models and POST /models/select handlers reference and update this variable.
Since cpp-httplib runs handlers concurrently on a thread pool, reassigning llm while another thread is calling llm.chat() would crash. We add a std::mutex to protect against this.
int main() {
auto models_dir = get_models_dir();
std::filesystem::create_directories(models_dir);
std::string selected_model = MODELS[0].filename;
auto path = models_dir / selected_model;
// Automatically download the default model if not yet present
if (!std::filesystem::exists(path)) {
std::cout << "Downloading " << selected_model << "..." << std::endl;
if (!download_model(MODELS[0], [](int pct) {
std::cout << "\r" << pct << "%" << std::flush;
return true;
})) {
std::cerr << "\nFailed to download model." << std::endl;
return 1;
}
std::cout << std::endl;
}
auto llm = llamalib::Llama{path};
std::mutex llm_mutex; // Protect access during model switching
// ...
}
This ensures that users don't need to manually download models with curl on first launch. It uses the download_model function from Section 4.6 and displays progress on the console.
GET /models HandlerThis returns the model list with information about whether each model has been downloaded and whether it's currently selected.
svr.Get("/models",
[&](const httplib::Request &, httplib::Response &res) {
auto arr = json::array();
for (const auto &m : MODELS) {
auto path = get_models_dir() / m.filename;
arr.push_back({
{"name", m.name},
{"params", m.params},
{"size", m.size},
{"downloaded", std::filesystem::exists(path)},
{"selected", m.filename == selected_model},
});
}
res.set_content(json{{"models", arr}}.dump(), "application/json");
});
GGUF models are several gigabytes, so we can't load the entire file into memory. By passing callbacks to httplib::Client::Get, we can receive data chunk by chunk.
// content_receiver: callback that receives data chunks
// progress: download progress callback
cli.Get(url,
[&](const char *data, size_t len) { // content_receiver
ofs.write(data, len);
return true; // returning false aborts the download
},
[&](size_t current, size_t total) { // progress
int pct = total ? (int)(current * 100 / total) : 0;
std::cout << pct << "%" << std::endl;
return true; // returning false aborts the download
});
Let's use this to create a function that downloads models from Hugging Face.
#include <filesystem>
#include <fstream>
// Download a model and report progress via progress_cb.
// If progress_cb returns false, the download is aborted.
bool download_model(const ModelInfo &model,
std::function<bool(int)> progress_cb) {
httplib::Client cli("https://huggingface.co");
cli.set_follow_location(true);
cli.set_read_timeout(std::chrono::hours(1));
auto url = "/" + model.repo + "/resolve/main/" + model.filename;
auto path = get_models_dir() / model.filename;
auto tmp_path = std::filesystem::path(path).concat(".tmp");
std::ofstream ofs(tmp_path, std::ios::binary);
if (!ofs) { return false; }
auto res = cli.Get(url,
[&](const char *data, size_t len) {
ofs.write(data, len);
return ofs.good();
},
[&](size_t current, size_t total) {
return progress_cb(total ? (int)(current * 100 / total) : 0);
});
ofs.close();
if (!res || res->status != 200) {
std::filesystem::remove(tmp_path);
return false;
}
// Write to .tmp first, then rename, so that an incomplete file
// is never mistaken for a usable model if the download is interrupted
std::filesystem::rename(tmp_path, path);
return true;
}
/models/select HandlerThis handles model selection requests. We always respond with SSE, reporting status in sequence: download progress, loading, and ready.
svr.Post("/models/select",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded() || !input.contains("model")) {
res.status = 400;
res.set_content(json{{"error", "'model' is required"}}.dump(),
"application/json");
return;
}
auto name = input["model"].get<std::string>();
// Find the model in the list
auto it = std::find_if(MODELS.begin(), MODELS.end(),
[&](const ModelInfo &m) { return m.name == name; });
if (it == MODELS.end()) {
res.status = 404;
res.set_content(json{{"error", "Unknown model"}}.dump(),
"application/json");
return;
}
const auto &model = *it;
// Always respond with SSE (same format whether already downloaded or not)
res.set_chunked_content_provider(
"text/event-stream",
[&, model](size_t, httplib::DataSink &sink) {
// SSE event sending helper
auto send = [&](const json &event) {
sink.os << "data: " << event.dump() << "\n\n";
};
// Download if not yet present (report progress via SSE)
auto path = get_models_dir() / model.filename;
if (!std::filesystem::exists(path)) {
bool ok = download_model(model, [&](int pct) {
send({{"status", "downloading"}, {"progress", pct}});
return sink.os.good(); // Abort download on client disconnect
});
if (!ok) {
send({{"status", "error"}, {"message", "Download failed"}});
sink.done();
return true;
}
}
// Load and switch to the model
send({{"status", "loading"}});
{
std::lock_guard<std::mutex> lock(llm_mutex);
llm = llamalib::Llama{path};
selected_model = model.filename;
}
send({{"status", "ready"}});
sink.done();
return true;
});
});
A few notes:
download_model progress callback. This is an application of set_chunked_content_provider + sink.os from Chapter 3sink.os.good(), the download stops if the client disconnects. The cancel button we add in Chapter 5 uses thisselected_model, it's reflected in the selected flag of GET /modelsllm reassignment is protected by llm_mutex. The /translate and /translate/stream handlers also lock the same mutex, so inference can't run during a model switch (see the complete code)Here is the complete code with model management added to the Chapter 3 code.
Since we added OpenSSL configuration to CMakeLists.txt, we need to re-run CMake before building.
cmake -B build
cmake --build build -j
./build/translate-server
curl http://localhost:8080/models
The gemma-2-2b-it model downloaded in Chapter 1 should show downloaded: true and selected: true.
curl -N -X POST http://localhost:8080/models/select \
-H "Content-Type: application/json" \
-d '{"model": "gemma-2-9b-it"}'
Download progress streams via SSE, and "ready" appears when it's done.
Let's translate the same sentence with different models.
# Translate with gemma-2-9b-it (the model we just switched to)
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
# Switch back to gemma-2-2b-it
curl -N -X POST http://localhost:8080/models/select \
-H "Content-Type: application/json" \
-d '{"model": "gemma-2-2b-it"}'
# Translate the same sentence
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
Translation results vary depending on the model, even with the same code and the same prompt. Since cpp-llamalib automatically applies the appropriate chat template for each model, no code changes are needed.
The server's main features are now complete: REST API, SSE streaming, and model download and switching. In the next chapter, we'll add static file serving and build a Web UI you can use from a browser.
Next: Adding a Web UI