ch04-model-management.md 25 KB


title: "4. Adding Model Download and Management" order: 4


By the end of Chapter 3, the server's translation functionality was fully in place. However, the only model file available is the one we manually downloaded in Chapter 1. In this chapter, we'll use cpp-httplib's client functionality to enable downloading and switching Hugging Face models from within the app.

Once complete, you'll be able to manage models with requests like these:

# Get the list of available models
curl http://localhost:8080/models
{
  "models": [
    {"name": "gemma-2-2b-it", "params": "2B", "size": "1.6 GB", "downloaded": true, "selected": true},
    {"name": "gemma-2-9b-it", "params": "9B", "size": "5.8 GB", "downloaded": false, "selected": false},
    {"name": "Llama-3.1-8B-Instruct", "params": "8B", "size": "4.9 GB", "downloaded": false, "selected": false}
  ]
}
# Select a different model (automatically downloads if not yet available)
curl -N -X POST http://localhost:8080/models/select \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-2-9b-it"}'
data: {"status":"downloading","progress":0}
data: {"status":"downloading","progress":12}
...
data: {"status":"downloading","progress":100}
data: {"status":"loading"}
data: {"status":"ready"}

4.1 httplib::Client Basics

So far we've only used httplib::Server, but cpp-httplib also provides client functionality. Since Hugging Face uses HTTPS, we need a TLS-capable client.

#include <httplib.h>

// Including the URL scheme automatically uses SSLClient
httplib::Client cli("https://huggingface.co");

// Automatically follow redirects (Hugging Face redirects to a CDN)
cli.set_follow_location(true);

auto res = cli.Get("/api/models");
if (res && res->status == 200) {
  std::cout << res->body << std::endl;
}

To use HTTPS, you need to enable OpenSSL at build time. Add the following to your CMakeLists.txt:

find_package(OpenSSL REQUIRED)

target_link_libraries(translate-server PRIVATE OpenSSL::SSL OpenSSL::Crypto)
target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)

# macOS: required for loading system certificates
if(APPLE)
  target_link_libraries(translate-server PRIVATE "-framework CoreFoundation" "-framework Security")
endif()

Defining CPPHTTPLIB_OPENSSL_SUPPORT enables httplib::Client("https://...") to make TLS connections. On macOS, you also need to link the CoreFoundation and Security frameworks to access the system certificate store. See Section 4.8 for the complete CMakeLists.txt.

4.2 Defining the Model List

Let's define the list of models that the app can handle. Here are four models we've verified for translation tasks.

struct ModelInfo {
  std::string name;       // Display name
  std::string params;     // Parameter count
  std::string size;       // GGUF Q4 size
  std::string repo;       // Hugging Face repository
  std::string filename;   // GGUF filename
};

const std::vector<ModelInfo> MODELS = {
  {
    .name     = "gemma-2-2b-it",
    .params   = "2B",
    .size     = "1.6 GB",
    .repo     = "bartowski/gemma-2-2b-it-GGUF",
    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
  },
  {
    .name     = "gemma-2-9b-it",
    .params   = "9B",
    .size     = "5.8 GB",
    .repo     = "bartowski/gemma-2-9b-it-GGUF",
    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
  },
  {
    .name     = "Llama-3.1-8B-Instruct",
    .params   = "8B",
    .size     = "4.9 GB",
    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  },
};

4.3 Model Storage Location

Up through Chapter 3, we stored models in the models/ directory within the project. However, when managing multiple models, a dedicated app directory makes more sense. On macOS/Linux we use ~/.translate-app/models/, and on Windows we use %APPDATA%\translate-app\models\.

std::filesystem::path get_models_dir() {
#ifdef _WIN32
  auto env = std::getenv("APPDATA");
  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
  return base / "translate-app" / "models";
#else
  auto env = std::getenv("HOME");
  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
  return base / ".translate-app" / "models";
#endif
}

If the environment variable isn't set, it falls back to the current directory. The app creates this directory at startup (create_directories won't error even if it already exists).

4.4 Rewriting Model Initialization

We rewrite the model initialization at the beginning of main(). In Chapter 1 we hardcoded the path, but from here on we support model switching. We track the currently loaded filename in selected_model and load the first entry in MODELS at startup. The GET /models and POST /models/select handlers reference and update this variable.

Since cpp-httplib runs handlers concurrently on a thread pool, reassigning llm while another thread is calling llm.chat() would crash. We add a std::mutex to protect against this.

int main() {
  auto models_dir = get_models_dir();
  std::filesystem::create_directories(models_dir);

  std::string selected_model = MODELS[0].filename;
  auto path = models_dir / selected_model;

  // Automatically download the default model if not yet present
  if (!std::filesystem::exists(path)) {
    std::cout << "Downloading " << selected_model << "..." << std::endl;
    if (!download_model(MODELS[0], [](int pct) {
          std::cout << "\r" << pct << "%" << std::flush;
          return true;
        })) {
      std::cerr << "\nFailed to download model." << std::endl;
      return 1;
    }
    std::cout << std::endl;
  }
  auto llm = llamalib::Llama{path};
  std::mutex llm_mutex; // Protect access during model switching
  // ...
}

This ensures that users don't need to manually download models with curl on first launch. It uses the download_model function from Section 4.6 and displays progress on the console.

4.5 The GET /models Handler

This returns the model list with information about whether each model has been downloaded and whether it's currently selected.

svr.Get("/models",
        [&](const httplib::Request &, httplib::Response &res) {
  auto arr = json::array();
  for (const auto &m : MODELS) {
    auto path = get_models_dir() / m.filename;
    arr.push_back({
      {"name",       m.name},
      {"params",     m.params},
      {"size",       m.size},
      {"downloaded", std::filesystem::exists(path)},
      {"selected",   m.filename == selected_model},
    });
  }
  res.set_content(json{{"models", arr}}.dump(), "application/json");
});

4.6 Downloading Large Files

GGUF models are several gigabytes, so we can't load the entire file into memory. By passing callbacks to httplib::Client::Get, we can receive data chunk by chunk.

// content_receiver: callback that receives data chunks
// progress: download progress callback
cli.Get(url,
  [&](const char *data, size_t len) {       // content_receiver
    ofs.write(data, len);
    return true;  // returning false aborts the download
  },
  [&](size_t current, size_t total) {        // progress
    int pct = total ? (int)(current * 100 / total) : 0;
    std::cout << pct << "%" << std::endl;
    return true;  // returning false aborts the download
  });

Let's use this to create a function that downloads models from Hugging Face.

#include <filesystem>
#include <fstream>

// Download a model and report progress via progress_cb.
// If progress_cb returns false, the download is aborted.
bool download_model(const ModelInfo &model,
                    std::function<bool(int)> progress_cb) {
  httplib::Client cli("https://huggingface.co");
  cli.set_follow_location(true);
  cli.set_read_timeout(std::chrono::hours(1));

  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
  auto path = get_models_dir() / model.filename;
  auto tmp_path = std::filesystem::path(path).concat(".tmp");

  std::ofstream ofs(tmp_path, std::ios::binary);
  if (!ofs) { return false; }

  auto res = cli.Get(url,
    [&](const char *data, size_t len) {
      ofs.write(data, len);
      return ofs.good();
    },
    [&](size_t current, size_t total) {
      return progress_cb(total ? (int)(current * 100 / total) : 0);
    });

  ofs.close();

  if (!res || res->status != 200) {
    std::filesystem::remove(tmp_path);
    return false;
  }

  // Write to .tmp first, then rename, so that an incomplete file
  // is never mistaken for a usable model if the download is interrupted
  std::filesystem::rename(tmp_path, path);
  return true;
}

4.7 The /models/select Handler

This handles model selection requests. We always respond with SSE, reporting status in sequence: download progress, loading, and ready.

svr.Post("/models/select",
         [&](const httplib::Request &req, httplib::Response &res) {
  auto input = json::parse(req.body, nullptr, false);
  if (input.is_discarded() || !input.contains("model")) {
    res.status = 400;
    res.set_content(json{{"error", "'model' is required"}}.dump(),
                    "application/json");
    return;
  }

  auto name = input["model"].get<std::string>();

  // Find the model in the list
  auto it = std::find_if(MODELS.begin(), MODELS.end(),
    [&](const ModelInfo &m) { return m.name == name; });

  if (it == MODELS.end()) {
    res.status = 404;
    res.set_content(json{{"error", "Unknown model"}}.dump(),
                    "application/json");
    return;
  }

  const auto &model = *it;

  // Always respond with SSE (same format whether already downloaded or not)
  res.set_chunked_content_provider(
      "text/event-stream",
      [&, model](size_t, httplib::DataSink &sink) {
        // SSE event sending helper
        auto send = [&](const json &event) {
          sink.os << "data: " << event.dump() << "\n\n";
        };

        // Download if not yet present (report progress via SSE)
        auto path = get_models_dir() / model.filename;
        if (!std::filesystem::exists(path)) {
          bool ok = download_model(model, [&](int pct) {
            send({{"status", "downloading"}, {"progress", pct}});
            return sink.os.good(); // Abort download on client disconnect
          });
          if (!ok) {
            send({{"status", "error"}, {"message", "Download failed"}});
            sink.done();
            return true;
          }
        }

        // Load and switch to the model
        send({{"status", "loading"}});
        {
          std::lock_guard<std::mutex> lock(llm_mutex);
          llm = llamalib::Llama{path};
          selected_model = model.filename;
        }

        send({{"status", "ready"}});
        sink.done();
        return true;
      });
});

A few notes:

  • We send SSE events directly from the download_model progress callback. This is an application of set_chunked_content_provider + sink.os from Chapter 3
  • Since the callback returns sink.os.good(), the download stops if the client disconnects. The cancel button we add in Chapter 5 uses this
  • When we update selected_model, it's reflected in the selected flag of GET /models
  • The llm reassignment is protected by llm_mutex. The /translate and /translate/stream handlers also lock the same mutex, so inference can't run during a model switch (see the complete code)

4.8 Complete Code

Here is the complete code with model management added to the Chapter 3 code.

Complete code (CMakeLists.txt) ```cmake cmake_minimum_required(VERSION 3.20) project(translate-server CXX) set(CMAKE_CXX_STANDARD 20) include(FetchContent) # llama.cpp FetchContent_Declare(llama GIT_REPOSITORY https://github.com/ggml-org/llama.cpp GIT_TAG master GIT_SHALLOW TRUE ) FetchContent_MakeAvailable(llama) # cpp-httplib FetchContent_Declare(httplib GIT_REPOSITORY https://github.com/yhirose/cpp-httplib GIT_TAG master ) FetchContent_MakeAvailable(httplib) # nlohmann/json FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz ) FetchContent_MakeAvailable(json) # cpp-llamalib FetchContent_Declare(cpp_llamalib GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib GIT_TAG main ) FetchContent_MakeAvailable(cpp_llamalib) find_package(OpenSSL REQUIRED) add_executable(translate-server src/main.cpp) target_link_libraries(translate-server PRIVATE httplib::httplib nlohmann_json::nlohmann_json cpp-llamalib OpenSSL::SSL OpenSSL::Crypto ) target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT) if(APPLE) target_link_libraries(translate-server PRIVATE "-framework CoreFoundation" "-framework Security" ) endif() ```
Complete code (main.cpp) ```cpp #include #include #include #include #include #include #include #include #include using json = nlohmann::json; // ------------------------------------------------------------------------- // Model definitions // ------------------------------------------------------------------------- struct ModelInfo { std::string name; std::string params; std::string size; std::string repo; std::string filename; }; const std::vector MODELS = { { .name = "gemma-2-2b-it", .params = "2B", .size = "1.6 GB", .repo = "bartowski/gemma-2-2b-it-GGUF", .filename = "gemma-2-2b-it-Q4_K_M.gguf", }, { .name = "gemma-2-9b-it", .params = "9B", .size = "5.8 GB", .repo = "bartowski/gemma-2-9b-it-GGUF", .filename = "gemma-2-9b-it-Q4_K_M.gguf", }, { .name = "Llama-3.1-8B-Instruct", .params = "8B", .size = "4.9 GB", .repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf", }, }; // ------------------------------------------------------------------------- // Model storage directory // ------------------------------------------------------------------------- std::filesystem::path get_models_dir() { #ifdef _WIN32 auto env = std::getenv("APPDATA"); auto base = env ? std::filesystem::path(env) : std::filesystem::path("."); return base / "translate-app" / "models"; #else auto env = std::getenv("HOME"); auto base = env ? std::filesystem::path(env) : std::filesystem::path("."); return base / ".translate-app" / "models"; #endif } // ------------------------------------------------------------------------- // Model download // ------------------------------------------------------------------------- // If progress_cb returns false, the download is aborted bool download_model(const ModelInfo &model, std::function progress_cb) { httplib::Client cli("https://huggingface.co"); cli.set_follow_location(true); // Hugging Face redirects to a CDN cli.set_read_timeout(std::chrono::hours(1)); // Set a long timeout for large models auto url = "/" + model.repo + "/resolve/main/" + model.filename; auto path = get_models_dir() / model.filename; auto tmp_path = std::filesystem::path(path).concat(".tmp"); std::ofstream ofs(tmp_path, std::ios::binary); if (!ofs) { return false; } auto res = cli.Get(url, // content_receiver: receive data chunk by chunk and write to file [&](const char *data, size_t len) { ofs.write(data, len); return ofs.good(); }, // progress: report download progress (returning false aborts) [&, last_pct = -1](size_t current, size_t total) mutable { int pct = total ? (int)(current * 100 / total) : 0; if (pct == last_pct) return true; // Skip if same value last_pct = pct; return progress_cb(pct); }); ofs.close(); if (!res || res->status != 200) { std::filesystem::remove(tmp_path); return false; } // Rename after download completes std::filesystem::rename(tmp_path, path); return true; } // ------------------------------------------------------------------------- // Server // ------------------------------------------------------------------------- httplib::Server svr; void signal_handler(int sig) { if (sig == SIGINT || sig == SIGTERM) { std::cout << "\nReceived signal, shutting down gracefully...\n"; svr.stop(); } } int main() { // Create the model storage directory auto models_dir = get_models_dir(); std::filesystem::create_directories(models_dir); // Automatically download the default model if not yet present std::string selected_model = MODELS[0].filename; auto path = models_dir / selected_model; if (!std::filesystem::exists(path)) { std::cout << "Downloading " << selected_model << "..." << std::endl; if (!download_model(MODELS[0], [](int pct) { std::cout << "\r" << pct << "%" << std::flush; return true; })) { std::cerr << "\nFailed to download model." << std::endl; return 1; } std::cout << std::endl; } auto llm = llamalib::Llama{path}; std::mutex llm_mutex; // Protect access during model switching // Set a long timeout since LLM inference takes time (default is 5 seconds) svr.set_read_timeout(300); svr.set_write_timeout(300); svr.set_logger([](const auto &req, const auto &res) { std::cout << req.method << " " << req.path << " -> " << res.status << std::endl; }); svr.Get("/health", [](const httplib::Request &, httplib::Response &res) { res.set_content(json{{"status", "ok"}}.dump(), "application/json"); }); // --- Translation endpoint (Chapter 2) ------------------------------------ svr.Post("/translate", [&](const httplib::Request &req, httplib::Response &res) { // JSON parsing and validation (see Chapter 2 for details) auto input = json::parse(req.body, nullptr, false); if (input.is_discarded()) { res.status = 400; res.set_content(json{{"error", "Invalid JSON"}}.dump(), "application/json"); return; } if (!input.contains("text") || !input["text"].is_string() || input["text"].get().empty()) { res.status = 400; res.set_content(json{{"error", "'text' is required"}}.dump(), "application/json"); return; } auto text = input["text"].get(); auto target_lang = input.value("target_lang", "ja"); auto prompt = "Translate the following text to " + target_lang + ". Output only the translation, nothing else.\n\n" + text; try { std::lock_guard lock(llm_mutex); auto translation = llm.chat(prompt); res.set_content(json{{"translation", translation}}.dump(), "application/json"); } catch (const std::exception &e) { res.status = 500; res.set_content(json{{"error", e.what()}}.dump(), "application/json"); } }); // --- SSE streaming translation (Chapter 3) ------------------------------- svr.Post("/translate/stream", [&](const httplib::Request &req, httplib::Response &res) { auto input = json::parse(req.body, nullptr, false); if (input.is_discarded()) { res.status = 400; res.set_content(json{{"error", "Invalid JSON"}}.dump(), "application/json"); return; } if (!input.contains("text") || !input["text"].is_string() || input["text"].get().empty()) { res.status = 400; res.set_content(json{{"error", "'text' is required"}}.dump(), "application/json"); return; } auto text = input["text"].get(); auto target_lang = input.value("target_lang", "ja"); auto prompt = "Translate the following text to " + target_lang + ". Output only the translation, nothing else.\n\n" + text; res.set_chunked_content_provider( "text/event-stream", [&, prompt](size_t, httplib::DataSink &sink) { std::lock_guard lock(llm_mutex); try { llm.chat(prompt, [&](std::string_view token) { sink.os << "data: " << json(std::string(token)).dump( -1, ' ', false, json::error_handler_t::replace) << "\n\n"; return sink.os.good(); // Abort inference on disconnect }); sink.os << "data: [DONE]\n\n"; } catch (const std::exception &e) { sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n"; } sink.done(); return true; }); }); // --- Model list (Chapter 4) ---------------------------------------------- svr.Get("/models", [&](const httplib::Request &, httplib::Response &res) { auto models_dir = get_models_dir(); auto arr = json::array(); for (const auto &m : MODELS) { auto path = models_dir / m.filename; arr.push_back({ {"name", m.name}, {"params", m.params}, {"size", m.size}, {"downloaded", std::filesystem::exists(path)}, {"selected", m.filename == selected_model}, }); } res.set_content(json{{"models", arr}}.dump(), "application/json"); }); // --- Model selection (Chapter 4) ----------------------------------------- svr.Post("/models/select", [&](const httplib::Request &req, httplib::Response &res) { auto input = json::parse(req.body, nullptr, false); if (input.is_discarded() || !input.contains("model")) { res.status = 400; res.set_content(json{{"error", "'model' is required"}}.dump(), "application/json"); return; } auto name = input["model"].get(); auto it = std::find_if(MODELS.begin(), MODELS.end(), [&](const ModelInfo &m) { return m.name == name; }); if (it == MODELS.end()) { res.status = 404; res.set_content(json{{"error", "Unknown model"}}.dump(), "application/json"); return; } const auto &model = *it; // Always respond with SSE (same format whether already downloaded or not) res.set_chunked_content_provider( "text/event-stream", [&, model](size_t, httplib::DataSink &sink) { // SSE event sending helper auto send = [&](const json &event) { sink.os << "data: " << event.dump() << "\n\n"; }; // Download if not yet present (report progress via SSE) auto path = get_models_dir() / model.filename; if (!std::filesystem::exists(path)) { bool ok = download_model(model, [&](int pct) { send({{"status", "downloading"}, {"progress", pct}}); return sink.os.good(); // Abort download on client disconnect }); if (!ok) { send({{"status", "error"}, {"message", "Download failed"}}); sink.done(); return true; } } // Load and switch to the model send({{"status", "loading"}}); { std::lock_guard lock(llm_mutex); llm = llamalib::Llama{path}; selected_model = model.filename; } send({{"status", "ready"}}); sink.done(); return true; }); }); // Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`) signal(SIGINT, signal_handler); signal(SIGTERM, signal_handler); std::cout << "Listening on http://127.0.0.1:8080" << std::endl; svr.listen("127.0.0.1", 8080); } ```

4.9 Testing

Since we added OpenSSL configuration to CMakeLists.txt, we need to re-run CMake before building.

cmake -B build
cmake --build build -j
./build/translate-server

Checking the Model List

curl http://localhost:8080/models

The gemma-2-2b-it model downloaded in Chapter 1 should show downloaded: true and selected: true.

Switching to a Different Model

curl -N -X POST http://localhost:8080/models/select \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-2-9b-it"}'

Download progress streams via SSE, and "ready" appears when it's done.

Comparing Translations Across Models

Let's translate the same sentence with different models.

# Translate with gemma-2-9b-it (the model we just switched to)
curl -X POST http://localhost:8080/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'

# Switch back to gemma-2-2b-it
curl -N -X POST http://localhost:8080/models/select \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-2-2b-it"}'

# Translate the same sentence
curl -X POST http://localhost:8080/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'

Translation results vary depending on the model, even with the same code and the same prompt. Since cpp-llamalib automatically applies the appropriate chat template for each model, no code changes are needed.

Next Chapter

The server's main features are now complete: REST API, SSE streaming, and model download and switching. In the next chapter, we'll add static file serving and build a Web UI you can use from a browser.

Next: Adding a Web UI