title: "3. Adding Token Streaming with SSE" order: 3
The /translate endpoint from Chapter 2 returned the entire translation at once after completion. This is fine for short sentences, but for longer text the user has to wait several seconds with nothing displayed.
In this chapter, we add a /translate/stream endpoint that uses SSE (Server-Sent Events) to return tokens in real time as they are generated. This is the same approach used by the ChatGPT and Claude APIs.
SSE is a way to send HTTP responses as a stream. When a client sends a request, the server keeps the connection open and gradually returns events. The format is simple text.
data: "去年の"
data: "春に"
data: "東京を"
data: [DONE]
Each line starts with data: and events are separated by blank lines. The Content-Type is text/event-stream. Tokens are sent as escaped JSON strings, so they appear enclosed in double quotes (we implement this in Section 3.3).
In cpp-httplib, you can use set_chunked_content_provider to send responses incrementally. Each time you write to sink.os inside the callback, data is sent to the client.
res.set_chunked_content_provider(
"text/event-stream",
[](size_t offset, httplib::DataSink &sink) {
sink.os << "data: hello\n\n";
sink.done();
return true;
});
Calling sink.done() ends the stream. If the client disconnects mid-stream, writing to sink.os will fail and sink.os.fail() will return true. You can use this to detect disconnection and abort unnecessary inference.
/translate/stream HandlerJSON parsing and validation are the same as the /translate endpoint from Chapter 2. The only difference is how the response is returned. We combine the streaming callback of llm.chat() with set_chunked_content_provider.
svr.Post("/translate/stream",
[&](const httplib::Request &req, httplib::Response &res) {
// ... JSON parsing and validation same as /translate ...
res.set_chunked_content_provider(
"text/event-stream",
[&, prompt](size_t, httplib::DataSink &sink) {
try {
llm.chat(prompt, [&](std::string_view token) {
sink.os << "data: "
<< json(std::string(token)).dump(
-1, ' ', false, json::error_handler_t::replace)
<< "\n\n";
return sink.os.good(); // Abort inference on disconnect
});
sink.os << "data: [DONE]\n\n";
} catch (const std::exception &e) {
sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
}
sink.done();
return true;
});
});
A few key points:
llm.chat(), it is called each time a token is generated. If the callback returns false, generation is abortedsink.os, you can check whether the client is still connected with sink.os.good(). If the client has disconnected, it returns false to stop inferencejson(token).dump() before sending. This is safe even for tokens containing newlines or quotesdump(-1, ' ', false, ...) are the defaults. What matters is the fourth argument, json::error_handler_t::replace. Since the LLM returns tokens at the subword level, multi-byte characters (such as Japanese) can be split mid-character across tokens. Passing an incomplete UTF-8 byte sequence directly to dump() would throw an exception, so replace safely substitutes them. The browser reassembles the bytes on its end, so everything displays correctlytry/catch. llm.chat() can throw exceptions for reasons such as exceeding the context window. If an exception goes uncaught inside the lambda, the server will crash, so we return the error as an SSE event insteaddata: [DONE] follows the OpenAI API convention to signal the end of the stream to the clientHere is the complete code with the /translate/stream endpoint added to the code from Chapter 2.
Build and start the server.
cmake --build build -j
./build/translate-server
Using curl's -N option to disable buffering, you can see tokens displayed in real time as they arrive.
curl -N -X POST http://localhost:8080/translate/stream \
-H "Content-Type: application/json" \
-d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
data: "去年の"
data: "春に"
data: "東京を"
data: "訪れた"
data: "。"
data: "桜が"
data: "綺麗だった"
data: "。"
data: [DONE]
You should see tokens streaming in one by one. The /translate endpoint from Chapter 2 continues to work as well.
The server's translation functionality is now complete. In the next chapter, we use cpp-httplib's client functionality to add the ability to fetch and manage models from Hugging Face.