Logging is an important and integral part of software development. However, just like everything else, logging has its own downside as well. One of the typical issues with logging is actually having too many log entries. In today’s post, we are going to talk about the common ways to limit logging into a sensible rate.

Logging in a nutshell

As a developer, if you have been working on a fairly complex software, the chances are, you probably already known what logging is in programming, and how a carefully constructed log can help you to pinpoint an issue much faster, especially in production. In short, logging is the process of recording an application’s key actions and their context. Later on, the logged information can help the engineers to mentally reconstruct the states of the application at certain points. A Log can be an important source of information, and sometimes the only source, when diagnosing a problem. As it said, the clues most useful in figuring out something that went wrong often lie in the events immediately prior.

There is no doubt that logging is a valuable tool for debugging when used judiciously, whereas when used unwisely logging may also cause serious issues. Here are some considerations one should keep in mind when working with logging. First, just like any other code in a program, logging statements need to be maintained. Furthermore, logging statements tend to become outdated more frequently than functional code, due to the lack of testing. A wrong logging message is worse than no logging, as it may waste a developer’s time for hunting the wrong target. Second, logging may inadvertently expose sensitive information, such as passwords or personal data, if not handled carefully. Third, another common issue one may have with logging is actually having too much logging. For one, logs clustered with irrelevant information can be difficult to browse. For another, logging can consume too much resources, such as, CPU, disk I/O or space, network bandwidth, and so forth, in a short period of time, thus affect the overall performance of the system significantly. This kind of issue is especially noticeable in sophisticated long-running programs or performance-critical applications.

In the rest part of this post, we will focus our discussions on programmatic mechanisms to deal with too many log entries.

Log severity levels

When doing logging for a program, it is more common to use a logging framework, instead of building the logging system all from scratch. Most logging frameworks provide a mechanism to assign a severity level to each log entry and the ability to filter out log entries of certain levels. Thus, setting a minimum severity level, or log level threshold, is usually the easiest way to reduce the number of items written to the log.

Log category

Beside log severity levels, some logging frameworks also support classify log entries into different categories. Log category can usually be used to turn off logging for an entire subsystem. If you image log levels as a way to classify log entries horizontally, then log categories can be seen as a way to divide log entries vertically. One common situation with log level is that, when you try to lower the minimum severity level to allow more items to get logged, and, suddenly, the log is flooded with so much irrelevant information that makes it hard to browse. With log category, however, you can just enable logging of the concerned subsystems, thus limiting the amount of information gets recorded into the log. For this reason, log levels is often combined with log categories to make the best use of both.

Rate limiters

Limiting log rate refers to controlling the number of log entries that are generated over a given period of time. It acts more like a passive defensive mechanism, when compares to log levels and log categories. Also, unlike log levels or log categories, it is relatively rare for a logging framework to have build-in support for rate-limiting mechanisms. So we have to build a logging rate limiter ourselves.

The logging framework

To keep things simple, I will not use any real logging frameworks available in the market, instead, I will use an imaginary logging library called logpp, which has the following interfaces:

// logpp.hpp

#pragma once

#include <iostream>
#include <iterator>
#include <string_view>

#include "chrono-utils.hpp"

class Logpp {
public:
    enum class Level {
        debug = 'D',
        info = 'I',
        warning = 'W',
        error = 'E',
        critical = 'C',
    };

    auto Log(const Level a_level, const std::string_view message) const {
        const auto now = std::chrono::system_clock::now();
        std::cout << '[' << now << ']' << '(' << static_cast<char>(a_level) << ") " << message
                  << std::endl;
        return true;
    }
};

Token bucket

The token bucket algorithm is based on an analogy of a fixed capacity bucket, with tokens being added into the bucket at a constant rate. Once the bucket is full, no more tokens can be added. When a log message is requested to be logged, the bucket is inspected to see if it contains any tokens at that moment. If it does, one token is removed from the bucket, and the log message passes through. Otherwise, the log message is dropped.^[1]

The typical implementation of token bucket algorithm takes two parameters:

Bucket size is the maximum number of tokens the bucket can hold.
Refill rate is the rate at which the bucket gets refilled with tokens.

Thus, a logger applies token bucket algorithm can produce messages with an average rate up to the refill rate, and have a burstiness determined by the bucket size.^[1]

The token bucket algorithm can be implemented in C++ like this:

// rate.hpp

#pragma once

#include <cassert>
#include <chrono>

using namespace std::chrono_literals;

class Rate {
public:
    explicit Rate(const long count, const std::chrono::seconds unit = 1s) :
        m_count_per_second(count * 1s / unit) {
    }

    auto CountPerSecond() const {
        return m_count_per_second;
    }

    auto Interval() const {
        assert(m_count_per_second);
        return std::chrono::microseconds(1s) / m_count_per_second;
    }

private:
    long m_count_per_second = 0;
};

// token.hpp

#pragma once

#include <memory>

class Token {
    friend class TokenBucketLimiter;
    Token() = default;
    explicit Token(const std::shared_ptr<long> count) : m_count(count), m_valid(true) {
    }
    Token(const Token &) = delete;
    Token &operator=(const Token &) = delete;
    Token(Token &&) = default;
    Token &operator=(Token &&) = default;

public:
    ~Token() {
        if (not m_valid) {
            const auto count = m_count.lock();
            if (count) {
                ++(*count);
            }
        }
    }

    void Return() {
        m_valid = false;
    }

    explicit operator bool() const {
        return m_valid;
    }

private:
    std::weak_ptr<long> m_count;
    bool m_valid = false;
};

// token-bucket.hpp

#pragma once

#include <memory>

#include "rate.hpp"
#include "token.hpp"

class TokenBucketLimiter {
public:
    explicit TokenBucketLimiter(Rate rate) :
        m_rate(std::move(rate)), m_token_count(std::make_shared<long>(0)) {
        *m_token_count = m_capacity = m_rate.CountPerSecond();
    }

    TokenBucketLimiter(Rate rate, const long capacity) :
        m_rate(std::move(rate)), m_capacity(capacity),
        m_token_count(std::make_shared<long>(capacity)) {
    }

    auto FetchToken() {
        fill();

        if (*m_token_count > 0) {
            --(*m_token_count);
            return Token {m_token_count};
        }

        return Token {};
    }

private:
    void fill() {
        const auto now = std::chrono::steady_clock::now();
        const auto duration = now - m_last_time;
        const auto seconds_count =
            std::chrono::duration_cast<std::chrono::seconds>(duration).count();
        if (seconds_count) {
            m_last_time = now;

            *m_token_count =
                std::min(m_capacity, *m_token_count + m_rate.CountPerSecond() * seconds_count);
        }
    }

    std::chrono::time_point<std::chrono::steady_clock> m_last_time =
        std::chrono::steady_clock::now();
    Rate m_rate {0};
    long m_capacity = 0;
    std::shared_ptr<long> m_token_count;
};

Note, the Token class only has private constructors, thus only its friend TokenBucketLimiter can create new tokens.

A logger that uses token bucket algorithm may look like this:

// token-bucket-logger.hpp

#pragma once

#include "logpp.hpp"
#include "token-bucket.hpp"

class TokenBucketLogger {
public:
    void Info(const std::string_view message) {
        log(Logpp::Level::info, message);
    }

    void Error(const std::string_view message) {
        log(Logpp::Level::error, message);
    }

    explicit TokenBucketLogger(const long log_per_second = 100) :
        m_limiter(Rate {log_per_second}) {
    }

private:
    void log(const Logpp::Level a_level, const std::string_view message) {
        auto a_token = m_limiter.FetchToken();
        if (a_token) {
            if (not m_logger.Log(a_level, message)) {
                a_token.Return();
            }
        }
    }

    TokenBucketLimiter m_limiter;
    Logpp m_logger;
};

Leaky bucket

The leaky bucket is an algorithm based on an analogy of a fixed capacity bucket, with holes that leaks water out of the bucket at a constant rate. Log messages are added into the bucket, which just like the way water is poured into a bucket. When the bucket is full, messages will overflow. Meanwhile, the bucket is inspected at a constant rate to see if it contains any log messages at that moment. If it does, one message is leaked from the bucket and passed through.^[2]

There are two versions of the leaky bucket algorithm: the leaky bucket as a meter and the leaky bucket as a queue.

Leaky bucket as a meter

In the leaky bucket as a meter version, the bucket is implemented as a counter or variable to count the number of log entries currently in the bucket:^[2]

The counter is incremented whenever a conforming log request arrives; this process mirrors that of decreasing the token counter when tokens are removed from the token bucket.
The counter is decremented periodically; this process mirrors that of regularly increasing the token counter as tokens are added to the token bucket at a fixed rate.
If the counter exceeds a threshold upon being incremented, the log message is dropped; this test mirrors that of checking the token bucket contains enough tokens upon removing tokens.

As you can see, the leaky bucket as a meter is exactly equivalent to a mirror image of the token bucket algorithm, thus, the implementation of the leaky bucket as a meter is omitted in this article.^[2]

Leaky bucket as a queue

The leaky bucket as a queue can be seen as a special case of the version as a meter.^[2] In this version, the bucket is implemented as a finite FIFO buffer or queue of log entries to directly control the log rate:^[2]

Each message is entered into the queue as it is requested to be logged, if the queue is not full at that time.
Messages are then removed from the queue at a fixed rate for onward processing.

The leaky bucket as a queue algorithm takes the following two parameters:

Bucket size is the maximum number of log entries the bucket can hold.
Leak rate is the rate at which log entries get removed from the bucket for onward processing.

Thus, a logger uses leaky bucket as a queue algorithm can produce messages with an average rate up to the leak rate.

The implementation of the leaky bucket as a queue is more elaborate comparing to other rate limiting algorithms, as its implementation usually involves more than one threads. Below is one possible implementation of a logger that applies the leaky bucket as a queue algorithm:

// leaky-bucket-logger.hpp

#pragma once

#include <atomic>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <thread>

#include "logpp.hpp"
#include "rate.hpp"

class LeakyBucketLogger {
public:
    void Info(std::string message) {
        log(Logpp::Level::info, std::move(message));
    }

    void Error(const std::string message) {
        log(Logpp::Level::error, std::move(message));
    }

    explicit LeakyBucketLogger(const long log_per_second = 100) :
        m_rate(log_per_second), m_capacity(log_per_second) {
    }

    ~LeakyBucketLogger() {
        m_abort = true;
        m_queue_cv.notify_all();
        if (m_thread.joinable()) {
            m_thread.join();
        }
    }

    void Start() {
        if (not m_thread.joinable()) {
            m_thread = std::thread(&LeakyBucketLogger::run, this);
        }
    }

private:
    void log(const Logpp::Level a_level, std::string message) {
        {
            std::lock_guard<std::mutex> guard {m_queue_mutex};
            if (m_queue.size() < m_capacity) {
                m_queue.emplace(a_level, std::move(message));
            }
        }

        m_queue_cv.notify_one();
    }

    auto elapsed() const {
        const auto now = std::chrono::steady_clock::now();
        return now - m_last_time;
    }

    void run() {
        while (true) {
            std::unique_lock<std::mutex> lock {m_queue_mutex};

            m_queue_cv.wait(lock, [this] {
                return m_abort.load() or (not m_queue.empty() and elapsed() >= m_rate.Interval());
            });
            if (m_abort.load()) {
                break;
            }

            const auto &[a_level, message] = m_queue.front();

            if (m_logger.Log(a_level, message)) {
                m_last_time = std::chrono::steady_clock::now();
            }

            m_queue.pop();
        }
    }

    std::thread m_thread;

    std::mutex m_queue_mutex;
    std::condition_variable m_queue_cv;
    std::queue<std::pair<Logpp::Level, std::string>> m_queue;

    Logpp m_logger;

    Rate m_rate {0};
    std::chrono::time_point<std::chrono::steady_clock> m_last_time =
        std::chrono::steady_clock::now();
    std::size_t m_capacity = 0;

    std::atomic<bool> m_abort {false};
};

Note that Start() needs to be called manually after a LeakyBucketLogger object is instantiated to actually launch the worker thread. This is a safe idiom to have std::thread as class members, although not really necessary in our case.^[3]

Fix window counter

The fix window counter algorithm works as follows:

Divide the time into fix-sized windows and each window has a counter to count the number of log entries within the time window.
Each conforming log request increments the counter by one.
The counter is set to zero when a new time window starts.
If the counter exceeds a threshold upon being incremented, the log message is dropped.

I see the fix window counter algorithm as a special case of the leaky bucket as a meter algorithm, where the bucket size equals to the number of log entries get leaked per time window. For this reason, the implementation of the fix window counter is omitted.

Sliding window log

The sliding window log algorithm involves tracking the time of logging requests with a log of timestamps. When a new request comes in, timestamps those are older than the start of the current time window are removed from the timestamp log. After that, if the log size is less than the given threshold, the new request passes through and its timestamp is added to the log. Otherwise, the request is dropped.^[4]

The sliding window log algorithm takes the following two parameters:

Window size is the size of the time window, beyond which timestamps are removed.
Threshold is the maximum number of timestamps per time window.

Thus, a logger uses sliding window log algorithm can produce messages with an average rate up to the threshold per window.

One possible implementation of the sliding window log algorithm is:

// sliding-log-logger.hpp

#pragma once

#include <chrono>
#include <queue>

#include "logpp.hpp"
#include "rate.hpp"

class SlidingLogLogger {
public:
    void Info(const std::string_view message) {
        log(Logpp::Level::info, message);
    }

    void Error(const std::string_view message) {
        log(Logpp::Level::error, message);
    }

    explicit SlidingLogLogger(const long log_per_second = 100) : m_rate(log_per_second) {
    }

private:
    void evict(const std::chrono::steady_clock::time_point &tp) {
        const auto window_size = 1s;
        while (not m_queue.empty() and (tp - m_queue.front()) > window_size) {
            m_queue.pop();
        }
    }

    void insert(std::chrono::steady_clock::time_point tp) {
        m_queue.push(std::move(tp));
    }

    auto isWithinLimit() const {
        const auto limit = m_rate.CountPerSecond();
        const decltype(limit) size = m_queue.size();
        return size < limit;
    }

    void log(const Logpp::Level a_level, const std::string_view message) {
        auto now = std::chrono::steady_clock::now();

        evict(now);

        if (isWithinLimit() and m_logger.Log(a_level, message)) {
            insert(std::move(now));
        }
    }

    std::queue<std::chrono::steady_clock::time_point> m_queue;
    Rate m_rate;
    Logpp m_logger;
};

Sliding window counter

The sliding window counter algorithm is a variant of the fixed window counter algorithm. It can produce a more smooth flow of log messages when comparing to the fixed window counter algorithm by taking the counter of previous window into account. The sliding window counter algorithm works as follows:^[5:§4]

Divide the time into fix-sized windows and each window has a counter to count the number of log entries within the time window.
Each conforming log request increments the counter by one.
The counter is set to zero when a new time window starts.
The number of log entries in the sliding window is calculated using this formula: the count of current fixed window + the count of previous fixed window * overlap percentage of the sliding window and the previous fixed window.
If the sliding count exceeds a threshold, the log message is dropped.

The sliding window counter algorithm takes the following two parameters:

Window size is the size of the fixed time window and the sliding window.
Threshold is the maximum number of log requests within the sliding window.

Thus, a logger uses sliding window counter algorithm can produce messages with an average rate approximately up to the threshold per window.

A logger that uses sliding window counter algorithm can be implemented like this:

// sliding-window-counter-logger.hpp

#pragma once

#include <chrono>

#include "logpp.hpp"
#include "rate.hpp"

class SlidingWindowCounterLogger {
public:
    void Info(const std::string_view message) {
        log(Logpp::Level::info, message);
    }

    void Error(const std::string_view message) {
        log(Logpp::Level::error, message);
    }

    explicit SlidingWindowCounterLogger(const long log_per_second = 100) :
        m_rate(log_per_second) {
    }

private:
    void evict(const std::chrono::steady_clock::time_point &floor_now) {
        const auto window_size = 1s;
        const auto diff = floor_now - m_current_window;
        if (diff > 0s) {
            m_current_window = floor_now;
            if (diff == window_size) {
                m_last_count = std::exchange(m_current_count, 0);
            } else {
                assert(diff > window_size);
                m_last_count = m_current_count = 0;
            }
        }
    }

    bool isWithinLimit(const double current_percentage) const {
        const long count = m_last_count * (1 - current_percentage) + m_current_count;
        return count < m_rate.CountPerSecond();
    }

    void insert() {
        ++m_current_count;
    }

    void log(const Logpp::Level a_level, const std::string_view message) {
        const auto now = std::chrono::steady_clock::now();
        const auto floor_now = std::chrono::floor<std::chrono::seconds>(now);

        evict(floor_now);

        const std::chrono::duration<double> diff = now - floor_now;
        const auto window_size = 1s;
        if (isWithinLimit(diff / window_size) and m_logger.Log(a_level, message)) {
            insert();
        }
    }

    Rate m_rate;
    Logpp m_logger;
    std::chrono::steady_clock::time_point m_current_window;
    long m_last_count = 0;
    long m_current_count = 0;
};

A few more words

In a perfect world, where software is tested completely and is bug-free, protective mechanisms, like logging rate limiters, should have no use. However, our reality is far from ideal, and a defect or a malicious use pattern can suddenly turn a benign logging command into a fountain of a disastrous message flood. Thus, it is usually a good idea to adopt logging rate limiters, especially for sophisticated long-running programs.

On the other hand, a log rate limiter that is too restrictive can hinder debugging as well. Since, it is generally unpredictable which messages a limiter will drop, which can easily break the logging context that is critical for debugging. Thus, it is recommended to have at least one way to turn the limiter off.

Conclusion

As usually the first thing to be checked when something goes wrong with a program, logging is a powerful tool for debugging. However, logging itself, sometimes, can become a source of problems.

In today’s post, we have talked about the common ways of reducing messages get logged, with an emphasize on how to integrate protective rate limiter algorithms with a logging framework. A logging rate limiter can help to prevent the logging system from causing resource starvation which is the result of a sudden logging requests burst. For which reason, a logging rate limiter can be seen as a critical part of a robust program.

I hope you have found this post useful, if so, you may want to check out other articles which also belong to the PDS series.

References

Token bucket
Leaky bucket
C++11: std::threads managed by a designated class by Rafał Cieślak
Rate Limiter — Sliding Window Log
System Design Interview - An insider’s guide, Volume 1 (#ad) by Alex Xu