User-Defined Input Stream Buffers

A blog about C++ programming, more or less.

Continue with my preview post, in today’s article, I am going to explore more about user-defined input stream buffers. For reasons you will see shortly, the input stream buffers are slightly more involved than the output ones. However, our existing knowledge with the output stream buffers will definitely help us understanding the input ones. So, if you haven’t, I strongly recommend you check out my output stream buffer post first.

This post is largely inspired by a section of the similar title in The C++ Standard Library by Nicolai Josuttis. Great book. Highly recommended.

Unbuffered input stream buffer

As usual, let’s start with an unbuffered one.

// hex-in-stream-nobuf.hpp

#pragma once

#include <unistd.h>

#include <streambuf>
#include <string>

class HexInBuf : public std::streambuf {
public:
    using char_type = std::streambuf::char_type;
    using int_type = std::streambuf::int_type;
    using traits_type = std::streambuf::traits_type;

    HexInBuf(const int fd = STDIN_FILENO) : m_fd(fd) {
    }

protected:
    static constexpr int WIDTH = sizeof(char_type) * 2;

    virtual int_type underflow() override {
        const auto c = uflow();
        if (not traits_type::eq_int_type(c, traits_type::eof())) {
            lseek(m_fd, -WIDTH, SEEK_CUR);
        }

        return c;
    }

    virtual int_type uflow() override {
        std::string hex(WIDTH, 0);
        if (read(m_fd, hex.data(), WIDTH) != WIDTH) {
            return traits_type::eof();
        }

        const char_type c = std::stoi(hex, nullptr, 16);

        return traits_type::to_int_type(c);
    }

    virtual int_type pbackfail(int_type c = traits_type::eof()) override {
        if (lseek(m_fd, -WIDTH, SEEK_CUR) == -1) {
            return traits_type::eof();
        }

        if (traits_type::eq_int_type(c, traits_type::eof())) {
            return traits_type::not_eof(c);
        } else if (traits_type::eq_int_type(c, underflow())) {
            return c;
        }
        return traits_type::eof();
    }

private:
    int m_fd = STDIN_FILENO;
};

Basically, HexInBuf class is the opposite of the HexOutBuf class. It implements an input stream buffer which can be used to read in hex encoded string and convert it into a normal string. As you can see, comparing to the unbuffered version of HexOutBuf, the unbuffered input stream buffer has to override more virtual functions.

An istream can peek() one character from its associated input stream buffer by calling its member function sgetc(). If no character is available in the get area, sgetc() returns underflow(). The virtual function underflow() is responsible for reading more data in from the underlying input channel. It returns the value of the first character read on success or traits_type::eof() on failure. The base class version of this function does nothing, and returns traits_type::eof().[2] In our example, we use the lseek() trick to reposition the input file offset back to its original position in order to avoid setting up a buffer.

An istream can read one character from its associated input stream buffer by using its member function sbumpc(). If no character is available in the get area, sbumpc() returns uflow(). The virtual function uflow() behaves similarly to underflow(), except it also increments the read pointer. The base class version of the function calls underflow() and increments the read pointer.[2]

The function snextc() can also be used to read one character. This function first calls sbumpc() to advance the read pointer, then calls sgetc() in order to read the character.

The function sgetn() can be used for reading multiple characters at once. This function simply calls the virtual function xsgetn(s, count) of the most derived class. The default implementation of xsgetn() reads characters as if by repeated calls to sbumpc().[2] Like the function xsputn() for output stream buffer, overriding xsgetn() is only necessary if reading multiple characters can be implemented more efficiently than reading characters one at a time.

As with input stream buffer, characters can also be put back into the read buffer by using the functions sputbackc(c) and sungetc(). Both functions decrement the read pointer, if possible. The difference is that sputbackc() gets the character to be put back as its argument. If putback position was not available, they return what virtual function pbackfail() returns. By overriding this function, you can implement a mechanism to restore the old read position even in this case. The default base class version of this function does nothing and returns traits_type::eof() in all situations.[1:§15.13.3][2] Our version of pbackfail() also ensures that the given character was indeed the character read.

Here is an example of using this stream buffer:

// test-utils.hpp

#pragma once

#include <iostream>

auto &TestHelper(std::istream &in) {
    for (int i = 0; true; ++i) {
        const auto peek_c = in.peek();
        const char get_c = in.get();

        if (not in) {
            break;
        }
        std::cout << get_c << '(' << peek_c << ')';

        if (i % 4 == 0) {
            in.unget();
        }
    }
    std::cout << std::endl;

    return in;
}

// hex-in-stream-nobuf.cpp

#include "hex-in-stream-nobuf.hpp"
#include "test-utils.hpp"

int main() {
    HexInBuf buffer;
    std::istream in(&buffer);

    TestHelper(in);
}

You may run it like this:

$ ./hex-in-stream-nobuf <<< '303a09455e69'
0(48)0(48):(58) (9)E(69)E(69)^(94)i(105)

Single character buffered input stream buffer

Because of how an input stream buffer works, an unbuffered version may not be the simplest way to implement a user-defined input stream buffer, rather the simplest way would be an input stream buffer that only maintains a single character buffer. I will show you how. However, before we dive into the implementation details, we need to understand how the get area works with the operations.

The get area is defined by three pointers that can be accessed by the following three member functions:[1:§15.13.3][2]

  1. eback(): (“end putback”) points at the beginning of the get area, or, as the name suggests, the end of the putback area.
  2. gptr(): (“get pointer”) points to the current character in the get area.
  3. egptr(): (“end get pointer”) points to one past the end of the get area.

Function setg(eback, gptr, egptr) sets the values of those three pointers. Characters in range [eback(), gptr()) are those can be put back. Characters in range [gptr(), egptr()) have been transported from the underlying input device, but are still waiting for processing.[1:§15.13.3]

gbump(offset) can be used to reposition gptr() by offset characters relative to its current position. Although, not clearly stated in the documentation, a negative offset may be given to decrement the read pointer gptr().

As already mentioned, sgetc() returns the value of *gptr() if gptr() < egptr(), returns underflow(), otherwise. sbumpc() returns uflow() if gptr() == egptr(), or, it returns the value of *gptr() and advances gptr().

With all this information, now we can implement our single character buffered input stream buffer.

// hex-in-stream-single-buf.hpp

#pragma once

#include <unistd.h>

#include <streambuf>
#include <string>

class HexInBuf : public std::streambuf {
public:
    using char_type = std::streambuf::char_type;
    using int_type = std::streambuf::int_type;
    using traits_type = std::streambuf::traits_type;

    HexInBuf(const int fd = STDIN_FILENO) : m_fd(fd) {
        setg(&m_buffer, &m_buffer + 1, &m_buffer + 1);
    }

protected:
    static constexpr int WIDTH = sizeof(char_type) * 2;

    virtual int_type underflow() override {
        if (gptr() < egptr()) {
            return traits_type::to_int_type(m_buffer);
        }

        std::string hex(WIDTH, 0);
        if (read(m_fd, hex.data(), WIDTH) != WIDTH) {
            return traits_type::eof();
        }

        m_buffer = std::stoi(hex, nullptr, 16);
        gbump(-1);

        return traits_type::to_int_type(m_buffer);
    }

private:
    int m_fd = STDIN_FILENO;
    char_type m_buffer {};
};

Well, that’s basically it, we only need to override underflow(), the default implementation of uflow() just works for this one character buffer.

Also note, the putback functionality of this HexInBuf is not completed, as it may fail at certain situations. Can you spot the issue?

Buffered input stream buffer

Of course, a single character buffer may be easy to implement, but it is not quite efficient. Here is how we can extend the same pattern and implement a fully buffered user-defined input stream buffer.

// hex-in-stream-buffer.hpp

#pragma once

#include <unistd.h>

#include <array>
#include <charconv>
#include <streambuf>

class HexInBuf : public std::streambuf {
public:
    using char_type = std::streambuf::char_type;
    using int_type = std::streambuf::int_type;
    using traits_type = std::streambuf::traits_type;

    HexInBuf(const int fd = STDIN_FILENO) : m_fd(fd) {
        setg(m_buffer.begin(), m_buffer.begin(), m_buffer.begin());
    }

    virtual ~HexInBuf() {
        sync();
    }

protected:
    static constexpr int WIDTH = sizeof(char_type) * 2;
    static constexpr int SIZE = 512;
    static constexpr int MAX_PUTBACK = 8;

    virtual int_type underflow() override {
        if (gptr() < egptr()) {
            return traits_type::to_int_type(*gptr());
        }

        const auto num_putback = std::min(MAX_PUTBACK, static_cast<int>(gptr() - eback()));
        std::copy(gptr() - num_putback, gptr(), m_buffer.begin());

        auto *const new_gptr = m_buffer.begin() + num_putback;
        const auto n = read(m_fd, new_gptr, (SIZE - num_putback) * WIDTH) / WIDTH * WIDTH;
        if (n <= 0) {
            return traits_type::eof();
        }
        for (int i = 0; i < n; i += WIDTH) {
            std::from_chars(new_gptr + i, new_gptr + i + WIDTH, new_gptr[i / WIDTH], 16);
        }

        setg(m_buffer.begin(), new_gptr, new_gptr + (n / WIDTH));

        return traits_type::to_int_type(*gptr());
    }

    virtual int sync() override {
        if (gptr() < egptr()) {
            if (lseek(m_fd, (egptr() - gptr()) * WIDTH, SEEK_CUR) == -1) {
                return -1;
            }
            setg(eback(), gptr(), gptr());
        }

        return 0;
    }

private:
    std::array<char_type, SIZE * WIDTH> m_buffer;
    int m_fd = STDIN_FILENO;
};

For this version of HexInBuf, one extra thing we need to take care of is saving the old data for putback when refreshing the get area with new characters. Often, we need to move the last few characters of the current buffer to the beginning of the buffer and appends the newly read characters thereafter.[1:§15.13.3]

Although, not strictly required, I also override virtual function sync(). For input streams, its behavior is implementation defined. Typically, one implementation may empty the get area and move the current file position back by the corresponding number of bytes.[2]

Conclusion

As I have already shown in today’s post, implementing user-defined input stream buffers usually requires us to override more virtual functions comparing to output stream buffers. Still, the key to implement an input stream buffer is in knowing when and how to override the corresponding virtual functions to manage the get area appropriately. The full code of this article can be found on my Github.

References

  1. The C++ Standard Library, Second Edition (#ad) by Nicolai Josuttis
  2. std::basic_streambuf