Unicode: A Bigger World of Characters

When I first started working with C++ seriously, I kept seeing types like char8_t, char16_t, and char32_t and wondered why we needed so many different “char” types. The key thing to remember is that each one is simply a type that reserves a specific amount of memory for storing a character.

In the early days of computing, a character was just one byte, giving us 256 possible values. That was fine for English, but as computers spread globally, it became obvious that 256 characters wasn’t enough. Languages like French, German, Russian, Chinese, and many others needed far more room.

Unicode was created to solve this problem. It’s not magic — it’s just a larger character set that uses 2, 4, or even more bytes to represent characters. This gives us millions of possible combinations instead of just 256.

If you’ve ever used the Wingdings font in Windows, imagine trying to squeeze those symbols into the same tiny table as the English alphabet. Unicode does exactly that, but on a global scale, allowing all languages and symbol sets to coexist.

Understanding `char8_t`, `char16_t`, and `char32_t`

Each of these is simply a different-sized character type.

char8_t → 1 byte (8 bits)
char16_t → 2 bytes (16 bits)
char32_t → 4 bytes (32 bits)

A statement is just a statement — these are just types.

Coding it up:

#include <print>
#include <cstdint>

int main()
{
    char8_t  my_one_byte_char  { u8'a' };
    char16_t my_two_byte_char  {  u'a' };
    char32_t my_four_byte_char {  U'a' };

    std::println(
        "The character code points are: {} {} {}",
        my_one_byte_char,
	static_cast<std::uint16_t>(my_two_byte_char),
	static_cast<std::uint32_t>(my_four_byte_char)
    );
}

The same exact character can be stored in memory multiple ways.

Unicode: A Bigger World of Characters

Understanding char8_t, char16_t, and char32_t

Coding it up:

Understanding `char8_t`, `char16_t`, and `char32_t`