Skip to content

uselessgoddess/utf8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strongly UTF-8 in C++

This library provide analogues of std::string_view and std::string (plus safe mutable string_view - str_mut)
Strings are always valid UTF-8 (in another cases it is undefined behaviour)

Basic Usage

string_view is implicitly constructible from string literal:

utf8::string_view hello = "Hello, World!";

Invalid UTF-8 in literals will not allow compiling:

utf8::string_view hello = "Hello, \xff!"; // compile error

This library also provide literals for string_view:

auto _1 = ""sv;
auto _2 = ""u;
auto _3 = ""U;

Representation

Strings are always valid UTF-8 bytes sequences. Because UTF-8 is a variable width encoding, Strings are typically smaller than an array of the same chars
However, because of this it becomes impossible to access the character in a constant time:

utf8::string_view s = "hello";
ASSERT_EQ(5, s.size());

auto s = std::vector<utf8::char_t>{ 'h', 'e', 'l', 'l', 'o' };
ASSERT_EQ(20, s.size() * sizeof(utf8::char_t));

utf8::string_view s = "💖💖💖💖💖";
ASSERT_EQ(20, s.size());
utf8::string_view  s = "hello";
ASSERT_EQ('l', *std::next(s.chars(), 2))

utf8::string_view  s = "💖💖💖💖💖";
ASSERT_EQ(char_t("💖"), *std::next(s.chars(), 2))

Note: the width of 💖 is larger than char, so we must create such a character from the string literal that has lit.chars() equal to one char_t(" ").

You can use .substr, which ensures that the passed value is a char boundary:

utf8::string_view s = "hello";
ASSERT_EQ("hell", s.substr(0, 4));

utf8::string_view  s = "💖💖💖💖💖";
ASSERT_EQ("💖", s.substr(0, 4));

About types

char_t - a new character type.

A char is a Unicode code point. This has a fixed numerical definition: code points are in the range 0 to 0x10FFFF, inclusive.

Any value of char character is valid unicode point:

char_t array[] = { '1', 'b', '!', char(255) };

No char_t may be constructed, whether as a literal or at runtime, that is not a Unicode scalar value:

char_t(""); // compile error
char_t("hello"); // compile error
char_t("\uDFFF"); // compile error
char_t("💖"); // ok

char_t is compatible with char

bool is_ascii(char_t ch) {
    return ch <= 0x7F;
}

Representation

char_t is a new-type of std::uint32_t:

static_assert(sizeof(char_t) == sizeof(std::uint32_t));
static_assert(alignof(char_t) == alignof(std::uint32_t));

As always, remember that human intuition about "character" may not match Unicode definitions.
For example, despite appearances, the symbol 'é' is one point in the Unicode code, while 'é' is two points in the Unicode code:

auto chars = utf8::string_view("é").chars();
auto iter = chars.begin();
// U+00e9: 'latin small letter e with acute'
ASSERT_EQ(char_t("\u00e9"), *iter++);

auto chars = utf8::string_view("").chars();
auto iter = chars.begin();
// U+0065: 'latin small letter e'
ASSERT_EQ(char_t("\u0065"), *iter++);
// U+0301: 'combining acute accent'
ASSERT_EQ(char_t("\u0301"), *iter++);
//   in 'constexpr' expansion of 'char_t("e\37777777714\37777777601")'
// error: expression '<throw-expression>' is not a constant expression
//       |     throw "`char_t` from character literal may only contain one codepoint";
auto c = char_t("");

string_view + string == char_t("❤")

utf8::string is a strongly unicode std::string
utf8::string_string is any borrowed strongly unicode char sequence (like std::string_view)

Use string_view to pass strings to functions for read only (does not use const string&):

bool is_ascii(utf8::string_view str) {
    auto chars = str.chars();
    return std::distance(chars.begin(), chars.end()) == str.size();
}

...

is_ascii("héllo"); // ok
is_ascii("h\xffllo"); // compile error
is_ascii(utf8::string()) // explicit compile error: `string_view` of temporary value 

auto s = utf8::string("hello");
is_ascii(s); // ok

Access

<