This library provide analogues of std::string_view
and std::string
(plus safe mutable string_view
- str_mut
)
Strings are always valid UTF-8 (in another cases it is undefined behaviour)
string_view
is implicitly constructible from string literal:
utf8::string_view hello = "Hello, World!";
Invalid UTF-8 in literals will not allow compiling:
utf8::string_view hello = "Hello, \xff!"; // compile error
This library also provide literals
for string_view
:
auto _1 = ""sv;
auto _2 = ""u;
auto _3 = ""U;
Strings are always valid UTF-8 bytes sequences. Because UTF-8 is a variable width encoding, Strings are typically
smaller than an array of the same chars
However, because of this it becomes impossible to access the character in a constant time:
utf8::string_view s = "hello";
ASSERT_EQ(5, s.size());
auto s = std::vector<utf8::char_t>{ 'h', 'e', 'l', 'l', 'o' };
ASSERT_EQ(20, s.size() * sizeof(utf8::char_t));
utf8::string_view s = "💖💖💖💖💖";
ASSERT_EQ(20, s.size());
utf8::string_view s = "hello";
ASSERT_EQ('l', *std::next(s.chars(), 2))
utf8::string_view s = "💖💖💖💖💖";
ASSERT_EQ(char_t("💖"), *std::next(s.chars(), 2))
Note: the width of 💖 is larger than char
, so we must create such a character from the string literal that
has lit.chars()
equal to one char_t(" ")
.
You can use .substr
, which ensures that the passed value is a char boundary:
utf8::string_view s = "hello";
ASSERT_EQ("hell", s.substr(0, 4));
utf8::string_view s = "💖💖💖💖💖";
ASSERT_EQ("💖", s.substr(0, 4));
A char is a Unicode code point
.
This has a fixed numerical definition: code points are in the range 0 to 0x10FFFF, inclusive.
Any value of char
character is valid unicode point:
char_t array[] = { '1', 'b', '!', char(255) };
No char_t
may be constructed, whether as a literal or at runtime, that is not a Unicode scalar value:
char_t(""); // compile error
char_t("hello"); // compile error
char_t("\uDFFF"); // compile error
char_t("💖"); // ok
char_t
is compatible with char
bool is_ascii(char_t ch) {
return ch <= 0x7F;
}
char_t
is a new-type of std::uint32_t
:
static_assert(sizeof(char_t) == sizeof(std::uint32_t));
static_assert(alignof(char_t) == alignof(std::uint32_t));
As always, remember that human intuition about "character" may not match Unicode definitions.
For example, despite appearances, the symbol 'é' is one point in the Unicode code, while 'é' is two points in the
Unicode code:
auto chars = utf8::string_view("é").chars();
auto iter = chars.begin();
// U+00e9: 'latin small letter e with acute'
ASSERT_EQ(char_t("\u00e9"), *iter++);
auto chars = utf8::string_view("é").chars();
auto iter = chars.begin();
// U+0065: 'latin small letter e'
ASSERT_EQ(char_t("\u0065"), *iter++);
// U+0301: 'combining acute accent'
ASSERT_EQ(char_t("\u0301"), *iter++);
// in 'constexpr' expansion of 'char_t("e\37777777714\37777777601")'
// error: expression '<throw-expression>' is not a constant expression
// | throw "`char_t` from character literal may only contain one codepoint";
auto c = char_t("é");
utf8::string
is a strongly unicode std::string
utf8::string_string
is any borrowed strongly unicode char sequence (like std::string_view
)
Use string_view
to pass strings to functions for read only (does not use const string&
):
bool is_ascii(utf8::string_view str) {
auto chars = str.chars();
return std::distance(chars.begin(), chars.end()) == str.size();
}
...
is_ascii("héllo"); // ok
is_ascii("h\xffllo"); // compile error
is_ascii(utf8::string()) // explicit compile error: `string_view` of temporary value
auto s = utf8::string("hello");
is_ascii(s); // ok