Skip to content

MyRocks record format

luqun edited this page Jan 22, 2019 · 4 revisions

Since RocksDB is a key-value storage, the MySQL structured information has to be encoded in plain key-value form. All data in a single instance is stored in one RocksDB database in flat key-value space. Tables can be distributed between multiple column families. Each table consists of a primary key and, potentially, a set of secondary keys. Each index has its own unique ID. Let's take a look at an example table and how it is stored in RocksDB:

create table t1 (
  a varchar(50) primary key,
  b integer,
  c varchar(75),
  d varchar(100),
  key(c),
  unique(d),
  unique(c,d),
)

Primary key

key: index number, M(a)
value: timestamp*, NULL-bitmap*, b, c, d
* means only store when it is needed

The primary key stores all the data in the table. The key, of the lower level key/value, consists of a 4 byte index number followed by the memcomparable format of the primary key (denoted as 'M()'). The memcomparable format allows comparison of two values by using the simple memcmp "C" function. This is much more efficient than parsing out individual fields from the key and doing type-specific comparisons. To support memcomparable format, MyRocks restricts collations on indexed columns -- only binary, latin1_bin and utf8_bin collations are supported.

The value for the primary key stores extra information(if needed) plus the remainder of non-key columns. In most cases, we do not store the primary key columns in there since we already store them as a part of a key. However, if any information was lost during conversion to the memcomparable form, we store the 'restore' information (sufficient information to restore the original value from the memcomparable value) or the original column value in there. timestamp is mainly used for TTL scenario. Its value is the time when the record is inserted or updated and Its size is 8 bytes.

Secondary index (c)

key: index number, NULL-flag, M(c), M(a)
value: timestamp*, restore data*
* means only exists when it is needed

For non-unique index, we store the memcomparable forms of both the secondary key column and primary key columns as a part of a key. Similar to the primary index, the key starts with a 4 byte index number, followed by the memcomparable form of secondary and primary key columns. If the secondary key column is nullable, it is prepended by a one byte flag. The flag value of 0 means that the field is NULL. In this case, M(c) is empty. The primary key can't be NULL, therefore there is no flag.

The value for secondary index stores timestamp during insert/update record if needed, restore data or the original key column values if needed. if neither of these are needed, value will be empty.

Multi column, secondary index (c,d)

key: index number, NULL-flag, M(c), NULL-flag, M(d), M(a)
value: timestamp*, restore data*
* means only exists when it is needed

Multi-column indexes are encoded by concatenating the memcomparable representations of their constituents one after another. All nullable columns are preceded by NULL bytes. Multi column primary keys also concatenate memcomparable forms one after another. No NULL bytes are needed for primary keys.

The value for secondary index stores timestamp during insert/update record if needed, restore data or the original key column values if needed. if neither of these are needed, value will be empty.

Memcomparable format

It is important to be able to compare keys without de-serializing and doing type-specific comparisons. It is possible to represent MySQL data in a memcomparable way. There is a number of subtle details to get things right, including: multi-column keys, case-insensitive collations, even more complex collations, multi-column rows with variable length columns, etc.

For simple, fixed width data types, we use MySQL-provided facilities. For details, see Field_xxx::make_sort_key method in MySQL code.

  • For integral types the bytes are switched to big-endian order and negative values have all bits negated while positive values have the high bit turned on.
  • For floating point types the bytes are switched to big-endian order and negative values have all bits negated while positive values have the high bit turned on and the exponent gets incremented.
  • For variable-sized data types it depends on whether the data is considered binary data (varbinary or varchar with a binary collation) or not. For binary data we break the data in groups of 8 bytes to which we append an extra byte that signals the number of significant bytes in the previous section. The extra byte can range from 1 to 9. Values of 1 to 8 indicate that many bytes were significant and this group is the last group - a value of 9 indicates that all 8 bytes were significant and there is more data to come.
  • For variable-sized data types that are not binary data we need to support space-padding. Space-padding is a characteristic of some collation types that indicates that the data should be sorted as if it were space-padded out to its maximum length. Currently all supported collations (except binary which is handled above) have the space-padding characteristic. The memcomparable format for space-padding is to break the data into chunks (the size of the chunk depends on the collation), padding the last check with the space character for the collation. Each chunk is followed by a flag byte that has one of three values: 1 means that the data following this chunk sorts as less than spaces, 2 indicates that this is the last chunk - conceptually meaning any data after this would be spaces, and 3 means that the data following this chunk sorts as greater than spaces.

To implement different encodings, we have to convert strings to some basic format. For instance for latin encoding, all we have to do is to convert all characters to the uppercase. This way, memcmp will effectively ignore case. To restore the original data, all we have to do is to store a bitmask which says which characters were converted to uppercase. This is restore data. For more complex collations, things are more trickier. We might choose to use just the original string as restore data. Any restore data will be stored in the value as (1 byte restore tag + 2 byte restore length + restore data). The restore tag is 0x02.

There are two types of memcmp which we do in RocksDB storage engine: regular and reverse. By default, we use direct comparator which is effectively the same as memcmp. However, if the column family name starts with "rev:", we'll be using reverse comparator (return -memcmp(a,b);). It makes it more efficient to iterate ranges in reverse. RocksDB uses prefix compression inside blocks of SST files. This somewhat slows reverse iterator. By reversing the order, we make reverse iteration faster (for instance, on queries such as select * from t1 desc limit 100;)

Row checksums

MyRocks has a feature to optionally store checksums at the end of the value field. Checksum has 9 bytes in total per row (1 byte checksum tag + CRC32 key and CRC32 value). It is also configurable how many percentages of rows to be checksummed. The checksum tag is 0x01.

Clone this wiki locally