Compressed and succint data structures are able to save big amounts of memory while mantaining the ability to access and query efficiently. In this repo you will find some examples using SDSL.
To install the library please checkout this. This library should probably not be used in production because is not mantained anymore. Checkout this fork for a more updated version.
Integer usually occupy some fixed bytes of memory. I will assume 8 bytes from now on. Usual arrays store in memory the bytes one after the other. To read them you just have to skip 8 bytes per iteration, because each number is 8 bytes. What if we wanted to use less bits per integer? We'd have to implement a data structure that is able to store continuously some bits and that can read them one after the other by skipping the appropriate pre-defined bit size of an integer. SDSL does this for us. But first, why would we ever do that? Is this even useful?
In many fields, like search engines, the web pages have an ID assigned to them and are stored in the so called inverted lists. Without going too much in detail, these are ordered lists of integers and are usually stored using gap coding, i.e. storing the difference between two consecutive integers instead of the integers themselves.
// example of an inverted list
// it is in the form word_i -> n, n1, n2, ..., nk
// where word_i is a string contained in a document or a web page
// n, n1... are the documents in which that string is contained
// For example if we have 2 documents
// doc 1: "The cat is on the table"
// doc 2: "The pen is inside the bag"
// The inverted lists would be
// The -> 1, 2; cat -> 1; is -> 1, 2; on -> 1; table -> 1;
// pen -> 2; inside -> 2; bag -> 2;
The first example uses int_vector<5>
in vector_example.cpp
to use only 5 bits per integer. This means that the maximum number we can represent is 31. Depending on the task, this simple but effective technique can reduce the memory occupation of a significant factor.
Using a fixed amount of bits per integer may be a good idea in some particular cases but it has a big flaw that arises when the maximum number in the array is very big and it'is an outlier compared to the other numbers. To solve this issue we can use variable length integer coders. In particular we use elias-$\gamma$ and elias-$\delta$ codes, developed by Peter Elias. Why can't we just store the binary representation of our numbers one ofter the other? Because otherwise we would not be able to recognize one integer from the next one (remember that we are just storing a sequence of bits in memory). If I gave you the sequence S = 011101101001
, telling you that I have encoded some integers in it, would you be able to decode it? It may be 01 110110 1001
or 01 110 110 1001
and many other combinations.
n | bn(n) | len(bn(n)) | u(len(bn(n))-1) | |
---|---|---|---|---|
17 | 10001 | 5 = 101 | 00 | 00 101 10001 |
32 | 100000 | 6 = 110 | 00 | 00 110 10000 |
1024 | 10000000000 | 11 = 1011 | 000 | 000 1011 10000000000 |
And that's it. Now we can compress any number using very few bits.
SDSL implements sdsl::coder::elias_gamma
and sdsl::coder::elias_delta
. Please check SDSL documentation for implementation details, given that it may be slightly different from what I described above. Check integer_codes_examples.cpp
for some examples. To build the actual array you have to do the following
sdsl::int_vector<> v(10 * (1 << 20)); // a big array
v[100] = 1ULL << 63; // set the max
sdsl::util::bit_compress(v); // dynamically find the max and use the fixed bit number representation for v
sdsl::vlc_vector<sdsl::coder::elias_delta, 128> vlc(v); // compress using delta code. What is 128? Keep reading...
Why do we need that 128 in the code? That's the density. You basically store how many bits there are every 128 numbers. Let's make an example.
Suppose you choose to have density {0, 0, 0, 0, 2, 3, 8, 0, 0}
you will obtain the bit vector 1 1 1 1 011 001000 0001001 1 1
(don't worry too much about the bit representation, know that SDSL uses a slightly different implementation compared to the one I described, for example it represents the number {0, 4, 22}
assuming density i
in less time. Actually, it will probably do binary search on it, given that it's sorted, but again you should check implementation details for that.
In the file integer_codes_examples.cpp
you'll also find the function example2()
that shows how much a random array gets compressed. It' roughly a factor of 20 for
The biggest issue of representing integers with sdsl::bit_vector
which is basically a specialization of sdsl::int_vector<1>
. I will now explain Elias Fano assuming that we have the following two functions already implemented.
select1(B, i){
// B is a binary vector that contains 0s and 1s
// i is the position
// returns the position of the i-th bit set to 1 (counting from 1)
// example: B = 1 0 0 1 0 1
// calling select1(B, 2) returns 4, because the second 1 is in position 4
}
How can we implement the select1
function? And how can we do it efficiently? I'm not going to explain this here, but there are ways to do this in constant time select1
because it counts the ones, we also have select0
, the same exact function that works with 0
. There's also another function we need.
rank(B, i){
// returns the number of 1s from position 1 to position i
// remember that the first position in the array is 1 and not 0
}
Again, this can be implemented to work in constant time. Now we can describe how the Elias Fano code works. This will require some math, nothing fancy. Elias Fano works on increasing integers, for example S = {1, 4, 7, 18, 24, 26, 30, 31}
. We define u = S[1] + S[-1] = 1 + 31 = 32
, b = log(u) = 32
(this is $log_2(32)$), l = log(u/n) = log 4 = 2
, where n
is the length of S
, and finally h = b - l = 3
. We need all these variables to calculate the two arrays that actually encode the numbers. Let's first convert S
to binary using b
bits: S = {00001, 00100, 00111, 10010, 11000, 11010, 11110, 11111}
. Now we define the two arrays L
and H
. We'll use L
for the least significant bits of the numbers. We therefore take l = 2
bit for each number, so we get L = 01 00 11 10 00 10 10 11
. The first 01
because the least significant bits of 00001
are 01
. The third and fourth bits are 00
because the 00100
are 00
, and so on. Now we create the H
array, which contains information on the high part. We first notice that the h = 3
more significant bits span from 000
to 111
, so they go from H
by counting how many of each configurations we have, starting from 0
. To make this clearer, let's crate a temporay array T = 000 001 001 100 110 110 111 111
that contains all the first h = 3
bits of each number. Now we just count the configurations and encode them in H
in unary. We'll therefore have H = 10 110 0 0 10 0 110 110
. Let's check this. How many configurations 000
do we have? Just one, therefore we write 1
followed by a 0
, which acts as a separator (otherwise we would not be able to understand where to stop counting for the configuration 000
). Now we check the following configuration, which is 001
and we see that we have 110
, the unary code of 010
, which never happens so we just put a 0
, and so on.
Now we just need a way to access
(that's the name of the function too) a number at a specific index. We assume that we have the select1
implemented. The following is the implementation of access
.
access(L, H, i){
low = L[(i-1)*l + 1 : i*l] // python-like notation
high = to_binary(select1(H, i) - i)
return low + high // + operator to appen high to low
}
Remember that i
starts from 0
). To understand the above code let's make an example. Let's say we want to execute access(5)
. To get the low part we have to get the low part of the fifth number which is 00
from L = 01 00 11 10 00 10 10 11
, in position 9
to 10
, ie L[(i-1)*l + 1 : i*l]
. For the high part, you need to count the number of ones before position i
, using the select function. Now we subtract i
from it and obtain the number of zeroes. So we have 11 - 5 = 6 = 110
. What does the number of zeroes represent? It's the number of binary configurations! So when we get 6
we know that it actually is the most significant part of the number we're searching for. Our result it therefore 11000, which is
There's also another very interesting function in Elias Fano called nextGEQ(n)
. This returns the number in the array greater or equal than n
, and it does so very fast (constant time).
You will not find any complete SDSL code examples in this repo, but I'll give you some guidelines. First of all, as I already said, you will use sdsl::bit_vector
which is just a specalization of sdsl::int_vector<1>
. For example you can write:
sdsl::bit_vector b = {0,1,0,1,1};
// or
sdsl::bit_vector b1(100, 0); // creates a vector with 100 bits all set to 0
Now you need the data structures to support rank
and select
. We don't know how these work internally, but someone already implemented them for us. Use the following code.
sdsl::bit_vector b = {0,1,0,1,1};
sdsl::rank_support_v<1> b_rank(&b); // to obtain a data structure that supports the rank() operation through b_rank(i)
sdsl::select_support_mlc<1> b_select(&b); // to obtain a data structure that supports select(i) through b_select(i)
Now we compress through Elias Fano in the following way.
sdsl::sd_vector<> ef(b); // elias fano of b
// L is contained in ef.low
// H is contained in ef.high
Know that there's a known issue (bug) that will give you different results compared to the ones i described. This bug also makes the library slower (by a small factor). If you use other libraries this is fixed.
Our last example is pointerless programming, to work with strings. As you probably understood by now, when working with big data we want to save every bit. Very simple problems are unfeasible when you have giga bytes, tera bytes or even peta bytes of data. We'll solve a very simple problem with string, trying to be as efficient as possible. We want to store strings in alphabetical order and support searches. What we could do is a char** v;
an array of pointers to char*
, but this approach uses 64 bits
for each pointer (ie for each string) and requires you to jump in memory to get the string, possibly causing a cache miss. Instead of doing this, we'll use the sdsl::bit_vector
to speed things up. Take a look at the following example, remembering that the special character \0
is the string terminator character.
s = {"dog", "cat", "parrot"}
// store the strings 1 after the other using just a char*v
// dog\0cat\0parrot\0
// 1000 1000 1000000 // use this bit array to indicate where each string starts
For implementation details checkout pointerless_programming.cpp
. The idea is to store the strings and the \0
in just a char*v
and to keep another bit vector where we store 1
where a string starts and then as many zeroes as the length of the string -1
plus another 0
for the string terminator. By doing this we don't store any pointers. You can then compress the bit array with Elias Fano.
To search for a number we do a binary search in the bit array. This of course gives you cache misses because you are jumping in the array, but not as many as you would get by using the typical array of pointers approach.