Before proceeding any further, read Compression, Speed, and Search Engines. This explains what variable byte encoding is and why it’s useful.
Here’s one implementation of variable byte encoding by Hugh Williams. When I first looked at this a while ago I didn’t really understand what was going on. Now I’ve been writing some code again I’ve finally figured it out.
This version of variable byte encoding assumes that an integer is 4 bytes long. Since we are using 1 bit from each of the 4 bytes to represent whether this is the last byte for the integer, we lose 4 bits for actually storing the integer. To work out the largest number this code can store turns out to be pretty easy. Since we are using 28 bits (4 bytes * 7 bits) to store the integer, the largest number is (2^28)-1 = 268,435,455.
The process of writing out the encoded integer is broken up into 2 stages.
Stage 1 – Get the binary representation for each of the 4 bytes
tmp = (number%128) << 1;
bytearray[x] = tmp;
number /= 128;
The first stage effectively gets the binary representation for each of the 4 bytes, one by one. As we are using 7 bits for storing the integer, the maximum number we can store in a single byte is 127. Hence we take the integer to be encoded and get the modulus 127. We also need to left shift the bits by 1 place to make room for the indicator bit. Note that in this first stage we don’t bother setting this indicator. Once we’ve stored this byte, we use integer division and divide by 128. This is because we want to store the binary representation of the number which when concatenated with the other bytes will effectively multiply the second byte by 128, the third byte by 16384 (128*128), and the fourth byte by 2097152(128*128*128). During this process if the integer division returns 0, then the remaining bytes are all set to 0.
Stage 2 – Write the integer to a file
if (bytearray[x] != 0 || started == 1)
started = 1;
bytearray[x] |= 0x1;
fwrite(&bytearray[x], sizeof(char), 1, fp);
bytearray |= 0x0;
fwrite(&bytearray, sizeof(char), 1, fp);
The second stage actually writes out the 4 bytes. Well in fact, and this is the point of the compression, if the number is small enough it won’t require 4 bytes. The first byte is the only one which is always required. As the number gets larger, more bytes are needed. With this scheme one byte can store numbers up to 127, 2 bytes up to 16,383, 3 bytes up to 2,097,151, and 4 bytes up to 268,435,455.
We first loop over the last 3 (most significant) bytes. The loop checks to see if each byte is 0 and we haven’t started writing the actual number yet. This ensures we don’t write any bytes that aren’t required but we also write a 0 if it is part of a larger number. Before we actually write the number we perform a bitwise or with 0×1 to set the indicator bit to 1, this is required so that the read function knows this is not the last byte of the number. Once this loop has completed we write out the mandatory first byte. This time we perform a bitwise or with the value of 0×0 to set the indicator bit to 0. This will be the last byte. So that’s it for the write part of the code.
Reading a Variable Byte Encoded Integer
Now that we understand how to write an encoded integer, the process of reading it is fairly straight forward.
int vbyteread(FILE *fp)
char tmp = 0x1;
int val = 0;
while((tmp & 0x1) == 0x1)
if (fread(&tmp, sizeof(char), 1, fp) == 0)
val = (val << 7) + ((tmp >> 1) & 127);
While we have another byte to read (the indicator bit is 1), which we can tell by peforming a bitwise and with 0×1 and checking it equals 0×1, we multiply the current value by 128 (left shift 7 bits), read the byte and right shift it by 1 to get rid of the indicator bit and then bitwise and it with 127 to ensure the eighth bit is set to 0, and then add this to value. This all gets accomplished with the single line:
val = (val << 7) + ((tmp >> 1) & 127);
Maximum Number Using This Code
Here’s a program using the variable byte integer encoding code that shows that the maximum number we can encode is 268,435,455. When we use a number bigger than this the values for the first bytes are still calculated, but any parts bigger are ignored. Hence 268,435,456 will be read back as 0.
When I first tested the speed of using the variable byte integer encoding it seemed to be slower than simply reading uncompressed integers. However if you don’t use enough data, the numbers can be unreliable. Once I tried using a much larger sample of data, the variable byte code proved to be significantly faster.
Here’s a program to write 10 files written with compression, and 10 without. You can compare the file sizes generated by this program and see that the compressed files are around 50% smaller than the uncompressed files. Out of interest, 1,638,400,000 integers are written over the 10 files. To me that’s a lot of integers!
Here’s a program that reads the 20 files and compares the times for reading and uncompressing the compressed files versus reading the uncompressed files. The results from my test were:
Uncompressed Files: 13 minutes 35 seconds
Compressed Files: 09 minutes 13 seconds
This shows that it’s much quicker to read the smaller encoded file and perform the decoding, than to simply read a larger uncompressed file.