Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZSTD performance is not good on LEB128 encoded integers #1325

Closed
wgtmac opened this issue Sep 18, 2018 · 2 comments
Closed

ZSTD performance is not good on LEB128 encoded integers #1325

wgtmac opened this issue Sep 18, 2018 · 2 comments
Labels

Comments

@wgtmac
Copy link

wgtmac commented Sep 18, 2018

Hi All,

We use zstd in production to compress various kinds of data and overall it performs very well. However, when it comes to LEB128 encoded integers, it could not handle the case very well. In case you don't know LEB128, here is a brief introduction: https://en.wikipedia.org/wiki/LEB128.

To reproduce the issue: suppose I have a file called '/tmp/number' which contains 512 unsigned 64-bit integers.

  1. read the file '/tmp/number' and write two files:
    1. /tmp/non_encoded: write uint64_t one by one in little endian
    2. /tmp/encoded: write each unsigned integer in LEB128 form
int main() {
  std::ifstream infile("/tmp/number");
  std::ofstream encoded("/tmp/encoded"), non_encoded("/tmp/non_encoded");

  uint64_t num;
  while (infile >> num) {
    // write num in little endian uint64_t
    uint64_t val = num;
    for (size_t i = 0; i != sizeof(int64_t); ++i) {
      non_encoded << static_cast<char>(val & 0xFF);
      val >>= 8;
    }

    // write num in LEB128
    uint64_t encVal = num;
    while (true) {
      if ((encVal & ~0x7FU) == 0) {
        encoded << static_cast<char>(encVal);
        break;
      } else {
        encoded << static_cast<char>(0x80 | (encVal & 0x7f));
        encVal >>= 7;
      }
    }
  }

  encoded.close();
  non_encoded.close();
}

Use the data provided below, /tmp/encoded is 1601 bytes and /tmp/non_encoded is 4096 bytes.

  1. Use zstd v1.3.5 command to compress the data:

zstd -1 /tmp/encoded -o /tmp/encoded.zstd
zstd -1 /tmp/non_encoded -o /tmp/non_encoded.zstd

We can find /tmp/encoded.zstd is 1615 bytes which is larger than its input (1601 bytes). But /tmp/non_encoded.zstd is only 1167 bytes and it is way smaller than the compressed file of LEB128 data. We cannot dismiss LEB128 because the file format we use is Apache ORC and all integers are forced to use LEB128 before compression.

Below are the 512 unsigned integers used in the above example. Other data can also reproduce similar results.

16430 40604 119216 135870 138416 154014 156006 157446 166070 169048 169872 171326 171672 175078 181260 185060 190384 195254 198468 200504 205416 208112 210106 215284 229418 236372 241734 246458 251618 254014 259456 262008 262644 262854 263514 273242 274578 294882 300610 300782 310950 323342 341560 356642 361604 372536 373828 374286 378842 381542 381622 383162 393538 395932 401404 403860 408106 413462 417376 418066 419848 426872 431748 433820 434064 434084 446326 447336 449950 452404 455754 456722 464436 465244 466866 470546 471210 472256 488502 489286 490094 496286 497430 499884 512328 519760 520084 521148 521974 523016 527920 533498 537202 538066 550164 557078 560000 564352 568800 569848 573524 576336 586750 591528 599206 604204 605608 608920 610866 615146 615756 616024 637724 638826 650744 652648 653944 656850 665996 668910 678996 683276 683642 688634 690498 703296 706594 722782 726476 726792 728170 730320 730802 734260 735606 743166 752058 752876 756096 756832 760224 767980 776024 778824 779460 784680 785476 787890 788956 795568 801720 807736 812546 816126 826156 829134 830152 832066 833978 839302 843260 846586 848152 849998 854968 857430 860342 863070 868378 873666 878208 880100 880348 882030 888196 890626 894542 901148 912452 913818 922708 927800 930338 937430 937934 939290 942680 953170 956496 957202 962168 965040 967626 971416 973050 974460 977298 995524 1000160 1001884 1004234 1004352 1007738 1016674 1027984 1031116 1033096 1036126 1041218 1043712 1049830 1058250 1076184 1076548 1081950 1081980 1085572 1086716 1092042 1092456 1096662 1105034 1112528 1120002 1120844 1122726 1123156 1124094 1129158 1131988 1135640 1137764 1142528 1147030 1149686 1150846 1153146 1158538 1161122 1162624 1168514 1168590 1186644 1187680 1200682 1204974 1206226 1206816 1208080 1213938 1220692 1221250 1222776 1223706 1224302 1225400 1226322 1230094 1231428 1233224 1245786 1248276 1255350 1256570 1259808 1269716 1271976 1272136 1272810 1275604 1279058 1287772 1288128 1289298 1291064 1293748 1294766 1296516 1297022 1298686 1306372 1334100 1337432 1337902 1340004 1340478 1340492 1346748 1347056 1357246 1362418 1364954 1368568 1371354 1371438 1376988 1378508 1389924 1393882 1419088 1431128 1444444 1445134 1445350 1454244 1455388 1455496 1455502 1456000 1466068 1474194 1479100 1491510 1491706 1493144 1498684 1501148 1503364 1509142 1515350 1515426 1515964 1517772 1519384 1526624 1530906 1532942 1543194 1546688 1551094 1553626 1557050 1565164 1566724 1567754 1569494 1576070 1583992 1586496 1595430 1620604 1622502 1631188 1643550 1645902 1650344 1657934 1659256 1673820 1678200 1685912 1686708 1688824 1694292 1694652 1695386 1696032 1703862 1707266 1713496 1719762 1728546 1735206 1736094 1737282 1744014 1746906 1749764 1752372 1754400 1755488 1760360 1761598 1761736 1775372 1776376 1786458 1790778 1791534 1793526 1796884 1804030 1804568 1804732 1807386 1809080 1813122 1817746 1818716 1821702 1821936 1829522 1831584 1839890 1841044 1842928 1843780 1846144 1852470 1860306 1872664 1879706 1880138 1880716 1889296 1890440 1891476 1892330 1892682 1894678 1895448 1905664 1906354 1907176 1913892 1917432 1927470 1927934 1931006 1934984 1944676 1945594 1946896 1956986 1976642 1984704 1988942 2004434 2007622 2015794 2018842 2023266 2023446 2037252 2037928 2039034 2048308 2049278 2054404 2065782 2069114 2070128 2073736 2075016 2075158 2080916 2096928 2099582 2107644 2110400 2111290 2112072 2114340 2120686 2124722 2131416 2134458 2137282 2143364 2147010 2161646 2163582 2172534 2173832 2175794 2180648 2181226 2181730 2184922 2187384 2195578 2210600 2228600 2229326 2240022 2240174 2240612 2240886 2242412 2244746 2248850 2254166 2256872 2262338 2263830 2273334 2288762 2292078 2294618 2305160 2308410 2308470 2314266 2321282 2324930 2326762 2334014 2336474 2336802 2337542 2343450 2349658 2357118 2367202 2367740 2370430 2374960 2381288 2381384 2385116 2390962 2403786

I have also tried zlib to compress these two kinds of data. Regardless of its performance compared to zstd, zlib can compress LEB128 data into smaller size than little endian uint64_t. Therefore I think there may be some room for zstd to improve on LEB128 data. Any comment is welcome. Thanks!

@Cyan4973
Copy link
Contributor

The LEB128 format breaks the regularity of input compared to non-encoded source, where all numbers are 64-bit. While the format "seems" to reduce the amount of data to compress, it actually makes the work of the compressor more difficult. This is unfortunately a common scenario.

Also, note that level 1 of zstd bets for "large repetitions", that's part of the speed trade-off.
By contrast, zlib bet is fairly uniform at all compression levels.
Unfortunately, on such data, with LEB128, "large repetitions" do not happen.
To achieve compression, one needs to look for "small repetitions". There are many (complex) ways to do that, the simpler one being to increase the compression level. I would expect level 12+ to perform noticeably better.
But still worse than without LEB128.

A recommended mitigation is to not use LEB128 encoding for tables of integer numbers.

A better transform for this kind of data would be a delta encoding, followed by a transpose, followed by zstd. It's apparently a suite of increasing numbers, so the result will be roughly proportional to the average distance between numbers, hence it should reduce the encoding to < 16-bit per number.
That's more work, obviously.

@wgtmac
Copy link
Author

wgtmac commented Sep 18, 2018

Thanks @Cyan4973 for replying. I tried to use different compression levels on both LEB128-encoded and non-encoded data. It seems that the compression ratio remains unchanged for LEB128-encoded integers from level 1 to 19. Those suggestions make sense. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants