This repository contains the source code to reproduce the results and figures presented in the paper "ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale".
This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.
- Create an Amazon EC2 c5.4xlarge instance with Ubuntu Server 22.04 LTS and 20GiB of storage.
- Clone the repository:
git clone https://github.com/dynatrace-research/exaloglog-paper.git && cd exaloglog-paper
- Install all required packages:
sudo apt update && sudo NEEDRESTART_MODE=a apt --yes install openjdk-21-jdk python-is-python3 python3-pip texlive texlive-latex-extra texlive-fonts-extra texlive-science cm-super && pip install -r python/requirements.txt
- To reproduce the estimation error results
results/error/*.csv
run thesimulateEstimationErrors
task (takes ~2h):./gradlew simulateEstimationErrors
- To reproduce the empirically determined memory-variance product (MVP) values based on the actual allocated memory and the serialization size of different data structure implementations for approximate distinct counting run the
runEmpiricalMVPComputation
task (takes ~3h, not needed for the figures):The results can be found in the./gradlew runEmpiricalMVPComputation
results\comparison-empirical-mvp
folder. - To calculate theoretical MVP constants as well as constants used in the Java implementation run the
calculateConstants
task (takes ~10min, not needed for the figures):The ouput can then be found in the./gradlew calculateConstants
results/constants
folder. - To (re-)generate all figures in the
paper
directory execute thepdfFigures
task (takes ~1min):./gradlew pdfFigures