The PGM‑index C++ library provides a handy program to benchmark the query time and the space usage of some of the classes implemented in the library with various configuration parameters. The benchmark works both with synthetic and user-given datasets and query workloads, and it includes a Python notebook to plot the results.
To compile the benchmark
executable, follow the instructions for building the code.
The simplest example benchmarks the index structures on four datasets of 10 million elements drawn from various distributions (uniform, binomial, negative binomial, and geometric, respectively):
./benchmark -s10000000
The results will appear on the screen in CSV format. The columns are:
To write the output both to the screen and a file, run:
./benchmark -s10000000 | tee results.csv
The results.csv
file can be then used to create plots, as explained in the next section.
To run the benchmark on some given input data instead, you have to pass the path to a binary file containing a sorted sequence of signed (or unsigned) 64-bit integers and specify the -I
option (or -U
for unsigned), as in:
./benchmark -U file.bin | tee results.csv
The input file can be prepared for instance in Python by calling np.sort(a).astype('uint64').tofile('file.bin')
on a NumPy array a
containing your data. The following Python code for example converts a text file containing newline-separated positive integers to the binary format required by the benchmark:
import numpy as np
a = np.loadtxt('my_data.txt', dtype=np.uint64)
np.sort(a).tofile('file.bin')
Finally, the benchmark program provides options to customise the query workload or to interleave the input keys with values of a given size in bytes (to simulate database records).
To plot the results you need to install the following Python 3 packages:
pip3 install jupyter "numpy>=1.18" "pandas>=0.25" "matplotlib>=3.1" "cpufeature>=0.2"
Then, inside the directory containing the benchmark program and plots.ipynb
, run
jupyter notebook plots.ipynb
Now, execute all the cells via the Run button in the toolbar, possibly changing the input_file
variable to point to the CSV containing the results of the benchmark. The notebook will show the plots on the screen and save a plots.pdf
file.
benchmark [file...] {OPTIONS}
Benchmark for the PGM-index library.
OPTIONS:
-h, --help Display this help menu
-v, --verbose Verbose output
-V[bytes], --values=[bytes]
Size of the values associated to keys
QUERY WORKLOAD OPTIONS (mutually exclusive):
-r[ratio], --ratio=[ratio]
Random workload with the given lookup ratio
-w[file], --workload=[file]
Custom workload file. Obeys the format of input files
INPUT DATA OPTIONS (mutually exclusive):
-s[size], --synthetic=[size]
Generate synthetic data of the given size
-U, --u64 Input files contain unsigned 64-bit ints
-I, --i64 Input files contain signed 64-bit ints
file... The input files
"--" can be used to terminate flag options and force all following arguments
to be treated as positional options