The PGM-index

The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast point and range searches in arrays of billions of items using orders of magnitude less space than traditional indexes.

Read the docs View on GitHub Grab the paper

Unlike traditional tree-based indexes that are blind to the possible regularity present in the input data, the PGM-index exploits a learned mapping between the indexed keys and their location in memory. The succinctness of this mapping, coupled with a peculiar recursive construction algorithm, makes the PGM-index a data structure that dominates traditional indexes by orders of magnitude in space while still offering the best query time performance.

By means of further novel techniques, we are able to turn the PGM-index into a data structure that offers compression, distribution-awareness, and multi-criteria adaptability, thus resulting suitable for addressing the increasing demand for big data systems that adapt to the rapidly changing constraints imposed by the wide range of modern devices and applications.

Features

Learned

Learned

It is one of the first results on learned indexes which achieves astonishing performance by capturing the distribution of the input data.
Optimal

Optimal

Unlike some early results, the PGM‑index is a learned index with provably optimal time-space complexity guarantees.
Memory efficient

Memory efficient

It always consumes less space than traditional tree-based indexes, often orders of magnitude less. If this is not enough, there is even a compressed version.
Fast construction

Fast construction

Even on gigabytes of data, its construction speed matches the one of traditional indexes.
Tunable page size

Tunable

You can tune the index to adapt to the memory hierarchy, space occupancy and time efficiency by properly setting one of its parameters.
Multicriteria

Automatic tuner

The automatic tuner lets you specify a maximum lookup time (or space footprint) and get the corresponding most space-efficient (or fastest) PGM‑index configuration.

Running example

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include "pgm_index.hpp"

int main(int argc, char **argv) {
    // Generate some random data
    std::vector<int> dataset(1000000);
    std::generate(dataset.begin(), dataset.end(), std::rand);
    dataset.push_back(42);
    std::sort(dataset.begin(), dataset.end());

    // Construct the PGM-index
    const int error = 128;
    PGMIndex<int, error> index(dataset);

    // Query the PGM-index
    auto q = 42;
    auto approx_range = index.find_approximate_position(q);
    auto lo = dataset.cbegin() + approx_range.lo;
    auto hi = dataset.cbegin() + approx_range.hi;
    std::cout << *std::lower_bound(lo, hi, q);

    return 0;
}

Read more about the C++ API here.

Cite us

If you use the library please put a link to this website and cite the following paper:

@misc{FerraginaVinciguerra:2019,
  Archiveprefix = {arXiv},
  Author = {Paolo Ferragina and Giorgio Vinciguerra},
  Day = {14},
  Eprint = {1910.06169},
  Month = {10},
  Primaryclass = {cs.DS},
  Title = {The PGM-index: a multicriteria, compressed and learned approach to data indexing},
  Url = {https://arxiv.org/abs/1910.06169},
  Year = {2019}}

Some interesting uses of the PGM-index

We would love to be informed whether you used our code in your projects. We will list the most interesting applications of the PGM-index here!

Contribute

There are a lot of ways to contribute on this project, just to mention a few:

  1. Engineering the support for insertions and deletions. A starting point could be to use insertion buffers which are periodically merged to the main index.
  2. Making the index SIMD aware. For example, you could set the error to the SIMD register width and use vector instructions to traverse the levels of the index with no branches.
  3. Adding support for concurrent and batch queries.

Feel free to submit issues and pull requests in the GitHub repository.