This page summarises the results on the space and time complexity of the PGM-index described in detail in our publications.
As model of computation, we do not use the random-access machine model taught in introductory algorithms textbooks, in which one counts the number of computation steps. What matters in modern machines working on large amounts of data is how algorithms and data structures take advantage of the memory hierarchy and minimise the data-access latency.^{1} Therefore, we use the I/O model of computation (aka the external-memory model or the two-level memory model) defined by the Encyclopedia of Algorithms as follows:
The input/output model (I/O model) views the computer as consisting of a processor, internal memory (RAM), and external memory (disk). The internal memory is of limited size, large enough to hold $M$ data items. The external memory is of conceptually unlimited size and is divided into blocks of $B$ consecutive data items. All computation has to happen on data in internal memory. Data is brought into internal memory and written back to external memory using I/O operations (I/Os), which are performed explicitly by the algorithm. Each such operation reads or writes one block of data from or to external memory. The complexity of an algorithm in this model is the number of I/Os it performs.
When constructing PGM-index, one fixes a positive integer parameter $\varepsilon$, which is the maximum absolute error of the piecewise linear approximation (PLA) at the core of the index. The more $\varepsilon$ is small, the more precise is the PLA in finding the approximate position of a query key, which in turn means that the query operation is faster. On the other hand, this precision comes at a cost: as $\varepsilon$ decreases, the PLA takes more space because the PLA needs more segments (that is, linear models) to guarantee the higher precision.
Let us denote by $n$ the number of input keys and by $m$ the number of segments in the PLA.^{2} It turns out that $m \leq \frac{n}{2\varepsilon}$ and that the space of the PGM-index is $\Oh(m)=\Oh(\frac{n}{\varepsilon})$ memory words.^{3} This space bound is very pessimistic, in fact we have proved that it converges to $\Oh(\frac{n}{\varepsilon^2})$ memory words w.h.p. (see the paragraph below).
For what concerns the PGM-index operations, let us assume that both the $n$ input keys and the index are kept on disk (we relax this latter assumption below). By setting $\varepsilon=\Theta(B)$, the number of I/Os needed for each operation is minimised, giving the following worst-case bounds:
Predecessor query (static case) |
$\Oh(\log_B n)$ I/Os |
---|---|
Predecessor query (dynamic case) |
$\Oh(\log^2_B n)$ I/Os |
Insert/delete | $\Oh(\log_B n)$ amortised I/Os |
Index space | $\Oh(\frac{n}{B})$ memory words $= \Oh(\frac{n}{B^2})$ disk pages |
A range search takes a number of I/Os equal to the ones of a predecessor query plus optimal $\Oh(\frac{K}{B})$ I/Os, where $K$ is the number of reported keys.
We can relax the assumption for which PGM-index is kept on disk alongside the $n$ input keys. Indeed, it is reasonable to assume (and, in practice, always the case) that the PGM-index fits in the internal memory, whereas the input keys are stored on disk.
In this scenario, the PGM-index achieves the following worst-case bounds:
Predecessor query (static case) |
$\Oh(1)$ I/Os |
---|---|
Predecessor query (dynamic case) |
$\Oh(\log_B n)$ I/Os |
Insert/delete | $\Oh(\log_B n)$ amortised I/Os |
Index space | $\Oh(\frac{n}{B})$ memory words |
In our experiments, the performance of the PGM-index was better than what the previous bounds show, particularly for what concerns the space of the index, which above we bounded as $\Oh(\frac{n}{B})$ memory words. This motivated us to study what happens to the space occupancy of the PGM-index when one makes some general assumptions on the input data.
In our journal paper On the performance of learned data structures (which extended the ICML 2020 paper Why are learned indexes so effective?), we found that, if we assume that the gaps between input sorted keys are taken from a distribution with finite mean and variance, then the space of the PGM-index converges to $\Oh(\frac{n}{B^2})$ memory words, i.e. $\Oh(\frac{n}{B^3})$ disk pages (versus $\Theta(\frac{n}{B^2})$ disk pages of B-trees). This result applies to any distribution (such as Uniform, Lognormal, Pareto, Exponential, and Gamma), as long as the mean and variance of the random variables modelling the gaps are finite.
Index space | $\Oh(\frac{n}{B^2})$ memory words $=\Oh(\frac{n}{B^3})$ disk pages |
---|
A classic survey on this topic is “External Memory Algorithms and Data Structures: Dealing with Massive Data” by Jeffrey Scott Vitter. ↩︎
One can inspect $m$ by calling the segments_count()
method. ↩︎
See Lemma 2 and Section 2.2 of the VLDB paper. ↩︎