Indexing Temporal Relations for Range-Duration Queries

Supplemental material

Authors
Affiliations
Matteo Ceccarello

University of Padova

Anton Dignös

Free University of Bozen-Bolzano

Johann Gamper

Free University of Bozen-Bolzano

Christina Khnaisser

Université de Sherbrooke

Published

February 8, 2023

This document complements the experimental results of the paper. In particular, in this document we provide interactive plots to explore the behavior of different index structures on mixed query workloads.

Mixed-workload queries

The paper reports on the performance of various index structures for uniform workload queries, i.e. range-only queries, duration-only queries, or range-duration queries.

Table 1, as reported in the paper
dataset query RD-Index-td RD-Index-dt GridFile Period-Index* RTree HINT ITree BTree
Random qro 415 602 468 242 12 164 867 272 8
Random qdo 842 734 77 69 106 6 12 43 103
Random qrd 11 737 14 085 1 581 1 403 2 591 72 215 502
Flight qro 54 945 57 471 49 505 15 015 4 921 1 342 098 51 813 2 798
Flight qdo 4 182 4 218 3 791 538 378 169 683 714 286
Flight qrd 232 558 208 333 188 679 29 940 21 786 11 439 43 860 13 514
Webkit qro 416 467 445 83 30 525 897 384 47
Webkit qdo 2 544 2 520 1 938 57 234 64 157 3 393
Webkit qrd 3 159 3 133 2 322 80 258 95 256 758
MimicIII qro 372 391 391 381 38 298 566 347 127
MimicIII qdo 3 560 2 876 2 075 520 452 25 76 2 500 000
MimicIII qrd 10 246 10 730 5 227 1 779 1 384 84 232 3 347

A natural question is: how does the performance change when these three types of queries are mixed together? Using data from the above table we can estimate performance figures for mixed workloads. In particular, the inverse of the throughputs reported in the above table are the average running time of each specific type of query. The harmonic mean of the throughputs is the throughput of the mixed workload: \[ QPS_{mixed} = \frac{1}{ f_{ro} / QPS_{ro} + f_{do} / QPS_{do} + f_{rd} / QPS_{rd} } \] where \(f_{ro}\) is the fraction of range only queries in the workload, \(f_{do}\) is the fraction of duration only queries in the workload, and \(f_{rd}\) is the fraction of range-duration queries.

Using the following radio buttons, we can select on which dataset to focus on.

Then, the ternary plot below allows to select different workloads by hovering the mouse on top of it. Towards the top corner we have workloads composed for the most part by range only queries. The bottom left corner is for workloads with a majority of duration only queries. In the bottom right corner we have workloads with mainly range-duration queries. The table above the plot reports the precise composition of the select workload. At the right of the ternary plot, a bar chart displays the performance attained by the different index structures on the selected workload, in queries per second. The ternary plot area’s itself is colored with the color of the best performing index in each particular point.

First and foremost, observe that for the majority of workloads on all datasets we have that one of the two variants of the RD-Index is the best performing data structure. This is true also when a considerable part of the workload is comprised of range-only or duration-only queries.

On all datasets, when the vast majority of queries are duration-only queries, the best performing data structure is the B-Tree. Similarly, when the majority of queries is range-only, the best performing data structure is HINT.

The threshold of duration-only and range-only queries at which these indices are the best varies from dataset to dataset, and depends also on how many range-duration queries are involved1. In particular, for the Random the B-Tree has the best performance only when there are at most \(\approx 37\%\) range-duration queries and no range-only queries. On Flight, with either \(\ge 60\%\) duration-only and no range-only queries, or \(76\%\) range-duration and \(23\%\) duration-only queries, the B-Tree is the best data structure. On the other two datasets, the B-Tree outperforms the RD-Index only when the vast majority of queries are range-only.

Finally, observe that the performance of the RD-Index is rather robust to the choice of indexing order: moving the mouse on the ternary plot while observing the bar chart shows that the throughput of RD-Index-dt is similar to the throughput of RD-Index-td.

Footnotes

  1. The B-Tree indexes only the duration dimension of intervals. Therefore, queries constraining only the range do not benefit from this index, but range-duration queries do, although not as much as duration-only queries.↩︎