Indexing Temporal Relations for Range-Duration Queries
Supplemental material
This document complements the experimental results of the paper. In particular, in this document we provide interactive plots to explore the behavior of different index structures on mixed query workloads.
Mixed-workload queries
The paper reports on the performance of various index structures for uniform workload queries, i.e. range-only queries, duration-only queries, or range-duration queries.
dataset | query | RD-Index-td | RD-Index-dt | GridFile | Period-Index* | RTree | HINT | ITree | BTree |
---|---|---|---|---|---|---|---|---|---|
Random | qro | 415 | 602 | 468 | 242 | 12 | 164 867 | 272 | 8 |
Random | qdo | 842 | 734 | 77 | 69 | 106 | 6 | 12 | 43 103 |
Random | qrd | 11 737 | 14 085 | 1 581 | 1 403 | 2 591 | 72 | 215 | 502 |
Flight | qro | 54 945 | 57 471 | 49 505 | 15 015 | 4 921 | 1 342 098 | 51 813 | 2 798 |
Flight | qdo | 4 182 | 4 218 | 3 791 | 538 | 378 | 169 | 683 | 714 286 |
Flight | qrd | 232 558 | 208 333 | 188 679 | 29 940 | 21 786 | 11 439 | 43 860 | 13 514 |
Webkit | qro | 416 | 467 | 445 | 83 | 30 | 525 897 | 384 | 47 |
Webkit | qdo | 2 544 | 2 520 | 1 938 | 57 | 234 | 64 | 157 | 3 393 |
Webkit | qrd | 3 159 | 3 133 | 2 322 | 80 | 258 | 95 | 256 | 758 |
MimicIII | qro | 372 | 391 | 391 | 381 | 38 | 298 566 | 347 | 127 |
MimicIII | qdo | 3 560 | 2 876 | 2 075 | 520 | 452 | 25 | 76 | 2 500 000 |
MimicIII | qrd | 10 246 | 10 730 | 5 227 | 1 779 | 1 384 | 84 | 232 | 3 347 |
A natural question is: how does the performance change when these three types of queries are mixed together? Using data from the above table we can estimate performance figures for mixed workloads. In particular, the inverse of the throughputs reported in the above table are the average running time of each specific type of query. The harmonic mean of the throughputs is the throughput of the mixed workload: \[ QPS_{mixed} = \frac{1}{ f_{ro} / QPS_{ro} + f_{do} / QPS_{do} + f_{rd} / QPS_{rd} } \] where \(f_{ro}\) is the fraction of range only queries in the workload, \(f_{do}\) is the fraction of duration only queries in the workload, and \(f_{rd}\) is the fraction of range-duration queries.
Using the following radio buttons, we can select on which dataset to focus on.
Then, the ternary plot below allows to select different workloads by hovering the mouse on top of it. Towards the top corner we have workloads composed for the most part by range only queries. The bottom left corner is for workloads with a majority of duration only queries. In the bottom right corner we have workloads with mainly range-duration queries. The table above the plot reports the precise composition of the select workload. At the right of the ternary plot, a bar chart displays the performance attained by the different index structures on the selected workload, in queries per second. The ternary plot area’s itself is colored with the color of the best performing index in each particular point.
First and foremost, observe that for the majority of workloads on all datasets we have that one of the two variants of the RD-Index is the best performing data structure. This is true also when a considerable part of the workload is comprised of range-only or duration-only queries.
On all datasets, when the vast majority of queries are duration-only queries, the best performing data structure is the B-Tree
. Similarly, when the majority of queries is range-only, the best performing data structure is HINT
.
The threshold of duration-only and range-only queries at which these indices are the best varies from dataset to dataset, and depends also on how many range-duration queries are involved1. In particular, for the Random
the B-Tree has the best performance only when there are at most \(\approx 37\%\) range-duration queries and no range-only queries. On Flight
, with either \(\ge 60\%\) duration-only and no range-only queries, or \(76\%\) range-duration and \(23\%\) duration-only queries, the B-Tree is the best data structure. On the other two datasets, the B-Tree outperforms the RD-Index only when the vast majority of queries are range-only.
Finally, observe that the performance of the RD-Index is rather robust to the choice of indexing order: moving the mouse on the ternary plot while observing the bar chart shows that the throughput of RD-Index-dt is similar to the throughput of RD-Index-td.
Footnotes
The B-Tree indexes only the duration dimension of intervals. Therefore, queries constraining only the range do not benefit from this index, but range-duration queries do, although not as much as duration-only queries.↩︎