Indexing Temporal Relations for Range-Duration Queries

Supplemental material

Authors

Affiliations

Matteo Ceccarello

University of Padova

Anton Dignös

Free University of Bozen-Bolzano

Johann Gamper

Free University of Bozen-Bolzano

Christina Khnaisser

Université de Sherbrooke

Published

February 8, 2023

This document complements the experimental results of the paper. In particular, in this document we provide interactive plots to explore the behavior of different index structures on mixed query workloads.

Mixed-workload queries

The paper reports on the performance of various index structures for uniform workload queries, i.e. range-only queries, duration-only queries, or range-duration queries.

Table 1, as reported in the paper
dataset	query	RD-Index-td	RD-Index-dt	GridFile	Period-Index*	RTree	HINT	ITree	BTree
Random	qro	415	602	468	242	12	164 867	272	8
Random	qdo	842	734	77	69	106	6	12	43 103
Random	qrd	11 737	14 085	1 581	1 403	2 591	72	215	502
Flight	qro	54 945	57 471	49 505	15 015	4 921	1 342 098	51 813	2 798
Flight	qdo	4 182	4 218	3 791	538	378	169	683	714 286
Flight	qrd	232 558	208 333	188 679	29 940	21 786	11 439	43 860	13 514
Webkit	qro	416	467	445	83	30	525 897	384	47
Webkit	qdo	2 544	2 520	1 938	57	234	64	157	3 393
Webkit	qrd	3 159	3 133	2 322	80	258	95	256	758
MimicIII	qro	372	391	391	381	38	298 566	347	127
MimicIII	qdo	3 560	2 876	2 075	520	452	25	76	2 500 000
MimicIII	qrd	10 246	10 730	5 227	1 779	1 384	84	232	3 347

A natural question is: how does the performance change when these three types of queries are mixed together? Using data from the above table we can estimate performance figures for mixed workloads. In particular, the inverse of the throughputs reported in the above table are the average running time of each specific type of query. The harmonic mean of the throughputs is the throughput of the mixed workload: \[ QPS_{mixed} = \frac{1}{ f_{ro} / QPS_{ro} + f_{do} / QPS_{do} + f_{rd} / QPS_{rd} } \] where \(f_{ro}\) is the fraction of range only queries in the workload, \(f_{do}\) is the fraction of duration only queries in the workload, and \(f_{rd}\) is the fraction of range-duration queries.

aq = require('arquero');
d3 = require("d3@7")
d3ternary = require("d3-ternary@2")

Using the following radio buttons, we can select on which dataset to focus on.

viewof datasetName = Inputs.radio(["Random", "Flight", "Webkit", "MimicIII"], {value: "Random", label: "Dataset"})

Then, the ternary plot below allows to select different workloads by hovering the mouse on top of it. Towards the top corner we have workloads composed for the most part by range only queries. The bottom left corner is for workloads with a majority of duration only queries. In the bottom right corner we have workloads with mainly range-duration queries. The table above the plot reports the precise composition of the select workload. At the right of the ternary plot, a bar chart displays the performance attained by the different index structures on the selected workload, in queries per second. The ternary plot area’s itself is colored with the color of the best performing index in each particular point.

// Function to transform a triplet of number adding up to 1 into a triplet of integers adding up to 100
function roundTriplet(triplet) {
  const [x, y, z] = d3.map(triplet, x => Math.round(x * 100))
  const excess = x + y + z - 100;
  return [x, y, z - excess];
}

// This is the table of all the performance and its tradeoffs
// table = aq.table(tradeoffs)
table = aq.loadCSV("tradeoffs.csv")

// This is the map holding the best index for each workload
bestMapData = table.params({dataName: datasetName}).filter((d, $) => d.dataset == $.dataName)
  .derive({
    triplet: aq.escape(d => roundTriplet([d.frac_ro, d.frac_do, d.frac_rd]))
  })
  .groupby("dataset", "triplet")
  .orderby(d => -d.qps)
  .slice(0, 1)

algoDomain = [
  "RD-index-td"   ,
  "RD-index-dt"   ,
  "Grid-File"     ,
  "Period-Index*" ,
  "R-Tree"        ,
  "HINT"          ,
  "Interval-Tree" ,
  "B-Tree"        
];

triplet = (bestMap != 0)? roundTriplet(bestMap.coords) : [0,0,0]; // Return the impossible triplet if a selection has yet to be made
selectedSubset = d3.filter(table, 
  d => d.dataset == datasetName &&
    Math.round(d.frac_ro * 100) == triplet[0] && 
    Math.round(d.frac_do * 100) == triplet[1] && 
    Math.round(d.frac_rd * 100) == triplet[2])

axisLabels = (g, labels) =>
  g
    .selectAll("text")
    .data(labels, d => d.label)
    .join(
      enter =>
        enter
          .append("text")
          .attr("dy", (d, i) => (i === 0 ? "2em" : ".5em"))
          .attr(
            "transform",
            (label, i) => `translate(${label.position})`
          )
          .attr("text-anchor", "middle")
          .text(label => label.label),
      update =>
        update.attr(
          "transform",
          (label, i) => `translate(${label.position})`
        )
    )

valuesTable = (labels, plot) => html`
<table style="height:50px; width:100%">
  <thead>
    <tr style="width:33%">${labels.map((label) => `<th>${label}</th>`)}</tr>
  </thead>
  <tbody>
    <tr>${
      plot === 0
        ? `<td colspan="3">
            Hover your mouse over the triangle to select a workload
          </td>`
        : plot.coords.map(c => `<td>${d3.format(".0%")(c)}</td>`)
    } </tr>
  </tbody>
</table>
`
valuesTable(['Range only', 'Duration only', 'Range-duration'], bestMap)

viewof bestMap = {
  const height = 350;
  const width = 500;
  const radius = Math.min(width, height) / 2;
  // to keep the triangle centered
  const yOffset = radius/4;
  // Set up the chart space
  const node = DOM.svg(width, height);
  const svg = d3.select(node);
  const chart = svg
    .append("g")
    .attr("transform", `translate(${width / 2} ${height / 2 + yOffset})`)
    .on("mousemove", handleMouseMove);

  const circle = chart.append("g");
  
  // Handler of mouse hover events. It stores the state
  // underlying the mouse in the SVG node itself.
  function handleMouseMove(d) {
    const xy = d3.pointer(d);
    const inverse = ternary.invert(xy);

    node.dispatchEvent(new CustomEvent("input"), { bubbles: true });
    node.value = {
      coords: inverse,
      data: d.target.__data__
    };
  }

  // Define transformer in ternary coordinates
  const normalBarycentric = d3ternary.barycentric()
    .a(d => d.frac_ro)
    .b(d => d.frac_do)
    .c(d => d.frac_rd);
  const ternary = d3ternary.ternaryPlot(normalBarycentric)
    .radius(radius)
    .labels(["Range only", "Duration only", "Range-Duration"]);

  const axisLabelsGroups = chart
    .append("g")
    .attr("class", "axis-labels")
    .append("g")
    .attr("class", "labels")
    .attr("font-size", 16);

  axisLabelsGroups.call(axisLabels, ternary.axisLabels());
  
  // Add the ternary components to the data
  const ternaryData = d3.map(bestMapData, (d) => {
    const [x, y] = ternary(d);
    return { x, y, ...d };
  });

  // border triangle
  const trianglePath = chart
    .append("path")
    .attr("d", ternary.triangle())
    .attr("fill", "transparent")
    .attr("stroke", "black")
    .attr("stroke-width", 6);

  // Set up color scale
  const color = d3.scaleOrdinal(d3.schemeTableau10)
    .domain(algoDomain);

  // Add the points
  chart.append("g")
    .selectAll("circle")
    .data(ternaryData)
    .join("circle")
    .attr("r", 2)
    .attr("cx", (d) => d.x)
    .attr("cy", (d) => d.y)
    .attr("fill", d => color(d.algorithm))
    .attr("stroke", "none");

  node.value = 0;

  return svg.node();
}

Plot.plot({
  width: 800,
  height: 600,
  x: {
    domain: d3.sort(selectedSubset, d => -d.qps).map(d => d.algorithm)
  },
  color: {
    type: "categorical",
    scheme: "tableau10",
    domain: algoDomain
  },
  marginLeft: 90,
  marginTop: 90,
  marginBottom: 50,
  marks: [
    Plot.barY(
      selectedSubset,
      {
        x: "algorithm",
        y: "qps",
        fill: "algorithm"
      }
    ),
    Plot.text(
      selectedSubset,
      {
        x: "algorithm",
        y: "qps",
        text: d => d3.format("d")(d.qps),
        dy: -10
      }
    ),
    Plot.ruleY([0])
  ]
});

First and foremost, observe that for the majority of workloads on all datasets we have that one of the two variants of the RD-Index is the best performing data structure. This is true also when a considerable part of the workload is comprised of range-only or duration-only queries.

On all datasets, when the vast majority of queries are duration-only queries, the best performing data structure is the B-Tree. Similarly, when the majority of queries is range-only, the best performing data structure is HINT.

The threshold of duration-only and range-only queries at which these indices are the best varies from dataset to dataset, and depends also on how many range-duration queries are involved¹. In particular, for the Random the B-Tree has the best performance only when there are at most \(\approx 37\%\) range-duration queries and no range-only queries. On Flight, with either \(\ge 60\%\) duration-only and no range-only queries, or \(76\%\) range-duration and \(23\%\) duration-only queries, the B-Tree is the best data structure. On the other two datasets, the B-Tree outperforms the RD-Index only when the vast majority of queries are range-only.

Finally, observe that the performance of the RD-Index is rather robust to the choice of indexing order: moving the mouse on the ternary plot while observing the bar chart shows that the throughput of RD-Index-dt is similar to the throughput of RD-Index-td.

Footnotes

The B-Tree indexes only the duration dimension of intervals. Therefore, queries constraining only the range do not benefit from this index, but range-duration queries do, although not as much as duration-only queries.↩︎