SPARQL – the de facto standard query language for RDF data – plays a central role within the MICO system. As a basis technology “under the hood” for metadata creation and retrieval it is, together with RabbitMQ messaging, the link between annotators, metadata storage and webservices. It is obvious that a performant evaluation of SPARQL queries therefor is mandatory to keep the performance of the overall system on a high level. With the introduction of SPARQL-MM’s media fragment filter functions it becomes clear, that common pattern based optimization algorithms fail regarding high selective filters. In this blog we show, where common optimization algorithms produces non-optimal query plans and propose an extension to filter aware cost calculation to overcome this problem.

Why complex filters are a mess

To illustrate why complex SPARQL queries may lead to non-optimal query execution times we start with a simple example:

SELECT ?person WHERE {
   ?carnivore a ex:Animal .
   ?herbivore a ex:Person .
   ?fragment1 dc:subject ?animal .
   ?fragment2 dc:subject ?person .
   FILTER mm:rightBeside(?fragment1, ?fragment2)
}

We can translate this query into SPARQL algebra (and skipped prefixes etc. for the matter of readability):

(project (?person)
  (filter (mm:rightBeside ?fragment1 ?fragment2)
    (bgp
      (triple ?carnivore rdf:type ex:Animal)
      (triple ?herbivore rdf:type ex:Person)
      (triple ?fragment1 dc:subject ?animal)
      (triple ?fragment2 dc:subject ?person)
)))))

A naive executer would evaluate this query as is from inside to outside. The Basic Graph Patterns (bgp) would be joined in a row before the filter is executed on the whole join result. This may lead to non optimal/minimal intermediate results (e.g. when the bgps has a different level of selectivity) which can increase the execution time a lot. Therefor we need to optimize the query before execution with respect to the query characteristics.

Basic optimization approach

After an exhaustive examination of several SPARQL query optimization approaches we select the one described by Song et. al. in “Extended Query Pattern Graph and Heuristic-based SPARQL Query Planing[1]. This approach is mainly heuristic based but produces very good query execution plans in comparison to others with a minimal planing effort. Additionally it is suitable for our extensions regarding filter optimization. It translates the SPARQL query to an Extended SPARQL query triple pattern Graph (ESG) which is represented by vertices V (query expressions like bgp, filter etc.) and edges E. An edge links two vertices if they share at least one variable. The cost-model is based on a set of heuristics, which are:

  • H1: The cost for executing query triple patterns is ordered as: c(s,p,o) <= c(s,?,o) <= … <= c(?,?,?)
  • H1*: The cost for execution query triple patterns is influenced by the distinct count of subject, predicate und object.
  • H2: A triple pattern that is related to more filters has higher selectivity and cost less.
  • H3: A triple pattern that has more variables appearing in filters has higher selectivity and less cost.
  • H4: Basic query triples patterns has higher selectivity and cost less than named graphs.
  • H5: A query executed with a specific graph pattern has higher selectivity and cost less.
  • H6: The position of the join variable of two vertices influences the join selectivity.
  • H7: Edges whose vertices share more variables are more selective.

As you can see, Filters are considered to decrease triple pattern costs without taking into account the selectivity of the specific filter function. Our assumption is that a cost calculation for Filters can lead to better query plans as high selective filters on a proper place can decrease join sizes on a early stage in query evaluation and vice versa.

Extension for efficient filter evaluation

In order to consider the selectivity of (SPARQL-MM) Filters we extend the heuristic as follows:

  • H2*: The selectivity of a triples pattern is influenced by the selectivity of a related filter.
  • H2**: If a triple pattern is related to more filters, the selectivity is influenced by the highest filter selectivity.

The selectivity of a filter can be expensive regarding calculation and memory. But as we need an efficient index for spatio-temporal fragments for the query evaluation in case of SPARQL-MM, the selectivity calculation is pretty lightweight. In the table below we listed some selectivities for SPARQL-MM functions based on annotations from the MSCOCO image corpus, which includes overall 300.000 images with 5 annotations per each (in average) from about 80 categories. This is just a small evaluation for a subset of 1000 images and a subset of functions but it shows immediately the high selectivity of the functions. As you can see, some functions are extremely selective (e.g. touches), which is due to the nature of annotations (not many annotations contacts without overlapping).

Function Crossjoin / Filtered
leftBeside / rightBeside 2,92
above / below 4,86
intersects 4,55
covers 11,07
disjoint 1,28
touches 88,19

Next Steps

As first tests showed, an efficient query planing together with suitable indexes improves SPARQL-MM performance a lot especially for bigger datasets. Currently we are finishing the implementation for Apache Marmotta with the Kiwi Backend.
The next steps are the setup of a proper testing environment including testdata. As basis we are using annotated images from Snapshot Serengeti Project and the MSCOCO image annotation set.
We are looking forward to present the results in a further blogpost!

References

[1] F. Song and O. Corby, Extended Query Pattern Graph and Heuristic-based SPARQL Query Planing, in Proceedings of the 19th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, 2015.