Answering Provenance-Aware Queries on RDF Data Cubes under Memory Budgets
Author(s): Luis Galárraga, Kim Ahlstrøm Meyn Mathiassen, Katja Hose, Torben Bach Pedersen
Full text: submitted version
Abstract: The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and multidimensional data (cubes) in RDF. Query processing in such settings is challenging because SPARQL OLAP queries tend to be complex: they usually contain many triple patterns with grouping and aggregation. Moreover, one important factor of query answering on web data is its provenance, i.e., metadata that tells us about the origin and quality of the data. Some applications, e.g., in data analytics, access control, etc., require to augment the data with provenance metadata and run queries that impose constraints on this provenance. This task is called provenance-aware query answering. In this paper, we investigate the benefit of caching some parts of an RDF cube augmented with provenance information when answering provenance-aware SPARQL queries. We propose provenance-aware caching (PAC), a caching approach based on a provenance-aware partitioning for RDF graphs, and a benefit model optimized for RDF cubes and SPARQL queries with aggregation. Our results on real and synthetic data show that PAC outperforms the LRU (least recently used) caching strategy and the Jena TDB native caching in terms of hit-rate and performance.
Keywords: OLAP; SPARQL; Caching; Provenance
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper deals with provenance-aware queries on RDF data, and is, thus, highly relevant to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) To the best of my knowledge, the approach is novel. However, the general idea of using optimised caching to achieve better performance is not new, but its application to the specific setting probably is. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is complete. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art is appropriately considered. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There is no theoretical demonstration of the properties of the approach, but the experimental evaluation is well-structured and appropriate. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental evaluation is well-structured and appropriate. (OVERALL SCORE) The paper describes PAC, an approach for speeding up provenance-aware OLAP queries, by using a set of heuristics that determine "fragments" of an RDF data cube to be cached in memory. The paper is clear and well-written. Some comments follow. It is not clear why Algorithm 1 considers the particular fragments to put in the tree and not others. I have certain remarks on the formalisation of the ben(\phi) quantity. First, the formula can be written more simply by dropping Q_\phi. The restriction of q_a \in Q_\phi is irrelevant, because if q_a \notin Q_\phi then the quantity being summed is 0. Second, is it not the case that ben(\Phi) = \sum ben(\phi)? This would be a simpler formula. Finally, I am wondering why the benefit is not associated with the number of triples relevant to a query. For example, a query that "touches" 1000 triples (in total), 10 of which are in the cache, will experience a different speed-up compared to the case when the query "touches" 10 triples, all of which are in the cache. I would like to see a discussion explaining whether this alternative benefit computation would make sense and why the authors chose the more "absolute" one. Strong points - Clear and well-written - Interesting application of the fragmentation approach to maximize the use of caching Weak points - Formalisation is unclear and can be improved Questions to the authors Please comment on the alternative formalisations proposed above. ====== EDIT AFTER REBUTTAL I acknowledge authors' response. I will keep my original scores and comments. Also, the response with regards to the number of triples is not clear: my suggestion was referring to "normalizing" the result *per query* and not over all queries (i.e., per workload, as suggested by the answer); looking at the query is (obviously) already necessary in the current definition of ben(\Phi). In any case, this does not affect my score. ======
Review 2 (by Javier D. Fernández)
(RELEVANCE TO ESWC) Authors focus on efficient caching mechanisms to speed up provenance-aware query answering on RDF data cubes. The paper can be of great relevance to ESWC as efficient resolution OLAP queries on multidimensional RDF data cubes is still an open challenge that is getting increasingly attention. (NOVELTY OF THE PROPOSED SOLUTION) The paper proposes a cache based on fragments specifically designed for quads (where provenance is represented as named graphs), a cost-benefit model to select the best fragments to be cached based on the space cost and benefit for provenance queries (with some heuristics to approximate the optimal benefit) and a way to select the best fragments with a given cache space budget and rewrite provenance-aware queries to hit the cache. To the best of my knowledge, this is the first work providing a cost-based cache model for provenance queries. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution is technically sound and the conclusions follow the premises. (EVALUATION OF THE STATE-OF-THE-ART) The evaluation of the state-of-the-art is extensive and reflects the most important works in the area. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Authors are clear in the scope of the approach, but I have a doubt that could be solved during the rebuttal phase. If I understood correctly, the fragmentation strategy always splits by the named graph in the first level (shown in Algorithm 1). Then, when rewriting the query to hit the cached fragments, the named graph is considered to be already given by the resolution of the provenance query (shown in Algorithm 2). Authors are then assuming that the named graphs are selected with the same probability, and the benefit of the fragment is solely related to the relevance to match the analytical query. If this is so, a) the algorithm seems to be not so specific of provenance-aware queries and could be easily used with “normal” SPARQL querying (a discussion on this would be helpful), and b) can the fragmentation be adapted to take into account also the distribution of the provenance data and the probability to “hit” such named graph? Update: I thank authors for the minor clarification in the rebuttal. The paper would benefit from such discussion. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Authors do not provide the augmented data and provenance queries they used, nor the code for reproducibility. I would encourage authors to clarify in the rebuttal if they are providing such information. The evaluation could also reflect statistics on the number of fragments created and the distribution of the sizes. I would assume the size of the selected datasets (almost) fits into memory. A selection of bigger datasets would make the scenario more solid. The subsection “Impact of graph filtering” is a bit unclear. Do you use the fragment tree that is cached as a filter for the graph, or the full fragment tree? In the second case, an indication on the type of indexing and the size overhead would be needed. Update: I thank authors for the pointer to the website including the evaluation details. (OVERALL SCORE) The paper addressed the efficient resolution of OLAP queries on RDF data cubes. In particular, authors focus on provenance-aware queries on RDF datasets, decomposed as a “provenance query” to filter the graphs and the “analytical query” to compute results on the desired graphs. The paper proposes then a novel cost-based cache mechanisms (provenance-aware caching, PAC) to increase the hit-rate on a given memory budget. To do so, authors first study a fragmentation strategy for the quads and build a cost-benefit scheme on top of it, mainly based on a) the cost as the number of quads in the fragment, and b) the less number of hops to the cube observation and the diversity of the fragments as the relevance for the analytical queries. Authors implement the proposed query-agnostic PAC approach on Jena TDB, evaluated on a synthetic and real-world corpus. The empirical evaluation shows that PAC is able to increase the hit-rate and increase performance time in OLAP queries. * Strong points: - Relevant and timely topic - The approach is technically sound, query-load agnostic (hence further improvements can be achieved as future work) and shows good performance. - The paper is very well written and the formalisms are easy to follow. * Week points (further elaborated in other sections) : - The PAC approach seems to be agnostic of the provenance information - Lack of code/corpus for reproducibility (provided in the rebuttal) - Evaluation has room for improvement * Questions for the authors (further elaborated in other sections): - Algorithm 1 seems to be not so specific of provenance-aware queries, could it be used with “normal” SPARQL querying? - Can the fragmentation be adapted to take into account also the distribution of the provenance data and the probability to “hit” such named graph? - Are the queries and code available for reproducibility? - In the evaluation (“Impact of graph filtering”), do you use the cached fragments as a filter for the graph, or the full fragment tree? In the second case, an indication on the type of indexing and the size overhead would be needed. Update: I thank again authors for their comments in the rebuttal, I increased some of the scores accordingly. I keep the overall weak accept given that the proposal can be more tailored to the provenance information.
Review 3 (by Khalid Belhajjame)
(RELEVANCE TO ESWC) The topic addressed by the paper is relevant to the semantic web community given that it tackles the problem of answering OLAP queries over RDF datasets considering provenance. (NOVELTY OF THE PROPOSED SOLUTION) The topic address by the paper has been addressed by previous papers, which are reviewed in the paper. The paper, however, focuses on aggregation queries, which are particularly relevant for OLAP queries. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution proposed by the authors is formally specified, the algorithms used for fragmentation (lattice construction) and the selection of fragments given a budget are also presented. The solution is also empirically validtaed. (EVALUATION OF THE STATE-OF-THE-ART) The authors dedicated a section to the study of the state of the art, they compared their solution to existing ones. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The solution proposed by the authors was empirically validated using the Star Schema Benchmark and the QBOAirbase dataset. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The material used for the evaluation is publicly available at http://qweb.cs.aau.dk/qboairbase/ (OVERALL SCORE) SP1. The topic addressed by the paper is timely and relevant to ESWC SP2. The solution proposed is sound and complete SP3. It is also empirically validated. WP1. The kind of provenance considered by the author, at least in the examples, are confined to URIs. This does not necessarily provide room for much details on the triple in question. In addition, the provenance of the triples in an RDF graph may be dependent, these dependencies are not considered in the fragmentation operation. WP2. The approach proposed by the authors does not seem to be specific to provenance, it can be applicable to any other kind of metadata on triples. This raises the question as to why the authors did not choose to present their solution with a broader scope of application. --------------- Edits made after rebuttal I will not change my review. That said, the answer that the authors provided to my second query on the fact that the approach does not seem to be specific to provenance is not convincing.
Review 4 (by Shawn Bowers)
(RELEVANCE TO ESWC) The paper deals with efficient execution of SPARQL queries (although limited in scope). (NOVELTY OF THE PROPOSED SOLUTION) Deals with caching for a specific variant of SPARQL queries. The overall approach and assumptions seem solid. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I think a more thorough experimental evaluation could have been performed. For instance, it would be good to see how these approaches compare to just a pure in-memory solution as well as an approach that doesn't do any caching (two extremes). The experimental evaluation also needs some work in terms of explaining the datasets used (e.g., the terms 'airbase-dk', 'airbase-bg', 'u-ssb-80k', and so on). Also, while the number of triples are reported in the experimental datasets, the size of the datasets should also be reported. I'm guessing none of these were really that large in size (but I have no way to really know if this is the case or not ...). (EVALUATION OF THE STATE-OF-THE-ART) Good. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) This is also good. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) See previous comments. (OVERALL SCORE) My main concern with the paper is whether it makes sense from a practical perspective. In particular, as the trend in database technology has moved to in-memory architectures the last few years, it seems reasonable that the same would be true of SPARQL/RDF based query approaches. It seems like the dataset would have to be very large to warrant the approach proposed in the paper. Even in the experimental evaluation, it is likely the case that the datasets used could have been processed purely in memory (the machine used to do the experiments had 128GB of RAM!). Since given the amount of memory even commodity machines have today, it seems rare that a given RDF dataset couldn't be processed purely in memory. This means that in only rare cases would one need this type of caching approach. And likely, a simple sharding-based, in-memory approach using a few machines would outperform the caching-based approach anyway. This is also one reason why it would be useful to actually see how the approach performs against in-memory approaches ...
Metareview by Maribel Acosta
This work addresses the problem of provenance-aware OLAP query processing. The authors propose provenance-aware caching (PAC) to cache relevant fragments of RDF graphs assuming memory budget. Empirical results indicate that PAC outperforms baseline approaches in scenarios with real and synthetic datasets. The reviewers agreed on the relevance and on the novelty of this work, hence the positive scores. Nonetheless, after the rebuttal, the majority of the reviewers still have major concerns about the role of provenance in certain components of the solution. From the paper, it seems that PAC is agnostic to provenance. As 'provenance' is crucial in the problem tackled in this work, we recommend the authors explicitly mention which components of the approach exploit provenance. In addition, the proposed solution is not properly positioned with respect to the authors' previous work  which hinders the contributions of this paper. As it is presented, the distinction between the fragments defined in PAC and materialized views remains unclear. Due to the reasons stated above, the paper cannot be accepted in the conference. We highly encourage the authors to take into account the reviewers’ comments to improve the overall quality of this work.  Kim Ahlstrøm Jakobsen, Katja Hose, Torben Bach Pedersen: Towards Answering Provenance-Enabled SPARQL Queries Over RDF Data Cubes. JIST 2016: 186-203