Efficient Duplicate Elimination in SPARQL to SQL Translation
Author(s): Dimitris Bilidas, Manolis Koubarakis
Full text: submitted version
Abstract: Redundant processing is a key problem in SPARQL to SQL query translation in Ontology Based Data Access (OBDA) systems. Many optimizations that aim to minimize this problem have been proposed and implemented. However, a large number of redundant duplicate answers are still generated in many practical settings, and this is a factor that impacts query execution. In this work we identify specific query traits that lead to duplicate introduction and we track down the mappings that are responsible for this behavior. Through experimental evaluation using the OBDA system Ontop, we exhibit the benefits of early duplicate elimination and show how to incorporate this technique into query optimization decisions.
Keywords: query optimization; OBDA; SPARQL
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper is relevant to ESWC, as there are a growing amount of systems, which work with RDF but having a relational database backend running. However the paper itself was very SQL focused. (NOVELTY OF THE PROPOSED SOLUTION) There are many papers (more than introduced in the related work section) covering the transition from SPARQL to RDF. There are also some interesting paper from the open source system the authors used as foundation. However I did not find equivalent proposed solutions. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The improved speed-up while running the benchmarks suggest a correct solution. However the authors did not mentioned if the results of the queries were still correct. (EVALUATION OF THE STATE-OF-THE-ART) The related work section is very poor, there was no explanation of how those introduced papers would perform on a similar benchmark. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is very theoretical, introducing more examples would help to understand and underline the proposed solutions. Sometimes the overall goal of the paper, but also between the chapters is not clear. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) While it is important to mention, that the results are public available, based on the current explanation of the implementation I would not be able to reproduce the experiments, (OVERALL SCORE) This paper is about creating a more efficient SPARQL to SQL translation with means of detection and removal of duplicated results in an early stage. As the authors describe in the introduction, this is a real world problem in ontology based data access systems. However it is not enough to filter out duplicates from the final results, but from a point of performance it make sense to remove duplicates in an early stage. This paper introduces different solution, which they have implemented in the open source system Ontop. The proposed solutions are later evaluated on the benchmark sets NPD and LUBM. The authors propose three different solutions, how to eliminate such duplicates. I especially likes chapter 3, which was very precise chapter and the combination of technical explanation together with a pseudo algorithm was very nice. Also the introduction was partially well written. Through discharging one of the more intuitive solutions (including an explanation why to do so) the complexity of the introduced challenge got more understandable. However the introduction would benefit from a clear statement (enumerated list) explaining the scientific contributions of this paper. The authors mentioned several times that because of the limited space they would refere to the technical report, which they uploaded on GitHub. However I do not agree with the authors on this point. There is a lot of possible space to save, if the first two chapters are comprehended. I do not think that such a deep and technical introduction in the preliminaries is necessary to unterstand this paper. More examples however, especially for formulas 2 and 4, which represent certain forms of queries, would be helpful to understand this paper. Overall I think the paper, like presented in the current form, needs some revision, but I think they are easy to introduce. I had a closer look at the technical report uploaded on GitHub, and there were many things explained more clearly. For example in section 7 in the technical report is a nice concluding table (Table 5), which I would like to see in this paper. If some of the sections are shorten, more examples (such as SQL queries), which are already in the technical report, could also be added to this chapter. Also it would be nice to have a more clear overview of the datasets (maybe one example each) and how they were applied in Ontop, because this would increase the understanding how to reproduce the results. I would also like to see a conclusion section as well as a more prominent related work section.
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) The topic of ontology-based data access is certainly of relevance to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) Most parts of the solution are novel. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Appears to be correct. (EVALUATION OF THE STATE-OF-THE-ART) Relevant works are cited and compared to. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) A prototype implementation of the proposed optimization techniques is available online. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) A prototype implementation of the proposed optimization techniques is available online. (OVERALL SCORE) The submission studies the issue of duplicate solutions in ontology-based data access and proposes a number of optimizations. They all have shown to improve the performance of the OBDA system Ontop. *** Strong Points (SPs) *** 1. The issue of duplicate elimination is quite important for OBDA systems. 2. A prototype is implemented by extending the OBDA system Ontop. *** Weak Points (WPs) *** 1. One of the optimizations is well-known. 2. The submission could have been written with a bit more care. "Redundant processing" (abstract) is a strange term. OWL ontology (p 1) -> OWL 2 QL (?) The meaning of "an RDF term represented by an IRI" (p 1) is unclear. "in a query that involves several joins" (p 2) is a bit unexpected at that point (the exponential is in the number of joins, is it not?) "Given that relational schemas are normalized" (p 2) is a strong assumption, which is not necessarily true in practice. "deduplicated" (p 3) is not a word R2RML appears out of the blue on p 4 "denote as" -> "denote by" (p 4 and elsewhere, multiple occurrences) "ontology axioms" (p 5) is undefined To simplify presentation, consider only GAV mappings from the very beginning. What is "the ABox of the initial ontology" (p 5)? Typesetting in the second paragraph on p 6 needs improving (and there is no need for the Mapping with a capital M) head(m)(D) on p 6 is a bit too complex Algorithm 1: is it DTR_R or DTR(R)? Also, line 8 would be redundant if the authors used properties and their inverses (which are also properties) in line 7 (the mapping for P^- is the mapping for P with the arguments swappped). "have been resulted" -> result (p 8) The sentence below (4) is broken. it relatively cheap -> is relatively cheap (p 8) "decides about early DE" (p 9) is a strange way of putting it The reader should be reminded what M_CQ is (it was defined on p 5 and has not been used) Check subscripts in (5) Self-join (with -) on p 11 "distinct answer tuples are only 9" (p 11) -> there are only 9 distinct tuples Did the authors confirm with the Ontop developers that the CQ minimization is not implemented because "the procedure is expensive"? (p 11) The optimization at the top of p 12 is nothing new (and the authors should not really take credit for it). Never perform -> The never-perform strategy Cost-based (with -) on p 14 Replacing just the last two authors in  by "et al." is very strange I acknowledge the rebuttal (no score changes).
Review 3 (by Carlos Buil Aranda)
(RELEVANCE TO ESWC) This paper is quite relevant since OBDA is one of the most useful technologies in the Semantic Web. The authors tackle an important optimization problem. (NOVELTY OF THE PROPOSED SOLUTION) Removing duplicates it is important and from what I know from the state of the art in OBDA this is novel, however not that novel in a distributed scenario. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) As I write in my review, the paper is confusing and I cannot guarantee completeness nor correctness. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art is just a small section at the end of the paper. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As I commented in the review, the evaluation (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments were hard to compare since there are no tables in the paper. The code for the experiments is available at https://github.com/dbilid/ontop/tree/version3/results but I'm not sure about the implementation. (OVERALL SCORE) In this paper the authors propose a method to eliminate duplicate mappings in an OBDA scenario. The authors present the scenario in which after a relational project operation is executed the amount of duplicate mappings that may be quite large, thus it is important to eliminate them early on so the query processing can be optimized. To do that the authors present first a method to identify the operators in which duplicates may arise, next an heuristic to decide when the optimization and an optimization to eliminate duplicates during the execution of self joins. Finally the authors present an evaluation of the techniques described and a brief state of the art. In the Introduction section the authors introduce the problem using a running example and why duplicates may have an impact in the total query execution time. They also explain how RDBMS handle duplicates and why it is costly to eliminate them. Comments: in general I agree with the introduction and the importance of the problem. However it was a bit hard to read it since the problem is introduce on page 1, referring to an example on page 2 and to a figure on page 4. This needs to be rearranged so the reader do not get lost. In the Preliminaries section the authors introduce the necessary terminology to understand the rest of the paper, including the notions of database, ontology, mappings, TBox, ABox, etc. The authors also include in this preliminaries section the Duplicate tuple ratio (DTR) which the authors will use later on to estimate when to execute the optimization. Comments: I found this section a bit hard to read since the authors mix the usual terminology with new concepts such the DTR formula. Besides, they use similar letters and symbols for different concepts, such as C_i (constant or class?) or the m function and later Mappings. For me this section was hard to read and I think this is a very important section because when the reader gets lost with some symbol or definition she will come straight to this section to clarify. However that was not possible to me. In Section 3 the authors present the technique with which they identify when the duplicate generation happens (what assertions generate such duplicates). The authors provide an algorithm that identifies the duplicates (using the previously defined DTR formula). Comments: in this section I got lost again, mainly because the mixed usage of some terms (like C or m as pointed before). Also, the initial example would be of use here to help to understand the algorithm, however it is not really explained. In Section 4, Pushing DE Before IRI construction the authors present how the optimization for duplicate elimination is applied. Again, the authors refer to formula (1), but the authors presented before a formula for DTR, next a proposition, next something labelled with (1). As reader I am very confused at this point, since I do not know to what formula they refer exactly since there is no label/title with formula in the paper. In Section 5 the authors present an heuristic to decide when to apply the optimization (when the duplicates are greater than the amount of results) and in Section 6 the authors present a method to remove duplicates due to redundant self joins. Comments: both methods seem sound, just a comment have the authors tried with other heuristics? In Section 7 the authors present the evaluation of their approach. In the evaluation the authors claim that their duplicate technique reduced the evaluation time of 5 out of 6 queries. Could the authors explain a bit more the details about these queries? it is important to understand why that happened. Also, a table summarizing the results would be good to compare them. Overall comments: This seems an interesting paper, however it is very badly organized as I pointed out during the review. I would suggest to rewrite it in a more friendly format for the readers. ========== after rebuttal ======= I acknowledge the authors response. I think some of my concerns will be fixed, however after reading all reviews and the author's response I will maintain my score.
Review 4 (by anonymous reviewer)
(RELEVANCE TO ESWC) In the scope of ESWC (NOVELTY OF THE PROPOSED SOLUTION) Interesting optimizations, but it seems to me that it is for an artificial problem (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solutions seem to be correct (EVALUATION OF THE STATE-OF-THE-ART) There is an evaluation but the mappings are missing. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There is an evaluation but the mappings are missing. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I don't have information about mappings (OVERALL SCORE) Summary of the Paper This paper presents an approach to eliminate duplicates when it comes to OBDA queries that do SPARQL to SQL translation. Duplicates are introduced, if I understand correctly, when the result of the query rewriting is unfolded and uses mappings that may return duplicates because they are mappings on denormalized schemas or represent a many-to-many relationship. The authors present multiple optimizations ranging from offline to online, and also evaluate their approach where they provide evidence to support their claims Strong Points (SPs) - Well defined problem - Variety of optimizations - well structured evaluation Weak Points (WPs) - The problem seems to be a bit artificial (see below) - Mappings are not available (see below) - The evaluation does not present the impact of each optimization Questions/Comments to the Authors (QAs) 1) My main comment/question is regarding the motivation of the problem. It seems to me that it is a bit artificial. I understand that the authors state "Specifically we show that duplicates are present in manymappings from both benchmarks and that specific queries are heavily impactedby this fact…” which motivates the problem, but in reality, there is no analysis presented on the types of mappings that generate the duplicates. This may be in the extended version, but as a reviewer, I’m not obligated to take a look at that. This paper should be self contained. Additionally, there are no links to the mappings used (more in the next question). My specific question is the following: the authors state "However, a denormalized schema or mappings that do not use unique columns can lead to duplicate introduction even from a single triple pattern.”. If this is the case, then wouldn’t the mapping be incorrect? Or, isn’t the mapping being used in an incorrect manner? If I understand correctly, per the example, the main motivation for the problem is when 1) you have a mapping for an object property, 2) there is no mapping for a class, 3) in the ontology you have an range/domain for the object property such that you can infer what should be the class given just the object property triple, 4) your query is asking for the class. Therefore, you can only use the mapping for the object property to answer your query. However, given that the mapping for the object property can be many-to-many, that is why you generate duplicates. If I understood correctly, Algorithm 1 just checks if the mapping for a Class has duplicates and for object properties if the mappings for the subject and object have duplicates because these three cases can generate instances of a class. For the first case (mappings for a class have duplicates), then this is an actual error in the mapping. I believe this is an artificial setting because a) in reality you will always have mappings for classes and b) if you didn’t, then in reality what you want is to find the issue so you can fix it, in this case, add the missing mapping, then this problem would go away. With this approach, it seems that you are masking a problem instead of fixing the root of the problem. 2) Questions about mappings a) What are the exact mappings used in the benchmark? b) Did you use all of them or did you take some way? c) How much coverage of the ontology did the mappings have? In my opinion, understanding what were the mappings is what would clarify the true motivation of this problem. 3) What is the impact of each optimization? The evaluation just presents overall numbers. A thorough analysis of each optimization should be presented. The global overview is not enough to understand the value of each optimization 4) In section 3: why does this have to be done with materialized view? The output of Algorithm 1 is a new set of mappings. You could use them as-is without having to materialize. I understand that if you materialized the views, the performance would be better. It would be good to see the tradeoff of not materializing vs materialization 5) In section 4: pushing duplication elimination into the inner part of the query so that the detection is done on data values instead of the IRIs. The most inner disjunctions are UNION (which does the distinct), these are the UCQ resulting of the query rewriting. The rest of the disjunction ares UNION ALL which just concats the results. Did I understand correctly? If so, this section is very confusing and up front it should be explained in a simple and clear manner what is going on before diving into the details. ======= Comments about rebuttal "then duplicates still would be introduced because a non-unique column is used" My argument is that if this mapping is created, it is incorrect. The mapping consists of creating a subject (which is part of a rdf:type :ClassName triple) based on a non-unique column when the goal of the subject is to be unique. It's like stating that a column is suppose to be a primary key when in reality the column has duplicates. So in order to add that primary key, you need to eliminate the duplicates. Well of course... but the point is that you initially selected an attribute that did not satisfy the definition of what is suppose to be a primary key. This is an example of an incorrect mapping: rdf:type(IRI(wlbNpdidWellbore), :DiscoveryWellbore)) <- select wlbNpdidWellbore from Discovery wlbNpdidWellbore is an attribute that is a FK which points to another table. The mapping should be based on the PK of that other table. Bottomline, I still feel that this is tackling an artificial problem, based on the examples and statements you have given me. I may be wrong but I have not seen a clear motivating example to convince me otherwise. FYI, My score is borderline (not reject). I do not have this option in easychair.
Metareview by Maribel Acosta
This work tackles the problem of duplicate elimination (DE) generated by mappings in SPARQL to SQL translations in the context of ODBA. The authors propose three DE strategies. Empirical results indicate that the approach is able to reduce execution time over baselines that do not eliminate duplicates. The reviewers agreed that research problem addressed in this paper is definitely relevant to the Semantic Web. Nonetheless, the reviewers raised several major issues this work. These issues include: the motivation is unclear, the correctness and completeness of the proposed approach is not formally demonstrated, the proposed approach is not positioned with respect to state-of-the-art solutions, no sufficient descriptions about the experimental settings (in particular, about the mappings used for the translation), no analysis about the impact of the different DE strategies on query performance. In addition, the reviewers indicated that presentation of the paper requires further improvement, in particular in the presentation of the experimental results, related work, and conclusions (which are currently missing). In summary, the direction of this work is promising. Yet, in its current state the paper not ready for publication. We encourage the authors to address the reviewers' comments to improve the overall quality of this work.