**Answers Partitioning and Lazy Joins for Efficient Query Relaxation and Application to Similarity Search**

**Author(s):** Sebastien Ferre

**Full text:** submitted version

**Decision:** accept

**Abstract:** Query relaxation has been studied as a way to find approximate

answers when user queries are too specific or do not align well with

the data schema. We are here interested in the application of query

relaxation to similarity search of RDF nodes based on their

description. However, this is challenging because existing

approaches have a complexity that grows in a combinatorial way with

the size of the query and the number of relaxation steps. We

introduce two algorithms, answers partitioning and lazy join, that

together significantly improve the efficiency of query

relaxation. Our experiments show that our approach scales much

better with the size of queries and the number of relaxation steps,

to the point where it becomes possible to relax large node

descriptions in order to find similar nodes. Moreover, the relaxed

descriptions provide explanations for their semantic similarity.

**Keywords:** Query Relaxation; Approximate Answer; Similarity Search; RDF Graph; Graph Pattern; Partition; Join

**Review 1 (by Alessandro Margara)**

(RELEVANCE TO ESWC) The paper addresses the problem of query relaxation. It proposes a new approach based on answers partitioning and an optimization that exploits lazy join to improve the efficiency in computing query relaxation. The objective of query relaxation is to compute relaxed queries, their proper (i.e., novel) answers, and their ordering, thus enabling to return the answers starting from more specific relaxed queries. The topic is interesting and certainly relevant for the conference. (NOVELTY OF THE PROPOSED SOLUTION) I am not an expert in the field and I am not entirely familiar with the related work. According to the authors, the existing approaches to query relaxation all generate relaxed queries up to some relaxation distance, and evaluate such queries from the most to the least specific. This yields to performance problems since the number of relaxed queries can grow exponentially with the size of the original query, and each relaxed query is separately evaluated. The authors propose a different approach based on clusters, where a cluster informally represents a relaxed query and the set of answers to that query. The approach refines an initial very general cluster (e.g., the cluster representing the entire dataset) by splitting it in two clusters based on a query element e (a triple pattern or a selection filter). The first derived cluster includes the subset of answers where e holds, whereas the second cluster is the complement of the first. This approach builds a binary tree with p leaves (queries), where p is bounded by the number of possible answers (in the worst case, the entire dataset) and the power set of the number of query elements. In this iterative process, every time a cluster is partitioned according to an element e, if e is a triple pattern then the computation of the new set of answers requires computing a possibly exponential number of joins. The authors propose an optimization with lazy joins that represents the set of matching using a tree-based structure and reduces the cardinality of the sets participating in the joins. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Some aspects of the solution are difficult to understand. I had to pass through Section 5 several times to understand the algorithm, and I'm still not sure I got all the details (see below). I suggest that the authors provide a simple running example in which they show a couple of iterations of their algorithm to guide the reader. Conversely, in the case of Section 6, the example is very useful to grasp the high level mechanism, but then Algorithm 2 should be discussed in greater details (what do delta+ and delta- represent? ...). Concerning again Algorithm 1, I am not sure I understood what the complement of Ce contains. Indeed, the algorithm (line 6) removes e from the set of elements, but adds relax(e). This means that the clusters do not partition the set of answers? Ce and its complement can share answers? Which are the consequences? Again, I think that these aspects need to be discussed in greater details. (EVALUATION OF THE STATE-OF-THE-ART) I am not an expert in the field. The discussion of the state-of-the-art is short, but effective. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As discussed in my detailed comments, some aspects of the solution are difficult to understand. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Concerning the evaluation, I have two doubts. The first one relates to the comparison with existing algorithms. Existing algorithms compute more specific queries first, and then compute the answers to more and more relaxed queries. This means that they can start presenting more specific answers as soon as they are computed and return more relaxed answers later: the latency to receive the first answers can be fairly low. Conversely, if I understand correctly, the proposed algorithm starts from the more generic answers and refines them through partitioning: this means that the entire process needs to terminate before the answers can be provided. I am wondering if this aspect can be relevant in the case the results need to be presented to some user or to some system that further processes them (and can start earlier in the case of traditional algorithms). Second, the authors do not provide any qualitative comparison of the relaxed queries (and related answers) generated by existing approaches and the proposed one. I believe that this aspect is also important, maybe even more than performance. (OVERALL SCORE) The paper addresses the problem of query relaxation. It proposes a new approach based on answers partitioning and an optimization that exploits lazy join to improve the efficiency in computing query relaxation. The objective of query relaxation is to compute relaxed queries, their proper (i.e., novel) answers, and their ordering, thus enabling to return the answers starting from more specific relaxed queries. I am not an expert in the field and I am not entirely familiar with the related work. According to the authors, the existing approaches to query relaxation all generate relaxed queries up to some relaxation distance, and evaluate such queries from the most to the least specific. This yields to performance problems since the number of relaxed queries can grow exponentially with the size of the original query, and each relaxed query is separately evaluated. The authors propose a different approach based on clusters, where a cluster informally represents a relaxed query and the set of answers to that query. The approach refines an initial very general cluster (e.g., the cluster representing the entire dataset) by splitting it in two clusters based on a query element e (a triple pattern or a selection filter). The first derived cluster includes the subset of answers where e holds, whereas the second cluster is the complement of the first. This approach builds a binary tree with p leaves (queries), where p is bounded by the number of possible answers (in the worst case, the entire dataset) and the power set of the number of query elements. In this iterative process, every time a cluster is partitioned according to an element e, if e is a triple pattern then the computation of the new set of answers requires computing a possibly exponential number of joins. The authors propose an optimization with lazy joins that represents the set of matching using a tree-based structure and reduces the cardinality of the sets participating in the joins. In general, the paper is well written and precise. The topic is interesting and certainly relevant for the conference. However, I believe that the presentation of the paper can be improved by providing more examples and by explaining the algorithms in greater details. As a reader, I had to pass through Section 5 several times to understand the algorithm, and I'm still not sure I got all the details (see below). I suggest that the authors provide a simple running example in which they show a couple of iterations of their algorithm to guide the reader. Conversely, in the case of Section 6, the example is very useful to grasp the high level mechanism, but then Algorithm 2 should be discussed in greater details (what do delta+ and delta- represent? ...). Concerning again Algorithm 1, I am not sure I understood what the complement of Ce contains. Indeed, the algorithm (line 6) removes e from the set of elements, but adds relax(e). This means that the clusters do not partition the set of answers? Ce and its complement can share answers? Which are the consequences? Again, I think that these aspects need to be discussed in greater details. Concerning the evaluation, I have two doubts. The first one relates to the comparison with existing algorithms. Existing algorithms compute more specific queries first, and then compute the answers to more and more relaxed queries. This means that they can start presenting more specific answers as soon as they are computed and return more relaxed answers later: the latency to receive the first answers can be fairly low. Conversely, if I understand correctly, the proposed algorithm starts from the more generic answers and refines them through partitioning: this means that the entire process needs to terminate before the answers can be provided. I am wondering if this aspect can be relevant in the case the results need to be presented to some user or to some system that further processes them (and can start earlier in the case of traditional algorithms). Second, the authors do not provide any qualitative comparison of the relaxed queries (and related answers) generated by existing approaches and the proposed one. I believe that this aspect is also important, maybe even more than performance. To conclude, I think that the paper addresses an interesting topic and introduces a novel approach. However, the current version of the paper makes it difficult to understand some important aspects of the approach and its evaluation. --- Minor comments page 3: the above approaches [] puts -> put page 6: Huang et al [] uses -> use page 7: partionned -> partitioned page 8: relaxtion -> relaxation --- Strong points 1) Interesting and relevant topic. 2) Novelty of the solution. 3) In general, the paper is well written and precise. Weak points 1) The presentation of the paper can be improved by providing more examples and by explaining the algorithms in greater details. 2) It is not entirely clear to me how the solutions compares to existing algorithms in terms of latency in presenting the results. 3) It would be useful to have a qualitative comparison of the relaxed queries (and related answers) generated by existing approaches and the proposed one. Questions 1) Algorithm 1. What does the complement of Ce contain? The algorithm (line 6) removes e from the set of elements, but adds relax(e). This means that the clusters do not partition the set of answers? Ce and its complement can share answers? Which are the consequences? 2) Algorithm 2. Can the authors expalin what do delta+ and delta- represent. 3) How does the solution compare with existing algorithms in terms of latency to present results to queries? Existing algorithms compute more specific queries first, and then compute the answers to more and more relaxed queries. This means that they can start presenting more specific answers as soon as they are computed and return more relaxed answers later: the latency to receive the first answers can be fairly low. Conversely, if I understand correctly, the proposed algorithm starts from the more generic answers and refines them through partitioning: this means that the entire process needs to terminate before the answers can be provided. I am wondering if this aspect can be relevant in the case the results need to be presented to some user or to some system that further processes them (and can start earlier in the case of traditional algorithms). 4) Would it be possible to provide some qualitative comparison of the relaxed queries (and answers) generated by the existing approaches and by the proposed one? ---- I would like to thank the authors for answering all my questions in the rebuttal, especially regarding the latency of existing approaches.

**Review 2 (by anonymous reviewer)**

(RELEVANCE TO ESWC) The proposed approach handles relaxation of SPARQL queries over Linked Data to be able to process them efficiently as required for similarity search. Therefore, the paper is of appropriate relevance for the ESWC. (NOVELTY OF THE PROPOSED SOLUTION) The solution seems to be novel although maybe not applicable to many kinds of datasets, because it depends on extensive definitions within the schema/ontology. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is described in detail including formal definitions. Correctness and completeness of the description is very good. (EVALUATION OF THE STATE-OF-THE-ART) The author introduces several competitive approaches and distinguishes the proposed approach against them. Nevertheless the majority of the related work is several years old and especially in the field of efficient similarity search might be more relevant approaches. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach consists of two different steps which facilitates the independent implementation of both algorithms. However, only few parameters are described in detail which might influence the results of the overall approach (as e.g. number of relevant triples, similarity score, different types of ontological restrictions for the lazy joins). (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation describes in detail the efficiency of the proposed approach. It lacks the discussion of the quality of the resulting queries which is an essential part of query relaxation in similarity search. The most efficient approach might result in completely irrelevant queries in terms of similarity. (OVERALL SCORE) The author presents an approach for query relaxation as it could be used for similarity search or question answering as well. The approach consists of two algorithms: answer partitioning and lazy joins. The former is used to retrieve more general answers for a given query. This applies for incomplete datasets where expected definitions (for example type definitions) are not fulfilled for all entities. The latter algorithm is restricting the potentially large amount of triples according to the underlying schema of the relevant dataset. The proposed approach and its presentation includes the following strong points: S#1: Formal definitions and symbols are introduced in a conclusive way S#2: relaxation and efficiency steps are strictly isolated which enhances parametrization for optimization S#3: Experiments and evaluation are processed on different datasets and well documented Weak points include: W#1: The author claims that for similarity search potentially many triples are retrieved when a query is relaxed for generalization. Unfortunately, there is no word on the level of similarity especially in terms of semantic relevance. If a query is relaxed in a way that too many triples are applying, are the potential answers still relevant for the original query? W#2: The lazy joins step is based on the restrictions that are provided by the schema/ontology of the underlying dataset. Many datasets (as also the DBpedia) only provide weak ontologies with weak definitions regarding sub properties or property restrictions. Therefore, I doubt that the approach performs well on such datasets. W#3: As mentioned in 1. the semantic aspect of the approach is not discussed for the answer partitioning step. Consequentially, the evaluation part lacks a quality aspect. There is no information about the proposed answers of the very efficient approach. The approach is fast, but are the answers relevant for the respective application? After rebuttal: I would like to thank you the author for the clarification. It supports my recommendation for a (weak) acceptance.

**Review 3 (by Ioanna Lytra)**

(RELEVANCE TO ESWC) The current paper deals with new algorithms for query relaxation and similarity search of RDF nodes, a topic that is very relevant to the ESWC community. (NOVELTY OF THE PROPOSED SOLUTION) The two proposed algorithms are novel and propose an improvement in terms of execution time to the state of the art. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) An element I am missing here is the effectiveness of the proposed algorithms. This point is not discussed in the paper. (EVALUATION OF THE STATE-OF-THE-ART) In my point of view, the RW is discussed adequately, the advantages and disadvantages of existing approaches and with respect to the author's proposal are presented. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The author discusses the complexity of the algorithms (not in detail though for Algorithm 2). The evaluation suggests an improvement in terms of execution time for the introduced algorithms; also the conditions under which the algorithms are better than the state of the art are discussed in the Evaluation section. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The algorithms are clearly described and the datasets and used queries are available. It is, therefore, possible to reproduce the results given also the fact that the source code is available publicly. (OVERALL SCORE) Summary of the Paper The paper entitled "Answers Partitioning and Lazy Joins for Efficient Query Relaxation and Application to Similarity Search" proposes two algorithms, one based on answer partitioning for query relaxation and one based on lazy joins for similarity search. The author compares the two algorithms with baseline approaches using two different datasets (MONDIAL and LUBM) in terms of execution time (query relaxation) and number of proper relaxed queries (for similarity search). Strong Points (SPs) - Clear presentation of introduced algorithms. - Comparison to baseline and discussion of results. - Explanatory examples accompanying the algorithms. Weak Points (WPs) - The setting for the similarity search is not explained in detail. - The experiments focus rather on the efficiency, what about the effectiveness? - Not all algorithms discussed in the RW section are afterwards used for comparison in the Evaluation section. Questions to the Authors (QAs) - What is the complexity of the second algorithm (Algorithm 2)? - How do you judge the effectiveness of your approach (especially for similarity search)? After rebuttal -------------- I would like to thank the authors for addressing my comments and questions.

**Metareview by Maria-Esther Vidal**

This work tackles the problem of query relaxation and proposes a solution that combines answer partitioning and lazy joins to compute approximate answers efficiently. The empirical results presented in this work indicate that the proposed solution is able to overcome state-of-the-art approaches in terms of efficiency. The reviewers agreed that this line of research is definitely interesting and relevant to the Semantic Web. Nonetheless, the reviewers have identified several weak points of this work: missing evaluation and discussions on the effectiveness of the proposed solution, lack of details in portions of the approach (especially, in Algorithms 1 and 2) which hinders the readability of the paper, limited experimental evaluation which hinders the generality of the results.