Paper 14 (Research track)

Answers Partitioning and Lazy Joins for Efficient Query Relaxation and Application to Similarity Search

Author(s): Sebastien Ferre

Full text: submitted version

camera ready version

Decision: accept

Abstract: Query relaxation has been studied as a way to find approximate
answers when user queries are too specific or do not align well with
the data schema. We are here interested in the application of query
relaxation to similarity search of RDF nodes based on their
description. However, this is challenging because existing
approaches have a complexity that grows in a combinatorial way with
the size of the query and the number of relaxation steps. We
introduce two algorithms, answers partitioning and lazy join, that
together significantly improve the efficiency of query
relaxation. Our experiments show that our approach scales much
better with the size of queries and the number of relaxation steps,
to the point where it becomes possible to relax large node
descriptions in order to find similar nodes. Moreover, the relaxed
descriptions provide explanations for their semantic similarity.

Keywords: Query Relaxation; Approximate Answer; Similarity Search; RDF Graph; Graph Pattern; Partition; Join

Review 1 (by Alessandro Margara)

 

(RELEVANCE TO ESWC) The paper addresses the problem of query relaxation. It proposes a new
approach based on answers partitioning and an optimization that exploits lazy
join to improve the efficiency in computing query relaxation.
The objective of query relaxation is to compute relaxed queries, their proper
(i.e., novel) answers, and their ordering, thus enabling to return the answers
starting from more specific relaxed queries.
The topic is interesting and certainly relevant for the conference.
(NOVELTY OF THE PROPOSED SOLUTION) I am not an expert in the field and I am not entirely familiar with the
related work. According to the authors, the existing approaches to query
relaxation all generate relaxed queries up to some relaxation distance, and
evaluate such queries from the most to the least specific. This yields to
performance problems since the number of relaxed queries can grow
exponentially with the size of the original query, and each relaxed query is
separately evaluated.
The authors propose a different approach based on clusters, where a cluster
informally represents a relaxed query and the set of answers to that
query. The approach refines an initial very general cluster (e.g., the cluster
representing the entire dataset) by splitting it in two clusters based on a
query element e (a triple pattern or a selection filter). The first derived
cluster includes the subset of answers where e holds, whereas the second
cluster is the complement of the first.
This approach builds a binary tree with p leaves (queries), where p is bounded
by the number of possible answers (in the worst case, the entire dataset) and
the power set of the number of query elements.
In this iterative process, every time a cluster is partitioned according to an
element e, if e is a triple pattern then the computation of the new set of
answers requires computing a possibly exponential number of joins. The authors
propose an optimization with lazy joins that represents the set of matching
using a tree-based structure and reduces the cardinality of the sets
participating in the joins.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Some aspects of the solution are difficult to understand.
I had to pass through Section 5 several times to understand the
algorithm, and I'm still not sure I got all the details (see below). I suggest
that the authors provide a simple running example in which they show a couple
of iterations of their algorithm to guide the reader.
Conversely, in the case of Section 6, the example is very useful to grasp the
high level mechanism, but then Algorithm 2 should be discussed in greater
details (what do delta+ and delta- represent? ...).
Concerning again Algorithm 1, I am not sure I understood what the complement
of Ce contains. Indeed, the algorithm (line 6) removes e from the set of
elements, but adds relax(e). This means that the clusters do not partition the
set of answers? Ce and its complement can share answers? Which are the
consequences? Again, I think that these aspects need to be discussed in
greater details.
(EVALUATION OF THE STATE-OF-THE-ART) I am not an expert in the field. The discussion of the state-of-the-art is short, but effective.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As discussed in my detailed comments, some aspects of the solution are difficult to understand.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Concerning the evaluation, I have two doubts. The first one relates to the
comparison with existing algorithms. Existing algorithms compute more specific
queries first, and then compute the answers to more and more relaxed
queries. This means that they can start presenting more specific answers as
soon as they are computed and return more relaxed answers later: the latency
to receive the first answers can be fairly low. Conversely, if I understand
correctly, the proposed algorithm starts from the more generic answers and
refines them through partitioning: this means that the entire process needs to
terminate before the answers can be provided. I am wondering if this aspect
can be relevant in the case the results need to be presented to some user or
to some system that further processes them (and can start earlier in the case
of traditional algorithms).
Second, the authors do not provide any qualitative comparison of the relaxed
queries (and related answers) generated by existing approaches and the
proposed one. I believe that this aspect is also important, maybe even more
than performance.
(OVERALL SCORE) The paper addresses the problem of query relaxation. It proposes a new
approach based on answers partitioning and an optimization that exploits lazy
join to improve the efficiency in computing query relaxation.
The objective of query relaxation is to compute relaxed queries, their proper
(i.e., novel) answers, and their ordering, thus enabling to return the answers
starting from more specific relaxed queries.
I am not an expert in the field and I am not entirely familiar with the
related work. According to the authors, the existing approaches to query
relaxation all generate relaxed queries up to some relaxation distance, and
evaluate such queries from the most to the least specific. This yields to
performance problems since the number of relaxed queries can grow
exponentially with the size of the original query, and each relaxed query is
separately evaluated.
The authors propose a different approach based on clusters, where a cluster
informally represents a relaxed query and the set of answers to that
query. The approach refines an initial very general cluster (e.g., the cluster
representing the entire dataset) by splitting it in two clusters based on a
query element e (a triple pattern or a selection filter). The first derived
cluster includes the subset of answers where e holds, whereas the second
cluster is the complement of the first.
This approach builds a binary tree with p leaves (queries), where p is bounded
by the number of possible answers (in the worst case, the entire dataset) and
the power set of the number of query elements.
In this iterative process, every time a cluster is partitioned according to an
element e, if e is a triple pattern then the computation of the new set of
answers requires computing a possibly exponential number of joins. The authors
propose an optimization with lazy joins that represents the set of matching
using a tree-based structure and reduces the cardinality of the sets
participating in the joins.
In general, the paper is well written and precise. The topic is interesting
and certainly relevant for the conference.
However, I believe that the presentation of the paper can be improved by
providing more examples and by explaining the algorithms in greater
details.
As a reader, I had to pass through Section 5 several times to understand the
algorithm, and I'm still not sure I got all the details (see below). I suggest
that the authors provide a simple running example in which they show a couple
of iterations of their algorithm to guide the reader.
Conversely, in the case of Section 6, the example is very useful to grasp the
high level mechanism, but then Algorithm 2 should be discussed in greater
details (what do delta+ and delta- represent? ...).
Concerning again Algorithm 1, I am not sure I understood what the complement
of Ce contains. Indeed, the algorithm (line 6) removes e from the set of
elements, but adds relax(e). This means that the clusters do not partition the
set of answers? Ce and its complement can share answers? Which are the
consequences? Again, I think that these aspects need to be discussed in
greater details.
Concerning the evaluation, I have two doubts. The first one relates to the
comparison with existing algorithms. Existing algorithms compute more specific
queries first, and then compute the answers to more and more relaxed
queries. This means that they can start presenting more specific answers as
soon as they are computed and return more relaxed answers later: the latency
to receive the first answers can be fairly low. Conversely, if I understand
correctly, the proposed algorithm starts from the more generic answers and
refines them through partitioning: this means that the entire process needs to
terminate before the answers can be provided. I am wondering if this aspect
can be relevant in the case the results need to be presented to some user or
to some system that further processes them (and can start earlier in the case
of traditional algorithms).
Second, the authors do not provide any qualitative comparison of the relaxed
queries (and related answers) generated by existing approaches and the
proposed one. I believe that this aspect is also important, maybe even more
than performance.
To conclude, I think that the paper addresses an interesting topic and
introduces a novel approach. However, the current version of the paper makes
it difficult to understand some important aspects of the approach and its
evaluation.
---
Minor comments
page 3: the above approaches [] puts -> put
page 6: Huang et al [] uses -> use
page 7: partionned -> partitioned
page 8: relaxtion -> relaxation
---
Strong points
1) Interesting and relevant topic.
2) Novelty of the solution.
3) In general, the paper is well written and precise.
Weak points
1) The presentation of the paper can be improved by providing more examples
and by explaining the algorithms in greater details.
2) It is not entirely clear to me how the solutions compares to existing
algorithms in terms of latency in presenting the results.
3) It would be useful to have a qualitative comparison of the relaxed queries
(and related answers) generated by existing approaches and the proposed
one.
Questions
1) Algorithm 1. What does the complement of Ce contain? The algorithm (line 6)
removes e from the set of elements, but adds relax(e). This means that the
clusters do not partition the set of answers? Ce and its complement can
share answers? Which are the consequences?
2) Algorithm 2. Can the authors expalin what do delta+ and delta- represent.
3) How does the solution compare with existing algorithms in terms of latency
to present results to queries? Existing algorithms compute more specific
queries first, and then compute the answers to more and more relaxed
queries. This means that they can start presenting more specific answers as
soon as they are computed and return more relaxed answers later: the
latency to receive the first answers can be fairly low. Conversely, if I
understand correctly, the proposed algorithm starts from the more generic
answers and refines them through partitioning: this means that the entire
process needs to terminate before the answers can be provided. I am
wondering if this aspect can be relevant in the case the results need to be
presented to some user or to some system that further processes them (and
can start earlier in the case of traditional algorithms).
4) Would it be possible to provide some qualitative comparison of the relaxed
queries (and answers) generated by the existing approaches and by the
proposed one?
----
I would like to thank the authors for answering all my questions in the rebuttal,
especially regarding the latency of existing approaches.

 

Review 2 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) The proposed approach handles relaxation of SPARQL queries over Linked Data to be able to process them efficiently as required for similarity search. Therefore, the paper is of appropriate relevance for the ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The solution seems to be novel although maybe not applicable to many kinds of datasets, because it depends on extensive definitions within the schema/ontology.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is described in detail including formal definitions. Correctness and completeness of the description is very good.
(EVALUATION OF THE STATE-OF-THE-ART) The author introduces several competitive approaches and distinguishes the proposed approach against them. Nevertheless the majority of the related work is several years old and especially in the field of efficient similarity search might be more relevant approaches.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach consists of two different steps which facilitates the independent implementation of both algorithms. However, only few parameters are described in detail which might influence the results of the overall approach (as e.g. number of relevant triples, similarity score, different types of ontological restrictions for the lazy joins).
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation describes in detail the efficiency of the proposed approach. It lacks the discussion of the quality of the resulting queries which is an essential part of query relaxation in similarity search. The most efficient approach might result in completely irrelevant queries in terms of similarity.
(OVERALL SCORE) The author presents an approach for query relaxation as it could be used for similarity search or question answering as well. The approach consists of two algorithms: answer partitioning and lazy joins. The former is used to retrieve more general answers for a given query. This applies for incomplete datasets where expected definitions (for example type definitions) are not fulfilled for all entities. The latter algorithm is restricting the potentially large amount of triples according to the underlying schema of the relevant dataset.
The proposed approach and its presentation includes the following strong points:
S#1: Formal definitions and symbols are introduced in a conclusive way
S#2: relaxation and efficiency steps are strictly isolated which enhances parametrization for optimization
S#3: Experiments and evaluation are processed on different datasets and well documented
Weak points include:
W#1: The author claims that for similarity search potentially many triples are retrieved when a query is relaxed for generalization. Unfortunately, there is no word on the level of similarity especially in terms of semantic relevance. If a query is relaxed in a way that too many triples are applying, are the potential answers still relevant for the original query?
W#2: The lazy joins step is based on the restrictions that are provided by the schema/ontology of the underlying dataset. Many datasets (as also the DBpedia) only provide weak ontologies with weak definitions regarding sub properties or property restrictions. Therefore, I doubt that the approach performs well on such datasets.
W#3: As mentioned in 1. the semantic aspect of the approach is not discussed for the answer partitioning step. Consequentially, the evaluation part lacks a quality aspect. There is no information about the proposed answers of the very efficient approach. The approach is fast, but are the answers relevant for the respective application?
After rebuttal:
I would like to thank you the author for the clarification. It supports my recommendation for a (weak) acceptance.

 

Review 3 (by Ioanna Lytra)

 

(RELEVANCE TO ESWC) The current paper deals with new algorithms for query relaxation and similarity search of RDF nodes, a topic that is very relevant to the ESWC community.
(NOVELTY OF THE PROPOSED SOLUTION) The two proposed algorithms are novel and propose an improvement in terms of execution time to the state of the art.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) An element I am missing here is the effectiveness of the proposed algorithms. This point is not discussed in the paper.
(EVALUATION OF THE STATE-OF-THE-ART) In my point of view, the RW is discussed adequately, the advantages and disadvantages of existing approaches and with respect to the author's proposal are presented.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The author discusses the complexity of the algorithms (not in detail though for Algorithm 2). The evaluation suggests an improvement in terms of execution time for the introduced algorithms; also the conditions under which the algorithms are better than the state of the art are discussed in the Evaluation section.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The algorithms are clearly described and the datasets and used queries are available. It is, therefore, possible to reproduce the results given also the fact that the source code is available publicly.
(OVERALL SCORE) Summary of the Paper
The paper entitled "Answers Partitioning and Lazy Joins for Efficient Query Relaxation and Application to Similarity Search" proposes two algorithms, one based on answer partitioning for query relaxation and one based on lazy joins for similarity search. The author compares the two algorithms with baseline approaches using two different datasets (MONDIAL and LUBM) in terms of execution time (query relaxation) and number of proper relaxed queries (for similarity search).
Strong Points (SPs)
- Clear presentation of introduced algorithms.
- Comparison to baseline and discussion of results.
- Explanatory examples accompanying the algorithms.
Weak Points (WPs)
- The setting for the similarity search is not explained in detail.
- The experiments focus rather on the efficiency, what about the effectiveness?
- Not all algorithms discussed in the RW section are afterwards used for comparison in the Evaluation section.
Questions to the Authors (QAs)
- What is the complexity of the second algorithm (Algorithm 2)?
- How do you judge the effectiveness of your approach (especially for similarity search)?
After rebuttal
--------------
I would like to thank the authors for addressing my comments and questions.

 

Metareview by Maria-Esther Vidal

 

This work tackles the problem of query relaxation and proposes a solution that combines answer partitioning and lazy joins to compute approximate answers efficiently. The empirical results presented in this work indicate that the proposed solution is able to overcome state-of-the-art approaches in terms of efficiency.
The reviewers agreed that this line of research is definitely interesting and relevant to the Semantic Web. Nonetheless, the reviewers have identified several weak points of this work: missing evaluation and discussions on the effectiveness of the proposed solution, lack of details in portions of the approach (especially, in Algorithms 1 and 2) which hinders the readability of the paper, limited experimental evaluation which hinders the generality of the results.

 

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *