Paper 213 (Research track)

Feature-based Reformulation of Entities in Triple pattern Queries

Author(s): Amar Viswanathan Kannan, Geeth De Mel, James Hendler

Full text: submitted version

Abstract: Knowledge graphs encode uniquely identifiable entities to other entities or literal values by means of relationships, thus enabling semantically rich querying over the stored data. Typically, the semantics of such queries are often crisp thereby resulting in crisp answers. Query log statistics show that a majority of the queries issued to knowledge graphs are often entity centric queries. When a user needs additional answers the state-of-the-art in assisting users is to rewrite the original query resulting in a set of approximations. Several strategies have been proposed in past to address this. They typically move up the taxonomy to relax a specific element to a more generic element. Entities don’t have a taxonomy and they end up being generalized. To address this issue, in this paper, we propose an entity centric reformulation strategy that utilizes schema information and entity features present in the graph to suggest rewrites. Once the features are identified, the entity in concern is reformulated as a set of features. Since entities can have a large number of features, we introduce strategies that select the top-k most “relevant” and “informative” ranked features and augment them to the original query to create a valid reformulation. We then evaluate our approach by showing that our reformulation strategy produces results that are more informative when compared with state-of-the-art

Keywords: Query Reformulation; Query Relaxation; SPARQL Query Reformulation; Entity Feature Ranking; Flexible querying

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper presents a new method for SPARQL query relaxation.
(NOVELTY OF THE PROPOSED SOLUTION) The novel contribution is to use facts from the KG to have a relaxed query more similar to the original query.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Authors motivate the work and define relaxation rules based on the content of the KG. The method then selects only few candidate reformulation for efficiency purposes. This is done by means of a ranking function defined ad-hoc.
(EVALUATION OF THE STATE-OF-THE-ART) The evaluation looks at efficiency over increasing size KGs. The result set size appears to be very big with the reformulation, which would make the approach not usable by end-user as claimed.
The main issue with the evaluation is the lack of comparison with baselines for the effectiveness evaluation. Instead, only examples are provided but not correctly setup effectiveness evaluation with effectiveness metrics to measure the quality of the retrieved results.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The discussion compares the proposed approach with existing literature. The key properties of the approach are interesting.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments use standard benchmarks making the experiments reproducible to a certain extent. Not all details are provided to replicate the experiments (e.g., which 20 queries have been chosen).
(OVERALL SCORE) **Summary of the Paper
The paper presents a new method for SPARQL query relaxation.
**Short description of the problem tackled in the paper, main contributions, and results
Authors motivate the work and define relaxation rules based on the content of the KG. The method then selects only few candidate reformulation for efficiency purposes. This is done by means of a ranking function defined ad-hoc.
The novel contribution is to use facts from the KG to have a relaxed query more similar to the original query. 
**Strong Points (SPs)
1 Interesting approach
2 Important problem addressed
3 Problem well motivated
** Weak Points (WPs)
1 The results are not convincing and do not compare against other methods.
2 The result set of the relaxed query appears to be large
3 No effectiveness evaluation other than examples
** Questions to the Authors (QAs) 
n/a
** Rebuttal
After reading the other reviews, the rebuttal, and the discussion, I am still not convinced with the baseline design choices as described above and by other reviewers.


Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) The problem of query relaxation on RDF Graph/KG is highly relevant to ESWC
(NOVELTY OF THE PROPOSED SOLUTION) The paper argues that its solution of utilising query's instance data statements to reformulated the more relaxes queries are new, no one have done before. But the reviewer are not convinced with innovative aspects claimed by the authors when comparing with like[9].
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution seems to be reasonable to implement. But the query reformulation procedure is not sufficiently described.
(EVALUATION OF THE STATE-OF-THE-ART) The state-of-the-Art is poorly analysed. The dedicated section 5 (Discussion) is superficial. The evaluation does not evaluate against any current solutions.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The proposed approach is adequately illustrated via examples but formalisation is quite shallow. No algorithms are given.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The paper mentions the code is available in their Github (https://github.com/N00bsie/QueryReformulation )but when reviewer accessed it, it’s an empty repository.
(OVERALL SCORE) The paper tries to address the problem of query relaxation by focusing on re-writing the query based on relaxing rules on entities in the triple patterns. The paper also introduces a ranking model using TF-IDF measures. The paper seems to be rewritten in a rush, the implementation and evaluation is unfinished. The analysis of state of the art is misleading. 
***Weak Points***
1) The paper make an inaccurate claim that [7.8.14,15,9] “is known as simple relaxation” which only relaxes entities to a variable. I would argue that would be the case. For example, in[9], they consider the relaxations on class of instances and the provide a solid solution in terms formalisations and algorithms.
2) The evaluation is quite superficial as it does compare against any current solutions. Also, the evaluation on real world data in section 4.2 is just a demonstrative walk through other than an evaluation step. In my opinion, Section 4.1 on evaluating scalability on LUMB dataset does  not bring much value for the paper, especially, LUBM is a simulated dataset which is generated by a controlled and uniform distribution.
3) Putting side by side with other related work like [7] and [9] reformulation model presented in section 3 is not superior to the counter-parts, sbpeficialcaly, the formalisation and ranking model. Furthermore, the ranking model is supported by any working evidence for targeted application domains, e.g. an evaluation in the paper or somewhere else (for KG or RDF graph instead general entity summarisation[19])
4)The analysis on related work of section 5 does not help the reader to clearly see the contributions of the paper among state-of-the-art, hence, the claimed contributions give doubts to reviewers, especially after checking some related work. 
5) I think the important point of the proposed solution would be section 3 on “Query reformulation procedure”, but unfortunately its oddly short (1 paragraph)
6) The paper cited the  link in section 3.3 which lead a three lines document which can be easily feed into the paper. And then in section 4.1, the paper mention" Our implementation can be found at our GitHub” with the link to the empty repository:  Our implementation can be found at our GitHub.
***Strong Points***
1) the paper gives a interesting observation on the an important problem
2) the paper is easy to read and follow.
***Questions to the Authors ****
1) Please answer some questions in the Weak points.
2) What is the entailment regime authors assume in the paper when talking type, subclass, sub properties? And is reasoning considered? 
3) The paper mentions blank nodes section 2 but section 3 does not mention it. So, how the blank nodes are handled?
=======
After rebuttal:
Many thanks to authors to clarify some of my concerns, but majority of my concerns are not sufficiently addressed. So, I keep my score


Review 3 (by Martin Giese)

(RELEVANCE TO ESWC) The paper concerns reformulations and relaxations of SPARQL queries over RDF data – clearly within scope.
(NOVELTY OF THE PROPOSED SOLUTION) The approach proposes to replace a triple pattern in a query referring to a concrete entity e, not by a variable as is otherwise done, but by a set of triples filtering on certain properties asserted for e in the dataset.
This approach seems to be new.  The selection of which properties to filter on, using selectivity and popularity, is not very surprising.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) There is quite a bit of formal notation in Section 2 and 3.  Unfortunately, some of this seems to be wrong.
e.g. the notation | sth | some condition | occurs in several places.  What does that mean?  In some places it seems to mean
|{ sth | condition(sth) }|, i.e. the number of objects satisfying a condition, but not always.  E.g. in the definition of "Variable Typing" in Sect. 3.
At the end of Sect 3.1, after "Entity Summary Pattern" I read e ∈{<…> or <…> or <…>}… what is that "or"?
Specificity should have an argument (or several?)  Same for popularity.  Both definitions have the | x | y | notation again.
And what is popularity(o | o∈f)
(EVALUATION OF THE STATE-OF-THE-ART) As far as I can tell, the authors are right in saying that the problem is not addressed satisfactorily in existing work.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Some evaluation on a few LUBM queries was done. And some on QALD-2 queries.  The information given in the paper is rather anecdotal, giving results for very few queries.
There is not much theoretical discussion of the approach.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) There is sufficient information about the the experiments reported on, so they should be reproducible.
(OVERALL SCORE) This paper tackles the problem of queries to RDF databases where the "user needs additional answers" to the ones usual query execution delivers.  There is little discussion of why this might be the case, but the paper builds on existing work on query reformulation and query relaxation.  The idea is to weaken (relax) the query so that answers are computed that are not correct for the original query, but which are closely related in some sense.
This paper specifically addresses queries that contain concrete entity references (URIs) in some of the triple patterns.  Previous relaxation techniques treat entities by replacing them with a variable.  So asking for "movies directed by Scorsese" is relaxed to "movies directed by someone," which will be too general for most users.  This paper proposes to look into the dataset for information about the entity (Scorsese in this case), and gather some of them like "persons who studied at New York university" or "persons with American nationality" etc., gather a _subset_ of these properties, and then look for "movies directed by an alumni of NYU with US nationality."  The question remains which properties to use for this, and the paper suggests taking the top-k candidates (for a suitable k) according to a ranking based on measures of "specificity" and "popularity" 
This approach is evaluated on a small number of examples.
Strong points:
* The treatment of entities in query relaxation by including parts of the data seems new and makes sense
* The Specificity and popularity measures seem to be a sensible approach.
* The approach is implemented and tested
Weak points:
* an explanation of _why_ users would want more answers to a query would help in evaluating the approach. I need to know what problem they had with the query to say whether that problem is solved.
* the technical definitions are lacking  (missing parameters, strange notation, etc., see "Correctness" above)
* only rather few examples are shown
* the ranking formulae are stated and then used, without any discussion or comparison of alternative ranking methods
* there are quite a few grammatical mistakes
Edit after rebuttal:
I am still confused about the notation used.
Concerning arguments: Specificity is defined in (1) as Specificity = …, but in (3) it is used as Specificity(f).  You need to define Specificity(f), or Specificity(<p,o>).
Similarly to popularity in (2).  It’s used with an argument in (3).  Though I don’t understand what that argument means.


Metareview by Hala Skaf

This submission presents a new method for SPARQL query relaxation. The topic is generally relevant and interesting. Unfortunately, the contribution of this work is questionable because positioning with respect to the state of the art is not clear. The proposed approach is adequately illustrated via examples but without any formal foundations.  The important point of the proposed solution in section 3  “Query reformulation procedure” is  described  in a short text. 
The evaluation is limited to  a small number of queries.   As a result, the submission does not fulfill the requirements for a ESCW publication in terms of maturity, technical depth, presentation and quality.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *