# Paper 42 (Research track)

Graph Similarity Based Personalized Keyword Search on Large RDF Graphs

Author(s): Souvik Brata Sinha, Dimitri Theodoratos

Full text: submitted version

Abstract: Keyword search is a popular technique to retrieve information from linked data on the web. With the increase in RDF graph data repositories, it becomes essential to work on a keyword search method that identifies user intent in order to rank the relevant results from a pool of huge number of candidate results for a keyword query. Our work is focused on personalized keyword search; where we intend to take into account some information about a user who is searching for a query, so that we can provide the results most relevant to the user. We extract the structural summary of the RDF graph and compute the pattern
graphs for a keyword query on the RDF graph using its structural summary. The user information is in the form of a graph similar to the pattern graphs called profile graphs. The ranking of the relevant results is achieved by measuring the graph similarity between the user profile graph and the generated pattern graphs. We propose similarity metrics which produce a similarity score to individual pattern graphs based on which the pattern graphs are ranked such that a pattern graph with a higher similarity score is ranked to be more relevant to the user. Our experimental results show that our approach intuitively matches the the user interests.

Keywords: Keyword Search; RDF Graph; Structural Summary; Personalization

Decision: reject

Review 1 (by Peter Haase)

(RELEVANCE TO ESWC) The paper proposes an approach to personalised keyword search over RDF graphs based on graph similarity measures between pattern graphs (generated from keyword queries) and user profile graphs. The topic is relevant for ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The work builds on previous related work on structural summaries and pattern graphs (e.g. [13]) - while related work is referenced, it is not really explicit which aspects are novel / extended. The work on profile graphs and similarity measures between pattern and profile graphs appear novel.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The technical presentation of the patter and profile graphs and similarity measures appears correct and complete.
What is missing is an explanation of how user profiles are to be generated in targeted application scenarios.
(EVALUATION OF THE STATE-OF-THE-ART) Related work is only marginally discussed, an explicit statement or comparison how the state-of-the-art has been advanced is missing. In particular, the evaluation does not include a comparison with state-of-the-art approaches.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation is very shallow and not convincing. Important details are missing or not well explained. Starting with the queries in table 1, it is not clear what information needs are reflected or how results would be different for varying user profiles.
It is unclear where the user profiles come from and how they are generated. The discussion of the results is insufficient, the only statement is “ As can be seen, these values portray that our ranking obtained from the similarity metrics is very close to the user interests.” How do we see this?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments have been done with a single data set. It is not clear how the results would generalise.
Further, there is no comparison with alternative approaches. Based on the information provided in the paper, it would not be possible to reproduce the experiments.
(OVERALL SCORE) The paper proposes an approach to personalised keyword search over RDF graphs based on graph similarity measures between pattern graphs (generated from keyword queries) and user profile graphs. The topic is relevant for ESWC.
The work builds on previous related work on structural summaries and pattern graphs (e.g. [13]) - while related work is referenced, it is not really explicit which aspects are novel / extended. The work on profile graphs and similarity measures between pattern and profile graphs appear novel.
The technical presentation of the patter and profile graphs and similarity measures appears correct and complete.
What is missing is an explanation of how user profiles are to be generated in targeted application scenarios.
Related work is only marginally discussed, an explicit statement or comparison how the state-of-the-art has been advanced is missing. In particular, the evaluation does not include a comparison with state-of-the-art approaches.
The evaluation is very shallow and not convincing. Important details are missing or not well explained. Starting with the queries in table 1, it is not clear what information needs are reflected or how results would be different for varying user profiles.
It is unclear where the user profiles come from and how they are generated. The discussion of the results is insufficient, the only statement is “ As can be seen, these values portray that our ranking obtained from the similarity metrics is very close to the user interests.” How do we see this?
The experiments have been done with a single data set. It is not clear how the results would generalise.
Further, there is no comparison with alternative approaches. Based on the information provided in the paper, it would not be possible to reproduce the experiments.
typos:
in the abstract: “the the user” -> “the user”

Review 2 (by Alasdair Gray)

(RELEVANCE TO ESWC) The topic of the paper is very appropriate for ESWC
(NOVELTY OF THE PROPOSED SOLUTION) The formalism is well thought through. The approach is based on constructing structural summaries of the content data and graph patterns for the keyword search based on these. These can then be converted into SPARQL queries and executed over a triplestore.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The formalism seems correct. There are a few issues, e.g. Q a keyword query should be defined and signatures are not properly defined.
It is not made clear how \alpha, \beta, and \gamma should be determined.
The approach is limited to RDF graphs where all nodes have a type declaration.
The approach does not consider the class and property hierarchy.
(EVALUATION OF THE STATE-OF-THE-ART) An evaluation is provided, but many details are missing that mean that it falls short of the standard I would expect. The evaluation also only uses their own approach, other keyword matching approaches are not compared.
Related, there is no related work section. There is a very brief overview as is appropriate for an introduction, and then statements dropped in throughout the text. Also the first paragraph of the introduction is missing references to ground the statements.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is well written with a running example used to demonstrate the formalism introduced.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Reference should be provided for Jamendo, in particular what version/date of the data was used. How could someone else get the same data and repeat your experiment? What were the queries and the user profiles used – these should be made avaialble on FigShare/Zenodo/GitHub? How were the queries selected? Who were your experts that determined the ground truth? How was this done? How many experts?
The result appear to be all over the place. I don't see how you conclude that your approach is effective.
What is the computational cost of generating all these summary graphs, particularly those constructed at query time? Does the volatility of the data have an adverse effect?
How does the approach compare with simply indexing all terms in the RDF?
(OVERALL SCORE) The paper presents an approach for performing keyword search over RDF graphs with the results being ranked based on user preferences. The paper and the formalism are generally well written, derived and explained. However, there is a lack of motivation for the real-world scenarios where this would be applicable and the evaluation is very limited.
As a whole, the paper is missing a motivation as to why personalised keyword search is useful. In what real-world situations would this approach be used. The definition of and construction of user profiles is unwhelming in this regard.
Minor issues:
- Abstract doesn't conform to the style guidelines
- Figure legends for 1 & 2 should explain what is happening in the figure rather than the surrounding text, later figure captions do this.
- Caption for Fig 5 is missing a '}' for the first query


Review 3 (by Anett Hoppe)

(RELEVANCE TO ESWC) The presented work is of high relevance to ESWC. With the increasing size of available RDF data, personalized access will be a more and more relevant topic.
(NOVELTY OF THE PROPOSED SOLUTION) Based on the information given in the paper, the proposed approach seems novel.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors specify the proposed measurement and its components in a thorough way.
(EVALUATION OF THE STATE-OF-THE-ART) There is hardly any discussion of the state of the art beyond a few references to works on personalized information access and knowledge graph similarity. Given the size of the respective research areas this seems inadequate. It would be much easier for the reader to judge the novelty of the proposed approach based on a clearly structured related work section with statements about how the discussed solution differs from the existing ones.
Furthermore, linking to survey papers on related areas which are not covered in the approach might be helpful – profile construction is not the main topic, but absolutely necessary to make actual use of the proposed solution. So why not reference one-two of the high-quality survey papers/solutions which exist?
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Composition and components of the novel measurement are presented in a formal way, supported by examples and visual representations. Some of the basic definitions (“RDF graph”) could be shortened to make more room for the actual novelty of the paper. Consider placing graphics closer to the definitions/examples which use them.
Some of your statements need references (e.g. in the introduction “SPARQL is too complex for simple users”, “efforts towards building large knowledge repositories where data is represented in RDF form”.
There is no discussion of future work – which are the “large knowledge repositories” that you plan to tackle?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation gives a first idea of the usefulness of the approach, but is in no way conclusive (examining only 10 queries, and being judged by three “experts”).  There is no explanation of the term “expert” in this context – expert users of Jamendo? Graph search? There is no discussion how the responses of the respective experts differed and if there were important differences in judgement observed on some of the queries.
No comparison to existing personalization methods is performed. Thus the authors state that the obtained ranking is “close to the user interests”, but do not provide evidence if the results are better or worse currently available personalization methods.
(OVERALL SCORE) The presented paper proposes a novel graph-based similarity measure with the objective to facilitate personalized access to graph-based knowledge bases.
Strengths:
The topic is certainly highly relevant to the community. The authors give a detailed, very formal description of their rational and present first experimental results. Their explanations are supported by visual representations which are used through the paper to make the proposed calculations more palpable.
Weaknesses:
The presentation seems to me overly formal and could be lightened up. The writing is not always straight-forward and seems cluttered (e.g. beginnings of 6.1. and 6.2 are redundant and even use the same wording). Some statements should be supported by references. In general, a discussion of the state of the art is missing. The presented evaluation is a first step, but should absolutely be extended!

Review 4 (by Andrea Giovanni Nuzzolese)

(RELEVANCE TO ESWC) The topic (i.e. personalized Keyword Search on Large RDF Graphs) discussed by the paper is certainly of interest to the conference
(NOVELTY OF THE PROPOSED SOLUTION) The idea of gathering prototypical graphs (i.e. patterns or motifs) from linked dataset for query purposes is not novel.
Examples of state of the art pattern extraction and usage solutions are [1,2,3].
Nevertheless, the metrics used for computing similarity and definition of profile graphs introduce fair enough degree of novelty. This justifies my score.
[1] Presutti, V., Aroyo, L., Adamou, A., Schopman, B., Gangemi, A., & Schreiber, G. (2011, October). Extracting core knowledge from linked data. In Proceedings of the Second International Conference on Consuming Linked Data-Volume 782 (pp. 37-48). CEUR-WS. org.
[2] Nuzzolese, A. G., Gangemi, A., Presutti, V., & Ciancarini, P. (2011, October). Encyclopedic knowledge patterns from wikipedia links. In International Semantic Web Conference (pp. 520-536). Springer, Berlin, Heidelberg.
[3] Nuzzolese, A. G., Presutti, V., Gangemi, A., Peroni, S., & Ciancarini, P. (2017). Aemoo: Linked data exploration based on knowledge patterns. Semantic Web, 8(1), 87-112.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors describe the solution and the metrics by both introducing formal definitions and providing supporting examples.
(EVALUATION OF THE STATE-OF-THE-ART) Some relevant state of the art are mentioned and briefly discussed in the paper.
Nevertheless, there in not a proper comparison between the proposed solution and related work.
This avoid a clear positioning of the proposed solution with respect to the state of the art.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Many details are not provided.
For example: how was generated the ground truth?
I find significant limits and flaws in the evaluation approach followed.
Typically, this kind of solutions, which deeply depend on user interaction, are evaluated in terms of usefulness and unexpectedness of results (cf. serendipity). Accordingly, it seems to me that a ground truth can hardly capture the effectiveness of the solution in generating useful pattern graphs for addressing personalised querying of large linked dataset.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) There is no reference to the specific Jamend version used.
The ground truth is not public released. Alternatively, the authors have to justify the reason why they cannot release the ground truth (e.g. specific non-disclosure agreements, etc.).
Has the solution proposed been implemented? If yes, please provide a reference and/or details.
(OVERALL SCORE) The paper presents a solution for the generation of personalised keyword search on RDF graphs using the concept of graph similarity.
The paper is in general well written and structured
*** PROS ***
- The paper introduces a fair degree of novelty by defining new metrics for computing similarities for graph matching.
- The formal description of the definitions and the metrics introduced is sound.
- The problem tackled by the authors is relevant to the conference.
*** CONS ***
- There is no clear positioning of the solution proposed with respect to related work.
- The evaluation is limited and show a number of flaws that, in my opinion, are significant.
- Many details are hidden that prevent the reproducibility of the experiments.

Metareview by Hala Skaf

Reviewers agree that the topic of the paper is relevant to ESWC. Defining new metrics for computing similarities for graph matching appear novel.
Reviewers agree that the state of the art is missing, only some relevant references are mentioned and briefly discussed, therefore, the positioning with the respect of the state of the art is unclear.
The evaluation is limited and not convincing. Many details are hidden that prevent the reproducibility of the experiments.
As a result, the submission does not fulfill the requirements for a ESCW publication in terms of maturity, technical depth, and quality.

Share on