Paper 1 (Research track)

Empirical Evaluation of Similarity and Path Ranking Measures to Address the Entity Connectedness Problem in Knowledge Graphs

Author(s): José Eduardo Talavera Herrera, Javier Guillot Jimenez, Marco Antonio Casanova, Bernardo Pereira Nunes, Giseli Rabello Lopes, Luiz André P. Paes Leme

Full text: submitted version

Abstract: This paper presents an empirical evaluation of a family of path search strategies that help understand the connectivity of a pair of entities in a knowledge graph. Each path search strategy in the family combines an entity similarity measure and a path ranking measure to find and order relevant paths between a given pair of entities. The paper adopts a benchmark from two entertainment domains, and a baseline from the literature. It applies the pairwise comparison method to compare the path search strategies with the benchmark, and with the baseline. Finally, the paper shows that the path search strategy that adopts the Jaccard index and the Exclusivity-based Relatedness ranking measure was the best strategy, and obtained better results than the baseline.

Keywords: Entity relatedness; Entity similarity measure; Path ranking measure

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper fits ESWC topics
(NOVELTY OF THE PROPOSED SOLUTION) The paper focuses on the evaluation of a set of measures already available in the literature that are used to assess entity relatedness that ultimately amount at determining relevant paths and ranking them. For the purpose a benchmark available at the state of the art is used.
The experimental evaluation appears to be sound and well done. Overall the paper seems to give a moderate take home message also because of the limited number of datasets that have been used (just two and basically coming from the very same source that is DBPedia).
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper is weakly motivated and the notion of relevance for a path needs to be formalized or at least a motivating example should be reported. For instance, given the intuitive notion reported on the paper, it seems that a kind of similarity threshold should be introduced. The very same comment applies to section 2, beginning of page 3 (entity similarity). 
The example reported at end of section 3 does not result fully clear since it uses notions such as the definition of the Jaccard measure that is formalized only afterword. Additionally, the example lacks of clarity. Nodes are highlighted without actually showing the reason for it. 
The frequency of two items co-occurring in G should be explained with more details and preferably this notion should be formalized, particularly when talking about the co-occurrence of properties. 
In algorithm 1 there are items that are used without introducing them, e.g. V_side (that is even not initialized), the local database, the function "extract", the function "selection", the function"join". Some of them should be formally defined, for others, some comments in the pseudo-code could be enough. As for "selection" and "join", they are briefly described in the text but additional details should reported, at least for the join function. Furthermore, it should be clarified what the authors mean for "the description of an entity in G" 
As for the adopted knowledge graph, which version of DBPedia has been used? It is not clear if additional subsets of DBPedia have been generated or if only the datasets provided in [10] have been used.
(EVALUATION OF THE STATE-OF-THE-ART) Section 6, mainly lists some related works but it lacks of a clear discussion concerning the advances of the proposed study with respect to the state of the art. Overall, the authors should argue more extensively on the advances and value added of the proposed research with respect to the state of the art, considering that the main contribution of the work is given by the experimental study based on theoretical solution already proposed in the literature. Additionally, some more discussions concerning the scalability of the considered solution should be provided.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is weakly motivated and the notion of relevance for a path needs to be formalized or at least a motivating example should be reported. For instance, given the intuitive notion reported on the paper, it seems that a kind of similarity threshold should be introduced. The very same comment applies to section 2, beginning of page 3 (entity similarity). 
The example reported at end of section 3 does not result fully clear since it uses notions such as the definition of the Jaccard measure that is formalized only afterword. Additionally, the example lacks of clarity. Nodes are highlighted without actually showing the reason for it. 
The frequency of two items co-occurring in G should be explained with more details and preferably this notion should be formalized, particularly when talking about the co-occurrence of properties. 
In algorithm 1 there are items that are used without introducing them, e.g. V_side (that is even not initialized), the local database, the function "extract", the function "selection", the function"join". Some of them should be formally defined, for others, some comments in the pseudo-code could be enough. As for "selection" and "join", they are briefly described in the text but additional details should reported, at least for the join function. Furthermore, it should be clarified what the authors mean for "the description of an entity in G"
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) As for the adopted knowledge graph, which version of DBPedia has been used? It is not clear if additional subsets of DBPedia have been generated or if only the datasets provided in [10] have been used. 
The experimental evaluation appears to be sound and well done. Overall the paper seems to give a moderate take home message also because of the limited number of datasets that have been used (just two and basically coming from the very same source that is DBPedia).
(OVERALL SCORE) The paper focuses on the evaluation of a set of measures already available in the literature that are used to assess entity relatedness that ultimately amount at determining relevant paths and ranking them. For the purpose a benchmark available at the state of the art is used. 
Some aspects of the paper need to be improved and/or clarified. See sections above for details.  
As regards the experimental evaluation, it is overall clear and well written even if some detials should be added (see comments above). Howeve, overall the the paper seems to give a moderate take home message also because of the limited number of datasets that have been used (just two and basically coming from the very same source that is DBPedia). 
Also the discussion concerning related works should be imporved. Additionally, some more discussions concerning the scalability of the considered solution should be provided.
--- AFTER REBUTTAL PHASE ---
I acknowledge that I read the authors response and I'm currently aware that a revised version of the paper is avaialble online


Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) The scope of the paper is on the evaluation of a methodology to search for paths between entities in am RDF graph
(NOVELTY OF THE PROPOSED SOLUTION) This score is due to the track this paper has been submitted, which is indeed focussed on benchmarking and evaluating existing methodologies. The novelty here is not in the methods but in the fact that this evaluation seems to not have been made before.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper suffers from some lack of correctness and completeness. For example, why was the strategy U&P no used? Some of the mathematics in the similarity and ranking measures seem off. I’ll give examples later. Algorithm 1 mentions three unspecified functions (selection, extract and join).
(EVALUATION OF THE STATE-OF-THE-ART) State-of-the-art with respect to the measures being evaluated is scattered in the paper and not systematized in a dedicated section. There is a related work section that mentions applications of path-ranking. Overall, however, I do not think this is a problem in this track.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The results are somewhat difficult to follow and interpret. The only statistical analysis is whether the performance of the strategies significantly differ from one another.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The paper provides the algorithm that was implemented, but some parts are left underspecified. The methodology was applied to two domains only.
(OVERALL SCORE) Description:
This paper evaluates 9 path strategies to find and rank paths in an RDF graph connecting two input entities. The evaluation was made by contrasting how these strategies compare to a baseline, RECAP. The evaluation consists in calculating a score for each produced ranking of paths. The paper concludes that one of the 9 strategies is better than the others.
Strong points:
1. The focus of the paper is important in the field of semantic web
2. The paper mathematical formulas and algorithms to ensure reproducibility
3. While the results are restricted to two domains, the work is completely generalizable into other domains
Weak points:
1. The paper is missing a more detailed description of the dataset used to evaluate the strategies. In particular, the dataset seems to have been extracted automatically from existing resources (see reference [10]), and yet it is used as a ground truth. Is there manual curation in that dataset? These issues should be either better explained in the text, the use of that benchmark better motivated, or else the paper should make clear that the benchmark is not curated.
2. Some details of the algorithm are missing, hindering its reproducibility
3. The provided example on how the path search algorithm works is vague and not very informative
4. English language should be corrected by a native
Questions to the Authors:
1. What is “transitive similarity” used in the introduction? It does not seem to be in line with the mathematics of “Entity similarity” criteria presented in section 2. In fact, I suspect that you meant, in that description “… such that, for each i in [0, q] w_0 and w_i are similar, and for each j in [q, k] w_j and w_k are similar”.
2. Why was U&P not tested (table 1)
3. Should “jaccard(a, c) = 0 if A_d union C_d = {}” (you have it equal to 1, but if the neighbourhood of A and C are empty, should they really be considered completely similar?)
4. The method to compute SimRank should be better explained
5. Where are the definition of the functions “extract” (line 21), “selection” (line 28) and “join” (line 42) in Algorithm 1?
6. What is “in-core” in line 2 in Algorithm 1?
7. Sometimes the abbreviation *DGC is used instead of DCG
8. The formula for the DCG is incorrect: it should read “log_2 (i+1)” instead of “log_2 i”
9. Explain why the values in table 3 seem much more “normalized” than the ones in table 2 (they seem to be equally distributed around a mean value of about 100, while in table 2 they seem to follow a more skewed distribution), and why the values are decimal in table 3 but integers in table 2.
10. Include a more detailed caption in the tables, explaining the meaning of the grey background and bold numbers.


Review 3 (by Vito Walter Anelli)

(RELEVANCE TO ESWC) I consider this work relevant to ESWC topics, potentially useful for a plethora of researchers.
(NOVELTY OF THE PROPOSED SOLUTION) I consider the novelty of the work quite limited because it is a comparison between well-known approaches.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I don't think this apply for this work
(EVALUATION OF THE STATE-OF-THE-ART) Related works is really short, w.r.t. recommender systems (a lot of works related to this topic have been proposed in recent years) and w.r.t. the relatedness of entities in a graph.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Many points in the work are unclear to me.
Why authors decided to exploit those measures? Since they are the most popular ones? Because they seem to be the best performing ones? 
Why they decided to use those specific baselines?
A more detailed justification for those choices could be helpful.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It is ok that they used a RDF graph extracted from DBpedia, but what are the details?
Which version of DBpedia did they use? Did they use all the graph? Did they extracted two different subsets (music and movies)? Was the extraction based on specific classes? Before the pruning they described, did they consider all possible properties in DBpedia?
I found really short the description of the graph and the dataset, considering the number of decisions that they had to take in order to prepare their experiments.
This description would also help to reproduce their results.
Experiments and tables are clear (typo DGC in tables) and this is an important point in this work but dicussion is not convincing. They claim that J&E was the winner of the competition, but it seems to me that J&E is not the absolute champion of the comparison, but Exclusivity relatedness seems to be. A more detailed discussion of the results could help us to understand better their claims.
(OVERALL SCORE) The authors perform a comparison between nine different combinations of strategies to extract relevant paths between items and metrics to rank the relevance of paths.
In details they exploited Jaccard Index, Wikipedia Link-based Measure and SimRank to measure similarity and they  decided to use three classical ways to measure path relevance: Exclusivity-based Relatedness , Predicate Frequency Inverse Triple Frequency and Pointwise Mutual Information.
They found that the pair Jaccard Index and Exclusivity-based Relatedness is the best performing approach.
I found the topic of the paper really interesting and coherent with literature. Their findings could be easily experienced by those who often work on graph relatedness .
I consider this work relevant to ESWC topics even though I consider the novelty of the work quite limited.
It was hard for me understand the main motivations of the work because I found nothing in the introduction. I agree with the fact that, too often, introduction becomes a bucket full of hollow words, but I think that if authors explained the reasons why they decided to work on it, it would have been easier for the reader to understand their choices, in terms of competing approaches and evaluation.
Perhaps, many points in the work are unclear to me.
Why authors decided to exploit those measures? Since they are the most popular ones? Because they seem to be the best performing ones? 
Why they decided to use those specific baselines?
Although these considerations, the main concerns are related to the experimental evaluation and related works.
It is ok that they used a RDF graph extracted from DBpedia, but what are the details?
Which version of DBpedia did they use? Did they use all the graph? Did they extracted two different subsets (music and movies)? Was the extraction based on specific classes? Before the pruning they described, did they consider all possible properties in DBpedia?
I found really short the description of the graph and the dataset, considering the number of decisions that they had to take in order to prepare their experiments.
This description would also help to reproduce their results.
Experiments and tables are clear (typo DGC in tables) and this is an important point in this work but dicussion is not convincing. They claim that J&E was the winner of the competition, but it seems to me that J&E is not the absolute champion of the comparison, but Exclusivity relatedness seems to be. A more detailed discussion of the results could help us to understand better their claims.
Related works is really short, w.r.t. recommender systems (a lot of works related to this topic have been proposed in recent years) and w.r.t. the relatedness of entities in a graph.
Finally I summarize the strong points and weak points I see in this work:
Strong points: 
- Interesting topic (potentially useful for a plethora of researchers)
- Clear results
- The choice of the DBpedia graph (if explained well) could ensure reproducibility of experiments.
Weak points:
- lack of motivations and introduction
- Unclear description of the graph used and dataset
- Non-convincing results discussion
- Very short related works


Review 4 (by Tommaso Di Noia)

(RELEVANCE TO ESWC) Since this paper presents an empirical evaluation of 9 path search strategies, it is coherent with the track.
(NOVELTY OF THE PROPOSED SOLUTION) There are novel ideas as the authors combine in every possible way entity similarity measures with path ranking measures.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution is well explained but some points are not completely motivated (See weak points in the overall evaluation)
(EVALUATION OF THE STATE-OF-THE-ART) Even if related works show the authors knowledge in the field, more baselines and similarity measures should have been used in the experimental evaluation.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Some details and consideration are missing
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Public datasets and simple evaluation setting make the work reproducible
(OVERALL SCORE) ---------------------------Summary of the paper---------------------------
In this work the authors present an empirical evaluation of 9 path search strategies, that they call "similarity-and-ranking strategies", obtained by combining 3 entity similarity measures (Jaccard Index(J), Wikipedia Link-based Measure(W) and  SimRank(S)) and 3 path ranking measures(Predicate Frequency Inverse Triple Frequency(I), Exclusivity-based Relatedness(E), Pointwise Mutual Information(P)). Each path strategy is composed by two steps: 1) Find set of paths such that each path satisfies one or more selection criteria (Unrestricted, Limited degree and Entity similarity) 2) recommend the top-j paths based on their relevance. They used only two selection criteria(Limited Degree to 200 and an entity similarity measure). Entity Relatedness Test Dataset in [10] has been adopted as benchmark while RECAP and EXPLASS as baselines. In order to measure the quality of the ranking they used nDCG and considered the first 50 paths. The results have shown that using the combination of the Jaccard similarity and the Exclusivity-based Relatedness ranking measure as the path search strategy has, in most cases, the best behaviour in the two selected domains (Music and Movies). DBpedia graph has been selected as Knowledge graph. 
---------------------------Strong Points---------------------------
1) The paper is well written, clear and easy to follow.
2) The idea behind this work (evaluate path search strategies) is good.
3) A good background topics description have been provided.
---------------------------Weak Points---------------------------
1) It would have been interesting to test other similarity measures such as the Cosine one or PathSim[PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks].
2) In the reported experiments the authors concluded that the J&E strategy should be preferred in the two domains but it is not the absolute champion since in the comparison with benchmark W&E has the best average (although the differences between the two are not statistically significant I cannot understand why I should prefer J&E over the W&E strategy). Could it possibly be due to the fact that the path strategy choice has a greater impact w.r.t. the similarity measure choice? For the sake of completeness, it would be helpful to include W&E strategy in the baseline comparison too. 
3) Why REX[5] and [12] are not used as baselines?
--- AFTER REBUTTAL PHASE ---
I carefully read the authors' response and, actually, I'm not satisfied at all. The aim of a rebuttal is to convince the reviewers they misunderstood something and/or that something can be clarified or improved. This is definitely not the case.


Metareview by Oscar Corcho

There has been extensive discussion on this paper, both on the initial set of reviews as well as in the final set of scores after the rebuttal. An initial comment should be done on this aspect, since the purpose of a rebuttal is not to provide answers in a new version of the paper, but to make the reviewers aware of whether there were any misunderstandings.
The paper has interesting merits, as pointed out by the reviewers, but has relevant drawbacks in terms of having been focused on a single type of data source, which is not fully specified and described in the paper, with no clear motivation and a lack of discussion on the measurements that have been considered.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *