Empirical Analysis of Ranking Models for an Adaptable Dataset Search
Author(s): Angelo Batista Neves, Rodrigo Guerra, Luiz André P. Paes Leme, Giseli Rabello Lopes, Bernardo Pereira Nunes, Marco Antonio Casanova
Full text: submitted version
Abstract: Currently available datasets still have a large unexplored potential for interlinking. Ranking techniques contribute to this task by scoring datasets according to the likelihood of finding entities related to those of a target dataset. Ranked datasets can be either manually selected for standalone linking discovery tasks or automatically inspected by programs that would go through the ranking looking for entity links. In the first case, users typically choose datasets that seem more appropriate among those at the top of the ranking, having little tendency for an exhaustive selection over the entire ranking. On the other hand, automated processes would scan all datasets along a whole slice of the top of the ranking. Metrics such as nDCG better capture the degree of adherence of rankings to users expectations of finding the most relevant datasets at the very top of the ranking. Automatic processes, on the contrary, would benefit most from rankings that would have greater recall of datasets with related entities throughout the entire slice traversed. In this case, the Recall at Position k would better discriminate ranking models. This work presents empirical comparisons between different ranking models and argues that different algorithms could be used depending on whether the ranking is manually or automatically handled and, also, depending on the available metadata of the datasets. Experiments indicate that ranking algorithms that performed best with nDCG do not always have the best Recall at Position k, for high recall levels. Under the automatic perspective, the best algorithms may find the same number of datasets with related entities by inspecting a slice of the rank at least 40\% smaller. Under the manual perspective, the best algorithms may increase nDCG by 5-20\%, depending on the set of features.
Keywords: Linked Data; entity linking; recommendation; dataset; ranking; empirical evaluation
Review 1 (by Alessandro Margara)
(RELEVANCE TO ESWC) The paper presents an empirical comparison of different ranking models to score datasets according to the likelihood of finding entities related to those of a target dataset. The topic is certainly relevant to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) Although the scope is narrow, I am not aware of similar solutions. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The methodology adopted in the evaluation is not discussed in detail. This is of primary importance for an empirical analysis. (EVALUATION OF THE STATE-OF-THE-ART) Some concepts are not presented in detail, and this hampers the readability of the paper (see my detailed comments). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The methodology adopted in the evaluation is not discussed in detail. This is of primary importance for an empirical analysis. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The methodology adopted in the evaluation is not discussed in detail. This is of primary importance for an empirical analysis. (OVERALL SCORE) The paper presents an empirical comparison of different ranking models to score datasets according to the likelihood of finding entities related to those of a target dataset. The authors claim that some ranking approaches can be more appropriate if the dataset is performed manually while others are more appropriate if the selection is automatic. In the first case, users might choose datasets among those at the top of the ranking. In the second case, automated processes might scan several high ranking datasets. The topic addressed by the paper is interesting and relevant for the community. Discovering and selecting relevant datasets that can enrich a target dataset can certainly be beneficial for many applications. However, I believe that the best model and algorithm to rank datasets strictly depends on the application scenario. In some cases, the user might want to enrich a dataset as much as possible selecting many datasets and thus making a very accurate ranking less relevant. In other cases, the user might want to select a very limited set of highly relevant datasets. In other cases again, the user might want to select subsets of various datasets. Because of this, I am a bit skeptical on the benefits of an empirical evaluation of ranking models, also considering that the authors measure differences between approaches that are not very large, and that these differences might change by taking into account different datasets. The paper is well written, but sometimes hard to follow. First, the authors sometimes use some terminology before properly introducing it. For instance, in the abstract and introduction they use nDCG and [email protected] without introducing it. Similarly, the introduction mentions topic categories and rule-based classifiers without giving an intuition on how these concepts relate to the problem at hand. I suggest that the authors reduce the introduction and present the concepts and findings at a higher level, deferring the precise definition of the techniques to some later point in the paper. In the rest of the paper, sometimes the authors introduce concepts and formulas without giving an intuition of their meaning and use. This is the case, for example, of tf-idf(ls) and tf-idf(c). The same is true for the definition of score, which involves the cosine of the angle between two vectors, but this computation is not properly discussed and motivated. I would also suggest to move the related work earlier in the text, maybe as part of the background. Indeed, being the paper an empirical evaluation of different approaches, it might be useful to know upfront which models have been proposed in previous research. Concerning the evaluation, I am not sure I understood the methodology adopted. In particular, what is the ground truth to decide the recall of a ranking? I suggest that the authors present their methodology in greater details as part of Section 4. To conclude, I believe that dataset ranking is interesting and relevant for the community, but I am a bit skeptical on the benefits of an empirical evaluation of some approaches. The quality of the presentation is good but the paper is sometimes hard to follow. The experiment setup for the evaluation can be presented in greater details, to help better assessing the results obtained. Strong points 1) Interesting and relevant topic. 2) Novelty of the study. 3) The paper is well written. Weak points 1) Sometimes the paper is hard to follow, since it uses some terminology without properly introducing it. 2) The best model to use might depend on the application scenario, and this hampers the usefulness of the empirical study. 3) The experiment setup for the evaluation can be presented in greater details, to help better addressing the results obtained. Questions 1) Concerning the evaluation, can the authors better explain the methodology? In particular, what is the ground truth to decide the recall of the ranking? 2) Can the authors better define tf-idf(ls) and tf-idf(c) and give an intuitive explanation of their meaning? 3) Can the authors better explain the benefits of their study, considering that the most suitable model might depend on the characteristics of the application at hand?
Review 2 (by Ernesto Jimenez-Ruiz)
(RELEVANCE TO ESWC) The paper presents an evaluation of five dataset ranking models and it is relevant to the ESWC track: Benchmarking and Empirical Evaluation (NOVELTY OF THE PROPOSED SOLUTION) It is unclear how to position the novelty of the contribution. The evaluation of ranking models is not novel per se. The authors should emphasize which are the novel characteristics of the presented evaluation (e.g. modification of ranking models, selection of features). (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The ranking methods are evaluated according to a selection of features (up to 5 linksets and up to 5 categories) and wrt to 2 dimensions depending on the use case (nCDG and [email protected]). As for any ranking method the selection of features is key. A different selection of features may have an important impact on the results. e.g. more than 5 linksets and 5 categories, or other features like top-X meaningful words in a dataset. The current features have a dependency on available linksets and categories (or dbpedia categories extracted from VoID descriptions). The Dataset selection is limited to those datasets providing dumps, but one could also use the SPARQL endpoints to extract all triples (unless there is a restriction on the type of queries that can be executed). (EVALUATION OF THE STATE-OF-THE-ART) As for the novelty point, it is missing a comparison with similar evaluation papers. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper misses some additional discussion about the obtained results to better understand the behaviour of the evaluated methods. The authors use the word "conclude", which I would substitute with the less strong statement "the conducted empirical evaluation suggests". The results may vary with different datasets and feature selection. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I checked reference  but it does not seem to include the used datasets nor concrete instructions to execute the ranking models. It seems to only include the obtained results and source codes to create the plots. Thus reproducibility doe snot seem to be possible given the current material. There are also some parameters that the authors claim that were empirically chosen. But no more details are given, which also damages reproducibility. Reference  should include an extended technical report to guide in the reproducibility of the results. (OVERALL SCORE) I would like to than the authors for the rebuttal. This has served to clarify a couple of points and I have updated my score accordingly. ------------------------------------------------------ The paper presents an evaluation of five dataset ranking models. The main contribution seems to be the use of the ranking models using 3 different type of features. Strong points: - The ranking outcome is relevant to entity discovery algorithms - Five models are evaluated - Two dimensions are used for the evaluation according to two use cases. Weak points: - Reproducibility of the results - Lack of fine-grained discussion about the results - Position of the conducted evaluation within the State of the Art Questions for rebuttal: - Which is the main contribution/novelty of the paper? Apart form the empirical study, did the authors performed adaptations to State of the art ranking models? - Why the content itself of the datasets is not taking into account as a feature? I understand only metadata is used. - How the results could be reproduced? Are the selected datasets and algorithms available? - Why the models with a mixed set of features did not improve the results? This is a bit counterintuitive and requires more discussion. Other comments: - Abstract and introduction are a bit too long. Some parts of the introduction could be moved to related work and background sections. - The paper could benefit from some additional examples. - Section 3 could be split into subsections for each model. - In page 10, why Fd is removed when calculating R? - In page 11, there is a typo. T1 should be < T1 + delta
Review 3 (by Payam Barnaghi)
(RELEVANCE TO ESWC) This work presents an empirical analysis of the entity section and ranking methods for linking different datasets. The authors have implemented and evaluated the performance of several existing solution with a set of open datasets. (NOVELTY OF THE PROPOSED SOLUTION) The work provides a good analysis of the existing methods for ranking; the evaluation results based on nDCG by also considering their recall and precision at n is a beneficial work and the presented results will be useful for the community. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) While the work does not present any novel solution, the comparison and evaluations presented in the paper are valuable and can be used by other researchers. (EVALUATION OF THE STATE-OF-THE-ART) The work provides a good coverage and evaluation of the state-of-the-art in this domain. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is well written and the results are described and discussed in the paper. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The authors are encourage to make their selected datasets and the codes available and make their work reproducible. (OVERALL SCORE) Overall this is a good paper and discusses a detailed evaluation of the existing method in ranking methods and linking the datasets. The results and the provided discussions are well presented. However, the work does not provide any novel technique or any best practices to detail how and when each method could be used.
Review 4 (by Shawn Bowers)
(RELEVANCE TO ESWC) This paper explores different metrics/models for a rank-based approach to finding dataset links. Instead of finding only the most relevant/similar datasets as candidates for linking, the approach ranks a set of datasets for links. The result can then be used either to manually find the best matches (according to the ranking) or automatically based on a "slice" of the ranking. Five different similarity metrics are considered and evaluated to determine which metrics provide the best results. (NOVELTY OF THE PROPOSED SOLUTION) There isn't a solution be proposed per se. Instead, this paper provides an evaluation of different approaches. The results suggest that ranking provides more flexibility while also suggesting cutoffs for automatic approaches. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) While not all of the detailed results are given (a separate/extended evaluation is given as a reference), the evaluation seems appropriate and the results are fairly clear. (EVALUATION OF THE STATE-OF-THE-ART) The evaluation of the ranking models is appropriate. The overall evaluation of related work is weak, not only in providing a fairly narrow set of references, but also not comparing the given approaches to those considered as related work. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation section seems appropriate. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It is unclear to me how the actual recall values (for [email protected]) would translate to different datasets. The results are very much focused on one corpus of datasets. (OVERALL SCORE) My main concerns are that the paper needs a bit of fine tuning in terms of wording, clarity, typos, grammar, etc. throughout. There are a few places where terms aren't properly defined (e.g., the idea of a "slice" is never given). The idea of ranking seems somewhat minor compared to defining good/realistic metrics. The related work in the paper doesn't provide any detail on similarities and differences of the approach in question. And the generality of the results with respect to the five models isn't clear.
Metareview by Emanuele Dellavalle
There has been extensive discussion on this paper, both on the initial set of reviews as well as in the final set of scores after the rebuttal. An initial comment should be done on this aspect, since the purpose of a rebuttal is not to provide answers in a new version of the paper, but to make the reviewers aware of whether there were any misunderstandings. All reviewers agree that the paper is relevant to ESWC, it is novel and that the research presented in it is carried on correctly. However at least one of the reviewers express its skepticism about the usefulness of the evaluation and the possibility to generalise the results.