# Paper 62 (Research track)

Time and Space Efficient Large-Scale Link Discovery using String Similarities

Author(s): Andreas Karampelas, George A. Vouros

Full text: submitted version

Abstract: This paper proposes and evaluates time and space efficient methods for discovering links between matching entities in large data sets, using state of the art methods for measuring edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target dataset, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: The first method exploits the blocking structure using the triangle inequality in conjunction to the substring matching criterion. The second method uses only the substring matching criterion, while the third method uses the substring matching criterion in conjunction to the frequency matching criterion. Evaluation results show comparative results of the pruning power of the different criteria used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state-of-the-art tools for large-scale link discovery.

Keywords: link discovery; edit distance; similarity metrics

Decision: reject

Review 1 (by Martin Kaltenböck)

(RELEVANCE TO ESWC) Link discovery is a very important task in the SW community. Furthermore, the methods shown here can be of use in other tasks, such as entity extraction.
(NOVELTY OF THE PROPOSED SOLUTION) The novely of the methods is good. They provide improvements in performance, but their mathematical novelty is limited, in that they are a small evolution from existing methods.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) There is one important error in the paper that MUST be corrected: Lemma 1 is not correct. Here is a counter example for \theta = 3.  t = {"ab","cd","e","fg"} and s= "xyzdefg". Substring "d" of s matches a segment of t, yet s is not similar to t within threshold theta.
That being said, that error is not significant to the argument.
While the good mathematical notation is appreciated, there are two further shortcommings of the method:
1) The computation of the blocks, as described, does not necesarily need to uniform distrubions of strings per cluster, so that it is possible that blocks end up with one or two strings, thus negating their prunning potential.
2) The cost of building segment-based indices is not mentioned, and it can be significantly high both in terms of time and space, as the length of strings increases.
Because this two issues are not addressed at all, the proof of performance is 100% empirical. With the mathematical aparatus that has been built around the string matching problem, including the statistics of distribution in blocks, this can be improved.
(EVALUATION OF THE STATE-OF-THE-ART) Good evaluation against other methods based on strings. No mention of more heuristical methods.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) While performance improvements are shown, the correctness of the methods are not discussed nor evaluated at all.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Code and data made available under CC license.
(OVERALL SCORE) The paper presents three methods for reducing the number of comparisons necessary to find matching pairs of strings between to large sets of strings, where matching is defined as having a Levenshtein less than a threshold. The three methods are more efficient that those used in popular software providing this solution. The methods are more memory and time efficient in the provided data set.
SPs: 1) the problem is very well described 2) the literature review is satisfactory and it reuses theoretical results 3) the performance results could be significant in some scenarios 4) the code is made available in CC license
WPs: 1) Lemma 1 is wront 2) There is no evaluation of the accuracy of the method 3) The figures are not explanatory at all 4) There is no evaluation of the impact of the relevance, e.g. how significant is this improvement in the real world? 5) there is no mention of how the segment-based indices are created and what is their associated costas

Review 2 (by Guillermo Palma)

(RELEVANCE TO ESWC) Is a important idea behind the Linked Data paradigm interlink very large  data sources with time-efficient link discovery tools.
(NOVELTY OF THE PROPOSED SOLUTION) Using string similarities properties for improving the filtering step in methods of link discovery is not a novel idea.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The filtering rules used by the proposed methods are described, and two lemmas are presented and demonstrated.
(EVALUATION OF THE STATE-OF-THE-ART) The proposed approach is only compared with the blocking techniques used in a early version of LIMES and SILK. In the experimental evaluation state-of-the-art methods with other approaches based on string similarity techniques were not included.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) In the section 6 "Verification of candidate matching pairs", the following is presented: "we can improve the complexity to O((\tetha+1)*min(|s|,|t|) and we can reduce it even further by using prefix pruning." This sentence must be proven.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The Experimental Study only based on ARIADNE programme dataset, and focuses only on small/medium size strings. A broader evaluation that includes different types of datasets is needed.
(OVERALL SCORE) This paper present a lossless and time-efficient approach for the matching entities in large datasets using methods for measuring edit distance as a string similarity metric.
Strong Points (SPs)
1) This paper proposes three  time and space efficient approaches for the discovery of links between knowledge bases
2) The experimental study show that the proposed methods are more efficient than the string similarity methods used in a early version of LIMES and SILK
3) Adding a characters’ frequency matching filtering rule is a efficient method to filtering-out candidate pairs.
Weak Points (WPs)
1) A comparison with state-of-the-art approaches is needed.
2) Parameters that impact on the performance of the proposed methods are not discussed
3) The proposed approach is only evaluated with the ARIADNE dataset.
----- AFTER REBUTTAL PHASE -----
I would like to thank the authors for their responses. Many of my observations were answered.
I updated the scores of the sections:
* Evaluation of the State-of-the-Art:
* Demonstration and Discussion of the Properties of the Proposed Approach
* Reproducibility and Generality of the Experimental Study
* Overall score
* Overall evaluation

Review 3 (by Ignacio Traverso-Rebon)

(RELEVANCE TO ESWC) The authors propose approaches based on string similarity measures for the link discovery task, which a relevant and well-known task in our community. However, there is no intensive use of semantics nor graph techniques.
(NOVELTY OF THE PROPOSED SOLUTION) The approaches are used to filter link discovery candidates based on string similarity values. On one hand, no novel approach for discovering links is proposed, just to discard them based on a similarity measure and a threshold value. On the other hand, the similarity measures used in this paper do not consider any kind of semantics, i.e. only the lexical information is taken into account. Word2Vec may be an example of semantic similarity measure for comparing strings.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) While the computational complexity of the different methods is well detailed, the explanation of the methods is not easy to follow and should be supported by better quality figures. I would consider the SM method as a baseline, given that there is no novelty there. Only a similarity value is computed and based on a threshold the link discovery candidate is discarded or not.
(EVALUATION OF THE STATE-OF-THE-ART) The state-of-the-art includes both string similarity measures and partitioning algorithms. However, the definition of string similarity join problem is vague. When are two string similar based on a similarity measure? Which threshold should be applied?
Additionally, the authors include a section including preliminary knowledge like the metric properties and the formulation of the problem to be solved.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation is performed only in terms of time and memory. A comparison in terms of precision and recall is missing. In this way it is not possible to determine if the reduction of memory usage and execution time also implies a degradation of the quality of the predictions. Therefore, it is not possible make any claim in this direction.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation is well detailed and the used datasets together with the code of the approaches are available on a git repository. The proposed approaches are compared with the two main state-of-the-art approaches in terms of execution time and memory usage.
(OVERALL SCORE) 1) Summary of the paper
The paper proposes three different methods for improving the efficiency in terms of time of link discovery approaches. Link discovery approaches are divided into three processes: 1) preprocessing of the edges in the graph, 2) application of a partitioning algorithm in order to detect link candidates and 3) filtering and verification of candidates according to a user-defined set of criteria. The proposed approaches are applied during the filtering step, reducing the amount of links to be verified. All the approaches are based somehow on string similarity or edit distances. The first approach considers the partition structure together with string similarity values in order to prune the amount of candidates to be verified. The second method only considers string similarity values for filtering and the third one improves the latter by also considering character frequency matching.
Empirical results show that the proposed method outperform state-of-the-art approaches as SILK and LIMES reducing the execution time and the memory usage.
2) Positive Points
* The problem to be solved is well formulated and preliminary knowledge is included to facilitate the reading of the paper.
* The authors provide informative figures that facilitate the understanding of their challenges.
* The proposed approaches outperforms SILK and LIME in terms of memory usage and execution time independently on the length of the compared strings.
3) Negative Points
* The abstract and the introduction do not present clearly the contributions of the authors. The differences between the three presented method is not clear.
* The definition of string similarity join problem is vague. When are two string similar based on a similarity measure? Which threshold should be applied? (Section 2.3)
* Using “cluster”, “partition” and “blocks” in the same paragraph is confusing. Which are the differences among them?
* The evaluation is performed only in terms of time and memory. A comparison in terms of precision and recall is missing. In this way it is not possible to determine if the reduction of memory usage and execution time also implies a degradation of the quality of the predictions.
4) Questions to the authors
Q1) Why do you use string based similarity measures and no semantic measures as Word2Vec or Doc2Vec?
Q2) Why is not precision and recall also shown in the experimental results?

Metareview by Jorge Gracia

Link discovery is crucial in the Semantic Web area, where many string-based methods are extensively used. This method is an advancement in that direction. However, as pointed by some reviews, the evaluation does not report the performance in terms of precision and recall, which makes it difficult to assess the validity of the approach and its comparison to other methods. The authors claim that the method is lossless, but this is not demonstrated in the evaluation. In their reply letter, the authors adequately address some of the issues raised by the authors, but there is no clear justification of the lack of P/R evaluation.
Nevertheless, the approach is an interesting one and we would like to encourage the authors to submit their work as a poster or demo.

Share on