Author(s): Andreas Karampelas, George A. Vouros
Abstract: This paper proposes and evaluates time and space efficient methods for discovering links between matching entities in large data sets, using state of the art methods for measuring edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target dataset, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: The first method exploits the blocking structure using the triangle inequality in conjunction to the substring matching criterion. The second method uses only the substring matching criterion, while the third method uses the substring matching criterion in conjunction to the frequency matching criterion. Evaluation results show comparative results of the pruning power of the different criteria used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state-of-the-art tools for large-scale link discovery.
Keywords: link discovery; edit distance; similarity metrics