Using Link Features for Entity Clustering in Knowledge Graphs
Author(s): Alieh Saeedi, Eric Peukert, Erhard Rahm
Full text: submitted version
Abstract: Knowledge graphs holistically integrate information about entities from
multiple sources. A key step in the construction and maintenance of knowledge
graphs is the clustering of equivalent entities from different sources. Previous
approaches for such an entity clustering suffer from several problems, e.g., the
creation of overlapping clusters or the inclusion of several entities from the same
source within clusters. We therefore propose a new entity clustering algorithm
CLIP that can be applied both to create entity clusters and to repair entity clusters
determined with another clustering scheme. In contrast to previous approaches,
CLIP not only uses the similarity between entities for clustering but also fur-
ther features of entity links such as the so-called link strength and link degree. To
achieve a good scalability we provide a parallel implementation of CLIP based on
Apache Flink. Our evaluation for different datasets shows that the new approach
can achieve substantially higher cluster quality than previous approaches.
Keywords: Link; Link strength; Knowledge graph; Clustering; Overlap resolve; CLIP
Review 1 (by Dagmar Gromann)
(RELEVANCE TO ESWC) Both clustering and repair algorithm provide novel contributions to the field as does the thorough comparison and evaluation. (NOVELTY OF THE PROPOSED SOLUTION) Even though a similar approach had been proposed by authors of this paper ( in the references) the repair algorithm and the comparison as well as evaluation provide a novel contribution. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The presentation of the approach is very clear, technically sound, and excellently visually supported. (EVALUATION OF THE STATE-OF-THE-ART) The clustering and repair algorithm are thoroughly evaluated with a music, a geographical, and a voter registry dataset. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Both the clustering and the repair step achieve excellent results in comparison to six other approaches tested. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The description of the approach allows for a good reproducibility even if it would be highly appreciated if the authors realize their future intention of publishing the code used. (OVERALL SCORE) SUMMARY: To overcome the problem of overlapping clusters or the inclusion of several entities of the same source within a cluster, this paper proposes a new multi-source entity clustering algorithm with the ability to create as well as repair clusters. In contrast to previous approaches, this clustering method introduces the idea of link strength discrimination, that is, only considering mutually strong links between entities. A thorough evaluation of the clustering algorithm over three datasets and in comparison to six other clustering schemes provides convincing results. The same is true for the evaluation of the repair algorithm that seems to successfully repair overlapping/conflicting clusters resulting from the other six schemes previously tested by the same authors. The repair process mostly consists of ensuring that each entity pertains to exactly one cluster and each cluster may not contain more than one entity from each source. STRONG POINTS: 1) Interesting and novel approach that is well explained, evaluated and visualized 2) Very thorough evaluation across highly distinct data sets 3) Structure and presentation of this paper are very clear and easy to follow, which made it a pleasure to read. WEAK POINTS: 1) Some design decisions could be explained and discussed in more detail (see questions) 2) The template is not fully followed (see formatting) QUESTIONS: 1) How were the blocking keys determined? Were other blocking methods, such as multi-pass blocking, tested? If yes, what were the results thereof? Was the amount of wrong cluster assignments based on erroneous blocking determined? The overall evaluation seem to suggest it to be rather low. 2) Since the similarity value seems to be crucial to this approach of link prioritization, where different metrics tested? While geographical data comparisons might not benefit from a semantic similarity metric, names of people and music albums might. Formatting: The spacing after headings 5.2 and 5.3 seems to deviate from the template. There should not be any full stop or abbreviation marker after "Table", that is "Table. 2" should be "Table 2" at par. 2 under 5.3.
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) The topic is certainly relevant to the conference. It will be nevertheless of interest for researchers in the field. However, an effort should be made to convince about the appropriateness for the Semantic Web context. Clustering methods for Semantic Web representation are not taken into account. Especially those that may resort to features that are specific for the rich representation of KBs. (NOVELTY OF THE PROPOSED SOLUTION) It seems to me that the paper presents an interesting extension of a previous work  that is very much in the line of the standard practice of the sub-field. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The content seems technically sound. The initial claims are supported by the experiments. The experiments, however, only target clustering based methods. It would be more interesting to compare the methods to state of the art methods also those based on supervised learning approaches. Also pairwise ontology mapping algorithms may be taken into account. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art section is short yet generally complete. This would be the place were related methods / technical details related to the related to the Semantic Web field, especially those involving/consuming specific knowledge features (e.g. similarity measures). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) While the authors clearly explain the advantages of the proposed methods a discussion of the weaker points would be worth being added. The advantage observed with the usage of repair discussed at p. 11 is quite expected. The choice of thresholds could be presented proving that a principle method has been adopted. As a follow-up of previous work the paper makes a moderate advance to the state of the art. It addresses an important problem in a well-established direction. The implications of the specific context with respect to similar ones, the case of distributed databases, for example, are not delineated, if any. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It would be possible to replicate the experiments provided that the synthesized datasets with the correct cluster assignments were made available. Code not available. (OVERALL SCORE) The paper focuses on the extension of entity resolution methods based on clustering. It features a parallel implementation of a graph clustering algorithm that is coupled with a repair method to tackle the cases of inconsistency of the links. An experiment proves the performance of the proposed extensions on 3 datasets over other cluster-based algorithms and the effect of the usage of the repair method. SPs 1. Interesting advances on unsupervised approach to the problem 2. Applicability of the repair method to clustering models obtained through other clustering approaches 3. A clearly written and well-organized paper: excellent usage of the figures WPs 1. Limited novelty | originality of the contribution: very much in the line of the previous work 2. Not convincing on the significance of the experiment that involves only clustering methods and its extent. 3. Not convincingly related to the Semantic Web context. Questions Q1 Have you thought of the inconsistency caused by matching instances within the same source? Q2 Clarify whether both crisp and fuzzy clustering algorithms are taken into account for the comparison Q3 the Cluster association degree seems to be based on an average link criterion: would it make sense to investigate defs. based on the single or complete link criteria? Q4 how is the "variable minimal similarity threshold θ" determined? [p. 11] Q5 how were the sim functions selected? Q6 have you tried to compare the approach to (un)supervised link discovery methods, like those in the references, which seem to lead to better performances (yet involving pairs of sources) ? Further Comments Good quality of writing all organization. Excellent usage of figures as examples Introduction should state precisely the new contribution wrt the previous work. Short yet generally complete related work section (should include works on clustering for SW KBs) Tricks to save space appear to be used that should be avoided in case of a revised version References to be improved (too succinct). Minor Comments The LNCS style should be enforced. In various sections: "Data Web" --> "Web of Data" p.4-5 Strong link: use \noindent (also for the following Normal and Weak link) --- AFTER REBUTTAL I'd like to thank the authors for their answers.
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper is in the relevance of Semantic web community. There is a substantial amount of work done in entity clustering schemes. The authors propose CLIP algorithm for clustering the entities from different data sources. They also propose an algorithm RLIP to repair overlapping clusters generated from other clustering algorithms. (NOVELTY OF THE PROPOSED SOLUTION) The proposed approach is novel. The authors use link strength measure for generating clusters. The authors also propose RLIP algorithm which repair overlapping of the clusters that are generated from other clustering algorithm. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is well explained in the paper. Author clusters the similar entities based on the link strengths and also propose an algorithm that can repair source-inconsistent clusters. Section 4 explains all the concepts used in approach. Section 5 discusses experiment carried out showing recall, precision and F-measure. Evaluation for three datasets shows that the new approaches achieve very good cluster quality and outperform previous clustering schemes to a large degree. The RLIP repair approach could improve the quality for all considered clustering schemes and achieve comparable quality than applying CLIP alone, in some cases with even lower execution times. The parallel implementations for CLIP and RLIP achieved good speedup values thereby supporting scalability to larger datasets. (EVALUATION OF THE STATE-OF-THE-ART) The study of the relevant algorithms is presented very well and compares other approaches with precision, recall and F-Measure. They present fair comparison taking three different multi-sourced data set to evaluate the proposed approach with state-of-the-art algorithms. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is well supported by an extensive evaluation. The evaluation results are promising. All the concepts used in the approach are explained clearly in section 4 as well both the algorithm presented very well. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Pseudo code for the clustering and repairing the clustering are presented so it can be reproduced. Dataset to evaluate the approach is taken from real datasets like DBpedia, Geonames, Freebase, NYTimes, MusicBrainz, North-Carolina voter registry etc. Authors used 3 different sets of data produced from above mentioned sources. Experimental environment is clearly mentioned with all the parameters. (OVERALL SCORE) The Authors proposed a new method called CLIP to cluster matching entities from multiple sources as well as a repair method called RLIP to improve entity clusters determined by other clustering schemes. The experiment evaluation is very well carried out. Strong Points : - Problem statement is very clear. - Strong evaluation with six other algorithm - Clearly presented approach Weak Points: - Cluster quality depends on quality of link strength - Method is used for measuring cluster quality is not discussed - Expensive computation cost Questions to Authors : What method is used for measuring cluster quality?
Metareview by Hala Skaf
This paper describes novel, consolidated, and extensively tested work, according to reviewers. The approach is described in the paper with clarity. Some (mostly minor) improvements can be addressed in the camera-ready version. The appropriateness and relation to the Semantic Web should be further emphasised in the paper. The reviewers have not substantially changed their judgement based on the authors' reply letter.