# Paper 103 (Research track)

Detecting Erroneous Identity Links on the Web using Network Metrics

Author(s): Joe Raad, Wouter Beek, Frank Van Harmelen, Nathalie Pernelle, Fatiha Saïs

Full text: submitted version

Abstract: Although best practices for publishing Linked Data encourage the re-use of existing IRIs, we often find ourselves with several resources denoting the same thing, requiring the use of owl:sameAs to align different identifiers on the Web. However, several studies dating as far back as 2010 have observed multiple misuses of this equality link, resulting in many erroneous and even inconsistent statements. In this paper, we show how network metrics such as the communities structure of the owl:sameAs graph can be used to detect these possibly erroneous statements. We have evaluated our methods on a subset of the LOD cloud, containing over 550M owl:sameAs statements.

Keywords: linked open data; identity; owl:sameAs; communities

Decision: reject

Review 1 (by Claudia d’Amato)

(RELEVANCE TO ESWC) The paper focuses on the problem of detecting erroneous sameAs links currently existing e.g. in the LOD Cloud. The proposed solutions is grounded on the exploitation of network metrics and specifically metrics for assessing/finding community structures. An experimental evaluation is provided for showing the validity of the proposed solution.
Overall the paper is clear, focuses on a very important problem and certainly fits ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The paper mainly presents an application of metrics and approaches at the state of the art to problem of assessing the true sameAs statements, given a collection of sameAs statements. As such, the paper shows a limited advance with respect to state of the art but, at the same time, it presents an interesting research perspective in terms application/exploitation of existing solutions for solving a different problem.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution, mostly formalized in algorithm 1, is simple but meaningful. The evaluation is clear and well presented. The main problem is that such an evaluation does not result very strong because of the limited number of subjects and the limited number of datasets (with modest cardinality). It can be considered as a kind of proof of concept for showing the validity of the presented research direction, which is very interesting, rather than a strong experimental evaluation. As a value added, it must be said that almost all material used for the experiments is publicly available.
(EVALUATION OF THE STATE-OF-THE-ART) All important related works have been taken into account.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation is clear and well presented. The main problem is that such an evaluation does not result very strong because of the limited number of subjects and the limited number of datasets (with modest cardinality). It can be considered as a kind of proof of concept for showing the validity of the presented research direction, which is very interesting, rather than a strong experimental evaluation. As a value added, it must be said that almost all material used for the experiments is publicly available.
Additional details are reported in the following.
In definition 2, G should be replaced by N. If not, it would be preferable to explain the rationale for the definition and particularly why G is considered rather than N.
In algorithm 1, V_k is an input parameter. However, no discussion concerning how to compute the equality sets (that seem to be a crucial aspect of the presented approach) has been reported. Additionally, k is a parameter referring to different subsets of V. In the algorithm only one subset is considered. All V_k, on varying the value of V_k, should be reported.
As regards the evaluation, as for the first case, it would be useful to have at least an additional larger sample for each considered dataset with a definitely larger percentage of highly ranked sameAs statements. This is in order to have an idea of the possible variation of the results when a larger dataset is considered. The discussion concerning the experimental results for the "Barack Obama equality set", and particularly regarding items (II) - (IV), is less clear than the first case. As for item (III), it seems that a (non trivial) mistake has been made in the automatic assessment. Additionally, when the judgment of the experts is taken into account, it should be appropriate to show the corresponding numbers for them.  Considering the limited number of datsets that have been taken into account, it would have been appropriate to report the analysis and discussion of the two cases with respect to the same metrics and/or baselines.
The experiments should also focus on trying to set up and giving a lesson learnt on the threshold to be used as a cut off for considering an owl:sameAs statement as a correct statement (and an erroneous statement). This important aspect has been ignored.
The authors conclude asserting that 558 millions of owl:sameAs statements have been considered but this does not seem to be fully correct since a small subsets of these statements have been actually taken into account for the experimental evaluation.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) As regards the evaluation, as for the first case, it would be useful to have at least an additional larger sample for each considered dataset with a definitely larger percentage of highly ranked sameAs statements. This is in order to have an idea of the possible variation of the results when a larger dataset is considered. The discussion concerning the experimental results for the "Barack Obama equality set", and particularly regarding items (II) - (IV), is less clear than the first case. As for item (III), it seems that a (non trivial) mistake has been made in the automatic assessment. Additionally, when the judgment of the experts is taken into account, it should be appropriate to show the corresponding numbers for them.  Considering the limited number of datsets that have been taken into account, it would have been appropriate to report the analysis and discussion of the two cases with respect to the same metrics and/or baselines.
It would have been an "accept" rather than a "weak accept" if the two cases were compared on the ground of the same metrics and baselines.
(OVERALL SCORE) The paper focuses on the problem of detecting erroneous sameAs links currently existing e.g. in the LOD Cloud. The proposed solutions is grounded on the exploitation of network metrics and specifically metrics for assessing/finding community structures. An experimental evaluation is provided for showing the validity of the proposed solution.
Overall the paper is clear and focuses on a very important problem. The paper mainly presents an application of metrics and approaches at the state of the art to problem of assessing the true sameAs statements, given a collection of sameAs statements. As such, the paper shows a limited advance with respect to state of the art but, at the same time, it presents an interesting research perspective in terms application of existing solutions to a new problem. The proposed solution, mostly formalized in algorithm 1, is simple but meaningful. The evaluation is clear and well presented. The main problem is that such an evaluation does not result very strong because of the limited number of subjects and the limited number of datasets (with modest cardinality). It can be considered as a kind of proof of concept for showing the validity of the presented research direction, which is very interesting, rather than a strong experimental evaluation. As a value added, it must be said that almost all material used for the experiments is publicly available.
Additional details are reported in the following.
In definition 2, G should be replaced by N. If not, it would be preferable to explain the rationale for the definition and particularly why G is considered rather than N.
In algorithm 1, V_k is an input parameter. However, no discussion concerning how to compute the equality sets (that seem to be a crucial aspect of the presented approach) has been reported. Additionally, k is a parameter referring to different subsets of V. In the algorithm only one subset is considered. All V_k, on varying the value of V_k, should be reported.
As regards the evaluation, as for the first case, it would be useful to have at least an additional larger sample for each considered dataset with a definitely larger percentage of highly ranked sameAs statements. This is in order to have an idea of the possible variation of the results when a larger dataset is considered. The discussion concerning the experimental results for the "Barack Obama equality set", and particularly regarding items (II) - (IV), is less clear than the first case. As for item (III), it seems that a (non trivial) mistake has been made in the automatic assessment. Additionally, when the judgment of the experts is taken into account, it should be appropriate to show the corresponding numbers for them.  Considering the limited number of datsets that have been taken into account, it would have been appropriate to report the analysis and discussion of the two cases with respect to the same metrics and/or baselines.
The experiments should also focus on trying to set up and giving a lesson learnt on the threshold to be used as a cut off for considering an owl:sameAs statement as a correct statement (and an erroneous statement). This important aspect has been ignored.
The authors conclude asserting that 558 millions of owl:sameAs statements have been considered but this does not seem to be fully correct since a small subsets of these statements have been actually taken into account for the experimental evaluation.
----- AFTER REBUTTAL PHASE -----
I would like to thank the authors for their response and I really encourage them to improve their work and to pursue on this interesting research direction


Review 2 (by Axel-Cyrille Ngonga Ngomo)

(RELEVANCE TO ESWC) The paper targets the detection of erroneous owl:sameAs links and is hence clearly relevant to the conference.
(NOVELTY OF THE PROPOSED SOLUTION) The solution relies on the Louvain algorithm combined with a scoring function. Not very innovative but OK.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The formal model proposed appears to be flawed (see Summary). The definitions being partly incorrect leads me to believe that the paper does not really reflect the work done by the authors.
(EVALUATION OF THE STATE-OF-THE-ART) There is no comparison with the state of the art. A similar paper published last year is missing.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors do provide some insights into the properties on the model they develop.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code and datasets used are made available, which I commend the authors for. However, the content of the paper most probably doesn't match the code.
(OVERALL SCORE) Summary of the paper
The paper presents an approach based on network metrics to fix the broken owl:sameAs structure of the Linked Data Web. The authors promise an evaluation on 558 million links. The goal of the paper is stated as "aim[ing] to detect erroneous owl:sameAs statements by using network metrics". The authors employ existing community detection metrics to find clusters of links. Thereafter, they derive a score which enables them to rank every link in their dataset.
The related work section covers works on alternatives for owl:sameAs and erroneous owl:sameAs statements. A very related work (Valdestilhas et al., 2017) is not mentioned in the related work. The authors of that work also rely on graph analysis to generate suggestions for erroneous owl:sameAs links from 19 million links.
In section 3, the authors formalize the problem at hand. They aim to find a node partition of an input graph G. The formalization is partly incorrect w.r.t. to the notation used. The second equation in Def. 1 should read C_i \cap C_j = \empty for i != j. As written, the definition states that C_i \cap C_i = \emptyset. Hence the authors always generate empty clusters if their partition contains one cluster, which defies the purpose. The authors motivate their use of the Louvain algorithm by relying on previous work.
Section 4 presents the approach of the authors. They suggest that the Louvain algorithm must be extended to deal with the data at hand. The first define identity networks, which basically represent the owl:sameAs structure of datasets. Then, they present equality sets, which are the transitive closure of identity networks. Again, the formal definition here (Definition 3) is incorrect. "Partitionings" (partitions?) P as modeled by the authors are not unique and are exponential in number. The text should thus read "a" (not the) collection of non-empty and mutually disjoints subsets of V which abide by the authors' specifications.
Definition 4 is incorrect and should read (a.o.) n_1 \in V_k, n_2 in V_k. The incorrect notation makes it difficult to guess what the authors really mean. *If I understand them correctly* they mean to write
- \forall n_1, n_2 \in V_k: (n_1, n_2) \in V AND \not (n_2, n_1) \in V \rightarrow w(n_1, n_2) = 1.
- \forall n_1, n_2 \in V_k: (n_1, n_2) \in V AND (n_2, n_1) \in V \rightarrow w(n_1, n_2) = 1.
The authors uses detect communities in the partitionings afore defined to complete and score the edges in and across communities for ranking.
The evaluation is carried out on a portion of LodALot. While the amount of data used is impressive and I commend the authors on this effort, the evaluation setup and results do leave substantial room for improvement (see questions).
Overall, I cannot recommend this paper. The formal model seems flawed (some definitions seems incorrect) and I think -not sure- the authors meant to state that they use unconnected subgraphs as partition of their original graph and run Louvain on these subgraphs independently. This is however not what the paper says and what is stated in the paper (Definition 3,4 + Algorithm 1) cannot be implemented as such.
Strong points
+ Relevant problem
+ Good reuse of the state of the art in community detection
+ Open-source and open data
Weak points
- Formal model
- Evaluation
Questions
1- How does your approach relate to CEDAL?
2- The computation of transitive closures can be expensive and previous works suggest that it is not time-efficient to do so (see, e.g., (Valdestilhas et al., 2017)). How expensive was this computation (time and space)? Can your approach do without this computation?
3- It is unclear why the authors can apply the Louvain algorithm (a global community detection algorithm) locally and achieves the same results. I'd strongly suggest that the authors justify clearly why they can do that.
4- What is a closed Explicit Identity Network (EIN)? An EIN after the application of the transitive closure?
5- There is no standard definition of a regular laptop that I am aware of. Could you please state exactly which hardware you used?
6- When are terms "more or less identical"? Can you please quantify?
7- Selection of the examples:
7a) It is unclear how the authors sampled the 67 negative and 33 positive links. They state that this was done by score but (1) did they simply select the top-ranking links or did they select a top-portion (say top-n links) and sample randomly therein?
7b) Why the class imbalance?
7c) It is unclear how representative this selection of links is. What was the score distribution and why did you not sample across the whole distribution?
7d) Each link was annotated by one person, a scheme which is prone to errors. Normally, links are to be annotated by at least 2 persons to ensure that there is no annotator bias in the data.
8- Interpretation of the results
8a) The authors claim a high precision. I would have expected some [email protected] score across the whole distribution. In general, taking the end of the distribution always leads to very good results in any classification task even for poor classifiers (which is basically what the authors are building) as those are the items with the smallest entropy. More interesting and valuable are results such as ROC, AUC or similar measures.
8b) The authors did not compare with other approaches. How well do they perform in comparison to (Valdestilhas et al., 2017) or other works of the kind.
CEDAL: Time-efficient Detection of Erroneous Links in Large-scale Link Repositories by Andre Valdestilhas, Tommaso Soru, and Axel-Cyrille Ngonga Ngomo in Proceedings of the International Conference on Web Intelligence, 2017
--
I do thank the authors for their rebuttal. I would suggest that the authors continue with their work and fix the formal glitches to sure that the paper if formally correct. Moreover, a comparison with other approaches should help strengthen the contribution of the paper.

Review 3 (by Caterina Caracciolo)

(RELEVANCE TO ESWC) The authors address the problem of identifying wrong sameAs links in datasets.
The approach taken is based on the idea that identical entities will be found in the same "community". Therefore, they compute "local explicit identity network" in the sameAs graph, and identify communities within them. The presence of intra- and inter- graph links provide evidence to assess whether the sameAs links are wrongly or rightly stated. The approach adopted does not take specific information on nodes in the graph, but only structural information of the graph itself. According to the authors this approach is unique.
The problem of identities and the statement of equivalence of identities is very relevant to the semantic web.
(NOVELTY OF THE PROPOSED SOLUTION) The solution proposed is based on known techniques, but combined in a novel way.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors propose a reasonable heuristic to assess the degree of suspected erroneousness of sameAs statements. The approach proposed is clear and well explained in its general lines. Glitches in definitions and description of the algorithms will be fixed as per authors’ response.
The experiment conducted represents an integral part of proposed solution, but it is reported and analyzed in a rather episodic way, by means of a generic inspection of the largest equality set, and a more detailed description of a much smaller and treatable set. Moreover, the human evaluation, although done on the basis of intuitively acceptable assumptions (e.g., the selection of the top few “best” and “worse” links) calls for a more formal definition of the goal of the evaluation itself – apart from confirming that the proposed approach does not produce false positive/negatives when dealing with extreme cases, what is its behavior in mid-range rank?
(EVALUATION OF THE STATE-OF-THE-ART) Appreciated that the SoA review includes also the alternative approach of using different properties than sameAS to state identities of entities.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The choice of selecting only one of the equality set for detailed analysis and then evaluation makes the evaluation of the proposed approach episodic. Also, some of the details provided do not seem to contribute strictly to the understanding of the benefit/criticalities of the proposed solution. (For example, it is unclear what contribution Fig 2 brings to the discussion.) No attempt is made to propose or discuss other approaches to the presentation of experiment's results.
The evaluation of the experiment is done on a very small set of cases, identified according to a rather ad-hoc methodology. The result of the experiments are more described than discussed, and also the final section of the paper recaps a few aspects of the proposed approach and experiment, more than putting them in the context of a discussion and draw research conclusions.
A few points have been made in a rather cursive manner, that would have deserved more attention. For example, the point of identities established within the same namespaces, or across languages. In this respect, a more detailed description of composition of the dataset of origin may have helped understand the added valued of the approach proposed by the authors.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Although I have not tried to reproduce the experiments, this seems to be possible.
(OVERALL SCORE) The work described addresses an important issue for the semantic web, namely the attribution of identities / equivalence of entities. The main contribution of the paper is to explore an approach to identifying wrong sameAs links uniquely based on graph topology. The results presented seem encouraging, although the sample used for assessment seem rather limited. Strong points include the use of known techniques, already evaluated but combined in a way that is also computationally feasible; the use of only structural features of graphs. Weak points: the apparently massive experiment run is described and then evaluated in a rather episodic way; the limited discussion of those aspects; the fact most of sameAs links happen between wikipedia and dbpedia raises questions on the composition of the dataset used as a start, and how this approach would work with other datasets, where other types of identity statements are used.

Metareview by Jorge Gracia

The paper describes a graph-based approach to detect erroneous sameAS links. Despite the paper focuses on a very important problem and proposes interesting research perspectives, according to the reviewers the reported evaluation is limited: it is based on two equality sets only, thus being illustrative (the analysis of the Barack Obama case is very interesting) but not conclusive. The threshold selection (to distinguish between best/worst ranked links) is not discussed. Also, some flaws in the formalisation have been detected. The reviewers have not substantially changed their judgement based on the authors' reply letter.
Nevertheless, the approach is an interesting one and we would like to encourage the authors to submit their work as a poster or demo.

Share on

## One thought to “Paper 103 (Research track)”

1. I think this paper is somehow related to your work:
André Valdestilhas, Tommaso Soru, and Axel-Cyrille Ngonga Ngomo. 2017. CEDAL: time-efficient detection of erroneous links in large-scale link repositories. In Proceedings of the International Conference on Web Intelligence (WI ’17). ACM, New York, NY, USA, 106-113. DOI: https://doi.org/10.1145/3106426.3106497

Cheers,