Finding Missing Types in Linked Data using Structural Features of Entities
Author(s): Xiang Zhang, Siyao Pi
Full text: submitted version
Abstract: Many Linked Data are incomplete on the type information of entities, the lack of which is a barrier to the success of many Semantic Web tasks. Type inference can retrieves missing types by means of reasoning, but this approach may become invalid in noisy data. Data-driven type prediction became prevalent these years, which utilized features of massive typed entities to revise untyped entities. In this paper, we propose our approach of type prediction based on collective classification on untyped entities. We investigate three structural features of entities on their type indicativeness, including attributive features, neighboring features and latent features. We also study the effectiveness of prediction by a mash-up of various types of features in collective classification. Experiments on real-world Linked Data demonstrate that our approach is considerably effective in finding missing types.
Keywords: Linked Data; type prediction; collective classification
Review 1 (by Petar Ristoski)
(RELEVANCE TO ESWC) The paper is highly relevant for the conference, as it is addressing an important task in the Semantic Web area, i.e., entity type prediction (NOVELTY OF THE PROPOSED SOLUTION) It is unclear what exactly is the novelty and the contribution of the paper. The authors use an existing algorithm, described in , to identify new entity types using entity feature vectors that consist of two parts. The novelty might be how the existing algorithm is altered in order to use the feature vectors, but the authors don't give clear explanation of it. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is based on existing algorithm, however the authors fail to explain how exactly they modify/enhance the algorithm in order to use it for the given task. (EVALUATION OF THE STATE-OF-THE-ART) Important related work is missing, e.g.:  (SDtype) , ,  etc. The related work section focuses on collective classification instead of type prediction. This makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. Furthermore, the first paragraph of the related work is a verbatim copy from . (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors don't provide a clear pipeline/architecture of the approach, and they barely explain Algorithm 1. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The authors don't use datasets used in the related work, and there is no detailed information how to reproduce the datasets and the used features in the evaluation. The evaluation of the approach is executed very poorly. The authors compare the approach only to one related approach, completely ignoring the rest. The evaluation setup is described very vaguely. Instead of performing cross validation, the authors choose to use 70/30 split, thus the results cannot be considered as valid. Furthermore, the set of evaluation datasets could be easily extended to other LOD datasets, as done in the related work approaches. (OVERALL SCORE) The paper describes a collective-classification-based approach for entity typing. While this might be an interesting idea, the authors didn't present it well in the paper. SP: 1. WP: 1. Missing important related work. 2. Poor evaluation of the approach. 3. Missing comparison to related approaches. 4. Poor presentation of the approach, i.e., it is not clear what is the novelty of the approach. Detailed Review: The paper describes a collective-classification-based approach for entity typing. While this might be an interesting idea, the authors didn't present it well in the paper. To me it is unclear what exactly is the novelty and the contribution of the paper. The authors use an existing algorithm, described in , to identify new entity types using entity feature vectors that consist of two parts. The novelty might be how the existing algorithm is altered in order to use the feature vectors, but the authors don't give clear explanation of it. The authors don't provide a clear pipeline/architecture of the approach, and they barely explain Algorithm 1. The evaluation of the approach is executed very poorly. The authors compare the approach only to one related approach, completely ignoring the rest. The evaluation setup is described very vaguely. Instead of performing cross validation, the authors choose to use 70/30 split, thus the results cannot be considered as valid. Furthermore, the set of evaluation datasets could be easily extended to other LOD datasets, as done in the related work approaches. Important related work is missing, e.g.:  (SDtype) , ,  etc. The related work section focuses on collective classification instead of type prediction. This makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. Furthermore, the first paragraph of the related work is a verbatim copy from . Section 2 is a verbatim copy of . The authors should focus more on describing how they apply and extend the collective classification approach presented in . The paper should be proof-read by a native/good English speaker as it contains a lot of grammatical mistakes, e.g., often missing definite/indefinite article, grammatical mismatches, mixed tenses and various typos. This makes it very difficult to read and follow the paper. Just to mention a few: - Type inference can retrieves - classification motives us The paper is not formatted in the LNCS style.  Paulheim, Heiko, and Christian Bizer. "Type inference on noisy rdf data." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.  Oren, Eyal, Sebastian Gerke, and Stefan Decker. "Simple algorithms for predicate suggestions using similarity and co-occurrence." The Semantic Web: Research and Applications (2007): 160-174.  Ma, Yongtao, Thanh Tran, and Veli Bicer. "Typifier: Inferring the type semantics of structured data." Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE, 2013.  Melo, André, Heiko Paulheim, and Johanna Völker. "Type prediction in RDF knowledge bases using hierarchical multilabel classification." Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics. ACM, 2016.  Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., & Eliassi-Rad, T. (2008). Collective classification in network data. AI magazine, 29(3), 93.
Review 2 (by Paul Groth)
(RELEVANCE TO ESWC) Automatic typing of instances is an important task in building semantic web datasets. (NOVELTY OF THE PROPOSED SOLUTION) The proposed solution does not appear to me to be novel. It seems to just apply the existing RESCAL approach to semantic web data which has been done before. See, for example: Krompaß, M. Nickel and V. Tresp, "Large-scale factorization of type-constrained multi-relational data," 2014 International Conference on Data Science and Advanced Analytics (DSAA), Shanghai, 2014, pp. 18-24. doi: 10.1109/DSAA.2014.7058046 (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution as discussed previously seems to be correct in terms of applying existing approaches. The solution could have been better contextualized (EVALUATION OF THE STATE-OF-THE-ART) The state-of-the-art description could have been much extensively discussed. Collective classification has already been used in the field for both type and link prediction. See for example: Knowledge Graph Identification. Jay Pujara, Hui Miao, Lise Getoor, William Cohen International Semantic Web Conference (ISWC) 2013 Getoor was cited several times in the paper but the semantic web paper was not. The paper mentions SDType - which was an approach described by Paulheim and Bizer in: Paulheim, Heiko, and Christian Bizer. "Type inference on noisy rdf data." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013. which was not cited. These are only some examples of the rich literature on type inference using ML for semantic web data that would have been good to summarize. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The proposed approach should describe better the characteristics of the algorithm. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I found the sourcing of data unclear. Which version of DBPedia was used? The description of the algorithm seems to be just the description of the standard RESCAL setup? This could have used much more context and explanation. It's unclear whether SDType is a strong or state-of-the-art baseline competitor. (OVERALL SCORE) * Summary of the Paper The paper describes several experiments with different classification algorithms for type inference using a collective classification. * Strong Points (SPs) 1) The problem is interesting to the community 2) Collective classification as a method is relevant. 3) The experiments are a good first start. * Weak Points (WPs) 1) Key related work is missing. 2) The evaluation does not compare to a strong baseline. 3) The paper did not use all the space available and there is needed description. Minor Comments: * "The carry out of most Semantic Web tasks rely on type information." -> "Most Semantic Web tasks rely on type information." * "type information could be lost or unable to extract." -> "type information could be lost or be unable to be recovered." * "Linked Data is usually huge in volume." -> reference? * "The research of type inference usually relies on logical inference to get missing types." -> reference?
Review 3 (by Petya Osenova)
(RELEVANCE TO ESWC) The paper proposes a method for detecting missing types in linked data through the structural properties of entities. The authors rely on the so-called 'collective-classification-based approach'. (NOVELTY OF THE PROPOSED SOLUTION) It seems that authors use a recently emerged, but known method (collective classification) for a new task (predicting the nature of missing types). (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The three features in the collective classification are presented respectively: attributive, relational and latent. The same holds for the iterative classification, entity embedding and the experiment section. However, more discussion might be added. (EVALUATION OF THE STATE-OF-THE-ART) This section gives a good overview of how the collective classification is used for other tasks and a brief mention on inference problems. At the same time, it might add also related works on the semantic type detection problems. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The proposed approach seems suitable for the task of semantic type detection. However, no error analysis is provided. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I think that the work is reproducible, since it combines available algorithms/resources. (OVERALL SCORE) The paper aims to prove that detecting the missing types in linked data is more successful when using entity features in comparison to statistical methods. Strong Points: ============== - the usage of collective classification in a linked data task - rich experimental set-up - reported results outperform statistical approaches Weak Points: ============ - English phrasing needs polishing - Introduction can be made more focused with respect to the paper topic - analytical substance on the method and results can be added Questions: =========== - Is the proposed method language independent? - Do the authors intend to try their method on other linked data sets, such as GeoNames and Freebase? - Any thought on why tensor is the worst feature indicator?
Review 4 (by Ana Roxin)
(RELEVANCE TO ESWC) Given its title, the article may appear as belonging to the topic of "Data quality, validation and data trustworthiness", but it is not. The approach presented by the authors focuses on identifying levels of accuracy of different classifiers (e.g. naive bayes, decision trees and kNN) as applied on top of some dataset extracted from DBPedia (how this was done is not explained by the authors). Thus, the article at hand has very ittle to do with Linked Data, as the data structure itself has not been taken into account when addressing the accuracy of the different classifiers considered (as the authors mention in their conclusion). (NOVELTY OF THE PROPOSED SOLUTION) The issue of missing type information in Linked Data may relate to the general issue of Linked Data quality, validation and thrustworthiness. This is an issue that has been addressed by several publications in the recent years, thus it is not a new problem. Authors mention that data quality is an issue in the Linked Data context. They mention several approaches were proposed adressing this issue, but these approaches are not cited. Moreover, authors state that this is still a problem with Linked Data today, and they cite as a reference a report from 2005 ! A recent reference would be needed to justify the issue regarding the extraction of Linked Data from semi or unstructured sources. For example, FRED is an online tool that produces RDF/OWL ontologies and Linked Data from natural language sentences - http://wit.istc.cnr.it/stlab-tools/fred. Given this, it is difficult to understand which issues are pointed by the authors when mentionning the integration of NLP texts as Linked Data. Additional remarks: - "Linked Data is usually huge in volume" - a reference is needed to support this affirmation - "A manual gleaning…" - not only this is not feasible, but it is not logical in the context of Linked Data ! On a specific linked dataset, one can use query languages such as SPARQL in order to identify the data present, thus deduce data that could be missing What about making use of domain and ranges defined for properties ? (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) As mentionned before, the approach allows outputting Tables 2 to 4 with some accuracy results of classification algorithms. There is no clear specification of how Linked Data structure is exploited by those algorithms. There is no clear illustration of the result output by such an approach. Additional remarks: - no clear correlation or justification why collective classification is pertaining for the Linked Data context - "Collective classification […] has been studied in recent decades" - a reference is needed to prove this statement. The only reference provided in the corresponding paragraph dates from 2001. - No link analysis - no exploitation of properties such as owl:sameAS or owl:equivalentClass or owl:equivalentProperty (EVALUATION OF THE STATE-OF-THE-ART) The "Related Work" section focuses on the collective classification of documents with no specific concern to Linked Data. Authors mention that "collective classification has gained attention only in the past five to seven years" while the references cited go back to 17 years ago. There is no reference to studies and approaches made in this domain in the last years, such as: - "Methods for Intrinsic Evaluation of Links in the Web of Data" from Cristina Sarasua, Steffen Staab, and Matthias Thimm, - "Quality Assessment for Linked Data: A Survey" by Pascal Hitzler - "Exploiting Source-Object Network to Resolve Object Conflicts in Linked Data" by Wenqiang Liu, Jun Liu, Haimeng Duan, Wei Hu, Bifan Wei, - etc. Additional comments: - The references should be all presented in a uniform manner ( includes the year of publication after the name of the authors, whereas the year is listed at the end for other references [1-4] - There's a mistake in reference  : "Jensen, D., and Neville, J. 2202." (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The results presented in Tables 2 to 4 give measures of accuracy of the different classifiers used for testing. Still they can hardly be interpreted by the reader, as one neither knows how these accuracy levels have been computed nor how the experiment has been conducted (which are the exact inputs and outputs). Additional comments: - Table1 is presented on page 3, but commented on page 6 - what version of DBPedia has been used ? What was the size of the dataset ? - "we removed part of the object property that occurs less frequently than n in the linked data structure diagram, due to the size of the data set." - what is "n" here ? Also don't you induce errors by doing so ? There are lost of DBPEdia properties that are widely used, but that don't carry any semantics. - You don't take into account the semantic heterogeneity of DBPedia properties - birthPlace and BirthPlace mean the same thing but are two different properties - Finally, it appears that your experiment was limitated by the size of your server's RAM. Would this imply that your approach requires extended computing ressources ? Have you tested in other environments e.g. cloud ? (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments performed need to me more clearly specified, notably in terms of : - dataset used - methodology used - what do authors mean by "we first verify the indications of the three types of features for type prediction and then verify the indicativeness of the different types of feature combinations" ? how this is done ? - accuracy calculation method - output of the system - does the system produce new links between instances ? how is assessed the accuracy of such links ? Taking a real data example from the dataset used could help in better understanding what is the exact outcome of the approach described. (OVERALL SCORE) Summary of the paper: Authors want to infer missing type information (e.g. an instance belonging to a class) based on 3 types of features an instance might have e.g. its attributes, its neighbours and its latent features. For doing so, they suggest using an automatic approach for predicting missing types of existing Linked Data. Still, in practice, they mainly apply three different classifiers on some DBPedia-extracted data (not characterized except in terms of number of instances) by varying the features implemeneted by each classifier. The nature of the data used for evaluation (meaning the structure of the data, the number of links, etc.) is not taken into account by this approach. SPs: The approach has been tested against datasets, and the measures are included in the paper. WPs: - The reader has difficulties to understand to what extent this approach really exploits the underlying nature and structure of Linked Data. - Most recent reference is from 2011 (only one in 23 references) QAs: Most questions to authors have already been listed in the comments above. Below, there's only a short summary. - How does the approach at hand make use of the nature and structure of Linked Data ? - Is it possible to make use of domain and ranges that could be defined for some properties ? - Have you considered using cloud-based environments for testing your approach ? - What is the exact methodology used for your experiments and what is their exact output ?
Metareview by Hala Skaf
The paper describes a collective-classification-based approach for entity typing. Despite the interest of the idea, reviewers agree that the authors did not present it well in the paper. Particularly, there are little comparison with the SoA and the reported evaluation is limited. The novel aspects of this approach have not been sufficiently stressed. The authors do not provide any rebuttal.