Inferring new types on large datasets applying ontology class hierarchy classifiers- the DBpedia case
Author(s): Mariano Rico, Idafen Santana-Pérez, Pedro Pozo-Jiménez, Asunción Gómez-Pérez
Full text: submitted version
Abstract: Adding resource types information to resources belonging to large open knowledge graphs becomes a challenging task, specially when considering those that are generated collaboratively, such as DBpedia, which usually contain errors and noise produced during the transformation process from different data sources. This problem has gained attention during last years, due to the importance of being able to properly classify resources in order to efficiently exploit the information provided by the dataset. In this work we explore how new classification models can be applied to solve this issue, relying on the information defined by the ontology class hierarchy. We have evaluated our approaches against DBpedia, and we have compared them to the most relevant contributions available nowadays. Our system, using a cascade of predictive models, is able to assign more than 1 million new types with higher precision and recall
Keywords: DBpedia; machine learning; dataset; semantic web; linked data
Review 1 (by Stefano Faralli)
(RELEVANCE TO ESWC) The topic addressed in this work matches with the list of relevant topics of the conference. (NOVELTY OF THE PROPOSED SOLUTION) The novelty of this approach is only limited on the usage of advanced machine learning techniques to class hierarchy classifiers. Advanced machine learning techniques are applied to this task by many other works. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The methodologies involved in this work are correctly applied. There are no flaws in the experimental setting. (EVALUATION OF THE STATE-OF-THE-ART) This is the main issue of the paper. The topic is recently widely studied. The comparative evaluation carried out by the authors should at least involve the benchmark provided by the editions of the Open knowledge Extraction challenge. In particular the task of Class Induction and entity typing for Vocabulary and Knowledge Base enrichment perfectly matches with the experimental setup already proposed. References and comparison with other taxonomy induction techniques are not mentioned in this paper. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There are no flaws in the proposed machine learning approach. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The dataset used by the authors as well as a the source code implemented and used to produce the results of the experiments are availiable for the downolad from a link provided in the content of the paper. (OVERALL SCORE) In my opinion this work is not mature. As already remarked there is a plethora of recent similar works not mentioned in this paper. The main issue is that existing benchmarks that perfectly match with the proposed approach are not involved. In particular I would like to address the authors towards the task of "Class Induction and entity typing for Vocabulary and Knowledge Base" from the released benchmarks of the past editions of the Open Knowledge Extraction Challenge. Additionally I would suggest to add some references to other approaches (e.g. Taxonomy induction, Information extraction ...) and a brief discussion of the differences (if any) with the presented work. In summary: Even if the topic is highly relevant to the ESWC conference I believe that this work requires some additional effort. Both related work and experimental setup may be improved by mentioning and comparing against more recent works.
Review 2 (by Vojtěch Svátek)
(RELEVANCE TO ESWC) Type inference for RDF knowledge bases is of utmost relevance and importance. (NOVELTY OF THE PROPOSED SOLUTION) While the problem is known, it seems that the authors have experimented with novel variations of type prediction techniques. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The method appears sound. However, its description sometimes has to be read twice or thrice before getting the point, since the writing style is suboptimal. (EVALUATION OF THE STATE-OF-THE-ART) The related research section contains a couple of relevant references but is far from complete for the KB type inference problem. Note especially the recent survey paper H. Paulheim. Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. Semantic Web Journal, Volume 8, Number 3 / 2017, http://www.semantic-web-journal.net/content/knowledge-graph-refinement-survey-approaches-and-evaluation-methods which lists in its sections 5.1.1 and 5.2.1 at least 15 approaches to resource type completion. On the other hand, some references, such as , do not seem to relate to types of resources in knowledge bases, thus are of comparably lower relevance. It also seems to me that the final part of the RR section actually does not provide a comparison but rather elaborates on the features of the authors' own approach. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties are explained to some degree, although the text is not always clear. One thing I miss, for example: the different approaches use different ML methods. This is OK if justified by experimentally proven differences of the methods for the different approaches. The authors are however silent about that, thus the choice looks as haphazard. Also, while the paper is relatively well equipped with examples, I critically miss one in Section 2.4, to illustrate the difference between approaches 2 and 3. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The strong point of the experimental study is the number of dimensions considered. Weaker points are: - There is only one competitor approach (in two variations) considered. Surprisingly, while the RR section refers to , it does not attempt to make an empirical comparison with it. Namely, the SLCN algorithm  is ML-based as the authors' approach, and it claims to outperform SDTypes on DBpedia by approx. 8%. - There is only one dataset, DBpedia (again, in comparison to, e.g., ). As regards the second point, the evaluation could actually be extended (w/o leaving the realms of DBpedia) by reusing a large crowdsourced gold-standard dataset called LHD2.0, see http://ner.vse.cz/datasets/linkedhypernyms/evaluation/. See also http://www.sciencedirect.com/science/article/pii/S1570826816300166. (OVERALL SCORE) Summary of the paper ==================== A family of ML-based techniques for type inference in KBs, based on resource type occurrence in triple objects, and with special treatment of hierarchical levels. Evaluated on DBpedia wrt. a predecessor approach. Strong points ==================== Important problem High-dimensional experimental space Comparison with a competitor based on experiment replication Weak points ==================== Incomplete coverage of related research Suboptimal level of writing Experiment limited in coverage of competitors and datasets Questions ==================== Do the authors already address direct comparison with ? Note: English should be improved in places. There are many typos and gram errors, or at least minor infelicities. For example: "ingoing" should better be "incoming" "leafs" -> "leaves" (many times) "as much columns" -> "as many columns" "Nave Bayes" (several times) "dbo:SubMunicipalty" "Cervances" (oops, offence to the Spanish national writer ;-)) "being level 0 owl:Thing" -> "level 0 being owl:Thing" "In this example, this result in" -> "... results ..." "which in turns has been proved" -> "turn" "outpeform" "provdiing" "Taxonomy-based classification approaches..., has been also proposed" --> "have" "less triples" --> "fewer triples" "Figure" etc. should be capitalized "As shown in 2" You probably mean Table 2? Some bibrefs are incomplete, e.g.,  or . Some bibrefs are bibtex-decapitalized, see, e.g., "rdf". Sometimes the flow of arguments looks odd, e.g.: "These types are in different levels of the ontology class hierarchy but, for each level, there is only one type." Why "but"? "The prediction for approach 3 is done using the 11 models"... "Instead of the theoretical 11 models, indeed we have used 9 models." "Therefore, here we show the results achieved in our reproduction (see table 1)." Actually, Table 1 is not very clear w/o some explanation. It is only explained, to some degree, some 4 pages later! "Notice that, as resources are randomly selected, a given resource can be in test1, but also in test10, as it has at least 10 `ingoing' properties. Therefore, for each approach, we have 3 training sets, each of them with 851k resources (i.e. 861 - 10k)." Why "Therefore"? "We can see in these tables that both, previous approaches and our approaches 2 and 3, achieve higher values than the naive approach 1." Actually, of the 24 F-measure comparisons in Tab.3, in 11 cases some of the Approach 1 variants beats the SDTypes variants, so it is not really obvious there. (While for the Appr.2,3 vs. 1 it is.) "we have focused only in matching the existing list of types (the so named leaf measure criteria)." I would say that the use of the leaf-based measure (compared to measures over the whole path) is orthogonal to the use of a-priori ground truth (compared to posterior ground truth, such as using a human oracle to confirm that Picasso was a Painter in the real world). The bracket content thus looks odd. Further comments: "In this way, we obtain 860,000 resources" Exactly? "For instance, for resource Spain, we have two triples with Spain as object and property location..." This is confusing - it only holds in the shown snippet and not in the whole data. Fig. 1 should ideally also display the ontological hierarchy. "These binary models are aimed at solving the `partial depth problem' explained in the related work section." The problem should be explained here, not many pages later! The thresholds >0, >9 and >24 in Table 2 are mathematically OK, but to be coherent with the text, >=1, >=10 and >=25 would be neater. Is it necessary to separate Tab.3 and 4? There is then redundancy in the SDTypes values. "precision and recall values that improve the previous approaches, with F-measure values between 12% and 39%." You probably mean, "F-measure *improvement* between 12% and 39%."? After the rebuttal ------------------ The authors made a reasonable effort to explain their choices, especially as regards minor issues. However, their response does not really make the paper stronger. Some points of the response are wrong or rather arguable. „to our best knowledge, the LHD 2.0 has not been updated since DBpedia 2014“ This is wrong. The most recent published LHD version is for DBPedia 2016-04 (http://downloads.dbpedia.org/2016-04/core-i18n/en/ (files: instance_types_lhd_dbo_en.ttl.bz2, instance_types_lhd_ext_en.ttl.bz2); the newest DBpedia release is 2016-10. The latest version of LHD is clearly stated on the dataset webpage: http://ner.vse.cz/datasets/linkedhypernyms/. Maybe the authors refer to the version of *ontology* used in the experiments reported in the LHD 2.0 paper? The dockerized LHD 2.0 framework on GitHub (https://github.com/KIZI/LinkedHypernymsDataset) allows for arbitrary version of the ontology to be passed. "Our study concludes that minority classes are not properly represented when the number of resources used for training is too low in comparison with the whole dataset. The 5-fold method used by Paulheim 2017 suffers from this effect, and his results are consistent, achieving a lower precision and recall. We are interested in a model that predicts accurately for the highest number of classes, not only the most populated, and that is why we selected Paulheim 2014 work as our reference." Such a justification of the choice of Paulhein 2014 as a base reference sounds too speculative. There should have been at least a small-scale evaluation or comparison that would justify the use of Paulheim 2014 over the newer approaches such as Paulheim 2017 or LHD 2.0. This is especially the case when the authors' approach as well as Paulheim 2017 and LHD 2.0 use hierarchical machine learning, while Paulheim 2014 does not. Actually, if the position of the class in the hierarchy is of interest, the authors should use hierarchical precision and recall measures, which are adopted in Paulheim 2017 and LHD 2.0. An open implementation of these measures for type evaluation is at https://github.com/kliegr/hierarchical_evaluation_measures/. Also, LHD 2.0 explicitly covers underpopulated classes in the ontology when the classification ontology is created (see section 6.2 of the LHD 2.0 paper). The approach also addresses the reliability of final type selection (assign one type with best tradeoff between reliability and specificity), which is something that submitted paper does not seem to discuss. My evaluation thus remains the same (moderately positive, with some reservations).
Review 3 (by Andrea Giovanni Nuzzolese)
(RELEVANCE TO ESWC) The topic addressed by the paper relevant to the conference. (NOVELTY OF THE PROPOSED SOLUTION) The authors propose three solutions based on machine learning that leverages limited variations of well known approaches for multi-class hierarchical classification. Accordingly, the novelty introduced is limited. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution proposed is well introduced in the paper. Nevertheless, many details are hidden in figures (e.g. Figure 1), which, in turn, are not properly discussed in the text. (EVALUATION OF THE STATE-OF-THE-ART) The main reference cited by the authors is the work of Paulheim and Bizer. Nevertheless, they authors do not provide references to many related works. Those works are worth to be cited because they (i) are relevant and (ii) tackle the typing problems with different approaches that can be used for comparison. An example is  or the solutions presented at past editions of the Open Knowledge Extraction challenge [2,3]. More in general, the authors do not provide fair comparison between related work and their solution. This does not allow a clear positioning of the proposed solution with respect to the state of tre art.  Gangemi, A., Nuzzolese, A. G., Presutti, V., Draicchio, F., Musetti, A., & Ciancarini, P. (2012, November). Automatic typing of DBpedia entities. In International Semantic Web Conference (pp. 65-81). Springer, Berlin, Heidelberg.  https://github.com/anuzzolese/oke-challenge  https://github.com/anuzzolese/oke-challenge-2016 (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Some details can be better discussed and explained. For example, section that describes the second and the third approaches omits lots of details that are, instead, useful to better understand the solution. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The dataset and the source code are published. All the steps needed to reproduce the experiment are provided. (OVERALL SCORE) The paper presents a solution for type prediction in large semantic datasets. The paper is well structured in all its part. *** PROS *** - type prediction is a relevant topic - the solution is well described - the experiment is reproducible - the evaluation is sound *** CONS *** - limited novelty of the solution proposed - the state of the art omits relevant work and fails in clearly positioning the paper with respect to related work - the evaluation does not use k-fold testing, which is typically used for assessing the effectiveness of solutions based on machine machine learning. Additionally, the evaluation is limited to DBpedia and a single competitor. *** AFTER THE REBUTTAL *** I thank the author for their detailed review. Just a clarification concerning the OKE challenge. Two editions of the OKE challenge (i.e. 2015 and 2016) have task dedicated to "Class Induction and entity typing for Vocabulary and Knowledge Base enrichment." Accordingly, the authors cannot argue that the OKE challenge out of scope. My scores and evaluation do not change after the rebuttal as I do not find the paper mature enough to be published in ESWC.
Review 4 (by anonymous reviewer)
(RELEVANCE TO ESWC) Has relevance. (NOVELTY OF THE PROPOSED SOLUTION) Minor novelty, is using well established machine learning approaches in reasonable comparison to existing techniques. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) In general find, the evaluation is weak and could do with explicit examples and critique. (EVALUATION OF THE STATE-OF-THE-ART) Would benefit from deeper discussion into other areas and the ordering of related work seems odd. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See review. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Good, data sets online. (OVERALL SCORE) Overall review The paper describes the results of running several forms of classification approaches to identifying types on entities in DBpedia. Three approaches are tested with some variation within these approaches and a quantitative comparison is performed with a prior evaluation published in 2014. The metrics based on leafs (measuring most specific node only) seemed prudent as the confusion between a true positive being reported as a false positive would undermine the evaluation. I would like to have seen Tables 3 and 4 grouped by measurement criteria rather than by complete leaf set. The current format makes it difficult to compare across all of the classifiers. For instance in criteria General the first table shows Approach Random Forest as highest in Test 1, but in reality the *overall* highest is Approach 2. I understand why the authors have grouped in this way (as it's a large table) but in essence they are different evaluations so I'd group by method for each test. I'm not sure why section 5 Related Work is so late. If this is literature review it should be earlier, but the content is largely discussion-like material. I'd reframe this as a discussion section or if the intention is really for background move earlier on. (If this is an ESWC mandated thing then I take the comment back and would ask ESWC why). My main criticism is that the evaluation is light on insights; it is really a black box of numbers. I accept the higher precision scores, but really the reader needs to see a few examples of some of the more interesting types that were good matches and those that were not would have been enlightening in the result section. I see it's performing better, but *why*? It's a very common issue when people publish ML work that they forget the why and only address the simple classification number. Was it actually just getting a lot of very simple type assertions en masse hence higher scores? I can't tell from the paper. I took a look at the results but realistically the paper should summarise this with a few interesting examples; I should not have to download lots of csv files and look at binary matrix of 670 columns to get an overview. All of that said, I thought it was an interesting paper and was generally well written, though was weakened by the lack of insights of an evaluation into the results.
Metareview by Andreas Hotho
The authors present several combinations of standard classifiers applied to entity-type prediction. The reviewers' concerns are mainly related to 1. Novelty of the approach, since it proposes a task-specific heuristic combination of standard classification algorithms instead of a novel model. 2. Evaluations seem not to be clear, concerning comparison to the state-of-the-art, the benchmark data used and presentation of results. Statements in the authors' feedback appeared incorrect and the concerns of the reviewers where not sufficiently addressed. The overall average assessment of the reviewer is weak reject (a category not available on easychair). We follow this judgement and recommend to reject the work.