# Paper 30 (Research track)

Addressing Big Data Variety via Type Information Identification

Author(s): Ruth Frimpong, Matt Selway, Wolfgang Mayer, Markus Stumptner

Full text: submitted version

Abstract: There are lots of Linked Data (Big Data) out there but they use lots of different ontologies to describe the data. The ontologies sometimes are relatively coarse grained taxonomies. That is, either the type to which an entity belongs to is not known or the type is too general to be used for ontology/schema matching. For effective matching, you may need to recover more fine grained structure. In this work, we propose to deal with coarse type information and the granularity variety challenge of Big Data. We present it as a clustering problem and discuss the features required to obtain useful solutions. Since Big Data has more instance features with relatively less schemas, we present a novel clustering algorithm (ExTypifier) which is an extended version of TYPifier that addresses the above problems by inferring fine grain type information from data instances to bring the entity type to the same level of granularity. Furthermore, we present the experimental results which show the effectiveness of ExTypifier in addressing the granularity problem and demonstrates improved results over the original TYPifier algorithm.

Keywords: Big Data; Granularity Variety; Entity Type; Hierarchical Clustering; Ontology Matching; Type Information

Decision: reject

Review 1 (by Petar Ristoski)

(RELEVANCE TO ESWC) The paper is highly relevant for the conference, as it is addressing an important task in the Semantic Web area, i.e., entity type prediction
(NOVELTY OF THE PROPOSED SOLUTION) The evaluation is in favor of the proposed approach, however I cannot identify the theoretical grounds for such improvements. The authors list the extensions over TYPifier in section 3, but there is no detailed explanation to why these modifications lead to such improvements. These improvements seem very trivial as described, and one wouldn't expect to have such a big impact. The authors should try to explicitly point out what is the change that makes their approach superior over the original approach, and give explanation why/how. Algorithm 2 seems to be the same as the original algorithm.
Furthermore, the authors should make clear separation between their work and the work presented in TYPifier. Some of the claims should be toned-down, as most of the work has already been presented in the TYPifier paper.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors propose a list of changes over an existing algorithm, but failed to clearly describe how/why this changes lead to significant improvement.
(EVALUATION OF THE STATE-OF-THE-ART) Important related work is missing, e.g.: [1, 2, 3, 4, 5] etc. The related work section focuses mostly on clustering instead of type prediction. This makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. The authors should compare the result of their approach to these, and other approaches, or at least discuss the differences.
[1] Paulheim, Heiko, and Christian Bizer. "Type inference on noisy rdf data." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
[2] Oren, Eyal, Sebastian Gerke, and Stefan Decker. "Simple algorithms for predicate suggestions using similarity and co-occurrence." The Semantic Web: Research and Applications (2007): 160-174.
[3] Melo, André, Heiko Paulheim, and Johanna Völker. "Type prediction in RDF knowledge bases using hierarchical multilabel classification." Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics. ACM, 2016.
[4] Ristoski, P., Faralli, S., Ponzetto, S. P., & Paulheim, H. (2017, August). Large-scale taxonomy induction using entity and word embeddings. In Proceedings of the International Conference on Web Intelligence (pp. 81-87). ACM.
[5] Kejriwal, M., & Szekely, P. (2017). Scalable Generation of Type Embeddings Using the ABox. Open Journal of Semantic Web (OJSW), 4(1), 20-34.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is very well motivated and the goal is clear. However, the authors don't clarify several properties of their approach: (i) what are the theoretical grounds for such improvements over the initial algorithm (ii) there is no discussion about the memory efficiency, although it is mentioned as one of the contributions of the paper (iii) there is nothing particular about the proposed approach that can tackle the issues of processing Big Data
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The authors use standard datasets that are used in the related work. However, there is no discussion about the availability of the tool/code.
(OVERALL SCORE) The paper presents an approach for fine grained entity typing, named ExTypifier. The approach is an extension of existing approach, named TYPifier, which uses a top-down hierarchical clustering for identifying entity types. The authors propose several modifications, which lead to improvements of the results.
SP:
1. Defined a list of possible improvements over Typifier.
WP:
1. Difficult to identify the contributions of the paper.
2. Missing important details to explain the superiority of the approach.
3. Claims that the approach is memory efficient and scales to Big Data, but never proven.
4. Missing important related work.
Detailed Review:
The paper presents an approach for fine grained entity typing, named ExTypifier. The approach is an extension of existing approach, named TYPifier, which uses a top-down hierarchical clustering for identifying entity types. The authors propose several modifications, which lead to improvements of the results.
The evaluation is in favor of the proposed approach, however I cannot identify the theoretical grounds for such improvements. The authors list the extensions over TYPifier in section 3, but there is no detailed explanation to why these modifications lead to such improvements. These improvements seem very trivial as described, and one wouldn't expect to have such a big impact. The authors should try to explicitly point out what is the change that makes their approach superior over the original approach, and give explanation why/how. Algorithm 2 seems to be the same as the original algorithm.
Furthermore, the authors should make clear separation between their work and the work presented in TYPifier. Some of the claims should be toned-down, as most of the work has already been presented in the TYPifier paper.
The authors claim that one of the main contributions is the lower memory complexity, however this is not further described nor evaluated in the evaluation section.
Throughout the paper, the authors refer to Linked Data as Big Data. Linked Data (the whole Linked Data Cloud) is not Big Data. Therefore, I advise the authors to remove the buzzword "Big Data" from the paper and the title. Not only because Linked Data is not Big Data, but also there is nothing particular about the proposed approach that can tackle the issues of processing Big Data. Also, the authors claim that one of the main contribution of the paper is that the approach can address the volume challenge of Big Data, however that is not shown in the paper, i.e., the biggest dataset used for evaluation has 4.5M triples, which cannot be considered as Big Data; and the evaluation shows that the TYPifier approach is more efficient.
Important related work is missing, e.g.: [1, 2, 3, 4, 5] etc. The related work section focuses mostly on clustering instead of type prediction. This makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. The authors should compare the result of their approach to these, and other approaches, or at least discuss the differences.
- The authors can omit some of the details about RDF in the introduction, and shorten/remove the ontology matching example. Instead, the space can be used to extend the related work, and further detailed the differences to TYPifier.
- Definition 1 is not a formal definition, thus it should be altered or removed.
[1] Paulheim, Heiko, and Christian Bizer. "Type inference on noisy rdf data." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
[2] Oren, Eyal, Sebastian Gerke, and Stefan Decker. "Simple algorithms for predicate suggestions using similarity and co-occurrence." The Semantic Web: Research and Applications (2007): 160-174.
[3] Melo, André, Heiko Paulheim, and Johanna Völker. "Type prediction in RDF knowledge bases using hierarchical multilabel classification." Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics. ACM, 2016.
[4] Ristoski, P., Faralli, S., Ponzetto, S. P., & Paulheim, H. (2017, August). Large-scale taxonomy induction using entity and word embeddings. In Proceedings of the International Conference on Web Intelligence (pp. 81-87). ACM.
[5] Kejriwal, M., & Szekely, P. (2017). Scalable Generation of Type Embeddings Using the ABox. Open Journal of Semantic Web (OJSW), 4(1), 20-34.


Review 2 (by Steffen Staab)

(RELEVANCE TO ESWC) Generally, the paper does pick up on existing Semantic Web challenges, such as missing schema information. However, it fails in delivering on them, as its 		contributions are lacking in substance.
(NOVELTY OF THE PROPOSED SOLUTION) Although, the paper claims to present a novel solution to the problem of type hierarchy induction, it does not. This paper obviously adds only minor adaptations to work previously presented by Ma,Tran and Bicer(2013). Further, the reasoning behind these adaptations is not substantial (and/or not laid out well enough).
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution only comprises minor changes (optimizations?) of an existing solution that indeed is correct and complete.
(EVALUATION OF THE STATE-OF-THE-ART) The paper fails to present a state-of-the-art in related areas. It references work on hierarchical clustering techniques only indirectly by pointing to survey  papers. Thus, a critical discussion is missing.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper generally lacks in a clear discussion and justification of its goals and contribution(s). E.g. reasoning for the stricter cluster definition remains vague.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation may be sound, but I cannot be sure, as the underlying measures are not defined clearly enough. E.g. in definition 4: how does T_i relate to the clusters?
Further, to my understanding, with the stricter definition of clusters, we should expect to see more complex hierarchies as output. This may principally result in better Type Entropy, Recall, Precision, F measurements. Also, is this all we care for? It depends on the actual goal, which is not communicated clearly.
(OVERALL SCORE) The paper addresses the problem of missing or too vague schema information present in Semantic Web data. It refines an existing algorithm, called Typifier, that automatically induces type hierarchies and assigns previously untyped entities to appropriate type clusters. Results seem to show slight improvements in clustering quality.
Main contributions are:
- redefinition of the type clusters discussed in Ma et al., 2013.
- other claimed constributions cannot be verified as such.
SP:
1. Addresses open problem in Semantic Web.
2. The redefined notion of what constitutes a (type) cluster is plausible.
WP:
1. It is unclear, what the actual goal is. A formal problem specification is missing.
2. Subsequently, the evaluation suffers in terms of validity.
3. Examples are missing to better understand the ideas.
4. The paper does not communicate well, what its contributions are, and what it just lends from the referenced paper.
5. The only real contribution is a plausible redefinition of what constitutes "clusters", other claimed contributions seem to be no contributions at all.
6. Fails to motivate and justify the contribution it brings.
7. Related Work is not presented clearly. Individual works are not being presented and/or discussed critically.
8. The authors point out, they are expanding on existing work. Still, the way the paper is lending from Ma et al., both in structure and content, is irritating.

Review 3 (by anonymous reviewer)

(RELEVANCE TO ESWC) The typification task and the consideration of levels of granularity in types assigned to instances is relevant for several application scenarios (e.g., ontology matching).
(NOVELTY OF THE PROPOSED SOLUTION) The authors change a previous algorithm. They modify the parameters of some features, add some extended conditions for merging the clusters and apply TF-IDF to filter the data in order to achieve better results. The contributions of the paper is therefore somewhat limited, despite I recognize the improvements in the experimental results.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach achieve good performnce in terms of quality of typification, but its scalability has not been evaluated on Big Data.
(EVALUATION OF THE STATE-OF-THE-ART) Authors should cite other contributions in the field and compare their work to these contributions. See for example these works and references therein:
Type inference on noisy ref data (Heiko Paulheim and Christian Bizer), etc.
Structure Inference for Linked Data Sources Using Clustering (Klitos Christodoulou(B), Norman W. Paton, and Alvaro A.A. Fernandes)
Leveraging Data and Structure in Ontology Integration (Octavian Udrea, Lise Getoor, Renée J. Miller)
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The overall idea implemented in the extension of Typifier is quite clear. Otherwise, details are not always well explained, as further argumented in detailed comments.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Authors do not release code of their algorithm for future comparisons.
(OVERALL SCORE) SUMMARY
SPs
The proposed approach improves the results of a previous approach significantly; in particular, the ontology built by ExTypifier is more similar to ground truth ontologies
These results are obtained without significantly increasing the time needed for typification
WPs
The novel contribution compared to a previous system is rather limited; in particular it consists of
- pre-filtering features using TF-IDF
- defining clusters in a different way from Typifier (although a clear comparison among the two definitions is not discussed: in which sense clusters are more defined more strictly?)
- conditions for merging/adding clusters as childs
The paper make a strong claim about being addressed to Big Data, but only samples are considered in the experiments (even if it may be difficult to evaluate P/R/F on the entire BTC, it would have been nice to know how the algorithm had worked on such a big data set).
Several concepts appear in the paper without being previously defined or without an adequate discussion (examples of undefined concepts: V in Def 2; \epsilon in Section 3; pop()and S_root [vs. S*_root] in Algorithm 2. Examples of required clarifications: math in Def. 1, f \in F(n \in N_E); similarity between clusters is not symmetric - is it a similarity, or would it be better to use a different term?).
Related work does not consider important work about type inference for RDF data.
Code of the algorithm seems not to be available, thus making experiments not reproducible.
QAs
How do you split clusters into those that are need for branching from root and those needed for its siblings?
Why is root defined as a triple (F_root, N_root, S_root)?
What is the difference between S*_root and S_root? If the S*_root is the child of the current root and the S_root is the root, how is it possible that S*_root is not empty (while block line 12) and S_root can be empty (line 17 inside the while block?).
How is assigned the S_root in line 24 of algorithm 2? What does pop mean in line 15 of algorithm 2? I suggest to the authors to explain the algorithm with a representative example (see, e.g., Typifier paper).
What is the ground truth for the hierarchical clusters? The ontology? It is not explicitly mentioned in the paper.
Moreover, what should be interesting about the experiments it to see an analysis of the errors that this algorithm does. Are they all about instances that have two or more types? Is there any misclassification about any type? ( I guess yes as the F measure is not 1). Why does this happen?
I would suggest to the authors to modify the title of the paper, as from the title type information identification it seems like the paper is about the identification of type for the instances that miss one. Moreover the paper is not addressing big data variety as non of the experiments is run for big data, rather than a sample of the selected datasets.
The authors define concepts that are well-known (P/R/F, TF-IDF), while they could have use this space to better describe the algorithm, the differences between Typifier and ExTypifier, the ground truth (of clusters) used in the experiments.
I found the footnotes misused. The point of using footnotes is when you want to make a comment that is not directly related to the argument of the sentences / section / or paragraphs, while I found all the footnotes relevant to your argument.
Pay attention to the order in which concepts are introduced, and not to used undefined symbols. (e.g., F in definition 1 is defined later; V in definition 2 is not defined. V is also used in the hierarchical clustering algorithm but it is not explained).
Add a reference for the recursiveFormTree procedure.
I suggest to rewrite some parts of the paper to give it a more natural flow. For example, I found it difficult to read Table 1 before that Definition 3 is introduced. While reading the motivating example, I was expecting to see the algorithm evaluated in an ontology matching task rather than hierarchical clustering.


Review 4 (by Wouter Beek)

(RELEVANCE TO ESWC) The automatic derivation of semantic knowledge from heterogeneous knowledge sources is an important topic in Semantic Web research.  In addition, the focus on Big Data is important in order to make the move from research over one or two datasets to research over the Semantic Web.
(NOVELTY OF THE PROPOSED SOLUTION) The paper presents an improvement over a specific existing approach (TYPifier).  The authors are very explicit about the improvements they have make (at the start of Section 3).  In the end, the improvement of the conceptual validity of the generated type hierarchy is IMO the strongest contribution.  Other contributions, like performance improvements and/or applicability to Big Data variety are not sufficiently proven in this paper.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Since one of the contributions of the paper is a redefinition of what a cluster is, it was a little bit of a disappointment to me that the definitions are not very clear.
For instance, in Definition 2, is it really a requirement that all features are defined for all entities?  Since heterogeneity/variety of Big Data was explicitly mentioned early on in the paper, I was assuming that entities would (most of the time) only have some of the schema features.
I like the added requirement that “characteristics of a type hierarchy” be preserved, however, I do not see how this requirement follows from Definition 2.  I was expecting something like: if C << D, then \forall_{c \in C, d \in D}(F(c) \subseteq F(d)).
Table 1 covers several cases, but not the case in which clusters C and D adhere to the following contraints: (i) C \setminus D \neq \emptyset, (ii) D \setminus C \neq \emptyset, and (iii) C \cap D \neq \emptyset.  C and D could still be very similarity, but neither C << D, nor C >> D, nor C = D would apply.
Finally, in Section 1.1 the challenge the paper addresses is formulated as: “[a]nalysing the entity type to identify fine[-]grain[ed] type information to bring it [(i.e., the type)] to the same level of granularity”.  It is not entirely clear what “the same level” refers to.  The same inexact phrase is repeated elsewhere -- e.g., in the abstract -- but never clarified.  But mainly, I was expecting a definition of *granularity level*, since the success criteria of the proposed algorithm are formulated in terms of it.  Later I understood that granularity level was defined in terms of similarity, but it is not entirely clear to me why this similarity metric corresponds to valid granularity in a type hierarchy.
(I still give this part a weak accept, because the authors may be able to clarify these issues by improving the definitions.)
(EVALUATION OF THE STATE-OF-THE-ART) The complexity of the previous approach is mentioned several times in the paper, but is never made explicit.  Since improving the complexity is an essential contribution of the current paper, either the complexity class or a more detailed empirical performance evaluation of the previous approach must be presented.  Also, the paper very heavily relies on a comparison to 1 approach, which may be a risk.  E.g., the Related Work section is quite light ATM.
Since the paper is about Big Data, it was a little bit of a disappointment to see the authors select a limited number of datasets (3), and from those datasets relatively small samples (~300K instances, ~4.5M triples).  As the authors mention, there are billions of instances/triples on the Semantic Web.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) This is where the paper underdelivers a bit IMO.  The whole paper is focusing on the heterogeneous nature / variety of Big Data.  While I like the conceptual improvements that are presented in the evaluation section (i.e., the generated type hierarchies seem to be better that with the previous approach), the other improvements are not sufficiently proven.
From the performance results in Table 2 and the rather brief Section 4.4, is not evident that the performance of the presented algorithm really does scale to Big Data / the Semantic Web.  With the three data points provided in this paper, performance may still be supralinear.
Overall, the paper was very clear (except for parts in the approach section, see above), and nice to read.  It contains quite a few small grammar errors that do not affect readability, but that should still be fixed.  I'm enumerating some of these errors from the start of the paper, but they do appear throughout:
- abstract:
- “relatively less” → “relatively little”
- “fine grain” → “fine-grained”
- It is unclear what “the same level” referes to.
- ‘demonstrates’ → ‘demonstrate’
- Introduction:
- “[in order] to deal with”
- ‘require[s]’
- ‘less’ → ‘few’
- You do not “redefine the redefinition of clusters,” rather, you redefine what a cluster is.
- Related Work:
- ungrammatical phrase: “to this type information identification”
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I did not see a link to the implementation, but I'm assuming this is a minor oversight the authors are able to add.
Provided that the code is made available, the main reason that would make replication of the here presented research difficult is that the criteria for take the samples from DBpedia and the BTC are not documented.
(OVERALL SCORE) This paper addresses two important topics: type information on the Semantic Web is often absent or too coarse-grained, and existing algorithms/implementations are known not to scale to Big Data / the Semantic Web.
The strong points are (i) the focus on Big Data variety, (ii) a very good presentation of the delta WRT the existing approach that the here presented one tries to improve upon, and (iii) interesting results in the conceptual improvements WRT the generated type hierarchies.
The weak points are (i) the performance comparison with the improved upon approach are not sufficiently conclusive, (ii) despite the focus on Big Data variety, the evaluation is performed over relatively small samples from 3 datasets.

Metareview by Hsofia Pinto

The paper presents an approach for fine grained entity typing, named ExTypifier.
While well written, clear and providing an evaluation, the description in the paper lacks details, the approach falls short in novelty and in rational. Since no rebuttal comments were provided by the authors, the overall initial evaluation of the reviewers that the paper does not meet acceptance criteria remains unchanged.

Share on