Long-Tail Entity Extraction With Low-Cost Supervision
Author(s): Sepideh Mesbah, Christoph Lofi, Alessandro Bozzon, Geert-Jan Houben
Full text: submitted version
Abstract: Named Entity Recognition and Typing (NER/NET) is achallenging task, especially with long-tail entities such as the ones foundin scientific publications. These entities – e.g. “datasets for evaluating recommender systems” – are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes.This paper presents an approach for training NER and NET classifiers for long-tail entity types that relies on minimal human input, namely a small seed set of instances for the targeted entity type. We propose and discuss different strategies for training data extraction and named entity filtering. The approach is showcased in the context of scientific publication annotation, focusing on the long-tail entities types Datasets and Methods. The approach consistently outperforms state-of-the-art methods, can provide good quality results (up to.91precision and.41recall)with a seed set of 100 entities, and achieves comparable performance with a seed set as small as 5 entities and 2 iterations.
Keywords: Named Entity Extraction; Long Tail Entity Types; Natural Language Processing
Review 1 (by Haofen Wang)
(RELEVANCE TO ESWC) Long-tail entity extraction is a relevant topic to ESWC (NOVELTY OF THE PROPOSED SOLUTION) TSE-NER provides an iterative pipeline starting from a small selected initial seed set, and can get good performance even with a very small seed set. Nevertheless, the methods used in each step of the whole approach are not new and the whole approach just looks like a combination of these existing methods. Thus, it is not novel. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) This paper proposes TSE-NER, which can extract long-tail entities from a small seed set, and different strategies can be used in the pipeline for different scenarios. The proposed solution is correct and complete. (EVALUATION OF THE STATE-OF-THE-ART) In this paper, evaluation has only been done on two entity types “Dataset” and “Methods” in the domain of scientific publications, which is not so convincing, more experiments and results are needed. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) TSE-NER starts from a small seed set and uses term/sentence expansion to increase the size and variety of the training set, after training a NER model, serval filtering strategies are used to select high-quality terms from the candidate terms and those high quality ones can be further used as new seeds in the next iteration to make the whole approach iteratively. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code is not open source, while the methods used in TSE-NER is common, so it’s may not be so hard to reproduce the experimental study. (OVERALL SCORE) This paper proposes an approach for the extraction of domain-specific long-tail entities called TSE-NER. TSE-NER provided a pipeline which can be iterated, the only manual input required in this approach is a small initial set of seed terms. The approach outperforms the state-of-the-art methods. It can provide good quality results (0.91 precision and 0.41 recall) on the domain of scientific publications with a focus on data science. However, the evaluation is done on two special entity types “Dataset” and “Methods”, which may not be so convincing. Strong Points 1. TSE-NER provides an iterative pipeline which can minimize the training cost when the targeted entity types are rare. 2. The approach outperforms the state-of-the-art methods on the domain of scientific publications and achieves comparable performance with an initial seed set as small as 5 entities and 2 iterations. 3. There are serval strategies/methods involved in this approach, which can lead to different precision and recall values and can be adapted to different scenarios. Weak Points 1. Evaluation has only been done on two entity types “Dataset” and “Methods”, more experiments are needed. 2. The setting of experimental comparison with state-of-the-art is not clear, the authors only report the precision/recall/F-score. The advantages of your approach compared with the state-of-the-art are not well analyzed and quantified. 3. The methods used in each step of the pipeline is not novel, the whole approach looks like a combination of exiting methods. E.g. In term expansion and sentence expansion, word embedding and sentence embedding trained by word2vec and k-means method are used. 4. Lack of analysis compared with related work. Questions to the Authors (QAs) 1. You claim that the best performance of your approach is (0.91 precision and 0.41 recall), what is the setting for this performance? In Table 1, the best precision is achieved by using No Expansion + Point-wise Mutual Information and the best recall is achieved by using Term Expansion + No Filtering. So the best precision and the best recall are achieved in different settings. 2. For Point-wise Mutual Information (PMI) Filtering, you need to manually compile a list of context patterns, how many patterns are needed? The cost of designing those patterns? 3. For iterative NER training, why only report the first 3 iterations? how about the next 7 iterations? After Rebuttal Most of the questions have been answered in the authors’ rebuttal, and the authors claimed that they will release the source code. Overall, this approach is simple but effective for two specific long-tail named entities in the publications domain. While it is hard to justify the effectiveness of this approach in other domains, more experimentation is needed to make this approach more convincing. We keep our review score unchanged.
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) Recognition of named entities and entities types has been a core topic in semantic web conferences. (NOVELTY OF THE PROPOSED SOLUTION) The approach is simple but effective, and produces good results for the problem domain. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The work is thoroughly evaluated. (EVALUATION OF THE STATE-OF-THE-ART) Good assessment against state-of-the-art, and hence a good position of the research hypothesis. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is well written and known shortcomings of the proposed solutions are well discussed. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) extended source code and source datasets were made available on the companion dataset, although the reviewer did not go through the reproduction themselves. The only I might comment on is a better documentation on how to reproduce the results described in the paper. (OVERALL SCORE) The paper proposes a specific named entity recognition challenge in the scientific publication domain. The authors articulate their goal of assessing a specific research hypothesis, and nicely designed a simple approach to validate their hypothesis. The paper is very well-written and nicely evaluated. Strong points: - good presentation and structure - nice presentation - fair acknowledge of limitations Weak points: - it will probably be good if the authors could say a bit more about how they anticipate the approach could be generalised for different domains - the authors could also add how others may reproduce their method for different application domains - the authors could add more details about how the semantically similar terms were selected and constructed === response to rebuttal == Thank the authors for the rebuttal. However, I feel the questions I raised have not been addressed fully satisfactory. I think the work is doing well for what it is, i.e. addressing a specific problem in a specific application domain, unlike what the authors seem to claim, as a generic high performance IR technique. Although the authors said in several places that the approach can be adapted for other domains, I'd suggest the authors remove such strong statements, given that evidence is yet to be collected to support such claims. Furthermore, I am also not very happy with how the authors tried to make the source code and training data available on their web site. As someone whose application domain is on scientific publication, the resources are simply listed as a shopping list and hardly self-explanatory. It took a lot of effort to understand how they correspond to the resources described in the paper. If the paper is accepted, I'd suggest the authors make some effort to 'describe' the resources used for produce the results in the paper, in a way that other researchers might take the method to a larger corpus and generate some real impact in the application domain.
Review 3 (by Kuldeep Singh)
(RELEVANCE TO ESWC) The paper is in the relevance of Semantic web community. The long tail entities have always been a challenging task, not just in scientific text, but also in generic named entity recognition. A lot of work has been done in generic NER and NED tools, but specific tools for long entities are still limited. Hence, work comes in a timely manner and will be interesting for the wider audience. (NOVELTY OF THE PROPOSED SOLUTION) The idea is novel and well presented in the paper. Authors propose a low-cost approach for training NER/NET classifiers for long-tail entity types by exploiting Term and Sentence Expansion. An extensive evaluation has been performed wrt state-of-the-art methods. Figure 1 explains the overall approach precisely. However, I doubt few steps in the proposed approach ( please see the overall summary section) (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is well explained in the paper. Authors have very well identified the problem in NER domain and approached it with a novel idea. Section 3 is self-contained with little room for improvement. What I really appreciate in the paper is a clearly identified problem in the introduction section and a seamless approach explained in section 3 to address the identified problem. (EVALUATION OF THE STATE-OF-THE-ART) Since the Introduction in the paper, the term state-of-the-art has been used in many occasions. For example in the sentence " properly perform, state-of-the-art NER/NET methods [1,5] either require comprehensive domain knowledge (e.g. to specify matching rules)..." 1) and 5) are termed as state-of-the-art methods. Then Textrazor is termed as state-of-the-art for generic NER identified. In their approach, they have used Stanford NER as generic NER tool for annotation (why not TextRazor? any specific reason to select Stanford NER when TextRazor is termed as state-of-the-art). Now in the evaluation section (18) and (15) are addressed as state-of-the-art tools against which the approach is evaluated. Now the fundamental point is- How the definition of state-of-the-art is selected in the paper? If (1) or (5) is baseline (or state-of-the-art) in introduction section then why in evaluation section a new state-of-the-art is introduced. Further, the performance of generic NER tools like TextRazor as claimed in the introduction that it mistype a long tail entity (with an example), is it just for the specific the example given in the paper or claim can be generalized. The performance of TextRazor is not been evaluated in the evaluation section to support the claim that generic state-of-the art NER tools don't perform well on long tail entities. Ofcourse, problem of long tail entities has been well supported by references in related work, but my argumentation is against the word state-of-art in multiple occasion. For TextRazor, I would suggest, please just use "generic NER tools like TextRazor....." In evaluation section, in best of my understanding work present in (18) is not evaluated in same experimental settings as of this paper. Also, no open source code of (18) is provided by the authors in the published research article. Hence in that case, have authors re-implement the approach given in (18)? Same holds true for (15) as it is a resource paper in LREC, have the same analysed for the dataset used by the authors in this paper? 15) J. Seitner, C. Bizer, K. Eckert, S. Faralli, R. Meusel, H. Paulheim, and S. P. Ponzetto. A large database of hypernymy relations extracted from the web. In LREC, 2016. 18) C.-T. Tsai, G. Kundu, and D. Roth. Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1733–1738. ACM, 2013. 1) M. Brambilla, S. Ceri, E. Della Valle, R. Volonterio, and F. X. Acero Salazar. Extracting emerging knowledge from social media. In Proceedings of the 26th International Conference on World Wide Web, pages 795–804. International World Wide Web Conferences Steering Committee, 2017. 5) M. Kejriwal and P. Szekely. Information extraction in illicit web domains. In Proceedings of the 26th International Conference on World Wide Web, pages 9971006. International World Wide Web Conferences Steering Committee, 2017. %%%%%after rebuttal%%% Evaluation is the main concern of the paper and still weak. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is well supported by an evaluation. The evaluation results are promising though extensibility and generalisability of the results are still a problem. Table 1, and Figure 3 illustrates the empirical studies effectively. Results show a promising improvement from baseline ( as discussed in part- comparison with state-of-the-art). A detailed explanation has been provided in the discussion section 4.3. Overall, the approach is appreciated. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The Stanford NER has been used for initial for NER training in section 3.4. The performance of generic Stanford NER tools is usually about 40-60 percent precision for generic small entities and expected to be low for the long tail entities. I am intrigued how the iterative approach has achieved overall precision touching 91 percent. In section 3.5 authors admit the noise and false positives created by Stanford NER in training phase. Is the filtering approach described in section 3.5 is really so effective coupled with iterations. For all these questions, I would like to see the original source code, experimentally generated data, the results of the performance of baselines in evaluation section. If possible, please provide them in rebuttal period. After rebuttal: Authors included their code's public link. Please improve your code, make it more self-explanatory, and include all the results in a single folder based on Table 1, 2 etc. (OVERALL SCORE) The paper presents a novel approach for long tail NER/typing for scientific content.The techniques is able to limit the reliance on human supervision, resulting in an iterative approach that requires only a small set of seed terms of the targeted type. The overall contribution of the paper is an approach, are a set of expansion strategies exploiting semantic similarity and relatedness between terms to increase the size and labeling quality of the training dataset generated from the seed terms, as well as several filtering techniques to control the noise introduced by the expansion. +ves: The clear problem statement. Clear evaluation Clearly presented approach -ves: No clear statement of choosing state-of-the-art as baseline. Low cost: The term cost is frequently used in web service composition, query optimization. What does cost signify here in this paper and how? What are the parameters affecting cost, and how cost has been reduced to term it as low-cost? Can we quantify cost here in this paper just like quantification of cost can be done in web service composition? If not, I would suggest excluding the term cost from the paper. Missing source code, and training/experiment data: Questions to Authors: Is this work an open accessed work for future? If yes, please provide all the source code, training and experiment data if possible. Please explain the state-of-the art point raised in Evaluation of State-of-the-art section of the review. Other minor typo issues: DBPedia- its DBpedia Textrazor- its TextRazor %%%%%%%%%%%%%%%%% After Rebuttal: %%%%%%%%%%%%%%%%%%%% %%%%%%%%% %%% % Authors have reasonably addressed nearly all the points raised during rebuttal period by me and all other reviewers. Although main concern regarding evaluation and extensibility/generalisability of this work for other long tail entity extraction domains still stands true, I welcome authors attempt to target the particular domain for the same. I do agree that long tail entity extraction in various domains cannot be addressed in the same paper. I expect authors to improve the paper by following points in final camera-ready copy, and I am happy to increase my score: 1.) Must improve: The introduction (or the motivation of the work) is built around citing few papers [1) and 5)] (and mentioning their shortcomings) and in the evaluation, the work has been compared to another baseline. And this is the (serious) fundamental flaw in the paper. I suggest in the introduction, please build the motivation of your work on the shortcomings of the baseline approaches with which you have compared in the evaluation section. This will provide a seamless read. The papers which have been cited in the introduction, please include them in the related work as they are similar but not the baseline in your case. Use same state-of-the-art for the whole paper. Remove Text Razor as an example, keep Stanford NER since the beginning or if you can find any literature which points shortcoming of existing NER tools for long entities, cite it and include that example. 2) Must improve: The explanation provided in the rebuttal response by authors, please add parts of it in the introduction and in the discussion section. A few statements in rebuttal are very crisp, and I appreciate it. 3)Expected: Please make your code more reusable, and self-explanatory. Your work is, of course, impacting in particular domain (and a first step too) but I anticipate this will be a foundational stone for similar work in other domains. 4)Strong Suggestion: One of the initial strong points of the paper is clear problem statement. I suggest following if authors do agree on it for camera ready: Please re-arrange your existing problem statement using points like P1, P2 in the introduction. Corresponding to these points, do define research question R1 (addressing problem P1), R2 (addressing problem P2) at the beginning of evaluation section. Just adjust the text of evaluation section around these research questions. This will propagate the clarity of your work which is clearly visible in the introduction section, also to evaluation section as currently, with the progress of the paper clarity is fading in the evaluation section (though you have right text in the evaluation, it just need to be re-arranged). Suggestions for wider impact: 1. If somehow, in one-two small paragraphs (or by making crisp statements) authors can also address the shortcoming in their approach in the discussion section, it will be helpful for researchers in same domain for future extensions. 2. A generic problem I have observed in baseline approaches is reusability. In future, if your work can be publically available for researchers as web-service (similar to DBpedia spotlight etc) where they can identify long tail entities, the impact of your work will be very high. This is because, in best of my understanding no current tool is providing such facility for the users for long tail entities, even in the specific domains. -- Kuldeep Singh
Review 4 (by Nitish Aggarwal)
(RELEVANCE TO ESWC) Entity linking is very relevant problem to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) The overall contributions of this paper are weak and do not satisfy the level of ESWC. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The problem is fairly explained but could be improved with some formalism. (EVALUATION OF THE STATE-OF-THE-ART) Evaluation is performed a small dataset, however, there is no comparison with any other state of the art methods related to the problem. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Experiment section fairly touch some aspects of the proposed approach, however, it is hard to make any conclusion without comparing it with other methods. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) In general experimental study is good but needs a bigger dataset to make it convincible. (OVERALL SCORE) The paper presents an approach to identify long-tail domain specific entities using minimal supervision. The proposed approach only uses few seed terms/entities to bootstrap the system and identify long-tail entities that match the recognized patterns. Evaluation is performed on a dataset of 50 entities of two types in the publication domain. - Strong Points (SPs) 1. Addresses an important problem of identifying long-tail entities. 2. Problem is well motivated. - Weak Points (WPs) 1. The main contribution of this paper is extracting data (term and sentences) to train a standard NER system, followed by applying some heuristic based filtering, which needs to be clear in the abstract and introduction. 2. Evaluation is performed on a small dataset of 50 entities appear in publications. Therefore, it is hard to make any conclusion about the effectiveness of the proposed approach in other domains. It needs evaluation in different domains on a bigger dataset. 3. The proposed approach of long-tail entity recognition is highly related with domain specific term extraction work in literature, which is not addressed in the paper. It would be good to compare the proposed approach with existing term extraction methods on any existing or previously used dataset in the literature. 4. In identifying entity type, evaluation considers only two types, i.e. method and dataset. It would be better to evaluate the proposed approach in other domain like finance, where more types can be found.
Metareview by Valentina Presutti
The paper proposes TSE-NER: a low-cost approach to extract domain-specific long-tail entities. Although the reviewers appreciate the clarity of the problem definition and the seamless explanation of the approach, major criticisms are raised about the evaluation, and on the generalisability and extensibility of the approach. The reviewers recommend to perform more experiments (currently limited only to two entity types) to address these issues.