Automatic Generation of Benchmarks for Entity Recognition and Linking
Author(s): Axel-Cyrille Ngonga Ngomo, Michael Röder, Diego Moussallem, Ricardo Usbeck, René Speck
Full text: submitted version
Abstract: Benchmarks are central to the improvement of named entity recognition and entity linking solutions. However, recent works have shown that manually created benchmarks often contain mistakes. We hence investigate the automatic generation of benchmarks for named entity recognition and linking from Linked Data as a complement to manually created benchmarks. The main advantage of automatically constructed benchmarks is that they can be readily generated at any time, and are cost-effective while being guaranteed to be free of annotation errors. Moreover, generators for resource-poor languages can foster the development of tools for such languages. We compare the performance of 11 tools on benchmarks generated using our approach with their performance on 16 benchmarks that were created manually. In addition, we perform a large-scale runtime evaluation of entity recognition and linking solutions for the first time in literature. Moreover, we present results achieved on the Portuguese version of our approach on four different tools. Overall, our results suggest that our automatic benchmark generation approach can create varied benchmarks that have characteristics similar to those of existing benchmarks. Our experimental results are available at http://faturl.com/bengalexp
Keywords: Benchmarking; Named Entity Recognition and Linking; Scalable Benchmarking
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The papers is about using RDF graphs to generate NER and EL benchmarks. (NOVELTY OF THE PROPOSED SOLUTION) The authors claim that their work is the first automatic solution to generating NER and EL benchmarks from any KB. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution was quantitatively evaluated using gold standards for English found in GERBIL, and also using Brazilian Portuguese examples. (EVALUATION OF THE STATE-OF-THE-ART) The authors provide a generally good overview about recent approaches to generate NER and EL benchmarks, but fail to compare with alternative approaches, such as silver-standard training annotations. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) It is not clear what was the RDF graph given to BENGAL to generate the 13 datasets in English and the 4 datasets in Brazilian. Many important details are missing about the generation of these datasets, specially in the annotator performance on Brazilian Portuguese, to ensure that the results will be consistent when using another corpus. The correlation between BENGAL datasets and manual datasets in terms of performance of NER and EL systems is not clear, and seems to vary. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The datasets are available as a compressed file but not the software to process them and get to the same results. The results are available as web pages in the GERBIL web site. (OVERALL SCORE) **Short description of the problem tackled in the paper, main contributions, and results** The article presents a system for generating automatic benchmarks for NER and EL by inverting the pipeline and go from facts to text (using RDF), and makes the comparison between features of the generated benchmarks and other frameworks. ** Enumerate and explain at least three Strong Points of this work*** - The article is globally clear, thorough and well written even if it contains one or two typos and some parts in which the text could be made more explicit. - Generally the results are very interesting, as the multi-lingual experiment. - Using GERBIL a public framework to evaluate the proposed solution ** Enumerate and explain at least three Weak Points of this work*** - Some experimental details are missing, RDF graph used, seed selection, synonyms used in brazilian portuguese. - The correlation between BENGAL datasets and manual datasets in terms of performance of NER and EL systems is not clear, and not fully detailed and discussed with examples - It is not clear how this approach compares to distant supervision techniques, that help NER and EL systems to automatically generated training data. ** Questions to the Authors (QAs) ** - Could you please explian what RDF graph was used, and if there is is any bias between the manual datasets and that RDF graph? - Why the correlation between BENGAL datasets and manual ones is not shown as a table? - Why do you chose brazilian portuguese instead of any other language? Minor issues: Page 2: you write 'especially when these solutions using caching' should be 'use' or 'used'; you have 'it can enhance both the measurement of both the'; seems as this could be all in one sentence: 'Moreover, BENGAL can be updated easily to reflect the newest terminology and reference KBs. Hence, it can generate corpora that reflect the newest KBs.'. Page 5: table 1 could be made more clear, separating the examples from the definitions. Page 10: fig. 1 description could be more exhaustive. Page 12: table 3 'N/A means that the annotator stopped with an error.' could be explain further. Some tables would be more readable by putting in bold the relevant results.
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) The topic is quite relevant to ESWC since entity recognition and linking are very important tasks for Semantic Web community. (NOVELTY OF THE PROPOSED SOLUTION) This paper introduces a novel problem of automatically generating benchmarks for entity recognition and linking. However, the proposed solution is not that novel, which is mainly an adoption of an approach to SPARQL verbalization, i.e., SPARQL2NL. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The automatically generated benchmarks based on the proposed solution might not reflect the dynamic nature of human languages, which is one main challenge for entity recognition and linking. (EVALUATION OF THE STATE-OF-THE-ART) This paper combines several existing techniques for seed selection, subgraph generation, verbalization and paraphrasing, for generating documents for the evaluation of entity recognition and linking tasks. In particular, the most important steps, i.e., verbalization and paraphrasing, are quite relevant to the area of natural language generation (NLG). However, the state-of-the-art of NLG has not been discussed and evaluated in this work. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the proposed approach is well demonstrated and discussed. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments have been conducted by comparing the generated datasets based on the proposed method with 16 manually created gold standards in Gerbil. I would suggest that more insights from such comparisons should be clarified. (OVERALL SCORE) Summary of the Paper: This paper aims to automatically generate benchmarks for entity recognition and linking tasks in order to avoid the disadvantages of existing manually created datasets, including annotation mistakes, small volume, lack of updates, popularity bias and lack of availability. Strong Points (SPs) - Entity recognition and linking have been a research goal for many years in different research communities, such as NLP and Semantic Web, and it seems that it will continue to be explored in the future. Therefore, due to the drawback of existing benchmarks, it is quite important to propose methods for automatically generating benchmarks for these tasks, which will be really helpful the future research in this area. - A experimental comparison between the automatically generated datasets in this work and the existing manually created datasets in Gerbil, a general entity annotation benchmark framework, has been made. Since the authors of this work is also the authors of Gerbil, I would expect that the automatically generated datasets will be well integrated into Gerbil for future research. Weak Points (WPs) - The problem of entity recognition and linking is hard mainly because of the dynamic nature of human languages, especially the variety of the structure used. That is why there are many manually created datasets for the evaluation of the problem, which reflect the diversity of the real world data. However, the automatically generated datasets based on the proposed method in this paper cannot reflect the characteristics of human languages. - More specifically, this paper heavily relies on SPARQL2NL, an existing approach to SPARQL verbalization. However, SPARQL verbalization aims to generate short questions, which usually only involve several triples, and the goal is to provide users simple natural language text to help them easily understand the intents of formal SPARQL queries. Therefore, the technology might not suitable for generating documents, especially for the datasets used for evaluation of entity recognition and linking tasks. Questions to the Authors (QAs) - In entity recognition and linking, there are two types of entities, namely named entities, e.g., Donald Trump, and nominal entities, e.g., President of the United States. How do you deal with these two types of entities in your solution? - How can you avoid the specific optimization of entity recognition and linking solutions according to the strategy of benchmark generation? E.g., paraphrasing in your solution is very related to mention detection and candidate entity finding for entity linking. If the same dictionary is used for both phases, we could expect a better result compared with using different dictionaries. In addition, an entity linking solution that focuses on exploiting relations in knowledge bases for each sentence in a test document might achieve better results on datasets generated by your solution. In general, I found this is a really good idea for automatic generation of benchmarks for entity recognition and linking. But the proposed method, especially the adoption of the technology for SPARQL verbalization, is not that convincing for me.
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper fits perfectly well to to the ESWC Benchmarking and Empirical Evaluation track. It addresses the topic "Benchmarking, focusing on datasets and algorithms for comprehensible and systematic evaluation of existing and future systems." by providing a novel method to automatically generate benchmarks for entity recognition and linking. (NOVELTY OF THE PROPOSED SOLUTION) The approach is based on the idea to verbalizing RDF and uses an existing approach SPARQL2NL to do so. Novel is the usage of this idea to automatically generate large scale benchmarks in different languages based on different knowledge bases. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach successfully verbalizes RDF statements into sentences and paraphrases such sentences by means of synonyms taken from dictionaries to achieve a certain variation in the use of natural language. Typical metrics have been applied for the comparison of automatically generated benchmark datasets against different wellknown existing benchmarking datasets. GERBIL is used for the evaluation. The approach can be applied to different languages, exemplified by English and Brazilian Portugese. (EVALUATION OF THE STATE-OF-THE-ART) The paper contains an overview and (brief) discussions of related work which covers different areas such as benchmarking corpora, RDF-based approaches as well as semi-automatic crowd-based approaches for benchmark creation. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper presents a detailed discussion of the experiments and the results of the experiments. In the experiments two languages have been adressed, English and Brazilian Protugese. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The automatically generated datasets as well as the create tables for computation of the metrics are available which allows for reproducability. Data and tables are provided on the institute's website and on proprietary cloud services (spreadsheets), thus they are (at the moment) not "guaranteed" permanently available. A more thorough discussion which difficulties are to be expected when applying the solution to other languages would have been beneficial. (OVERALL SCORE) The approach automatically creates benchmarking data sets by verbalizing RDF statements into natural language. The principle feasability is shown by conducting experiments in which the synthetic datasets are compared against existing datasets. The main contributions are a methodology for the automatic creation, the experimental validation of the validity of the approach and the provision of the created benchmarks used for the experiments. The results indicate the feasibility for different languages and also show the current limitation for languages for which only sparse structured data exist. The paper is written very well and has a clear structure. Strong points: - Generic methodology which can be refined to cover more complex language structures - Scalability of the approach for the creation of large scale benchmark datasets - Principle applicability for different languages Weak points: - Fairly simple approach for verbalization - Only partial coverage of the diversity in which languages are being used by people (... arguably almost the same point as the first weakness) - Result tables and datasets are not (yet?) available persistently on an open platform Questions to the authors: - Will you provide all related data of paper on an open plattform (e.g. github or similar) for the final version? - Which difficulties are to be expected when applying the approach to other languages (e.g. Chinese)?
Review 4 (by anonymous reviewer)
(RELEVANCE TO ESWC) As the paper suggests, the topic is about a system for automatic generation of benchmarks for entity recognition and linking. Thus, it is relevant to the Semantic Web. (NOVELTY OF THE PROPOSED SOLUTION) The idea looked new but I was not sully convinced. The paper lists several shortcomings of existing approaches, especially, manual and semi-automatic approaches. I hoped the author provide some convincing arguments/experimental results to support their statements. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) As explained, I was unable to judge the work due to its weak presentation. (EVALUATION OF THE STATE-OF-THE-ART) A section of Related Work with more depth would be needed. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I hoped to see some more experiments that could demonstrate the novelty and significance of the proposed approach. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It might be producible. Links to more datasets for the work are provided but I was unable to check the proposed system (I did not see a link for the system). (OVERALL SCORE) Positive: An interesting topic, a prototype system is implemented and experimental results are presented. Negative: I had the feeling that the authors used some formal definitions to make their work more mathematical instead of trying to make their work more understandable to people.
Metareview by Oscar Corcho
Some of the explicit questions that were made by the reviewers in their initial reviews have been left unsanswered, since no rebuttal was provided. Reviewers acknowledge the merits of the paper, but there are many aspects that should be further clarified in this paper to make it acceptable.