Author(s): Sepideh Mesbah, Christoph Lofi, Alessandro Bozzon, Geert-Jan Houben
Abstract: Named Entity Recognition and Typing (NER/NET) is achallenging task, especially with long-tail entities such as the ones foundin scientific publications. These entities – e.g. “datasets for evaluating recommender systems” – are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes.This paper presents an approach for training NER and NET classifiers for long-tail entity types that relies on minimal human input, namely a small seed set of instances for the targeted entity type. We propose and discuss different strategies for training data extraction and named entity filtering. The approach is showcased in the context of scientific publication annotation, focusing on the long-tail entities types Datasets and Methods. The approach consistently outperforms state-of-the-art methods, can provide good quality results (up to.91precision and.41recall)with a seed set of 100 entities, and achieves comparable performance with a seed set as small as 5 entities and 2 iterations.
Keywords: Named Entity Extraction; Long Tail Entity Types; Natural Language Processing