Author(s): Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Steve Welch
Abstract: In many Information Extraction (IE) tasks, dictionaries and lexica are powerful building blocks for any sophisticated extraction. The success of the Se- mantic Web in the last 10 years has produced an unprecedented quantity of avail- able structured data that can be leveraged to produce dictionaries on countless concepts in many domains. While being an invaluable resource, these dictionaries have a certain level of noise which causes errors in downstream processes.
In this paper, we propose a simple method that i) given a concept dictionary of non-verified quality and a corpus of interest ii) identifies potentially ambiguous and spurious items in it that iii) will be adjudicated by a human-in-the-loop evaluation. Our aim is to recognize items in the dictionary that are very likely to generate errors in subsequent tasks either because they are “ambiguous” i.e. they appear with multiple different meanings in the target corpus, or “spurious”, i.e. do not actually belong to the dictionary and are included by mistake.
By focusing on identifying the terms of highest concern we minimize the amount of human effort as a tradeoff. We prove the effectiveness of the method with a systematic experiment on all DBpedia concepts, using a very large Web corpus as target, with an average precision in identifying a problem term above 95%.
Keywords: dictionaries; Linked Data; human-in-the-loop; disambiguation; noise detection