Paper 9 (Research track)

Identifying Ambiguity in Semantic Resources

Author(s): Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Steve Welch

Full text: submitted version

Abstract: In many Information Extraction (IE) tasks, dictionaries and lexica are powerful building blocks for any sophisticated extraction. The success of the Se- mantic Web in the last 10 years has produced an unprecedented quantity of avail- able structured data that can be leveraged to produce dictionaries on countless concepts in many domains. While being an invaluable resource, these dictionaries have a certain level of noise which causes errors in downstream processes.
In this paper, we propose a simple method that i) given a concept dictionary of non-verified quality and a corpus of interest ii) identifies potentially ambiguous and spurious items in it that iii) will be adjudicated by a human-in-the-loop evaluation. Our aim is to recognize items in the dictionary that are very likely to generate errors in subsequent tasks either because they are “ambiguous” i.e. they appear with multiple different meanings in the target corpus, or “spurious”, i.e. do not actually belong to the dictionary and are included by mistake.
By focusing on identifying the terms of highest concern we minimize the amount of human effort as a tradeoff. We prove the effectiveness of the method with a systematic experiment on all DBpedia concepts, using a very large Web corpus as target, with an average precision in identifying a problem term above 95%.

Keywords: dictionaries; Linked Data; human-in-the-loop; disambiguation; noise detection

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) This paper is highly relevant to ESWC since high-quality dictionaries for purposes of information extraction are still scarce, and an approach to analyze and improve a dictionary not in general but in relation to the respective use case under consideration is even more valuable.
(NOVELTY OF THE PROPOSED SOLUTION) See above -- the method addresses a highly relevant issue while being particularly simple and needing very few resources (apart from a human reviewer). Moreover, the method is novel since instead of focussing on the ambiguity of terms in a certain text it focusses on a potential ambiguity of a term from a dictionary when occurring in a certain corpus and thus takes into account the respective use case of the end user, without additional resources or preprocessing steps except for tokenizing.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution seems correct and complete although on page 9 the authors list potential obstacles for the method and I am not quite sure if there could not be more of those obstacles in a linguistic corpus. This could be studied and evaluated on a greater scale across a large set of various corpora.
(EVALUATION OF THE STATE-OF-THE-ART) The representation of the State-of-the-Art is sufficiently informative although a bit short and some of the methods mentioned cannot be fully understood without studying the references. Maybe the State-of-the-Art could be structured a bit better into the different methods. The authors clearly point out the disadvantages of existing methods as a preparation for the presentation of their own method. Page 3: "While _these_ are ... " - to which of the methods does "these" refer -- only the last ones or all of them? (Also, in my opinion the example sentence is still ambiguous in conjunction with the second sentence -- the elephant could still be inside the shooter's own pyjamas with the shooter wearing it.)
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The algorithm per se is only discussed in half a page and not in detail but since it is fairly simple this is probably appropriate. Parts of the description of the properties of the method appear in the section on the experimental study, maybe that could be restructured.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental study is general since it uses two large corpora and DPpedia. The experiment is described in detail and is thus reproducible. The authors admit that their gold standard may be considered incomplete but that a more complete gold standard could only improve the results.
(OVERALL SCORE) In this paper, the authors propose a simple method that i) given a concept dictionary of non-verified quality and a corpus of interest ii) identifies potentially ambiguous and spurious items in it that are likely to generate errors in subsequent tasks and therefore iii) will be adjudicated by a human-in-the-loop evaluation. The method is evaluated via a systematic experiment on all DBpedia concepts, using a very large Web corpus as target, and achieves an average precision in identifying a problem term above 95%.
Strong points:
- The relevance of the paper, see above: The development of high-quality dictionaries is still a very important open issue in the realm of information retrieval.
- The proposed method needs very few resources, preprocessing steps, or additional assumptions
- The method take into a account a specific use case (a corpus) and thus delivers a solution that is tailored specifically to an end user's needs.
Weak points:
- The explanation of the central principle used in this paper, i.e., "unaligned terms" could be even clearer, especially for people who are not highly familiar with pattern matching and the associated terminology: (p.5) "by identifying terms which frequently generate unaligned patterns. We define an unaligned pattern as a context that does not match multiple members of the source dictionary and thus have a high probability for identifying ambiguous terms." Could this explanation be paraphrased as "an unaligned pattern is a context that matches one term/label belonging to a concept but not the others, i.e., it isolates homonyms"?
- See above: On page 9 the authors list potential obstacles for the method and I am not quite sure if there could not be more of those obstacles in a linguistic corpus. This could be studied and evaluated on a greater scale across a large set of various corpora.
- Minor weak point: The State-of-the-Art could be structured more clearly, see above.
Questions to the authors:
- p.1: Maybe clarify the definition of a "semantic type"?
- p.5: See first "weak point" above
- p.5: Maybe clarify in one sentence why unaligned patterns are problematic when used for dictionary generation? Because they are not aware of the aligned patterns and thus generate ambiguous terms inadvertently?
- Your example "apple" could be disambiguated by other heuristics, e.g., the fruit is countable and will appear with "the/a" or in the plural; the technology is spelled with a capital "A". However, it is probably hard to find an even clearer example.
- p.9: It sounds strange to me to state that there is an upper limit -- aren't these rather obstacles which could hamper the method also in ways that are not quantifiable (esp. point 3 -- how much is "enough")? Unforeseen interactions (such as point 4 but there could be others)?
- Maybe you can discuss your choice of window-size 6: I think that on Twitter the information content is much higher than in texts on websites in general (and thus probably your other corpus) -- how does that compare?
- In your opinion, do the results of your paper basically mean that the development of high-quality dictionaries is impossible without human intellectual effort?


Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) It is a relevant paper to the topics of the conference.
(NOVELTY OF THE PROPOSED SOLUTION) There is no research contribution and novelty in the proposed solution.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Since the problem definition is not well defined it leaves no place to judge whether the solution is correct or incorrect.
(EVALUATION OF THE STATE-OF-THE-ART) There is WSD, NED in SOTA, how can you compare, what is your task explicitly? In WSD and NED the tasks are clear, what is your problem definition? From your comparisons I still don’t have it clear, which exact problem you wanna tackle. Having a dictionary and identifying ambiguous terms, is not a realistic problem setting.
The comparison with WSD does not hold, as is a different task. Here the authors simply check if a word might have multiple uses and remove those entries from their dictionary. How do you draw the relation with this line of work?
Similarly, the comparison with the word embedding line of work is confusing to me. How does your task relate to any of the works mentioned in the SOTA section?
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is tightly connected to the problem definition and since the task is ill-defined therefore any solution may be correct or incorrect.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The approach seems easy to recreate and could be adapted well on different corpora. However, given my concerns w.r.t the task definition I am not sure if this would make for a positive point here.
(OVERALL SCORE) **** Short description ****
This paper presents THAT-dictionary, it is a method to identify ambiguous entries in a dictionary based only on a given corpus. The approach uses a pattern scoring method,
which is responsible to suggest which patterns in a corpus might lead to ambiguous dictionary entries. 
**** Strong Points ****
1) Minimal computational efforts to run this approach.
2) The paper is nicely written and well structured, with some limitations in the evaluation section.
*** Weak Points and Questions to the Authors *****
1) The motivation behind this work is not clear as well as the problem definition. A lexical dictionary is bound to specific tasks, and usually they have a specific purpose for which they are created, hence, their generalization is bound to fail. Take for example list of verbs in which you can identify language bias (epistemological bias, framing bias [1,2] etc.). They need to be used on specific tasks, therefore, measuring ambiguity of a lexical dictionary on use cases like the one mentioned in the paper, where you look only for the occurrence of a specific words or tokens, this is not a real task and has no clear goal or definition, thus this is an ill-defined problem.
2) Moreover, every word can be ambiguous. Even in the same document the same word can be correctly used with different meanings or word senses. It is a harsh solution  to remove from a dictionary words that in a specific text appeared to be ambiguous. 
3) What you call ambiguous terms based on your motivation, seems more like entity disambiguation rather than word sense disambiguation or lexical entry disambiguation. What is the task here? Language ambiguity, specifically, the use of words in different contexts? This would be then the task of WSD. If yes, then comparison with SOTA would be highly crucial.
4) How do you define an unaligned pattern? What does it mean in your context? Is it a simple count? What if there are many “aligned” patterns?
5) It seems that you do some form of vanilla IE, and extract co-occurring words in text based on your seed list (dictionary). The scoring functions has no justification or reasonable explanation. Since it seems that even the cases where words can fit to the “aligned” patterns, e.g. “eat sushi”, “milk for breakfast” or “to eat cereals for breakfast” would negatively affect the ambiguity of the dictionary entry “apple”.
6) It’s not clear how to interpret the evaluation results. The deviation is very high. We do not know exactly what are the topics and domains covered in your corpora. Can you provide some collection statistics, is it a general web crawl? How about twitter, what kind of crawl is it?.
7) It seems that you get the high scores by simply having matches on the unambiguous concepts by DBpedia. The provided explanation and discussion unfortunately leaves much of the experiments for misinterpretation as they are poorly described and I do not get the intuition of your evaluation, simply because there is not enough information to judge on the quality of the results.
[1] Marta Recasens, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky: Linguistic Models for Analyzing and Detecting Biased Language. ACL (1) 2013: 1650-1659
[2] Hooper, Joan B. On assertive predicates. Indiana University Linguistics Club, 1974.
I would like to thank the authors for the rebuttal. However, my questions and concerns regarding the paper remain unanswered by the rebuttal. Therefore my overall score will not change. I would strongly urge the authors to improve and resolve on the concerns that I have raised in my review in their next submission.


Review 3 (by Zhisheng Huang)

(RELEVANCE TO ESWC) Identifying ambiguity in semantic resources is one of important issues in the Semantic Web.
(NOVELTY OF THE PROPOSED SOLUTION) The paper proposes an algorithm which can recognize ambiguous terms and spurious terms.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The presentation of the paper needs to be improved.  Algorithm 1 is the main contribution of the paper. However, it is quite unclear "center match" can be achieved. The N is the function topN in line 16 of Algorithm 16 should be a parameter, so that it can be changed.
(EVALUATION OF THE STATE-OF-THE-ART) Related work have been discussed.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper reports the experiments with test data for the evaluation of the proposed  approach.
However, the evaluation on detecting  spurious terms (Section 4.3) is quite weak.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The proposed approach is general to cover various scenarios.
(OVERALL SCORE) See the comments above.


Review 4 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper addresses issues in the effective use of Linked Data resources, primarily DBpedia
(NOVELTY OF THE PROPOSED SOLUTION) The approach addresses the issue of identifying ambiguous and spurious terms in Linked Data semantic resources but does not take into account a large number of very relevant earlier approaches to this issue in the context of lexical semantic resources such as WordNet (see work on sense ranking, sense clustering, systematic/regular polysemy).
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is worked out in detail and and appropriate experiments have been implemented.
(EVALUATION OF THE STATE-OF-THE-ART) See comments above. The state of the art has been addressed well in regard of WSD and WSI but not in regard of earlier work on lexical semantic analysis in sense ranking, sense clustering, systematic/regular polysemy, which is actually most relevant here.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach has been clearly explained, however its implementation as it currently stands is not really appropriate as it deals with automatically derived dictionaries from an automatically derived knowledge base (DBpedia). The approach should be applied (also) to authoritative human-controlled semantic resources (thesauri, terminologies, taxonomies, dictionaries/lexicons).
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiment can be reproduced
(OVERALL SCORE) The paper presents an approach to the issue of effective use of semantic resources, such as DBpedia by focusing on the lexical/linguistic aspects of such resources, and in particular on the aspect of ambiguity in matching knowledge concepts in text. The approach is worked out well and appropriate methods and experiments have been put in place. However, its implementation as it currently stands is not really appropriate as it deals with automatically derived dictionaries from an automatically derived knowledge base (DBpedia). The approach should be applied (also) to authoritative human-controlled semantic resources (thesauri, terminologies, taxonomies, dictionaries/lexicons).
I read the author rebuttal, thanks. However, the authors are missing the point here in that I do not suggest to apply the approach to additional resources but to appropriate resources (authoritative human-controlled semantic resources) - which was not done so far.


Metareview by Valentina Presutti

The paper proposes an approach to identify ambiguous terms in resources as for their use within specific use cases. It is very appreciated that the method requires very few resources and is worked out well with appropriate experiments put in place. There is a major concern about how conclusive the result can be with respect to the claim: the approach should be validated also against authoritative human-controlled lexical/semantic resources.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *