Paper 152 (Research track)

Leveraging Semantic Web Resources to Detect Temporal Ambiguities in Text

Author(s): Farshad Bakhshandegan Moghaddam, Maria Koutraki, Harald Sack

Full text: submitted version

Abstract: Named Entity Recognition and Disambiguation (NERD) is a fundamental analysis task in natural language processing for the correct interpretation of meaning. However, existing NERD applications most times are not able to detect and resolve ambiguities that require the determination of a temporal context for correct interpretation, such as in so-called temporal roles like CEO of a company, Soccer World Champion, or head of a country.
In this paper, we propose a novel learning-based approach to automatically detect and recognize temporal roles as a first step towards the subsequent disambiguation.
The approach is driven by Conditional Random Fields (CRF) leveraging information learned from Wikipedia and Wikidata knowledge graph.
Experiments on a manually annotated dataset as well as on a large dataset automatically collected from Wikipedia show that the CRF-based approach outperforms vanilla baselines such as dictionary matching.

Keywords: NLP; Entity Recognition; Temporal Ambiguity; Conditional Random Fields; Wikidata

Decision: reject

Review 1 (by Ziqi Zhang)

(RELEVANCE TO ESWC) The paper introduces a new problem of 'temporal role' detection. I can see it is relevant to the named entity linking and disambiguation tasks and therefore is potentially of interest to a wide range of audience of this conference.
(NOVELTY OF THE PROPOSED SOLUTION) This is a typical 'apply tried-and-tested techniques to a new problem' work so technical novelty is limited. While the problem addressed is certainly new, I think the authors should make more convincing justification why the problem itself needs addressing separately (as opposed to directly tackling other problems to which temporal role detection is supposed to help address)
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The method is well described and easy to understand. It is also sound and fit for purpose.
(EVALUATION OF THE STATE-OF-THE-ART) The work compares ANNIE a well established state of the art tagger. The comparison is reasonable but can be improved by, e.g., re-training existing NER taggers using your temporal role annotations.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Result discussion is reasonable. Some feature analysis is provided. But I thin what is really lacking is how will temporal role detection feed into/support/improve other downstream applications/tasks.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The method is well described so easy to re-implement. The authors also shared their data and implementation, which is good.
(OVERALL SCORE) **short description**
The paper introduces the new problem of 'temporal role' detection. It uses the 'tried and tested' supervised machine learning techniques to address it as a tagging task. Results show that the supervised method performs better than a dictionary based look up method. But the difference is arguably, not very significant.
**Strong points**
1. paper is well written, easy to follow
2. better results of the proposed method than state of the art
3. dataset and implementation are made available
**Weak points**
1. the motivation for treating 'temporal role' detection as an independent task is not convincing
2. lack of experiments to prove that this can positively contribute to more interesting tasks such as NEL, NED
3. the scope of experiments is a bit limited - only four roles are tested while there can be much more. 
1. Why did you choose the four roles for evaluation, why not adding more, and why not testing others?
2. Can this task be addressed by simply re-training existing NER taggers using your labelled dataset (still using the NER standard features)? If not, why? If yes, what is your unique contribution?
The paper introduces a new problem that is 'temporal role' detection. The paper is generally well written and easy to follow. Two methods are proposed to detect temporal roles, one dictionary based and one supervised machine learning based. Results obtained by the two methods are, arguably, quite close. The authors also created a large dataset following the distant supervision approach as well as a small manually annotated dataset for future research.
The main issue I see is the 'so what' question, which I would expect the authors to take the rebuttal as an opportunity to try to address it... Apparently, detecting temporal roles has limited value in itself as a task and most likely, it has to be used with other downstream applications, such as NER or NED/NEL. Your experiments show reasonably high accuracy by both unsupervised and supervised methods and in fact the difference is non-significant, which also begs the question is the task (by itself) difficult/worth solving at all? But this could also be due to the lack of diversity in your datasets (see below). I suppose that the big question is, does detecting the presence of such a role in the text help any of such downstream tasks (e.g., NER, NED/L)?
Suppose we are dealing with an NED/L task and using your example 'The acting U.S President visited Pope Francis'. If a classic NER system already detects 'U.S President' and 'Pope Francis' as NEs, how does knowing 'President' and 'Pope' being temporal roles affect the next step of NED/L? You pretty much still need the full lexical context from the sentence (and even beyond) for that task and I cannot see how the fact of knowing such 'temporal roles' exist as part of the NEs have changed that or made that easier. One potential use of such temporal roles is to feed into the NER process - assuming an NER system makes errors by only recognising 'Francis' as an NE, knowing that 'Pope' is a temporal role can possibly alter its prediction to select 'Pope Francis' altogether as an NE. But how prevalent are such errors in your data, and how could this temporal role information feed into NER? What's the gain vs effort and would it be an over-kill considering you have to build a dictionary/supervised system for temporal role detection?
For the above reasons, in my opinion you need to justify the value of detecting temporal roles in the NER/NED/NEL tasks to make the effort worthwhile. How well can a state of the art NED system perform without detecting temporal roles, and how much better it can get if it knows the existence of temporal roles? The task on itself is however, not a very appealing or convincing - and questionably neither very challenging - problem to solve. 
Another problem is that I suspect your dataset is not diverse enough. You should give more details about the data you created for experiments. What do you mean by four temporal role 'groups'? Why did you choose only four? What about 'champion' as you previously illustrated, and roles such as 'university chancellor', 'manager', 'team coach' etc... there can be a lot of temporal roles. What are examples from each group? The figure 1 you referred to on page 7 does not make sense.

Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) The authors use Wikipedia and the Wikidata knowledge graph to extract information about certain categories, but they don't seem to use any kind of hierarchical information or do anything semantically related with it - everything else in the work is pure core NLP, so the relevance is borderline in this respect.
(NOVELTY OF THE PROPOSED SOLUTION) There really isn't much novel about this work as far as it goes. The interesting part will be what's planned for later in terms of the temporal disambiguation, which is currently little tackled in the literature. But other than that, they just apply some ML and a very simple dictionary lookup approach to the problem of recognising some (very few) categories of NE
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution is rather simple, and while not technically incorrect, does not really go far enough. In particular, it does not actually tackle the issue of temporal ambiguity.
(EVALUATION OF THE STATE-OF-THE-ART) The evaluation proposed is a bit simplistic. First, the authors only take a very small number of categories. Second, the baselines they compare against could have been stronger - why not compare against an actual NER system, or indeed against the Jobtitle recognition in ANNIE/GATE rather than just comparing against the list (these are two different things). It also seems odd that the difficult negative set is so small in the automatically generated dataset.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Again, I don't feel the approach goes far enough in its current stage.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) On thep lus side, the authors provide a large annotated automatically generated dataset on which to evaluate, but the manually produced dataset seems very unbalanced and the number of different classes recognised is really far too small for a proper evaluation.
(OVERALL SCORE) The paper describes an approach to recognising certain categories of NE for the puposes of NERD. They call it temporal ambiguity detection, but actually what they do in the paper is not this at all, just a first step towards it. I find this quite confusing. They compare a CRF method with a dictionary-based method to find these NEs, based on collecting relevant terms from Wikipedia.
Strong Points: 
- the authors produce a large automatically generated dataset for evaluation
- the work is an interesting first step towards the problem of temporal ambiguity detection
- the paper is generally well-written
Weak Points:
- the method proposed simply doesn't go far enough, it's simply a first step.
- the evaluation is a bit flaky (see comments).
- the semantic web aspect is rather limited (and there are some unclear points)
Questions for Authors:
- Why did you not compare the results against the results from e.g. ANNIE (or another NER system) that involve the detection of these particular NE types (job titles)? This would have been fairer than just comparing against the list of job titles in ANNIE.
- Why did you not include more NE types in your set? The set picked seems rather limited (and one could be suspicious that it was hand-picked in this respect).
- Why is the difficult negative set so small in the automatically generated benchmark?
Having read the rebuttal, I do not feel that the authors sufficiently addressed my concerns. For example, they did not adequately answer why they used a flat gazetteer list of job titles from ANNIE rather than the actual JobTitle annotation that is provided, which would have provided a more accurate comparison.  I maintain also that the list of types is very limited and there are many others that could have been chosen. Similarly the point about the small negative  set does not satisfy me as I believe this to be significant. In summary, the paper has many flaws in the methodology and evaluation and simply does not go far enough yet - it is a minor first step to solving a problem, but this is all.

Review 3 (by anonymous reviewer)

(RELEVANCE TO ESWC) The submitted work claims to use semantic technologies (specifically information from WikiData and Wikipedia) in order to improve the recognition and dsiambiguation of "temporal" roles. As such, it is fits the "Natural Language Processing" track of the ESWC. 
However, the actual utility of the used ontology information is not as extensive as suggested by the paper title and the abstract. In particular, including features derived from WikiData are limited to extracting dictionary information for temporal entities. Also, temporal entities are limited to the relation "replaces|replacedBy" which seems rather narrow. These two aspects are not discussed well and seem to deminish the relation to semantic technologies.
(NOVELTY OF THE PROPOSED SOLUTION) The methodology itself is not new: dictionary-based and CRF-based NERD. On the feature and dataset level Wikidata and Wikipedia are used to derive a dictionary of "temporal" role surface forms (for a dictionary-based approach and a feature for CRF-based method) as well as corresponding ground-truth datasets. In this context the methods are interesting but not specifically complex. The claimed main innovation lies in the focus on temporal roles.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed method for temporal role detection is (besides a dictionary approach) based on adding context (features of surrounding words) as well as surface forms to the features for classification. As such, building the dictionary is the main methodological contribution of the proposed solution. The dictionary is built by extracting concepts from Wikidata where these concepts are limited to "temporal" roles by using the relation "replaces|replacedBy". It is augmented by adding anchor texts from links linking to the extracted roles. While the method generally may result in an extended dictionary for "temporal" roles, there are several questions and details which need to be addressed:
* Is the dictionary hierarchical? Are disambiguities actually resolved by keeping the links between achor texts/surface forms and concepts extracted from Wikidata? 
* The evaluation implies that the method focuses on classifying role/notRole and does not seem to check if the correct role is detected (as seems to be implied by the abstract and the introduction).
* It is stated that other dictionaries do not fit the given task (and are thus not evaluated), but no concrete examples or proof is given (e.g., as part of the evaluation).
* To extract "temporal" roles the relation "replaces|replacedBy" is used. This seems rather limited. What about the mentioned examples about popes where there may be several popes at the same time? Is each pope replaced by the single pope following this generation? A discussion in this direction is required since the rationale about only using this relation is not obvious.
Minor notes:
* Why are role labels removed from the dictionary as final clean up step? Duplicates have been removed before.
* There is a paragraph about role phrase annotation which does not seem to be relevant.
(EVALUATION OF THE STATE-OF-THE-ART) The related work section discusses several NERD frameworks. However most of them are labelled as not fitting the task. However, one of them is even used as a feature for CRF. And none of them are actually evaluated. It would have been appropriate to see how these systems would perform on the collected dataset. Also there is a wide variety of other NER approaches which could have been _trained_ and tested on the collected data for the given task which has not been done. See for example: 
* The used CoreNLP toolkit allows retraining its NED component:
* A survey of named entity recognition and classification, David Nadeau, Satoshi Sekine
* Named Entity Recognition: A Literature Survey, Rahul Sharnagat
* Named Entity Recognition: Applications, Approaches and Challenges, Archana Goyal1, Manish Kumar, Vishal Gupta
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There are several issues with the demonstration and the discusstion of the properties of the proposed approach:
* The paper claims to detect temporal ambiguities in text. The motivational text furthermore leads to belief that the system will be able to distinuish different "popes" of different times (also since NERD is introduced and not only NER). However what is actually evaluated is role detection, i.e., a corpus with "temporal" roles is built and only the classification of role/notRole is evaluated. Here, either the experiments or the introduction and the line of argumentation needs to be strongly adjusted. In particular, will the proposed approach be able to distinguish "pope" Pope X and Pope Y depending on context?
* The dictionary used in generating the test corpus is also used as a feature in the classification process. This will always bias the results towards the proposed method. Especially because no alternative method was evaluated that can draw information from training data. Except the implicitly evaluated linear CRFs. However, for linear CRFs, results are not reported on both datasets and the feature studies are very limited (excluding a combination of local and contextual features). 
* Along the same line: How would the overall method perform without the external features? The feature ablation study does not answer this question because local+contextual performance is missing. As such it is not clear if the conclusion that information from Wikipedia/Wikidata actually helps.
* Also, how would a standard NER tool perform (since corresponding output is also used as a feature)? Similarly, other, more advanced, methods which can be trained are not evaluated (see related work).
* How would the current method be incorporated into the overall NERD task? Currently the evaluation focuses only on temporal roles and not classification.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) There are several aspects which will make the reproducibility challenging:
* the curated gold standard dataset (as well as the code) is supposed to be available (github repository), but the link seems broken
* no link to the ground-truth dataset is given
* It is not clear which Wikipedia articles are used to extract the data (complete English Wikiepdia?). How come there are only so few simple negative examples? The number indicates some preselection process.
* How was the curated dataset built? As described, it is probably strongly overlapping the automatically generated one (since it is a mixture of NYT and Wikipedia)? 
* How was the Wikiepdia Anchor-Text Dictionary built?
* Is the feature ablation based on the hand curated dataset or the large dataset?
Also, there are several points and questions which impact the quality of the evaluation.  
* It can not be judged how good the performance of the method is on the hand-cureated dataset, because no class statistics are given.
* The hand-curated dataset seems to overlap the automaticaly generated one (since it is a mixture of NYT and Wikipedia). Also, text pieces from NYT are again selected by using the dictionary. This strongly reduces the value of the evaluation. Since (as for the large dataset), the methods explicitly are given features used to build the dataset! Thus any other method will probably perform worse than the proposed method. On other datasets however this may change depending on the present temporal roles. 
* How much will leaving out anchor texts from the dictionary reduce performance? What about leaving out the original class labels? Evaluating this would justify the choices mode. 
* The hand-curated dataset is very small. Are the reported differences actually significant?
* Why are no baselines reported for the large dataset?
* It may be more appropriate to build a dataset using a general NER tool, and labeling each detected entity as temporal or not. 
* It seems pretty obvious that only difficult example fail since the dictionary which is used to derive the dataset is used as a feature.
(OVERALL SCORE) === Summary of the Paper
The paper states that it focuses on detecting and disambiguating temporal roles, such as "pope", or "world champion". To this end it builds a dictionary of such roles and corresponding surface forms based on Wikidata and Wikipedia. This dictionary is used for several further steps in the paper: a) building a large (distantly supervised) test dataset based on Wikipedia articles and a smaller hand-curated one, b) as the basis for a dictionary-based approach to detect temporal roles, as well as c) to provide features for a CRF-based approach. The evaluation is based on detecting the classes "role/notRole" on the collected large scale dataset as well as a hand-curated dataset. The results indicate a slight performance increase by using CRFs compared to the dictionary. Baselines are limited to two dictionary-based approaches.  
=== Strong Points (SPs)
* creating and using a large (distantly-supervised) gold-standard dataset of "temporal" roles
* claim to make data and code available (link currently does not work though)
=== Weak Points (WPs)
* the implied claim of disambiguation seems too strong
* hard to read and follow, especially in the technical parts
* lacking evaluation and discussion (especially with regard to baselines and features)
=== Questions to the Authors (QAs)
* Please elaborate on the validity of your results. Using features of the classification to build evaluation datasets is very questionable and needs to be justified very well.
* Please elaborate on the choice of baselines. Why are no other trainable NER methods used (similar to the linear CRF)? The feature ablation seems to point at a large potential, even without using Wikidata/Wikipedia.
* Please report the class statistics and data source ratio as well as further details about the hand-curated dataset.
* Please elaborate on the building process of the Wikipedia Anchor-Text dictionary.
* Please add a note on the significance of the reported numbers, especially on the small dataset.
* Please report results of the baseline on the large dataset.  
=== Further notes and suggestions
* While nice to read the initial example is a little bloated. Especially, because the puzzle really still cannot be solved by the proposed approach, which makes the line of argumentation kind of awkward.
* Example of entries of the generated dictionary entries would be very helpful
* How is the computer architecture relevant to the evaluation if no runtimes are given?
=== Overall
Leveraging Wikidata and Wikipedia to enhance NERD tasks is an interesting application (especially due to its rich structural properties) and given a thorough evaluation. Also, the collected datasets, especially if made openly available, are a valuable contribution. Unfortunately, the main contribution of the article is hard to grasp and leaves too many open questions in order to properly judge the impact of the proposed solution. Major issue are the somewhat misleading motivation, the use of features to devise the datasets used for evaluation, and the omission of appropriate baselines. As such, currently the paper needs to be rejected.
=== After the rebuttal
I appreciate the comments of the authors. They definitely clarified a few things.
However there are still major issues which are hard to evaluate without another iteration of reviews. 
* For one, while I very much appreciate the will of the reviewers to rework the motivation and introduction according to the notes of the reviewers, I think that the changes will we rather major needing another round of reviews (major revision in journal terms).
* Second retraining existing NER approaches definitely is a fair comparison as they are often based on linear CRFs and the results of the authors show that this approach works well even without specialized features. Thus, still, the evaluation seems rather limited. 
* The authors did not completely address the raised issue of using the features to construct the dataset in the evaluation. While they state that there is "no overlap between the distant supervision and the manually curated dataset" the results from the distant supervision dataset still seem to be invalid. For the curated dataset, this seems less likely but the roles still seem to largely match the dictionary ... unfortunately the authors have not adequately addressed this in their rebuttal.

Review 4 (by Víctor Rodríguez Doncel)

(RELEVANCE TO ESWC) The paper is of interest for ESWC for the topic it handles, matching several of the bullet points in the ESWC call for papers. It is worth to note, however, that the use of the semantic information is tangential, only related to the creation of dictionaries and little more.
(NOVELTY OF THE PROPOSED SOLUTION) I was not aware of any technique for this purpose and after a small research I could not find any equivalent work --besides the RoleNER that is mentioned in the state of the art (and for which there is code online but no accompanying documentation).
Also, this work seems to be the result of a young researcher (Farshad Bakhshandegan-Moghaddam) with no similar paper published before: the work does not seem reused from another context.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The number of role groups in the gold standard (President, Pope, Monarch and CEO) is reduced, as I can imagine of many more temporal roles not related to a specific post (for example "in love with") etc.
(EVALUATION OF THE STATE-OF-THE-ART) The evaluation of the state of the art is fair, considering there is no similar method besides RoleNER. 
However, I would have been happy to see the application of CRF in similar problems.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The demonstration is fair enough, although the comments are too optimistic in relation to the precision/recall measures obtained: the advantage over the dictionary-based approach is somewhat limited.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Source code is available, gold standard dataset is online and everything is clean and transparent.
(OVERALL SCORE) This paper is about detecting detecting temporal roles (properties whose validity depends on time) in a text. Two methods are proposed; one method is dictionary-based, the other machine learning based. The latter, which is based on Conditional Random Forest, performs somewhat better than the former.
The dictionary is also a result of this work, as well as a dataset for this task, with includes a gold standard subdataset.
The paper is well structured, with an easy English and free of typos. 
The contribution is relevant, as it may lead to increasing the quality of triples on the net, by improving the correctness of information which changes with the time. This usefulness is not evidenced in this paper, however. Most notable, the dataset is online, the source code in github and it compiled and partially run correctly on my computer (the setup is non-trivial). Some files were also missing (folders data, category, etc.)
However, the conducted test was limited to four roles and was not connected to other NER tests --for example, it could have been applied to a triple extraction challenge numerically evidencing how many of the triples might be affected.
Second, the temporal roles extracted with the SPARQL query uses the property replaces/replacedBy. Is there any other similar SPARQL query? I could not find it in the repo, and the text is unclear in this point to me.
Third, much emphasis is given to CRF, but I dont see much more effort beyond using the CRFSuite software.
Finally, I do not see much comment about disambiguation --I would rather have read NER instead of NERD in the abstract.
The link in footnote 7 does not work (
I acknowledge receipt of the authors comments, which do not essentially modify my view. Thanks for providing the correct link, I had already downloaded the code anyway.

Metareview by John McCrae

This work concerns the recognition and disambiguation of temporal roles. It seems that this work is quite premature and in particular that the evaluation is not sufficient to prove the main points of this paper. We would recommend that the authors expand the dataset and add sufficient baselines to the evaluation. Secondly, the methodology is not clear in its novelty and its impact, as such the opinion of the reviewers seems to be that this work should be further continued.

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *