An Evolutionary Methodology for Domain Specific Entity Linking in Real-world Settings
Author(s): Tayfun Gökmen Halaç, Oğuz Dikenelli
Full text: submitted version
Abstract: Recently, there has been an increasing interest on entity linking which transforms text data into a higher level semantics. State-of-the-art entity linking systems rely on comprehensive knowledge bases (KB) constructed in a long period of time. But, most businesses do not have well defined KB, and therefore entity linking requires developing domain knowledge. Since enterprises have limited time to build entity linking knowledge, we claim that an interpretation of entity linking architecture that allow rapid prototyping and incremental improvement is required. The main aim is creating enterprise’s engagement to the system in short period of time and then evolve. In this paper, development of a domain specific entity linking system is investigated from the engineering perspective. An architecture and an evolutionary methodology for building entity linking applications are proposed, and a case study that explores the applicability is presented.
Keywords: entity linking; domain knowledge development; text corpus development; text classification; methodology
Review 1 (by Ali Khalili)
This paper proposes an architecture and an evolutionary methodology that help in reducing the effort put in building an annotated corpus and a rich Knowledge base, which are used as inputs in an entity liking task. This is then applied in the fashion industry domain. My impression is that the authors fail to really show by how much the burden (of building an annotated corpus or a rich Knowledge base) is reduced. Is it less of a burden if in the end, building the corpus or the KB is done at a single iteration or incrementally? I guess it depends on what is available at the start. This makes me wonder how much data and what type of data is needed at the start. While the paper aims at diminishing the corpus creation burden, it briefly mentions data curation in the corpus development step as if data curation was a trivial and less time-consuming task. This leads to the observation that the paper is missing a good chronology of who do what and when. And approximately how much time is required. Finally, with respect to the evolutionary aspect of the system, I wonder, at what iteration one can start relying on the linking results? What do you do with entities found but not present in the system? How does the system’s evolution affect the linking or the disambiguation quality? Impression Overall, the article has some interesting ideas worth publishing about but it needs a major revision. Below are few reasons why. (1) This sentence, for example, like many others needs more from the reader to figure out what the author wants to say. “Entity linking systems use annotated text data or relations in the KB to construct an entity model which assists to decide the mentioned entity .” (2) The paper is full of missing articles and this makes some of the sentences very difficult to understand. For example, in section 3.1, Shen et. al. introduce A general process of entity linking systems . We depicted an abstract architecture and process in Figure 2. An entity linking system consists of three main components: A knowledge base, AN annotated corpus, and A learner. First, A knowledge base is A database of entities and their relationships. This component may be a large-scale database such as DBpedia, an enterprise relational database, a domain ontology, or only a list of entities. Second, A corpus is a collection of texts which are (IS) used to learn language models, and AN annotated corpus consists of texts which include marked words and phrases denoting entities in knowledge base. AN Annotated corpus enables recognizing context of an entity in terms of surrounding words and entites (entities). Third, A learner is the key component that includes the learning algorithm. It may employ relations in the knowledge base and words in the corpus to create a model which decides about linking mentions to entities. (3) The paper is too vague and very confusing, at least for me. It needs a better story line and a clearer contribution. (4) The paper needs to clearly show how simple or how complex it is to build what is being proposed. Introduction The definition at the start of the introduction is mixing two tasks, mainly entity detection and entity linking. This is not a good summary or reference of . Approach The description of the Approach is too vague to really understand what are all the preconditions for it to work. For example, does the classifier need a ground truth? Is a class-based context better than entity-based context? What does this sentence mean: “Using text classification is similar to incorporation of entity types and entity categories into entity linking process. Classes of taxonomy provide context for similar entities. This context also supports addition of new similar entities into the system.” Also, from the methodology, it is not made explicit what is actually evolving: the linking methodology, or just the number of link generated at each linking iteration, or the taxonomy…. The text is too confusing. In the paragraph about taxonomy specification, it is not clear who does what. For example, who does the analysis of suitability? How are the relations constructed? “If the knowledge is required to be developed from scratch, it is grown step by step in iterations.” How? It is not clear if the disambiguation is always different from the pruning step. Over all, this section is too vague. It gives a lot of possibilities but never picks one for an in-depth explanation. The Case Study does not help for a better understanding of the approach and the contribution. After reading author's response, I am afraid the rebuttal has not changed my opinion.
Review 2 (by anonymous reviewer)
The paper is about an approach for the "interpretation of general entity linking architecture and an evolutionary methodology to develop according to this interpretation" (quote from the beginning of Section 3). The methodology is described as an "entity linking application development methodology". The paper is full of such very abstract statements. I found it ok to read in the beginning and was eagerly waiting for a more concrete description of the problem considered in the paper or any technical details or even just numbers, but they never came. Figure 1 provides an example sentence classified as "Sport" as opposed to "Economy", which helps to disambiguate the word "embargo" from the example sentence to "Transfer Embargo" and not "Economic Embargo". Using such information for disambiguation is standard in entity linking, but the caption says "Example of a proposed entity linking architecture", insinuating that there is something new about this approach. Figure 5 provides an example of the iterative development process, but the example is actually an abstract example, which somewhat defeats the purpose of an example. Section 4 is entitled "Case Study" and I was finally expecting at least the description of a concrete dataset, but the only thing made concrete here was the domain ("logisitics" and "fashion"). There is no evaluation of any kind. A "text search" is mentioned, but it is not clarified which text is searched for which purpose. Figure 6 provides a screenshot of the "designed application" with the caption "Fashion entity linking application offers search and analytics capability". It does not become more concrete than that. The article "the" is missing quite often throughout text (see). In various places, wrong words are used, liked "constructured" or "gaved". This by itself does not pose a major problem concerning readability, however. I have read the response letter from the authors and it did not change my assessment in any way.
Review 3 (by anonymous reviewer)
The paper discusses an architecture and a methodology for building entity linking applications. The authors also discuss a case study to demonstrate their approach. The approach is based on the idea of integrating the entity linking problem with a classification solution. Both the proposed entity linking architecture and methodology also involve domain experts. The paper is well written and the authors propose a set of points that need to be made clear in order to have a better architecture and methodology for developing the entity linking architecture of interest. The contribution of the paper seems to be marginal for the ESWC conference and hence I mark it as a borderline paper. I have read the rebuttal of the authors but unfortunately my opinion on the paper did not change.
Review 4 (by anonymous reviewer)
I have read the rebuttal but it has not changed my opinion. The authors present a methodology for performing entity linking in a domain-specific setting where no resources are available. They use a simple case study to validate their proposal. While they claim that "The case study shows that useful domain specific entity linking applications may be employed in many industrial domains.", I have serious reservations with the fact that it is useful in "many industrial domains" as this is not proven with a single case study. Main strengths: - Interesting idea. Main weaknesses: - I am not convinced by the method nor the case study. - Even though the proposed methodology does not need a KG nor an annotated corpus, it does apparently depend "on hierarchical domain taxonomy since hierarchy enables evolutionary development" (not sure what that means though). More detailed comments: - "Classification method can be revised or changed through iterations." – based on what? - "After classifying the text, entities incompatible with the class prediction are pruned." – how is this determined? - The criteria in Section 4 are rather "soft". I would love to have seen actual comparative experiments.
Review 5 (by Anna Tordai)
This is a metareview for the paper that summarizes the opinions of the individual reviewers. The paper presents a method for entity linking and includes a case study. The reviewers question whether the motivation for the approach - to reduce the burden of building a knowledge base and an annotated corpus - is really addressed. Also, the conclusions about general applicability of the approach based on the use case are doubtful. The reviewers note that the paper is hard to read. Laura Hollink & Anna Tordai