SIEge- Generating Domain-specific Knowledge Resources for Semantic Information Extraction
Author(s): Emrah İnan, Vahab Mostafapour, Burak Yonyul, Oguz Dikenelli
Full text: submitted version
Abstract: Knowledge bases are used in different semantic data mining algorithms such as semantic search, question answering, and information extraction.
There are a vast amount of open domain knowledge bases for these algorithms. However, it is difficult automatically to generate annotated datasets for domain-specific tasks. This study presents a tool called SIEge as a multilingual domain-specific semantic embeddings generator for semantic information extraction tasks such as entity linking and relation extraction. Also, SIEge generates evaluation datasets for specific domains using Wikipedia and DBpedia. Wikipedia category pages and DBpedia taxonomy are used for adjusting domain-specific annotated text generation. In this study, we publish semantic embeddings model and evaluation datasets for entity linking and relation extraction tasks for the movie domain that is publicly available.
Keywords: Evaluation Dataset; Semantic Embeddings; DBpedia; Wikipedia
Review 1 (by anonymous reviewer)
The paper aims to present an approach for generating domain specific knowledge-bases automatically Even if the idea is interesting in principle, for my point of view, the paper is still not ready for being published. The presentation is quite confusing and, actually, by considering that it has been submitted to the Resource track, it is partially out of scope of this track. No resources are actually presented, but just an approach for extracting information from existing resources. Thus, the novelty is very limited. I am sorry for the short review, but actually I do not have much more to add to this. I thank the authors for their effort in preparing the rebuttal. After reading their reply, I confirm the score given earlier.
Review 2 (by Simon Walk)
The authors present SIEge, a multilingual domain-specific semantic embeddings generator for semantic information extraction tasks. The tool uses Wikipedia and DBpedia to generate evaluation datasets. I really liked the general idea of the paper, but the presentation of the paper is lacking. Nearly every other sentence is riddled with grammar mistakes. Frequently, tenses are mixed up and I often had problems of actually understanding what the authors tried to express. Aside from the problems of actually understanding the paper, this also (at least for me) makes the experiments and stated contributions non-reproducible! In the current state, the paper needs a major revision to fix all of the grammar and tense errors, and simply is not ready to be accepted at a venue like ESWC. I would like to encourage the authors to spend some more time in polishing the paper and continuing with their line of research, as I think the idea behind the presented tool is very nice. Nonetheless, due to the early stage the paper appears to be in, I have to recommend to reject the paper. I thank the authors for their reply, but I will keep my original evaluation.
Review 3 (by anonymous reviewer)
The paper presents a system for generating semantic metadata in a multilingual, domain-specific field. The tool generates also evaluation datasets for given domains by exploiting DBpedia and Wikipedia. It is well written and organized, the proposal is presented accurately and most relevant elements of this resource are treated with the right detail. To be honest the proposal is not completely novel. Similar approaches already exist and solid tools have been already diffused. Anyway the authors state their system differs from the existing ones due to the fact it is suitable for domain-specific knowledge resources. To the best of my knowledge this is true and could make interesting the proposal. On the other hand, the related work survey is complete and exhaustive. Among main weaknesses of the resource, there is the dependence on DBpedia and the relevant computational impact it should have on common computing platforms. Unfortunately, this last element is not faced in the paper. The evaluation section is the worst one as it does not report on any performance assessment (turnaround time, memory peaks and so on). Finally, some citations are outdated and should be refreshed. Thanks for the rebuttal. I confirm my score. I basically liked the paper, probably a more accurate preparation of the manuscript could lead to an acceptance in the next edition of the conference. On my side the authors are encouraged to pursue their work.
Review 4 (by Elena Cabrio)
The paper "SIEge: Generating Domain-specific Knowledge Resources for Semantic Information Extraction" describes a tool suite that includes algorithms to generate embeddings and datasets for a specific domains in Wikipedia, mainly for the information extraction task. The use case scenario to show the applicability of the proposed tool suite is the Movie domain (DBpedia as KB). The topic of the paper is potentially interesting to the ESWC conference, however, my main concern is on the amount of novelty of the proposed approach w.r.t. state of the art work on the topic. Quite simplistic algorithms are applied for the KB extraction (and very targeted to the selected use case), and the experimental setting is narrow and quite poor (and lack important details). The paper is written in a quite confused way, that make it hard to follow and to properly evaluate the proposed contributions (the authors should work at better restructuring the paper and clarifying some passages to help the reader not to loose the thread of the paper). Questions and remarks: - Define what is a domain-specific KB since the beginning (is that related to one DBpedia property only, as in the example of Movie?) - Algorithm 1: add some example of the obtained results - "Embeddings models or representing data in a low-dimensional vector spaces, outperformed information extraction algorithms.": avoid this kind of claims if not supported by references - Whenever you cite an approach, an algorithm, a peculiar data structure, you should add a reference (too many things are taken for granted) - HDT 3 (Header, Dictionary, Triples) is defined too late in the paper - why the 2015 DBpedia version was chosen (and not a more recent one?) - "The movie dataset involves 945 annotated documents and 3648 entities" who has annotated such dataset? - last paragraph of page 9 not clear to me. Which triples have been extracted? - Fig. 2 where this annotation comes from? - no multilingualism is mentioned in the rest of the paper The paper requires proofreading, there are plenty of typos and weird word phrasings, that make it quite hard to follow. E.g. - [abstract] it is difficult automatically to generate -> it is difficult to automatically generate - [abstract] we publish semantic... -> weird phrasing (we make available) - [Intro] facilitates to produce -> facilitates the production/creation - [Intro] They consider -> who is "they"? (there are no references to papers) - [Intro] Why we depend on DBpedia is to contribute -> ??? - [Intro] we address these drawbacks -> which drawbacks? - ... I thank the authors for addressing part of my concerns in the rebuttal, however I confirm that I will keep my score.