Paper 136 (Resources track)

DBpedia NIF- Open, Large-Scale and Multilingual Knowledge Extraction Corpus

Author(s): Milan Dojchinovski, Julio Hernandez, Markus Ackermann, Amit Kirschenbaum, Sebastian Hellmann

Full text: submitted version

Abstract: In the past decade, the DBpedia community has put significant amount of effort on developing technical infrastructure and methods for efficient extraction of structured information from Wikipedia. These efforts have been primarily focused on harvesting, refinement and publishing semi-structured information found in Wikipedia articles, such as information from infoboxes, categorization information, images, wikilinks and citations. Nevertheless, still vast amount of valuable information is contained in the unstructured Wikipedia article texts. In this paper, we present \textit{DBpedia NIF} – a large-scale and multilingual knowledge extraction corpus. The aim of the dataset is two-fold: to dramatically broaden and deepen the amount of structured information in DBpedia, and to provide large-scale and multilingual language resource for development of various NLP and IR task. The dataset provides the content of all articles for 128 Wikipedia languages. We describe the dataset creation process and the NLP Interchange Format (NIF) used to model the content, links and the structure the information of the Wikipedia articles.
The dataset has been further enriched with about 25\% more links and selected partitions published as Linked Data. Finally, we describe the maintenance and sustainability plans, and selected use cases of the dataset from the TextExt knowledge extraction challenge.

Keywords: DBpedia; NLP; IE; Linked Data; training; corpus

Decision: reject

Review 1 (by Heiko Paulheim)

The paper describes the release of a new dataset of textual information from Wikipedia in a structured form, i.e., NIF. The dataset is used for a continuous challenge, two entries of which are also briefly mentioned in the paper.
On the positive side, this paper describes a dataset which could have the potential of a wide adoption, thus, is a nice fit for the ESWC resource track. The dataset is created along with DBpedia in its ecosystem, and since DBpedia is one of the longest running and most continuously developed datasets in the LOD cloud, continuity in the provision of the dataset seems to be granted.
As far as the organization of the paper is concerned, there is some room for improvement. The paper mixes the input data (i.e., texts from Wikipedia in different languages and their characteristics), and the creation of a novel resource. For example, the input data is described in sections 3.1 (structure of an article), 3.3 (differences between languages), 4 (linkage policies for Wikipedia, and differences between languages, with the bot argumentation for Cebuano being repeated redundantly). Likewise, the description of the dataset and its generation is scattered across sections 3 and 4.
For the approach, the design decisions could be made clearer. The authors rely on using a MediaWiki instance for parsing the abstracts, which is a disruptive change w.r.t. earlier renditions of the DBpedia Extraction Framework and should be properly discussed. Moreover, according to our own experience in the area with the DBkWik dataset, MediaWiki can be configured in quite a few ways, and it would be interesting to know how that affects the approach -- e.g., are there different configurations used for different language editions of Wikipedia?
When it comes to the creation of additional links, the description is very short and could use some more details. Moreover, it seems like the only additional links that are added are subsequent links to pages that have already been linked to. This is not very intuitive when it comes to the guideline principles sketched in the beginning of section 4, as it only addresses shortcomings caused by the first principle, but not the other three. Furthermore, according to my understanding, surface forms are only used from the article; it is unclear why surface forms collected globally from Wikipedia are not used for the linked resources instead. In addition, when inspecting the dataset, it is not clear why external pages (not even necessarily Linked Data sources!) have been chosen as link targets for the resource at hand (e.g., the link target for USA in the US article is
As far as the evaluation of the additional links created, the protocol is a bit questionable. Picking pages with a high PageRank introduces a strong bias towards head entities and also skews the topical coverage of the sample (e.g., for English, 7 out of 10 pages are related to countries). Additionally, many additional links refer to the same resources: e.g., for the US example, 112 out of 695 newly introduced links refer to the entity US; i.e., a sample of 30 links will contain this link about five times. Furthermore, the discussion of the results (especially: the strong difference between English and German, with the somewhat surprising finding that the approach works better for German than for English) is a bit narrow. Dataset-wise, I would like to see a distinguishing between existing and additional links, because the latter are obviously of lower quality than the former, and a data consumer should have a chance to include or exclude the links. 
For the motivation, there are quite a few approaches that could make use of such a dataset (e.g., [1-4], so the selection of approaches (par. 2 in section 1) is somewhat arbitrary. Furthermore, including DBpedia Spotlight in a list of works claimed not to "have achieved significant impact and recognition within the Semantic Web and NLP community" is a bit strange, given that this work has almost 800 citations according to Google Scholar and is widely well known in the DBpedia community. Moreover, given the overall narrative, the paper targets relation extraction, while many of the works cited here are NER/NED and not relation extraction works.
The description of the dataset also has some room for improvement. Simply counting triples leads to an impressive number of billions of triples, but one could argue that this is partly due to the verbosity of the underlying NIF schema. Furthermore, table 1 should report the average length of the articles, as this seems to correlate pretty well with the mean links per article and provides an additional explanation for the variety (at least a very brief calculation I did myself by dividing paragraphs by articles as a proxy for article length yielded a number with a high correlation with the mean links per article). Moreover, the last line in table 1 should report the sum for *all* languages, not just the 10 largest, plus: summing up the last two columns is not a very meaningful thing to do. In table 3, the meaning of "unique annotations" remains unclear.
The paper requires careful proofreading, as it contains quite a few language mistakes (particularly: missing articles). 
In summary, I have mixed feelings towards this paper. It describes a dataset which can be a very useful resource, but the paper has some flaws which should be addressed.
[1] Hofmann, A., Perchani, S., Portisch, J., Hertling, S., & Paulheim, H. (2017). DBkWik: towards knowledge graph creation from thousands of wikis. In International Semantic Web Conference (Posters and Demos).
[2] Nguyen, D.P.; Matsuo, Y.; Ishizuka, M. Relation extraction from wikipedia using subtree mining. National Conference on Artificial Intelligence, 2007
[3] Aprosio, A.P.; Giuliano, C.; Lavelli, A. Extending the Coverage of DBpedia Properties using Distant Supervision over Wikipedia. In: NLP&DBpedia, 2013
[4] Heist, N.; Paulheim, H.: Language-agnostic Relation Extraction from Wikipedia Abstracts. In: ISWC, 2017
I have read and appreciate the authors' remarks. It would be helpful if you tried to incorporate as many of those as possible into a revised version to make the paper self-contained.

Review 2 (by Tommaso Pasini)

The main contribution of the paper is a version Wikipedia structured in NIF format which effectively could make life easier to those who want to parse Wikipedia and use it for any kind of task. Moreover the authors also propose a method for enriching Wikipedia pages by adding links within each page propagating those already existing.
There are few missing citations:
Flati te al 2014 - Two is bigger (and better) than one: the Wikipedia bitaxonomy project
Flati te al 2016 - multiwibi: the multilingual Wikipedia bitaxonomy project
Raganato et al 2016 - Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia
That have already addressed at least partially the extraction of knowledge from Wikipedia free text and the enrichment of Wikipedia pages by link propagation a.
The enriched dataset should be further evaluated with extrinsic experiments (e.g. what happen if you re train some ner system on your enhanced dataset?)
While the resource would be valuable, the experimental part is not strong enough to state how good are the propagated links.
I have read and appreciate the authors' remarks. I still think that the experimental part is too weak to get the paper accepted.

Review 3 (by Hamed Shariat Yazdi)

The paper presents DBpedia NIF, which is a large corpus of multilingual knowledge extracted from Wikipedia. The difference of this with DBpedia is that the former focuses on extraction of valuable information available in Wikipedia free text while the latter has so far focused on extraction of structured information in Wikipedia e.g. info-boxes, links etc. The presented dataset in the paper could be quite useful in natural language processing as well as information retrieval tasks. The dataset also covers 128 languages available in Wikipedia. Moreover, the authors have also enriched the dataset by 25% more links though further processing of the text. The authors have also announced a sustainability and an update plan in order to address the lack of proper support in existing comparable datasets.
-The positive aspects:
The idea of the paper is very interesting and helpful for different applications in the semantic web and NLP communities. The DBpedia NIF, proposed in the paper, exploits the free text and unstructured information in Wikipedia articles and semantically describes them by using recently proposed NLP Interchange Format. The dataset provides the content of 128 languages that makes the dataset to be generic and helpful to be referred and used in future. Moreover, the dataset provides likes in articles and enriches them from this view point.
-The aspects which should be improved:
In Sectioin-3.3 authors mention that they have already published the English subset of their datasets according to the linked open data principles. They highlight that they will publish the rest of it if there is a need within community. I think one of the prominent aspect of this work is that it is a multilingual dataset which covers 128 languages. Therefore, I strongly believe that the whole dataset should be published by the authors to maximize the effect. Moreover, all language versions should be on the sustainability and update plan.
Sec. 3.2 paragraph-1:
“... to describe the position of the sting ...”.
<<Potential impact>>
* Does the resource break new ground?
Yes, it provides free text in Wikipedia in semantically structured form. Also the dataset is provided for too many languages which increase the usability of the dataset by other experts in the scope of NLP, IR and semantic web.
* Does the resource plug an important gap?
Yes, the DBpedia focuses on structured information in Wikipedia including inforboxes. However, there is much more information in the text of articles that could be useful to be extracted for future uses. Also, it is important to present the extracted information semantically as the paper did it by employing NIF. The overall idea of adding links to the articles is helpful for different future NLP tasks. However, there are some few concerns about it which will be mentioned later in this review.
* How does the resource advance the state of the art?
Providing unstructured information, collected in Wikipedia articles, in semantically structured form, will provide a great potential for the dataset to be used over and over by others who are working on different NLP tasks. Being multilingual makes the dataset to be more unique comparing to other existing work. However, at the moment, only the English subset of the dataset is published according to the Linked Data principal with dereferenceable URIs and other languages will be added upon request. As I believe that  there is a considerable demand for this, the whole dataset should be published.
* Has the resource been compared to other existing resources (if any) of similar scope?
The paper contrasts DBpedia NIF with DBpedia to show the advantage of the proposed dataset. Also, some comparisons were mentioned in the literature review to provide a deeper background about the existing work considering free text for information extraction. It would be helpful to add a comparison with possibly reported numerical results regarding the qualitative aspects of competitive datasets. This could be more useful than reporting on the winners of the TextExt challenge.
* Is the resource of interest to the Semantic Web community?
Yes, semantically representation of free text, using many languages, adding links to the text, etc are interesting approaches used in the dataset which will be of interest for the semantic web and NLP communities.
* Is the resource of interest to society in general?
The resource is very generic. It provides a great potential and opportunities for others who want to take advantage of large scale rich multilingual resource.
* Will the resource have an impact, especially in supporting the adoption of Semantic Web technologies?
Others can be inspired by the framework used in this paper and creating the resource to present different counterparts for the resource in different domain. Also, the dataset is one of the examples of the use of NIF to present information. Overall, in my opinion, the techniques used in the paper will have impact on the community.
* Is the resource relevant and sufficiently general, does it measure some significant aspect?
The resource is relevant and general to be used in different tasks. It has many advantages by including great potential of too many free multilingual text and links in the Wikipedia articles.
<< Reusability >>
* Is there evidence of usage by a wider community beyond the resource creators or their project? Alternatively, what is the resource’s potential for being (re)used; for example, based on the activity volume on discussion forums, mailing list, issue tracker, support portal, etc?
Yes, with the same reason that DBpedia was widely used by others, this resource will be used by others. The dataset is very rich and wide as it extracts much more information from the Wikipedia and represents it in the proper form. Being available online as DBpedia and also being updated as the DBpedia, and also considering a wide range of languages in the dataset will make it popular and useful for many tasks to be used in future. The authors add a site to regularly include more information related to the dataset and receive the others feedback. DBpedia-discussion mailing list, the TextExt challenge and the DBpedia Framework issue tracker are some channel for maintenance and receiving feedbacks. Also, for all resources, the authors mint the URIs in the DBpedia namespace which is flexible for publishing different versions of the dataset.
* Is the resource easy to (re)use?  For example, does it have good quality documentation? Are there tutorials availability? etc.
The authors provide a site to include information about the dataset and provide download links for dataset. However, a better documentation could be done about different parts of the project in unified way.
* Is the resource general enough to be applied in a wider set of scenarios, not just for the originally designed use?
The dataset is extracted from Wikipedia which is very rich and general. Therefore, the dataset also will be widely used in different diverse ways.
* Is there potential for extensibility to meet future requirements?
Yes, the authors provide flexible URIs and also will be updated in regular way (the authors claim). Also, the dataset published based on linked open data principal.
* Does the resource clearly explain how others use the data and software?
In my opinion, some more information could be included about using data.
* Does the resource description clearly state what the resource can and cannot do, and the rationale for the exclusion of some functionality?
The authors provide useful information about the resource, its coverage and limitation.
<< Design & Technical quality >>
* Does the design of the resource follow resource specific best practices?
The overall efforts done for creating dataset were appropriate. However, in the part of the paper that links are added, some conditions such as ambiguity in terms, having abbreviation in the articles, etc could be investigated more deeply in future releases. Using NIF to provide semantically structured information was one of the most interesting parts of the paper.
* Did the authors perform an appropriate re-use or extension of suitable high-quality resources?  For example, in the case of ontologies, authors might  extend upper ontologies and/or reuse ontology design patterns.
Is the resource suitable to solve the task at hand?
The authors extend the existing work i.e., DBBpedia in another way by considering free text. The number of languages is increased and some other improvements are added to enrich the dataset.
* If the resource proposes performance metrics, are such metrics sufficiently broad and relevant?
The authors evaluated the dataset in the terms of syntactic validity and semantic accuracy and the results are conclusive.
* If the resource is a comparative analysis or replication study, was the coverage of systems reasonable, or were any obvious choices missing?
<< Availability >>
* Is the resource (and related results) published at a persistent URI (PURL, DOI, w3id)?
Yes, the link is provided in the paper.
* Does the resource provide a license specification?
The license is not mentioned in the paper, however as it is published on the one can assume the same license applies. It would be better that authors explicitly mention it in the paper.
* Is there a sustainability plan specified for the resource? Is there a plan for the maintenance of the resource?
The paper has a suitable sustainability and update plan which helps to maximize its broad acceptance and usage within the community.
===== POST REBUTTAL =====
First, I would like to appreciate the authors for clarification of some issues raised during the discussion. However, following the discussions and the rebuttal, I think the points Heiko (review 1) raised need to be better addressed by the authors in order to increase the impact and improve the results and outcomes. In the current state of the paper, those are not addressed. Therefore, despite the fact that the goals of the work is valuable and have the potential of being used widely by the community, I would decrease my score to weak accept.

Review 4 (by Francesco Corcoglioniti)

The paper describes DBpedia NIF, an RDF dataset using the NIF vocabulary that provides page texts (section, paragraphs, titles) and inter-page links for multiple Wikipedia chapters, the EN one also available as Linked Open Data. DBpedia NIF is automatically produced from Wikipedia dumps by properly handling Wiki markup, and it is enriched with additional links for entities that are linked only once in a Wikipedia page (as per Wikipedia rules). The dataset is being used in the TextExt challenge series for knowledge extraction systems.
POTENTIAL IMPACT. DBpedia NIF provides a Semantic Web compliant, large-scale multilingual resource that nicely complements DBpedia, Wikidata and any other knowledge resource grounded on Wikipedia. It is thus of interest to the Semantic Web community, and more in general to any application using Wikipedia texts, as it provides a ready-to-use, already cleaned up (via parsing of MediaWiki-rendered text) and enriched (via link addition) version of Wikipedia dumps. This is particularly relevant for NLP applications, and DBpedia NIF may have the effect of fostering the adoption of NIF beyond the Semantic Web community. Concerning novelty, I'm not aware of other resources providing NIF coverage of Wikipedia text. While related to DBpedia NIF, the other resources mentioned in the paper are mainly specific to the tasks of Named Entity Recognition and Entity Linking, and thus serve different goals than DBpedia NIF. 
DESIGN & TECHNICAL QUALITY. The RDF modeling, including the choice and the reuse of existing vocabularies (e.g., NIF, ITSRDF), appears sound and appropriate. The approach for enriching the datasets with new links is straightforward, and its results in terms of amount of added links and precision (62% anchors+links for EN Wikipedia) should be improved to make these links usable in practice, although this is not essential to the current usefulness of the dataset. A different choice of URIs, e.g., denoting DBpedia/Wikipedia version and page in the base URI and using a hash fragment to denote context and begin/end indexes, may ease Linked Open Data publishing by enabling the use of static file serving (one RDF file per Wikipedia page) and better leveraging of web caches. No machine readable description of the dataset seems available.
AVAILABILITY. DBpedia NIF is publicly available under a Creative Commons license (CC BY-SA) on a website affiliated to DBpedia, with extraction code hosted on GitHub. Although the paper claims availability of data for 128 Wikipedia chapters and a total of 9B triples, it seems that currently only data for 9 Wikipedia chapters can be downloaded as RDF dumps (Wikipedia dump used for DBpedia 2016-04), covering only page abstracts for a much more limited number of triples (335M triples for the EN chapter). Data of the EN chapter should be available as Linked Open Data, but the EN URIs I tried to dereference were not accessible (tried both URIs from Listing 1 and from one of the EN dump files). Sustainability is provided via support from the DBpedia Association (hosting, computation) and via community involvement also through the TextExt challenge series, whose outcomes are expected to be integrated in DBpedia NIF.
REUSABILITY. Using DBpedia NIF should be easy as the dataset has a rather simple structure, is well packaged, and uses the NIF vocabulary that is well documented, so no additional information should be needed (and indeed the dataset web page is rather essential to that respect). Third party reuse is reported within the TextExt challenge, and the dataset has the potential to be extended - within and outside TextExt - with additional annotations layers that may further enrich it, possibly giving rise, in the long term, to a comprehensive knowledge extraction corpus for Wikipedia texts.  
Summing up, while conceptually simple and still a work-in-progress (URI dereferencing not working, only abstracts for 9 chapters available online, limited enrichment precision), the presented DBpedia NIF dataset represents a useful resource within and outside the Semantic Web community, which may result in fostering the adoption of NIF and Semantic Web standards among a wider audience. As users might be interested in producing a NIF dataset out of Wikipedia dumps different from the ones backing DBpedia releases, I suggest the authors to also documentand facilitate third-party reuse of the NIF extraction framework they implemented.
Concerning the quality of writing, the paper is clear and well organized but contains many typos and grammatical errors.
In general, there are many missing articles, such as:
* "still vast amount ... is contained" -> "a vast amount" (abstract, check also other occurrences of "amount");
* "In past decade" -> "In the past decade" (§1)
* "develop robust extraction process for extraction of the information" -> "a robust", "the extraction of the information" (§1)
* ...
Then, I suggest the following corrections:
* "refinement" -> "refining" (abstract)
* "an overall growth of 296% datasets" -> "an overall 296% growth in datasets" (§1)
* "and citation." -> "and citations." (§1)
* "aims at published structured knowledge" -> "publishing" (§2)
* "the authors propose use" -> "to use" (§2)
* "manipulate with the content and prepares it" -> remove "with", "prepare" ($3.1)
* "position of the sting using offsets and denote its length" -> "string", "denoting" (§3.2)
* "In a same manner" -> "In the same manner" (§3.2)
* "using the ... and nif:hasParagraph property" -> "properties" (§3.2)
* "Following listing provides" -> "Listing 1 provides" (§3.2)
* "withing" -> "within" (§3.2)
* "is considerable large" -> "considerably" (§3.3)
* "for creation of dataset" -> "the dataset" (§3.4)
* "put advance knowledge extraction technologies in action" -> "advanced" (§3.4)
* "In case of an overlapping matches" -> remove "an" (§4)
* "In overall" -> just "Overall" (§4)
* "extraction knowledge from HTML ... an significant effort. Following three ..." -> "extracting", "a significant", "The following three" (§5.1)
I would also suggest the authors to:
* check rows 13 and 26 (wrong objects) in Listing 1
* for consistency, add thousand separators in totals of Table 1
* clarify the meaning of "Unique annotations" in Table 3
* check their use of "knowledge extraction" to describe the way their resource is built, as rather than eliciting the semantic content of text, their process is about a more straightforward format conversion and cleanup (with a bit of enrichment) of existing Wikipedia XML dumps
POST REBUTTAL UPDATE. I acknowledge the response of the authors and the availability of NIF data for the other DBpedia chapters (published 26 February), which addresses one of the points I raised. I'm still not convinced of the string matching approach proposed for link enrichment and the justifications given in the rebuttal for its design and evaluation. I agree with optimizing precision, but I think the proposed approach falls short from achieving that goal given the numbers reported in Table 4, as I pointed out in my review. Concerning the authors' justification, their implicit assumption that "one linked surface form = one sense within a page" does not hold in general (e.g., see "Greek" in "Athens" EN page, but other examples can be easily found via querying / grep). To that respect, considering other surface forms as suggested may even help identifying non-ambiguous surface forms in the whole Wikipedia, which can then be used to generate more precise links. While I invite authors to improve this aspect of their work (along addressing other comments), I don't consider link enrichment the main contribution of the submission and I thus confirm my score.

Review 5 (by Raphael Troncy)

This paper has been largely discussed among the reviewers and with the resources track chairs. While there is a consensus that the proposed work is valuable and has the potential of being widely used by the community, a number of significant weaknesses have also been raised (e.g. the the link evaluation is biased and the link enrichment approach described is rather naive and under-performing). Consequently, we recommend the authors to further work and envision a future submission to either the ISWC resources track or as a SWJ dataset paper.

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *