A Dataset for Web-scale Knowledge Base Population
Author(s): Michael Glass, Alfio Gliozzo
Full text: submitted version
Abstract: For many domains, structured knowledge is in short supply, while unstructured text is plentiful. Knowledge Base Population (KBP) is the task of building or extending a knowledge base from text, and systems for KBP have grown in capability and scope. However, existing datasets for KBP are all limited by multiple issues: small in size, not open or accessible, only capable of benchmarking a fraction of the KBP process, or only suitable for extracting knowledge from title-oriented documents (documents that describe a particular entity, such as Wikipedia pages). We introduce and release CC-DBP, a web-scale dataset for training and benchmarking KBP systems. The dataset is based on Common Crawl as the corpus and DBpedia as the target knowledge base. Critically, by releasing the tools to build the dataset, we enable the dataset to remain current as new crawls and DBpedia dumps are released. Also, the modularity of the released tool set resolves a crucial tension between the ease that a dataset can be used for a particular subtask in KBP and the number of different subtasks it can be used to train or benchmark.
Keywords: Knowledge Base Population; Knowledge Induction; Slot filling; Benchmark; DBpedia; Common Crawl
Review 1 (by Raghava Mutharaju)
Post rebuttal: Thank you for the response. Please emphasize very early in the paper (Abstract, Introduction) that you are sharing the tool that generates the CC-DBP dataset rather than the dataset itself. Also, list the output generated by the tool clearly (similar to the one mentioned in your rebuttal response). ------------------------------- The paper describes a dataset called CC-DBP which can be used for training and benchmarking Knowledge Base Population (KBP) systems. This dataset was created using the data from Common Crawl and DBpedia. CC-DBP is also useful for evaluating subtasks of KBP such as entity detection and linking, and context set construction. The paper is well written describing the subtasks of KBP, the related datasets and how CC-DBP improves on them. Questions/comments: 1) Since CC-DBP is claimed to be modular, is it possible to easily switch the following with other alternatives: DBpedia, Common Crawl, and the libraries used for components such as EDL and CSC? The authors mention that they would like to try using DBpedia Spotlight and Tagme. Would it be an easy switch? In general, can this setup be used to generate benchmarking datasets for KBP systems in other domains such as healthcare, geosciences, aviation etc.? 2) Does CC-DBP include or are there plans to have a gold standard that is manually curated for the subtasks? 3) Figure 12, is there a correlation between entity pairs, shared contexts, and the length of the sentence (number of words)? 4) Page 2, Section 2 last paragraph, "not only" can be removed from the sentence "... to a triple is not only motivated ...". Review criteria: 1) Potential impact: CC-DBP has been compared to existing datasets that are similar. I think this resource would be of interest to the Semantic Web community and is useful for KBP systems. 2) Reusability: The authors do not provide any evidence of usage of this resource by a wider community beyond the authors and their team. There is some documentation on the github page and combined with the description in the paper, there is sufficient information about the resource. The authors briefly discuss the shortcomings (n-ary relations) in the Conclusion section. 3) Availability: Permanent URL is not provided, but the source code is available on github with an Apache 2.0 license. Raghava Mutharaju
Review 2 (by Hamed Shariat Yazdi)
The paper targets the knowledge base population (KBP) problem (i.e. extending/building knowledge bases from text) by proposing a dataset (called CC-DBP) which can be used for training and benchmarking of KBP approaches. The extracted information are stored as triples with confidence values. KBP consists of subtasks which are logically distinct step: Entity Detection and Linking (EDL), Context Set Construction (CSC), relation prediction and reasoning. EDL detects entities and co-references and link entities together. CSC is responsible to gather textual evidences of two entities, i.e. when they occur in same contexts, in order to predict their relationship. Reasoning is the final step which predicts and adjusts the confidence of new triples. The authors introduce relevant datasets (e.g. NYT-FB, TAC-KBP etc.) and discuss how and to what extent different steps of a KBP process could be evaluated on each dataset. In the end they describe their own dataset (CC-DBP) and how they created it. In this regard, they have provided baseline systems for EDL and CSC in order to immediate use of the components for relation prediction and reasoning. The dataset proposed in this paper has some advantages. It is based on two rich sources i.e. DBpedia and Common Crawl. The dataset is web scale, and also open and accessible. According the statistics mentioned in the paper, the dataset has diversity of relations which have a considerable fraction of instances. Overall, the above mentioned features make the dataset to be of interest for the community. However, the following issues should be clarified or resolved: The reference  posed a challenge when KB is not derived from the training text (i.e. the external KB that is not primarily derived from the training text), indicating that the distance supervision assumption is violated in such condition. Therefore,  proposes a new approach for creation of the related dataset, NYT-FB, to tackle this problem. Regarding the fact that the authors used an external KB which is not derived from the text they used for KBP, it is also important that such a problem is discussed in the paper as it affects the quality of the proposed dataset and the final results. Another concern is that such approach creates a considerable amount of false positives , reducing the accuracy. Maybe a pair of nodes exists in a sentence, but the sentence doesn’t express the relation. This should be explained more. How does the proposed approach avoid noisy patterns to achieve better results? The number of node-pair contexts doesn’t looks to be high by considering too many sentences in the corpus (173M sent.) How such results are explained (based on EDL and CSC modules)? What about other languages for dataset apart from English? (for future) How the approach could be adapted? The authors wrote the following sentence in the paper, “For CC-DBP, 63% of node-pairs have only a single shared context, while in NYT-FB, 81% of entity-pairs have only a single context”. More explanation would be useful regarding the difference. It is helpful to add some statistics related to the new entities and provide a comparison for that. The currently reported statistics could be improved. The paper includes some approaches to exclude some relations and the authors are going to include the last one (mediator) in future. How will these inclusion and exclusion affect the statistics mentioned in the figures such as diversity of relation distribution etc.? Including some information about this will help others to judge about the times that others use the resource. It would be good to add proper references for the EDL component and its variants. In page 8, the authors wrote Table 7, but the label of the table, is Fig 7. << Potential impact >> * Does the resource break new ground? The dataset is web scale and takes advantages of both well-known KG DBpedia and Common Crawl as corpus. The dataset has a considerable diversity in the distribution of relations over instances. The dataset can be used for training and Benchmarking KBP systems. The modularity of the tool set considers the both ease of using for a subtask in KBP and the number of different subtasks for training and benchmarking. The dataset complements the existing ones and is not is a pioneer. * Does the resource plug an important gap? The existing work for KBP suffer from one or more following issues: small in size, not open/accessible, capable of benchmarking a fraction of the KBP process, only suitable for extracting knowledge from title-oriented document. The proposed approach and dataset removes the mentioned limitations. * How does the resource advance the state of the art? By providing tool for KBP and creating a dataset which is richer as it takes the advantages of DBpedia and Common Crawl. Such a dataset is for training and benchmarking KBP systems. * Has the resource been compared to other existing resources (if any) of similar scope? Some statistics were reported to compare the number of sentences, relation type, node-pair context and availability. More comparison could be provided In the paper with detailed analysis. However, the existing ones are satisfactory when properly put in contrast. In the bottom of the page 12, some information mentioned about node-pairs with single shared context for NYT-FB and CC-DBP, but it was better to use a table and add figures for comparison. Also, using better and different measures for comparison could be helpful. * Is the resource of interest to the Semantic Web community? The framework and the dataset will be interesting for the community specially regarding the KBP tasks from web text data. * Is the resource of interest to society in general? Yes. * Will the resource have an impact, especially in supporting the adoption of Semantic Web technologies? KBP is quite important to elevate KGs and their applications. The framework and the dataset are both helpful on developing future KBP systems. * Is the resource relevant and sufficiently general, does it measure some significant aspect? The resource is relevant and general as it provides some ordered subtasks to be used for KBP and creates a dataset which takes advantages of two general and important resources. The statistics mentioned in the last part of the proposed method show that the relations can have a reasonable level of diversity and distribution in comparison to the previous works. << Reusability >> * Is there evidence of usage by a wider community beyond the resource creators or their project? Alternatively, what is the resource’s potential for being (re)used; for example, based on the activity volume on discussion forums, mailing list, issue tracker, support portal, etc? The dataset and the framework have the potential to be used by others especially in KBP. However, it will be better to add some supplementary material, comparison, and more information about the work in order to increase the usability. * Is the resource easy to (re)use? For example, does it have good quality documentation? Are there tutorials availability? etc. In my opinion, this work needs better documentation. Some part of the dataset creation steps are mentioned without providing enough details. Having a tutorial could increase the impact. * Is the resource general enough to be applied in a wider set of scenarios, not just for the originally designed use? It can be used for training and benchmarking for KBP and its related activities. * Is there potential for extensibility to meet future requirements? Yes, the work and the tool for creating it were developed in a way that can be extended appropriately in future e.g. some subtasks can be changed and revised in a way that the quality of results being improved as mentioned in the future work section. * Does the resource clearly explain how others use the data and software? This part wasn’t done in the best way by the authors. There are ambiguities in this regard. * Does the resource description clearly state what the resource can and cannot do, and the rationale for the exclusion of some functionality? This part also could be described better by the authors. Some interesting advantages were introduced, but the work could be improved by adding more information and contrasts. << Design & Technical quality >> * Does the design of the resource follow resource specific best practices? Although the overall framework the way the dataset was created is interesting, it is not still the best practices and other efforts could be followed to provide better results. For example, some relation filtering could be removed employing new strategies. Also there are some issues regarding noises. Handling such conditions could improve the work. * Did the authors perform an appropriate re-use or extension of suitable high-quality resources? For example, in the case of ontologies, authors might extend upper ontologies and/or reuse ontology design patterns. The authors reorganized some subtasks proposed previously to create dataset in flexible and better ways. Further improving each subtask by considering the existing problems such as noise etc and using better way for relation prediction could improve the work. * Is the resource suitable to solve the task at hand? The dataset can be used for training and benchmarking for KBP systems. * Does the resource provide an appropriate description (both human and machine readable), thus encouraging the adoption of FAIR principles? Is there a schema diagram? For datasets, is the description available in terms of VoID/DCAT/DublinCore? The schema diagram was provided. However, it wasn’t enough for readability. * If the resource proposes performance metrics, are such metrics sufficiently broad and relevant? Investing efforts in this part could improve the work. Comparison could be done better and wider and with more detailed information. << Availability >> * Is the resource (and related results) published at a persistent URI (PURL, DOI, w3id)? Yes * Does the resource provide a licence specification? (See creativecommons.org, opensource.org for more information) A licence was provided and mentioned in the link. * How is the resource publicly available? For example as API, Linked Open Data, Download, Open Code Repository. Openly available on github. * Is the resource publicly findable? Is it registered in (community) registries (e.g. Linked Open Vocabularies, BioPortal, or DataHub)? Is it registered in generic repositories such as FigShare, Zenodo or GitHub? Yes, it is available in github. * Is there a sustainability plan specified for the resource? Is there a plan for the maintenance of the resource? More information could be included related to maintenance and sustainability. ===== POST REBUTTAL ====== I would like to thank the authors for their rebuttal and other PC members for their constructive discussions. One part of our concern is addressed (regarding the percentage) in the rebuttal. However, the point we raised about ref. is not covered satisfactorily. Therefore, we would prefer to stay with our previous score.
Review 3 (by Vinh Nguyen)
Since most of the major concerns I had with the submission have been addressed by the authors, I changed my score. ======== This paper describes the creation of the CC-DBP dataset and a modular tool for benchmarking the knowledge base construction task, which is a very important and timely topic. The new dataset was created by adding triples extracted from the Common Crawl dataset to the existing DBpedia 2016 by using the KBP tool with a set of modular components. The context from which the triple was extracted is also presented. The statistics such as entity/relation distributions are provided. I think the work is useful but premature to be reused by the community because (1) although the resource described in the paper is the dataset, it is not provided with a downloadable link in the submission, (2) the triples cannot be found in the resulting dataset after running the code, (3) the probability of the triple is not provided as claimed, (4) the tutorial for the tool is not provided, and (5) the dataset is not mentioned to be used or evaluated by any project. While I appreciate the release of the tool generating the dataset, I think more work needed to be done to improve the resulting dataset before it could be reused by the community. If done well, this could be very useful and have a high impact not only in the Semantic Web community. Particularly, - Although this resource paper’s main contribution is the CC-DBP dataset, I cannot find the link to download it, even with a sample. Since the inputs (CC and DBP) are available and fixed to a specific version, the output dataset should be fixed too and it should be made available for download. At least one version should be provided. - For testing, I had to build the code and run the createSmall.sh to generate a small dataset. The one dataset that I found in the tsv file is contexts-part0.tsv, which does not have any metadata for interpreting the meaning of each column. Examples from contexts-part0.tsv (1) br:Court dbr:Bars_(band) unk unk [140,145) [78,82) "Yes, we have arrested them and as I speak to you now they are already behind bars and will answer charges before the Resident Magistrate's Court here in Bariadi," said the RPC. http://24tzonline.blogspot.com/2016/08/chadema-members-arrested-for-incitement.html (2) dbr:Qatar dbr:Kuwait dbo:PopulatedPlace dbo:PopulatedPlace [80,85) [53,59) Notary Mantralya Attestation UAE Embassy Attestation Kuwait Embassy Attestation Qatar Embassy Attestation Saudi Embassy Attestation Oman Embassy Attestation Bahrain Embassy Attestation Other Embassy Attestation http://adnanenterprise.com/documents-attestation-from-hrd-mea-gad-sdm-notary-home-ministry-all-embassies/ (3) dbr:Radio_station dbr:Night_Fever unk dbo:MusicalWork [53,66) [117,128) 1. FM-FM was heavily a heavily promoted film about a radio station that was supposed to be 1978's answer to Saturday Night Fever. http://1001afilmodyssey.blogspot.com/2016/12/ - “In KBP, the new knowledge added is in the form of triples with confidence”: the paper claims that each triple is associated with a probability or confidence score but I could not find this information in the dataset file contexts-part0.tsv (see above examples). I checked the other file typePairs.tsv but I believe it contains the frequencies of all entity pairs, not triples. - Critically, KB population would add triples to the existing KB like DBpedia but I could not find any triple. So I assume that the first two columns are for the pair of entities, the next two columns are their entity types, the rest of the columns are for the location of the entities and their sentences. Where can I find the triples? Where are the relations of the supposed-to-be triples? I assume this information is generated in the code because Figure 10 shows the distribution of ``Number of Relations for an Entity Pair‘’, which means the relation for each entity pair is available. - The source code is available online via GitHub. However, the tutorial how to use the tool or how to interpret the results is not available. How is this tool adapted? API for plugging in another EDL tool, for example? How can it be reused for other datasets? - Which method has been evaluated on this dataset? Although the paper did discuss the methods of evaluation but does this paper actually have an evaluation done? - Has this dataset been used as the benchmarking comparing any tools or systems at IBM or elsewhere? Another point that confused me is the main contribution of the paper, which seems to be a dataset, but I don’t know what the CC-DBP dataset looks like since there is no description or sample of it. Since this track is the resource track, the dataset should follow the FAIR (Findable, Accessible, Interoperable and Reusable) principles as well. I would appreciate it if the authors - Correct the dataset output with missing information: relation, probability - Upload the dataset and make it available for public download. - The dataset if in the form TSV should be described with metadata so that it can be interpreted precisely. - Or preferably, since this is a Semantic Web conference, it can be published as an RDF dataset. For publishing various kinds of metadata associated with each triple, the approaches including RDF singleton property, reification, n-ary, or named graph can be used for the representation. - For the tool, a tutorial or API would be appreciated by anyone who would like to use it. If the tool or dataset being used by any paper, team, or product, please describe it. Especially, if the tool is the result of any paper, a reference to it would definitely be appreciated.
Review 4 (by anonymous reviewer)
The paper "A Dataset for Web-scale Knowledge Base Population" first presents an overview of the Knowledge Base Population (KBP) task and of related subtasks, and then introduces CC-DBP, a dataset extracted from a publicly available crawl of web pages and from DBpedia. The topic of the paper is interesting, and relevant to the ESWC conference. I have appreciated the first part of the paper proposing a survey on the KBP tasks, and the authors' positioning with respect to them (and to the proposed definitions and evaluation methods of each subtasks). The second part of the paper would deserve a more detailed description and additional comments on the proposed resource, and an in-depth qualitative analysis (non only quantitative). The paper requires proofreading. Some typos, e.g.: - [sec. 2] the determining the number -> the number? - [sec. 4.1] it’s raw form -> its raw form - [sec. 4.2] wrong split of the url I acknowledge that I have read the rebuttal, and I stick to my score.