Paper 173 (Resources track)

A Dataset for Web-scale Knowledge Base Population

Author(s): Michael Glass, Alfio Gliozzo

Abstract: For many domains, structured knowledge is in short supply, while unstructured text is plentiful. Knowledge Base Population (KBP) is the task of building or extending a knowledge base from text, and systems for KBP have grown in capability and scope. However, existing datasets for KBP are all limited by multiple issues: small in size, not open or accessible, only capable of benchmarking a fraction of the KBP process, or only suitable for extracting knowledge from title-oriented documents (documents that describe a particular entity, such as Wikipedia pages). We introduce and release CC-DBP, a web-scale dataset for training and benchmarking KBP systems. The dataset is based on Common Crawl as the corpus and DBpedia as the target knowledge base. Critically, by releasing the tools to build the dataset, we enable the dataset to remain current as new crawls and DBpedia dumps are released. Also, the modularity of the released tool set resolves a crucial tension between the ease that a dataset can be used for a particular subtask in KBP and the number of different subtasks it can be used to train or benchmark.

Keywords: Knowledge Base Population; Knowledge Induction; Slot filling; Benchmark; DBpedia; Common Crawl

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *