NELL2RDF- Reading the Web, Tracking the Provenance, and Publishing it as Linked Data
Author(s): José M. Giménez-García, Maísa Duarte, Antoine Zimmermann, Christophe Gravier, Pierre Maret, Estevam R. Hruschka Jr.
Full text: submitted version
Abstract: NELL is a system that continuously reads the Web to extract knowledge in form of entities and relations between them. It has been running since January 2010 and extracted over 120 million candidate statements. NELL’s generated data comprises all the candidate statements, together with detailed information about how it was generated. This information includes how each component of the system contributed to the extraction of the statement, as well as when that happened and how confident the system is in the veracity of the statement. However, the data is only available in an ad hoc CSV format that makes it difficult to exploit out of the context of NELL. In order to make it more usable for other communities, we adopt Linked Data principles to publish a more standardized, self-describing dataset with rich provenance metadata.
Keywords: NELL; RDF; Semantic Web; Linked Data; Metadata; Reification; Provenance
Review 1 (by Daniel Garijo)
This paper describes an update of the existing NELL2RDF dataset, including in addition provenance information. The paper is relevant for ESWC, well written and easy to follow. In addition, the resource looks useful and I think it has potential to be used as an interesting knowledge base. However, the contribution versus the previous version of the resource, besides the update, is very incremental, and I am hesitant on whether to accept it in the conference. On the one hand, it is great to see maintenance of old resources. On the other hand, this sets a precedent for seeing new versions of datasets submitted as resources every year. - The paper also doesn't explain why the captured provenance is useful for users besides NELL developers. True, having a benchmark could be useful for triple stores, but that use has nothing to do with the value of the captured data, or the knowledge being represented. What would be the difference versus just using synthetic testbeds? - There are no examples on how a user is supposed to use the resource. How will the semantic web community will benefit from this resource? A couple of query examples showing its value would really help. - There aren't any metrics on community adoption of the resource. Has anyone used the resource from the last time the first paper was published? If so, it would strengthen the publication. Small issues: - The canonical URI resolves to the same URL used in the previous paper: http://nell-ld.telecom-st-etienne.fr/ This makes sense, but it is not a significant contribution either. As a side note, the URL should use https, in order to avoid a double redirect from w3id.org. - If the code is public in Github, please state version and citation (e.g., Zenodo) - URI of the ontology seems to be unavailable (or at least I haven't found it in the paper) - The license is stated in the paper, but not in the page where the resource is shared. AFTER REBUTTAL: Since the authors have decided not to address the comments from my review, I decided to lower the score to a weak reject.
Review 2 (by Silvio Peroni)
In this paper, the authors introduce an RDF datasets constructed by converting NELL data into RDF by means of a conversion application. They present the techonology and vocabularies used for describing these data, as well as they provide some figures about the number of statements and the dimension of each of the datasets provided. Honestly, there are several aspect that should be appropriately extended and clarified for making this paper acceptable for ESWC, which are introduced as follows. In addition, I also have some minor issue that should be considered as well: - What is the "veracity" of the statement? - I found only a link to the Web interface for querying the SPARQL endpoint. What if I want to call it via a common HTTP request in an application? What is the URL I have to use? - There are no figures about community adoption of the resource, even if its first instance has been released in 2013 according to . Why? - Are the various entities described by the resource linked by external datasets. Do the entities in the resource link to external datasets? Other comments: # Potential impact - Does the resource break new ground? In principle yes, since it contains information extracted from the Web. However it is not clear which kinds of possible information are really stored. - Does the resource plug an important gap? I'm not entirely sure. For instance, what is the difference between the data contained in encyclopedic datasets (e.g. Wikidata and DBpedia) and the resource presented? How the data in the resource are compared with the aforementioned ones? Are they complementary or are overlapping? - How does the resource advance the state of the art? It is not clear and not stated in the paper. Maybe it could be good to remove or resize some images so as to better introduce what is the output of the data produced by NELL. This would help in clarifying the scope of the resource. - Has the resource been compared to other existing resources (if any) of similar scope? No, it is not – while it could seem that there are relevant resources (e.g. Wikidata and DBpedia) with which the presented resource should be compared. - Is the resource of interest to the Semantic Web community? Yes, it is, as also clarified by the authors. - Is the resource of interest to society in general? I have no evidence to support this. - Will the resource have an impact, especially in supporting the adoption of Semantic Web technologies? It is not entirely clear, since there is no evidence of current usage. - Is the resource relevant and sufficiently general, does it measure some significant aspect? As mentioned before, it is not clear (by reading the paper) which kinds of data are actually contained by the resource. Thus, it is difficult to understand its generality. # Reusability - Is th ere evidence of usage by a wider community beyond the resource creators or their project? Alternatively, what is the resource’s potential for being (re)used; for example, based on the activity volume on discussion forums, mailing list, issue tracker, support portal, etc? It could have the potential to be reused, at a first sight. However, since this is not the first effort related to the resource (one of the last ones is described at ), I expected that a relevant community has started to use it, since more than 4 years have passed since its first release. However, there is no evidence of such reuse in the paper. - Is the resource easy to (re)use? For example, does it have good quality documentation? Are there tutorials availability? etc. It has been provided in two formats, and has been made available also in SPARQL endpoints. However, it lacks a bit of documentation in terms of the data it actually contains, as suggested. It would be useful, for instance, to have some examples available and some exemplar queries for get the data from the SPARQL endpoints, according to the particular reification format adopted. - Is the resource general enough to be applied in a wider set of scenarios, not just for the originally designed use? It is not clear - see the concerns above on the kinds of data it contains. - Is there potential for extensibility to meet future requirements? I'm pretty sure about it, considering it is a dataset and that should be easy to extend its coverage in terms of data and vocabularies. - Does the resource clearly explain how others use the data and software? No, it doesn't. - Does the resource description clearly state what the resource can and cannot do, and the rationale for the exclusion of some functionality? No, it doesn't. - Does the design of the resource follow resource specific best practices? In general, yes - even if it is not clear to me how the N-Triples format used (as stated in the paper) is able to store information using named graphs. However, it seems that the resource contained in the triplestore are not accessible via HTTP (e.g. https://w3id.org/nellrdf/metadata/Execution_AliasMatcher_haswikipediaurl_-1000004637_621). Thus, they are not really exposed according to the LOD principles since they cannot be looked up as expected. This is a quite huge drawback for the resource. - Did the authors perform an appropriate re-use or extension of suitable high-quality resources? For example, in the case of ontologies, authors might extend upper ontologies and/or reuse ontology design patterns. The authors have reused existing models (e.g. PROV-O, VoID and DCAT) for describing specific aspects of the resource. - Is the resource suitable to solve the task at hand? Not applicable. - Does the resource provide an appropriate description (both human and machine readable), thus encouraging the adoption of FAIR principles? Is there a schema diagram? For datasets, is the description available in terms of VoID/DCAT/DublinCore? The schema is included, and it seems to follow part of the FAIR principles (even if the lack of HTTP access is a big issue). It reuses existing models for describing the data. - If the resource proposes performance metrics, are such metrics sufficiently broad and relevant? No performance metrics has been used for measuring the resource adoption (and, thus, it is not clear if and how the community has reused it), while there is a table describing descriptive statistics about the number of statements and the GBs needed for storing them according to the particular model adopted for reifying the statements. - If the resource is a comparative analysis or replication study, was the coverage of systems reasonable, or were any obvious choices missing? Not applicable. # Availability - Is the resource (and related results) published at a persistent URI (PURL, DOI, w3id)? Yes, it is (via w3id). - Does the resource provide a licence specification? (See creativecommons.org, opensource.org for more information) Yes, it is. - How is the resource publicly available? For example as API, Linked Open Data, Download, Open Code Repository. It is available as download and SPARQL endpoint. - Is the resource publicly findable? Is it registered in (community) registries (e.g. Linked Open Vocabularies, BioPortal, or DataHub)? Is it registered in generic repositories such as FigShare, Zenodo or GitHub? The software used for the conversion is available on GitHub (even if *no* documentation has been provided and this makes difficult to reuse it). Instead, the datasets have been published only in the resource website, while they had to be also available at least in well-known registries, such as Figshare. - Is there a sustainability plan specified for the resource? Is there a plan for the maintenance of the resource? There is no mention about the sustainability of the resource. - Does it use open standards, when applicable, or have good reason not to? It reuses existing and open stadards. --- after rebuttal phase The authors decided not to address my comments. Thus I confirm my score.
Review 3 (by Francesco Corcoglioniti)
The paper presents the new version of NELL2RDF, an RDF dataset containing the knowledge graph (3M promoted beliefs, 120M candidate beliefs) and the provenance metadata (~2B triples) extracted from the Web by the Never-Ending Language Learning (NELL) system in its latest iteration(s). The dataset is automatically generated from NELL tabular data using publicly available conversion code. Compared to the previous version published in 2013, this version included provenance metadata, represented using PROV-O and different reification models (e.g., named graphs). POTENTIAL IMPACT. Aimed at establishing a bridge between the NELL and the Semantic Web communities (§1), in my opinion NELL2RDF falls short in achieving this goal due to the lack of alignments between this resource and other established Linked Open Data resources, both in terms of ABox (e.g., alignments of instances to DBpedia or any other Wikipedia-based knowledge resource) and TBox (e.g., alignment of classes to YAGO). Without these alignments, it is difficult for NELL users to import Linked Data, and for Semantic Web practitioners to include NELL2RDF data into their Linked Data applications. Considered that NELL beliefs (from a previous iteration) were already provided in RDF and described in the authors' 2013 paper, the new contribution lies in the representation of provenance data. To that respect, I agree with authors' remark that NELL2RDF may represent a useful use case for researches on provenance and metadata management, but the contribution appears limited by the adopted provenance model being very specific to NELL extraction process and not covering the Web source(s) a belief was derived from, making the reuse of the model in other knowledge extraction scenarios unlikely. Moreover, the support of different reification models appears of little value to me (unless an informative comparative evaluation is carried out), as it is straightforward to map one reification model to another (e.g., starting from the simplest and most compact named graphs model). While different in goal, both DBpedia and Wikidata (briefly cited) have provenance mechanisms to represent the source of a statement, and I suggest the authors to include some background and a comparison with these and related state-of-the-art approaches in the paper. DESIGN & TECHNICAL QUALITY. The dataset features a proper reuse of the PROV Ontology and all the considered reification models are appropriate, although it's not clear to me why in Table 2 the n-Ary relations model, which requires 2 triples per each (candidate) belief, has the same number of triples of the named graphs model, which require just one triple (quad) per belief. As reported above, the main limitation is the lack of ABox/TBox alignments to other Linked Open Data resources, which make the current version of the resource appear as a straightforward conversion of NELL tabular data. An evaluation is not reported in the paper, and should focus on fitness for use (e.g., in selected relevant use cases) rather than on quality of extracted beliefs (which regards NELL) or of the conversion process (unless ABox/TBox alignments are introduced). Availability of VOID and DCAT metadata is claimed in the paper (§1), but I found no link on the website and no relevant triple through the SPARQL endpoint (queried for void:Dataset, void:triples, dcat:Dataset); however, this metadata can be easily generated. The released RDF files contain incorrect xsd:dateTime literals (e.g., '2012/08/03 10:35:59' rather than '2012-08-03T10:35:59') that produce (ignorable) parse errors. AVAILABILITY. NELL2RDF dataset is assigned a persistent w3id URL and is publicly available under a Creative Commons license (CC BY 4.0), while the conversion code (Java) is available on GitHub under the LGPL v3 license. For each reification model, NELL2RDF data is available either as dump files (HDT / NTriples / NQuads) and through a SPARQL endpoint (beware: data upload / SPARQL update options seem offered too!). URI dereferencing is not available but should be easy to implement, given the availability of the SPARQL endpoint. Sustainability is not discussed in the paper: even if regenerating the dataset for new NELL iterations seems just a matter of running the publicly available conversion code, I suggest authors to briefly comment on whether/how they plan to keep the resource and the conversion machinery up-to-date and hosted online. REUSABILITY. NELL2RDF documentation consists of the reviewed paper, which is available online together with a technical report including also contents from the 2013 NELL2RDF paper. While this material provides adequate documentation for NELL2RDF metadata, reuse of belief data is hindered by the lack of an ontology or some documentation about its TBox. No evidence of third party reuse is produced and no concrete NELL2RDF use case is discussed. Reuse of the provenance model outside NELL2RDF is not an authors' goal and appears difficult, due to the model being tied to NELL extraction process. On the other hand, the representation chosen seems rather flexible in accommodating new NELL extraction components, this way ensuring its suitability to future NELL releases. Summing up, the resource and the idea behind it appear promising, but in my opinion further work is needed on the resource (esp. ABox/TBox alignments) and the paper describing it (evaluation/discussion of fitness for use, third-party adoption, sustainability, related work) for this contribution being acceptable at a major conference. Concerning the presentation, the paper is overall well written. Things to be fixed: * "a detailed description the original dataset" -> "of the" (§1) * "rdf:label" -> "rdfs:label" (§3.1) * "LatLong execution is denoted by ... CPLExecution" -> "LatLongExecution" (§3.2) * missing property names in Fig 2j * "were used to infer the belief. Each one containing" -> "belief, each one" (§3.2) * "We decided to follow each of them with to provide" -> remove "with" (§3.2) * "entity along with with a lot" -> remove second "with" (§5) * "and to the latter" -> "and from the latter" (§5) * "This makes of it an ideal" -> remove "of" (§5) POST REBUTTAL UPDATE. I acknowledge the response of the authors and the reported infeasibility to address the reviewers' comments in the current submission. I thus maintain my overall score.
Review 4 (by Vinh Nguyen)
This paper describes the creation of the NELL2RDF dataset with metadata associated with each statement in RDF. The original NELL dataset is transformed into RDF using five models to represent metadata for each statement. What concerned me the most is the potential impact of this resource. One previous version of this dataset (without metadata) has been published five years ago and still did not get any application usage or growing community. This dataset version adding metadata to the representation does not demonstrate the adoption of this work in the community, simply providing the RDF representation of NELL would hardly make it an impactful resource. In terms of availability, this dataset adheres to the Linked Data principles with an ontology created for the transformation, the files are available for download and a SPARQL endpoint is also provided with a persistent URI. - Available in RDF files and SPARQL endpoint although I tried a few queries to learn the patterns and they are all timeout --- data is too big in terms of the number of tripes and storage. There is no tutorial or self-explained information from the website on how to query the data from the website. It is hard to form the SPARQL query without knowing the triple pattern for each representation, a typical reader would not understand how to query the five representations, and which one should be queried. Sample data files and SPARQL query examples on each representation would be appreciated as they would make it easier for the datasets to be understood and queried, even by the experts, without downloading and setting up all the big data files. - In terms of design, the code of NELL2RDF is tied with the NELL data and it can not easily be reused for other datasets. - We have five big datasets, then so what? No demonstration of reusability or applications of these RDF datasets! - Transforming one specific dataset like NELL to RDF seems to be straightforward and trivial since many datasets like this have been created before, e.g. WikiData, DBpedia, PubChem. - LatLong: not use the existing Geo ontology, e.g. https://www.w3.org/2003/01/geo/ - It seems the resource has not been used by any other community except the authors. What benefits does RDF representation add to this NELL dataset is not demonstrated. I noticed that the workshop paper  has been published in 2013, and until now 2018, it does not show any sign of being used by the community. And in this version, what changed is the metadata added to the RDF representation and it does not demonstrate any community adoption yet. I doubt that just providing the RDF version of NELL would make or show any impact.  A. Zimmermann, C. Gravier, J. Subercaze, and Q. Cruzille, "Nell2RDF: Read the Web, and turn it into RDF", 2nd Int Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, 10th ESWC 2013, Montpellier, France - What could be more interesting is the generalization of the code so that the tool could also be used to publish any TSV dataset with metadata for each statement to RDF, like the existing work D2RQ.