Generation of Web pages for Public Scientific Databases Using Schema.org
Author(s): Josef Hardi, Kody Moodley, John Graybeal, Michel Dumontier, Mark Musen
Full text: submitted version
Abstract: While much effort in applying schema.org concentrates on popular text, such as news articles, blogs, or restaurant reviews, the scientific data on the Web have received less attention. For example, pages from public database websites (e.g., DrugBank, PubMed, NASA, NOAA) are almost never presented using schema.org. This absence prevents Web search engines from applying more advanced search features, such as filtering or ambiguity resolution, to information generated by these websites. In this paper, we describe software for generating schema.org-compliant Web pages out of raw metadata used to describe scientific registries or datasets using an extract-transform-load (ETL) pipeline. With this software, data elements that can be mapped to schema.org are automatically extracted and transformed into JSON-LD and then loaded into HTML source. We also present a declarative mapping language to facilitate the data mapping in the extraction process. The result is a framework that public databases can use to publish Web pages that are semantically indexable by search engines. We show that annotating scientific data using schema.org can be done effectively using a well-defined data mapping and ETL processes.
Keywords: Semantic Content Authoring; Schema.org; Linked Data; Web Technology
Review 1 (by anonymous reviewer)
This paper presents an approach to generate schema.org markup from metadata retrieved from a number of scientific resources. A simple language for mapping between the source representation and schema.org is provided and injection of the derived schema.org JSON into HTML pages is also demonstrated. A python library is provided for others to re-use this approach. This is a nice lightweight approach to get structured metadata into web pages but I think this paper falls short of meeting the brief for the resources track. There is already a number of efforts underway to explore the use of schema.org in scientific resources (see http://bioschemas.org), so this resource does little to break new ground. There is also no indication if the approach has been deployed in any actual scientific resources or a discussion on what value to the data providers this would bring.
Review 2 (by Silvio Peroni)
In this paper, the authors introduce a tool that allows one to create a schema.org descriptions of scientific databases (to be injected in HTML pages) by interpreting specific mapping rules defined according to a particular mapping language: CAML. Please find attached the comments to the various review criteria. # Potential impact - Does the resource break new ground? Honestly, I cannot see a huge improvement to the current status of the Semantic Web domain. It can be a useful tool for addressing the specific task of converting databases / XML documents into Schema.org annotation though. - Does the resource plug an important gap? Other tools, e.g. D2RQ, have been already developed in the past for addressing similar issues, at least from a db-to-RDF conversion perspective. Plenty of approaches already exist for XML-to-RDF conversions though. In addition, there are also other Web-oriented tools that have been built for creating HTML (eventually semantically-enriched) presentations of data coming from DB and RDF triplestores as well – e.g. RSLT or Fresnel. - How does the resource advance the state of the art? The proposed language for describing mapping is indeed easier to learn than others – but I suspect it is a bit less powerful though. - Has the resource been compared to other existing resources (if any) of similar scope? No comparison has been provided. - Is the resource of interest to the Semantic Web community? In principle, yes it is – even if it address a quite specific task. - Is the resource of interest to society in general? It does not appear clear by reading the paper. - Will the resource have an impact, especially in supporting the adoption of Semantic Web technologies? Yes, it has been developed so as to push a bit of semantic descriptions of scientific dataset within HTML pages. - Is the resource relevant and sufficiently general, does it measure some significant aspect? The mapping language presented seems to be quite general. However, it is not clear the flexibility of the resource in different contexts and for addressing different mapping tasks. # Reusability - Is th ere evidence of usage by a wider community beyond the resource creators or their project? Alternatively, what is the resource’s potential for being (re)used; for example, based on the activity volume on discussion forums, mailing list, issue tracker, support portal, etc? There is no evidence of it reuse, and the authors quickly stated the intended usage, without clarifying if it will be used on daily bases by some database provider. - Is the resource easy to (re)use? For example, does it have good quality documentation? Are there tutorials availability? etc. The documentation is rather limited, and no examples of usage (e.g. an exemplar running, etc.) have been provided. There is also a playground, but honestly there are no hits to help the user in understanding the basic concepts and features of the tool. - Is the resource general enough to be applied in a wider set of scenarios, not just for the originally designed use? I'm not sure about this. - Is there potential for extensibility to meet future requirements? It is open source, and it is developed in Java. Thus I think it is reasonable to extend it for some future requirements. - Does the resource clearly explain how others use the data and software? Not at all. - Does the resource description clearly state what the resource can and cannot do, and the rationale for the exclusion of some functionality? It is pretty clear from the text the intent of the resource. Honestly, more examples would have helped for clarifying better the way it works. - Does the design of the resource follow resource specific best practices? Nothing has been clarified by the authors form this perspective. - Did the authors perform an appropriate re-use or extension of suitable high-quality resources? For example, in the case of ontologies, authors might extend upper ontologies and/or reuse ontology design patterns. They have reused existing software, such as an XSLT processor and a JSON library available on GitHub. - Is the resource suitable to solve the task at hand? Not sure, since there is not executable file for trying it on the fly. - Does the resource provide an appropriate description (both human and machine readable), thus encouraging the adoption of FAIR principles? Is there a schema diagram? For datasets, is the description available in terms of VoID/DCAT/DublinCore? The resource by itself does not apply with this. However, the outcomes it produces should be enriched with appropriate provenance information in an automatic fashion, so as to clearly state that the data have been provided by the tool. However, there is no mention of this aspect in the paper. - If the resource proposes performance metrics, are such metrics sufficiently broad and relevant? The authors have evaluated it by looking at the time needed to translate XML files into JSON-LD - even if they did not perform a similar evaluation with non-XML sources. - If the resource is a comparative analysis or replication study, was the coverage of systems reasonable, or were any obvious choices missing? There is no comparative analysis, which is unfortunate since it would have been stronger the discussion about the tool itself. # Availability - Is the resource (and related results) publishe d at a persistent URI (PURL, DOI, w3id)? It doesn't seem so, according to the paper. - Does the resource provide a licence specification? (See creativecommons.org, opensource.org for more information) Yes, it does. - How is the resource publicly available? For example as API, Linked Open Data, Download, Open Code Repository. It is possible to download its sources. - Is the resource publicly findable? Is it registered in (community) registries (e.g. Linked Open Vocabularies, BioPortal, or DataHub)? Is it registered in generic repositories such as FigShare, Zenodo or GitHub? On GitHub. - Is there a sustainability plan specified for the resource? Is there a plan for the maintenance of the resource? There is no plan specified for the resource maintenance. - Does it use open standards, when applicable, or have good reason not to? Yes, it does. --- after rebuttal phase First of all, I would like to thank the authors for their answers. Summarising, I've not changed my mind after reading the authors' rebuttal. Some more detailed information as follows. > our resource is unique because it is not tied to one specific data model or data repository choice, specifically to the RDF data model and a triplestore and > present can only handle XML and RDF data (see ‘3.4 Implementation Details’). Well, actually - according to the current implementation and what the authors said again in the rebuttal - the resource is able to address two data models only, XML and RDF, while leaving some of the most important data models out, such as traditional databases and CSV files – which I believe are still among the most adopted options worldwide. In addition, it is not clear to me why the existing solutions for XML and RDF (e.g. just the "plain" XSLT and SPARQL CONSTRUCT) where not enough for handling the conversion. I don't see the clear added value introduced by CAML and, if there is, it has not been demonstrated at all in any way. The question – sorry, another one, but maybe I wasn't clear enough in my review, is: why do I have to use CAML instead of other existing tools for performing XML- and RDF-to schema.org conversions? > new ‘Help’ icon in the playground to bring out the Wiki page, and a user guide on how to use the playground. Honestly, I would have expected to find already those things in the Playground. Proving that the authors had already worked for addressing these issues during the rebuttal (since they should not be a big deal to provide) would have been a wonderful sign of their positive and proactive attitude in fixing the issues raised, and could have resulted in a more persuasive action with regards to the reviewers. > this work is a supplement project for another bigger project called CEDAR (https://metadatacenter.org/), which we anticipate will sustain the resource Great to hear, thanks! > we are happy to hear your suggestions for alternative metrics in our evaluation. Well, as already told my my review, the performance is not an issue here, since the process implemented by the resource should not deliver results live but it can be run from time to time for updating the information related to the database. However, I see at least two additional kinds of evaluation that can be performed: 1. understanding if CAML is easy to be used by a database owner, in order to set up the conversion, and 2. if it allows one to manage the conversion of all the important information (according to a database owner). Point 1 can be addressed possible by involving users (BTW, which kinds of user? Semantic Web experts? Database owners? Domain experts?) in addressing specific mapping tasks by using CAML. On the other hand, point 2 can be adressed by asking to the intended final users their requirements in exposing such information in schema.org, in order to see if CAML can be used for producing the rules for the conversions required - note that, in this case, even a comparison with the other tools can be performed, since it would be based on conversion features that are handled by the languages/tools.
Review 3 (by Christoph Lange)
UPDATE AFTER AUTHORS' RESPONSE: Thank you for your detailed response. My score and review remain unchanged because you basically said (which is good) that you are planning to address the reviewers' concerns. This resource is an implementation of an ETL pipeline and a mapping language (CAML) that processes metadata of scientific databases in order to publish them as web pages semantically annotated using schema.org. The paper clearly states the problem that scientific data is so far rarely exposed to web search engines using schema.org annotations. Besides presenting the approach and its implementation, this paper also presents partial results of a performance evaluation. The resource satisfies the review criteria to a large extent (details below). A major shortcoming is that the experimental data are not considered to be part of the resource. I'd recommend to publish the mappings for DrugBank, PubMed, ClinicalTrials, etc., and to publish the resulting schema.org-annotated pages. Also, it is mandatory that you re-try the RDF evaluation. Minor issues with the paper (see PDF at https://www.dropbox.com/s/3dmgxefbzgrdftx/ESWC2018_paper_92.pdf?dl=0 for details): * There is some redundancy between sections 3.1 "ETL Process Scenario" and 3.2 "Pipeline Setup"; the latter section just seems to paraphrase the former in different words. Review criteria: > Potential impact > > Does the resource break new ground? The approach is no rocket science but a solid application of state-of-the-art technology, but … > Does the resource plug an important gap? > How does the resource advance the state of the art? … it clearly solves a problem in the application domain (as stated above). > Has the resource been compared to other existing resources (if any) of similar scope? The "related work" section compares the resource to a few related tools, but rather to tools for manual/semi-automatic semantic annotation of plain text or HTML. I could imagine that there also exist tools for bulk generation of schema.org or similar annotated content from structured (meta)data input; such tools are not considered here. > Is the resource of interest to the Semantic Web community? > Is the resource of interest to society in general? > Will the resource have an impact, especially in supporting the adoption of Semantic Web technologies? "Yes" to all, because it makes large amounts of sources of relevant date more easily accessible. > Is the resource relevant and sufficiently general, does it measure some significant aspect? Yes, CAML is a generic mapping language. > Reusability > > Is th ere evidence of usage by a wider community beyond the resource creators or their project? Alternatively, what is the resource’s potential for being (re)used; for example, based on the activity volume on discussion forums, mailing list, issue tracker, support portal, etc? Not yet. There's great potential thanks to the GitHub infrastructure, but so far only the first author has contributed. > Is the resource easy to (re)use? For example, does it have good quality documentation? Are there tutorials availability? etc. There is good documentation especially for CAML on GitHub. > Is the resource general enough to be applied in a wider set of scenarios, not just for the originally designed use? The resource already supports transformation from two independent data models: XML and RDF. > Is there potential for extensibility to meet future requirements? It will therefore also be extensible by further data models. > Does the resource clearly explain how others use the data and software? Not yet. > Does the resource description clearly state what the resource can and cannot do, and the rationale for the exclusion of some functionality? Limitations are not acknowledged. > Design & Technical quality: > Does the design of the resource follow resource specific best practices? Yes, introducing a domain-specific language that is compiled into fully-featured standard languages (here: SPARQL or XSLT) is a common pattern for mapping tasks. Regarding the output, it is not clear why you embed blocks of JSON-LD into HTML rather than fine-grained, local embeddings of RDFa. Re. not reusing RML and R2RML please argue more precisely what important features they do not support. > Did the authors perform an appropriate re-use or extension of suitable high-quality resources? For example, in the case of ontologies, authors might extend upper ontologies and/or reuse ontology design patterns. Yes, via schema.org, JSON-LD, SPARQL and XSLT. > Is the resource suitable to solve the task at hand? Clearly yes. > Does the resource provide an appropriate description (both human and machine readable), thus encouraging the adoption of FAIR principles? Is there a schema diagram? For datasets, is the description available in terms of VoID/DCAT/DublinCore? The descriptions are limited to very few README files and wiki pages. > If the resource proposes performance metrics, are such metrics sufficiently broad and relevant? Doesn't apply – the resource is merely _evaluated_ w.r.t. the "time" performance metric. > If the resource is a comparative analysis or replication study, was the coverage of systems reasonable, or were any obvious choices missing? Doesn't apply. > Availability > > Is the resource (and related results) publishe d at a persistent URI (PURL, DOI, w3id)? Nothing beyond a GitHub repository. > Does the resource provide a licence specification? (See creativecommons.org, opensource.org for more information) Yes, a BSD-style software license. > How is the resource publicly available? For example as API, Linked Open Data, Download, Open Code Repository. Source code repository and public demo ("playground") > Is the resource publicly findable? Is it registered in (community) registries (e.g. Linked Open Vocabularies, BioPortal, or DataHub)? Is it registered in generic repositories such as FigShare, Zenodo or GitHub? Yes, GitHub. > Is there a sustainability plan specified for the resource? Is there a plan for the maintenance of the resource? Not specified. > Does it use open standards, when applicable, or have good reason not to? Yes, see above.
Review 4 (by anonymous reviewer)
The paper describes the development of a declarative mapping language CAML together with a data processing tool to add schema.org terms to public scientific data repositories. CAML is used to define mappings between schema.org terms and schema of data sources, which is subsequently used by the data processing pipeline to transform existing HTML webpages to webpages annotated with schema.org terms. There are several issues with the current version of the paper: 1. The motivation for development of yet another schema mapping language, CAML, is not clear. The paper mentions both RML and R2RML, however it does not state what features are not supported by these languages and how does CAML addresses the limitations of these existing mapping languages. More importantly, there is no discussion in the paper regarding the large number of existing schema mapping languages, including XPath, Orchid, and D2R, which already support the features, such as "data path", "data object" etc. In addition, it is not clear what are the comparative advantages of using CAML as compared to these existing mapping languages. 2. Although the paper describes the generation of schema.org annotated webpages, it is not clear how these annotated webpages lead to better ranking of these web pages during Web search as compared to their existing rank. In particular, the PubMed repository mentioned in the paper already includes specialized metadata terms from the Medical Subject Heading (MeSH) vocabulary, which significantly supports semantic interpretation of articles content during PubMed search. The paper does not describe how use of the CAML-based webpage transformation tool improves the search functionality of the existing PubMed repository. 3. It is not clear how the evaluation described in the paper in terms of time complexity of the xml-pipeline is relevant to the development of a new mapping language and embedding of schema.org terms in the webpage. 4. The practical utility of this tool is not clear as the paper does not discuss how the annotated webpages will be adopted by the owners (government agencies) of these data repositories. As discussed above, the use of the tool and the resulting annotated webpages by data repositories can be enabled by a comparative evaluation of the annotated webpages with the original webpage using specific number of metrics, such as ranking in search query results. Overall, it is not clear how the proposed CAML and the tool to generate schema.org annotated webpages break new ground or address an important resource gap. The paper needs to clearly demonstrate the potential impact of the proposed tool as described in the Resource Track solicitation.