Author(s): Josef Hardi, Kody Moodley, John Graybeal, Michel Dumontier, Mark Musen
Abstract: While much effort in applying schema.org concentrates on popular text, such as news articles, blogs, or restaurant reviews, the scientific data on the Web have received less attention. For example, pages from public database websites (e.g., DrugBank, PubMed, NASA, NOAA) are almost never presented using schema.org. This absence prevents Web search engines from applying more advanced search features, such as filtering or ambiguity resolution, to information generated by these websites. In this paper, we describe software for generating schema.org-compliant Web pages out of raw metadata used to describe scientific registries or datasets using an extract-transform-load (ETL) pipeline. With this software, data elements that can be mapped to schema.org are automatically extracted and transformed into JSON-LD and then loaded into HTML source. We also present a declarative mapping language to facilitate the data mapping in the extraction process. The result is a framework that public databases can use to publish Web pages that are semantically indexable by search engines. We show that annotating scientific data using schema.org can be done effectively using a well-defined data mapping and ETL processes.
Keywords: Semantic Content Authoring; Schema.org; Linked Data; Web Technology