Author(s): Kody Moodley, Josef Hardi, John Graybeal, Michel Dumontier, Mark Musen
Abstract: Schema.org is an initiative by the purveyors of the major search engines to define a common vocabulary for structuring Web content from a variety of domains, promoting data interoperability, potentially allowing for increased discoverability in search results, and enabling Web content to benefit from sophisticated search services. Schema.org’s health-lifesci extension provides specialized attributes for describing healthcare and medical data. Before applying these extensions to increase interoperability of medical data, it is valuable to know the current expressivity of the vocabulary to capture key biomedical attributes. We are not aware of any quantitative evaluations addressing this question, and we fill this gap by providing such an evaluation. We propose a mapping of attributes from a selection of prominent community specifications for drug and clinical trial metadata, to schema.org terms. We also define a mechanism for measuring the coverage of schema.org for attributes in these specifications. For our selected specifications, schema.org showed roughly a 60%, 66% and 10% coverage ratio for drug, medical dataset and clinical trial metadata, respectively. Our study shows that: 1) a substantial portion of drug and medical dataset metadata can immediately leverage schema.org for the potential benefits, and 2) precise descriptions of clinical trial data are not supported by schema.org. Our proposed mapping provides clues for: 1) extending schema.org to support detailed description of clinical trial data, and 2) further improving coverage of drug and medical dataset attributes, should these items be required.
Keywords: Schema.org; Linked data; Scientific metadata; Semantic markup