Author(s): Mohamed Nadjib Mami, Hajira Jabeen, Sören Auer
Abstract: We have recently made a huge leap in terms of data formats, modalities, and storage capabilities. Dozens of storage facilities have been created as a result. Today, we are able to store cluster-wide data, and to choose a storage that suits our application needs, rather than the opposite. If connected together, this data can generate valuable insights and knowledge. Therefore, several works have been conducted to bring heterogeneous data together, by either physically transforming it into a unique format, or virtually querying it on-the-fly. Both approaches pose a challenge in a certain stage of data preparation. However, modern technology enabled us to achieve the latter more efficiently than ever. In this article, we suggest a general framework that takes advantage of Semantic Web standards to query heterogeneous big data. We devise an implementation, named Sparkall, that uses Spark as the underlying query engine. Our evaluation demonstrated the feasibility and efficiency of Sparkall in querying five data sources of y …bytes of size.
Keywords: big data; databases; nosql; data heterogeneity; data management; ontology; obda