Author(s): Jasim Waheed Ansari, Naila Karim, Stefan Decker, Michael Cochez, Oya Beyan
Abstract: In the Big Data community Data Lakes have become the de facto standard for storing data. Often, these data lakes are contained within the Hadoop ecosystem, where the actual storage happens on HDFS. The data is then stored in its raw (structured or unstructured) form and whenever an application needs the data, it interprets the raw data.This approach is a schema-on-read approach in which the interpretation of the data and potential consistency checks happen when the data is read by an application. The biggest challenge for data lake governance is to avoid that it turns in a so-called data swamp. First, there is the data quality aspect, such as noisy or incorrect data. Second, as the amount of data ingested grows exponentially and no schema is enforced, also the amount of used schemas tends to grow.
In this work we present a new metadata extension to data lake systems by semantic profiling, which attempts to recognize the meaning of the data which is ingested into the Data Lake. The developed tool does not only detect meaning at schema level, but also at the data instance level by employing domain vocabularies and ontologies. With our tool, ingested data sets can easily be mapped to common domain concepts with unique identifiers and the meaning of the data can be discovered by the system. The developed profiling tool will help to produce meaningful summaries of the ingested content and provides opportunities to link relevant data sets ingested using different data schemas. We evaluate our tool by using two cancer genome datasets. We use semantic profiling tool during data ingestion and observe how data sets are tagged and profiled. Our experiments show that Semantic Ingestion is a promising approach for enriching the data sets in a data lake.
Keywords: Big data integration; data lakes; scientific data management