Paper 127 (Research track)

Extending Data Lake Metadata Management by Semantic Profiling

Author(s): Jasim Waheed Ansari, Naila Karim, Stefan Decker, Michael Cochez, Oya Beyan

Full text: submitted version

Abstract: In the Big Data community Data Lakes have become the de facto standard for storing data. Often, these data lakes are contained within the Hadoop ecosystem, where the actual storage happens on HDFS. The data is then stored in its raw (structured or unstructured) form and whenever an application needs the data, it interprets the raw data.This approach is a schema-on-read approach in which the interpretation of the data and potential consistency checks happen when the data is read by an application. The biggest challenge for data lake governance is to avoid that it turns in a so-called data swamp. First, there is the data quality aspect, such as noisy or incorrect data. Second, as the amount of data ingested grows exponentially and no schema is enforced, also the amount of used schemas tends to grow.
In this work we present a new metadata extension to data lake systems by semantic profiling, which attempts to recognize the meaning of the data which is ingested into the Data Lake. The developed tool does not only detect meaning at schema level, but also at the data instance level by employing domain vocabularies and ontologies. With our tool, ingested data sets can easily be mapped to common domain concepts with unique identifiers and the meaning of the data can be discovered by the system. The developed profiling tool will help to produce meaningful summaries of the ingested content and provides opportunities to link relevant data sets ingested using different data schemas. We evaluate our tool by using two cancer genome datasets. We use semantic profiling tool during data ingestion and observe how data sets are tagged and profiled. Our experiments show that Semantic Ingestion is a promising approach for enriching the data sets in a data lake.

Keywords: Big data integration; data lakes; scientific data management

Decision: reject

Review 1 (by Ioanna Lytra)

(RELEVANCE TO ESWC) The current paper proposes a framework and tool for semantic profiling of datasets that are ingested in data lakes. While this work is at the first sight related to topics in the Semantic Web community the proposed approach does not necessarily contribute to the state of the art in Semantic Web; it rather provides an application of how semantics can be used to annotate data and extract metadata.
(NOVELTY OF THE PROPOSED SOLUTION) According to the authors' claims the main contributions of this work is the algorithm for semantic profiling of datasets in the data lake and the integration of this algorithm (and corresponding architecture) in the Kylo data lake system. The Semantic Profiling Algorithm seems very trivial to me: it just maps the data schema with existing vocabularies and ontologies based on string matching to existing labels, properties, etc. It is not very clear either which problem the authors are trying to solve. Do they deal with heterogeneous datasets? Do they address data quality? Do they achieve data integration of ingested datasets? In order to judge the novelty of the approach the problem statement has to be done more clearly.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The correctness and completeness are difficult to judge given the fact that:
- The evaluation is done only with one dataset
- No assumptions or preconditions are discussed under which the semantic profiling is possible
- The generated profiles in the evaluation section are not checked for correctness and completeness
(EVALUATION OF THE STATE-OF-THE-ART) The discussion of the state of the art is very limited. The authors are missing important work and related tools for data profiling in general (have a look at this tutorial for more information https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/2017/SIGMOD_2017_Tutorial_Data_Profiling.pdf). In addition, it is not explained why this problem is relevant in specific for data lakes and how existing data lake systems are dealing with these problems.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) This point is very weak in the paper. The authors provide two figures (fig.2) showing the pipeline including the steps of their approach which are however very technical. The many technical details (including for example the unreadable figures 3 and 4) do not help to get a clear understanding of the properties of the proposed approach.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental study is very limited in my point of view; only one single dataset is ingested in the data lake and the goals of the experiments are not clear. What is being evaluated exactly? Table 2-4 are not informative at all since they just show a few mappings. Are the results correct? What is being measured here? Using only one datasets and one data format does not provide enough evidence about the generality and effectiveness of the proposed approach.
(OVERALL SCORE) Summary of the Paper
The paper entitled "Extending Data Lake Metadata Management by Semantic Profiling" proposes an algorithm and corresponding tool for semantic profiling of data that are being ingested in a data lake. For the data profiling ontologies and vocabularies from BioPortal are used. The tool is integrated in an existing data lake system named Kylo. The semantic profiling of the TCGA dataset is used for evaluation purposes.
Strong Points (SPs)
- Code available for testing
- Well known problem of data profiling
- Integration of approach in existing data lake system
Weak Points (WPs)
- Limited comparison to related work
- Poor evaluation results
- Technical presentation - missing discussion of research results
- Not clearly written and lots of grammatical/syntactical mistakes
Questions to the Authors (QAs)
- What is the main problem being solved here which has not been addressed (or addressed inadequately) in previous works?
- What is the main problem being solved (the motivating example does not seem to be relevant to the problem)?
- How generalizable is your approach: does the data format matter? does the data structure / data quality have an influence?
- What is the goal of your evaluation? What is being measured?


Review 2 (by Khalid Belhajjame)

(RELEVANCE TO ESWC) The topic addressed by the paper is of relevance to the Semantic Data Management and Big Data research track. The authors propose a method for automatically (semantically) annotating datasets in the data lake.
(NOVELTY OF THE PROPOSED SOLUTION) While the topic is relevant, the proposed solution does not deliver on the promises made in the introduction when motivating the problem. The authors use a BioPortal service to annotate the columns and the values of columns in Table (Files in the hadoop file) system.
They overlook important aspects like:
- The overhead assessed by the metadata generated. I remind the authors that they are dealing with large quantities of data. Annotating values of the columns is likely to generate amounts of annotations that may exceed the size of the input data.
- The automation of annotation is likely to generate false positive annotations. This aspect is not tackled by the authors either.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper as is is immature, and there are key problems that need to be rethought. As well as the issues mentioned in "Novelty of the proposed solution", it is not clear what the authors are trying to show in the validation section. The section should start by stating the questions that the authors are trying to answer empirically in the validation section. 
The paper focus on the annotation of the columns and their values. How about annotating the table (File) and the records in the table. It seems to me that annotating the records (lines of the file) is also important, if not more important that column value annotations, given that it represents entities
(EVALUATION OF THE STATE-OF-THE-ART) The authors reviewed the main research works in the area. That said, I was expecting analysis and comparison with the kind of the annotations that the authors are focusing on in this work.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As mentioned before, it is not clear what the authors are trying to show in the validation section. The section should start by stating the questions that the authors are trying to answer empirically in the validation section, a description of the datasets used and any experimental set-up, and a description of the method followed, before presenting and analyzing the results.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The authors provided a git link with the template used for Kylo. There aren't eniugh details however to allow a user to understand or reproduce the experiment conducted by the authors.
(OVERALL SCORE) SP. The motivation of the work ie well formulated and the problem is interesting and timely for the semantic web community
WP1. The solution presented by the authors allow annotating columns and column values using the API provided by BioPortal. There is no  solid research solution that the authors propose in this paper.
WP2. The accuracy of the automaic annotation obtained is not justified.
WP3. The validation of the paper need to be rethought.
WP4. The paper suffers from several typos.


Review 3 (by anonymous reviewer)

(RELEVANCE TO ESWC) Integrating Big Data and semantic technologies such as this approach are very relevant for ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) Unfortunately, the approach is neither very innovative nor novel. To me it appears, that the approach is rather a wrapper for the BioPortal API, where the algorithm iterates over all columns and rows of a file to ingest and calls the API for each cell.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is not sufficiently clearly described (see detailed comments below) and the technical depth and breath could have been increased in many ways.
(EVALUATION OF THE STATE-OF-THE-ART) Some relevant work has been identified, but some works are missing (e.g. RDFstats, LODStats) and a more systematic comparison would have been nice.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the approach are not very clearly presented.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The relatively straightforward source code is made available, but could have been better documented for reproducibility. Some quantitative figures and a performance evaluation are missing.
(OVERALL SCORE) Summary of the Paper
This submission presents an approach to profile datasets being ingested in a data lake using the BioPortal API on the columns and rows of the dataset and performing some simple heuristics on the results.
Strong Points (SPs)
* relevant and interesting problem
* some code is available
Weak Points (WPs)
* the approach and the description lacks technical depth and breath
* more systematic and precise description of the approach
* cursory evaluation
* very sloopy writing and presentation
More detailed comments:
In Fig 1 the is a box for Profiling and one for Semantic Profiling Algorithm - isn't the latter part of the first?
The definition of semantically valid is not clearly defined, maybe a small table and some examples would be better here:
* what do you mean by "full text", e.g. in " full text is found from the BioPortal recommender service" - probably the full class/property label
* why are alternate names only partially valid?
Many things are not very clear to me, e.g.:
* "parsed label which tells the user how much the word has been parsed successfully by the API"
The definitions in section 5 are unclear and imprecise, e.g. its unclear that "DD_{instance}(PreferredLabel)", means, when DD_{instance} is a table, what do you mean with PreferredType(Preferred)
The discussion of the Bioportal API Key is a technicality, which is not interesting or relevant, especially not for the algorithm.
The algorithm is relatively straightforward - basically two loops iterating over columns and rows in the table and retrieving recommendations from Bioportal. Regarding formatting, the inner for loop is not properly indented.
Regarding evaluation, I'm missing some quantitative results. Also, my impression is, that the algorithm analysing each individual column with performing BioPortal API calls will not scale well, which would contradict the Big Data Lake claim of the title.
Minor comments:
* sutomated=>automated
* "we used THE BioPortal API vocabulary service"
* smeantic=> semantic
* "With this approach , we take full opportunity"=>"With this approach, we take full advantage"
* "Following the the reusable approach"
* consistently capitalize Kylo: kylo=>Kylo
* articles (the, a) are often missing
* "This service can be accessED via"
* weighed=>weighted
* "That includeS data extraction"
* spacing and punctuation, e.g. full stop after 6th sentence in section 4, space before brackets
* clarity should be improved, e.g. "First, even though the meaning behind the terms are annotated, it is not a guarantee that there is a full text found by the recommended engine."
* Uri => URI
* Figures should be better prepared: text is too small (Fig 3 and 4 are completely unreadable), the lines and arrows in Figure 2 are not aligned, and especially the Reusable template part looks quite messy


Metareview by Olaf Hartig

This paper proposes an approach to profiling data that is to be ingested into a data lake. While the reviewers consider the topic relevant, they point out a number of significant weaknesses of the presented work (e.g., triviality of the approach, insufficient evaluation, insufficient discussion of related work, lack of a clear scientific contribution, paper not clearly written). Due to these weaknesses, the paper cannot be accepted for publication in the conference.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *