Teach me to fish, on querying Semantic Data Lakes
Author(s): Mohamed Nadjib Mami, Hajira Jabeen, Sören Auer
Full text: submitted version
Abstract: We have recently made a huge leap in terms of data formats, modalities, and storage capabilities. Dozens of storage facilities have been created as a result. Today, we are able to store cluster-wide data, and to choose a storage that suits our application needs, rather than the opposite. If connected together, this data can generate valuable insights and knowledge. Therefore, several works have been conducted to bring heterogeneous data together, by either physically transforming it into a unique format, or virtually querying it on-the-fly. Both approaches pose a challenge in a certain stage of data preparation. However, modern technology enabled us to achieve the latter more efficiently than ever. In this article, we suggest a general framework that takes advantage of Semantic Web standards to query heterogeneous big data. We devise an implementation, named Sparkall, that uses Spark as the underlying query engine. Our evaluation demonstrated the feasibility and efficiency of Sparkall in querying five data sources of y …bytes of size.
Keywords: big data; databases; nosql; data heterogeneity; data management; ontology; obda
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper proposes an approach for querying semantic data lakes, where the data lake is a collection of heterogeneous Big Data sources, which are integrated using OBDA-like query rewriting techniques. As such, the paper addresses a problem relevant for this conference. (NOVELTY OF THE PROPOSED SOLUTION) In previous OBDA-approaches, non-relational data sources have been considered only to a limited extent. The usage of Spark packages as data connectors and Apache Spark as query processor is novel in this aspect, However, there is no formal description of the components of the system such that it is difficult to assess what is novel on a conceptual level. The comparison with the state-of-the-art is very limited. The advantage of existing OBDA-solutions is that are formally well described and their properties are well understood. This is not the case here. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) With the lack of a formal description of the integration system, there is also no clear description of the semantics. The expressiveness, or any other properties of the system, remain open. (EVALUATION OF THE STATE-OF-THE-ART) While related work is briefly discussed, with the current description, the advances over the state of the art are not really clear. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Similarly, the algorithms, in particular for the query catalyst (query rewriting) are not explained. Assumptions are not made explicit, properties such as completeness, correctness, or complexity cannot be assessed. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation shows results from some example queries, but they are not based on a proper benchmark and do not consider any alternative state-of-the-art approaches. (OVERALL SCORE) The paper proposes an approach for querying semantic data lakes, where the data lake is a collection of heterogeneous Big Data sources, which are integrated using OBDA-like query rewriting techniques. As such, the paper addresses a problem relevant for this conference. In previous OBDA-approaches, non-relational data sources have been considered only to a limited extent. The usage of Spark packages as data connectors and Apache Spark as query processor is novel in this aspect, However, there is no formal description of the components of the system such that it is difficult to assess what is novel on a conceptual level. The comparison with the state-of-the-art is very limited. The advantage of existing OBDA-solutions is that are formally well described and their properties are well understood. This is not the case here. The evaluation shows results from some example queries, but they are not based on a proper benchmark and do not consider any alternative state-of-the-art approaches. With the lack of a formal description of the integration system, there is also no clear description of the semantics. The expressiveness, or any other properties of the system, remain open. Similarly, the algorithms, in particular for the query catalyst (query rewriting) are not explained. Assumptions are not made explicit, properties such as completeness, correctness, or complexity cannot be assessed. While related work is briefly discussed, with the current description, the advances over the state of the art are not really clear.
Review 2 (by Valerio Basile)
(RELEVANCE TO ESWC) This work deals with data lakes, a more general topic than linked data, but it also has a clear SW angle, investigating the interaction between semantic queries and heterogeneous data stores. (NOVELTY OF THE PROPOSED SOLUTION) The proposed approach is fairly novel. Related approaches are discussed briefly, but unfortunately no comparable approach is tested (see comments below). (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed architecture is described in sufficient detail, bot its design and its implementation Sparkall. I could not find flaws in the method, and the experimental evaluation confirms that no result is left behind when performing a query in Sparkall. (EVALUATION OF THE STATE-OF-THE-ART) The evaluation and comparison to the state of the art is a weak spot of this paper in my opinion. A few approaches to map heterogeneous data to RDF are discussed, but they take no part in the evaluation. I understand that it is difficult to compare directly complex architectures, but I think an effort should be made in this direction, perhaps devising a different experiment or creating a common benchmark. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The method and its implementation seem mature enough to be proposed as a packaged solution. However, I find the experimental evaluation somewhat weak and limited in the insight gained towards understanding the strengths and issues of the approach. Only 4 queries are tested against one type of data (e-commerce). The comparison is only made against MySQL, which clearly (and admittedly) does not scale up. In fact, the table of results is incomplete, and thus not very informative in parts. Also, once it is proven that Sparkall returns all the records it has to return, there is no point in showing a figure of accuracy if it is always 100% (which is a positive result, of course). In summary, what I learn from the experimental study of the novel system is that it works, and nothing else, i.e., what data it works best on, what are the potential pitfalls, etc. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The benchmark data used for the evaluation are available, and, most importantly, the source code of the Sparkall iplementation is made available. The 4(+1) queries are not disclosed, so the experiment is not entirely reproducible. I advise the authors to do publish the exact queries used in the experiment also for clarity. (OVERALL SCORE) This paper presents an approach to map semantic queries against heterogeneous data stores (data lakes). A working implementation is also presented and tested on a standard benchmark in comparison to a classic relational database. This is not exactly my field of expertise, so I focused my review on the scientific aspects more than the engineering ones. The paper is generally well written, although I would have liked a little more background, e.g., on data lakes in general. Strong points: the architecture is complete and well fleshed out. The implementation is solid and works well on large data sets. Weak points: the paper could benefit from a little more background. The evaluation is quite limited, especially lacking any result from comparable systems. Also an evaluation on different data sets (e.g., from different domains) would be welcome. Minor detail: there is a reference missing in section 3.1
Review 3 (by Catia Pesquita)
(RELEVANCE TO ESWC) The paper addresses querying heterogeneous Big Data resources using mappings and rewriting queries into SPARQL, and is thus suitable for ESWC. (NOVELTY OF THE PROPOSED SOLUTION) In general, the solution appears to have some novelty, but the detail of the description of the semantic aspects of the approach and the lack of proper state of the art discussion makes it difficult to provide a stronger comment on novelty. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Although there is some attention given to architecture and implementation, there are not enough details on the semantic properties of the system. It is not exactly straightforward to evaluate the system's soundness, and although I commend the authors for providing access to source code, the paper should be more clear in this regard. (EVALUATION OF THE STATE-OF-THE-ART) State of the art is very briefly discussed and not enough attention is given to the differences between the approaches. For instance, the Optique framework, which is likely a very close system, is only presented in terms of goals not techniques and approaches, which would have been very helpful in identifying the degree of novelty and relevance of the proposed system. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation of the proposed system is one of the weakest points in the paper. Authors compare to MySQL, which while a suitable benchmark for accuracy is not at all interesting in terms of scalability. It would have been very interesting to see the results run on the same data before the transformations that made it un-joinable and using the same queries. This could serve as a theoretical maximum and provide some valid discussion points as well. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Benchmark data and source code available. (OVERALL SCORE) The paper discusses a system for querying heterogenous Big data resources exploring user-defined mappings and resorting to SPARQL transformations. The proposed system is available and experimental results support its effectiveness and ability handle large datasets. Strong points: 1. The system is aimed at the integration of NoSQL databases, which is very interesting 2. The NoSQL database, although described briefly, seems very useful 3. The results, although insufficient are positive. Weak points 1. Description of system is not formal enough, leaving doubts in what concerns the semantic expressiveness it can handle 2. State of the art is not discussed in sufficient detail or insight 3. Experiments do not target related systems, not even "non semantic" integration of NoSQL databases Minor: missing reference in section 3.1 expand BGP acronym Cassandra is typically referfed to as a NoSQL database, not a relational one QA: 1. What do you mean by "excatly as we need it" in "With minimal settings, RML enables us to annotate entities and attributes, exactly the way we need it."?
Review 4 (by anonymous reviewer)
(RELEVANCE TO ESWC) The topic of the paper is clearly relevant to the conference. (NOVELTY OF THE PROPOSED SOLUTION) The novelty of the proposed solution is somewhat unclear as there is a similarity to classic data integration systems, which is not sufficiently discussed. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution seems to be correct. (EVALUATION OF THE STATE-OF-THE-ART) The paper covers related work in the area of OBDA systems but does not relate to similar systems developed in the database community. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Not all of the properties are sufficiently discussed and supported by experimental evidence. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The implementation is available online but the exact experimental setup is unclear and not reproducible because of omitted details of the setup. (OVERALL SCORE) This paper outlines a framework and an approach for querying Semantic Data Lakes and positions itself as a virtual data access method (Ontology-Based Data Access) on top of BigData and NoSQL databases. The approach supports a number of systems serving as data sources. The system seems to be fully-implemented and running providing a basic solution to the motivated problem. The experimental evaluation, however, is insufficient. While the paper is making use of Semantic Web technology and terminology, what it describes comes very close to a classic mediated virtual data integration system (as developed by the database community), which virtually integrates multiple heterogeneous datasets via mappings and uses a mediator for query rewriting and performing joins between partial results from different sources. These systems are well studied in the database community. The paper briefly mentions data warehouses and their ETL processes but the database community has developed many other approaches in the area of distributed database systems and data integration systems. Hence, one of my questions (Q1) is what exactly are the similarities and differences to existing work in the area of databases and data integration systems. The evaluation is rather short and does not cover all relevant aspects, starting with experimental setup, query characteristics, etc. -- see questions below. Strong Points: S1) The problem that the paper aims to solve is very important and timely. S2) The paper is well structured and organized. S3) There seems to be a complete and running system including a graphical interface. The code is publicly available. Weak Points: W1) The experiments are relatively weak and do not cover all relevant aspects that are necessary to draw reliable general conclusions. W2) The paper does not position itself well enough to related work in the context of data integration systems. W3) The sections discussing the main contributions are hard to follow. They appear to be written in non-standard terminology and in a more complicated way than necessary. Questions to the authors: Q1) What are the similarities and differences to existing work in the area of databases and data integration systems? Q2) What exactly is the subset of SPARQL that is currently supported? Q3) Abstracting from the textual description, one might think that the system simply supports left deep join trees and processes join results based on join attributes (ParSets join links). Does the system go beyond these basic strategies? Q4) How would a triple store perform as a source in the framework? Does the system support this? Q5) How exactly is the data partitioned (horizontally or vertically)? The text seems to indicate that each source is assigned one relation (vertical) and this assignment is not changed. What is the consequence of this assignment? What happens if the relations are assigned to different stores? And more interestingly, what happens if the data is horizontally partitioned, i.e., each source contains parts of all relations? Q6) What is the influence of selectivity on the obtained results? The text mentions that the queries contain filter expressions but there is no exact information on filter selectivity and join selectivity for each of the queries. This is important information that influences performance and should be discussed in more detail. Q7) Table 1 shows that for Q3 runtime goes up from 500k to 1.5m but goes down for from 1.5m to 5m. Why is that? Q8) Wouldn't it be possible to run Q4' on MySQL as well? Q9) The text in the experimental section states "The threshold is set to 1800s (30min)." Do you mean timeout? Q10) Why are there no measurements in Table 1 for MySQL in the 1.5m setup for queries Q3, Q4, Q4' (and none for 5m)? Is it because of the timeout? Wouldn't it be useful to let the queries run to the end to be able to check accuracy? Q11) Why not compare the performance to a native setup in Spark and a native triple store, instead of only MySQL? Q12) Is there actually a chance that the system will produce results with less than 100% accuracy? Q13) The text mentions that Q1 involves 2 sources, Q2 involves 3, and Q4 involves 5 without actually saying which sources. This is however important to understand the obtained results. How is it possible to draw general conclusions based on these experiments? Q14) Can the system support the standard queries of the BSBM benchmark? If yes, what is their performance?
Metareview by Olaf Hartig
This paper proposes an OBDA-like approach to querying data lakes. While the reviewers agree that the presented work addresses a timely and interesting problem, they also point out a number of significant weaknesses of the paper. In particular, the presented experimental evaluation is insufficient, the paper is not well written (e.g., algorithms are not explained and assumptions not made explicit), and it is not made clear how the presented work advances the state of the art. Due to these weaknesses, the paper cannot be accepted for publication in the conference.