Efficient Ontology-Based Data Integration with canonical IRIs
Author(s): Guohui Xiao, Dag Hovland, Dimitris Bilidas, Martin Rezk, Martin Giese, Diego Calvanese
Full text: submitted version
Abstract: In this paper, we study how to efficiently integrate multiple relational databases using an ontology-based approach. In ontology-based data integration (OBDI) an ontology provides a coherent view of multiple databases, and SPARQL queries over the ontology are rewritten into (federated) SQL queries over the underlying databases. Specifically, we address the scenario where records with different identifiers in different databases can represent the same entity. The standard approach in this case is to use sameAs to model the equivalence between entities. However, the standard semantics of sameAs may cause an exponential blow up of query results since all possible combinations of equivalent identifiers have to be included in the answers. The large number of answers is not only detrimental to the performance of query evaluation, but also makes the answers difficult to understand due to the redundancy they introduce. This motivates us to propose an alternative approach, which is based on assigning canonical IRIs to entities in order to avoid redundancy. Formally, we present our approach as a new SPARQL entailment regime and compare it with the sameAs approach. We provide a prototype implementation and evaluate it in two experiments: in a real-world data integration scenario in Statoil and in an experiment extending the Wisconsin benchmark. The experimental results show that the canonical IRI approach is significantly more scalable.
Keywords: ontology-based data integration; entity identifier management; query rewriting
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper deals with ontology-based data integration, and is, thus, highly relevant to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) To the best of my knowledge, the approach is novel. Although the use of "canonical elements" to represent an equivalence class is not a new idea, it is, to my knowledge, its first application in OBDI. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is complete. (EVALUATION OF THE STATE-OF-THE-ART) There is no proper state of the art analysis, but the work is, to the best of my knowledge, the first of its kind, so this brief state-of-the-art analysis can be justified. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the proposed approach are well-studied, both from the theoretical perspective, and with an appropriate experimental evaluation. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental evaluation is appropriate. (OVERALL SCORE) This work studies the use of "canonical IRIs" as representatives of different IRIs connected through the owl:sameAs relationship, as a means to improve performance and readability of query answers in the context of OBDA/OBDI. Although the use of "canonical elements" to represent an equivalence class is not a new idea, it is, to my knowledge, its first application in OBDI. Overall, the paper is solid, well-written and well-motivated. I have no specific comments or suggestions for improvement. Strong points - Clear and well-written - Interesting application of the "canonical elements" approach for query optimisation Weak points None Questions to the authors None Edit after the rebuttal: I acknowledge the rebuttal. It forces no changes in my scores or comments.
Review 2 (by Loris Bozzato)
(RELEVANCE TO ESWC) The work presents theoretical and practical results for the management of sameAs relations in ontology based data integration: thus it is clearly relevant to the topics of the conference. (NOVELTY OF THE PROPOSED SOLUTION) The paper proposes a novel approach for considering sameAs relations in OBDI. The work, however, can be considered as a continuation to the work started in  and, as noted in the introduction, the use of canonical representation of IRIs has been already applied in reasoning engines (e.g. ). (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I could not detect any evident technical flaw in the proposed solution from the presentation of the paper (and proofs in the referred appendix). (EVALUATION OF THE STATE-OF-THE-ART) The paper does not seem to directly compare with related approaches and a summary of the state of the art in the field (in OBDI and sameAs reasoning) is missing. (Only some remarks are presented in Sections 1 and 7). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Both the theoretical results (Sections 3-5) and the experimental results (Section 6) are provided with sufficient justification for their claims. On the other hand, I would suggest to provide in the paper (in addition to the external proofs) some intuition about the verification of the formal results. Similarly, the discussion about the results of the two experiments in Section 6 should be expanded: in both cases it is recognized that the canonical IRI solution is more convenient, but a quantitative assessment of the improvement is not provided. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Proofs of all formal results are provided (even if not directly contained in the paper). With respect to the experimental evaluation, where possible (i.e. local experiments in Section 6.2), the system and materials to reproduce the experiments are provided. (OVERALL SCORE) SUMMARY: The paper presents a solution for efficient integration of data in a OBDI setting, based on an alternative semantics for sameAs relations, defined by canonical IRIs. The authors present the canonical IRI semantics as a SPARQL entailment regime, showing that it is formally equivalent to the usual sameAs interpretation. It is then shown that this semantics can be encoded in query answering by query rewriting. However, since this solution is not practical, an alternative encoding is provided based on mapping rewriting. A prototype implementation of the approach is then presented and experimentally evaluated on two integration scenarios. STRONG POINTS: - The paper provides a novel approach for reasoning on sameAs relations in OBDI query answering, both from the formal and implementative point of view. - The paper is well written, the contributions are clearly stated and the running example helps in understanding the application of the approach. - The experimental evaluation provide promising results on the applicability of the proposal. WEAK POINTS: - The work does not clearly compares (also in the experiments) with other solutions e.g. for reasoning with sameAs relations and integration of different DB schemas. - The analysis of the experimental results is limited: a quantitative assessment of the advantages in using canonical IRIs should be provided. - (minor) The notation causes the formal sections to be a bit heavy: moreover, some intuitive explanation of the formal results should be provided. QUESTION TO AUTHORS: - How much are the results dependent on the OWL-QL entailment regime? - (minor) In Section 4, can you provide a formal measure of the "growth" in the query execution under the proposed rewriting (thus motivating its impracticability)? - Similarly, is it possible to formally show that the solution proposed in Section 5 is more efficient w.r.t use of the original sameAs interpretation? (This is one of the claims of the paper, but it is not directly assessed in this section). ----- Added after rebuttal: I acknowledge the authors' comments to reviews provided in their response letter and I confirm my positive scores. I suggest to include the clarifications provided in the response as comments in the final version of the paper. Moreover, I suggest to increase the discussion on the outcomes of the experimental results where possible.
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) Paper on Ontology Based Data Integration, which is core to a Semantic Web conference (NOVELTY OF THE PROPOSED SOLUTION) Presents an approach, that to the best of my knowledge, has not been presented before (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) To the best of my knowledge, the proposed approach is correct and complete. (EVALUATION OF THE STATE-OF-THE-ART) A real world and controlled setting evaluation is presented. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Experiments demonstrate the approach (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The controlled setting can be reproduced. The real world setting can't (OVERALL SCORE) Summary of the Paper This paper presents an approach on how to deal with Ontology Based Data Integration when it comes to having different identifiers coming from different databases but that actually identify the same thing. This is done through the notion of canonical IRIs. The approach assumes 1) that there exists a “master table” that contains the sameAs relationships between different identifiers and 2) there is SQL Federated access to multiple databases (a single sql query sent to a federated database). Based on these assumptions, the goal is to rewrite the existing mappings by joining each query of the current mapping to the master table and projecting the canonical IRI, which is produced by a mapping over the master table. This is in the spirit of saturated mappings where the work is pushed into the mappings instead of the query rewriting. This work is a natural extension of previous work from the authors where they achieving the same goal but through means of query rewriting. To the best of my knowledge the work presented here is sound. I believe it advances the state of the art of Ontology based data integration and my recommendation is to accept this paper. Strong Points - Push the sameAs work into the mappings instead of the query rewriting is a great approach - Advances the state of the art of ontology based data integration. - well defined experiments Weak Points - The assumption of having a federated database limits the applicability of the results - The introduction does not describe what is the actual approach. The reader has to wait till the end to get the full picture. - SPARQL query in section 4 is not explained and it is core to understand the theory Questions/Comments to the Authors 1) It would be nice if a short high level explanation of the approach was presented in the introduction, instead of having to wait till page 11 to figure out what is the approach. 2) At the end of section 2, the authors state "Forcing the same IRIs also makes it hard to scale mapping construction, as the equal IRIs must be enforced everywhere.” —> I find this claim interesting. Why is it hard to scale the mapping construction? At the end of the day, the result of the proposed approach is doing the same: enforcing equal IRI everywhere by changing the mappings everywhere. 3) the query in section 4 was not explained. My understanding is the following. We have the following BGP that needs to be matched: (?s) —?p2—>(?x) —?p1—>(?o) and filter out all the matches where ?x has a canonical IRI, meaning that you only get nodes that have an incoming and outgoing edge and where that node does not have a canonical IRI associated to it. This part is just returning a pair of the same ?x ( in the inner part of the query ?x is renamed to ?xc) and then this is unioned with the pairs of all canonical IRIs. Thus, if I understand correctly, this query returns all the associations of canonical IRIs defined through canIriOf and also IRIs that don’t have canonical IRIs associated to it and for the latter case, it’s a pair of the same IRI. Did I get this right? As I reader, I should not be trying to figure this out on my own. 4) I believe that  should be Sequeda J.F., Arenas M., Miranker D.P. (2014) OBDA: Query Rewriting or Materialization? In Practice, Both! ISWC2014. https://link.springer.com/chapter/10.1007/978-3-319-11964-9_34 5) I appreciate the comparison of the real world experiments to the in vivo and in vitro scenario! I agree! ---- Comments after rebuttal Thanks for addressing my points. One final comment. Reference your answer to point 2) . I understand. I would recommend to "soften" the existing claim in the paper by adding the explanation that you gave me.
Metareview by Olaf Hartig
This paper presents an approach to ontology-based data integration that takes into account the existence of multiple URIs for the same thing. The reviewers agree on the novelty and on the soundness of the approach. Another strong points of the paper is that the properties of the approach are well studied (both theoretically and experimentally). As a consequence, this paper can be recommended for acceptance. We expect that for preparing the final version, the authors take into account the reviewers’ comments and implement the changes mentioned during the rebuttal.