**The Problem of Correlation and Substitution in SPARQL**

**Author(s):** Daniel Hernandez, Claudio Gutierrez, Renzo Angles

**Full text:** submitted version

**Abstract:** Implementations of a standard language are expected to give same outputs to identical queries. In this paper we study why different implementations of SPARQL (Fuseki, Virtuoso, Blazegraph and rdf4j) behave differently when evaluating queries with correlated variables. We show that at the core of this problem lies the historically troubling notion of logical substitution. We present a formal framework to study this issue based on Datalog that besides clarifying the problem, gives a solid base to define and implement nesting.

**Keywords:** Correlation; Substitution; SPARQL

**Decision:** reject

**Review 1 (by anonymous reviewer)**

(RELEVANCE TO ESWC) Very relevant, as it deals with SPARQL semantics. (NOVELTY OF THE PROPOSED SOLUTION) As far as I know, no other similar study exists (except from an unofficial publication by the same authors). Edited: based on the suggestion of another reviewer of a relevant paper, I am forced to reduce my score on this aspect. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) No problems with the approach were identified. (EVALUATION OF THE STATE-OF-THE-ART) Insufficient. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The theoretical evaluation of the approach is sufficient. It is not entirely clear why the proposed semantics is the "intented" or "correct" way to handle nested queries and "EXISTS" quantifiers, but this was not the intention of the paper. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Irrelevant; no experimental study. The theoretical evaluation of the approach is sufficient for this kind of work. (OVERALL SCORE) This paper identifies a problem with the SPARQL specification, which leads to different systems reporting different results for the same database and query. The authors propose extensions of Datalog that can be used to properly capture the semantics of SPARQL queries with nesting and existential quantifiers. I believe the paper will prove a very interesting addition to the ESWC program and will raise interesting discussions. Typos: - Footnote 3: "Finally, ... finally" - "Substituting in a non trivial even" - Schneider "an" Martin - "are better understand" - "as improper substitution where applied" - "we where chosen" Strong points - Important problem related to the semantics of SPARQL - Solid technical solution - Rigorous formal treatment Weak points - The paper is a bit on the "heavy" side with regards to formalisms, but this is inherently associated with the addressed problem and is not related to the paper itself. Questions for the rebuttal: In Section 5.1, join and MINUS are defined with respect to a modality, but it is not clear why this particular modality was chosen (rather than the alternative) and why the alternative is not mentioned (is it irrelevant?). EDIT AFTER THE REBUTTAL: I acknowledge the authors' rebuttal. I raised my score with regards to novelty.

**Review 2 (by Daria Stepanova)**

(RELEVANCE TO ESWC) The paper discusses the problem that arises in SPARQL using FILTER (NOT) EXISTS expressions, which is relevant to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) A formal framework to analyze the problem has been established by the authors; however, no concrete solution has been actually proposed. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The missing proofs of important and non-trivial results (e.g., correctness of Lemma 4) do not allow one to verify the technical correctness of the work rigorously. Since Modal Datalog is a non-standard extension, such proofs are crucial for understanding the technical results. (EVALUATION OF THE STATE-OF-THE-ART) The related work section describes the history on the given issue and points to two proposed solutions, which have not been studied in a formal framework. I wasn't aware of the problem before reading this paper, therefore have not noticed any major issues with the related work. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) In principle, the approach to analyze the problem has been carried out appropriately. The reader is guided through the steps and the presented formalisms. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) While no proper experimental evaluation has been provided, the running example has been tested in different SPARQL engines, and the respective results have been discussed. (OVERALL SCORE) The paper discusses problems with FILTER (NOT) EXISTS queries in SPARQL. The authors point out, that different SPARQL engines give different answers on the same query using such expressions. The goal of the paper is to formalize the problem, provide a logical framework to study it as well as a consistent alternative for solving the problem. In order to achieve this goal the authors introduce Nested Datalog and Modal Datalog. Nested Datalog is Datalog extended with nested queries/programs, which is able to handle nested FILTER EXISTS expressions. Modal Datalog extends Nested Datalog with modalities to support the open world semantics of RDF/SPARQL. For both Datalog variants the authors discuss syntactic and logical substitutions, which formally describe the problem of evaluating a SPARQL query with subexpressions. For Nested Datalog the compatibility check of a substitution can be done at an arbitrary place in the dependency graph of the query, while this is not anymore the case in Modal Datalog. Finally, the authors provide a translation of SPARQL to Modal Datalog, analyze FILTER EXISTS queries using this translation and give three sources of discrepancy in the interpretation of such expression. In conclusion the authors leave the decision on the correct interpretation of such problematic queries open to the W3C working group. *** Strong Points *** * The importance and relevance of the studied problem * The presence of intuitive examples * Self-contained preliminary material *** Weak Points *** * Absence of the proof of central results (e.g., Lemma 4) * The promised in the introduction proposal of authors' "view of how to handle this problem" has not been presented. * The paper contains numerous typos (some are listed below), and overall seems to be put together in a rush. *** Questions to the Authors (QAs) *** * Is it possible to analyze/discuss the two semantics mentioned in the Related Work section in the view of the formal framework given in this paper? How do these semantics fit into this framework? *** Incomplete list of typos *** page 3: Nested Datalog (a->an) extension Schneider (an->and) Martin page 4: A variable X (is) occurs page 5: occurs in (a->an) equality formula page 7: as a query (p(Z),P) where P contains the rule q(z) --> q(Z) should be p(Z)? page 9: We write S,\theta \models L ... sign before L should be \circle? page 10: one source X (us->is) null Proof of Lemma 3: s is unary or binary? Post-rebuttal comments: Thanks for addressing the raised question concerning the semantics mentioned in the Related Work and their fit into the framework. In the archive version of the paper the proof of Lemma 4 is still missing, thus the scores concerning the correctness/completeness of the proposed approach have not been increased.

**Review 3 (by anonymous reviewer)**

(RELEVANCE TO ESWC) The submission studies the subtleties of the semantics of SPARQL, and this topic is clearly relevant to the conference. (NOVELTY OF THE PROPOSED SOLUTION) The following article addresses the same issue M. Kaminski, E. V. Kostylev, B. Cuenca Grau: Query Nesting, Assignment, and Aggregation in SPARQL 1.1. ACM Trans. Database Syst. 42(3): 17:1-17:46 (2017) Yet, it's not even mentioned by the authors. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors attempt the clarify the semantics of correlated variables in SPARQL subqueries. The authors do not propose a concrete fix for the problem but instead suggest to consider SPARQL queries as programs in what they call Modal Datalog (which is based on another invention, the Nested Datalog). (EVALUATION OF THE STATE-OF-THE-ART) The starting point of the investigation is an observation that the 4 standard SPARQL engines give different answers on a reasonably simple query with correlated variables in the subqueries. Other authors have addressed the issue (see above), yet the authors appear to be unaware of it. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There are numerous flaws in the presentation and it's difficult to judge. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) N/A (OVERALL SCORE) The authors attempt to clarify the semantics of the subqueries in SPARQL. The starting point of the investigation is discrepancies among 4 standard implementation of SPARQL engines. The authors do not propose a concrete fix for the problem but instead suggest to consider SPARQL queries as programs in what they call Modal Datalog (which is based on another invention, the Nested Datalog). ** Strong Points (SPs) ** 1. It is important to iron out any inconsistencies in the semantics of SPARQL. 2. The examples are interesting (although it's a bit difficult to appreciate all the nuances due to WPs below). ** Weak Points (WPs) ** 1. The submission contains many broken sentences, e.g., "It is well known the complexities that this notion involves." (p 1), "Carnap in ... and Quite in ... definitions still contain problems" (p 1), "We - the Semantic Web advocates - are in a problem" (p 2) - the list can easily be continued. The pattern "Given ..., then ... " is ungrammatical. More specific comments can be found below, but the submission is very difficult to get through because of the language. 2. The definitions of Nested Datalog and, more importantly, Modal Datalog and how it is related to SPARQL are unclear and contain some hidden assumptions and a number of inconsistencies (which is a bit ironic in view of the submission's mission). What is the exact meaning of "[] are not repeated inside the same predicate"? Why can't you have P(x,x) in the head of a rule? I fail to understand "equality formulas require that all variables be assigned before being evaluated". What is "the order of Stratified Datalog"? (The meaning of the sentence is unclear!) Placing (non-standard) definitions in footnotes (e.g., Footnote 7) is a very bad taste. The explanation of let(Y = a) is very unclear: P has a single rule; then a literal l(Y) is added to the rule - why? Does l(Y) replace the equality? Is l(a) part of the extended P? What is a? Any constant? If l(a) is added to P, then Definition 2 does not mention it. Does let(Y=a) stand for l(Y)? How is "the variable Y in the second rule logically connected with the variable X in the first rule" (p 6)? What is the meaning of the "logically connected"? "move the literals let(X = \theta(X)) down" - is it always X = \theta(X)? The example appears to suggest it could be Y = \theta(X) - or is it just a typo? Footnote 9 appears to serve as a proof of the base case. Nested Datalog queries cannot contain cycles in "nested atoms", can they? In general, is it assumed that Datalog is non-recursive? Section 2 begins by suggesting so, but then non-recursiveness is never mentioned again. Definition 4: "B is a set of literals allowing the value \bot" is very poorly phrased - do the authors mean that \bot is a term that can occur as an argument in atoms? filter(t_1 \ne t_2) was never defined (but used on p 9) Lemma 3 is just an observation, not a lemma (and it's statement is not very precise either - could change? but can also remain the same, right?). Definition 6 appears to suggest (in the join part) that \bot has magical semantics, i.e., \bot = a, for any a. Also, what is R \setminus S? And how can it be equal to a set of variables? In general, I find Definition 6 very confusing - the notion of solution mapping compatibility is gone; the fact that solution mappings are partial functions (perhaps, with different domains even for the same pattern) is ignored... What are "outer and inner patterns" (p 12)? I can imagine that an inner pattern is in fact the argument of EXISTS. Does the outer pattern include the filter or not? Footnote 13 only brings confusion - why would inner and outer patterns correspond to P FILTER EXISTS Q and P FILTER NOT EXISTS Q? Typos (a very incomplete list) p 1: Logische (not Logishe) p 2: non-natural -> unnatural p 7: are better *understood* p 8: the main technique ... *has* been (not have) - but the sentence is awkward anyway p 12: has at least a solution -> has a solution (the article a serves the same purpose) Post-rebuttal Comments It is the case that by "the equality X=Y "pass" the value from X to Y", the authors mean some sort of an assignment operation Y := X? That is quite unexpected in Datalog (which is meant to be declarative, with a simple logic-based semantics). What exactly is a correlated variable in Datalog? As concerns SQL/relational algebra, then the queries indeed can have correlated variables, but is is a purely syntactic issue to determine which attribute name is taken from which relation. Thus, I do not really see a problem here (taking into account that all tuples are *total* functions on the attributes, unlike the partial functions in SPARQL).

**Review 4 (by Antoine Zimmermann)**

(RELEVANCE TO ESWC) Any study about SPARQL is relevant to ESWC (NOVELTY OF THE PROPOSED SOLUTION) as far as I know, this specific problem has not been presented at a conference or in a journal before (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) There are too many things that are missing or problematic (EVALUATION OF THE STATE-OF-THE-ART) to the best of my knowledge, SotA seems rather complete (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) a lot of things need to be revised to demonstrate the approach convincingly (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) there is no experimental study, but even the deductions from the formal parts can't be verified (OVERALL SCORE) Answer to the authors' rebuttal: You did your best to clarify some imprecisions and acknowledge some inaccuracies or missing explanations, but there are important remaining issues that would have to be fixed for the paper to have a chance to be accepted (at least by me). There is one point that I did not see at the time I reviewed the paper: you say that the SPARQL specification is ambiguous, or even contradictory, highlighting 2 pieces of text. You are mixing two distinct notions, one about subqueries (which have to be evaluated bottom-up in all cases) and one about FILTER (NON) EXISTS, which is not a subquery and must be evaluated after substituting the solution found from matching the outer query. This leads to a non ambiguous definition that both Patel-Schneider & Martin 2016 and Kaminskin et al. 2017 formalise equivalently, but it leads to problems as substitution may lead to a non-syntactically valid query. Erratum: existstential is in fact the correct spelling from the paper's title, but it should be type set as "EXISTStensial" instead. Summary of the Paper: The paper discusses the problem of substitution and correlation in SPARQL queries, particularly in the presence of subqueries, where variables may have to be substituted with the result of the enclosing query. It is shown that the problem is not sufficiently well documented in the SPARQL standard, leading to misinterpretations of the semantics of SPARQL in current implementations. The paper formalises the semantics of subqueries using Modal Nested Datalog with null values and proposes ways of intepreting SPARQL FILTER EXISTS queries. Strong Points: The root of the problem is well exposed. The approach using datalog is interesting and promising. Weak Points: There are far too many little things that are left implicit and many little mistakes in the formal parts of the paper. Besides, important properties and lemmas that are not trivial have no proofs. With all this considered, it is impossible to ensure that the contribution is even valid. Detailed comments: While I find the paper very promising, its main problems are that it has a lot of missing information to make it self contained (e.g., the semantics of datalog is not properly and fully presented) and it has a lot of little mistakes that, in accumulation, makes the results simply impossible to check. I comment now precisely the parts that are unclear or incorrect: Section 2: - "are not repeated inside the same predicate" -> why this restriction is made? - "if either \theta(L) is an equality formula a = b" -> the syntax does not have equality formulas like this. This should be filter(a = b) - "where a and b are the same constant" -> in this case, it should be "a = a" (in fact, "filter(a = a)") - this semantics does not explain what "S,\theta\models P" is for a program P - "If a rule R ... built-in." -> this is not clear - "as can be inferred from the rules of P" -> this is not precise. What does it mean to be inferred? What is the "inference processed"? Section 3: - Def.1 seems to allow queries to be negated, as they are atoms. Nothing is said about negated queries anywhere. How positively occurs is defined if there can be negated queries? - "The assignment [...] can be done by adding a literal l(Y) to R [...] and the fact l(a) to P. [...] let(Y = a) as a syntactic sugar to denote the result of assigning a to Y in R." -> assignment requires adding something to a rule, *and adding a fact to a program*. But let(Y = a) is used everywhere as if it was just a literal in a rule. - in the example with top-down and bottom-up, there is "let(Y = \theta(X))" -> it should be "X = \theta(X)", right? - "evaluate P" -> what does it mean? - "compatibility" -> what is this? - Lemma 1: what does "moving down literals" mean? How is the dependency graph built for Nested Datalog? - "we have defined logical substitution for a rule but not for a whole query" -> this suggests that the definition of logical substitution for a query somehow utilise the definition of substitution for a rule? - on page 7, "filter(Z \neq Y)" is not part of the language. Is it equivalent to \neg(filter(Z = Y))? - "Improper substitution ... of queries" -> this paragraph is not clear - Lemma 2: * "extension database" is not defined * "p(X1, ..., Yn)" -> why is there Yn? should it be Xn? * "{p(X1, ..., Yn) | E,\phi \models p(X1, ..., Yn)}" -> this is obviously wrong. This should be a set of facts, not literals. Morever, ans_E(Q) has not been defined for Nested Datalog * the proof has Xj, ..., Xk. I think it should be Xi1, ..., Xik with 1\leg i1 < ... < ik \leg n. Moreover, the proof is gibberish. I can't even see how it is a proof of the lemma. - "A corollary of this lemma is that nesting does not add expressive power to Datalog" -> How does this follow from the lemma? Section 4: - "The main technique to codify incompleteness have been the null values" -> did you survey the techniques for expressing incompleteness and found that using null values is more statistically occuring? - Def.4: can \bot appear in filter(t1 = t2)? can it be used in facts? can the goal be modal? - what is an "instance of a literal"? - can \bot be substituted for a variable? - what does \theta\models [something] mean? - "L is let(X = t)" -> as said, "let(X = t)" is syntactic sugar. It is not a literal. - what is \theta(P) for a program P? - "a fact F\in S such that F \leq \theta(L)" -> isn't it the opposite (F is more informative?) - Def.5 is said to be the "Semantics of Modal Datalog" but wasn't what precedes already part of the semantics? - "\theta is the less informative substitution" -> what does it mean? - why is the notion of \models so detailed for Modal Datalog when it was only sketched for Dalatog and not explained for Nested Datalog? - Example 1: "ans_E(Q)" is used here but not defined for Modal Datalog - "As this literal [let(X = \theta(X))] is syntactic sugar for introducing a predicate formula u(X)" -> as said before, this should also introduce u(\theta(X)) to the program - "see Def.5" -> I completely fail to see how what is said here is supported by Def.5 - "As we saw" -> I did not see it, where was it? - "the value of a variable X coes from more than one source" -> I do not understand - "will be provided by two sources, namely \theta(X) and the literal q(X)" -> I do not understand what this means - "is considered compatible with a constant" -> what does "compatible" means here? - Proof a Lemma 3: I do not understand how Program B is instantiating anything presented before? Section 5: - "set semantics (as is done in [6] and [2])" -> [2] is entitled "The Multiset Semantics of SPARQL patterns"! - in the definition of the algebra operator: * what is \phi? * what does "\mu \models \square\phi" mean? * the last two operators are not properly defined. These are not set-theoretic definitions of sets (use of a modal sentence as a condition) - "most informative value" -> which means? - "we have selected one mode for each modal operator" -> what does it mean? - footnote 11: "it is well known that the standard SPARQL operators are definable in the algebra presented here" -> it is not well known, I'm afraid. Especially since SPARQL uses a list semantics (elements are ordered and can repeat, and the algebra presented here does not) - Def.6 uses predicate "p" all the time. I the same predicate is used each time, it is going to be of multiple arities - \phi must be of the form "t1 = t2" - Lemma 4 is not proved - With Def.7, there is a translation of FILTER EXISTS to Modal Datalog. So if Modal Datalog has a well defined semantics, then there is no ambiguity on how to interpret the filter. Yet the rest of the paper says that FILTER EXISTS can be interpreted in different ways. I do not understand this. - footnote 14: "It is not difficult to" -> really? - "No system agrees with this evaluation" -> the paper only show the answers of 4 systems. Has the study been made on all existing SPARQL engines? - The end of Section 5 is confusing. Typos, grammar, and other minor issues: Introduction: - "It is well known the complexities ..." -> well known for whom? I was not aware of these complexities before reading this paper. Avoid this kind of statements in a scientific paper - "We --the Semantic Web advocates-- are in a problem" -> please do not speak in the name of "the Semantic Web advocates". I consider myself part of them and I disagree with your statement - p3: "a extension" -> an extension - "and what do they mean ..." -> and what they mean Section 2: - "and an negative" -> a negative - "A variable X is occurs" -> "A variable X occurs" or "A variable X is said to occur" - "same predicate than L" -> same predicate as L Section 3: - "in a equality formula" -> in an equality formula - "in Relation Algebra because optimization concerns" -> because of - Def.3: "resulting of replacing" -> resulting from replacing - "are better understand" -> understood - "because is not free" -> because it is not free - "substitution where applied" -> were - "the logical chain start" -> starts - Proof of lemma 2: * "Is well known" -> It is * "to a such formula" -> to such a formula * "queries suffices giving" -> it suffices Section 4: - "in one source X us null" -> is null - "We showed that in Nested Datalog logical substitutions" -> ... in Nested Datalog, logical substitutions Section 5.2: - "a relation $r$ and a algebraic" -> an algebraic - after Def.7, item 1: "Free variables A free" -> add dot after "variables" - Case Q = Q2 "or do not allowing" -> or not allowing References: - 17. "Existstential" -> Existential

**Metareview by Olaf Hartig**

As the title suggests, the focus of this paper problems of correlation and substitution in SPARQL queries with subqueries. To address these problems this paper introduces a formalization based on Modal Nested Datalog with NULL values, and the authors propose approaches to interpret queries that use the FILTER EXISTS feature. While the reviewers agree that the problem addressed in the paper is interesting, some of them have also highlighted numerous small issues. Some of these points remain unconvincing to the reviewers after the rebuttal. Additionally, there are concerns about the authors’ interpretation of the parts of the SPARQL specification that are related to the topic of the paper. Due to these points, the paper cannot be accepted for publication in the conference.