Paper 150 (Research track)

Querying APIs with SPARQL- language and worst case optimal algorithms

Author(s): Adrián Soto, Juan Reutter, Domagoj Vrgoc, Fernando Pieressa, Matthieu Mosser

Full text: submitted version

camera ready version

Decision: accept

Abstract: Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate the obtained information into SPARQL query answers, thus obtaining a query language allowing to bring data from the “traditional” Web to the Semantic Web. Looking to evaluate these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We also do a set of experiments that empirically confirm the optimality of our approach.

Keywords: SPARQL; Query Languages; JSON APIs

 

Review 1 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) -
(NOVELTY OF THE PROPOSED SOLUTION) -
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) -
(EVALUATION OF THE STATE-OF-THE-ART) -
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) -
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) -
(OVERALL SCORE) In their paper, the authors propose an extension of the SPARQL querying language to incorporate information stored in the JSON format by means of calls to a corresponding API. They describe syntax and semantics of this extension, present an implementation of their approach and investigate the problem of query optimization in terms of minimizing API calls, which are assumed to be the bottleneck at query execution. They conduct experiments to practically demonstrate optimality of their algorithm in that respect.
The paper's topic is clearly relevant for ESWC. The paper is very well structured and nicely written and covers all the bits and pieces from theoretical to implementation and optimization aspects. It describes a nice and clean way to open up the world of JSON data for semantic technologies.
Small detailed remark: it is not specified what the unit of the access times displayed in Table 1 is. I assume it is seconds? Please specify!

 

Review 2 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) An algorithm that allows conjunctive queries to be evaluated with the worst-case optimal number
of API calls is relevant to the Semantic Web community.
(NOVELTY OF THE PROPOSED SOLUTION) The main contribution seems to be the algorithm, which is unclear, and seems to be mainly derived from works in the domain relational databases.
I am unsure about the practical usefulness of SPARQL querying over JSON APIs, so I don't see this as a significant contribution (see comments below)
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The worst-case optimal algorithm (section 4.2) that is proposed by the authors is quite vague,
and I'm not able to fully understand it.
(EVALUATION OF THE STATE-OF-THE-ART) Related work is mentioned briefly, but many details that are required for understanding some of the formalisms are omitted.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) An online demonstration is provided, together with a link to the source code.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments are sufficiently reproducible, as the source code and used versions are provided.
(OVERALL SCORE) This article introduces a method for querying JSON APIs as a SERVICE-based extension of SPARQL.
This extension is described formally, and a naive algorithm is provided for evaluating this extension
on top of any existing SPARQL engine.
Furthermore, a algorithm is provided that allows conjunctive queries to be evaluated with the worst-case optimal number
of API calls using concepts from the relational domain.
I am unsure about the practical usefulness of SPARQL querying over JSON APIs,
Many JSON-based Web APIs use some form of API token,
which seems to be an issue as well in the online demo.
For complex queries with many different forms of authentication,
this can require a lot of manual setup work.
Would it make sense to include this in the SERVICE clause?
Or are there other ways to improve this?
Furthermore, this work does not allow entity linking,
which only allows with basic interactions (literals) with JSON APIs.
The worst-case optimal algorithm (section 4.2) that is proposed by the authors is quite vague,
and I'm not able to fully understand it.
Presenting the algorithm more in line with the naive algorithm presented in section 3.2
could benefit the understandability.
As I see this as the main scientific contribution of this work,
I do not recommend an accept.
Page 2
"We picked JSON because it is currently the most popular data format in Web APIs"
A strong statement like this requires a citation.
Page 2
The SERVICE clause is specific to querying SPARQL endpoints.
Therefore, extending SERVICE for JSON API access can be confusing.
I would therefore recommend introducing a new clause for this,
so that syntax conflicts would not occur.
Table 1: time unit is missing
Page 4
"We assume the reader is familiar with ... the abstraction proposed in [28]"
This is not a reasonable assumption.
At least the basic idea of this abstraction should be included.
Page 4
The basic notions regarding graph patterns are confusing.
A basic graph pattern is typically defined as a set of triple patterns,
so calling the basic graph pattern a triple pattern can be confusing.
Furthermore, a triple pattern is defined as (I∪V)×(I∪V)×(I∪V),
however, this is mostly defined as (I∪B∪V)×(I∪V)×(I∪B∪L∪V).
Page 4
"we denote the variables appearing in P by var(P)."
This is not concrete enough.
Is the result of this operation a set? Or can it contain duplicates?
Are variables from OPTIONAL included?
Page 4
"we define SPARQL queries as expressions of the form SELECT W..."
This can be confusing, as this does not conform to the SPARQL 1.1 definition of a SPARQL query,
as clauses such as UPDATE, DESCRIBE, ASK, CONSTRUCT, but also things such as path expressions are not included.
Page 4
"Given a graph G and a pattern P, we denote the evaluation of a graph pattern P over G as..."
It should be mentioned that this evaluation results in a set of mappings.
Page 5
"A URI Template [21] is an URI" -> A URI Template [21] is a URI
Page 5
The explanation of the URI template should be improved,
as it could be interpreted as not being correct according to the RFC.
Furthermore, it is not clear if the question mark should be part of the variable label or not.
For instance, the example template "http://weather.api/request?q={?city},{?country}"
would result in URIs of the form "http://weather.api/request?q=?city=somecity,?country=somecountry",
which does not seem correct.
I assume "http://weather.api/request?q={city},{country}" or "http://weather.api/request{?city,country}" was intended.
Page 6
"If there is some ?x ∈ var(U) such that μ(?x) is not defined, we define μ(U) as an invalid IRI that will result in an error when invoked."
What happens when a variable is bound in the scope of OPTIONAL?
Page 6
What if the graph pattern has variables depending on the response of the mapping procedure?
Is the system capable of continuing with resolving the graph pattern?
Page 10
"with with relations" -> "with relations"
Page 14
"It would be also interesting to test how issuing API calls in parallel affects the running times of diferent algorithms."
So this means that for every binding, the evaluation blocks until the API request resolves?
This seems like a major bottleneck.
Next to reducing the number of requests, which the authors do,
I think making this non-blocking would have a major impact on the overall performance as well.
=== Post-rebuttal comment ===
I thank the authors for their reply.
My main concerns regarding the practical usefulness of SPARQL querying over JSON APIs and the vagueness of their algorithm were not resolved after the rebuttal.
I do appreciate the effort of the authors to include a simple example of the algorithm,
but unfortunately the rebuttal space is too limited for clarifying this sufficiently,
and it does not address the more complex workings of the algorithm as noted by the authors.
My initial scores therefore remain the same.

 

Review 3 (by Catherine Faron Zucker)

 

(RELEVANCE TO ESWC) This paper addresses the problem of integrating non RDF data with the Semantic Web by querying Web API within SPARQL queries. The authors propose an extension to the SERVICE clause to query JSON APIs, they describe its syntax and semantics, a basic implementation of it within the Jena framework and an optimization enabling to limit the number of API calls; this optimization is based on state-of-the-art optimization technics for RDB.
(NOVELTY OF THE PROPOSED SOLUTION) The proposed extension is already presented in a demo paper at ISWC 2017. In this paper the syntax and semantics is detailed, together with two algorithms. There is therefore a real novel contribution in the paper submitted.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) To query Web APIs other than JSON APIs, another similar extension should be proposed, one for each kind of API.
The proposed optimization of the baseline implementation requires to limit the SPARQL queries to conjunctive patterns with filters. (Even if most SPARQL queries do not use the rest of the expressivity of the language, it is a limitation)
The approach has been implemented on top of Jena. To what extent is the implementation independent to Jena and can be used on top of another SPARQL engine?
(EVALUATION OF THE STATE-OF-THE-ART) related works are summarized into a single paragraph but it is sufficient to position the proposed approach.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I see 4 aspects on the proposed approach which should be further discussed in the paper:
- It is an extension to SPARQL and therefore not standard. 
- Each time that an end-user (Semantic Web programmer) wants to write a SPARQL query integrating a call to a JSON API, he must be aware of the API documentation and the structure of the JSON response). 
- To query Web APIs other than JSON APIs, another similar extension should be proposed, one for each kind of API.
- The proposed optimization of its implementation requires to limit the SPARQL queries to conjunctive patterns with filters. (Even if most SPARQL queries do not use the rest of the expressivity of the language, it is a limitation)
- The approach has been implemented on top of Jena. To what extent is the implementation independent to Jena and can be used on top of another SPARQL engine?
I find the presentation of the optimization of the basic implementation of the approach presented in section 4 difficult for non RDB specialists. 
The reader is expecting a “Proof” section after each theorem or proposition or lemma. I do not understand the difference in status between these three kinds of assertions in the section.
Could the contribution be generalized to a special class of (standard) SPARQL queries instead of being specific to the proposed extension? 
Could it be a contribution in the domain of RDB (“we show that [tight bounds for the size of the outputs of join queries] can be extended for queries that use access methods”)?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) code and material are available online
(OVERALL SCORE) This paper addresses the problem of integrating non RDF data with the Semantic Web by querying Web API within SPARQL queries. The authors propose an extension to the SERVICE clause to query JSON APIs, they describe its syntax and semantics, a basic implementation of it within the Jena framework and an optimization enabling to limit the number of API calls; this optimization is based on state-of-the-art optimization technics for RDB.
The paper is well written, well structured, and it addresses a research question very relevant for the Semantic Web community.
The proposed extension is already presented in a demo paper at ISWC 2017. In this paper the syntax and semantics is detailed, together with two algorithms. There is therefore a real novel contribution in the paper submitted.
Generally speaking, I see 4 aspects of the proposed approach which should be further discussed in the paper:
-	It is an extension to SPARQL and therefore not standard. 
-	Each time that an end-user (Semantic Web programmer) wants to write a SPARQL query integrating a call to a JSON API, he must be aware of the API documentation and the structure of the JSON response). 
-	To query Web APIs other than JSON APIs, another similar extension should be proposed, one for each kind of API.
-	The proposed optimization of its implementation requires to limit the SPARQL queries to conjunctive patterns with filters. (Even if most SPARQL queries do not use the rest of the expressivity of the language, it is a limitation)
The approach has been implemented on top of Jena. To what extent is the implementation independent to Jena and can be adapted to another SPARQL engine?
I find the presentation of the optimization of the basic implementation of the approach presented in section 4 difficult for non RDB specialists. 
The reader is expecting a “Proof” section after each theorem or proposition or lemma. I do not understand the difference in status between these three kinds of assertions in the section.
Could the contribution be generalized to a special class of (standard) SPARQL queries instead of being specific to the proposed extension? 
Could it be a contribution in the domain of RDB (“we show that [tight bounds for the size of the outputs of join queries] can be extended for queries that use access methods”)?

 

Review 4 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) Extending SPARQL with API invocation is a timely and interesting topic.
(NOVELTY OF THE PROPOSED SOLUTION) As far as I know, the solution is quite novel, even if - as the authors mention - there are different approaches to do this in a different way (by wrapping APIs).
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors made quite an effort in detailing and formalizing their approach. Still, the proposed extension seems to me a bit limited in practice.
(EVALUATION OF THE STATE-OF-THE-ART) The authors seem to be well aware of the state of the art, which is however very briefly discussed.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The formalization of the proposed approach and its characteristics is quite complete. The initial part of section 2 could have been skipped, since it is basic semantic web knowledge that any reader should know. I just found a bit annoying that the authors referred to other documents for details: if some detail is needed to understand the content of the paper must be reported; if it is unnecessary, the reference to another document is pleonastic. 
By the way, I was unable to retrieve [1,2,3].
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation is sound and demonstrates the reduction in API calls of the proposed WCO algorithm. The only limitations I see in the setup are that (1) the dataset and "external" APIs were synthetic and (2) there is no comparison with completely alternative approaches, like a RDF-wrapping of the external API. Also, I would have loved to see a real-world evaluation with actual existing API.
(OVERALL SCORE) The paper is interesting and addresses a timely topic. The formalization of the problem and the proposed approach is quite extended. I am less convinced of the more pragmatic aspects, since the evaluation is done on a simulated scenario and no comparison is done with quite different approaches. Also, the proposed extension makes quite strong assumptions about the "family" of JSON APIs that are considered. What if JSON results need to be further processed? What if the URI template variable instances are not immediately available? In the given initial example there is a difference between the place label "Ben Nevis" and the required parameter "Ben_Nevis"... Moreover, it seems that a deep understanding of the API is needed to include it in the proposed extension. So, in short, while I think it is an interesting first step towards a richer solution, I am a bit doubtful about the actual/pragmatic applicability of the proposed approach.
*** after authors' rebuttal ***
I thank the authors for their clarification, I agree that the paper touches upon a very important issue; I still have the feeling that the reported work is somewhat limited, thus I keep my "weak accept" score. Good luck!

 

Metareview by Hala Skaf

 

This submission addresses the problem of integrating non RDF data with the Semantic Web by querying Web API within SPARQL queries.  
As pointed by reviewers the topic is related to ESWC. The submission describes syntax and semantics that allows SPARQL queries to connect to HTTP APis returning JSON. A worst-case optimal algorithm for processing  the extended SPARQL queries is proposed with an implementation and evaluation. 
Despite reviewers are a bit doubtful about the actual/pragmatic applicability of the proposed approach. This submission worths to be presented at ESWC.

 

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *