On the Semantics of TPF-QS towards Publishing and Querying RDF Streams at Web-scale
Author(s): Ruben Taelman, Riccardo Tommasini, Joachim Van Herwegen, Ruben Verborgh, Emanuele Della Valle, Erik Mannens
Full text: submitted version
Abstract: RDF Stream Processing(RSP) is a rapidly evolving area of research that focuses on extensions of the Semantic Web in order to model and process Web data streams. While state-of-the-art approaches concentrate on server-side processing of RDF streams, we investigate the TPF-QS method for server-side publishing of RDF streams, which moves the workload of continuous querying to clients. We formalize TPF-QS in terms of the RSP-QL reference model in order to formally compare it with existing RSP query languages. We experimentally validate that, compared to the state of the art, the server load of TPF-QS scales better with increasing numbers of concurrent clients in case of simple queries, at the cost of increased bandwidth consumption. This shows that TPF-QS is an important first step towards a viable solution for Web-scale publication and continuous processing of RDF streams.
Keywords: Linked Data; RDF stream processing; continuous querying; TPF-QS; RSP-QL; SPARQL
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper deals with the problem of publishing and querying rdf streams. (NOVELTY OF THE PROPOSED SOLUTION) The authors propose a novel solution, but do not make clear exactly in what aspects and how it is different from existing approaches. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution appears to be correct and complete. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art is nicely summarized and discussed. It could be improved by including a brief discussion on the differences of the proposed approach to existing approaches, which are now discussed in more detail in Table 1. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper has a nice discussion of the proposed approach, which is easy to follow. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental evaluation covers several aspects of the problem and the proposed approach. Though, it does not shed much light on when to use the proposed approach vs existing systems, eg, what the sweet spot of query/data rate is that would make TPF-QS better/worse than an existing streaming or SPARQL engines. The authors should add titles in the y-axes of the graphs in figures 3-5. (OVERALL SCORE) This proposes a solution for publishing and querying slow RDF streams, and demonstrates the tradeoffs of the proposed solution. Strong points: 1. Deals with a real problem 2. Proposes an efficient solution 3. Experiments demonstrate the superiority of the proposed solution Weak points: 1. Experiments do not reveal the operating sweet spot of the proposed solution 2. Discussion of related work could be improved 3. Explain in one sentence, and add references, for the Kruskal-Wallis and Nemenyi tests
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper is highly relevant to ESWC, especial it tries to address one of the topics of interest highlighted in Mobile Web, Sensors and Semantic Streams. (NOVELTY OF THE PROPOSED SOLUTION) The solution is just the incremental extension from authors’ previous work, the idea is not innovative in terms of state of the art of query federation in general. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper tries to map their semantics of execution to, so called, RSP-QL semantics which partially discussed in reference . I personally, think unified semantics for RSP processing is still not in a established phase yet, the field is too young to have an agreed sound semantics(denotational and operational). The paper claims o formalism of operational semantics which I don’t think it’s clearly addressed in the paper yet. (EVALUATION OF THE STATE-OF-THE-ART) The paper has the evaluations against two current RSP systems, C-SPARQL and CQELS but the comparison results are controversial. The paper does not touch similar approaches such as query federation and publish/subscribe systems. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The formal operational semantics is somehow toughed but it is not full addressed. The technical details on how the rewrite work are left in their workshop paper (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The implementation is open source and evaluations are provided from reproducing. The generality of the Experimental Study is not convincing. (OVERALL SCORE) The paper investigates the TPF-QS method for server-side publishing of RDF streams, which moves the workload of continuous querying to clients using TPF client previously developed by some of authors. Via its experiments, the paper claims its approach can help to scale the server load better with increasing numbers of concurrent clients in the case of simple queries, and at the cost of increased bandwidth consumption. In general, the paper does not contributes much new scientific and technical breakthrough with an insignificant extension of its workshop paper. Some experiment results are obvious and misleading. Note that, a previous version this paper was submitted to ESWC 2017, I rejected that version with the hope the authors will make a good paper out that. But, disappointedly, after one year, the authors submitted nearly the same content, so, I reuse most of my comments from last year. *** Weak Points (WPs) *** 1)Controversial motivations: I’m not convincing with the authors’s argument that their solution will provide a more scalable solution to the server-side processing. By moving the processing to the clients, this approach will loose opportunities for optimising shared query loads among concurrent queries which I think the saving workload and memory consumption might surpass the saving always-on concurrent network connections not to mention that pushing data out to clients also consume computing and memory resources in data exchange operations. On top of that, the paper put itself to the competition with well-established publish/subscribe systems. Hence, authors need to work hard to find a sharp angel to motivate their approach. 2) Unsolid formalisation ground : The paper tries to map their semantics of execution to, so called, RSP-QL semantics which partially discussed in reference . I personally, think unified semantics for RSP processing is still not in a established phase yet, the field is too young to have an agreed sound semantics(denotational and operational). The paper claims of formalism of operational semantics which I don’t think it’s clearly addressed in the paper yet. I would recommend author to consider the challenges in dealing execution semantics in the distributed settings such as correctness, latency and time domain skew in the paper for the future manuscript. 3) Weak evaluation: The authors claimed that they “created a set of SPQRAL queries to semantically equivalent results using on the RSP-QL-based formalization from Section 4 and operational engine comparisons from Section 5, so that they result in an evaluation frequency of 10 seconds. This frequency was chosen because earlier experiments have shown that TPF-QS works best with this order of frequencies or slower. Additional details on how these queries were semantically transformed can be found …” . I’m not convinced that this way can guarantee the semantic equivalent of C-SPARQL and CQELS because C-SPARQL and CQELS have different mechanisms of triggering the execution and semantics of the output are different. Secondly, it’s impossible to compare CPU loads of periodical execution mechanism(C-SPARQL,TPF-QS) with input&timer-based-triggered(CQELS), because it frequency of triggering execution is much higher due the fact that the processing load is not only contributed by the stream input but also the timers that regularly check expired data items of the windows of 10 seconds. To some certain extent, pushing some parts of the continuous processing load over stream data to the consumer side might be a good idea. But the claim on scalability is controversial. First, look closely on the conclusion derived from the experiments, "server load scales better with increase of number clients….”, it think it’s not fair just to look into Fig 4 and 5 to conclude that TPF-QS scales better or the paper approach will help to scale the server load. Because, the behaviour is quite obvious, in TPF-QS it has 8 machines to run its clients, the accumulated processing load is divided into 8 computing nodes( 1 for server, 8 for clients) whilst, C-SPARQL and CQELS engine have to process 20-60 concurrent queries in 1 processing node. 4) Poor presentation: The flow paper sometimes confusing to me, e.g., wordings, main messages and contributions of the paper. The section 2 and section 3 wanders around several things that are obvious. The paper mentions about“formalising…” that makes it sounds that its formalisations is one of the innovative aspects of the paper, but, by reading section 4 and 5, there is no significant contributions given there. The evaluation lacks several technical details that might lead to confusion in interpreting the evaluation results. For instance, there is no explanation or definition of "query completeness”. Also, there is no details on how the metrics are measured, e.g, latencies in C-SPARQL and CQELS. In overall, the storyline from the tile “On Semantics…” to section title and content flow is misleading for me. ***Strong Points*** 1) The paper introduces an interesting angle of processing RDF stream data based on their previous work on Linked Data Fragment. 2) Authors spent good Implementation effort to realise their solution and make it accessible. 3) The evaluation is reproducible.  Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (August 2015), 1792-1803. ====After rebutal==== I would like to thank authors for responding to my concerns. However, I would disagree with authors's main arguments as following: ----Motivations---- It's not true to assume that C-SPARQL or CQELS or some other current state-of-the arts are a server-side only solution. Moreover, the paper over-claimed that TPF-QS makes a "paradigm shift" to move the processing load to the client side for the sake of scalability. In fact, these engines can be decoupled from stream sources and then process the rest of the processing pipelines in another computing node. In fact, there are some recent implementations on top C-SPARQL and CQELS that can consume stream sources via websocket, mqtt or http. So, moving processing load is just a deployment choice which TPF_QS processing engine does similar jobs like that of C-SPARQL and CQELS. So, service-side coupling of current system is not a limitation, and moving processing to multi-clients is capability that other existing system can support as well. Now coming to the debate on whether executing processing on client-side will actually more efficient and more scalable than a centralised solution which serves multiple clients. First, for server-side caching, some good performance indications from "Triple Pattern Fragments, Verborgh et al." does not mean it's universal as materialised view techniques have been used for million connections for the last 20-30 years. Authors might argue that they targeted the publishing servers that have limited resources or capabilities like sensor gateways like RaspberryPi which do not have such complicated processing capabilities. I personally think it's a good direction for the paper to motivate its contribution. However, I don't think paying the "scalability" card is the right move here. Bear in mind that, the common practice in collecting&processing stream sources of resource-constraint devices is relaying stream data to bigger gateways to avoid multiple connections for the sake robustness and security, so, there are more effort to convince against this. ----Formalization Ground----- Let's dive into RSP-QL paper to see if it will provide a good ground for this paper's operational semantics. RSP-QL makes very strict assumption on "time required to evaluate the query current input and to produce the portion of answer is lover than the time unit". However, the paper does not discuss or align with this assumption in their definitions, especially Definitions 5,6 and 7. Oddly, in section 4.3, paragraph 2, authors mentions "we assume that all query evaluations take one time unit, except for the evaluation at time 6, which takes two time units" One important aspect of RSP-QL's continuous execution semantics is that it does not consider the time domain skewness problems that are highlighted in the paper (I proposed above). This problem is really common in currently practice of processing stream pipeline whereby data arrives from remote sources. Think this problem applies to TPF-QS which I don't think are covered in RSP-QL's paper. Therefore, I doubt that operational semantics presented in this paper will guarantee the determinism of the output. ----Evaluation------- I checked the paper's appendix (2 pages), I'm still not convinced that the query semantics are equivalently in both denotational and operational aspects. I noticed there are some recent papers shown that slide parameter of CQELS does not work, but the paper claimed that it used SLIDE 1s to align the execution steps to have comparable results. With my above concerns, I think the evaluation results are very controversial.
Review 3 (by Alasdair Gray)
(RELEVANCE TO ESWC) This manuscript talks about the formal operation of TPF-QS, a query streamer in the domain of RDF Stream Processing. It is relevant to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) The strategies proposed in this manuscript are interesting and novel. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The technologies proposed in this manuscript are elaborated throughly. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art technologies are investigated and evaluated in detail. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The proposed operational semantics are compared with some other solutions available on CityBench, by experiments on the same queries. The results are evaluated through criteria like server cpu, query result latency, result completeness, bandwidth and client cpu usage. The discussion is quite clear and reasonable. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code and setups to repeat the evaluation experiment are open and available online. (OVERALL SCORE) Thank you for the thorough rebuttal. It would be good to see these issues addressed in the final version of the paper. ---- This manuscript proposed TPF-QS -- an extension of a previously proposed query streamer TPF. It discussed the formal operational semantics of TPF-QS, including the mapping strategy "TpfqsToRspql" to map TPF streams into RSP streams. Experiments are made to compare with some other RSP query languages. Performance and overhead are discussed in detail with the metrics of server cpu, query result latency, result completeness, bandwidth and client cpu usage. The work proposed in the manuscript is novel. The logic of the presentation is clear and reasonable. Experiments are well designed, with the code and setups available in public for repetition. Some minor weak points do exist, however, as listed below. - It could be better to move the full name of TPF-QS at the beginning of the manuscript, rather than at the middle in Section 2.1. - Page 10, second last paragraph of Section 6.1, last sentence: “… cores machine cores.” Is that possible that you could check the phrase? - The font of the text does not seem to be consistent, especially for acronyms in capital. Abstract states the expected result – increased scalability but at the cost of bandwidth. Where is the cross-over point in terms of rate of stream vs speed of processing? It might be worth adding the details of the target use case, slow streams with temporal querying into the abstract. How does the query processor know which statements in the query are streaming and those which are stored? What is the metadata required to do this inferencing? Make clear that the ?sensor is instatiated to ex:sensor1 based on the results of query 1.3 to materialise 1.5. Stream definition is that of previous RSP engines, i.e. timestamped triples. The RSP-QL model is one of timestamped graphs of triples. http://streamreasoning.github.io/RSP-QL/Abstract%20Syntax%20and%20Semantics%20Document/ Window start is not unique to TPF-QS, it is supported in SPARQLStream; SPARQLStream supports both historic windows and refreshing of stored data sources. Entry in Table 1 for MorphStream should be updated. Evaluation frequency set to 10 seconds. What was the original evaluation frequency of city bench? What are the consequences of this decision? Many of the original queries have a range of 3s. Are there other windowing factors that are affected by this choice? CityBench defines 13 queries, you only mention 12. It would be good to summarise the features of these queries and what makes them simple or complex. Fig 2: Why do queries 2 & 9 have high client CPU usage? Fig 3: Why do the server side versions plataux at less than 20 clients. Are all clients served correctly? §6.6: Discussion What is the case where TPF-QS has higher accuracy than one of the RSP engines. TPF-QS is not the first system to support historic queries. Conclusions Low-powered sensors generally have energy as a major concern. In that case, TPF is not the appropriate solution as transmission is far more energy consuming than processing. I do agree that there is a place for TPF-QS. TPF-QS requires a stream publishing server and a client, just like any of the RSP engines. Minor issues: - Listing 1.4 keep ? with sensor in SELECT clause. - p10: number of cores machine cores. - Footnote 2 does not appear on the same page as the reference - p10: These results of these tests
Review 4 (by Pankesh Patel)
(RELEVANCE TO ESWC) rdf Stream Processing (rsp) is a rapidly evolving area of research that focuses on extensions of the SemanticWeb in order to model and processWeb data streams. The topic area is quite relevant to the conference. (NOVELTY OF THE PROPOSED SOLUTION) With the current state of the art, we have to either keep the server always active with an rsp engine for this slow stream, or we have to use a traditional sparql solution that lacks the temporal dimension. Clearly, these solutions are both sub-optimal and they suffer from scalability problems for a large number of clients that need to query the stream. This is because rsp engines have to maintain open connections with each client to push new results, whereas the high complexity of sparql queries can lead to low availability in sparql endpoints. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors support the contribution by formalizing the operational semantics of tpf-qs using rsp-ql . This enables the authors to formally compare its evaluation with other rsp engines. After that, the authors extensively evaluate tpf-qs using CityBench, an rsp benchmark with real-world data. (EVALUATION OF THE STATE-OF-THE-ART) To best of my knowledge, the state of art is complete and comprehensive. In Section 2, the authors first present the related work on querying and publishing Linked Data, after which they present their respective extensions for rdf streams. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) This paper formalizes tpf-qs in terms of the rsp-ql reference model in order to formally compare it with existing rsp query languages. The authors experimentally validate that, compared to the state of the art, the server load of tpf-qs scales better with increasing numbers of concurrent clients in case of simple queries, at the cost of increased bandwidth consumption. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The authors extensively evaluate tpf-qs using CityBench, an rsp benchmark with real-world data. In CityBench case, the average number of cars that enter and exit a parking area in a given time interval is monitored. Data are produced by a sensor network deployed over the parking area where new observations are streamed out every 5 minutes. (OVERALL SCORE) The authors formalize tpf-qs in terms of the rsp-ql reference model in order to formally compare it with existing rsp query languages. They experimentally validate that, compared to the state of the art, the server load of tpf-qs scales better with increasing numbers of concurrent clients in case of simple queries, at the cost of increased bandwidth consumption. I find this paper, well-structured and well-written. however, the authors should consider to fix some minor typos.
Metareview by Intizar Ali
This paper presents query semantics of a triple pattern fragment-based approach for querying RDF streams on the Web. Authors proposed approach is designed for less frequently updating streams on the Web where processing is conducted at the client level rather than at the server side as is the case for the RDF stream processing engines. The evaluation results are presented to showcase that the proposed approach outperforms RDF stream processing engines in a few cases. Paper attempts to tackle an interesting issue; however, a few major concerns are raised by the reviewers regarding the feasibility of the approach, correctness of semantics and reproducibility of the evaluation. We strongly encourage authors to improve their work following the suggestion and feedback from the reviewers.