The Cost of Querying Knowledge Graphs over Triple Pattern Fragments- An Empirical Study
Author(s): Lars Heling, Maribel Acosta, Maria Maleshkova, York Sure-Vetter
Full text: submitted version
Abstract: Triple Pattern Fragments (TPFs) are a novel interface for accessing data in knowledge graphs on the web. Up to this date, work on performance evaluation and optimization has focused mainly on the SPARQL query execution on top of TPF servers. However, in order to devise querying techniques that efficiently access knowledge graphs via TPFs, we need to identify and understand the variables that influence the performance of TPF servers on a fine-grained level. In this work, we measure the performance of TPFs by measuring the response time for different requests, and analyze how the requests’ properties may impact the performance. To this end, we conduct an empirical study over four knowledge graphs in different server environments. As part of our anal- ysis, we provide an extensive evaluation of the results and focus on the impact of the variables: triple pattern type, answer cardinality, caching and the environment type on the response time. The observed results suggest that all variables have an impact on the measured response time.
Keywords: Linked Data; Triple Pattern Fragment; Empirical Study; Querying; SPARQL
Review 1 (by Alessandro Margara)
(RELEVANCE TO ESWC) The paper presents an empirical study to assess the performance (response time) of TPF servers and to study the impact of different variables. The paper addresses an interesting problem, relevant for the conference. (NOVELTY OF THE PROPOSED SOLUTION) I am not aware of similar studies. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I am missing some details regarding the evaluation, which is of primary importance for an empirical study (see my detailed comments). (EVALUATION OF THE STATE-OF-THE-ART) Some background information is missing, and this hampers the readability of the paper (see my detailed comments). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I am missing some details regarding the evaluation, which is of primary importance for an empirical study (see my detailed comments). (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I am missing some details regarding the evaluation, which is of primary importance for an empirical study (see my detailed comments). (OVERALL SCORE) The paper presents an empirical study to assess the performance (response time) of TPF servers and to study the impact of different variables. The paper addresses an interesting problem, relevant for the conference. In general, the paper is well written, although it lacks a detailed description of the TPF interface and servers, and this hampers its readability (see the detailed comments below). The evaluation appears to be rigorous, although its discussion would certainly improve if it was supported by a precise modeling of TPF. My main concern with the paper is that it provides an empirical evaluation of TPF without properly introducing the subject and the terminology it adopts. This makes some parts of the paper rather difficult to follow for a reader that is not familiar with the details of the TPF interface and of the TPF server implementation. For instance, the concept of pagination is discussed in Section 5, but is not properly introduced in the paper. Similarly, the impact of cache is evaluated without a discussion of caching mechanisms in the context of TPF. I suggest that the authors introduce a background section where they introduce TPF and the related terminology, and refer back to this section while discussing the experiment setup and results. To save space for this section, I would suggest to better integrate Section 3 and Section 4, thus avoiding repetition of concepts. Concerning the evaluation, I am missing some details to fully understand the experiment setup. In particular, how are queries submitted? Sequentially, one by one, or concurrently with (partial) overlap of their execution on the server? How exactly is the response time measured? An additional metric that might be interesting to investigate is the throughput, or availability of the server when the load of queries grows. How does the performance change in such scenario? In addition, I wonder which aspects are representative of the TPF interface and which ones depend on the specific implementation and settings. In conclusion, the paper is interesting and relevant for the conference. The expriments presented appear to be rigorous and the conclusions are supported by the measures. However, I suggest that the authors include a description of TPF to make the paper easier to read and follow. I also suggest that the authors add some details to the experiment setup and evaluation. It would be also interesting (possibly as future work) to include more experiments, for instance on throughput and availability. --- Minor comments page 2: reprodicibility -> reproducibility page 5: sever -> server page 12: the only pattern type, which -> the only pattern type that page 12: results suggests -> results suggest page 13: correlation ca be observed -> can be observed --- Strong points 1) Relevant topic. 2) In general the paper is well written. 3) Novelty of the proposed solution. Weak point 1) The scope is a bit narrow. 2) A description of the key features of TPF would make the paper easier to read and follow. 3) The experiment setup can be discussed in more detail. 4) The evaluation does not consider throughput. Questions 1) Concerning the evaluation. Can the authors better explain the experiment setup? How are queries submitted? Sequentially, one by one, or concurrently with (partial) overlap of their execution on the server? How exactly is the response time measured? 2) The authors did not consider the throughput in their evaluation. Can they elaborate on this choice? Would it be possible to obtain some measurements of the throughput? --- I'd like to thank the authors for answering my questions. I hope they can integrate the answers in the final version of the paper.
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) This paper tackles the problem of evaluating the response time of query answering in the Triple Pattern Fragment (TPF) setting. TPFs represent a (partially) distributed way of querying RDF data; this is achieved by serving data in the form of TPFs and moving part of the computation on the client side during query evaluation. Authors extend current research in the field by including more features in the evaluation framework. In general, the paper is relevant to ESWC, altough one may ask what is the real-world audience for TPS; who is actually using them? While traditional endpoint-based (centralized) query processing solutions have a clear outreach, the same cannot be said for TPS, as far as I know. Authors are encouraged to further motivate why studying this problem is important (in the real world). (NOVELTY OF THE PROPOSED SOLUTION) Authors include into the evaluation framework features that were not previously considered. To be more precise, their evaluation framework includes features that were considered separately in previous work (apart from Pagination and Relation KGs/ TPF instance). What I really like about this paper is the detailed explanation of the evaluation setting and the availability of source code (also for reproducibility). Nevertheless, the novelty is somehow limited. Aspects analyzed in this paper, although not specifically considered in the context of TPFs, have been considered in other benchmark settings. Moreover, I also think that including query complexity in the loop would have made the paper stronger. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) As previously said, the paper has some merit. Methodologically is sound although I think there are a couple of major issues: (i) authors mention (in Section 3.1) that they do not consider queries having empty answers. I have to strongly disagree with this design choice. An empty answer is still an answer. There are cases in which one is actually interested to check whether the query gives no results; (ii) instead of the sampling procedure, why not starting from real-logs? Or, if not available, why not *casting* query logs from traditional SPARQL endpoints into a TPF setting? (EVALUATION OF THE STATE-OF-THE-ART) The paper misses some important references when talking about: (i) "growing number of research and governmental organizations, as well as companies are looking into adapting Linked Data as part of the new data economy"; there is no single reference that backs up such statement (ii) "A variety of interfaces to access and query Linked Data". It would have been nice to present a characterization of previous work based on the degree of centralization (i.e., from completely distributed query evaluation approaches like SQUIN or NautiLOD to completely centralized like SPARQL endpoints passing by partially-centralized like TPFs infrastructures). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As previously mentioned, there are some design choice that need to be clarified; for instance, why not including queries with no answers. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Links to the code and the complete evaluation data is available. As for the generality, I don't think that the approach is useful beyond the context of TPFs. (OVERALL SCORE) The paper presents an empirical study to measure the response time of TPFs querying infrastructures. Authors combine features from previous work and propose some new ones. The major contribution is a quite detailed experimental evaluation. Strong points: (i) detailed methodological description of the framework (ii) availability of both the code and complete evaluation data Weak points: (i) there is no clear justification of why not including also queries that give empty results (ii) the conclusions are not exciting; everything is almost exactly as one would expect it (the almost being RQ2) Questions: -Can you please clarify the intuition behind not considering queries that give empty answers? - Please provide further references to back up the statement made in the Introduction (e.g., about the new data economy) -- After rebuttal: I thank the authors for their response and I believe they are studying queries with empty answers.
Review 3 (by Ioanna Lytra)
(RELEVANCE TO ESWC) The current work is very relevant to ESWC since it studies the cost of querying Knowledge Graphs using Triple Pattern Fragments, one of the most popular ways to access and query Linked Data. Therefore, the results discussed in this paper are very interesting for the Semantic Web community. (NOVELTY OF THE PROPOSED SOLUTION) It is not the first time that the performance of TPFs is studied under different conditions and taking into consideration different parameters; the authors cite several works that have performed similar studies like [2,8,6,13], however, this study provides a thorough evaluation and discussion of parameters like triple pattern types, type of execution environment, number of answers, caching, etc. The main novelty of the approach is the proposed architecture for evaluating TPFs consisting of the Triple Pattern Sampling Component and the Analytical Component. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution is described in detail. A point that I am missing is how many times the experiments were replicated in order to produce the results. In addition to this, it is not very clear to me how the random sampling leads to "RDF triples that capture different characteristics of the Knowledge Graph in terms of data distributions". Also, the authors claim with regard to the Analytical Component that "the observed variable response time is not normally distributed": how is this proved? Did the authors perform a normality test for this? (EVALUATION OF THE STATE-OF-THE-ART) The authors are aware of the state of the art and have included all related studies to their work. Table 1 helps understand the different parameters affecting query execution time for TPFs and SPARQL endpoints that have been studied in existing works. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) This is very detailed in the paper since the experimental study is the focus of it. I am missing a discussion section that would try to interpret the results and more importantly to provide some lessons learned for the users of TPFs. For instance, how can I use these results to improve execution times or make decisions about the query interface I will used based on my queries, knowledge graphs, etc? (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental study is reproducible since the data and code are available online. However, I am wondering if the results in a real-world environment are reproducible because the query time is dependent on many other parameters than the parameters studied in isolation. (OVERALL SCORE) Summary of the Paper The paper entitled "The Cost of Querying Knowledge Graphs over Triple Pattern Fragments: An Empirical Study" proposes a general methodology for assessing the query execution cost for TPFs consisting of a Sampling Component and Analytical Component and discusses the experimental results of a study focusing on the impact of different independent variables on the performance of TPFs. They use 4 different KGs for their evaluation and assess the statistical significance of the results. Strong Points (SPs) - Detailed and clear presentation of results using self-explanatory plots - Discussion of experimental results - General methodology for assessing the performance of TPFs supported by accompanying data and code for reproducibility Weak Points (WPs) - Missing comparison of the results to the results already presented in previous works (did the authors falsify or validate existing results in the literature?) - Missing discussion how the users of TPFs can benefit from these results (e.g. make the right decisions depending on type of queries and knowledge graphs) - Some things are not adequately explained, e.g. what do negative correlations mean in Table 5? Questions to the Authors (QAs) - Did you test the reproducibility of your results by repeating the experiments? - Did you manage to validate/falsify results in existing studies? After rebuttal -------------- I would like to thank the authors for addressing my comments and questions.
Review 4 (by Pieter Colpaert)
(RELEVANCE TO ESWC) ESWC benefits from work done on measuring the cost of querying different types of HTTP interfaces. Certainly when these HTTP interfaces expose RDF graphs, and provide RDF-specific functionality such as triple pattern filtering. (NOVELTY OF THE PROPOSED SOLUTION) It only measures HTTP response times as if the NodeJs+HDT back-end are a blackbox. The sampling method to come to a query mix is a simple uniform distribution, which seems a correct approach. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The work at hand evaluates a couple of knowledge graphs in two settings: a real-world setting, as published at data.linkeddatafragments.org, and a controlled environment, using the same NodeJS code-base with the HDT backends. My biggest question in this research is: (Q1) if an answer could be found that is unrelated to answer cardinality, isn’t this then just a bug in the NodeJS + HDT back-end code? Next, (Q2) why doesn’t this research study other implementations as well? E.g., what would be the effect of changing the page size at the server side? (EVALUATION OF THE STATE-OF-THE-ART) Ok (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The proposed approach is well explained and comprehensible. The box plots are a good visualization of the different query response times. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments are reproducible for any TPF interface at https://github.com/Lars-H/tpf_profiler (OVERALL SCORE) This paper measures the request-response time of TPF servers in both a controller environment as on the production servers at data.linkeddatafragments.org. The triple patterns to be executed are categorized by triple pattern type (where and how many variables) and are randomly selected. All triple patterns will return at least 1 correct, and will go up to the max of, in this case, 100 triples in an HTTP response. The number of 100 triples is configurable, yet not changed by the authors. Some effects are noticed, mainly due to answer cardinality. In the controlled environment, even different effects are noticed due to a smaller cache size. The authors did not try to change the cache size either and experiment with that. The authors conclude the answer to RQ1 as a "significant impact", yet I would dare to say that this has got nothing to do with the type of triple pattern, but only with a bigger response document that needs to be transferred over HTTP. I assume the fact that there is indeed a correlation, is merely due to the fact of the effect shown in figure 5. However, concluding merely the type of triple pattern causes a different response time is highly questionable. The paper however could still be accepted after rebuttal if this answer to RQ1 is altered and the questions are addressed ↓. Strong points: * The work elaborates nicely on how a sample size is selected * The evaluation is reproducible * The motivation to make the choice for Linked Data more fact-based rather than educated guesses is refreshing. Weak points: * The paper does not elaborate on implementation details. Yet, for all tested environments, the same NodeJS+HDT backends were chosen. * The acknowledgement section says there were talks with the creators of HDT, yet the paper does not elaborate on these... What could the readers of this paper learn if your discussion was interesting? * The paper does not try to alter the out of the box behavior of the chosen software * The impact of pagination on response times → do you really need an experiment to know this? The code shows that retrieve a page is O(1). * I don’t think I learned anything from this paper that I couldn’t have known from implementation. Questions to the authors: Q1: If an answer could be found that is unrelated to answer cardinality, isn’t this then just a bug in the NodeJS + HDT back-end code? Q2: Why doesn’t this research study other implementations as well? E.g., what would be the effect of changing the page size at the server side? Q3: Could this paper be extended with researching what a "better cache" size would be? Is bigger always better? Minor comments: * RQ4: graph → graphs * the sever provides the set * a weak possitive correlation ca → can * traffic, influence and blur the response times → review sentence --- After rebuttal: My objection still stands that what is measured is merely the HDT implementation of TPF. The framework could indeed be used for benchmarking TPF implementations to check for bugs, yet then this paper should have been written and submitted as a resource track paper.
Metareview by Roberto Navigli
This is a good prototypical paper for the evaluation task. However, the reviewers have pointed out some drawbacks that reduce the value of this work and led to rejection of the current version: - Parameters like the cache size in the controlled experiment should be configurable. That is, experiments should be run with different cache sizes, in order to determine the impact that this variability may have in the results and conclusions that can be obtained from there. - Parameters like the amount of triples returned per HTTP response should be configurable as well in all experiments, with the same objective as above. - Other implementations of TPS should be considered. If they are not considered in a future version of the paper, then the title of the paper should be changed to reflect that the analysis has been done, on the controlled experiment, on a very specific implementation of TPS. - An analysis of throughput should be also considered, as pointed by reviewer 3.