Paper 149 (Research track)

A Multi-Criteria Experimental Ranking of Distributed SPARQL Evaluators

Author(s): Damien Graux, Louis Jachiet, Pierre Geneves, Nabil Layaida

Full text: submitted version

Abstract: SPARQL is the standard language for querying RDF data. There exists a variety of SPARQL query evaluation systems implementing different architectures for the distribution of data and computations. Differences in architectures coupled with specific optimizations, for e.g. preprocessing and indexing, make these systems incomparable from a purely theoretical perspective. This results in many implementations solving the SPARQL query evaluation problem while exhibiting very different behaviors, not all of them being adapted in any context.

We provide a new perspective on distributed SPARQL evaluators, based on multi-criteria experimental rankings. Our suggested set of 5 features (namely velocity, immediacy, dynamicity, parsimony, and resiliency) provides a more comprehensive description of the behaviors of distributed evaluators when compared to traditional runtime performance metrics. We show how these features help in more accurately evaluating to which extent a given system is appropriate for a given use case. For this purpose, we systematically benchmarked a panel of 10 state-of-the-art implementations. We ranked them using a reading grid that helps in pinpointing the advantages and limitations of current technologies for the distributed evaluation of SPARQL queries.

Keywords: SPARQL Evaluators; RDF data; Benchmarking; Experimental Analysis

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) An experimental evaluation of SPARQL engines is indeed relevant work for ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The paper presents yet another evaluation of SPARQL systems with existing benchmarks. Though,
the authors consider some further criteria in addition to performance, the results are not 
really surprising. Furthermore, similar experiments have been already conducted in other works, 
e.g. [5].
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The work does not propose a new approach but presents results of an experimental study. Basically,
the conducted experiments make sense. However, there are some concerns regarding the setup:
* Both benchmarks are based on relatively simple query templates (BGPs) using only some very basic
features of SPARQL. Switching to a more advanced benchmark would provide new insights.
* For a distributed, cluster-based setup the datasets are quite small - two of them fit into main
memory of a single server!
* Using 2 physical machines with 4 resp. 5 VMs seems to be a rather small setup for the goal of 
the evaluation.
(EVALUATION OF THE STATE-OF-THE-ART) Related work is only briefly discussed, mainly by mentioning some other benchmarks and experimental
studies. In contrast to existing work (e.g. [5]), only the list of evaluated systems was updated.
Unfortunately, the paper ignores some recent work in benchmarking, e.g. the work of the LDBC council
like the SNB bechmark (http://www.ldbcouncil.org/benchmarks/snb).
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See comments below (overall score).
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The setup of the evaluation (hardware, benchmarks, systems) are explained in detail and
are publically available. Therefore, the experiments are reproducible.
(OVERALL SCORE) The paper presents yet another experimental evaluation of distributed SPARQL engines using
two existing benchmarks. These engines are classified into three groups (standalone, HDFS-based,
Hadoop/Spark-based) and compared according to several criteria: performance (velocity), loading
time (immediacy), update performance (dynamicity), resource utilization (parsimony), fault 
tolerance (resiliency).
Strong points:
S1: The authors have invested a lot of work to evaluate 10 different systems.
S2: The comparative evaluation does not consider only perfomance, but other criteria as well.
Weak points:
W1: Only two rather simple benchmarks are used which do not lead to new insights. Why not using
benchamrks like SNB? This would allow to compare the results (at least roughly) with commercial systems.
Furthermore, the queries of the used benchmarks do not really strss the systems (in terms of SPARQL features).
W2: Though, 10 systems are evaluated, these are only open-source systems (or research prototypes)
which do not really represent the state of the art. Look for example at http://www.ldbcouncil.org/benchmarks/snb 
for some results.
W3: One of the main contribution of this paper is an extended set of criteria. However, the 
evaluation methodology does really address all of these criteria, e.g. dynamicity would require
a defined update set as part of the workload, and reliability/fault tolerance would require
letting nodes fail in a more systematic manner. This kind of test is mentioned only briefly in 
sect. 5.5.
Minor comments:
- Sect. 5: for testing dynamicity an update part should be defined.
- Sect. 5, parsimony: Pricing models for cloud services are not really convincing for accepting
slower answers. However, minimizing resource usage is a relevant criterion.
- Sect. 5, resiliency: This should be addressed by a controlled test setup, i.e. a failure model.
Looking only at disk footprints isn't sufficient.
In summary, the paper provides only a limited amount of new insights - the selected bechmarks,
the small datasets and the restriction to open-source systems as well as the evaluation methodology
for the non-performance criteria are weak points.
After Rebuttal:
Thanks for the response, but my concerns still exist. Thus, I don't want to change my score.


Review 2 (by Pascal Molli)

(RELEVANCE TO ESWC) Comparing the performances of the different implementation of SPARQL
on distributed platforms is clearly relevant to ESWC and of interest
for the whole semantic web community.
(NOVELTY OF THE PROPOSED SOLUTION) Well, It is a benchmark quite similar to [5]. But, it includes new metrics, new interpretations.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The benchmark is well executed.
(EVALUATION OF THE STATE-OF-THE-ART) Related works are correct but integrated in the conclusion. It has to be corrected.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Section 5 presents an interesting "overall vision" of different SPARQL
evaluators. The different features (velocity, immediacy, dynamicity,
parsimony, and resiliency) looks pertinent to me, well-connected to
different use-cases.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Technically, the comparison seems well done, reproducible and trustable.
(OVERALL SCORE) The paper presents a benchmark of 10 state-of-art implementations of
distributed SPARQL evaluators i.e. SPARQL implementation on distributed
platforms such as Hadoop or SPARK. Then, it proposes to 'rank' them according to 5 features:
velocity, immediacy, dynamicity, parsimony, and resiliency.
Strong points;
- The benchmark is interesting and represents a valuable work by
itself. New systems are compared, new metrics are measured.
- The proposed features are pertinent. Different implementations are
not aiming the same "performances"  criteria and then correspond to
different use-cases.
Weak points:
- The benchmark is strongly inspired by [5], but makes different
choices about benchmark setup: not the same datasets, queries, and systems. IMHO, such choices are not enough motivated.
- Quite strange for me to have a section Related work and conclusion.
I expect to see in introduction why we need a new benchmarking of
distributed SPARQL evaluator. The structure of the paper can be
confusing.
- Many figures are small and difficult to read.
- If the proposed features (velocity, immediacy, dynamicity, parsimony, and resiliency) looks pertinent to me, it remains features and not
metrics. Consequently, "ranking" in features cannot really be quantified.


Review 3 (by Axel Polleres)

(RELEVANCE TO ESWC) I agree with the authors that benchmarking performance alone is not enough and that especially distributed approaches need more principled evaluation metrics.
This paper is an excellent contribution in this regard!
(NOVELTY OF THE PROPOSED SOLUTION) I haven't seen this aspect addressed earlier and rigorously and seen non-native distributed SPARQL engines compared in this manner and rigorously before.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) As this is a benchmark paper, I am not really sure how to evaluate "correctness" and "completeness" but I think the authors did consider most existing solutions in their focus and correctly compare them. What is slightly unclear to me is in how far do different data-distribution strategies play a role and in how far are they configurable in the tested systems?
(EVALUATION OF THE STATE-OF-THE-ART) I think the literature references and related works are well summarized, in fact I just printed one of the mentioned works I hadn't known before 🙂
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I think the benchmark and the evaluations features are well-justified and explained how they can be evalutaed. One minor point that remains open: Many standalone triple stores also run on a cluster, e.g. virtuoso: in how far would these clustered setups be (un)comparable with your benchmark that 
only considers non-stanalone distributed engines.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Reproducibility is flawless, all data, experiments are documented well on the accompanying webpage along with tutorials online. The quick tutorials themselves have some value to run the tested systems, nicely done!
(OVERALL SCORE) SP:
well written, clear rationale, rigoroously executed and summarized, a lot of things to take away for readers of the paper
WP:
not really, maybe the two open questions 2)+3) below, that are not really clear to me.
QAs:
1) In how far do different data-distribution strategies play a role and in how far are they configurable in the tested systems?  
2) Many standalone dedicated triple stores also run on a cluster, e.g. virtuoso: in how far would these clustered setups be (un)comparable with your benchmark that 
only considers  non-standalone distributed SPARQL engines.
3) I would be interested to know whether the authors considered DREAM in their comparison or whether they still could?
--- post rebuttal comment ----
the response doesn't really change/affect my (positive) verdict. Trusting the authors that the metrics and evaluation they is not part of the earlier evaluations mentioned and as I like the way they present it (the detailed related works discussion itself is per se useful), I think this could be accepted.


Review 4 (by Carlos Buil Aranda)

(RELEVANCE TO ESWC) The paper is relevant to the track since it evaluates several SPARQL query processing systems.
(NOVELTY OF THE PROPOSED SOLUTION) Not that novel, no new criteria, systems were evaluated previously or in other scenarios.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The evaluation criteria and benchmarks are correct.
(EVALUATION OF THE STATE-OF-THE-ART) It seems a complete state of the art.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper presented a good discussion about the data generated by the executions.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It is easy to download the benchmarks and systems and run the experiments. The authors also have a web page for the paper.
(OVERALL SCORE) In this paper the authors present an evaluation of 10 systems implementing the SPARQL query language. The evaluation is done according to 5 features (velocity, immediacy, dynamicity, parsimony, and resiliency) which are present in real scenarios. The authors provide a description of the 10 systems, detailed data about data loading/preprocessing time, query execution, timeouts, etc. using 2 different benchmarks (LUBM 1 and 10k and WatDiv). The authors compare the results against these 5 dimensions.
The introduction section presents the need of evaluating the 10 systems according to 5 criteria using 2 different benchmarks. The need is that nobody so far looked into these 5 criteria in these types of systems (distributed in memory, HDFS, native RDF storage systems).
Comment: the main comment I have is that this paper seems another typical evaluation paper. Maybe rewrite the introduction to highlight more the benefits of reading it?
The next two sections present the systems and the methodology for the evaluation. Section 2  introduces the systems to be evaluated while Section 3 presents what will be measured in the evaluation. Section 3 also describes the benchmarks used (LUBM and WatDiv) with the different data configurations so data data does not fit in memory. The authors also report on what queries are hard and why.
Comments: the authors presented nicely the reasons why they chose these systems/benchmark. However the authors did not explained clearly how they chose the indicators to be monitored. Besides, these  criteria is not new, I would recommend the authors to check [1] since it contains a set of guidelines to select such indicators. Or other systems comparisons in the literature.
In Section 4 the authors describe how well the systems behaved during the evaluation with the benchmarks. This section provides details for each combination of system and benchmark about loading times, execution times for queries, what queries were harder, etc. 
Comments: I think this section is nicely written, the authors clearly described what happened with each system in each query. Not much to say.
In Section 5 the authors compare the systems (using the data from the previous section) using the criteria described in Section 3. Comments: I think this is the weakest part of the paper. The section is nicely written and easy to read, however the criteria to compare are not really new. 
In Section 6 the authors conclude the paper.  
Overall comments: I found the paper easy to read and well structured. The main drawback is the lack of novelty since the native RDF/SPARQL systems were evaluated somewhere else and the HDFS systems have been evaluated in other scenarios (how different is the RDF use case?).
============== after rebuttal ===========
I acknowledge the author's response. However I still see the paper as an another evaluation benchmark, nothing really new, however it is interesting to read it. After author's response I will maintain my score.


Metareview by Emanuele Dellavalle

The paper created a deep and long discussion both among the reviewers and the chairs. 
The work is clearly relevant for ESWC and to the best of the chair knowledge this is the first attempt to perform such a broad and deep evaluation of Distributed SPARQL Evaluators. 
However, the paper has weak points that do not make it publishable at this stage.
In addition to the reviewer comments, as meta-reviewers, we recommend the authors to try to scale out their tests of at least two orders of magnitude. As one can read in Figure 3, the tested system where not loaded neither in terms of CPU nor in terms of memory. We also recommend them to add as term of comparison a centralised solution. This will show explicitly at which scale Distributed SPARQL Evaluators starts to outperform centralised ones.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *