Paper 71 (Research track)

Benchmarking Commercial RDF Stores with Publication Office Dataset

Author(s): Ghislain Auguste Atemezing

Full text: submitted version

Abstract: This paper presents a benchmark of RDF stores with real-world datasets and queries from the EU Publications Office (PO). The study compares the performance of four commercial triple stores: Stardog 4.3 EE, GraphDB 8.0.3 EE, Oracle 12.2c and Virtuoso 7.2.4.2 with respect to the following requirements: bulk loading, scalability, stability and query execution. The datasets and the selected queries (44) are used in the Linked Data publication workflow at PO. The first results of this study provides some insights in the quantitative performance assessment of RDF stores used in production environment in general, especially when dealing with large amount of triples. Virtuoso is faster in querying and loading scenario while GraphDB shows better results regarding stability.

Keywords: Benchmarking; Triple stores; RDF; SPARQL; Enterprise RDF stores

Decision: reject

Review 1 (by Alasdair Gray)

(RELEVANCE TO ESWC) The paper falls in the interest area of ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The paper presents an evaluation of 4 commercial triplestores using the queries and dataset  used in a production environment, i.e. a real-world scenario. However, the results are not related back to other benchmark figures.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Overall the evaluation appears to be well thought through and executed. Appropriate warm-ups are made and multiple runs are performed.
However, it seems that the same values are used in the queries, i.e. the search term (literal/resource) is not changed to random permitted values. 
Also, several queries timeout which skew the results when they are aggregated across the query mix. I would expect that you also report the results for the combination that do not timeout and see how they do on that subset of the queries.
In the multi-threading scenario you claim that Virtuoso is less stable, but there is not a huge amount of variance in the values. Given that the timeouts will be skewing the other figures and likely a major factor in the more constant performance.
§5.3 how do Virtuoso and GraphDB reach 256 threads if the maximum is set to 128? What is meant by an error here?
You claim that there is no clear winner, yet in every category Virtuoso seems to outperform the other systems, as highlighted in grey in each of your tables.
(EVALUATION OF THE STATE-OF-THE-ART) The related work section correctly reports on various available SPARQL benchmarks. However, the stated purpose of the work is not to provide another benchmark, but to consider the performance of different triplestores on a real-world setting. I would expect that the results of various benchmarks are compared to the results of the real-world evaluation.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The experiment and results are discussed well. It was not clear to me what it meant to normalise RDF and how this process reduces the 2,195 files to 64.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code and queries for the experiment have been made available on GitHub, although the specific commit used is not stated. Configurations of all the systems are provided as well as the version number. The dataset itself is not openly available. This has to be requested.
I found that there were discrepancies between the report of the queries in the paper and what I could see on GitHub, e.g. Q2 and Q11 of category 1.
(OVERALL SCORE) The paper presents an evaluation of 4 triplestores using the queries and datasets used in the European Publications Office.
The paper is generally well written, although there are numerous grammatical errors (too many to list)
The experiment design is well documented and the code has been made available.
There are issues with the analysis of the results as detailed above. I would be keen to see the results, particularly of table 5, when the queries that timeout are removed.
Minor issues:
- References should be added to back up claims in the introduction
- First paragraph on p2 needs breaking up and regrouping
- numerous grammatical issues throughout the paper, to many to list
- Acronyms are used before they are introduced, e.g. CDM, FRBR, and PROD.
- Paragraph at the top of p9 is incomplete
- VOS is used in §6 to refere to Virtuoso.
Thank yo to the authors for providing a rebuttal. I believe that it is important that experience papers from the real-world get published, but they must meet the expected standards.
This work is interesting in that it looks at the experience of using some leading triplestores with a real-world query load. However, I believe the penalty they have applied to systems for timing out is distorting the results. They also have not attempted to compare the results to existing benchmark results.
I would encourage a future resubmission with an improved analysis of the results.


Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) Very relevant
(NOVELTY OF THE PROPOSED SOLUTION) Does nothing extraordinary -- it is a benchmarking study
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The conducted benchmark analysis is incomplete and does not cover all aspects of the experiments.
(EVALUATION OF THE STATE-OF-THE-ART) The related work reported in the paper is poor. There are missing works which need to be considered/cited. These are mentioned in the later in the detailed review
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) It is a benchmarking study for commercial RDF Stores, therefore, this question is not suitable for this paper. However, the author has put forward a good first effort in conducting the study, but, it still lacks the necessary foundation which can be improved from the comments addressed in this review.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The study might be reproducible if one is able to -- (i) obtain the license from the respective RDF store vendors as they are commercial, (ii) generate the same (type and amount) of data using the details and scripts provided by the author.
(OVERALL SCORE) *Summary*
This paper presents a benchmark study of four commercial RDF stores (Stardog 4.3 ee, Virtuoso 7.2.4.2, GraphDB 8.0 ee, and Oracle 12.2c) using the publication office dataset and a selected set of forty-four queries. The benchmark task at hand is with respect to traditional quantitative performance assessment using factors such as bulk loading of data, query execution time, scalability and stability. The results demonstrate the superiority of virtuoso over other commercial RDF stores for factors other than stability, where GraphDB outperforms the rest.
** Strong Points (SPs)  ** 
SP1 - Interesting study concerning the commercial RDF stores and the PO dataset. 
SP2 - Publically available data generator scripts and overall setup used in the system. Good use and display of open source scenario.
SP3 - A very good first effort in benchmarking commercial RDF stores. While there are very less studies that try to perform benchmarking of commercial RDF stores in exclusive detail, not many succeed as benchmarking requires a very careful curation of setup and a very controlled environment. This is one of the few studies which has a potential to improve building on their current efforts.
**Weak Points (WPs) **
WP1 - Not clear reasons for selection of RDF stores followed by unclear data and query statistics.  
The article covers only a small fraction of commercial RDF stores, most of which are actually multi-model stores and not native RDF stores. Native commercial RDF stores should also be included in the study and analysis.  Furthermore, Section 2 is very confusing regarding the sizes and number of datasets used or generated. It is extremely hard to follow in the rest of the paper what is being benchmarked against which query and why precisely. 
WP2 - Not enough technical contribution. 
Given that this study performs only a quantitative performances analysis of commercial RDF stores, it becomes extremely difficult to differentiate and accept studies such as this, for the venues such as ESWC. A qualitative performance study of the underlying factors influencing the performance of these systems should have been reported/supplemented. The discussion section is also very short and not very insightful. 
Some of such studies which present a detailed benchmark analysis are are cited below, Kindly also add them to your related work:
@inproceedings{saleem2015feasible,
title={Feasible: A feature-based sparql benchmark generation framework},
author={Saleem, Muhammad and Mehmood, Qaiser and Ngomo, Axel-Cyrille Ngonga},
booktitle={International Semantic Web Conference},
pages={52--69},
year={2015},
organization={Springer}
}
@inproceedings{thakkar2017trying,
title={Trying Not to Die Benchmarking: Orchestrating RDF and Graph Data Management Solution Benchmarks Using LITMUS},
author={Thakkar, Harsh and Keswani, Yashwant and Dubey, Mohnish and Lehmann, Jens and Auer, S{\"o}ren},
booktitle={Proceedings of the 13th International Conference on Semantic Systems},
pages={120--127},
year={2017},
organization={ACM}
}
@inproceedings{saleem2015lsq,
title={LSQ: the linked SPARQL queries dataset},
author={Saleem, Muhammad and Ali, Muhammad Intizar and Hogan, Aidan and Mehmood, Qaiser and Ngomo, Axel-Cyrille Ngonga},
booktitle={International Semantic Web Conference},
pages={261--269},
year={2015},
organization={Springer}
}
WP3 - Extremely weak evaluation of the systems.
The evaluation presented in this paper is weak, confusing and erroneous. There are many errors in the section 5 which leads to a lot of confusion regarding the RDF stores being benchmarked and the datasets being referred to within.  Furthermore, as mentioned earlier the paper requires a major overhaul and cannot be considered for publication in its given state.
** Questions to the Authors (QAs) **
Q1 - 
Why are only the selected RDF stores chosen? I.e. There are also other Commercial RDF Stores in the market (such as MarkLogic, AllegroGraph, Algebraix, Dydra, SparkleDB, etc), what was the reason for choosing the ones selected in this study? Why not others? And lastly, Why only four? 
It will also be interesting to see how Amazon Neptune performs against OpenLink Virtuoso since it claims to be the fast and reliable Graph and RDF store for the cloud.
Q2 - 
Openlink Virtuoso 7.2.4.2 which is used in the study seems to be an open source product available for download at https://github.com/openlink/virtuoso-opensource/releases
Did the author obtain a special commercialized edition for the same? 
Q3 - There are several questions in the next sections, kindly address them. I list some of the most important ones below:
- The article talks about two categories of queries in section 3, whereas in the provided online resource at https://github.com/gatemezing/posb/tree/master/bench/queries there are three. This is not clear why?
- How many datasets have been used? Each of what size? This is very unclear in the paper.
- Why is there inconsistency in the section 4 (experimental setup)? It appears that all the systems did not receive the same amount of RAM (i.e. 64 GB). Why is that? Refer to other remarks/comments section 4 for details.
- It is surprising to see that Virtuoso can executed more than 3200+ queries per minute whereas other commercial RDF stores struggle to process even 20-30 queries per minute. Why is this? Does the author have any answer to this? Also why is the average response time for Virtuoso drastically less than that of the other systems?
=============Post rebuttal===========
The comments from the author are appreciated. However, they do not satisfy all my queries. A quick remark, benchmarking Neptune would bring the author to the cloud territory, not to mention that Amazon is branding Neptune as a graph database. This would again start the "fairness" debate. If the author wishes to benchmark only RDF stores, it is not a good idea to include databases with a multimodel backend, certainly not graph databases. The idea of benchmarking is to have a common platform with restricted (or equal) environment in order to assess the performances of participating systems in a relative manner. I would stick to my previous decision for this paper. None-the-less I do certainly appreciate the efforts and the community does indeed benefit from such studies from time to time.


Review 3 (by Axel Polleres)

(RELEVANCE TO ESWC) Benchmarking for SPARQL is certainly a topic relevant to ESWC
(NOVELTY OF THE PROPOSED SOLUTION) I think the paper is in general novel, as it presents a new benchmark and dataset of considerable size. However, I do not see any real novelty in terms of *scientific* novelty in it. The paper rather presents a practical use case (RDF querying for a specific use case from the Publication Ofice) and testing which triple store performs best in one particular setup (machine) for this task in the specific important queried for this use case. Scientifically, I would have expected more in terms of:
- a principled comparison with other benchmarks (e.g. LUBM and BSBM are mentioned) in terms of features present/important in the queries that are NOT present in the other benchmarks, comparison on why one store in the specific benchmark performes better here than in the other benchmarks in comparison. Instead the comparison is only at the level of "BSBM is artificial, this benchmark is not",  "LUBM uses a smaller number of classes" and "S2PBench uses syntectic queries and data" and apart from that the related works section just demonstrates awareness of the other benchmarks rather than saying why these weren't good enough for you or where your data and results differed.
That is not enough imho, you should compare the characteristics and performance of these benchmarks with yours.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The reported evaluation results appear to be correct. The queries are published on github (and I only find the queries there, but I do not know how to acquire the datasets.
(EVALUATION OF THE STATE-OF-THE-ART) See above: my criticism is mainly about the comparison to other benchmarks.
Some important work to relate to and to consider is also missing, IMHO:
Kjetil Kjernsmo, John Tyssedal: Introducing Statistical Design of Experiments to SPARQL Endpoint Evaluation. International Semantic Web Conference (2) 2013: 360-375
Kleanthi Georgala, Mirko Spasic, Milos Jovanovik, Henning Petzka, Michael Röder, Axel-Cyrille Ngonga Ngomo: MOCHA2017: The Mighty Storage Challenge at ESWC 2017. 3-15
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Not sure what is meant by that score. In case it is about "presentation and structure", it is ok, but has some falws, e.g. grammar/languagewise or consistency of fugures/tables, see details below, but is otherwise ok.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The queries are there ,but I don't know how to reproduce/obtain the data.
(OVERALL SCORE) I really would have preferred to have a *borderline* option for this paper.
Strong points: 
realistic data and queries from a reaslistic use case, with a significantly sized dataest.
I would happily accept this in a workshop, but am not sure this is contribution enough for a full paper in the conference.
Weak points:
not really scientific approach, not compoaring to other works in a principled manner, rater listing related works as a demonstration of awareness than making clear how exactly it related.
QA:
Where do I find the data if I wanted to re-do the experiments?
DESCRIBE has no normative semantics... did you ensure all triple stores handle it equivalently?
------- post rebuttal comment --------
Just as a quick remark:
My question "Where do I find the data if I wanted to re-do the experiments?" doesn't seem to be answered. Also, a minor remark: For the links the authors provide in their response demand that you have to ask for access on gdrive, not great people who want to see it decide to stay anonymous. you may want to make this available "read all"


Metareview by Oscar Corcho

The authors agree on the fact that evaluating on real use cases is something very relevant and useful for the community. However, there were several drawbacks/limitations in the methodology used and in the results as they were reported that make this paper not acceptable yet for the conference.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *