Paper 145 (Research track)

HDTQ- Managing RDF Datasets in Compressed Space

Author(s): Javier D. Fernández, Miguel A. Martinez-Prieto, Axel Polleres, Julian Reindorf

Full text: submitted version

camera ready version

Decision: accept

Abstract: HDT (Header-Dictionary-Triples) is a well-known compressed representation of RDF data that supports retrieval features without prior decompression. Yet, RDF datasets often contain additional graph information, such as the origin, version or validity time of a triple. Traditional HDT is not capable of handling this additional parameter(s). This work introduces HDTQ (HDT Quads), an extension of HDT, which is able to represent quadruples (or quads) while still being highly compact and \queryable{}. Two approaches of this extension, Annotated Triples and Annotated Graphs, are introduced and their performance is compared to the leading open-source RDF stores on the market, Results show that HDTQ achieves the best compression rates and is a competitive alternative to well-established systems.

Keywords: RDF compression; RDF indexing; Linked Data management

 

Review 1 (by Ruben Taelman)

 

(RELEVANCE TO ESWC) Compressed storage of RDF quads is very relevant to a Semantic Web conference.
(NOVELTY OF THE PROPOSED SOLUTION) Storage of RDF quads is not a new thing, but the approaches taken by the authors to reach these levels of compression are new.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The algorithms are sound are explained in detail.
(EVALUATION OF THE STATE-OF-THE-ART) The SOTA is sufficiently discussed and is part of the evaluation.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is discussed in detail and demonstrated in the evaluation. Source code is provided.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Details and scripts regarding the exact configuration of the experiments are included on the website, including raw results.
(OVERALL SCORE) This article introduces an extension of the well-known HDT RDF compression format.
This extension enables the storage of quads (triple + graph) instead of regular triples using bitmaps,
similar as to how HDT stores triples.
The authors propose two techniques for structuring these bitmaps (HDT-AT and HDT-AG).
Furthermore, algorithms are introduced to resolve quad patterns based on these bitmaps.
Additionally, an open-source implementation is provided for this extension.
Finally, an evaluation is done comparing HDT-AT, HDT-AG, Jena and Virtuoso for different graph-based datasets.
The results are very promising in terms of the storage space requirements and the required indexing time.
For query evaluation efficiency, not an overall winner can be determined.
For some kinds of quad patterns, the HDT-based approaches outperform the state-of-the-art engines,
while the opposite is true for other patterns.
The paper is well written and very understandable.
The presented approach for adding quad support to HDT is interesting and very valuable to the community.
The evaluations and its conclusions provide a good overview of in what situations this approach performs well, and when not.
My only concern with this work is that the experiments are not sufficiently reproducible.
While the exact versions of the used software was mentioned,
the authors did not provide any details or scripts regarding the exact configuration of the experiments.
Furthermore, the paper only presents aggregated results for indexing space+time and relative results for query times,
while the raw results are not available online.
The authors do however provide the used datasets and queries.
Ideally, the raw results and experimentation scripts and configurations should be available here as well.
SP:
* The authors propose novel approaches for compressed quad storage on top of HDT
* The algorithms are explained clearly
* The evaluations are complete and properly discussed
WP:
* The experiments are not sufficiently reproducible.
* The paper only presents aggregated results for indexing space+time and relative results for query times.
* HDTQ query resolution is in some cases significantly slower than other approaches.

 

Review 2 (by Steffen Staab)

 

(RELEVANCE TO ESWC) The paper presents a queryable compressed representation of RDF including named graphs. Therefore, it addresses "Storage Models for Semantic Data" and "Query processing of Semantic Data" of the research subtrack "Semantic Data Management and Big Data".
(NOVELTY OF THE PROPOSED SOLUTION) The proposed HDTQ is only a rather small extension of the already existing HDT format. Nevertheless, the extensions are not straightforward so that the proposed solution still provides novelty.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) When thinking about how HDT can be extended to support named graphs, the first straightforward implementation that comes to my mind is adding an additional B_G-S_G layer to HDL. This additional layer would store for each triple the IDs of the graphs that contain this triple. This approach would work similar like HDTQ-AT but in case of datasets with many named graphs and with triples that only occur in one or a few graphs might show a better performance than HDTQ-AT. The authors could have given some arguments why they have not considered this straightforward implementation.
Beside this aspect the technical approach is sound and complete. The core design decisions are well-motivated.
(EVALUATION OF THE STATE-OF-THE-ART) The paper gives a good overview of the related work. No important related work is missing.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The goal that their approach saves storage capacity is shown by several experiments comparing the compressed dataset size of HDTQ with the size of the gzip-compressed dataset and the size in two RDF stores. Even if it is not required to justify the presented approach, it would be interesting to know which overhead the storage of the named graphs has in comparison to the plain HDT format.
When analyzing the loading time and the quad pattern resolution speed the authors should state whether Jena and Virtuoso use transactions or not. Since the HDT format cannot be changed during runtime, the transaction support should be disabled which is the default setting for Jena.
The measured loading times show that HDTQ shows a performance comparable to the RDF stores. Also, the quad pattern resolution speed of HDTQ is comparable to the one of Jena and Virtuoso. In the latter case the authors should provide some additional information regarding the experimental setting. (i) For each type of quad pattern 100 different quad patterns were generated (if possible). With how many triples did these quad patterns match? (ii) Are the shown resolution speed the arithmetic mean of all 100 runs, the median, the highest speed, ...?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The used datasets and queries are made available online. The queries for BEAR-A are missing. Also information about the configuration of the evaluated RDF stores are missing. If the default configuration was used, this could have been stated. It is nice that the source code of HDTQ is made publicly available on GitHub. For the presented quad pattern resolution speeds it is unclear, how the presented numbers were produced out of the 100 executions of each type of quad pattern. Therefore, beside these minor issues the experiment seems to be reproducible.
The authors made best effort to produce a general experimental study covering different types of datasets consisting of several named graphs as well as several types of queries.
(OVERALL SCORE) In this paper the authors address the problem how datasets consisting of several named graphs can be stored in a compressed way such that the data is still queryable without decompressing the data. To achieve this the authors provide two extensions of HDT, a queryable compressed representation of RDF. In the first extension called annotated triples (HDTQ-AT) for each triple an additional bit vector is stored indicating in which graphs the triple occurs. In the second extension called annotated graphs (HDTQ-AG) for each graph a bit vector is stored indicating which triples belong to this graph. For both extension the authors implement query techniques that allow for retrieving all matches of a given quad pattern (i.e., a triple pattern extended by a fourth element that reflects the named graph). In the performed evaluation the authors could show that both extension require a bit more storage space than the gzip encoding but much less space than in the state of the art RDF stores Jena or Virtuoso. When evaluating the quad resolution speed, both extensions required a similar amount of time as the RDF stores.
== Strong points ==
S1) To the best of my knowledge, it is the first proposed queryable compressed representation for named graphs.
S2) Two alternative representations are proposed and evaluated showing a better storage compression while having a similar quad resolution speed like state of the art RDF stores.
S3) The software is open source so that it can be used by everyone.
== Weak points ==
There are only minor issues with the paper:
W1) The authors could have argued why their approach works better than a straightforward extension of HDT by named graphs (see comment at the correctness and completeness section).
W2) Since HDTQ does not support updates the transaction support in Jena and Virtuoso should have been disabled. The information of Jena and Virtuoso are executed with or without transaction support is missing.
W3) In the description of the experimental setting of the quad pattern resolution speed experiments it is not stated, how the presented numbers were produced out of the 100 executions of each type of quad pattern.
== Questions to the Authors ==
Q1) How is the compression rate and the quad resolution speed of HDTQ in comparison with the straightforward approach (see comment at the correctness and completeness section)?
Q2) Were the transaction in Jena and Virtuoso disabled during the experiments? 
Q3) How were the quad resolution speed numbers computed out of the 100 executions?
Q4) For each type of quad pattern 100 queries were produced. With how many quads did the patterns match? Something like a box plot would be nice.
Q5) How does HDTQ-AT perform in comparison with HDTQ-AG?
Q6) Which overhead do the additional bitsets for storing the occurrence of triples in the different named graphs cause?

 

Review 3 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) The authors present a generalization of the HDT format for RDF data to RDF datasets where triples contain an additional bit of information (e.g., its origin, version, timestamp). This is formalized by considering *quads*, which are ordinary RDF triples extended with an additional component (the "graph of the pattern"). The paper is clearly relevant to ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The proposed generalization of the HDT format is natural, but in my opinion not trivial.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I believe that the approach is technically correct. However, one must say that the presentation of the technical details is not extensive due to space limitations.
(EVALUATION OF THE STATE-OF-THE-ART) The authors have performed extensive experiments to demonstrate the applicability of the approach by comparing to the leading open-source RDF stores. The results are impressive.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors have performed extensive experiments to demonstrate the applicability of the approach by comparing to the leading open-source RDF stores. The results are impressive.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I believe that the results are reproducible.
(OVERALL SCORE) I think this is a strong paper that deserves to be accepted.

 

Review 4 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) Enabling the efficient evaluation over compressed datasets is a very relevant contribution, but I have some concerns about some aspects (see text about Correctness and questions Q6-Q8, Q10).
(NOVELTY OF THE PROPOSED SOLUTION) It is an interesting solution but it is quite straightforward from previous works.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Conceptually the approach looks good; however, I have some concerns about the correctness of Algorithm 4 (see questions Q6, Q7, Q8, and Q10). Good compression and performance without correctness is not interesting.
(EVALUATION OF THE STATE-OF-THE-ART) It seems to discuss and compared with relevant related works.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Properties of the proposed approach are discussed across the paper and evidenced in the experimental results presented in Section 5.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The implementation is available and the experimental study seems quite extensive.
(OVERALL SCORE) This paper presents an extension to the HDT compressed representation of RDF. The extension considers that triples in a dataset are associated with graphs or contexts and is able to record this additional information, storing quads instead of triples.  
This extension is implemented in two ways, one that considers a bitmap per triple and another that considers a bitmap per graph. These implementations have been empirically compared to each other and to existing data stores such as Virtuoso and Jena. In the experimental setup four datasets were considered in different configurations, for instance varying overlap of triples among graphs and varying number of graphs. The compression achieved by the extension is similar to the one achieved by HDT, making it considerably better than other alternatives in most of the cases.  
Strong Points:
SP1) Presentation of the motivation of the paper
SP2) Use of examples to illustrate and explain the proposed extension
SP3) Experimental setup that considers diverse configurations
Weak Points:
WP1) Missing computation of the overhead incurred by the extension in comparison to HDT, see question Q19
WP2) Correctness of Algorithm 4 is not evident; see questions Q6, Q7, Q8, and Q10
WP3) Some texts and explanations can be improved, for details see questions Q1-Q20
Questions to the Authors:
Q1) What are RDF terms? (Section 3.1)
Q2) What does "disregarding the name graphs" mean? (Section 3.2)
Q3) Why do the terms that only appear as objects range from |SO|+1 up to |SO|+|S|+|O|?
Q4) Do the search operations introduced at the beginning of section 4.3 consider the graph / context?
Q5) In the example about the getNextSolution operation, where is quad = 7??G, what does the 'G' represent here? Is it the GraphTU graph or any graph?
Q6) Does the Algorithm 4 concern search with bounded or unbounded graphs? In text it is referred as bounded but in the title of the algorithm as unbounded
Q7) Given that the statements in lines 3 and 6 do not consider the graph, how can be it ensured in Algorithm 4 that the retrieved triples are part of the graph g?
Q8) Given that the statement in line 3 provides solutions independently of the graph, what if the first of such solutions do not belong to the graph 'graph' and therefore the statement in line 4 returns 'null'? How would the other solutions be considered?
Q9) Could you provide an execution of Algorithm 3 where it will return 'null'? It seems to always return integer values
Q10) How are statements in lines 6 and 7 of Algorithm 4 connected such that it is correct to include the triple obtained in line 6 with graph 'graph' in the results?
Q11) Could you provide any insights about the triple distribution and entities overlaps among the 9,998 files generated for LUBM500?
Q12) Could you please describe the three independent executions that were averaged for the results? What was independent the conditions? The concrete quads evaluated? The warming up of the systems?
Q13) What are the values for HDT-AG in the first four points of Figure 6? Are they under the points for HDT-AT?
Q14) Does Figure 6 include all the datasets mentioned in Table 1?
Q15) In Figure 7, is there any explanation for the slight decrease after 10 graphs?
Q16) What does it mean "expect for the pattern ????" at the beginning of Section 5.2?
Q17) Did you consider other warming up loads, such as evaluating the same quad several times before the actual measurement?
Q18) In the case of the quad S??G, is there any explanation for the huge different between Figures 8a and 8b?
Q19) How much is the overhead in time and space incurred by HDTQ in comparison to HDT?
Q20) Did you check if the obtained results from the different systems were the same and correct when evaluating quads for Figures 8 and 9?
--- after rebuttal ---
I thank the authors for addressing my concerns in their rebuttal, I increase my score from "weak accept" to "accept".

 

Metareview by Olaf Hartig

 

This paper presents an extension of the RDF compression format HDT that supports named graphs. The reviewers agree on the relevance and on the novelty of this work. Other strong points of the paper are the thorough experimental evaluation and the clear presentation in the paper. Moreover, during the rebuttal, the authors managed to satisfactorily address the remaining concerns of the reviewers. As a consequence, this paper can be recommended for acceptance. We expect that for preparing the final version, the authors take into account the reviewers’ comments and implement the changes mentioned during the rebuttal.

 

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *