Paper 126 (Research track)

Hierarchical Characteristic Set Merging

Author(s): Marios Meimaris, George Papastefanatos

Full text: submitted version

Abstract: Characteristic sets (CS) organize RDF triples based on the set of properties characterizing their subject nodes. This concept is recently used in indexing techniques, as it can capture the implicit schema of RDF data. While most CS-based approaches yield significant improvements in space and query performance, they fail to perform well in the presence of schema heterogeneity, i.e., when the number of CSs becomes very large, resulting in a highly partitioned data organization. In this paper, we address this problem by introducing a novel technique, for merging CSs based on their hierarchical structure. Our technique employs a lattice to capture the hierarchical relationships between CSs, identifies dense CSs and merges dense CSs with their ancestors, thus reducing the size of the CSs as well as the links between them. We implemented our algorithm on top of a relational backbone, where each merged CS is stored in a relational table, and we performed an extensive experimental study to evaluate the performance and impact of merging to the storage and querying of RDF datasets, indicating significant improvements.

Keywords: RDF; SPARQL; Characteristic Set; Query Optimization; Indexing; Database; Data Management

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) In my opinion, the subject tackled in the paper is strongly relevant to ESWC. Indexing RDF graphs to improve the performance of the existing repositories is crucial for the successful adoption of these technologies.
(NOVELTY OF THE PROPOSED SOLUTION) The main problem I see regarding the novelty is that it is focused on how to merge the Characteristic Sets. This problem, in the end (while not referenced), boils down to extract the most interesting frequent itemsets, which is a well-known and thoroughly studied problem in the Data Mining community. Besides, all the hierarchy and information contents arising from a lattice such the one built has also subject of thorough studies within the Formal Concept Analysis community.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposal seems to be correct. However, the authors redefine several concepts that are well-known in Data Mining. For example, the notion of density is usually applied to what they call NULL-value effect, while they define the dense nodes according to a threshold in their support; and the definition of NULL-value effect might not be exact (see questions to the authors).
(EVALUATION OF THE STATE-OF-THE-ART) As previously stated, I miss completely the analysis of the reduction of the concept lattice from the point of view of frequent itemset mining, or FCA, which are completely devoted to the problem the authors are tackling. It is true that the usage of the emerging schema to this particular subject is novel, but the main goal of the approach is to get a good schema, the most important CSs / frequent itemsets, and this crucial aspect of the paper is not properly positioned.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is easy to follow, and indeed, the use of the emergent schema is shown to be useful for indexing. The implementation section is a little bit more obscure and not self-contained (I have had to go to their ICDE'17 paper for some missing details).
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments show that the approach works for the selected datasets and queries. However, I miss some comment about the differences of the m value for the different datasets and the heterogeneity of the data (see comments and questions).
(OVERALL SCORE) The authors propose a method to exploit the implicit hierarchy that exists between the Characteristic Sets of the KB to reduce their number, and therefore, to have less tables in their underlying indexing schema. Those CSs and their extension ECSs have already been proved to be useful to improve the performance of SPARQL queries. The main idea is to select the sets of CSs which maximize the usefulness of each of the created index tables. 
Strong points: 
* The proposal of using emergent schema to index the data is indeed useful and has already been proved to be effective. 
* The detection of the most interesting CSs/structures to reduce their amount is indeed the way to go for improving indexing RDF graphs. 
* The authors provide an implementation on top of a well-known RDBMS, and experiments show the effectiveness of the indexing. 
Main weak points: 
* The selection of the most relevant frequent itemsets/CSs is a well-known problem in different domains such Data Mining and/or FCA. 
* Following the previous point, the authors redefine concepts that have been already used in solving this kind of problem. 
* The authors do not properly address the problem of schema heterogeneity, which leads to bigger alphabets/different items in the CSs. 
Questions/Comments to the Authors: 
* Going back to the 2011 paper, and to the 2017 paper, the authors mine all the existing patterns building the complete lattice of the CSs. This is known to be a problem, and that's why the authors need to reduce the amount of nodes in such lattice. However, there have been a lot of previous efforts which have been omitted. For example, there are many works aiming at mining closed frequent itemsets, selecting the most representative ones, to name a few:  
- Pasquier et al. Discovering frequent closed itemsets for association rules. Proceedings of the ICDT’99, pp 398–416, 1999
- Xin et al. Mining compressed frequent-pattern sets. Proceedings of VLDB’05, pp 709–720, 2005
- Wang et al. Summarizing itemset patterns using probabilistic models. Proceedings of KDD’06, pp 730–735, 2006
or more recently: 
- Webb et al. Self-sufficient itemsets: An approach to screening potentially interesting associations between items. ACM TKDD, 4(1):1–20, 2010
- Webb et al. Discovering significant patterns. Mach. Learn., 68(1):1–33, 2007
- Vreeken et al. KRIMP: mining itemsets that compress. Data Min. Knowl. Disc., 23(1):169–214, 2011
On the other hand, from the FCA point of view (regarding the concept lattice that is built using CSs), I would suggest to take a look on: 
- Priss et al. Formal Concept Analysis in Information Science. Annual Rev. Info. Sci & Technol., 40(1):521-543, 2006 
as a starting point. 
* The authors state that "the notion of cost depends on possibly subjective factors, such as the query workload, the storage technology, the input dataset and so on". Such claim of subjectivity is quite bold, specially when any of the given examples can be precisely examples of objective measures (in the case of the input data, we could calculate their homogeneity/density). Note this can also be stated without doubt in the context of RDBMS, and they have manage to index their data (of course, exploiting structure and regularities as suggested in the ICDE'2017 paper for RDF graphs). The patterns having the most information can be selected and be even the merged ones. 
* Please, revise the definition of r_null taking into account that the support/extension of the subsumed CS also contributes to the amount of information that the other table is going to store. Focusing in the matrix built by r_j x P_j (with r_i being a strict subset of r_j and P_i of P_j), to evaluate the null part ratio, I miss (/r_j - r_i/) in the denominator, otherwise, if I'm not wrong, such ratio is not properly estimated. 
* In raxonDB, how do the authors determine the foreign keys to be addressed? At first sight, a particular element in a CS could be joined to many different tables. Besides, the explanation of the query processing is quite difficult to follow as it is currently (it might be too summarized from ICDE'17). 
* I would like to know the rationale behind only including the outgoing links in CS/ECSs. Up to which point the permutations in the query processing step could be avoided by including also the resource incoming links? 
* Regarding the evaluation of the approach, while the results are interesting, I have some concerns about the heterogeneity of the knowledge bases used. I miss an analysis on the sparsity of the data in the transactions / Characteristic sets space. For example, while Geonames seems at first (m=0) be quite heterogeneous (851 CSs and 12136 ECSs), by getting rid of the potentially noisy CSs by establishing even a really low support threshold, the number of CSs and ECSs reduces drastically, denoting potentially strong homogeneity. This also seems to apply to WatDiv Simple. I cannot but wonder how this frequent itemset merging strategy would behave with a general domain knowledge base, such as DBpedia for example (getting all the CSs/closed frequent itemsets would explode). 
Minor Comments: 
* In Definition 4, c_base is not established (it is c_1 implicitly) through the definition of P_1. 
Typos: 
Figure 1.c, the arrow going out from C6 is the other way round
Remarks after the rebuttal:
I'm afraid that I cannot agree with the answer to the main concern the authors address. Regarding FCA, I acknowledge that the goal of FCA is a little bit different, but there have been many previous efforts precisely on summarizing the lattice selecting the best granularity while losing the least detail possible (which would be equivalent to minimizing the NULL parts of the tables, according to the authors' nomenclature). However, what I find it is the most important problem is the argument about frequent itemset mining. The support of a frequent itemset is precisely what they define as cardinality (the number of triples ~transactions in Data mining that a particular frequent itemset cover), and the selection of such frequent itemsets (finding the most descriptive patterns) is based on such support and, among others, the density of the clusters that they build. In fact, it is not difficult to show that the characteristic sets are themselves itemsets (each property in the CS would be an item, and their cardinality is their support). 
Apart from this particular point, the authors have not addressed other important concerns (for example, the technical problem with the definition of density, or the usage of this technique in heterogeneous knowledge bases, where the amount of characteristic sets is going to explode). So, I'm afraid I'm not modifying my evaluation.


Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) Query optimization techniques for RDF datasets with schema heterogeneity is clearly a relevant contribution.
(NOVELTY OF THE PROPOSED SOLUTION) The main idea proposed in this paper was already proposed in [10], however the way it has been formalized is novel.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Algorithms and analysis seem correct.
(EVALUATION OF THE STATE-OF-THE-ART) The paper includes a quite good discussion of related works.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See questions Q2, Q3, Q12-Q14, and Q16-Q17.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The implementation is available but the experimental study is limited (see questions Q12-Q14)
(OVERALL SCORE) In this paper, an approach to improve the data storage and query performance of Characteristic Sets-based query optimization is proposed. In order to infer the schema of RDF data, Characteristic Sets, i.e., the set of properties described by each entity in the dataset have been used. A limitation of this approach is regarding very heterogeneous datasets, where the number of Characteristic Sets and consequently combinations of Characteristic Sets is very high. In this case, merging the Characteristic Sets and their connections can lead to better query execution times. The approach proposed in the paper is based on a hierarchy of Characteristic Sets derived from the generalization/specialization of types. More specific types include all the properties of their supertypes (more general types). The paper considers the general problem that it is shown to be NP-Hard, and proposes an heuristic solution to efficiently reduce the storage resources and query execution time. The paper considers a relational based implementation of a RDF store, and the merged CSs as tables that includes the union of the properties of the merged CSs, for the more general CSs the missing properties are represented with NULL values in the table of merged CSs. The approach proposed relies on reducing the number of NULL values included in the merged tables. Experiments with real-world and synthetic datasets were performed and show that the proposed approach mostly achieves faster execution times than existing engines.
Strong Points:
SP1) Presentation of the motivation of the paper
SP2) Formalization of the studied problem
SP3) Positioning with respect to existing works
Weak Points:
WP1) Missing detailed comparison with existing approaches with respect to the merging techniques proposed in the paper, see questions Q3 and Q19.
WP2) Unclear challenges presented by the experimental setup, see questions Q12-Q14.
WP3) Missing details across the paper that make it a bit hard to follow, see questions Q1-Q19.
Questions to the Authors:
Q1) Which version of the real world datasets were used in the experimental setup?
Q2) How could you describe the trade-off achieve by the proposed approach? Does merging the CSs increases in any way the cost of executing queries?
Q3) Can the approach proposed in [10] to merge the Characteristic Sets and suggested for datasets with more than 10,000 CSs be compared with the one proposed in this paper?
Q4) What is C in definition 2?
Q5) Is the subsumption between c_i and c_{base} in E or E'? Should (c_i, c_{base}) be in E? How are V and V' related?
Q6) What does "smaller CSs" mean at the end of page 4?
Q7) Why is the arrow in Figure 1c) between c_2 and c_6 in that direction?
Q8) Point (iv) in Definition 5 concerns E or E'?
Q9) What does "sub-graph roots" mean in page 6? what does it mean "dense root" in page 7?
Q10) What is the cost of computing e(k_i) used in Algorithm 2?
Q11) How comes that the process described in Algorithm 2 "Obviously" does not cover all CSs of the input dataset? When does this happen?
Q12) What kind of data does the WatDiv dataset includes? Why was it considered it in experimental setup/Table 1? Why wasn't it considered in Figures 5, 6 and 7?
Q13) What is the advantage of using synthetic datasets if only one instance of each of them is considered?
Q14) How were the queries used in the evaluation chosen? Why those particular queries from [8]? Was there any other criteria besides that they seem to be the most simple and "easier" to evaluate queries? For instance, it seems that only the highly selected queries for LUMB were considered and only queries with 3-6 query ECSs were considered for the real-world datasets.
Q15) Can you explicitly provide the "best merging of CSs" mentioned towards the end of page 13?
Q16) Why does raxonDB perform considerably worse than other approaches for Q6 (Geonames)?
Q17) How do the values of Size and Time provide in Table 1 compare with respect to the values for the other engines considered in Figure 7?
Q18) What timeout value was used for the experiments?
Q19) Which version of RDF3X was used for the experimental results shown in Figure 7? Are you sure it includes the Characteristic Sets and their merging for cases of more than 10,000 Characteristic Sets described in [10]?
--- after rebuttal
After having read the other reviews and the authors' response, I have to conclude that the rebuttal does not (sufficiently) answer the questions raised in my review. In addition, as also pointed out by the other reviewers, many of the critical points raised by them were not sufficiently/convincingly addressed either. Although the authors write that they would improve the paper to address several points raised by the reviewers, the decision has to be made based on the paper that has been submitted.
In summary, given that the rebuttal does not satisfactorily address the concerns of the reviewers, I believe the paper requires more work before it can be accepted and consequently lowered my score from "weak accept" to "reject".


Review 3 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper is about a novel approach to store and query RDF data - with improved performance and scalability 
- by merging characteristic sets based on their hierarchical structure. 
Thereby this matches totally the topics of ESWC as its a core topic of semantic web to working efficiently with RDF data and improve rdf data management continously.
(NOVELTY OF THE PROPOSED SOLUTION) Taking a look at the referenced literature provided the impression that the approach is quiet innovative. Nevertheless as being no expert in this field I cannot justify this impression in depth! Please take this into consideration - many thanks.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach seems to be correct and complete - nevertheless some information can only be understaood when following the provided references and taken these into account.
(EVALUATION OF THE STATE-OF-THE-ART) As being no expert in this topic its a bit hard to give a proven statement here - nevertheless its written and explained in a way that helps also a non expert to understand the topic and approach and the given references are good and listed where-ever needed.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper gives a good overview on the topic and is well structured and written. It explains all mentioned fields and approaches very well so it can also be understood by a non expert. Nevertheless the conclusion and future work section is very small and it seems that here the authors have lost their motivation in writing that they have shown throughout the other parts of the paper. This could be expanded and strengthened by e.g. given a more comprehensive conclusion and an outlook on future work e.g. which benchmarks are planned, how to implement this in a productive envionment and/or how this could be used by others in the future...
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments carried out are described well and results can be easily followed. Reproducability should be possible. Potentially a broader benchmark could have been carried out or could be a part of future work considered.
(OVERALL SCORE) The paper is about a novel approach to store and query RDF data - with improved performance and scalability
- by merging characteristic sets based on their hierarchical structure. 
Thereby this matches well the topics of ESWC as its a core topic of semantic web to working efficiently with RDF data and improve rdf data management continously.
The paper is well written and gives a good insight into the topic and the novel approach followed. Also a non expert understands the idea and approach in general, what is useful. References are provided throughout the whole work and are useful to be taken into account to understand and evaluate the work done. Comprehensive experiments have been carried out on top of different datasets and comparisons with other stores in place are provided. The conclusion and outlook could have been more comprehensive as this does not provide really useful information fo the reader. 
Furthermore following existing benchmarks in place could be useful (e.g. as future work) - as the LDBMC approaches and/or the comparison to other rdf / triple stores in place beside Virtuoso could be helpful.
If someone is working in the firld of RDF data management the topic and the submission is useful & helpful.


Review 4 (by Jens Lehmann)

(RELEVANCE TO ESWC) The submission presents an approach for efficient querying of RDF datasets, which is certainly relevant for ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The method essentially marks all characteristic sets above a cardinality threshold as dense. A non-dense node is then merged with exactly one dense node by iterating over all possible candidates and selecting the best based on a cost function. This greedy process is repeated for the remaining non-dense nodes. 
The rest of the proposed solution is based on existing work on Extended Characteristic Sets. Overall, the solution contains novelty even though it employs relatively simple heuristics.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) In my opinion, the preliminaries are not well written:
“In the context of this paper, with the term Characteristic Set we will refer collectively to the properties and records of a CS, i.e., its relational table, rather than just the set of properties proposed in the original definition, for the sake of simplicity.” => Why not just define it this way rather than making a definition a few lines before this paragraph which is invalid for the paper? Moreover, you just introduced the notion of c_i for the relational table of a CS already. If the characteristic set itself is now the relational table, you would not need this.
CS Ancestral Sub-graphs do not seem to be awkwardly defined: The definition says that a = (V’,E’) is an ancestral subgraph, but V’ is not explicitly defined with notions of “lowermost” and “sink node” not defined. “This means that any sub-graph with c_base as a sink node will be an ancestral sub-graph of c_base.” is also confusing as c_base is used as both node and graph here.
“We denote the set of all records of cs i as r i , while cs i is represented by a relational table c_i” later followed by “Logically, we map each CS to a relational table, so that for a CS c_i we create a relational table t_i”. Hence, c_i is already a relational table and t_i seems to be exactly the same.
“Hierachical CS Merge” => I do not understand what is hierarchical about this operation. You could apply it to any CS and just take the union of properties. The definition also makes the assumption that the first property set (P_1) is the most specific one which you cannot make in set notation (or you just use P_1 twice in which case something like P_min would be better for the second occurence). You also say that “This means that r_i will contain NULL values”. However, the placing of NULL values should actually be part of the definition.
The problem formulation is incorrect (Equation 1). It needs to take L_c as input and then find the merge graph L’ derived from L_C that minimises the cost function. At the moment you minimise over all subgraphs (and required a strict sub-graph relationship there is probably incorrect as well).
The notions are not very complex, but please clean them up and present them clearly. 
Complexity Analysis: To me it looks, like it should be d^k rather than k^d combinations.
“Note that it lies beyond the scope of this work to compute the degree of approxima-
tion to the optimal solution.” => Why?
(EVALUATION OF THE STATE-OF-THE-ART) The authors are experts in the field and well aware of the state of the art. The evaluation compares against other approaches on 3 datasets and shows promising results. A comparison of the loading time of the approach compared to regular triple stores would be useful. Naturally, the approach has the advantage that it optimised the hyperparameter m on the same query load.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The method discusses the influence of the hyperparameter m on query execution and loading times on different datasets. What could be improved is that the heuristic of selecting a threshold for dense sets is not well justified or compared against more sophisticated approaches.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The source code of the approach is provided. The datasets are publicly available. The query workloads are taken from an existing paper and also provided in the repository.
(OVERALL SCORE) ** Short description of the problem tackled in the paper, main contributions, and results ** 
One of the major drawbacks of triple stores compared to relational databases in terms of query performance stems from their schema flexibility. Characteristic sets are an approach to overcome this problem by identifying typical patterns within RDF datasets, specifically sets of properties which jointly occur for entities in the RDF dataset. This can then be materialised to improve querying performance. Each characteristic set is at its core a set of properties. The paper proposes an approach to compute characteristic sets of an RDF dataset. 
** Strong Points (SPs) **
- Addresses relevant problem.
- Improves state of the art in triple store querying.
- Good awareness of related work.
** Weak Points (WPs) ** 
- Sloppy formalisation. 
- The authors state “but we can rely on the observation that each non-dense node must be merged to exactly one dense node” without any explanation. Given that this is a core part of the paper, this should be justified.
- Method appears to use quite simple heuristics.
** Questions to the Authors (QAs) ** 
- Why must each non-dense node be merged with exactly one dense node?
- What is the loading time of the approach compared to regular triple stores without characteristic sets?
** Minor **
-  “subjective factors, such as the query workload” - Consider rephrasing “subjective”.
- all no ancestral sub-graph
- Fig 1 indices are hard to read.
- “which is a significant performance when”
Remarks after the rebuttal:
The rebuttal acknowledges the raised problems in the formalization of the approach, but I do not change my score since no factual new evidence was added or raised questions answered.
The paper indeed has value and can have a very nice potential impact. I can certainly encourage the authors to work on this further. If this were a journal, I would recommend a major revision with a chance to resolve the problems. However, for a conference I think there are too many non-trivial changes required to accept it in my opinion.


Metareview by Olaf Hartig

This paper presents an approach to determine characteristic sets of an RDF dataset that may be leveraged to a high effect during query processing. The reviewers agree that the presented work is very relevant and may have a high impact. On the other hand, there is no consensus among the reviewers regarding the novelty of the approach. There have been concerns regarding the relationship to frequent itemsets detection in pattern mining, and these concerns have not been sufficiently addressed during the rebuttal. Further issues that the reviewers agree have not been addressed satisfactorily are that the approach has unclear limitations (problem of schema heterogeneity) and the paper is not very well written (in particular, there are a number of missing details and the formalization is sloppy). These are weak points due to which the paper in its current state cannot be accepted for publication in the conference. However, given the agreement on relevance on potential impact, we highly encourage the authors to address these issues to turn this manuscript into a high quality paper.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *