A Parameterized Formal Model for Flexibly Defining Schema-level Indices for the Web of Data
Author(s): Till Blume, Matthias Schulte-Althoff, Thomas Gottron, Ansgar Scherp
Full text: submitted version
Abstract: Schema-level indices address many challenges that come with the growing size of the Web of Data. We argue that there is not a one-size-fits-all index for the Web of Data. Rather, a parameterized, formal model is needed that allows to quickly design, tailor, and compare different schema-level indices. We provide the first formal model for schema-level indices called FLuID. Our model is abstracted from existing indices. In addition, our model provides novel features such as aggregation over owl:sameAs as well as RDFS inferencing. Indices defined with FLuID can be efficiently computed in O(n). We implemented the FLuID model following an existing stream-based schema computation approach for the Web of Data. We empirically show that indeed different index models are needed for different information needs, datasets, and space requirements.
Keywords: Linked data; Schema-level indices; Parameterized meta model
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) I did have a slight problem with the notion of "schema-level index". Googling for this term reveals 39 search results (in qutoes), which is not an awful lot. I would have liked a simple explanation in the beginning of the paper so the paper can be placed into the readers mental map of the semantic web world easier. However, after reading the paper, I think it does fits. (NOVELTY OF THE PROPOSED SOLUTION) The approach, judging from the background work of the authors, appears to be novel. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Unfortunately, I could not follow the technical details of the approach. It looks mostly alright. (EVALUATION OF THE STATE-OF-THE-ART) There is a detailed chapter on related approaches. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Details of the complexity analysis are typically deferred to the technical report; as such the paper is not self-contained. There are however some mentions of computational complexity and other properties. However, I found the discussion very hard to follow. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) There is a detailed empirical study comparing related approaches from the literature. However, a slightly better discussion of the relevance of the handpicked datasets would have been good. Generalisability is, as always with studies that do not test with a set of input problems that cover a wide range of the space of possible problems, impossible to judge. (OVERALL SCORE) Revised review after the author's rebuttal. No changes to this review were necessary. The authors present a new model for schema-level indices and empirically show that there is no one-fits-all index for the WoD, and that their model help to flexibly define new indices. The topic fits well into ESWC, and the write-up is up to community standards. Unfortunately, the paper is very hard to follow for the average semantic web person with no background in schema-level indices, therefore I must defer final technical judgment to other reviewers. Some very minor remarks/typos: Introduction - As a typical ESWC participant I am still not familiar with Schema-level indices. Please provide a definition and an example in the introduction when you first use the term. Chapter 2 pnly -> only (typo) Chapter 3 Def 2: A RDF instance -> An RDF instance Def 4: An Property-Object -> A Property-Object Chapter 4 Include a sentence on how the choice of dataset might influence the conclusions of the evaluation Optional Chapter 4 A quick breakdown of the two datasets would have been helpful (beyond number of triples and instances), like a breakdown of rdf(s) language constructs used.
Review 2 (by Vito Walter Anelli)
(RELEVANCE TO ESWC) The proposed approach is interesting and relevant to ESWC topics. (NOVELTY OF THE PROPOSED SOLUTION) The novelty of the method is limited but it should be considered the big potential utility of this little novel idea. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The bottom-up creation approach of the model is quite limiting w.r.t. the authors' abstraction effort of proposing a parametrized model. Not considering that, the proposed approach is correct. (EVALUATION OF THE STATE-OF-THE-ART) The related works are relevant but a more detailed comparison in terms of formalization between the approaches and the proposed approach could have been interesting to appreciate modelling differences. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Authors often (eight times) refer to an external "technical report", that stores formal definitions, details and even some results discussion. The work seems to me to be non-self consistent, I think that, at least the formal definitions should be on the main work. What happens if the link is broken for any reason? Will the readers appreciate an half defined model? (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) This technical report is substantially an extended version of the proposed work, stored on dropbox. Once I read the existence of an external repository, for the sake of reproducibility, I expected to find pseudocode and an implementation of the approach, the links to the redefined indices for the competing approaches. My curiosity was frustrated. Moreover they refer to simple queries and complex queries, but I think that they should be detailed better. The technical report could be a good places for example queries. (OVERALL SCORE) Authors propose a formal model for schema-level indices named FLuID, an abstraction of the existing models presented in related works. The abstracted components are subsequently parametrized to constitute the final model. After the explanation of the formalization an evaluation is provided, that takes into account the ratio of compression the schema elements over the number of the original triples and the ratio of instances in the summarization over the instances in the original dataset. In order to ensure scalability of the approach authors exploit a the stream-based schema extraction approach and they evaluate the approximation error. Strong points: - very interesting idea of defining a single formal model for a plethora of indices - good idea to use a stream-based approach to solve the scalability issue - good general setting for the evaluation Weak points: - formalization in external "technical report" (bad for a paper on a formal model) - lack of reproducibility - unclear points in the evaluation of the method
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper fits very well the conference topics since it presents a formal parametric model to build schema-level indexes that summarize large RDF datasets. (NOVELTY OF THE PROPOSED SOLUTION) The main idea of this work is to provide a parametric model capable of generating different schema-level indexes (SLIs) based on the requirements of the application. Thus the approach aims at generalizing previous work in such a way that the type of index can be defined by specifying parameters. In addition, the paper discusses an approach to generate SLIs using stream processing. Finally, the use of inference to compute SLIs is a novel contribution of the paper. It is not clear if the proposed generalization can generate SLIs that where not covered in previous work (and by the four approaches compared in the experiments), or it is only aimed to make the generation of SLIs parametric. My feeling is that this work only formalizes what is already done in previous approaches that are quite similar to each other. Thus novelty is limited, although I recognize some usefulness in the parametrization proposed in the paper. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper has frequent references to a technical report. I found that the model in the paper is not well explained (despite I know most of the referenced papers like TermPicker and SchemeEx), and that some definitions lead somehow to confusion. Sometimes definitions seem also more complex than needed (e.g. Definition 2 of RDF Instances, which are basically triples sharing the same subject). Other things are not enough explained such as payload (page 6) or in Complex Schema tilde^s, tilde^p and tilde^o are not defined. Another important point is the effectiveness of the streaming approach proposed to compute summaries and the decision to avoid any sort of pre-processing. To the best of my knowledge, your experiments show that the processing window significantly decreases the representativeness of the summary, at least for complex queries (see e.g., results for CQ with the TimBL-11M dataset, and results for SQ and CQ on the DyLDO-127M dataset). However, I wonder if the quality of approximation does not depend on the order in which triples are read: if triples are grouped by subject (like it frequently happens) then no preprocessing is needed; but what if triples are grouped differently (e.g., by predicate)? In conclusion, would such an approximate index be useful in practice based on application scenarios? And what would be the advantage, if you can easily compute the full index using 1TB of memory? As a minor comment, the linearity of computation could have been better discussed a least for corner cases. For example, it would be interesting to have an idea about how cardinality estimation is computed linearly, where, I have the impression that you should scan and group triples matching a schema element (i.e., in an equivalence class defined by P-O schema element) both by subject and by object to estimate cardinality (by object to compute distinct subjects associated with distinct objects). (EVALUATION OF THE STATE-OF-THE-ART) Most of the relevant work is cited. I suggest adding a reference to Loupe (Mihindukulasooriya et al. 2015), which also compute SLIs based on schema patterns. There are some inaccurate statements about some of the referenced approaches. I suggest double-checking these statements. Also, it may be useful to give a feel about the objectives of different approaches (e.g., supporting query answering vs. helping users to understand the data). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Results are convincing about the effect of different parameters on the size and completeness of the index. However, I believe that the paper is not well written and is hard to read even for someone familiar with the addressed topic. For example, when evaluating compression and summarization ratio, you compare the number of triples. This means that summaries output triples. I found confusing not finding a clear definition of the triple-wise output of the different models. I believe that this would help significantly the reader to have a better idea of how different summaries look like (thus convincing about the usefulness of parametrization) It would have been useful to connect these results back to requirements. For example, from the experiment, we can see that approach X with parameters Y is more useful when you need to perform task T. Do you believe that similar conclusions can be draft from your experiments? Some comments provided in the section “Correctness and Completeness of the Proposed Solution” are also relevant to this section. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Code of the proposed approach and experimental data have not been shared, thus making hard to reproduce the experiments. There are also some clarity issues in the experimental evaluation. See also more comments in the review. Here I mention two points: - It is not clear in the experimental part what are considered as simple queries and what are complex queries. An example for both queries could clarify this better. - “The queries are generated from the schema elements taken from the gold standard” need to be further explained. - You never said that SLIs return triples, until the experimental section (I can understand that this could be somehow derived from figures, but it is still confusing for the reader). (OVERALL SCORE) The paper presents FLuID, a parametrized Formal Model for generating Schema-Level Indices (SLIs). Schema level indices are used to abstract the original data (grouped into instances) and create summaries based on schema elements (e.g. use of types and properties). There exist different approaches that generate schema SLIs, which use proprietary data structures to solve a particular application scenario such as searching, browsing or recommendation of terminology. Having such formal model for generating SLIs for different scenarios (requirements) is useful not only to facilitate the creation but also to compare different SLIs. FLuID generalizes on top of some relevant related work approaches used to generate SLIs. All the schema elements (simple and complex) considered by the model use the concept of equivalent relations over RDF instances. These schema elements can be further specified by using four parameters and the novelty is on the aggregation over owl:sameAs and the RDFS inferencing that were not used in the state of the art approaches. The approach is experimented in terms of quality and of scalability of the results on two datasets: TimBL-11M triples and DyLDO 127M triples, two very large datasets valuable to evaluate the approach. I am a bit ambivalent about this paper. In general, I quite like the idea of parameterizing generation of SLIs, and I agree this requires a bit of formalization. I also like the use of RDFS inferences to build SLIs. Otherwise, the formalization is not very clear in the paper, even for a reader familiar with the topic. Finally, the authors should focus more on impact of the novel contributions (RDFS-based inference and approximation with stream-based processing), which are not well discussed. Strong Points (SPs) * The use of inference in the creation of SLIs is an interesting novel contribution of the approach. * The paper tackles an important problem that is the summarization of datasets, which is very useful to serve a variety of application scenarios * The SLIs generated by FLuID are computed linearly in O(n) and covers SLIs that were generated by different models. * The comparison of different SLI approaches is interesting and evaluated on big datasets. Weak Points (WPs) *The paper is not well written which makes it difficult to interpret correctly the main concepts and definitions. It refers to a technical report, which is fine, but I suggest explaining better the main concepts in this paper and refer to the technical report for details. Defining the output of SLIs parameterized differently in terms of triples (as used in Experiments) may help the reader. The use of inference is a novel contribution of the paper, which may deserve more attention in the paper. * The generalized model is the main contribution of the paper, in the argumentation. However, I find it difficult to generate a new SLI for a new scenario that is not already covered by the schema level indices provided by the state of the art approaches: it is unclear if this generalization covers only the approaches compared in the experiments. Thus, my feeling is that this work only formalizes what is already done, which may be more suitable to an extended version of previous paper (e.g., TermPicker), rather than as novel contribution to ESWC. * I am not sure that the stream-based approach works in this context, as the loss in representativeness of the SLIs (quantified somehow in experiments reported in Table 2) seems quite significant and it is not clear if it is acceptable in practice. Also, I have the impression that the loss in representativeness is sensitive to the order in which input triples appear. A more in-depth discussion of these problems is needed. *Experiments and evidence reported therein are interesting, but some important aspects are not adequately (or clearly) discussed. Although there exists a detailed report, I would expect to see a clear and comprehensive discussion in the paper of the main points. For example, in related work you mention more than four SLI approaches, while in Figure 2 and in the experiments, you consider only four approaches. The choices of the approaches compared in experiments should be better motivated. Also, the impact of approximation on practical scenarios should be better discussed (is it useful a SLIs that achieve less than 0.50 F-1?). QAs * Which are the exact requirements coming from the application scenarios, which are addressed by parameterization? * Does the quality of approximation depend on the order in which input RDF triples are processed? (What if triples are not grouped by subject but, e.g., by predicate?) * Based on results in Table 2, would approximate indexes useful in practice in application scenarios mentioned in the introduction? (Otherwise, what would be the advantage of sacrificing representativeness, if you can easily compute the full index using 1TB of memory?) * In the experiments, why do you choose k = 0, 1 and 2? * In the experiments, why are windows defined on schema elements, rather than on the input RDF triples? Is it a plain mistake in the text? * For which settings (k, metric and dataset) do you refer to in the comment "increases the index size by about 3% more triples"? * In your comment "75 times larger on the DyLDO-127M dataset", I cannot find this value for k=1 and for the SemSets. Can you clarify this? Minor comments * Why don’t you define already the compression and summarization ratio in section 5? You should not repeat and "re-define" them in section 6. This comment of similar text that is repeated and rephrased is also for other parts found in the paper. *FLuID -> you should first explain what is the acronym and then use it through the whole paper *LODatio+ (Section 1, first paragraph) ->It is not clear how is this connected to what you said before. Provide a better connection with the previous sentences *ABSTAT selects one type from the set of types such that all remaining types are sub-classes of the selected type -> Do you mean super-classes? Also are you sure only one type is selected? * An Property-Object Cluster-> a Property-Object Cluster * Definition 4: ... iff for each triple in both instances there exists a matching triple in the other instance, such that the property and the object are the same -> should be rephrased *Figure 2a-> you should state that p1 = p2 and r1=r2. This is only mentioned implicitly in the text "since it requires the objects to be connected over the same property" *Example 1 is not clear. It doesn't seem that the i2 and i3 are equal. maybe you wanted to say i3 and i4? *”than simple queries” -> that simple queries *”Despite being a larger index in terms” -> Despite being a large index in terms *”there exist fewer partitions on the data gaph” -> fewer partitions compared to what? or did you mean just few partitions ** After rebuttal ** I thank the authors for their answers. In particular, I found the argument about the size of LODlaundromat convincing about the need for stream processing for solving memory problems. However, the accuracy that is guaranteed by SLIs built with the proposed approach is very low for complex queries. I understand that the improvement of the quality of the computed SLIs may be a subject for future work. But I would like to see at least a more in-depth analysis of the results discussed in the paper. Which is the practical impact of the currently proposed solution? (e.g., no CQ can be supported using these indices?) It is quite surprising that the fact that order of triples has a major impact on the quality of the produced SLIs is not discussed in the paper. It is not clear how this further discussion can be fit into the paper, which is not well presented already. In summary, I see a lot of potential in this work, but in its current shape (analysis of results and presentation) is hard to judge if it can be made acceptable for publication in a limited amount of time.
Review 4 (by Roberta Cuel)
(RELEVANCE TO ESWC) The paper is well structured and provide an innovative formal model for schema-level indices called FLuID. Authors also empirically validated the model (NOVELTY OF THE PROPOSED SOLUTION) I'm not an expert of the domain, but the contribution seems innovative. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper is complete, considering the fact that authors provided an empirical evaluation. (EVALUATION OF THE STATE-OF-THE-ART) I'm not an expert in that (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I'm not an expert in that (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) More details should be provided and also the dataset should be on line (OVERALL SCORE) The strong elements of this paper is the empirical evaluation. The weakest part of the paper is section 3 and section 5. In particular I suggest authors to focus more on the discussion of results (sec.5) recalling the elements described in sec.3.
Metareview by Olaf Hartig
This paper introduces a formal model to capture schema-level indices of RDF datasets and provides an empirical evaluation of such indices. The reviewers agree that the presented work is very relevant and interesting. However, there have been major concerns about readability of the paper; several reviewers (including expert reviewers) found the discussion hard to follow and the formalization not very clear. Perhaps related to this issue, several important details are not in the paper but in an external technical report, which makes this paper hardly self-contained. Further concerns that have been raised are the generality of the proposed model, which is unclear, and the limited discussion of the results of the evaluation. By taking all these points together, I cannot recommend to accept this paper because I do not think that the issues are easily fixed in a short amount of time (in particular, the issue regarding the readability of the paper). However, given the agreement on relevance and the interest in the presented work, we highly encourage the authors to address these issues to turn this manuscript into a high quality paper in the future.