RDF-Gen- Generating RDF from streaming and archival data
Author(s): Georgios Santipantakis, Christos Doulkeridis, Konstantinos Kotis, George Vouros
Full text: submitted version
Abstract: Recent state-of-the-art approaches and technologies for generating RDF graphs from non-RDF data use languages that have been designed for specifying trans-formations or mappings to data of various kind of format. This paper presents a new approach for the generation of ontology-annotated RDF graphs, linking data from multiple heterogeneous streaming and archival data sources, with high throughput and low latency. To support this, and in contrast to existing approaches, we propose embedding in the RDF generation process a close-to-sources data processing and linkage stage, supporting the fast template-driven generation of triples in a subsequent stage. This approach, called RDF-Gen has been implemented as a SPARQL-based RDF generation approach. RDF-Gen is evaluated against the latest related work of RML and SPARQL-Generate, using real world datasets.
Keywords: RDF generation; RDF knowledge graph; data-to-RDF mapping
Review 1 (by anonymous reviewer)
This paper outlines a system, RDF-Gen, for generating RDF data. It considers archival data as well stream sources. As the paper was submitted to the In-Use Track, I am somewhat missing some details on how the tool is used in practice to address real-life problems. One important question is: what exactly is the novelty of this paper over the authors' previous work? The authors present a good discussion of related work and a nice overview in Table 1. It seems like KR2RML comes closest to RDF-Gen. Hence, it would be interesting to also include KR2RML in the evaluation. The functionality of the link discovery component is a bit unclear. Is its tasks to provide links within the data that is generated by RDF-Gen or also links to external sources? The term link discovery suggests the latter but the text describes the former. The details about XML and XPath and the examples are nice to have. Yet, some more explanation would be necessary as some of the examples are hard to understand. It is very positive that the authors publish their code and additional information on github. The text mentions "Overall, the average time per triple generated is approximately 0.04 seconds, given that the frequency of position reporting per aircraft/vessel is at least 2 seconds." Doesn't this mean that the system is limited to supporting only a certain number of updates (25 assuming one update corresponds to one triple) per second and hence a limited number of aircrafts and their updates? The paper also states that "...RDF-Gen allows the automatic validation of the generated triples using Jena API." Unfortunately, no details are provided on this issue. How does it work? The conclusion states that the proposed system supports distribution of processing and the exploitation of streaming data sources. Could you please point out which parts in the paper support this conclusion? Detailed comments: page 2: " και " page 6: "Further processing options can be supported such as conversion of values..." --> "can" be supported as in not yet implemented or "are" supported? -- after having read the authors response -- The authors do not comment on my concern regarding missing details on how the tool is used in practice to address real-life problems (applying it on real-world data is not the same as addressing a real-life problem). More importantly, the authors did not comment at all on the question that I explicitly highlighted as being important: what exactly is the novelty of this paper over the authors' previous work? Hence, I have not increased my scores.
Review 2 (by Alistair Duke)
The paper is well suited to the track as it addresses a real-life problem and the approach is applied to this with a good evaluation section and comparison with related work (which is diligently described in section two). The developed system is well described although there is a lack of detail on the Link Discovery element which would also have benefited from clearer examples of its use. One criticism would be the lack of use in practice e.g. assisting in discovering trends or firing alerts based on real streaming data.
Review 3 (by Anastasia Dimou)
This paper presents RDF-Gen, a tool for generating Linked Data from streaming and archival data. The contribution seems to be significant, as the Linked Data generation from streaming data is not systematically studied so far. However, the description of the paper is not enlightening. I assume that the paper aims to present the tool which was developed for the datAcron project to generate RDF from streaming and archival data. Unfortunately this is not explicitly mentioned in the paper. For instance, there is a reference to requirements related to the datAcron but the requirements per se are not mentioned. If the paper gets accepted, I would strongly suggest that this is clearly mentioned. I mainly mention this because an In Use paper typically does not barely present a tool, but explicitly shows its adoption by the community which is not the case with the paper in its current format. However, my main concerns are related to the content of the paper. Overall, the paper encounters the following issues: 1) it treats a language and an implementation as one and the same, while this is not the case. 2) the paper arbitrarily brings up certain objectives without supporting their choice. 3) it introduces a way of referring to data extracts from a data source relying on examples but neither a grammar is provided nor all cases can be (ever) covered while the filtering and processing are not thoroughly addressed. 4) it requires a lot of improvement with respect to the English language Introduction: I agree with what hinders the use of different technologies in the introduction but I would suggest that the argument is either supported by a reference or by proof that the situation is indeed like this. Similarly, a reference or proof for the following statement would be nice for the following statement: "an RDF generation approach from multiple data sources ideally should imply a unique workflow." "SotA for generating RDF graphs from non-RDF data use languages. However, these approaches do not satisfy the computational efficiency and scalability requirements ..." --> The fact that languages are used has nothing to do with the corresponding implementations. Languages may still be used (as it occurs in this case) and still satisfy efficiency and scalability. The later two are requirements for an implementation and not for debating whether a language is required or not. “a close-to-sources data processing” --> while this concept is often brought up and it is considered one of the primary contributions of RDF-Gent, a definition is not given, therefore it is not clear what it is meant with this concept. If the paper gets accepted, I would suggest that this concept is defined in the paper. “Driven by the datAcron domains’ requirements and the limitations of existing RDF generation approaches” --> It would better support the paper to know which these requirements and limitations are. RDF-Gen satisfies the following individual objectives (subsequently denoted by Ox)--> Those objectives are randomly chosen and they are not explained. Unless they can be supported by a reference or a proof that these are the requirements. Moreover, not all objectives are meaningful, while the authors themselves come in conflict as far as their necessity is concerned. For example: "O2. Provide facilities for ... generation of URIs" --> Isn't this by definition the case? Namely, RDF generation relies on URIs generation. "O4. Demonstrates computational efficiency" --> computational efficiency is not always necessary. For example, if DBpedia releases every 3 months, generating its RDF in 2 months time is still on time for the release but we may agree that it is not computational efficient! In other cases, the quality matters more than the efficiency. We cannot argue that all implementations should necessarily be computational efficiently. Related work: "RML does not require storing files..." --> But RML is a language, I guess the RMLProcessor or the RMLMapper is meant here and in the following comments. "RML supports the integration of custom data processing functions using a script language, namely FunUL." --> Firstly, the reference is missing . Then there were actually two alternatives proposed for RML. An alternative with FnO is missing from SotA  "SPARQL-generate supports easy validation of generated output (objective O9)." --> Based on introduction though, there are only 7 objectives, which one is O9? "RMLProcessor  is an RML-based approach" --> The RMLProcessor is wrongly cited!  proposed an extension of the RMLProcessor with functions but not the RMLProcessor itself! The best reference for the RMLProcessor at the moment is . "DataLift can parse several types of data e.g. CSV, RDF, XML and ESRI shapefiles but not streams." --> well streams can be in CSV format that DataLift supports. So either DataLift does support streams (which is not the case) or the problem is not in the data formats that DataLift supports but in the type of data sources that DataLift supports. "GeoTriples , reuse existing mapping languages such as R2RML and RML, thus they inherit their limitations and advantages" --> is the focus on the languages or the implementations? I think GeoTriples inherits the limitations of the R2RML and RML engines that it uses and not of the languages. The paper misses the most relevant state of the art, namely carml https://github.com/carml/carml, an RML engine for streaming data. RDF-Gen "Data connectors provides data from sources in a uniform fixed size vector of value" --> is it the same fixed size? If so, how was it determined? If not, how is it configured? the “grammar” which is defined to refer to the data of the different data sources is randomly selected and not formally defined. How reproducible is the approach? i.e. could other generators replicate the generation? More in the same context, how transferable is the solution to other data formats? How extensible is the approach to contexts? "the data connector can also connect to SPARQL endpoints" --> how are the references to the data defined then? More details are required for a complete description of the tool's function "Such a mapping specification may include a filtering mechanism to exclude entries w.r.t. values on specific attributes. Further processing options can be supported such as conversion of values or data extraction" --> I am wondering how this happens as it does not seem to be as straightforward as with attributes filtering. Moreover, I'm wondering how this happens wrt coverage, is it defined the syntax of writing such statements and is every possible value transformation fully supported? "Although link discovery while generating RDF data provides an advantage in many cases, it is not always applicable or appropriate." --> this statement is in conflict with "O3 Supports close-to-source link discovery functionality". Is it necessary afterall or not? The formalization is not properly defined. For instance, “F” appears with different formats, allowing readers to doubt if it refers to the same concept. If the paper gets accepted, the formalization needs to be reworked in my opinion. Evaluation “typical or large volumes of data varying between 100 and 100,000 entries” --> I do not think that <100,000 entities may be considered large volumes of data Indeed the throughput is a very valid aspect that needs to be evaluated with respect to streaming data. However, I do not see how meaningful its comparison is with implementations which are meant for barely static data and their throughput is by definition lower. Ideally I would expect a comparison against carml which is the only tool which does support Linked Data generation from streaming data nowadays. On authors’ defense, carml is a relatively new tool, so it might have not fallen in their attention.  Ademar Crotti Junior, Christophe Debruyne, Rob Brennan, and Declan O'Sullivan. 2016. FunUL: a method to incorporate functions into uplift mapping languages. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (iiWAS '16).  De Meester B., Maroy W., Dimou A., Verborgh R., Mannens E. (2017) Declarative Data Transformations for Linked Data Generation: The Case of DBpedia. In: Blomqvist E., Maynard D., Gangemi A., Hoekstra R., Hitzler P., Hartig O. (eds) The Semantic Web. ESWC 2017. Lecture Notes in Computer Science, vol 10250. Springer, Cham Minors: I indicatively mention a few of the grammar/syntax mistakes I encountered but the paper contains way more. Overall, it needs a serious proofreading. A wide range of tools have been implemented --> A wide range of tools has been implemented transforming and linking data from all sources --> not literally all right?! views on data that analysis tasks require --> views on data that require analysis tasks in a variety of domains, varying from data --> variety and varying in a sentence for the generation of RDF from necessary data sources in a domain --> rephrase This may imply hindering --> this may hinder verifying RDF data generated --> verifying RDF data generation As one important type of data sources include --> As one important type of data sources includes w.r.t. --> with respect to satisfying strict latency requirements --> (to) satisfy strict latency requirements KR2RML --> KR2RML  Data connectors provides data --> Data connectors provide data implementing by the triple generator --> implemented by the triple generator according to the configuration provided --> according to the provided configuration The server instance is listening on a port --> The server instance is listening to a port
Review 4 (by anonymous reviewer)
The paper discusses a novel approach to extracting and interlinking extracted data from RDF streaming and archiving heterogeneous and distributed sources. The paper is very well written, and the extensive experimental results show that the proposed system outperforms the state of the art systems RML and SPARQL-Generate for all the KPIs (scalability, throughput and usability) considered by the authors.
Review 5 (by Anna Tordai)
This is a metareview for the paper that summarizes the opinions of the individual reviewers. The paper presents a tool for generating RDF from streaming and archival data and addresses a “hot” topic for the community. The authors also share their code and additional information in GitHub. The reviewers praise the extensive evaluation and mention that the system is quite well described although the paper needs some clarification in elements such as the link discovery component. Reviewer 3 lists a number of concerns including the lack of explanation in the choice of objectives listed in the Introduction. This particular concern is not addressed in the rebuttal by the authors. Reviewer 1 also points out that the question of what real-world problem is being addressed is insufficiently addressed in the rebuttal. Laura Hollink & Anna Tordai