## Paper 72 (Research track)

Processing incoherent open government data- A case-study about Romanian public contracts funded by European Union

Author(s): Bogdan Ghita, Octavian Rinciog, Vlad Posea

Full text: submitted version

Decision: reject

Abstract: Lately, many governments have adopted policies and mechanisms for making open data available to citizens, in order to increase the transparency of state administration and institutions. The usage of these data is hampered by the incorrect, incomplete and incoherent nature of the information.

The purpose of this paper is to summarize the general steps that are needed in order to transform raw open data that contain errors to consistent data. These steps are used to correct the open data published by the Romanian government regarding public contracts funded by European Union, supporting entities interested in using these data.

Keywords: Open Data; semantic web; error correction

Review 1 (by Pavel Shvaiko)

(RELEVANCE TO ESWC) The submission provides a case-study on processing incoherent open government data related to the Romanian contracts funded by the EU. The topic addressed is relevant and is worth further investigations. However, the major problem of the submission is that there no relationship at all about the usage of semantics or semantic technologies in this context, what makes it only marginally relevant to the ESWC scope.

(NOVELTY OF THE PROPOSED SOLUTION) The method proposed is rather straightforward, being mostly an engineering exercise, hence the originality of work is low.

(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The method applies ad hoc rules to handle incoherences.

(EVALUATION OF THE STATE-OF-THE-ART) Related work on open government data and data quality in relation to the semantic technologies is weak.

(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Section 6 discusses a usage scenario, though with no explicit links to the semantic technologies.

(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Experiments could be reproduced. There are no specific statements on "generality" of the study.

(OVERALL SCORE) The submission does not justify the ESWC publication, based on the reasons mentioned above.

Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) See detail below.
(NOVELTY OF THE PROPOSED SOLUTION) See detail below.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) See detail below.
(EVALUATION OF THE STATE-OF-THE-ART) See detail below.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See detail below.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) See detail below.
(OVERALL SCORE) The paper addresses a practical topic - use of open data beyond the SW and wider academic communities, and challenges in data correctness and validity when coming from multiple sources.
Based on a use case reporting EU funding in Romania, the aim is to provide support for analysis, mainly for prevention of corruption.
The authors propose data validation for reports submitted on requests for refunds for money spent delivering (previously) approved projects. Validation is based on a set of files describing data expected in recording such cases.
The proposal addresses the underlying issue at a relatively superficial level - the solution is VERY specific, and as described is not easily reused. As is, even for the same use case, any changes in data properties require a manual fix. The methodology described is actually fine wrt to identifying where issues exist and where solutions may lie - the implementation is where it breaks down.
Also, data considered invalid is simply discarded, no attempt is made to correct it, and importantly, to set up a structure to prevent errors in future data capture. Considering the process of validating could be easily reused to prevent input errors in the first place, it's surprising the authors don't consider this aspect of the problem at all. Esp  as it would also simplify input.
Looking at Berner-Lee's design issues (ref1) and proposal for 5* data, this solution is problematic. Note therefore that wrt novelty and proposed solution the actual score is borderline. But this option does not exist so I had to go down as accept is definitely not the case. For reproducibility/generality I moved up from borderline to weak accept.
Two graphs of the validated data are shown but no discussion carried out to illustrate how they are used to prevent fraud. And yet this is the main purpose of the exercise.
The paper is classified as Linked Data. Three references address linked data, but the paper doesn't at all. And there is no effort to link the output to any other relevant datasets either, or even encode using relevant ontologies, which would be an alternative way to achieve this. So open data, yes, but not necessarily linked. This in itself doesn't invalidate the proposal but the paper should probably be reclassified.
Overall, I would class the paper as borderline, while the issues wrt to implementation and discussion are not too difficult to fix I doubt there is enough time to do so within the conference review process. Without this option I have to  move down to reject.
******** OTHER DETAIL
"In [4], Futia et all discussed about using errors occurred in Italian procurement documents …" - is "using" an error? - otherwise please expand on how they made use of the errors in the documents.
"by correcting the various errors that may occur and to describe them in a consistent manner.." - what needs description - the DATA - this is what I would assume, but the text says the ERRORS.
"In order to achieve the goal of obtaining clear and consistent data, all these cases should be investigated. …This involves error detection and correction, which if not possible, the affected data must be invalidated."
Isn't this a bit extreme - why invalidate - this would simply reduce the amount of data made open, potentially by a significant amount. Why not simply flag as not validated?
I'm not convinced the example of "Praha" vs "Prague" in S4.5 is a good one - neither is a mistake, it's simply the same city in different languages. So this is inconsistent, yes, but not an error. Which brings me back to clearly missed potential to reuse the solution at the input stage.
Wrt fig1 - how can you tell unique customers - you state earlier they do not have a unique code. And the probability the name/label of each is always correctly entered is near 0.
In fact, further on in the paper you give an example of constantly changing names. As the discussion there also answers this question, please bring this to the first place where you mention the issue rather than making it appear to be unresolved.
Also, what is the difference between a customer and a beneficiary?
In step 1 - "A document is considered valid if a tuple of <one single document type, one month, and one 4-digit number, between 2009 and 2016> …"
Isn't this a bit short-sighted - I'm reading this in 2018, so already you have more than a year's worth of data that will be automatically invalidated. Isn't the whole point of an exercise like this to develop patterns that check for correctness, rather than using fixed values - going back to the example of "Praha" vs "Prague" - that cause the problem to start with?
"Because values belonging to the same type of logical information can be represented in different formats (eg: numbers as string or as number) we normalized the format across each column." - normalised as what? Simply because this matters, the choice will either help to resolve the problem or create a bigger headache.
Table 1 - the numbers in the text that follow do not completely match those in the table. If reporting also joint totals put these in the table. And please also use consistent number formatting, including a comma separator for 1000s for readability.
The table and text following have "￼POC (project_unique_code)" - previously this has been abbreviated PUC.
In step 2 - "from a total of 1,487,508 entries 6183 were discarded, representing a negligible percentage of 0.41%." - if those 6183 contained 25% of the funding or reimbursement, or were the most current, say, they are definitely not negligible. Using percentages this way is valid only if each data point is equivalent to all others. Unless this is the case AND you state so the conclusion is not valid.
"… identify the customers recipients …" who are these recipients? Why are they important?
fig 2 - what is "outsources"? Who is authorising the request? Looking at the graph, reimbursement follows authorisation. Is the former counting from after authorisation is granted or do both start at the same time? What is the graph saying - it is presented without any discussion.
Ditto fig 3 - what conclusions can be drawn based on it?
****
Please run an auto-grammar check and proof-read - there are a large number of errors that would be identified quite easily.
Place "the" in front of cases such as "European Union" - several cases missing, e.g., "projects are financed by [THE] European Union", "first register a project plan to [THE] European Commission"
On the other hand, "The Figure 1 shows an overview …" - delete "the" at the start of the sentence.


Review 3 (by Axel Polleres)

(RELEVANCE TO ESWC) In general, the description of the datasets and the proposed pipeline seems to be more a technical report on how to solve/engineer the specific task than a research paper, and therefore, in my opinion, is not suitable for the research track of ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) My main point of criticism is that the contribution of the paper, the presented data processing pipeline, is heavily tailored to the specific use case, i.e., the processing of the project funding datasets.
For instance, you extract the date and type of a single document from the file name, which works fine for your use case but isn’t a generalizable approach. The same holds for any step of your process.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Neither is there an evaluation of the data processing pipeline, nor a discussion of how to evaluate/generalize/apply the solution.
An online version, source code, demo, etc. would be helpful to test and verify the submission.
(EVALUATION OF THE STATE-OF-THE-ART) Your initial data quality discussion should be based on existing reports and should provide some quantitative insight into the discussed errors. Here are some pointers to existing works on Open Data profiling and processing:
[1] Ermilov, Ivan, Sören Auer, and Claus Stadler. "User-driven semantic mapping of tabular data." Proceedings of the 9th International Conference on Semantic Systems. ACM, 2013.
[2] Ritze, Dominique, et al. "Profiling the potential of web tables for augmenting cross-domain knowledge bases." Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2016.
[3] Mitlöhner, Johann, et al. "Characteristics of open data csv files." Open and Big Data (OBD), International Conference on. IEEE, 2016.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As mentioned above, no online version, source code, demo, etc. is provided.
There are statistics on the number of detected errors, the invalid values, ... in the processed dataset, however, these are only of descriptive nature and no discussion of potential false positives, misclassifications, etc. is provided.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) See novelty, demonstration & correctness above.
(OVERALL SCORE) The submitted paper describes an ETL pipeline to process Romanian OGD datasets about public contracts funded by the European Union. A total of 59 documents are analysed wrt. representation inconsistencies and errors, and the steps of the data processing pipeline are detailed. Eventually, the paper discusses how the curated dataset is used in the context of fraud detection. (However, I must say that I did not understand how fraud detection is possible based on the presented plots. The author should explain or exemplify this.)
Strong points:
- bring awareness to (at first sight simple) open data quality issues, such as column ordering of datasets, format inconsistencies, ..
- motivated by real-world use case (the fraud detection application)
Weak points:
- tailored, not generalizable, solution
- missing discussion of novelty and of related/state-of-the-art literature (regarding the data processing but also the data quality)
- missing relevancy to scope of conference

Review 4 (by Judie Attard)

(RELEVANCE TO ESWC) The authors target issues in open data within an open government context with the aim of enhancing use.  This topic is is possibly relevant to the conference (if it focused on Linked Open Data), however the authors fail to apply any Linked Data or semantic web technologies, and the paper does not provide any contribution.
(NOVELTY OF THE PROPOSED SOLUTION) The authors simply provide documentation on the identifications of errors in open government data. They simply describe a number of steps, without providing any context, motivation, approach details, or improvement upon state of the art.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors do not propose any solution.
(EVALUATION OF THE STATE-OF-THE-ART) The provided section on state of the art barely grazes the surface. The authors mention only one (!)  publication that has a similar objective to theirs, but fail to provide any discussion on how they compare. Moreover, the authors claim the following:
"Their authors studied how the usage of this type of information helps both economic growth and confidence in public administration and minimizes corruption in public institutions, increasing transparency [2][7]."
There are also many authors who indicate that benefits of open data cannot yet be proven. The authors cite a Draft paper (2003) and a workshop paper (2012) which are certainly not the most recent or authoritative publications on the topic.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is simply described in a very abstract manner, with no details about the scientific approach or the research contributions.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) No scientific contributions are provided.
(OVERALL SCORE) This paper seems to only provide an abstract overview of notes, and therefore is not worthy of being published. There is no concrete contribution, no comprehensive description of the approach (including motivation, context, etc), and not even a proper state of the art discussion. Moreover, no use of linked data or semantic web technologies.

Metareview by Hala Skaf

This submission presents a case-study on processing incoherent open government data related to the Romanian contracts funded by the EU.
The submission simply provides documentation of errors in open government data.
Reviewers agree that the submission is not relevant to ESWC conference. The authors do not provide a rebuttal.

## Paper 5 (Research track)

Price Sharing for Streaming Data- A Novel Approach for Funding RDF Stream Processing

Author(s): Tobias Grubenmann, Daniele Dell’Aglio, Abraham Bernstein, Dmitry Moor, Sven Seuken

Full text: submitted version

Decision: reject

Abstract: RDF Stream Processing (RSP) has proposed solutions to continuously query streams of RDF data. As a result, it is today possible to create complex networks of RSP engines to process streaming data in a distributed and continuous fashion. Indeed, some approaches even allow to distribute the computation across the web. But both producing high-quality data and providing compute power to process it costs money.

The usual approach to financing data on the Web of Data today is that either some sponsor subsidizes it or the consumers are charged. In the stream setting consumers could exploit synergies and, theoretically, share the access and processing fees, should their needs overlap. But what should be the monetary contribution of each consumer when they have varying valuations of the differing outcomes?

In this article, we propose a model for price sharing in the RDF Stream Processing setting. Based on the consumers’ outcome valuations and the pricing of the raw data streams, our algorithm computes utility-maximizing prices different consumers should contribute whilst ensuring that all the participants have no incentive of manipulating the system by providing misinformation about their value, budget, or requested data stream. We show that our algorithm is able to calculate such prices in a reasonable amount of time for up to one thousand simultaneous queries.

Keywords: RDF Streaming Processing; Price Sharing; Equal-Need Sharing

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper discusses a problem related to pricing in the context of rdf stream processing systems.
(NOVELTY OF THE PROPOSED SOLUTION) Even though the proposed solution is novel, it is rather straight forward with no significant technical challenges.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution is correct, but could be extended to cover situations where condition 1 (section 4.4) is not satisfied.
(EVALUATION OF THE STATE-OF-THE-ART) The discussion is brief, but adequate. There is no evaluation of approaches other than the proposed one.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The discussion is easy to follow. Though, some aspects of the approach (eg, when condition 1 is not satisfied) are left for future work.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation of the proposed approach is not complete, and does not help to deeply understand the behavior of the proposed algorithm. It is not clear how the time limit parameter for each one of the queries is set. It is also not clear how varying the value, budget and time limit parameters will affect the performance of the proposed approach. Even though experiments with synthetic datasets are useful, it would be nice to report results with real datasets (at least one), as well.
(OVERALL SCORE) This paper deals with the problem of price sharing for streaming rdf systems.
This is an interesting problem, but the paper does not make significant technical contributions. Presenting a solution that would also work for cases where the consumer gets value for partial streams (condition 1) would make the paper much stronger, since these cases appear often in practice.
SP1. Interesting problem.
SP2. Simple and elegant solution.
WP1. Solution does not cover the important case of partial streams.
WP2. Non-conclusive experimental evaluation.
Update: My most important point on not covering the case of partial streams has not been addressed, and this means that the contribution of the paper is limited.

Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) The work presents a price sharing model for streaming data. Though, RDF stream processing is mentioned
in the title of the paper, there is nothing specific to RDF or semantic data. Thus, the paper is only
weekly relevant to ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) Though, price sharing is not a novel idea in economics, the application of this approach to stream
processing in a cloud environment is an interesting idea. Though, some assumptions
of this paper are somehow questionable, the approach is basically promising.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) An algorithm for cost sharing (or better allocation) is presented. However, this problem seems
rather to be an optimization problem with constraints (total budget). Furthermore, the algorithm
isn't well presented: e.g. S should be better the set of queries but not indexes and what happens
to queries if the budget is exhausted?
(EVALUATION OF THE STATE-OF-THE-ART) The authors discuss briefly some cost sharing approaches. This should be extended and consider
also cloud pricing models. In contrast, the discussion of RDF stream processing is actually not
really necessary for this paper.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See comments below.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation described in the paper is just a run of the price sharing algorithm without any
connection to a real system or deployment. However, details of the synthetic data (apart from
"randomly generated") are not given which limits the reproducibility.
(OVERALL SCORE) The paper presents a model and algorithm for price sharing in (cloud-based) stream processing.
The basic assumptions are that users have to pay for running operators and accessing stream sources
for a certain time. The main idea is that users specify their expected value (utility) and
the budget they are willing to spent. Based on this information a cost distribution is calculated.
In principle, this represents an interesting and promising idea. However, there are some
weaknesses.
strong points:
S1: Cost/price sharing is an interesting idea for cloud-based stream processing and the paper is
among the first to try to address this problem with an economic approach.
S2: The price model and the requirements are well described and motivated.
weak points:
W1: The assumptions underlying this approach are questionable. This holds both for the
requirements (partially R2, R3) and the model (see details below and in the correctness section).
W2: Actually, the approach is based on a method described in [15]. Thus, the main contribution of
this work is to apply this model to a stream processing scenario. Furthermore, there is nothing
special related to stream processing (apart from longer running queries) or even RDF stream processing.
W3: The evaluation is limited to a runtime analysis of the algorithm based on synthetic data.
Neither the hypothesis nor the goal of these experiments are really clear. I would argue that calculation
time is only a minor issue in this approach.
* The price model doesn't reflect typical price models for cloud infrastructures: wouldn't it be more
realistic to pay for resource usage and not for operators? Second, in stream processing the arrival
rate of tuples is important which is not considered here. There is a big difference in running a query
where only 1 tuple per minute arrives vs. a query with thousands of tuples per millisecond.
* There are some concerns with the requirements (R3, R2) and the corresponding properties (sect. 4.4):
though, the motivation for R2 and R3 are clear, the explanations (what means "misinformation", getting
higher utility by outside computation) are unclear. Why is the algorithm "ignorant" to budget and value?
These parameters are used in Alg. 1?! For R3: what is wrong with a query where the platform is used
to prepare some data and perform the compute-intensive learning step outside? How could this be forbidden
In summary, the work seems to be in a rather premature state. The authors should revise their assumptions.
Furthermore, the paper is probably better suited for a cloud conference than a semantic web venue.
After rebuttal:
I thank the authors for their response, but my concerns still hold. Therefore,I do not wish to change the scores in my review.

Review 3 (by Pieter Colpaert)

(RELEVANCE TO ESWC) The Semantic Web puts forward tools for decentralizing data tasks, yet this decentralization comes at a cost for which the business model is unclear. This paper puts forward a price sharing algorithms for funding RDF stream processing. It might be one of the highlights of ESWC2018.
(NOVELTY OF THE PROPOSED SOLUTION) Exciting work putting forward the first step into research for new data-driven business models. Exactly what the community needed.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Q1: Is it correct that I seem to miss information about how this algorithm was implemented, and on what machines the evaluation was executed?
This evaluation, which I deem less important than the intriguing ideas behind the paper, seem to lack implementation and design details.
(EVALUATION OF THE STATE-OF-THE-ART) The Linked Data Fragments framework, explained in [22], was also created from the observation the publishers would be paying too much for hosting a public Web API with too many functionalities, while a much simpler and more cost-efficient interface could be thought off, still offering user agents on the Web good access to the data. Furthermore, it would allow other intermediary agents to perform some actions for other parties. This paper could in fact automate this negotiation of costs.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper is well written and has nice examples to illustrate the proposed approach.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Q2: It’s clear that stream processing is a special resident in the Web of Data, as explained in your Introduction, yet can your approach also be applied to classic webby datasets?
The research datasets are not available. The algorithm is well described in a code snippet, yet the code itself is not available either.
Q3: Could this data become available?
(OVERALL SCORE) Based on three economic requirements, an algorithm to allocate runtimes and calculating the total prices is put forward. The requirements are proven to be implemented. In chapter 5, Evaluation, the processing time is evaluated to run the algorithm.
Strong points:
* Introducing economics in Semantic Web
* First step into overlooked issue of business model for Web of Data
* Algorithm is built on the basis of clear requirements
Weak points:
* No research data available
* Evaluation is unclear
* Not sure why there is a strong focus on stream processing
3 Questions see earlier.
Typo: Ontoloty in reference 6
After rebuttal:
I do not wish to change the scores on my review. I believe the contribution is indeed limited, as mentioned by other reviewers, but unique and novel in its kind and may lead to interesting discussions.

Review 4 (by anonymous reviewer)

(RELEVANCE TO ESWC) Very relevant
(NOVELTY OF THE PROPOSED SOLUTION) Interesting concept which has been tackled before
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The presentation is clear.  Evaluation provide a level of validation.
(EVALUATION OF THE STATE-OF-THE-ART) Quality of Service approach from stream processing would have a relevance here.  At the very least to show a GAP in their efforts.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) An evaluation of the approach is provided with a discussion on the limitations of the approach.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The description of the experiment is OK.  No code or datasets released.
(OVERALL SCORE) This work tackles the problem of sharing access and processing fees for a set RDF stream processing graphs with overlapping operators and sources among multiple consumers. The paper proposes a model that uses a joint execution plan, which is based on the collected queries of all data consumers. It combines this query execution plan with the outcome valuations, runtime limits and willingness to pay values from all data consumers, the pricing of the raw data streams, and the pricing of the computations to determine a utility-maximizing payment distribution.  The approach is evaluated to determine the run-time performance.
Strong points:
- Equal need cost sharing: Their price sharing algorithm follows an equal need cost sharing method for operators/sources that overlap between multiple queries, where the assigned price share for the whole query can never be higher than the price of running the query in isolation.
- Maximized Utility: They allow the user to provide a utility value for running the query for a specific period of time, a runtime limit and a total willingness to pay value. The provided parameters are used by the price sharing algorithm to allocate runtime and charges payments as long as the assigned price is smaller or equal to the consumer's value thus maximizing the consumer’s utility.
- Illegitimate Utility Gain: Assigning price shares to queries is ignorant of any value, runtime limits, or budget for consumers, hence the consumers cannot benefit from manipulation by misinformation.
Weak points
- Multiple users per query: Their algorithm for price distribution calculates the price shares based on the different queries in the model assuming a single user per query which may not be the case in a real world scenario. However, they mention including a special kind of license attached to the streaming result of the query which prohibits the redistribution of the results outside the legal entity (a person or a company) which defeats the concept of contention ratio thus the number of users should be part of the price sharing model.
- User Negotiation: The price sharing algorithm charges payments as long as the assigned price is smaller or equal to the consumer's value which is beneficial for maximizing the consumer’s utility. However, if the calculated price share is greater than the consumer value, the algorithm drops the whole query and recalculates the price shares for the remaining queries. (1) Perhaps they should consider some sort of negotiation with the user before not providing the service at all (Future work), and (2) In a model with multiple users per query, the query cannot be dropped.
- Scalability: Their evaluation shows that the solution is limited in scalability when tens of thousands of queries are simultaneously involved which they acknowledge in their conclusion section.
- Related work:  QoS for Stream Processing should be covered.
Questions:
Have you considered the effect of any common runtime stream processing optimizations e.g. operator replication, operator reordering, etc., operator failures or not meeting consumer QoS constraints on price sharing?
After rebuttal:
My score remains the same.  If accepted I would encourage the authors to acknowledge the limitations highlighted by the reviewers as future research opportunities.

Metareview by Intizar Ali

This paper presents a price sharing model for streaming data which considers cost and resource sharing among multiple data consumers. Pricing models are well addressed for cloud-based and services domain solutions. However, the proposed study is one of the initial efforts to introduce a pricing model for Web of data. A well-established pricing model will be a key to creating a self-sustaining economy for Web of data and there are no doubts that this study can bring very interesting discussion at the ESWC. However, the strong reservations are regarding the novelty of the approach with limited contribution beyond state of the art. Also, one of the reviewers rightly raised a few concerns related to assumptions and model itself, which needs a careful revision.  The paper in its current state has limited contribution to be accepted as a full research paper, but a careful revision of the assumptions, pricing model, and its evaluation can certainly have a good impact for the sustainability of Web of data. We strongly encourage authors to submit a revised version of their paper in the upcoming editions of semantic web related conferences.

## Paper 227 (Research track)

Dynamic Tailoring of RETE Networks in Incremental Scenarios

Author(s): William Van Woensel, Syed Sibte Raza Abidi

Full text: submitted version

Abstract: Decision support systems, with production rule systems at their core, have an opportunity to leverage the embedded semantics of semantic, ontology-based data to improve decision support accuracy. Advances in mobile hardware are enabling these rule-based systems to be deployed on mobile, ubiquitous platforms. By deploying reasoning processes locally, time-sensitive tasks are no longer influenced by network conditions, less bandwidth is wasted, and less re-mote (costly) resources are needed. Despite hardware advances however, recent benchmarks found that, when directly re-using existing (PC- or server-based) technologies, the scalability of reasoning on mobile platforms is greatly limited. To realize efficient semantic reasoning on resource-constrained platforms, utilizing rule-based axiomatizations of ontology semantics (e.g., OWL 2 RL), which are known to trade expressivity for scalability, is a useful first step. Furthermore, the highly dynamic nature of mobile and ubiquitous settings, where data is typically encountered on-the-fly, requires special consideration. We pro-pose a tailored version of the RETE algorithm, the mainstay algorithm for production rule systems. This algorithm dynamically adapts RETE networks based on the evolving relevance of rules, with the goal of reducing their resource consumption. We perform an evaluation of semantic reasoning using our custom algorithm and an OWL2 RL ruleset, both on the PC and mobile platform.

Keywords: RETE; OWL2 RL; rule-based reasoning; OWL reasoning; reasoning optimization

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) The article depicts optimizations of the Rete algorithm for forward inference in rule based systems, and the approach is empirically evaluated over the OWL 2 RL Benchmark Corpus.
Both seem relevant to ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) According to the authors themselves, the optimizations are similar to [Doorenbos 95] (and to some extent to [Miranker 87]), and a detailed comparison is missing to appreciate the novelty of the approach.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Completeness is claimed to be preserved, but no justification is given for the case of fact retraction (see the full review below).
(EVALUATION OF THE STATE-OF-THE-ART) Pointers to the literature are relevant, but the analysis of the state-of-the-art needs to be more detailed.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The optimizations are very clearly described and illustrated.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Thanks to a precise description of the algorithm, and to the publication of the ruleset and data, the evaluation should be reproducible.
The main weakness of the evaluation is the chosen baselines (see the full review below).
(OVERALL SCORE) The article proposes a series of optimizations of the Rete algorithm, one of the most popular algorithms for forward inference in rule based systems.
The optimizations are meant to target resource-constraint platforms (such as mobile phones), and incremental reasoning, i.e. scenarios where additional facts (or fact retraction) may be dynamically fed to the inference engine.
The optimizations prevents attempting some joins which will necessarily fail, because the memory of one of the join arguments is empty.
This is performed by "unlinking" alpha nodes from (the Rete graph of) certain rules, or further "pausing" them if they have been unlinked from all rules in which they appear.
A triggering mechanism allows relinking/reactivating alpha nodes once the join operations have a chance to succeed again.
As a side effect, if a paused alpha node never resumes execution, unnecessary matches will be avoided.
A detailed description of the whole unlinking/pausing/resuming mechanism is provided, including an algorithm, called Rete_tailor.
The strategy is then empirically evaluated over the OWL 2 RL Benchmark Corpus, together with a partially handcrafted ruleset, available online, in addition to helpful links and samples.
The two baselines are a naive implementation of the Rete algorithm, and a simple a priori tailoring strategy, which consists in running the naive Rete over a fragment of the dataset, discard unused rules, and then perform (incomplete) inference over the remainder of the dataset.
The article is particularly well written, clear, and largely self-contained.
The more technical aspects are introduced progressively, and illustrated with examples.
A precise description of the algorithm, together with the publication of the dataset, make the experiment reproducible.
Possible limitations (such as the need for storing tokens in memory for paused alpha nodes) are explicitly pointed out, as well as similarities to optimizations proposed in the literature.
My main reservations are:
1) An insufficient positioning/analysis of the contribution.
Rete-based forward inference is a relatively mature technology.
A number of improvements of the naive algorithm have already been proposed/evaluated, and cost-models have even been designed (see for instance Wright et al. 2013 below).
In this context, a finer-grained (theoretical and/or empirical) comparison of Rete_tailor with existing optimizations is expected.
In particular, the parallels drawn in Section 5 are too high-level to appreciate the novelty of the approach.
It is argued to be well-suited for resource-bounded scenarios and incremental reasoning, but the article does not explain why previous optimizations of the Rete algorithm may not.
2) The baselines chosen for the empirical evaluation are arguably not appropriate.
Comparing Rete_tailor to the naive Rete algorithm is not very informative, as the latter has long been replaced in practice by optimized versions.
As for the other baseline, it does not guarantee completeness.
So the performance comparison (number of join/match actions) is not only unfair to Rete_tailor, but also somewhat irrelevant, as both systems may not yield the same output.
Instead, one would expect a comparison with some of the optimizations proposed for instance in [Doorenbos 95], with which Rete_tailor has a lot in common, according to the authors.
Other concerns are the following:
3) Rete-based fact retraction is mentioned several times, but never (even briefly) described.
Retracting a fact is arguably a key aspect of incremental reasoning, which is the application scenario targeted by the authors.
To my (limited) knowledge, Rete-based retraction consists in applying the inference mechanism in a reverted fashion.
But it is unclear how the proposed optimizations behave in scenarios where facts are both added and retracted.
In particular, can completeness be affected by the interaction between asserting/retracting a fact and pausing/resuming alpha nodes?
The answer may be obvious, but a short justification would help.
4) Although OWL 2 RL can express disjointness, the assumption seems to be made that the input dataset (data + ruleset) is consistent.
Maybe this could be explicitly said.
### Questions ###
- Page 5, Section 3.2, definition of h.1: "and \alpha memories in case i <= 2; Fig. 1".
I could not find an "i" in Figure 1.
Is it the index of the alpha node?
- Section 5: How does Rete_tailor compare to [Özacar et al. 2007]?
### Suggestions ###
- Page 7, Section 4.1.4 7, T.iii: "to avoid redundant join and match operation": "redundant" is possibly misleading.
Intuitively, redundancy means matching twice the same token (at different times), or joining twice over the same pair of tokens.
I suggest something more explicit, like "avoid necessarily failing join/match attempts".
- Page 7, Section 3.4: "Code 1" -> "Algorithm 1"
- Page 8, second paragraph: "previously deemed redundant". Same remark as above for "redundant".
- Section 5 helps understanding the intended application context, as well as the specificity of the approach, in particular to what extent the proposed work differs from [Doorenbos 95].
Therefore it may appear earlier in the article.
- Bibliography: at least one of the articles of Forgy (author or the original Rete algorithm) should be mentioned (e.g the one referenced below).
### Typos ###
- Figure 1: t3 is used instead of t2
- Page 4, 2nd paragraph: "Afterwards, the tailored". Part of the sentence is missing.
- Page 7, 3rd paragraph: "is longer empty" -> "is no longer empty"
- Page 7, Section 4.1.4 7: "took place the original dataset". Missing preposition.
- Page 7, case "t_x > t_y": "to beta node \beta_1(network1 > network2)".
Seems like what is meant here is "(network1 > network3)", as network2 is never reached in the case t_x > t_y.
- Page 12: avoid splitting Table 2 over 2 pages.
### References ###
Forgy, Charles L. "Rete: A fast algorithm for the many pattern/many object pattern match problem." Readings in Artificial Intelligence and Databases. 1988. 547-559.
Wright, Ian, and James AR Marshall. "The execution kernel of RC++: RETE*, a faster RETE with TREAT as a special case." Int. J. Intell. Games & Simulation 2.1 (2003): 36-48.
---
After rebuttal
Thanks to the authors for their response letter.
1) "one requires a comparison to the baseline system": fair enough. A comparison to the standard Rete seems indeed relevant. My point was that it is not sufficient.
2) "An explicit comparison with the optimizations from Doorenbos [18] seems unsuitable, since most of these are included in our work": I have to disagree with this statement. I was precisely expecting a comparison between optimizations from Doorenbos [18] (at least the ones implemented by the authors) and the Rete_tailor algorithm. Currently, the experiments do not specifically evaluate the new optimizations proposed in the article (namely pausing, and the lower-failed alpha node heuristic).
3) I also still disagree with the relevance of the other baseline (a priori tailoring).
The evaluation compares performances (token matches, join attempts, reasoning time) of the different algorithms.
Therefore they should produce the same output.
To take an analogy, imagine comparing query evaluation times of two SQL engines.
This is arguably pointless if one of them returns incorrect or incomplete answers.
Put another way, one can only expect a qualitative comparison of Rete_tailor and a priori tailoring (number of inferred triples, ...).
4) "other approaches with the same goal" and "the most relevant work in the field":
I may be missing something obvious here (apologies if I am).
Which "field" exactly are we talking about?
If I understand correctly, the field is "Rete-based forward chaining in incremental scenarios", as opposed to "Rete-based forward chaining in non-incremental scenarios".
And according to the authors, the most relevant work in the former field is Tai [5].
But what exactly makes the incremental case different?
Section 3.3 paragraph 2 is not sufficient for me to understand it.
More exactly:
- Let us first assume that no fact is retracted. Which optimizations are better-suited to the incremental case, and why (is this due to additional parallelization opportunities in the non-incremental case)?
- Now let us assume that facts can be retracted. Are there additional differences between incremental and non-incremental?
And if so, why is the literature about deletion (in the non-incremental case) not relevant?

Review 2 (by Wouter Beek)

(RELEVANCE TO ESWC) While reasoning is an important use case on the Semantic Web, I am not sure whether the adaptation of reasoners for resource-constrained environments is specifically relevant.  The desktop and mobile systems that are used in the evaluation are already almost comparable in hardware specs.  There may only be a very brief window (two to five years) in which mobile and desktop devices have access to different computational resources.  It certainly would have helped to have at least some discussion on the mid-to long-term utility of this work in the paper.  I can imagine some additional long-term applications in an IoT context, but the paper does not mention these at all.
(NOVELTY OF THE PROPOSED SOLUTION) The proposed approach does include non-trivial extensions of an existing approach and of existing algorithms.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is explicitly preserving completeness of entailment results (as opposed to a specific existing approach), and completeness is also evaluated WRT to multiple ontologies.
(EVALUATION OF THE STATE-OF-THE-ART) Does Table 1 show the total number of unlinking/relinking operations over all ontologies?  Is this not a very small number / does this even matter when compared to the total number of operations?  In the text the general applicability is explained to be larger than these numbers suggest: “it is likely that [real-world scenarios] will require many more reverting operations.”  But is this really the case?  Since real-world data seems readily available on the Semantic Web, it should be possible to determine the validity of this supposition.
I am not 100% sure whether I understand the outcomes reported in Table 2.  The main outcome seems to be that the here presented approach has performance comparable to the existing approach of Tai et al., while preserving completeness.  However, the improvement in performance of both approaches compared to the default case seems to be small/negligible.  It is a bit unfortunate that Table 2, which is already somewhat difficult to read/interpret, is split across two pages.
Since the resume heuristic requires an external store, I was expecting the size of this store to be quantified as part of the evaluation.  After all, is the performance benefit in the mobile scenario not (partially) achieved by offloading the mobile process' storage onto another process / the triple store?  If the latter is case (that is at least how I understood the setup), then it would be useful to also know the memory and CPU consumption of the triple store.
Since the different between mobile and desktop environments is crucial to the main research question of this paper, it is unfortunate that "average performance times for PC and mobile are not directly comparable".
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The comparison to other approaches and implementation in the Related Work section is very good.  However, I am missing a more high-level discussion/reflecting on the utility of the here presented research.  I am prepared to accept use cases for mobile reasoners, but to what extent do they really have restricted access to computing resources (when compared to desktop systems)?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I did not find a link to the published Open Source code of the here presented implementation, but it is probably not difficult for the authors to add such a link.  Since the data and rulesets are shared online, through a link that is given as a literature reference (which is a bit unusual), availability of the code would enable reproducing the here presented evaluation results.
*After the rebuttal*: Since no link to an Open Source version of the code was shared, the reproducibility of the here presented work is more difficult than originally assessed.
(OVERALL SCORE) This paper extends the existing RETE algorithm with a heuristic-driven approach that results in less memory and/or CPU consumption, while preserving the completeness of reasoning over an important OWL fragment.  The reduction in hardware resources specifically applies to resource-constrained environments like mobile phones, where a process has limited access to RAM.
Strong points:
- The invention of two intelligent extensions that could positively impact performance of the RETE algorithm without sacrificing its completeness in monotonically increasing databases.
- A good Related Work section, where alternative approaches are explained.
- The paper is written very well and it nice to read.  I could find almost no mistakes (not even minor ones).
Weak points:
- The utility of the paper currently relies on a property of the hardware market (desktop versus mobile) that may no longer exist within a couple of years.  (The utility is at least not substantiated, e.g., by referring to discussions in the hardware and/or mobile research communities.
- The evaluation does not deliver the numbers that seem to be most interesting: a direct comparison between desktop a mobile.
- The evaluation results -- assuming I am interpreting them correctly -- seem to indicate that the proposed approach only marginally affects performance.  Yet the text does not draw this conclusion, but sometimes refers to different (real-world) datasets for which the proposed approach would have / could have had different results.
----
Small grammar errors:
- Ungrammatical sentence: “Afterwards, the tailored ruleset for future
assertions.”
- p6. “This [is] realized”
- p10. ungrammatical phrase: “A reasoning cycle took place the initial
dataset”
- p11. “seem[s] negligible”
---
*After rebuttal*
Thanks to the reviewers for their rebuttal.  Unfortunately, my worries
about the main purpose of this paper have not been cleared.  The
authors say that “Although desktop and mobile systems are comparable
in hardware specs, there are huge differences in performance.”  The
paper gives detailed information about the former (in Section 4.1.3),
but never quantifies the -- apparently more important -- differences
in performance.  Based on the details in Section 4.1.3, it is not
clear where the huge differences in performance originate from.  They
must be due to properties not documented in the paper (e.g., more
aggressive power conservation strategies on mobile).  If these
properties are the most important differentiator and motivator for
this research, then they should be documented (in addition to or
instead of the less important hardware specs).
Also: “Regardless of improvements in technology, the processing and
memory capability of the latter will be smaller than those of the
former.”  But this is already not the case according to the
information the paper gives us: a dual-core 2.9Ghz processor (desktop)
does not have a larger processing capability than a quad-core 2.2Ghz
processor (mobile), unless properties not documented in this paper are
in place (see above).

Review 3 (by Brian Davis)

(RELEVANCE TO ESWC) The paper if extremely relevant to the Reasoning track under:
optimisation, approximation, or evaluations of web reasoning algorithms,  i.e., of procedures that take, as an input,  ontologies (usually in RDF, RDFS, OWL, RIF) and test entailments or answer queries.
(NOVELTY OF THE PROPOSED SOLUTION) The paper presents and advancement on the RETE pattern matching algorithm for implementing production rule systems which dynamically adapts RETE networks at run time and aims to reduce consumption of computational resources.   The dynamic tailoring involves unlinking alpha  memories from the network as well pausing alpha nodes such that they will no longer by matched by incoming tokens. The RETE tailor algorithm seeks to guarantee completeness in incremental scenarios linking and resuming at runtime when required.  The delta appears to be that they extend  which extends null left activation with a second heuristic -  lower failed alpha nodes  as well as the addition of a second operation - pausing  - which avoids redundant match operations.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The algorithm is very well described and is clearly differentiated against the state of the art.  The experiment setup is well described, including ontologies, rulesets used and benchmark configurations and metrics.
(EVALUATION OF THE STATE-OF-THE-ART) The authors describe the state of the art quite thoroughly and assert their  contribution relative to the related work.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The RETE Tailor algorithm is very well described and is clearly differentiated against the state of the art.  The experiment setup is well described, including ontologies, rulesets used and benchmark configurations and metrics.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments appear to be well documented  referencing algorithms used in benchmark configuration, the datasets used in addition to link to online documentation See https://niche.cs.dal.ca/materials/rete-tailor/
(OVERALL SCORE) The paper presents and advancement on the RETE pattern matching algorithm for implementing production rule systems which dynamically adapts RETE networks at run time and aims to reduce consumption of computational resources.   The dynamic tailoring involves unlinking alpha  memories from the network as well pausing alpha nodes such that they will no longer by matched by incoming tokens. The RETE tailor algorithm seeks to guarantee completeness in incremental scenarios linking and resuming at runtime when required.  The delta appears to be that they extend  which extends null left activation with a second heuristic -  lower failed alpha nodes  as well as the addition of a second operation - pausing  - which avoids redundant match operations.
Strong Points
Mostly very Well written
Algorithm and experiments well described.
Good knowledge or related work
Weak Points
Section 3 is somewhat heavy to read and it could have been structured, broken down more clearly  succinctly as opposed to what appears to be a lengthy discussion in parts.
Minor issue is the use of "Further" at lot in Section 4.2.  I don't think that is and adverb rather an adjective.
discussion in parts.


Metareview by Jeff Z. Pan

The paper addresses the topic of resource aware incremental reasoning by dynamically adapting the RETE networks at runtime to reduce resource consumption. The reviewers agree that this line of research is interesting and is related to the Semantic Web. However, they also identified important weaknesses which led to a reduced value of the current version of this work:
(W1) An insufficient positioning/analysis of the contribution against related works, inluding the Delete and Rederive approaches, incremental reasoning approaches for OWL RL and OWL EL;
(W2) Section 3 is somewhat heavy to read;
(W3) The utility of the paper currently relies on a property of the hardware market (desktop versus mobile) that may no longer exist within a couple of years: some more thourough motivations (supported by evidence) should be provided.
The paper was suggested to be a candidate cond-accept paper. Unfortunately, it didn't manage to stay above of the threshold in the final round of decision making. We hope the comments here are useful for a future submission of the work.

## Paper 226 (Resources track)

Build a corpus of scientific articles with semantic representation

Author(s): Jean-Claude Moissinac

Full text: submitted version

Abstract: As part of the SemBib project, we undertook a semantic representation of the scientific production of Telecom Paristech. Beyond the internal objectives, this enriched corpus is a source of experimentation and a teaching resource. This work is based on the use of text mining methods to build graphs of knowledge, and then on the production of analyzes from these graphs. The main proposal is the disjoint graph production methodology, with clearly identified roles, to allow for differentiated uses, and in particular the comparison between graph production and exploitation methods. This article is above all a methodological proposition for the organization of semantic representation of publications, relying on methods of text mining. The proposed method facilitates progressive enrichment approaches to representations with evaluation possibilities at each step.

Keywords: semantic publishing; publication; Linked Data; SPARQL

Decision: reject

Review 1 (by Giuseppe Rizzo)

This paper addresses a need of turning the ParisTech library into a richer archive of data and relations among authors and papers and into a better indexed and thus searchable archive. Such a pain can be extended to any library of universities and organizations, thus this objective is certainly of help for the community and society as whole.
The approach presented aims to apply basic text mining notions (identified in the paper as TF-IDF -- only) to the generation of a knowledge base of scientific concepts and publications (scientific papers), where concepts and publications have explicit relations. Thus, the approach presented in this paper is of interest for the Resources track as defined in the requirements of the call for papers.
However, the paper lacks from the necessary formalism and depth in describing core concepts (see below) and in presenting the knowledge base.
The paper would also benefit from proof-readings as numerous paragraphs are hard to be understood (I omit to report the long list of typos but I recommend to check plurals such as "different strategy", and the numerous sentences ending with 3-dots which left me doubting about the notion acquired by the writers and the notion delivered to the readers).
Strengths
- idea and impact to scientists and society
- availability of the resource (with a big question mark see below)
Weaknesses
- the portal http://givingsense.eu/sembib/ is in French, this introduces an access barrier to the reuse of the data instances indexed by the portal. Then, from the portal I cannot find instructions on how to access the instances. There is a "technical section" in http://givingsense.eu/sembib/state.php , but it remains unclear how to traverse the graph. Despite the reported technicalities, the easiest thing of allowing the access and re-use of data instances is lacking
- the approach is half-way of being a research paper and half-way of being a resource paper because it lacks from describing with the necessary depth the text mining approaches for the creation of the KB (a few mentions here and here such as TF-IDF) and the KB as data model and instances
- the value chain of the approach is based on a traditional process of data acquisition->corpus creation->data publishing. However, Sec 2.5 "Gather docs and perform basic treatments" (I reckon you meant basic processing) that should describe the first 2 stages of the value chain do not answer the question "how this has been done". Then, the authors mentioned that the ontology used for modeling data points is based on a set of SPAR ontologies. Which ones? In principle I can indeed agree on the choice, however the paper should list the rationale and eventually a comparison why those ontologies in such a context (a citation could be already enough)
- the authors mentioned that the approach grounded on the decoupling of concepts vs publications and this part has been underlined as important for the creation of a "ground truth". Which ground truth by the way? The choice of decoupling, assuming crucial, needs to be described comprehensively if you want to give such an importance
- the other stage of the value chain is the publishing that should ease the exploration. This part cannot be assessed (see above). It is introduced that the authors did undergo into a technology choice for deploying a SPARQL endpoint. This technology choice is just introduced, but not elaborated, lacking from giving the necessary inputs to scientists to understand your motivations. Imagine: if I want to replicate or extend such an approach, how could I know which technology is more efficient, better performing, ... (and we can list here many more evaluation dimensions)?
- a wrap up section on the achievements based by sembib is presented. This section should be expanded in order to show the actual benefit of the approach, ideally using a proper scientific validation (hypothesis formulation, experimental set up definition, KPI derivation and quantification)
======= AFTER REBUTTAL ========
Thanks for answering. However, my points concerning the how (3rd, 4th, 5th) haven't been answered. I have to confirm my previous mark as I don't have the necessary information to judge the relevancy of this approach and thus the resource that is generated.

Review 2 (by Cristina Sarasua)

This paper presents a data set containing RDF metadata about scientific articles by Telecom ParisTech. The author reused state-of-the-art ontologies from the bibliographic domain (e.g. Fabio and dcterms), used state-ot-the-art tools to mine the text of publications to obtain their keywords, and enabled a SPARQL endpoint. The author claims in the abstract that the contribution of the paper is the organization of publication metadata in different graphs.
My main concern with this submission is that it provides a very limited contribution and it ignores the related work done in the field of scientific RDF data publishing. Even if the submission is a Resources submission, a data set submission in this track should somehow contribute to the state of the art. One way to do that is to provide data that can be used for benchmarking methods (and hence, includes an exhaustive set of test cases manually reviewed by experts). From the description provided in the text and the content visible from the SPARQL endpoint, it does not look like this data set provides a solid benchmarking data set, and it is not a data set that is supposed to help validating new methods either. While having a new data set about publications is in general appreciated, the submission does not provide a novel research contribution, and it looks like an engineering effort. Moreover, the submission provides a data set that ignores some Semantic Web publishing practices, and it lacks proper documentation (see more details below).
I recommend the author to work in the direction of new methods to improve the annotation and the interlinking of scientific publications and submit a novel contribution to a Semantic Web venue in the future.
**Positive aspects**
- It adds RDF data into the Web of Data.
- The author reused some existing ontologies, such as Fabio and dcterms.
- The data set contains a concept graph with the papers’ keywords.
** Negative aspects**
- The contribution of this paper does not have novelty in terms of research. Many data sets with scientific articles have been published in the past (see the publications colour in the LOD diagram http://lod-cloud.net/versions/2017-08-22/lod.svg), and the submission does not refer to any of the existing data sets.
- The submission lacks specificity. For example, when the author describes the representation of documents, the author says that “a semantic representation of the metadata has been realised by relying on SPAR ontologies (bibo, cito) and ontologies commonly used for documents (Dublin Core, schema, foaf …);”, without providing a comprehensive description of the exact parts of the ontologies used, nor providing examples of representative resources. A graphical representation or a Turtle/N3 description of the RDF resources that the data set contains would help the reader have a clear understanding of the exact data published. Moreover, the author mentioned that data is exposed via a SPARQL endpoint, but the text does not provide the URL of the endpoint. Guessing that it would end in “*/sparql” I found it on the Web. However, the author should be aware of the fact that a Resources submission should indicate such details (see the call for papers at ESWC 2018 https://2018.eswc-conferences.org/resources-track/ and last year’s analogous call and program in ISWC https://iswc2017.semanticweb.org/calls/call-for-resources-track-papers/ ).
- The work presented seems to be incomplete, as the text mentions that they started to evaluate methods for learning word vectors, and they are working on enrichment.
- The submission does not fulfil the requirements of a high quality submission in the Resources track.
** As mentioned before, the data set does not break new ground, it does not advance the state of the art, and it does not look like it will have impact in improving the adoption of Semantic Web technologies — the publications domain is quite well researched, and (despite the challenge of convincing specific organisations to do so), more and more librarians and archivists are involved in standard Semantic Web metadata publishing.
** The Web site of the project is exclusively in French. Considering that it will have an international scientific audience, English would be needed.
** The author says in the conclusions and outlook section that “An important next step is the RDF documentation of this dates with DCAT, and then the publication of this data”. I assume the data is published (since it is queryable via the SPARQL endpoint), and again, the DCAT description is something the author should have provided as for the submission (see also https://2018.eswc-conferences.org/resources-track/ ). The author should make the data FAIR (see also https://www.nature.com/articles/sdata201618).
** It is very difficult to assess whether the resource will be useful for a wider audience, since the author does not indicate the extent to which the data set intersects with other publication data sets.
** The technical quality of the resource should be revised, applying best practices in Semantic Web publishing. See suggestions below to (i) interlink the instances to other data sets and (ii) revise the classification information for e.g. ResearchPapers
** The author does not provide enough descriptive statistics about the data set in the submission. For example what is the AVG and STDEV of the number of keywords per publication in the data set? The RDF description of the paper I looked up, for example, does not have any keyword in the RDF response obtained in the SPARQL endpoint, while the publication does contain keywords (https://eprint.iacr.org/2013/303). It is unclear if that is due to multiple named graphs or because the data is not there. It would also be advisable to showcase queries that use the various graphs mentioned in the text.
** There is not information about the license of the data — or at least it is not mentioned in the submission, nor in the main Web site (http://givingsense.eu/sembib/sparql/ or http://givingsense.eu/sembib/).
** The author does not provide any entry of the data set in Zenodo, GitHub, DataHub, Figshare etc.
**There is no sustainability plan specified. The author barely indicates technical tasks that they are currently working on (section 3.2) but does not mention the way the maintenance of the data set will be carried out.
- The semantic forms developed, to display the resource descriptions in a more human-friendly way (e.g.http://givingsense.eu/sembib/onto/persons/David_Bertrand) are not a novel contribution either. There have been plenty of projects developing user interfaces for RDF data (e.g. all the work around semantic wikis and semantic portals).
- The conclusions indicate that the paper describes "a general methodological approach for testing different approaches to the description of bibliographic entities by association with concepts” while the content of the paper mainly refers to the engineering process of preparing metadata about bibliographic entries, and the text lacks indeed details about processes.
- The quality of the writing should be improved:
(i) The author should revise their text with an English native speaker.
(ii) In some parts, the text does not look rigorous enough for a scientific publication (e.g. when the author lists the achievements based on SemBib, when the author mentions “the idea is to use external graphs”, in sentences like “we can try different strategy to associate..” and “Important progress remains to be made to improve the exploitation of semantic graphs. their adoption, especially on the Web, is still very limited compared to the potential of these representations, especially driven by thematic operations- music, events ... - carried by major search engines (cf schema.org).”)
(iii). The text contains explanations about Semantic Web standards or principles that are unnecessary for a Semantic Web audience and consume valuable space.
**Suggestions**
** I recommend the author to translate the explanations offered in their Web site (http://givingsense.eu/) from French to English, to provide the information in both languages.
** I recommend the author to link the resources in this data set to other external data sets. For example, the author could have also already linked the instances via owl:sameAs links to resources in other existing bibliographic data sets such as DBLP. If this (SemBib) data set contains keywords that other data sets do not have, then this new data set would be enriching the description of the article.
** I went to the SPARQL endpoint http://givingsense.eu/sembib/sparql/
** I executed the following query to look up the description of a ResearchPaper: describe <http://givingsense.eu/sembib/onto/tpt/biblio/14420>
<rdf:type rdf:resource="http://purl.org/spar/fabio/ResearchPaper"/>
<ns0:publicationDate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2013</ns0:publicationDate>
<ns1:firstAuthor rdf:resource="http://givingsense.eu/sembib/onto/persons/Bhasin_S_"/>
<ns1:publicationMonth>jan</ns1:publicationMonth>
<ns1:state>published</ns1:state>
<ns2:audience>2</ns2:audience>
<ns2:category>grandpublic</ns2:category>
<ns2:entrytype>article</ns2:entrytype>
<ns2:fromDpt rdf:resource="http://givingsense.eu/sembib/onto/tpt/COMELEC"/>
<ns2:fromGroup rdf:resource="http://givingsense.eu/sembib/onto/tpt/SEN"/>
<ns2:hasTptId rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">14420</ns2:hasTptId>
<ns3:creator rdf:resource="http://givingsense.eu/sembib/onto/persons/Carlet_C_"/>
<ns3:creator rdf:resource="http://givingsense.eu/sembib/onto/persons/Bhasin_S_"/>
<ns3:creator rdf:resource="http://givingsense.eu/sembib/onto/persons/Guilley_S_"/>
<ns3:language>en</ns3:language>
<ns3:title>Theory of masking with codewords in hardware: low-weight $d$th-order correlation-immune Boolean functions</ns3:title>
<ns4:url>http://eprint.iacr.org/2013/303</ns4:url>
</rdf:Description>
** And I saw that the description of the paper does not link to any other external representation of the paper
** While DBLP (http://dblp.l3s.de/d2r/snorql/) contains the representation of that paper (
http://dblp.l3s.de/d2r/snorql/?describe=http%3A%2F%2Fdblp.l3s.de%2Fd2r%2Fresource%2Fpublications%2Fjournals%2Fiacr%2FBhasinCG13) as well as the representation of the main author (http://dblp.l3s.de/d2r/resource/author/Shivam_Bhasin)
- I encourage the author to identify the intersection and the differences that this data set has with other bibliographic data sets, in terms of concepts, resources (people, publications, topics), and facts (i.e. statements). I would also recommend to try to enrich the data set with other information, not necessarily from the bibliographic domain.
**Other things**
- Why is the property “entrytype” used as a datatype property in the description of a ResearchPaper, if this way it helps little to classify the resource? Moreover, the description already contains an rdf:type statement. What value does “entrytype” add to it?
In the resource description shown above http://givingsense.eu/sembib/onto/tpt/biblio/14420”> there are two statements as follows:
<rdf:type rdf:resource="http://purl.org/spar/fabio/ResearchPaper"/>
<ns2:entrytype>article</ns2:entrytype>
- Section 2.3 mentions “Our hypothesis is that advances in the semantic web are able to give us new ways to efficiently exploit the data we collect, in order to provide research and analysis functions. “ That very much depends on the specific definition of the objective, which is not clearly specified. The hypothesis is not tested in any way in the paper.
- In page 3 the submission says “Large warehouses of bibliographic data exist elsewhere. Unfortunately, they give a very truncated view of our production, in particular because they are not able to resolve the changes in the name of our institution and their usual variants. Moreover these bases do not have information on internal structures of research: projects, departments and groups of research …” . Has the author considered that usually information about projects, institutions etc. are present in separate data sets? That is why it is so important to link the data to other data sets (for the sake of information coverage, and data maintenance).
- The author says that publications were crawled but the text does not discuss matters such as the copyright of publications. How was this issue handled?
"Publication venue" sounds better than the term used ("publication channel”), when one refers to conferences and journals.
- When one tries to execute the query in pages 8-9, if one includes the PREFIX statements the endpoint gives an ARC2 error, and without the PREFIX statements, as the author indicated the query gives a “this site cannot be reached”. See also https://goo.gl/VT5iHt
- The plot shown in Figure 1 does not provide a very useful insight to the reader. The author should think of communicating the information differently. Perhaps the authors could show a table, including information about the shared keywords, clustering the information by conference or topic, and give statistics based on the number of keywords shared.
** After rebuttal **
I would like to add that the comment about "ignoring other publication data sets" didn't refer to the fact of citing them in the text, but rather to the process of integrating the data.
Finding different ways to reuse existing technology is a valid approach to explore a field and come up with new solutions to new problems. However, I still think that this work requires more novel components and further elaboration to be accepted.

Review 3 (by Jodi Schneider)

Thanks for your many comments and especially for trying to make your materials more accessible to non-French speakers!
----
This paper describes the process of producing a semantic representation of publications in the Telecom Paristech full-text repository. While semantic metadata for publications has been extensively curated, work on full-text of papers is novel. Even as an expert in this area I find the paper useful in going further than the published literature on details of corpus ennrichment.
The fit with the resource track is not completely clear; I would think of it as an industry/in-use application. It would be even better to reframe the text as a tutorial with select data released (e.g. presumably the non-copyrighted data).
Some suggestions:
- avoid dates in the form 1/12/2017 in international publications (whether this is Jan 12 or Dec 1 will depend on the location of the reader. YYYY-MM-DD (ISO 8601) is always safe. (It appears from the end "In not differently specified, all links were last followed on January 12, 2018." -- which should be "if not..." -- that this is actually Jan 12 2018)
- remove space before footnotes so that they follow the letters (not follow the space)
- carefully proofread (e.g. "citet Larsen: 2010: Scientometrics: 20700371"), ideally have a native English speaker proofread (e.g. in English there is no space between a number and the % mark).
- check references for consistency and usefulness. For instance, I don't know what ToTh is. Technical reports should ideally have URLs given to make them easier to find.
- Figure 1 needs improvement. It's not clear what the colors mean. Do you have any comment to make on the disconnectedness of keywords?

Review 4 (by anonymous reviewer)

The paper presents a collection of scientific articles with semantic representation. It follows a graph-based approach for capturing the metadata. The overall goal would be to enable to connection with further distributed repositories of scientific publications.
At the currents state of the repository, I am not sure that it is ready to be released to the community. As the authors point out, there are already quite a few large and established repositories for academic articles. A new one would have to have very definitive and clear advantages. The main concerns are: 1) why not use a standardised format for the meta data? This would also include linking to existing graphs of authors for example (or reusing existing ontologies) 2) The first paragraph of 2.5 does not really clearly state what the state of the repository is? Is it an anvil pilot with only a few articles, are there any further repositories likes, is it in a final and stable version?
Overall, the work is interesting but the potential impact for the community is not absolutely clear.