Processing incoherent open government data- A case-study about Romanian public contracts funded by European Union
Author(s): Bogdan Ghita, Octavian Rinciog, Vlad Posea
Full text: submitted version
Abstract: Lately, many governments have adopted policies and mechanisms for making open data available to citizens, in order to increase the transparency of state administration and institutions. The usage of these data is hampered by the incorrect, incomplete and incoherent nature of the information.
The purpose of this paper is to summarize the general steps that are needed in order to transform raw open data that contain errors to consistent data. These steps are used to correct the open data published by the Romanian government regarding public contracts funded by European Union, supporting entities interested in using these data.
Keywords: Open Data; semantic web; error correction
Review 1 (by Pavel Shvaiko)
(RELEVANCE TO ESWC) The submission provides a case-study on processing incoherent open government data related to the Romanian contracts funded by the EU. The topic addressed is relevant and is worth further investigations. However, the major problem of the submission is that there no relationship at all about the usage of semantics or semantic technologies in this context, what makes it only marginally relevant to the ESWC scope. (NOVELTY OF THE PROPOSED SOLUTION) The method proposed is rather straightforward, being mostly an engineering exercise, hence the originality of work is low. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The method applies ad hoc rules to handle incoherences. (EVALUATION OF THE STATE-OF-THE-ART) Related work on open government data and data quality in relation to the semantic technologies is weak. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Section 6 discusses a usage scenario, though with no explicit links to the semantic technologies. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Experiments could be reproduced. There are no specific statements on "generality" of the study. (OVERALL SCORE) The submission does not justify the ESWC publication, based on the reasons mentioned above.
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) See detail below. (NOVELTY OF THE PROPOSED SOLUTION) See detail below. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) See detail below. (EVALUATION OF THE STATE-OF-THE-ART) See detail below. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See detail below. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) See detail below. (OVERALL SCORE) The paper addresses a practical topic - use of open data beyond the SW and wider academic communities, and challenges in data correctness and validity when coming from multiple sources. Based on a use case reporting EU funding in Romania, the aim is to provide support for analysis, mainly for prevention of corruption. The authors propose data validation for reports submitted on requests for refunds for money spent delivering (previously) approved projects. Validation is based on a set of files describing data expected in recording such cases. The proposal addresses the underlying issue at a relatively superficial level - the solution is VERY specific, and as described is not easily reused. As is, even for the same use case, any changes in data properties require a manual fix. The methodology described is actually fine wrt to identifying where issues exist and where solutions may lie - the implementation is where it breaks down. Also, data considered invalid is simply discarded, no attempt is made to correct it, and importantly, to set up a structure to prevent errors in future data capture. Considering the process of validating could be easily reused to prevent input errors in the first place, it's surprising the authors don't consider this aspect of the problem at all. Esp as it would also simplify input. Looking at Berner-Lee's design issues (ref1) and proposal for 5* data, this solution is problematic. Note therefore that wrt novelty and proposed solution the actual score is borderline. But this option does not exist so I had to go down as accept is definitely not the case. For reproducibility/generality I moved up from borderline to weak accept. Two graphs of the validated data are shown but no discussion carried out to illustrate how they are used to prevent fraud. And yet this is the main purpose of the exercise. The paper is classified as Linked Data. Three references address linked data, but the paper doesn't at all. And there is no effort to link the output to any other relevant datasets either, or even encode using relevant ontologies, which would be an alternative way to achieve this. So open data, yes, but not necessarily linked. This in itself doesn't invalidate the proposal but the paper should probably be reclassified. Overall, I would class the paper as borderline, while the issues wrt to implementation and discussion are not too difficult to fix I doubt there is enough time to do so within the conference review process. Without this option I have to move down to reject. ******** OTHER DETAIL "In , Futia et all discussed about using errors occurred in Italian procurement documents …" - is "using" an error? - otherwise please expand on how they made use of the errors in the documents. "by correcting the various errors that may occur and to describe them in a consistent manner.." - what needs description - the DATA - this is what I would assume, but the text says the ERRORS. "In order to achieve the goal of obtaining clear and consistent data, all these cases should be investigated. …This involves error detection and correction, which if not possible, the affected data must be invalidated." Isn't this a bit extreme - why invalidate - this would simply reduce the amount of data made open, potentially by a significant amount. Why not simply flag as not validated? I'm not convinced the example of "Praha" vs "Prague" in S4.5 is a good one - neither is a mistake, it's simply the same city in different languages. So this is inconsistent, yes, but not an error. Which brings me back to clearly missed potential to reuse the solution at the input stage. Wrt fig1 - how can you tell unique customers - you state earlier they do not have a unique code. And the probability the name/label of each is always correctly entered is near 0. In fact, further on in the paper you give an example of constantly changing names. As the discussion there also answers this question, please bring this to the first place where you mention the issue rather than making it appear to be unresolved. Also, what is the difference between a customer and a beneficiary? In step 1 - "A document is considered valid if a tuple of <one single document type, one month, and one 4-digit number, between 2009 and 2016> …" Isn't this a bit short-sighted - I'm reading this in 2018, so already you have more than a year's worth of data that will be automatically invalidated. Isn't the whole point of an exercise like this to develop patterns that check for correctness, rather than using fixed values - going back to the example of "Praha" vs "Prague" - that cause the problem to start with? "Because values belonging to the same type of logical information can be represented in different formats (eg: numbers as string or as number) we normalized the format across each column." - normalised as what? Simply because this matters, the choice will either help to resolve the problem or create a bigger headache. Table 1 - the numbers in the text that follow do not completely match those in the table. If reporting also joint totals put these in the table. And please also use consistent number formatting, including a comma separator for 1000s for readability. The table and text following have "￼POC (project_unique_code)" - previously this has been abbreviated PUC. In step 2 - "from a total of 1,487,508 entries 6183 were discarded, representing a negligible percentage of 0.41%." - if those 6183 contained 25% of the funding or reimbursement, or were the most current, say, they are definitely not negligible. Using percentages this way is valid only if each data point is equivalent to all others. Unless this is the case AND you state so the conclusion is not valid. "… identify the customers recipients …" who are these recipients? Why are they important? fig 2 - what is "outsources"? Who is authorising the request? Looking at the graph, reimbursement follows authorisation. Is the former counting from after authorisation is granted or do both start at the same time? What is the graph saying - it is presented without any discussion. Ditto fig 3 - what conclusions can be drawn based on it? **** Please run an auto-grammar check and proof-read - there are a large number of errors that would be identified quite easily. Place "the" in front of cases such as "European Union" - several cases missing, e.g., "projects are financed by [THE] European Union", "first register a project plan to [THE] European Commission" On the other hand, "The Figure 1 shows an overview …" - delete "the" at the start of the sentence.
Review 3 (by Axel Polleres)
(RELEVANCE TO ESWC) In general, the description of the datasets and the proposed pipeline seems to be more a technical report on how to solve/engineer the specific task than a research paper, and therefore, in my opinion, is not suitable for the research track of ESWC. (NOVELTY OF THE PROPOSED SOLUTION) My main point of criticism is that the contribution of the paper, the presented data processing pipeline, is heavily tailored to the specific use case, i.e., the processing of the project funding datasets. For instance, you extract the date and type of a single document from the file name, which works fine for your use case but isn’t a generalizable approach. The same holds for any step of your process. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Neither is there an evaluation of the data processing pipeline, nor a discussion of how to evaluate/generalize/apply the solution. An online version, source code, demo, etc. would be helpful to test and verify the submission. (EVALUATION OF THE STATE-OF-THE-ART) Your initial data quality discussion should be based on existing reports and should provide some quantitative insight into the discussed errors. Here are some pointers to existing works on Open Data profiling and processing:  Ermilov, Ivan, Sören Auer, and Claus Stadler. "User-driven semantic mapping of tabular data." Proceedings of the 9th International Conference on Semantic Systems. ACM, 2013.  Ritze, Dominique, et al. "Profiling the potential of web tables for augmenting cross-domain knowledge bases." Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2016.  Mitlöhner, Johann, et al. "Characteristics of open data csv files." Open and Big Data (OBD), International Conference on. IEEE, 2016. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As mentioned above, no online version, source code, demo, etc. is provided. There are statistics on the number of detected errors, the invalid values, ... in the processed dataset, however, these are only of descriptive nature and no discussion of potential false positives, misclassifications, etc. is provided. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) See novelty, demonstration & correctness above. (OVERALL SCORE) The submitted paper describes an ETL pipeline to process Romanian OGD datasets about public contracts funded by the European Union. A total of 59 documents are analysed wrt. representation inconsistencies and errors, and the steps of the data processing pipeline are detailed. Eventually, the paper discusses how the curated dataset is used in the context of fraud detection. (However, I must say that I did not understand how fraud detection is possible based on the presented plots. The author should explain or exemplify this.) Strong points: - bring awareness to (at first sight simple) open data quality issues, such as column ordering of datasets, format inconsistencies, .. - motivated by real-world use case (the fraud detection application) Weak points: - tailored, not generalizable, solution - missing discussion of novelty and of related/state-of-the-art literature (regarding the data processing but also the data quality) - missing relevancy to scope of conference
Review 4 (by Judie Attard)
(RELEVANCE TO ESWC) The authors target issues in open data within an open government context with the aim of enhancing use. This topic is is possibly relevant to the conference (if it focused on Linked Open Data), however the authors fail to apply any Linked Data or semantic web technologies, and the paper does not provide any contribution. (NOVELTY OF THE PROPOSED SOLUTION) The authors simply provide documentation on the identifications of errors in open government data. They simply describe a number of steps, without providing any context, motivation, approach details, or improvement upon state of the art. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors do not propose any solution. (EVALUATION OF THE STATE-OF-THE-ART) The provided section on state of the art barely grazes the surface. The authors mention only one (!) publication that has a similar objective to theirs, but fail to provide any discussion on how they compare. Moreover, the authors claim the following: "Their authors studied how the usage of this type of information helps both economic growth and confidence in public administration and minimizes corruption in public institutions, increasing transparency ." There are also many authors who indicate that benefits of open data cannot yet be proven. The authors cite a Draft paper (2003) and a workshop paper (2012) which are certainly not the most recent or authoritative publications on the topic. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is simply described in a very abstract manner, with no details about the scientific approach or the research contributions. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) No scientific contributions are provided. (OVERALL SCORE) This paper seems to only provide an abstract overview of notes, and therefore is not worthy of being published. There is no concrete contribution, no comprehensive description of the approach (including motivation, context, etc), and not even a proper state of the art discussion. Moreover, no use of linked data or semantic web technologies.
Metareview by Hala Skaf
This submission presents a case-study on processing incoherent open government data related to the Romanian contracts funded by the EU. The submission simply provides documentation of errors in open government data. Reviewers agree that the submission is not relevant to ESWC conference. The authors do not provide a rebuttal.