Paper 12 (Research track)

A Declarative Approach to Anonymize Linked Data Graphs

Author(s): Rémy Delanaux, Angela Bonifati, Marie-Christine Rousset, Romuald Thion

Full text: submitted version

Abstract: The growth of the LOD (Linked Open Data) graph due to the injection of new information from individuals, enterprises and governments leads to rethink the available data protection methods mainly conceived for relational data.
In this paper, we introduce a declarative framework for privacy-preserving Linked Data publishing in which privacy and utility policies are seamlessly encoded as SPARQL queries.

We focus on the problem of query-driven anonymization of a graph dataset prior to its publication in the LOD graph.
We formalize the problem of finding sequences of modifications to a graph database guaranteeing that both privacy and utility criteria as encoded in their respective policies are satisfied.

We propose a data-independent method that inspects only the privacy and utility policies and does not need to navigate the graph instance in order to determine the sequence of operations needed for policy application.
We prove the soundness of our method and gauge its effectiveness and runtime by means of an experimental assessment.

Finally, we conclude the paper by discussing the generality of our
declarative framework and its possible extensions.

Keywords: Linked Data Anonymization; Data Privacy; Linked Open Data; Privacy Policy; Utility Policy; Privacy/Utility Tradeoff; Declarative Privacy; RDF Graph Anonymization

Decision: reject

Review 1 (by Javier D. Fernández)

(RELEVANCE TO ESWC) The paper addresses the relevant and hot topic of privacy-aware mechanisms for Linked Data, in particular Linked Data anonymization. Given the latest trends on privacy-preserving technologies and methods, in special with the upcoming application of the new General Data Protection Regulation in Europe, the topic is highly relevant for our community.
(NOVELTY OF THE PROPOSED SOLUTION) Authors propose a declarative approach in which privacy policies are represented by a SPARQL query, which is a well-known technique e.g. in access control for Linked Data. In contrast, authors also define utility policies via SPARQL queries that represent all the functionality that the data providers want to keep after anonymization. Thus, the paper focuses on keeping this functionality while satisfying the privacy policies (so far this enforcement is based on data deletion, but other methods such as value replacements are envisaged). To the best of my knowledge, this combination is underrepresented in the literature.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution is technically sound and formally presented. All algorithms, lemmas and theorems are relatively easy to follow. Proofs of the theorems are sketched in the paper and a full version is available online.
(EVALUATION OF THE STATE-OF-THE-ART) Although authors review the most important references, I missed a reference to related work on access control for Linked Data (e.g. [1], or a full survey in [2]) where a declarative approach can be also used to specify fine-grained access control policies. In a way, both approaches define SPARQL queries that should be enforced (acknowledging that this is performed differently in each case). A discussion on this aspect would be useful for the reader.
[1] Villata, S., Delaforge, N., Gandon, F., & Gyrard, A. (2011, October). An access control model for linked data. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems" (pp. 454-463). Springer, Berlin, Heidelberg.
[2] Kirrane, S., Mileo, A., & Decker, S. (2017). Access control and the resource description framework: A survey. Semantic Web, 8(2), 311-352.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Authors formally demonstrate the properties of the approach, and they discuss some limitations. In particular, the solution deals with anonymization via deletion, while other operations (such as IRI and value replacement) are devoted to future work. Likewise, the solution is solely based on the policy and utility queries, while further optimizations bases on the real data and order of delete operations is also seen as future work.
However, I would expect some discussion on i) the implications of extending the expressivity of the queries (as now it is limited to BGPs) and ii) the use of some kind of reasoning (e.g RDFS or OWL2 EL).
Update: Authors provided a minor explanation for i) and ii) in the rebuttal. I might expect that in the final version of the paper, if accepted.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I have some concerns regarding the evaluation. Firs of all, authors do not provide neither the data and queries used in the experiment nor the implementation of the algorithms. I would encourage authors to clarify in the rebuttal if they are providing those.
Then, if I understood correctly, authors decided to disregard star queries. However, this is a well-known pattern which is a core aspect in most benchmarks (see e.g. see Watdiv [3], and their query templates: I would expect a justification or clarification in the rebuttal. 
Finally, experiments are based on single synthetic dataset, while I would expect to test different scenarios. In particular, my hypothesis could be that the number of properties and data types may play an important role. 
Given that one of the aspects of the declarative approach is the expressivity, I would maybe expect an evaluation with an expert defining policy and utility queries. 
[3] G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, 2014, pages 197-212.
Update: Authors provided a github repository with the code in the rebuttal, as well as a minor evaluation with star-shaped queries, but I understood this latter won't be part of the paper, and the justification of this lack is still unclear.
(OVERALL SCORE) The paper addresses the timely topic on Linked Data anonymization. In particular, authors propose a declarative approach to specify a) privacy policies (i.e. data that shouldn't be exposed) and b) utility policies (i.e., data that should be visible after anonymization) via SPARQL queries. 
Besides providing a formal framework, authors then focus on computing the set of modification operations (restricted to delete so far) that can be applied to the input graph to satisfy both policies. The proposed algorithm, limited to BGP SPARQL queries, can compute such operations independently of the data (albeit authors don't focus on further optimizations to provide an optimal combination of operations). 
Finally, authors make use of synthetic data and randomly generated queries for the policies, varying on the number of triple pattern and the overlap between the privacy and utility queries. Results show some expected results, e.g that having less restrictions on the utility makes anonymization simpler, while specifying more privacy concerns increases the number of candidate triple deletions (which is reduced if more privacy and utility topics overlap). 
* Strong points:
- Relevant and timely topic
- All the paper, including the formalization, is well-written and easy to follow
- The proposal is simple and limited (in query expressivity and modification operations), but it is technically sound and can address a subset of problems in Linked Data anonymization. 
* Week points (further elaborated in other sections) :
- Limited evaluation 
- Missing deeper discussion on the practical utility on the approach and the implications of extending the expressivity and  reasoning capabilities of the policies.
- Incomplete state of the art (although it might be completed in the final version of the paper)
* Questions for the authors:
- Are the data and code used in the empirical evaluation available for download?
- Can authors discuss the implication of extending the expressivity of the policy queries? How can authors justify that the provided simple expressivity is enough to deal with Linked Data anonymization?
- Can authors discuss how reasoning (e.g RDFS or OWL2 EL) can be supported?
- Why did authors fully disregard star-shape queries? 
- Can the results be generalizable under different scenarios (e.g different number of properties and data types)?
Update: I thank authors for their rebuttal and I increased some of my scores accordingly. I think the paper would still miss a more detailed discussion on (or extension of) the (limited) expressivity, hence I keep my overall score.

Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) Unfortunately, the paper must be rejected, since it does not comply with the formatting requirements and would significantly exceed 15 pages if properly formatted. It even looks to me that the authors simply reduced margins in order to make a prior draft fit to ESWC. Due to this fact I'm only very briefly reviewing the submission.
(NOVELTY OF THE PROPOSED SOLUTION) The approach seems to be novel, but the base assumptions are not reasonable.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Generally the definitions and conclusions appear resonable, but the advantage compared to a trivially naive approach of extracting all data using the utility queries is not clear to me.
(EVALUATION OF THE STATE-OF-THE-ART) The authors have considered relevant related work.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The experimental study seems to be reasonable, the comparison with a trivially naive approach is missing.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I can not find resources facilitating the reproducibility, such as code or evaluation data.
(OVERALL SCORE) Generally the submission is relevant and interesting and looks well prepared. I'm not convinced the base assumptions and definitions make sense. For example, it looks to me that I could still retrieve all the data, by using individual TPs of the privacy and/or utility queries and then reconstruct a large part of the original data. Also, by deleting (which seems to be the main anonymization operator), the informativeness of the data and thus the analytical possibilities are significantly reduced. Usually anonymization could also mean altering the data in such a way, that the informativeness is similar to the original data, but sensitive information is modified in such a way, that the original data can not be reconstructed. Probably the use case here is slightly different: sensitive data should be removed, but couldn't this simply be achieved by extracting the data returned by the utility queries (and removing everything else) - I do not see what the added value of the approach is to such a simplistic procedure.

Review 3 (by Mathieu D’Aquin)

(RELEVANCE TO ESWC) This paper presents an approach to defining anonimity and utility policies for RDF graphs, as well as to establish anonimisation operations to apply in order to satisfy those policy. The relevance to ESWC is straighforward and the topic timely.
(NOVELTY OF THE PROPOSED SOLUTION) There has been a few attempts at looking at anonymisation of RDF graphs, but as far as I'm aware the approach to define anonimity and utility policies as sets of queries is new. I would have liked to see more discussion on approaches to policy definition in other areas and to the more general anonymisation of graphs (which might be applicable to RDF). 
The authors might want to cite the following paper as a more recent attempts at applying anonimisation approaches to RDF graphs:
k − RDF-Neighbourhood Anonymity: Combining Structural and Attribute-Based Anonymisation for Linked Data by Benjamin Heitmann, Felix Hermsen, and Stefan Decker at PrivOn2017
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The definitions, algorithms and proofs presented appear correct. However, the approach to defining policies is very limited. Indeed, only defining binary (i.e. compliant or not) utility policies might be unrealistic in practice. It is also seem very restrictive to only allow conjunctive queries in the definition of privacy policies. This for example means that policies based on criteria such as the one at the basis of k-anonimity (quasi-identifiers should cover at least k records/individuals) are not expressable.
(EVALUATION OF THE STATE-OF-THE-ART) The evaluation is applied to synthetic data and randomly generated policies. While this shows some characteristics of the approach, it misses key aspects, and certainly does not assess the ability ot the approach to maitain anonymity or utility. A comparison, at least in qualitative terms, with other anonymisation/policy definition approaches would have been welcome.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Again, here, the binary nature of the policies and the limitations of conjunctive queries as a policy definition language are not discussed. What is also not discussed is the fact that, with this approach, large numbers of policies might have to be definined manually for both anonymity and utility policies in practice, with much manual effort required to ensure consistency and completeness (and no guaranty).
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Generally, the evaluation is sufficiently described and the approach sufficiently formalised that it could be reproduced, and potentially applied to other (although similar) approaches.
(OVERALL SCORE) The topic is important, and the technical aspects of the approach seem robust. There are however limitations to the method compared to usual anonymity works, and many of those limtations and aspects are not sufficiently discussed.

Review 4 (by Michel Dumontier)

(RELEVANCE TO ESWC) De-identification of RDF data is a relevant topic, especially in the context of public data. The proposed method will result in a set of SPARQL DELETE queries, which can be executed on a dataset following a full known graph structure with given ontologies.
The manuscript gives an in-depth explanation of the logic, including fictive examples. With these examples, and knowledge about RDF, this manuscript is good to follow. However, when missing both backgrounds (RDF or description logic), it becomes harder to understand.
(NOVELTY OF THE PROPOSED SOLUTION) The authors do not explain well what the advantage of this method is. I assume it's by developing real-life query patterns for allowed or denied queries and developing a rule set which can accommodate both scenario's and develop the final delete queries. Hence, the method has in my opinion only added value when there is a need to determine which axioms should be maintained (utility policy), while attempting to comply as best as possible to deletion criteria (privacy policy).  In this case, it also means that utility policies always overrule privacy policies. I would suggest adding such a description (when to use this method?) for clarity to the broader public.
Furthermore, authors state that work on "k-anonimity" has been performed before (reference 1), however do not explain in detail where the proposed method differentiates from this referenced work.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The description logic definitions do look correct, however is not the background of this reviewer.
As the testing dataset is synthetic, and the policies are automatically generated, it only proves the mathematical and theoretical behavior of policies. It would be interesting to see what the consequences are with real data and policies. Does the number of statements in privacy policies still influence the deletion rate this much? Or does it also depend on the graph structure? If the latter is true, how does it apply to a dataset with a large number of blank nodes?
(EVALUATION OF THE STATE-OF-THE-ART) The approach of data removal for anonymization is not new. However the applied method to develop a final set of deletion queries is state-of-the-art in my opinion.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Demonstration is only in-text. No sample code or datasets are given. Limitations of this method are addressed in the discussion, and show that it only applies on a given graph structure without taking into account the actual values.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental study is only described in the manuscript. No code has been given, making it hard to reproduce the experimental study. Especially because the characteristics of the synthetic dataset are unknown.
(OVERALL SCORE) This manuscript describes a logic-based, declarative approach, to data anonymization using existing Semantic Web technologies (RDF and SPARQL). This approach is based on 3 components. The first component starts with developing RDF patterns which need to be anonymized (privacy policies), and RDF patterns of data preservation (utility policies). Afterwards, a statement-based approach checks whether single pattern statements in the privacy policies interfere with utility policies. If not, the privacy statements are maintained; if interference occurs, the privacy axioms are removed from the list. Finally, the third step (2nd algorithm) constructs (multiple) delete operations, based on the remaining privacy policies axioms. This method is tested in a synthetic dataset, using different parameters for the privacy and utility policies size. Limitations of the work are addressed, however some issues are still existing due to the synthetic dataset:
Does the number of statements in privacy policies still influence the deletion rate this much? Or does it also depend on the graph structure? If the latter is true, how does it apply to a dataset with a large number of blank nodes?
Furthermore, who would the end-user of such a tool be? If it is someone with experience in Semantic Web, does he/she need such a tool to develop delete scripts, or would someone develop these queries him/herself, and why?

Metareview by Maribel Acosta

This work presents a query-driven anonymization approach for RDF graphs. The reviewers agreed that this line of research is definitely interesting and relevant to the Semantic Web. 
During the reviewing process, one critical drawback has been identified regarding this submission: this paper does not follow the LNCS formatting. This, in turn, causes the submission to exceed the page limit established in the rules listed on the ESWC 2018 website. Due to this reason, the paper cannot be accepted in the conference.
In addition, the reviewers have also identified further weak points of this submission, including: missing relevant state-of-the-art solutions in the Related Work, limited empirical evaluation, and missing discussions on the limitations/extensions regarding the expressivity of the query policies handled by the approach. We highly encourage the authors to address the previous issues to improve the overall quality of this work.

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *