Paper 151 (Research track)

Learning SHACL Constraints for Validation of Relation Assertions in Knowledge Graphs

Author(s): Andre de Oliveira Melo, Heiko Paulheim

Full text: submitted version

Abstract: Constraints are an important part of ontologies and are responsible for the detection of wrong statements. The automatic induction of constraints from data can assist the creation and maintenance of knowledge graphs. Current state-of-the-art knowledge graph constraint learning approaches are part of ontology learning methods and are restricted to the generation of simple RDFS or OWL axioms. In this paper we propose a method for automatically learning complex SHACL relation constraints from data in order to extend existing ontologies. Our approach translates decision trees trained for relation assertion error detection into SPARQL validation queries. We show that our approach benefits from the higher expressiveness of SHACL and can detect errors which could not be found by current automatic ontology learning methods.

Keywords: ontology learning; error detection; machine learning

Decision: reject

Review 1 (by Agnieszka Lawrynowicz)

(RELEVANCE TO ESWC) The paper covers the topic of learning SHACL constrains for validation of knowledge graphs. This is a core topic for the Semantic Web community.
(NOVELTY OF THE PROPOSED SOLUTION) The main contribution of the paper is described in Section 4.2.
It is a rather simple algorithm for translating a decision tree, that uses a subset of SPARQL BGPs (or DL expressions) as features, into a set of SHACL constrains.
PaTyBRED, the algorithm for generating the trees, is not in the scope of the paper.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors provided a list of features used by the trees and demonstrated their counterparts in SHACL.
(EVALUATION OF THE STATE-OF-THE-ART) Related Work secion is quite good, covering both older and newer approaches, but it seems to me that some relevant works are missing, e.g. Muñoz E., Nickles M. (2017) Mining Cardinalities from Knowledge Bases. In: Benslimane D., Damiani E., Grosky W., Hameurlain A., Sheth A., Wagner R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science, vol 10438. Springer, Cham
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The limitations of the approach are clearly stated, although they are mostly the limitations of PaTyBRED.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The description of the experiment is hard to follow, the purpose unclear.
Moreover, the study includes manual evaluation by one of the authors, but its results are not available for verification.
*****
After rebuttal: The authors provided the link to the source code and the evaluation results, and explained the most confusing parts, thus I changed the score to accept.
(OVERALL SCORE) Summary of the paper: 
The problem tackled in the paper is learning SHACL constrains for RDF properties and property paths. 
The proposed solution is based on PaTyBRED, an algorithm for generating decision trees for deciding whether a given assertion to a relation in a knowledge graph is correct or not.
PaTyBRED uses SPARQL BGP-like features.
The contribution of the paper is an algorithm for translating a decision tree into a set of SHACL expressions.
Strong Points:
* New, imporant problem.
* Good discussion of the related work.
* Complete coverage of the features used by PaTyBRED.
Weak Points:
* Some parts of the text are badly written, there are repetitions, multiple terms refering to the same thing, examples and pieces of code lack captions, positions 11 and 12 in the bibliography describe the same paper.
* The algorithm is rather straightforward.
* Experimental study is unconvincing, due to both design and quality of the description. Moreover, it seems to me that it evaluates PaTyBRED rather than the algorithm itself.
Questions to the Authors:
1. Why introduce the disjunctive normal form (DNF) as an intermediary step? Transforming a decision tree into a set of conjunctive rules is simple enough.
2. Why there is a negation in the formula in the upper part of page 8?
3. What are "triangular path features"?
4. What does "top-100k" mean? Is this best/worst 100.000 errors according to some measure? If so, what measure?
5. Why perform double selection: first select a large subset of errors, categorize them and then select a sample for manual evaluation?
6. What does it mean that an error falls into wrong triple with correct type category (or, for that matters, any other of the four categories of errors)?
*****
After rebuttal: I thank the authors for the extensive response to my questions. I am satisfied with the answers and changed the score to accept.


Review 2 (by Irlan Grangel)

(RELEVANCE TO ESWC) Knowledge graphs are today of paramount importance for the Semantic Web community. 
The capability for knowledge graphs to be correct is very important to obtain all their benefits.
Therefore, this paper is of relevance for the conference.
(NOVELTY OF THE PROPOSED SOLUTION) The novelty of the proposed solution is based on the use of decision trees in the PaTyBRED approach that authors have developed in previous approaches.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution is presented with an adequate level of correctness and completeness and authors also acknowledge their limitations.
(EVALUATION OF THE STATE-OF-THE-ART) State of the art approaches are evaluated up to some extent. Would be good to have a clear separation regarding the previous version of the PaTyBRED method, e.g., if it was only the inclusion of decision trees what is new in this work.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Properties of the proposed approach are properly demonstrated.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiment is correctly explained with sufficient information. However, links to the dataset, code, etc, are missing.
(OVERALL SCORE) The paper entitled "Learning SHACL Constraints for Validation of Relation Assertions in Knowledge Graphs" presents an innovative approach to validate Knowledge graphs by means of the automatic generation of constraints. The proposed method learns SHACL-SPARQL constraints for relations.
They rely on the PaTyBRED approach and use decision trees to perform the generation. 
The paper is well written and the ideas are well structured and easy to follow. 
Strong points:
- The relevance of the topic presented in the paper is of high relevance for the semantic web community
- The utilization of machine learning techniques in knowledge graph is definitely a very interesting approach which is presented in a solid manner
Weak points:
- Despite authors mention some of the advantages of decision trees, the approach seems to be of too much cost (there is one training set for each constraint), taking into account that validation will still be needed it. 
- It is not completely clear how the generated restrictions are validated
- Authors recognize that "PaTyBRED has been shown to work best with random forests or support vector machines as relation classifiers" and the most important reason to opt for decision trees is that they can be easily converted to SHACL constraints. Would be interesting to know the trade-off between this reason and the huge cost for the decision trees.  
Questions:
- 
Minor issues
- References 11 and 12 seems to be the same - duplicated
- 6 assertions -> six assertions
- 6 positive examples -> six positive examples
- that defines the defines the examples ->  that defines the examples
- to be highly to be erroneous -> ??
- induced my modern knowledge graph -> induced by modern knowledge graph 
- extend previous works [14, 5] -> extend previous works [5, 14]


Review 3 (by anonymous reviewer)

(RELEVANCE TO ESWC) With their approach for learning SHACL constraints the authors extend the existing work of inducing constraints on the schema of structured data expressed e.g. in RDF or OWL. This is an important endeavor to ensure certain quality aspects for (usually automatically generated) knowledge sources of ever increasing size.
(NOVELTY OF THE PROPOSED SOLUTION) The main achievement of their work in comparison with existing approaches is the consideration of different kinds of property paths in a knowledge base that can be used as features for describing a given property through a decision tree. Further, the choice of using decision trees gives a structured and human readable description of a learned classifier that lends itself to be translated into SHACL in a systematic way. Using SHACL to express such constraints is favorable since it is a recent W3C recommendation serving exactly this purpose.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Several details concerning the overall approach were left unclear. Certain terms and notations were not explained. Moreover, quite often techniques were not introduced but just named (sometimes not even referenced).
Overall a running example would have been helpful. Some of the given examples were not comprehensible to me, due to incomplete data or because certain aspects were not explained.
(EVALUATION OF THE STATE-OF-THE-ART) In the evaluation PaTyBRED is compared to only one other approach which is not introduced properly. Constraints were only learned on one single dataset in one restrictive (in terms of property path length) setting. Conclusions drawn are entirely based on ratios of certain types of ‘detected errors’ and ignore other possible measures that might have covered quality and performance aspects of learning schema constraints. Moreover, the discussion of the results was quite brief and vague to me.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Several details concerning the overall approach were left unclear. Certain terms and notations were not explained. Moreover, quite often techniques were not introduced but just named (sometimes not even referenced).
Overall a running example would have been helpful. Some of the given examples were not comprehensible to me, due to incomplete data or because certain aspects were not explained.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental study is not reproducible since the implementations for PaTyBRED and ‘Statistical Schema Induction’ are not available. Besides this the dataset used and evaluation settings were given. The actual results or numbers were not presented nor made available.
(OVERALL SCORE) In their submission "Learning SHACL Constraints for Validation of Relation Assertions in Knowledge Graphs" the authors present an approach to learn SHACL constraints for a given knowledge graph. The overall method named PaTyBRED, comprising steps like
- the generation of negative examples (i.e. object property assertions that are not part of the knowledge base),
- feature generation,
- feature selection,
- derivation of a decision tree based on the selected features,
- decision tree pruning,
- translation of the decision tree into a logical formula in DNF, and
- translation of the DNF formula into SHACL constraints
was used to compare their approach with a statistical schema induction attempt.
*** Strong Points
- Relevance
With their approach for learning SHACL expressions the authors extend the existing work of inducing constraints on the schema of structured data expressed e.g. in RDF or OWL. This is an important endeavor to ensure certain quality aspects for (usually automatically generated) knowledge sources of ever increasing size.
- Originality
The main achievement of their work in comparison with existing approaches is the consideration of different kinds of property paths in a knowledge base that can be used as features for describing a given property through a decision tree. Further, the choice of using decision trees gives a structured and human readable description of a learned classifier that lends itself to be translated into SHACL in a systematic way. Using SHACL to express such constraints is favorable since it is a recent W3C recommendation serving exactly this purpose.
- Correctness
Besides the originality of the approach, its correctness can also be considered as a strong point, since all the steps involved and mentioned above require a formal foundation. Since an evaluation was performed and presented in the paper, one can assume that this more technical and formal basis was developed, however only very few details were given in the paper.
*** Weak Points (WPs)
- Presentation
Several details concerning the overall approach were left unclear or phrased in a way that was hard to understand. One source of confusion is the use of a certain vocabulary that (at least to me) seems unusual in the context of Semantic Web technologies. Using standard terms like '(object) property' instead of 'relation', 'object property assertion' instead of 'individual/relation/triple assertion', for example, might have avoided misunderstandings that only resolved after reading and thinking over the paper a second time.
Certain terms (e.g. 'level', 'constraint component', or 'error') and notations (esp. the DL axioms in Table 1, but also e.g. the SPARQL property path operator '/', OWL property path operator 'o') were not explained which might hamper the understanding. Moreover, quite often techniques (like SSI, PRA, SDValidate, LCWA, 'KG embedding models') were not properly introduced but sometimes just named.
Many sentences from Section 2 (Preliminaries) seem to be copied (some of them even literally) from different parts of the respective W3C specifications which makes it hard to understand since not all of the SHACL terms were introduced.
Overall a running example would have been helpful. Some of the given examples were not comprehensible to me, due to incomplete data (president example/Table 2) or because certain aspects were not explained (pruning in Fig. 1 seems arbitrary since no threshold etc. given).
- Evaluation
In the evaluation section PaTyBRED is compared to only one other approach ('statistical schema induction') which is not properly introduced. Moreover no implementations for PaTyBRED or statistical schema induction are given which makes the evaluation irreproducible.
In the evaluation the authors concentrate on a comparison of the ratio of certain types of 'detected errors' which were not discussed in detail. The conclusions drawn from this IMHO were not properly justified. Certain other aspects, like ‘the number of correct triples with correct types’ reported, which are 'false positives' if I understood this correctly, were not discussed at all. This could have been reflected by means of standard measures like accuracy, F-score etc.  Besides this, the evaluation does not give a good impression of the performance of the PaTyBRED approach. Since there are many steps (as listed above) involved I would have been interested in an overview regarding the individual runtimes. Another point would be to show that the main benefit one gets by applying PaTyBRED, i.e. being able to exploit property paths as features for a classifier, is feasible in practice. In the evaluation provided in the paper the maximum path length was set to 2, and the authors pointed out that path lengths greater than 2 could cause performance issues. It would have been helpful to see how big they are.
Comments after the rebuttal:
We acknowledge the rebuttal of the authors. Given that now more information was made available that supports the reproducibility we adjusted the respective score accordingl


Review 4 (by Catherine Faron Zucker)

(RELEVANCE TO ESWC) This paper presents an approach and a tool to learn SHACL constraints from RDF graphs with the aim of validating the RDF graphs with them. This is a major issue for the Semantic Web.
(NOVELTY OF THE PROPOSED SOLUTION) The proposed approach relies on the PaTyBRED method previously proposed and published by the authors to detect errors among relation assertions in an RDF graph. It relies on a set binary classifiers (one for each relation).
The contribution of the paper is the choice of decision trees for classifiers and the translation of decision trees into SHACL-SPARQL constraints.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) For the generation of SHACL-SPARQL constraints from decision trees, the reader is expecting an algorithm in addition to the textual presentation of the approach.
Also, the pruning of the decision tree should be better explained.
(EVALUATION OF THE STATE-OF-THE-ART) The authors are well aware of the state-of-the-art
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I am not confortable with evaluating this. A manual evaluation has been conducted, comparing the proposed approach to statistical schema induction. To which extent is it an evaluation of PaTyBRED, and to which extent is it an evaluation of the contrbution of the paper?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The source code of SSI and PaTyBRED are avalaible online, the URIs should be indicated in the paper to avoid the reader to look for them in other papers.
(OVERALL SCORE) This paper presents an approach and a tool to learn SHACL constraints from RDF graphs with the aim of validating the RDF graphs with them. This is a major issue for the Semantic Web.
It relies on the PaTyBRED method previously proposed and published by the authors to detect errors among relation assertions in an RDF graph. It relies on a set binary classifiers (one for each relation).
The contribution of the paper is the choice of decision trees for classifiers and the translation of decision trees into SHACL constraints.
The paper is well written but 
- there are many typos which should be corrected
- the announcement of the contribution is unclear at the end of the introduction
- it is also unclear in the text that PaTyBRED is a previous contribution of the authors (and there are 2 references for the same paper and a third reference for the proceedings, this should be corrected)
- there are some redundancies in the introduction of PaTyBred and the approach. I would advise to move the text in the introduction of section 4 in section 1, and to have 2 separate sections for the presentation of PatyBRED and the actual contribution of the paper: the generation of SHACL constraints.
For the generation of SHACL-SPARQL constraints from decision trees, the reader is expecting an algorithm.
local remarks
Why are contraints for data properties out of the scope of the paper? (end of introduction of section 4)
"it has been shown" -> where (ref to the paper)
a discussion/motivation of the features used in PaTyBRED would be welcome
in section 4.2, after the 2 examples showing that SHACL core is not convenient, the reader is expecting the shapes in SHACL-SPARQL


Metareview by Christoph Lange

The reviewers agree that this paper addresses a relevant topic; however, there are the following concerns:
* The increment over previous publications on the PaTyBRED method is not clear.
* Some more context needs to be provided (e.g., techniques such as SSI need a bit more explanation).
* The cost of adapting the approach to new settings (with one classifier and one training set per relation) should be commented on.
* The evaluation, on one dataset, with a limited path length, is limited.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *