Paper 46 (Research track)

Tell Me Why Is It So? Explaining Knowledge Graph Relationships by Finding Descriptive Support Passages

Author(s): Sumit Bhatia, Purusharth Dwivedi, Avneet Kaur

Full text: submitted version

Abstract: We address the problem of finding descriptive explanations of facts stored in a knowledge graph. This is important in high-risk domains such as healthcare, intelligence, etc. where users need additional information for decision making and is especially crucial for applications that rely on automatically constructed knowledge bases where machine learned systems extract facts from an input corpus and working of the extractors is opaque to the end-user. We follow an approach inspired from information retrieval and propose a simple and efficient, yet effective solution that takes into account passage level as well as document level properties to produce a ranked list of passages describing a given input relation. We test our approach using Wikidata as the knowledge base and Wikipedia as the source corpus and report results of user studies conducted to study the effectiveness of our proposed model.

Keywords: Explainability; Textual descriptions; Knowledge graphs; information retrieval

Decision: reject

Review 1 (by Chenyan Xiong)

(RELEVANCE TO ESWC) Finding support sentences/passages is a standard task for semantic web. It is important and useful for knowledge graphs to not only showing the relation in the schema format, but also with natural language texts that explain and support the fact.
(NOVELTY OF THE PROPOSED SOLUTION) The proposed approach is a standard language modeling method reasonably adapted for this task.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The adapted language model is intuitive and the derivation of the retrieval method follows the standard.
(EVALUATION OF THE STATE-OF-THE-ART) Only a very basic baseline, the default paragraph retrieval model, is compared. The discussions and comparisons with previous research in finding support sentences are incomplete.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) No analyses that study the influence of the proposed method's new components (relation synonym expansion and smoothing) are provided.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The proposed method is straightforward. I expect it easy to be implemented by others.
(OVERALL SCORE) This paper presents a language modeling approach to retrieve support sentences (paragraphs) for the relations in knowledge graphs. The task is to find natural language texts that explains the relations between two entities. The proposed approaches use the head entity name, relation name, and tail entity name as sub-queries, ranks candidate paragraphs for these sub-queries using language models, and combines these language model scores for the ranking score. To solve the sparsity or vocabulary mismatch problem, the paragraph language model is smoothed by document level and corpus level language models, similar to standard smoothing approach in language modeling literature. In addition, the synonyms of the relation names from the knowledge graph are used to enriched the relation name query. 
Strong Points:
The problem is well motivated and the introduction does a good job in describing the meaning of this task.
The proposed language modeling approach, albeit simple, is intuitive and follows nicely the standard approaches in language modeling literature.
The writing clearly describes the work with sufficient details---while also being interesting.
Weak Points: It is a little sad that the experiments conducted are not sufficient to back up the contribution of this paper. It also raises several concerns about the effectiveness of proposed approach.
First, the baseline performance is really low and the advantage of the proposed approach is not well studied. The proposed language modeling approach outperforms the baseline, Indri’s paragraph retrieval, by several times. There are only three major differences between the two, from the description in this paper: the proposed method uses synonyms of edges, smooths the paragraph language model using document and corpus language models, and the combination of the head, edge, and tail name queries is different. It is supervising that these differences make such a huge difference (0.82 VS 0.32 on [email protected]). There is only one example provided to explain the effectiveness of the proposed method. That is not sufficient to provide a full understanding of the proposed approach. In order to do so, a thorough study of the contribution of each of the three components is required. The description of the Indri baseline needs more details as well.
Second, finding the support sentence in the form of natural language texts is not a completely new problem. [36] uses learning to rank to conduct this and should be discussed with more details, perhaps also compared as a baseline. It is understandable that the proposed language modeling approach, as an unsupervised method, may not be able to outperform the LeToR method. However, the comparison or the integral of the language modeling approach in the LeToR system will provide a much better picture of the effectiveness of the proposed approach.
Third, though the annotation of labels is carried out in a reasonable way, the scale of the result labels is a little limited. Only 50 queries are labeled and only 10 queries are labeled by two annotators. In addition, the pooling size for candidate documents is only 5 and only two methods are used to generate the candidate pool. The limited scale limits the robustness of the experiments.
I consider this paper as weak reject. I like this task, the writing, and the simple language modeling approach. I wish the authors can improve the quality of the experiments with more baselines, analyses of individual components, and more labels. I would like to see this paper in future semantic web or IR conferences.
After rebuttal:
I thank the authors for providing the explanations. I still consider this paper as a weak reject (There is no weak reject or borderline score choice given in the system, so I have to put it as 2:reject). My concerns about incomplete surveys for related work and the unclear source of effectiveness w.r.t. a very similar baseline remain after reading the rebuttal.

Review 2 (by Christian Dirschl)

(RELEVANCE TO ESWC) The Task at Hand clearly falls within the scope of ESWC and the NLP track. Semantic Web Technology and Standards are used as well as common datasets like wikidata and wikpedia as the base corpus.
(NOVELTY OF THE PROPOSED SOLUTION) The Approach taken is quite straightforward and the section on related work is quite concise and limited to adjacent aspects. So I am not sure about the novelty aspect here.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors Claim that they are presenting an Approach to "finding descriptive explanations
of facts stored in a knowledge graph." Actually, I am not convinced that they have achieved this. For me explanations are explaining something. Their Basis for Explanation are however randomly Chosen Relations from wikidata. I assume that These Relations help to give some context, but I do not see (and I have found no evidence in the paper apart from the Evaluation results, which are also not very focused on "Explanation")why this aspect should be specifically expressed here.
(EVALUATION OF THE STATE-OF-THE-ART) There is no specific section on that, but based on the Tools and process used, I assume that this is fine.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The Approach is well described and can be understood. It also includes sensible simplifications, so that re-use is possible.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The Setting is fine and the description is clear. The Approach taken for the Experiment is sensible and state-of-the-art.
(OVERALL SCORE) I am struggling with the Claim made and the results presented. For me, there is a gap, which cannot be bridged by the Evaluation. So I do no really know on what Basis the good results are really grounded. Context is not the same as Explanation.
Strong Points:
- Good structure and language
- Relevant Topic
- Good experimental Setting
Weak Points:
- No Definition of Explanation and therefore no evidence what was achieved in this area

Review 3 (by Francesco Ronzano)

(RELEVANCE TO ESWC) Knowledge graph textual grounding is a relevant topic for ESWC
(NOVELTY OF THE PROPOSED SOLUTION) Motivated by comparison in SOA (more info on the Overall score section)
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The evaluation of the proposed approach is extensive (more info on the Overall score section)
(EVALUATION OF THE STATE-OF-THE-ART) Extensive review of SOA (more info on the Overall score section)
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation of the proposed approach is extensive (more info on the Overall score section)
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Complementary material (code on GitHub) difficult to explore / reuse (more info on the Overall score section)
(OVERALL SCORE) *** Summary of the Paper ***
The paper proposes a method to identify, given a corpus, textual explanations of knowledge graph relationships: in particular, starting from a relationship, the proposed approach aims at pointing out the ranked list of excerpts that better contextualize that relationship by motivating why it holds. The authors extensively discuss the importance of textual descriptions to explain and increase trust on search results, particularly focusing on the case where such results include entities and relationships from a knowledge graph.
Starting from the words describing a relationship (that include the name of the relationship as well as the names of its source and tail nodes), the authors propose an approach to compute the probability that such relationship is mentioned in a textual passage (extracted from a collection of documents). This probability is equal to the product of the probabilities of occurrence of each single word describing the relationship in the textual passage and, in turns, the probability of occurrence of a word in a passage is computed by means of a mixture of three language models. These language models are useful to identify the probability of a word given a passage, a word given a document and a word given the whole collection of documents.
As a result, if we have a collection of textual passages and a relationship, we can compute for each passage the probability to mention that relationship and show the user the passage characterized by the highest probability to explain and contextualize the relationship.
The proposed passage ranking approach is compared with a baseline system implementing a network based generative passage retrieval algorithm. The English Wikipedia is exploited as collection of documents to retrieve passages and 150 relationships from Wikidata are used to compare the approaches. Given 50 relationships, two human evaluators are involved to rank the relevance of the text passages retrieved by each approach. Moreover, by considering the whole set of 150 relationships, human evaluators are asked to chose the best passage to explain such relationship among the best one identified by the proposed approach and the best one identified by the baseline. In both experimental settings the proposed approach outperforms the baseline. The authors discuss some error analysis.
*** Strong Points (SPs) ***
The paper is well written and easy to read since consistently organize. It gives the deserved emphasis to the evaluation section.
The paper deals with a relevant topic: the retrieval of textual information that contextualizes and motivates relationships spotted by means of knowledge graphs.
*** Weak Points (WPs) ***
Some details missing in the explanation of the approach in Section 3.
Complementary material (code on GitHub) difficult to explore / reuse
*** Questions to the Authors (QAs) ***
- You state you set the length of the passage returned by the baseline system to be equal to at most 600 words. Could this fact make the baseline system returns incomplete sentences at the beginning and the end of each passage and as a consequence bias evaluators ratings towards the passage returned by the proposed approach that includes three complete sentences?
- In Section 4, both evaluation presented show that your approach significantly outperforms the baseline chosen. It would be great to discuss in future works if you think about some other / stronger baseline to consider.
- It would be great to quantify the overlap of the textual passages retrieved by the baseline approach and your method.
- Section 3: in the formula (3), could you specify if you consider the probability of a single word multiple times if it is repeated in more than one name of the relationship or the entities it connects?
- Section 3: it would be great if you could motivate why you set in your mixture model the values of the three λ coefficient equal to 0.4 / 0.4 / 0.2?
- Section 3: could you better explain and provide some example of how you deal with multiple synonyms of entities / relationships names?
- Section 4.1: could you explain how you got 80 unique entities from the set of 25 most-viewed Wikipedia pages?
- The proposed approach could identify also textual excerpts that contradict the relationship under analysis (since the relationship does not hold any more or one of its entities changed - or the text passage retrieved that contradict a relationship is from a newer trusted corpus). Did this happen with your use case? How would you deal / exploit the possibility of such kind of contradictions?
- It would be great in case of acceptance to organize and comment a bit more the code repository on GitHub for reproducibility
- Typo Section 4.2: ...we assigned that passage lower of the two ratings. --> ...assigned THE lower...
** After rebuttal **
I thank the authors for providing answers to the issues raise by the review.
After reading their comments, I would leave my final score unchanged.

Review 4 (by Steffen Remus)

(RELEVANCE TO ESWC) The paper is about providing text passages as evidence for semantic triples, which is clearly relevant to ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The approach is rather simple but effective as the authors claim. As far as I understand it the  approach is comparable to an inverted index with weighted query expressions.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The scores presented by the authors seem valid and the approach is in general reproducible due to the accompanied source code and experimental setup.
(EVALUATION OF THE STATE-OF-THE-ART) I see a strong divergence between the underlying data of the indices for the proposed and baseline methods.
For the prosed approach documents were fragmented into passages of 3 sentences, where passages contain sentences, which overlap with other passages.
The baseline system on the other hand was fed with standard documents as far as I understand.
This makes the evaluation unfair, i.e. imagine the event of a single sentence being relevant for the query, the proposed approach will return the 3 passages which contain the sentence, thus 3 results will be considered relevant, while the baseline system will return only one relevant passage.
It would have been interesting to see in the empirical analysis how many of the returned passages from the proposed approach actually overlap.
Also, I'm not able to figure out how the queries were presented to the baseline system, e.g. quoting, as well as the order of the terms will make a difference; have synonymous terms of the relationship label been used too?
Further, the evaluation splits of the queries and the choice of only ten queries for measuring annotator agreement seems strange. Why only 10 for measuring annotator agreement? Why was a disjoint set evaluated by the two evaluators?
Table 4. shows the results of a pairwise evaluation test, as far as I understand, each of the evaluators was presented with results from different queries, but still, the evaluators share a similar number of decisions (in average 5 vs 25), thats a factor 5. The results presented in Table 2 behave similarly, the amount of irrelevant passages is (roughly) 5 times as much as the number of highly relevant passages (resp. 1/5 for the proposed method). This suggests that for some queries no results might have been relevant whereas for others all results were relevant. I wonder if that really can be coincidence?
For the pairwise comparison, there should have been an option for equal relevance.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper discusses the results adequately, it shows comparative examples of the baseline system and the proposed approach and provides an error analysis with examples.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments are reproducible due to the accompanied source code and experimental setup, the experimental study is general enough, although I already expressed my doubts about the experimental setup.
(OVERALL SCORE) The paper presents an approach for retrieving text passages which provide evidence for a semantic triple (subject, predicate, object) in a knowledge graph.
- The paper is well written.
- The presented approach is simple yet effective.
- The paper is accompanied with the source code and the experimental setup including results.
- The evaluation should have been done different (see my remarks above).
- Since the goal of the approach is to find evidence passages of relational triples, another evaluation setup could have been to provide users with random (false) triples and true triples, and the letting users judge the truthfulness based on the returned hits.
- Parameters (lambda_1 to lambda_3) should have been tested.
Minor remarks:
- The first footnote starts with 3.

Metareview by Valentina Presutti

The paper proposes a method to identify, given a corpus, textual explanations of knowledge graph relationships. The problem is well defined and motivated, and paper is well written. The reviewers appreciate that the approach is simple yet effective, and that the paper is accompanied with the source code and the experimental setup including results. There is a major share concern though, as they all agree that the evaluation is inappropriate. The reviewers have provided constructive and detailed insights for helping the authors in improving the evaluation and they encourage the authors to do it as the work looks very promising.

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *