Paper 161 (Research track)

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Author(s): Aditya Mogadala, Umanga Bista, Lexing Xie, Achim Rettinger

Full text: submitted version

camera ready version

Decision: accept

Abstract: Images on the Web encapsulate diverse knowledge about varied abstract concepts. They cannot be sufficiently described with models learned from image-caption pairs that mention only a small number of visual object categories. In contrast, large-scale knowledge graphs contain many more concepts that can be detected by image recognition models. Hence, to assist description generation for those images which contain visual objects unseen in image-caption pairs, we propose a two-step process by leveraging large-scale knowledge
graphs. In the first step, a multi-entity recognition model is built to annotate images with concepts not mentioned in any caption. In the second step, those annotations are leveraged as external semantic attention and constrained inference in the image description generation model. Evaluations show that our models outperform most of the prior work on out-of-domain MSCOCO image description generation and also scales better to broad domains with more unseen objects.

Keywords: Knowledge Base Semantic Attention; Caption Generation for Novel Visual Objects; Visual Entity Linking


Review 1 (by Michael Granitzer)


(RELEVANCE TO ESWC) See overall evaluation
(EVALUATION OF THE STATE-OF-THE-ART) See overall evaluation
(OVERALL SCORE) See overall evaluation


Review 2 (by Valerio Basile)


(RELEVANCE TO ESWC) This paper is clearly relevant to the Semantic Web community, both for the resource used for the task of producing descriptions of images, and for the output, which will effectively enrich the existing image-based and general knowledge graphs on the Web of Data.
(NOVELTY OF THE PROPOSED SOLUTION) The paper contains a well written related work section. The work is well situated among the state of the art, and it is clear that there are several novel points.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The model presented in this paper is quite sophisticated, involving deep learning and language models. A multi entity classifier is used to extract entities from the images, which in turn are used to query large KGs and extract more informative features. The architecture also includes a LSTM-based language model in order to produce full captions from the word and entity features.
(EVALUATION OF THE STATE-OF-THE-ART) The paper is well situated among the recent literature, and the experimental results are compared to some of the existing models. However, the comparison is restricted to models that use VGG-16 for the extraction of features from images, which I find a bit odd. Most importantly, the proposed model, while performing roughly at the same level of the other model, is not clearly superior.
More detail would be welcome on the evaluation metrics, which are not described at all apart from references to the relevant literature. As far as I understand, the quantitative evaluation concerns the labels predicted for the images rather than the complete captions. If that is the case, some more evaluation is needed of the output of the full pipeline, beyond the (very nice) examples in Figure 3.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The strong claim about this method is that it will be able to scale to large amounts of unseen objects in image recognition. However this claims needs to be backed up by equally strong empirical evidence. In summary, I am convinced that this approach works, but I am not convinced that it works better than the state of the art on a large number of object categories.Y
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The resources and evaluation benchmarks used for the experiments are available and linked in the paper. It is unclear whether the authors will release the software implementing the model.
(OVERALL SCORE) This paper presents a novel deep learning model to generate descriptions of images based on the objects recognized in them. The novelty of the model relies (among other steps) in the inclusion of a language model trained together with the entity labels extracted from the images.
The model reaches state of the art level performance on standard datasets, and shows promising results in terms of generating fluent and informative image descriptions.
The claim that this model scales well with respect to the number of candidate object categories is not convincingly backed up by the experiments.
Authors' response:
The authors replies positively to two of my comments, while one question is not clearly answered (the comparison limited to VGG-16-based systems only). However, this is not a major point.


Review 3 (by anonymous reviewer)


(RELEVANCE TO ESWC) This paper describes a (machine-learning) approach to generate image captions with the peculiarity of using Knowledge Graphs to handle the cases of previously unseen object. The paper is most a machine-learning paper; notions like LSTM, Embedding and the like are not core to this conference. Hence, the potential audience need an advanced machine learning background. 
As a side note, by looking at the reference one may find several papers that appeared in the CVPR conference and almost no reference about (Semantic) Web conferences or journals.
(NOVELTY OF THE PROPOSED SOLUTION) As far as I can judge, the proposed solution is very interesting. One main concern that many people have about the usage of Deep-Learning approaches is the difficulty to really understand what it is going on (although their working principle is quite simple: chain rules, backprop, etc.). I guess that incorporating external structured knowledge into the loop is an interesting direction. Nevertheless, I can partially assess the novelty of the approach because of the lack of an overvie of the related work in the field (I trust authors that's why I gave accept).
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Even in this case, the approach sounds convincing. I'm familiar with LSTM and related notions but I cannot completely judge the whole correctness of the approach.
(EVALUATION OF THE STATE-OF-THE-ART) By looking at the related approach authors pick to compare to I can see that these are very relevant (and popular).
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Experiments seem to confirm the main claim of the paper, that is, adding background knowledge help in generating more accurate captions.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I could not find any pointer in the paper about the system. I was very interested in running it.
(OVERALL SCORE) Strong points:
(i) Interesting connection between Knowledge Graph and machine learning in the specific task of image caption generation;
(ii) The paper seems reasonable in its main design; usage of embedding (to reduce the representation space), LSTM to handle network memory, etc.
(iii) authors report performance better than other related approaches
Weak points:
(i) The paper is largely not accessible to people with no (or small) machine learning background; hence, it is out of the scope for ESWC, in my opinion
(ii) I could not find a pointer to the system
After rebuttal:
I thank the authors for their response. Nevertheless, it does not help in clarifying my concerns.


Review 4 (by Ralph Ewerth)


(RELEVANCE TO ESWC) The proposed approach is utilizes a knowledge graph representation in order to improve an image captioning system. This paper suggests semantic web techniques and is relevant to this conference.
(NOVELTY OF THE PROPOSED SOLUTION) The combination of an image caption generator and knowledge graph is the main contribution of this paper. The integration of the knowledge graph is realized with the help of a new attention mechanism. Overall, to the best of our knowledge, this approach is novel.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper contains a detailed description of each particular system component. In addition, the overview of the entire pipeline in Figure 2 is very helpful to understand the system. However, some equations should be checked again. For example, in equation 3 and 4, calculating the tanh before the softmax function seems to not necessary. It is not clear if the authors want to calculate the tanh for all previous hidden layers.
(EVALUATION OF THE STATE-OF-THE-ART) The authors compare their methods with several recent approaches and the proposed system achieves state of the art results. The selected reference systems cover current approaches that are published in well-established conference.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the proposed approach are discussed appropriately.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) In our opinion, the paper lacks some parameter settings to reproduce the results. For example the hyperparameters to train the neural networks (learning rate, number of epochs, optimizer) are not explained.
(OVERALL SCORE) The authors describe an image caption system that utilizes a knowledge graph to identify unseen objects and to refine the generated caption. The experimental evaluation is based on a self-generated test dataset and shows that the proposed system has the capability to achieve state of the art performances compared to several state-of-the-art approaches.
Strong Points
+ The proposed knowledge guided attention module sounds interesting
+ The experimental results demonstrate that the system achieves state of the art results
+ The paper is very well written and structured and therefore easy to understand
+ Related work section covers recent work published in well-established conferences
Weak Points
- The training hyperparameters are not mentioned in the work (learning rate, number of epochs), which affects the reproducibility
- The text in the graphics is too small and difficult to read
*** Authors' response: 
We thank the authors for their response and additional explanations. Our questions have been addressed, the paper should be revised accordingly to the reviewers' comments, if accepted.
There are no changes in our scores.


Metareview by Andreas Hotho


The paper deals with a deep learning method for predicting the description of images utilizing a knowledge graph. All reviewer like the work. Weakest points stated by the reviewer are the strong claim on scalability and the focus on machine learning without relation to ESWC. Also, the writing could be improved and the notion needs to be more precise. I think the scalability issue was clarified with the author response and the work is clearly in the scope of the conference. I support the arguments of the authors in the rebuttal, as this is the machine learning track of the conference and I think one could assume some prior knowledge in machine learning.
There are problems with the rating system as two overall scores exists. All reviews propose accept or weak accept for the overall score which is also reflected in the reviews. This is not consistent with the overall evaluation (once weak reject) which is only used for the in-use-track.  
The author promised in the rebuttal to add the missing information about the parameter and fixed the isse with the notion. I recommend a conditional accept of the paper with respect to the promised changes.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *