Paper 125 (In-Use track)

Exploring the Synergy Between Knowledge Graph and Computer Vision for Personalisation Systems

Author(s): Chun Lu, Philippe Laublet, Milan Stankovic, Filip Radulovic

Full text: submitted version

Abstract: In this paper, we explore the synergy between knowledge graph and computer vision tools for personalisation systems. We propose two image user profiling approaches which map an image to knowledge graph entities representing the interests of a user who appreciates the image. We show the superiority of one of our approaches against the baseline Google Cloud Vision API in terms of accuracy and argue the importance of the capacity to create semantically useful profiles. This approach is then applied in a novel personalisation use case where we seek to select the most appropriate image to display in recommendation banners. Our proposed knowledge-based approach tries to select the images which are the most in line with the user profiles. We conduct a user study with a real commercial travel catalogue and show its promising performance in terms of persuasion, attention, efficiency and affinity. A demo about the presented approaches is made available online.

Keywords: Knowledge Graph; Computer Vision; Image; Personalisation; User Profiling; Travel; Recommender System

Decision: reject

Review 1 (by Paul Groth)

Comments after rebuttal: I thank the authors for their response and in particular the clarifications on the test scenario. The work still strikes as slightly early to report on. 
This paper describes the use of image recognition in combination with a knowledge graph (i.e. DBPedia) to perform the task of image selection for recommendations in the travel domain. In other words, the task is to display a good image for the recommendation of a trip. The paper describes two methods to before image selection and compares them with just using the Google Cloud Vision API. The proposed methods are not particularly novel with respect to image recognition (as acknowledged by the authors) but are an interesting combination of using existing image recognizers with an entity selection algorithm in a KG . The systems were evaluated stand alone on a generated ground truth dataset (which was made available) as well as against the actual task. 
Overall, I think this would at least make for a great demo paper at the conference. I'm a bit more on the edge with respect to whether it is acceptable as an in-use paper. There are two major issues I have:
1) Lack of background around the in-use elements. 
The authors claim that the system is using data from a large french travel site. This is great but a bit more information would be useful. Is this given problem seen as important in the business? What do they think about the implementation? Is it needed? Has it been tested on the site itself?
2) Unclarity with respect to the evaluation setup
In the first evaluation, the hierarchical clustering approach used to select the images was not detailed. I think this is critical as there were only 50 images evaluated so the selection is really important. This is glossed over. In the second evaluation, it was unclear how the annotators were sourced. Was it through Mechanical Turk? Were they from the travel site staff? Were they students? I also wondered why in the second evaluation the procedure was not compared to the google vision api. Yes, the KG based methods performed better on the ground truth dataset but it would still be interesting to see the performance on the full task. 
I also note that the notion of "catalogue" plays an important part in the methods described but it's not really defined anywhere. 
In summary, I like the overlap of image recognition and knowledge graphs as a topic but would have appreciated much more context around the specific implementation in practice. 
Minor comments:
* 'We found few work in this area." -> "We found a small amount of work in this area."
* "We found several approaches creating user profiles from images" -> "We found several approaches to…"
* "by considering the semantic sim- ilarity calculation to be adopted further" -> unclear what adopted further means
* A minor point if you want to use abbreviations you can just put these in parenthesis behind the name of the algorithm instead of writing a sentence to that effect. 
* It would be nice to put your data somewhere a little bit more permanent and accessible than a google drive. But thanks for making it available!


Review 2 (by anonymous reviewer)

The authors present a method for employing KGs for computer vision tasks, in the specific context of a travel recommender system. 
Main strengths:
- Combining KGs with computer vision is a highly relevant and unsolved task
- Evaluation dataset released
Main weaknesses:
- Methods and dataset biased towards a specific domain 
- Missing experimental details
Detailed and also more minor comments below.
There are some experimental details missing. For instance (Section 4.1): "240 of these synsets have known mappings to Wikidata and thus to DBpedia by “owl:sameAs”." -- Obtained how? And (same section): "For the resting 760 synsets, we made the mappings in a semi-automatic way by referring to the glosses of the synset." -- How? Why is this "semi-automatic"? 
Approach 2 (Section 4.2) relies "on the penultimate layer outputted by Inception-V3 which is a 2048-dimensional vector. The similarity between two images is determined by the Euclidean distance between their vectors." It remains totally unclear to me, however, why this is a good idea. Why this layer? Why this distance? Please clarify and support your choices, preferably with experimental evidence. 
For the evaluation, the authors "selected 50 diverse and representative images by using a hierarchical clustering algorithm." Representative how? And which algorithm is used? 
Using which statistical test are the p-values obtained? 
I would loved to have seen a per-image result analysis to complement Table 1, denoting the differences in performance between the best and worst-performing system. Is the performance better across the board or are there some individual images that attain the majority of the improvements? 
Even though it features in numerous places, it remains unclear to me what a "recommendation banner" is. It is also unclear (Section 2) what the goal of "enhancing the perception of the recommended items" entails. How can a perception be enhanced? 
Why is a variant of [32] not used as a baseline? From the description in Section 2 it seems the most relevant related paper. Moreover, the authors should make it more clear what the difference/s is/are with the current work. 
There are some places in the paper where the verb "try" is used (e.g., "Our proposed knowledge-based approach tries to select [...]"). I would suggest replacing these with another, stronger verb such as "aim" (or just leave it out altogether). 
What do the authors mean by "In case of need, e.g. the number of appearing entities is too small, we may enrich with the entities which are closely related to the appearing ones. The main idea is to map to entities which can contribute to the semantic similarity calculation, in other words, which are useful in further personalisation tasks. Thus, ideally, the conceptual scope should be defined by considering the semantic similarity calculation to be adopted further. For example, we may enrich with the entities by a set of selected object properties [20], or the ones by 1-hop category enrichment [22] or the ones used as dimensions in embeddings [5, 27]." -- Which "semantic similarity"? Which enrichments are used? This also ties in with "We constitute the conceptual scope of the catalogue by retrieving directly appearing entities (and closed related ones)." -- How is "closely related" defined? What are "closely related entities by 1-hop category enrichment" (Section 4.3)?
What is "web entity detection" (Section 4.3)?
I am not sure I agree with the experimental setup in Section 4.3, given that the authors ask three annotators to assess "suppose that you like the image, would it be reasonable to determine that you are interested in the entity?" How is one "supposed" to like an image? Isn't that inherently personal? 
I find the baselines in Section 5.2 rather weak, with one being random and the other one assuming that "the travel agent privileges images which are the most attractive in general." Isn't this, again, rather personal? 
The images in Figure 4 are hard to discern on a black and white printout. 
There are quite some typos and grammatical errors in the paper.


Review 3 (by Victor de Boer)

Comments after rebuttal: I thank the authros for their rebuttal. After readig the two short points addressing my concerns, I do not wish to update the review.
This paper presents a proof-of-concept of a method that combines methods from computer vision and semantic technologies for image recommendation tasks. The paper presents interesting work, combining methods from the two domains in a straightforward way to come up with relevant image-based recommendation tasks. Two two image user profiling algorithms are presented  which map an image to knowledge graph entities representing the interests of a user who appreciates the image. 
Section 4 presents quite a comprehensive and extensive evaluation of two variants to a baseline, which shows the 2nd variant of the proposed method outperforms both the baseline and the other variants in image recommendation. A second, more task-based evaluation is in the context of "recommendation banners" in the context of users selecting a package tour. Both of these evaluations are done on a real commercial travel catalogue and showed its promising performance in terms of persuasion, attention, efficiency and affinity.
The paper presents a simple but interesting and novel approach rather then showing the impact of proven techniques in an industry setting. However, the task and datasets chosen are interesting real-world problems for which the authors show the benefit of having a semantics-based approach and as such, I would say it is relevant for the in-use track. 
Other remarks: 
For further related work on using semantic graphs and serendipity in recommendation, authors could look at Valentina Maccatrozzo, Davide Ceolin, Lora Aroyo and Paul Groth, Semantic Pattern-based Recommender. In: Presutti, V., et al. (eds.) SemWebEval 2014, CCIS, vol. 475, pp. 182-187. Springer, Heidelberg (2014).
p5: "resting" -> "remaining"
p5: the authors state that "In this paper, we use the DBpedia, knowing that other similar large-scale knowledge graphs like Wikidata can also be used"-> it would be good if the authors could comment on this: to what extent are the structural features of DBPedia relevant for the results? Would domain-specific knowledge bases (e.g. LinkedMDB) also work?
footnote 7: for persistency, it would be preferable to publish the dataset using an archived system, such as figshare or github
p12: There is a weird tab before the mention of Table 2 in the text.


Review 4 (by anonymous reviewer)

Comments after the rebuttal: The authors answer some of the issues raised by the other reviewers, but I am still doubtful about the fitting of the paper in the In-Use & Industrial Track. I suggest a more in-depth state of the art research.
The paper proposes an approach to use knowledge graph in computer vision for personalization systems. 
The authors discuss two case studies, comparing the proposed approach to state-of-art approaches, showing the potential of their approach. The paper is well written, although it would benefit a spelling check. 
However, I have some doubts about the novelty of the approach. The related work session, indeed, misses some important references, for example [1,2]. Also, the authors mention a problem for linking DBpedia and WordNet, while there is a mapping already available on the DBpedia download website. 
In the first experiment, the authors create a ground truth dataset which, IMHO, is not done properly. The question asked to the annotators is subjective, and thus, 3 annotators are not enough to generalize.
On page 12 "[...] they see a recommendation banner without and rates on a 5-level Likert [...]": I guess something is missing after "without".
Moreover, I think this paper is not fit for the In-Use & Industrial Track as all the experiments are performed in a lab setting. 
[1] L.Hollink, A.Th.Schreiber, J.Wielemaker, B.Wielinga. Semantic Annotation of Image Collections. In proceedings of the KCAP'03 Workshop on Knowledge Capture and Semantic Annotation, Florida, October 2003.
[2] L. Aroyo, N. Stash, Y. Wang, P. Gorgels, L. Rutledge. CHIP demonstrator: Semantics-driven recommendations and museum tour generation. The Semantic Web, 879-886


Review 5 (by Anna Tordai)

This is a metareview for the paper that summarizes the opinions of the individual reviewers.
Overall, there is some disagreement amongst reviewers regarding the novelty of the work although most reviewers are enthusiastic about the combination of image recognisers with entity selection in Knowledge Graphs. The reviewers are also positive about the sharing of the dataset. Except for reviewer 3, all reviewers raise issues with the evaluation setup. 
The reviewers express doubts about whether this paper fits within the In-Use track. A commercial dataset is used, as well as usage scenario's, but is has not been deployed in real life, nor is there an evaluation with real users. 
Reviewer 3 and 4 recommend additional literature as reference and related work. 
Laura Hollink & Anna Tordai


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *