Paper 155 (Research track)

A Tri-Partite Neural Document Language Model for Semantic Information Retrieval

Author(s): Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf

Full text: submitted version

camera ready version

Decision: accept

Abstract: Previous work in information retrieval have shown that using evidence, such as concepts and relations, from external knowledge resources could enhance the retrieval performance. Recently, deep neural approaches have emerged as state-of-the art models for capturing word semantics that can also be efficiently injected in IR models. This paper presents a new tri-partite neural document language framework that leverages explicit knowledge to jointly constrain word, concept, and document learning representations to tackle a number of issues including polysemy and granularity mismatch. We show the effectiveness of the framework in various IR tasks including document similarity, document re-ranking, and query expansion.

Keywords: Ad-hoc IR; knowledge resource; semantic document representation; deep neural architecture

 

Review 1 (by Chenyan Xiong)

 

(RELEVANCE TO ESWC) The utilization of knowledge graphs and semantic web resources is an important research topic in semantic web. The application in document representation is one of the recently studied tasks for the utilization of semantic resources.
(NOVELTY OF THE PROPOSED SOLUTION) The joint training of word, entity, and document embedding has some novelty The retrofitting with signals from knowledge graphs to learn document representation is a somehow novel application of the retrofitting technique.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution follows the standard approaches in training joint embeddings and the popular retrofitting technique.
(EVALUATION OF THE STATE-OF-THE-ART) The evaluation in the document representation part is conducted with enough detail and the comparison with paragraph vector is sound.
The utilization of embeddings in document ranking task and query expansion task could be updated with the recent development in Neural IR.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Convincing studies on the impacts of different individual model components are provided.
The case studies also share some good insights of the model behavior.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments are conducted on popular public benchmarks. The implementation of the proposed approaches should not be difficult for those familiar with embedding.
(OVERALL SCORE) This paper presents a document embedding model with joint learning of word, entity, and document vectors. It is an extension of the paragraph vector with the triple-embeddings jointly learned as a combined loss function. The knowledge graph information is also used to ``retrofit’’ the word embeddings as a regularizer in the loss function. The learned embeddings are evaluated in the document similarity, ad hoc ranking, and query expansion tasks. The proposed method performs well on the document similarity task but not quit so in the ranking and expansion tasks. Nevertheless, some analyses and insights are provided for the limited performances of word and document embeddings in the IR tasks.
The proposed method is an easy upgrade of a popular method, paragraph vector. It is easy to use and intuitive. One would expect the additional information from knowledge graph helps in both joint learning and retrofitting.
The experiments on the document similarity task did a good job in demonstrating the effectiveness of the proposed method and its individual components. The T-SNE study showing the separation of relevant and irrelevant documents, as well as the different locations of query embeddings, is interesting and provides the intuition of learning embeddings for IR tasks.
The experiments on the document ranking and expansion tasks, though on which the proposed methods do not outperform baseline, provide some additional insights about why the embeddings do not help the retrieval performances. In fact, the concerns suggested in this paper aligns well with the recent progresses in Neural IR. It is recently well-understood in the IR community that the word2vec style embeddings are not as effective for relevance modeling. Also the representation base model, i.e. embed the whole document into a vector, is not as effective as match-based models that directly learn the matching between query and document.
There are also some weaknesses for this paper:
The first weakness is that the writing quality requires improvement. The overall story is clear, but the clarity is not there yet. The description of the proposed method needs some major revision to reduce the readers’ efforts. For example, the training loss is unclear until very late the Section 3; the definition of concept sets in a document is only clear in the experiment section.
The second concern is that the related work about both utilizing knowledge graphs for IR and neural methods for IR is not updated. Both topics are rapidly developing; many new progresses have been made in the past one or two years. For example, better methods to incorporate entities in document representations, and better neural methods to train and use embeddings for document ranking. I suggest the authors conduct a more thorough survey in recent IR conferences and update the related work section. The recent workshop and tutorials in KG4IR and Neural IR are good starting points.  It will help better position this paper as well as improve the experiments in the IR tasks.
Overall, I recommend this paper as weak accept. I like the utilization of entities and knowledge graphs in document embedding. The experiments in the document similarity task are also interesting. I suggest the authors improve the experiments regard to the two IR tasks and the writing quality in the next iteration.

 

Review 2 (by anonymous reviewer)

 

(RELEVANCE TO ESWC) The subject tackled in the paper is relevant for the ESWC community. The authors aim at expanding current language models based on word embeddings with semantics by mixing different other approaches in their proposal.
(NOVELTY OF THE PROPOSED SOLUTION) The novelty is not the strongest point of the paper: Adding the documents to the model is adapted from Mikolov's PV model, Navigli et al. have already embedded together words and concepts, and the inclusion of the influence of the existence of semantic relationships is adapted from Yamada et al. 2016. However, a tri partite model seems to be novel in this context.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposal seems to be correct, while there are some important questions (see comments and questions below).
(EVALUATION OF THE STATE-OF-THE-ART) The state of the art is accurate and the proposal is correctly positioned.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The model is correctly presented, being the discussion mainly guided by the results of the experiments.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiments are well described, the datasets are available, and the results are correctly discussed.
(OVERALL SCORE) The authors propose a new vector space model following the trend of embeddings. In this case, they propose to include together documents, word windows, and concepts appearing in such windows to encode them in the same space (similar to the previous works by Mikolov regarding paragraphs, by Navigli regarding codifying concepts and words together, and by Yamada regarding including the regularization term to include the influence of existing semantic relationships between words). 
Strong points: 
* The proposed model brings together different approaches, achieving a potentially solid new model. 
* They evaluate the model in two new IR scenarios where these kind of models haven't been tested. 
Main weak points: 
* The proposed model seems to be very sensitive to noise in the concept detection. 
* The generalizability of the model is not analyzed, which might be a real problem when dealing with documents and/or concepts not previously seen by the model. 
* The improvements in terms of MAP and Recall are not impressive, and the potential loss of generalizability might not be worthy. 
Questions/Comments to the Authors: 
My main concern about the model proposed is how is actually used. This will lead almost all the comments I have about the paper and the model in its current status: 
* In Mikolov's proposal [ref 14 in the paper], where they include the paragraphs to expand the distributional semantics from a local (the window) frame to a broader one, they explicitly state that, for new paragraphs (those unseen by the model so far), they freeze the model weights, expand the paragraph identifiers by one, and perform a mini-training batch with the words of the paragraph to obtain a vector representation of the new paragraph. This makes it possible to generalize their model to new fresh data. In the proposed model, the input is expanded further and in the inner matrices we will have the weights for D (being the document collection), V (being the seen vocabulary), and C (being the collection of seen concepts). I miss in the paper how the presence of new documents or new concepts is handled in the model. 
* Following with the previous point, how much does it take to train this model? The input and processing times were already big for Mikolov's seminal model, bigger in PV, and now it's even expanded further. In an IR scenario, where the appearance rate of new documents might be quite high, is it expected/feasible to make the model evolve together with the corpus? 
* This is specially important when it comes to the queries, which can be even more free than the documents themselves. How are queries' vectors calculated in the experiments? The titles of the documents are taken as queries, but they have to be transformed to vectors at some point. Are they included in the model in the training to have already a vector or are they derived/trained somehow? If so, how much does it take for each query? and how much are they affected by the concept tagging? I mean, depending on the extension of the queries posed, the concept tagger might not have detected any particular concept, leading to queries that might not under-specified for the model. Have you consider this in the experiments? I assume that there might be a vector to denote the absence of concepts in the different windows, if so, how often this vector is used when querying the model? This might be important to explain the results regarding the sensibility towards correct tagging and the fact that when there are more concepts, the system disambiguates worse (it might happen that the best trained subspace would be the one corresponding just to this no-concept vector). 
* Regarding the experiment results: 
* Quality of Document Embeddings: while the percentages are quite interesting, depending again on the generalizability of the model, TF-IDF provides a good baseline, with the benefits that it does scale when adding new documents, and the query model is well established and flexible enough. I miss a comment on this particular aspect of the model. 
* Quality of Document Embeddings: the percentage achieved by SD2V is worse than PV, it is not until the relationships are included that the model behaves better. The authors state that there exists a sinergy. However, this also suggests that training PV with the semantic regularization term (LR) might lead to a semantics aware model which could improve also the results. Did the authors any experiment in this line? Any thought about this particular issue? 
* Document Re-Ranking: I have already included the comment in the previous point about the queries, please, see above. 
* Query Expansion: the results in terms of the F-Measure (calculated with the given MAP and Recall) show the best scenario (Concept-expanded queries) SD2V performs 1.6% better than PV, and SD2VR 2.3% (always regarding the baseline). This is interesting as, as the authors state, adding non-ambiguous concepts should improve the efficiency of the queries as they would be better specified. The average number of concepts in the queries is 1, while the average number of concepts in documents is 31, so it makes sense that expanding using concepts will make a difference. However, in the cross-analysis the authors have focused on the number of concepts in the documents, not in the queries themselves. I miss some comment on the influence of the number of concepts in the query in the final results. 
* Minor comment: what's the rationale behind the formula (4)? why averaging the vectors and using 4k in the denominator?

 

Review 3 (by Serena Villata)

 

(RELEVANCE TO ESWC) The paper does not target the Semantic Web community and I cannot say it advances the state-of-the-art in this context: the only connection with the Semantic Web is the use of DBpedia. The paper provides a contribution to the NLP community and to the NN community so conferences like ACL, EMNLP, and ICANN seem to be more appropriate, both for the paper to get the deserved visibility and for the evaluation process to ensure the right novelty and relevance in the community.
(NOVELTY OF THE PROPOSED SOLUTION) I'm not an expert in the community of neural models to know whether this tri-partite neural model is novel.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) On the positive side, the paper tackles an interesting problem using what is nowadays a very popular approach. The paper is well written. The contribution of the paper is clearly stated in the introduction, and the comparison with the related literature seems to be appropriate (I'm not an expert in this domain). Another positive point about the paper is the fact that it both proposes a technical contribution as well as its experimental evaluation.
(EVALUATION OF THE STATE-OF-THE-ART) On the positive side, the paper tackles an interesting problem using what is nowadays a very popular approach. The paper is well written. The contribution of the paper is clearly stated in the introduction, and the comparison with the related literature seems to be appropriate (I'm not an expert in this domain). Another positive point about the paper is the fact that it both proposes a technical contribution as well as its experimental evaluation.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The description of the dataset should be improved: who annotated it? is it available? what does it mean that the Robust04 collection results in queries with 1 concept in average, etc? Providing a clear description of the dataset is fundamental to ensure the assessment of the experimental results. The same holds for the evaluation methodology, the three scenarios should be described more detailed. 
I strongly suggest the authors include some examples, they would be really helpful to let the reader understand the three scenarios and also the error analysis of the evaluation results, which is rather obscure at the moment.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The description of the dataset should be improved: who annotated it? is it available? what does it mean that the Robust04 collection results in queries with 1 concept in average, etc? Providing a clear description of the dataset is fundamental to ensure the assessment of the experimental results. The same holds for the evaluation methodology, the three scenarios should be described more detailed.
(OVERALL SCORE) *** Thanks for your rebuttal, where you clarified some of the issues I raised. I still believe that the contribution to the Semantic Web community is limited. ***
The paper presents a new tri-partite neural model to address issues like vocabulary mismatch and polysemy in information retrieval. The paper first describes the proposed neural model and then it experiments is over different tasks to show the effectiveness of the proposed method. 
On the positive side, (1) the paper tackles an interesting problem using what is nowadays a very popular approach. The paper is well written. (2) The contribution of the paper is clearly stated in the introduction, and the comparison with the related literature seems to be appropriate (I'm not an expert in this domain). (3) Another positive point about the paper is the fact that it both proposes a technical contribution as well as its experimental evaluation.
On the negative side, however, there are some drawbacks that should be addressed before publication:
(1) the paper does not target the Semantic Web community and I cannot say it advances the state-of-the-art in this context: the only connection with the Semantic Web is the use of DBpedia. The paper provides a contribution to the NLP community and to the NN community so conferences like ACL, EMNLP, and ICANN seem to be more appropriate, both for the paper to get the deserved visibility and for the evaluation process to ensure the right novelty and relevance in the community. 
(2) the description of the dataset should be improved: who annotated it? is it available? what does it mean that the Robust04 collection results in queries with 1 concept in average, etc? Providing a clear description of the dataset is fundamental to ensure the assessment of the experimental results. The same holds for the evaluation methodology, the three scenarios should be described more detailed. 
(3) I strongly suggest the authors to include some examples, they would be really helpful to let the reader understand the three senarios and also the error analysis of the evaluation results, which is rather obscure at the moment.

 

Metareview by Roberto Navigli

 

This paper presents a method for information retrieval using deep neural networks. This paper is only partially within the scope of the conference and does not cover major issues within the Semantic Web such as RDF or linked data. It does use DBpedia as a data source although this is not the application of the paper, as such it seems that the reviews had many difficulties understanding this paper, and it may not be well-received by a SW audience. Publishing at a conference such as SIGIR or ACL would likely be more appropriate.
The paper has been generally given positive reviews, however there are many technical points that seem to be unclear to the reviewers. This may in part also be due to the very technical nature of this paper and its relative distance from the core topics of reviewers.

 

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *