Paper 90 (In-Use track)

Answering Multiple-choice Questions in Geographical Gaokao with a Concept Graph

Author(s): Jiwei Ding, Yuan Wang, Linfeng Shi, Wei Hu, Yuzhong Qu

Full text: submitted version

camera ready version

Decision: accept

Abstract: Answering questions in Gaokao (the national college entrance examination in China) brings a great challenge for recent AI systems, where the difficulty of questions and the lack of formal knowledge are two main obstacles, among others. In this paper, we focus on answering multiple-choice questions in geographical Gaokao. Specifically, a concept graph for geographical Gaokao is automatically constructed from textbook tables and Chinese wiki encyclopedia, to capture the core concepts and relations in geography. Based on this concept graph, a graph search based question answering approach is designed to find explainable inference paths between questions and answer choices. We developed an online system called CGQA and conducted experiments on two real datasets created from the last ten year geographical Gaokao. Our experiments show that CGQA generates accurate judgments and provides explainable solving procedures. Additionally, CGQA shows promising improvement by combining with existing approaches.

Keywords: concept graph; geographical Gaokao; question answering; CGQA


Review 1 (by Vanessa Lopez)


This paper presents a system to answer the kind of multiple choice questions that students need to answer to pass the national colleague entrance examination in China. It focus in an specific domain: geography. 
The paper is interesting and well written. Its nicely motivated. The problem is very challenging because a QA system needs to match not only the question, but also the sentences given in the multiple choice answers. It also needs to provide an explanation on its choice. 
Another challenge is that there is a lack of formal knowledge encoded to answer these questions. To solve this challenge, the information is extracted from geographical textbooks and geographical databases with a large number of entities, such as GeoNames. The proposed approach constructs a graph (automatically) from the concepts descriptions extracted from textbook tables and the chinese wiki, and it looks for an inference path between the actions and the different answers / options (the path is the explanation)
The main weakness of the approach presented here is that it is very brittle and evaluated only in one very specific domain.  The proposed system makes strong assumptions on the type of tables from which it extracts the information. It expects that the tables in the textbook comply with a given format, where the geographical concepts appear in the first column (which may be the case for the textbook used but that’s hardly portable to other textbooks or domains). For each concept a unique ID is assigned. The paper does not tackle the issue of the same concept appearing in more than one table (e.g, no integration is performed). The approach only considers SKOS-based relations among concepts such as: related (if the concepts appear in the same row or in each other abstract), disjoint (the assumption made here wasn’t very convincing) , broader, narrower and description. No other ad-hoc relations or attributes are extracted for each concept.
It is not clear how descriptions are detected (i.e., if the cells in the same row of a concept do not contain other concepts … how does the system knows what?), or how does the system knows how to pick up only the tables that are in the expected format.
Issues such as portability, performance and scalability are not discussed.  There is however an interesting discussion on the evaluation and comparison with other approaches 
Minor point:  state of the art on QA over semantic web data is missing.


Review 2 (by Andriy Nikolov)


The paper presents an approach for handling multiple-answer questions typical for Chinese geography examinations (Gaokao). The approach involves construction a knowledge graph using information from textbooks and online sources (Baidu). Then, this graph is used to match the clauses from the alternative answer options and infer the correctness of the statements. The judgement over a statement (correct vs incorrect) is inferred based on the information from the knowledge graph: a statement including incompatible clauses (matching graph containing disjoint vertices) is assumed to be incorrect. The evaluation involves comparing the authors’ method with alternative state-of-the-art ones with promising results reported. 
In my view, the paper and the method are definitely relevant for the conference. In particular, I found the inference approach with supporting and protesting paths an interesting one.
On the other hand, I am not sure if the paper suits well the in-use track goals. There is no description of a practical/industry use case and the project looks like a purely research-oriented one. In my opinion, the paper would be more appropriate for the research track.
Some questions, which would be nice to clarify:
- The evaluation section discusses both the correctness of the automatically constructed knowledge graph and the performance of the judgement/explanation procedure. Is the knowledge graph used for the judgement of question answers the original one or the one which was manually refined based on the results of the quality assessment?
- For the NN-based approach, what was the amount of training data required to achieve the reported performance?
I would like to thank the authors for the provided rebuttal. I still have my concerns regarding the fitness of the paper to the in-use track, but considering both the use case scenario involving real-world data and the technical approach I am in favour of accepting the paper.


Review 3 (by Hannah Bast)


The paper proposes a system that automatically answers multiple-choice questions in Gakao (a Chinese college admission exam), in the area of geography. 
The system works with a complex knowledge base that, in a pre-processing step, has been automatically constructed from geography textbooks and study guides using co-occurence information (in tables and abstracts). The knowledge base knows about concepts (e.g. volcanic landform) and which concepts are related, and about textual descriptons matching concepts (e.g. tropical rainforest zone -> hot and rainy). A given question is analyzed for matching entities and concepts (which is easy, since they are often mentioned literally in the questions). Both the question and the answer option are semantically matched to descriptions from the pre-constructed knowledge base. The similarity between the question and an option is computed using a variety of signals, including edit distance, cosine similarity, and WordNet synonymy. The system not only aims at finding the best match, but also at providing evidence (positive and negative) in the form of inference paths in the concept graph (of length as small as possible).
The evaluation is extensive. Two benchmarks are constructed (which is a contribution of the paper by itself). The system is compared against three standard approaches: one based on standard IR techniques, one based on word embeddings (word2vec), and one using end-to-end learning with a neural network. All four approaches, as well as combinations of them, are evaluated and compared. The results are discussed and explained in depth. The main result is that a small percentage of the questions (around one tenth) can be answered with high precision. The other questions are too hard and the accuracy quickly goes down for all approaches when answering these.
The benchmark is made publicly available. The paper contains a link to the demo. The page of the demo loaded, but the page did not work beyond that: the JavaScript console showed several "Failed to load resource: the server responded with a status of 404 ()" errors.
This is an interesting and well written paper with an extensive evaluation and an in-depth analysis of the results. The used techniques are standard, but orchestrated with care and in a meaningful way. The results are not great, but that is convincingly attributed to the hardness of the task.
I read the response letter of the authors. The demo now worked for me but confused me. I was expecting multiple-choice questions with options, as described in the paper. Instead, the demo lets the user enter single-sentence statements and judges them. I could not understand the detailed analysis in the demo, because the Chinese text was only partially translated.


Review 4 (by anonymous reviewer)


The paper presents a system for multiple-choice question answering, designed to answer questions related to Geography from the Chinese college entrance exam named Gaokao. The solution is based on various semantic technologies, in particular a "concept graph" extracted from text corpora, semantic embeddings built over text documents, and the use of structured knowledge sources.
The paper is a strong contribution to this track as it shows a significant impact of semantic technologies in an important application area. The overall framework is novel, and the results of the evaluation are very promising. The system also has an online working demo and the data sets are made publicly available.
The authors have done an excellent job translating the Chinese text into English to make the problem and the solution very clear. The solution sounds very general and applicable to other languages but it would be great if the authors can comment on the particular requirements for Chinese and what parts may break or become easier in case of another language like English.
In Equation 1 in Section 4.1, If the lexical similarity is high, then what is the point in using a similarity based on embeddings? And if the semantic similarity is very high, is it important to take the lexical similarity into account? Perhaps you can bring an example to help understanding the reason behind this way of combining the scores, similar to your example regarding antonym adjectives in the following paragraph.
In your experiments, couldn't you modify the other approaches (IR, WE, and NN-based) so that they also avoid returning an answer when the confidence score is low? For example, using a threshold for the similarity of the documents/statements returned by Lucene in the IR-based approach, or a threshold on the cosine similarity in the WE approach? Then you can have a more direct comparison and inclusion of all the approaches in a table like Table 2.
Update after rebuttal:
Thank you for for your response to my questions. Regarding setting a threshold for other approaches, if time and space permits, I still recommend adding results and showing the effect of doing so. It seems like you argue that such results will show the difficulty of setting a threshold, and would further motivate the need for your approach.


Review 5 (by Anna Tordai)


This is a metareview for the paper that summarizes the opinions of the individual reviewers.
The reviewers agree that this paper demonstrates the impact of semantic technologies on a real-life problem. It includes an extensive evaluation, and all data is available online. There are questions regarding the generalizability of the approach beyond the application domain, especially given the strong assumptions about the format of the data.
Laura Hollink & Anna Tordai


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *