Mind the (Language) Gap- Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders
Author(s): Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frederique Laforest, Jonathon Hare, Elena Simperl
Full text: submitted version
Abstract: While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is therefore of utmost social and cultural importance to focus efforts on languages whose speakers only have access to limited Wikipedia content. In this work, we investigate supporting communities by generating summaries for Wikipedia articles in underserved languages, given structured data as an input.
We focus on an important support for such summaries: ArticlePlaceholders, which are dynamically generated content pages in underserved Wikipedia versions. They enable native speakers to access existing information in Wikidata, a structured Knowledge Base (KB). To extend those ArticlePlaceholders, we provide a system, which processes the triples of the KB as they are provided by the ArticlePlaceholder, and generate a comprehensible textual summary. This data-driven approach is employed with the goal of understanding how well it matches the communities’ needs on two underserved languages on the Web: Arabic, a language with a big community with disproportionate access to knowledge online, and Esperanto, an easily-acquainted, artificial language whose Wikipedia content is maintained by a small but devoted community. With the help of the Arabic and Esperanto Wikipedians, we conduct a study which evaluates not only the quality of the generated text, but also the usefulness of our end-system to any underserved Wikipedia version.
Keywords: Multilinguality; Wikipedia; Natural Language Generation; Wikidata; Esperanto; Arabic; Neural Networks
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) This paper proposes methods to generate natural language summaries from structured data. (NOVELTY OF THE PROPOSED SOLUTION) The proposed approach applies an existing method for natural language generation model trained on wikipedia data. The main novelty is the application of this to different underserved languages. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is correct and the created training dataset is very interesting. The technical novelty is limited, but it is still an interesting contribution. (EVALUATION OF THE STATE-OF-THE-ART) The selected baseline are good. The MT approach is unfortunately not reproducible. It is also not clear how it works as it is served as proprietary APIs. Minor comments: - Broken reference in Sect 2. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is compared with good baselines and the evaluation measures have been selected correctly. The paper presents comprehensive results both by means of experimental evaluation as well as involving end users. Results show the high quality of the generated textual summaries. Minor comment: - the training data for underrepresented languages should be small. How robust is your method to limited amount of training data? (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code and the data is available. (OVERALL SCORE) **Summary of the Paper This paper proposes methods to generate natural language summaries from structured data. **Short description of the problem tackled in the paper, main contributions, and results The proposed approach applies an existing method for natural language generation model trained on wikipedia data. The main novelty is the application of this to different underserved languages. The paper presents comprehensive results both by means of experimental evaluation as well as involving end users. Results show the high quality of the generated textual summaries. ** Strong Points (SPs) 1 Important problem 2 Comprehensive evaluation 3 Strong baselines ** Weak Points (WPs) 1 Limited technical contribution 2 No results on robustness over varying training dataset size 3 Baseline not reproducible ** Questions to the Authors (QAs) 1 How robust is your method to limited amount of training data?
Review 2 (by Michael Granitzer)
(RELEVANCE TO ESWC) The paper presents an approach to generate (natural language) summary sentences from semantic data (i.e. sets of knowledge base triples) and is therefore highly relevant to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) The paper fails to explain what is new. The approach largely seems to be the same as in previous work of the authors  with only minor modifications.  Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples (https://arxiv.org/abs/1711.00155) (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The description of the proposed solution is incomplete, e.g. "The encoder is a feed-forward architecture which encodes an input set of triples into a vector of fixed dimensionality" is more or less the only information the reader gets about the encoder. How are sets of triples with a varying number of set items handled? Given an incomplete description prohibits to judge the solution as correct and complete. (EVALUATION OF THE STATE-OF-THE-ART) While the relevant work seems to be cited, the paper fails to draw connections (or highlight differences respectively) and put the work at hand into context. "In contrast with these works, in our paper we extend those research work to include open-domain, multilingual summaries" - it seems as if the extension consists of applying the approach on different datasets. minor: one citation seems broken "...for NLG [2,5,18,?]" (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the approach are not demonstrated and discussed appropriately. While the result section contains a bit of demonstration, based on the evaluation results (such as the appropriateness of the generated summaries), the discussion is missing. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The paper describes only the very basic architecture of the model used, from which it is not possible to reproduce the experimental study. I appreciate the provision of evaluation code, but unfortunately, this seems to be limited to the quantitative evaluation, using pre-trained models. (OVERALL SCORE) The paper presents an approach to generate summaries for Wikipedia articles automatically from Wikidata triples in ArticlePlaceholders. The approach is evaluated on two underserved languages (Arabic and Esperanto), both quantitatively and qualitatively with targeted user groups (readers and editors). Strong Points: 1) The evaluation is extensive and sound. In particular the qualitative part is highly appreciated. 2) The encoder-decoder model seems to be a proper approach for the summary generation in underserved languages. 3) The paper is well written and easy to follow. Weak Points: 1) The approach is not adequately described, which manifests in the further points: 2) On the approach part, the paper fails to present the novelty in the paper and therefore the contribution remains unclear. It seems as if there is little to no difference in the approach to previous work of the authors (see  in section "novelty"). 3) Due to missing information, it is impossible for the reader to implement the model and reproduce the experimental results. As already mentioned, my main concern is that I cannot spot the difference in the approach to previous work of the authors. On the other hand, I highly appreciate the evaluation part. From this point of view, it could be reasonable to only focus on the evaluation. That is, leaving the description of the approach almost completely to , providing only the utmost necessary parts for the paper to be self-contained. This would give room to add even more depth to the evaluation and discuss the properties of the approach, which could provide a solid groundwork for future work and other researchers. Or, in case I mistakenly overlooked the differences in the approach, I highly encourage the authors to point them out explicitly and describe the model in more depth. Minor: The screenshots of the example survey questions might be placed in the github repository (saving footnote-space in the paper) and a translation to English would be nice as well. Questions to the Authors: 1) Already mentioned in the "correctness" section: How are sets of triples with a varying number of set items handled in the encoder? 2) In the last row of table 4 the test scores are constantly better than the validation scores. How can this behaviour be explained? I thank the authors for their respones, which addressed parts of my concerns. I have adjusted the review accordingly.
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper is relevant due to the idea of filling the language gap of under-resourced languages on the web. (NOVELTY OF THE PROPOSED SOLUTION) Extension of the previous work for multiple languages. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The described approaches are scientifically correct and their outcomes are close to the quality standards of Wikipedia. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art is explained sufficiently. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Both are acceptable. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The overall evaluation is good, nevertheless, regarding community study, the information how the annotators were instructed to evaluate the predicted sequences is missing. (OVERALL SCORE) The authors introduce Wikipedia’s ArticlePlaceholder with multilingual summaries automatically generated from wiki data triples from underserved/under-resourced? language on Wikipedia. They are showing that their outcomes are close to the quality standards of Wikipedia and can reuse a large portion of the generated summaries for two languages under their study namely Arabic, and Esperanto. They claim that their approach can enrich ArticlePlaceholders with textual summaries that can serve as a starting point for the Wikipedia editors to write their article. The authors have evaluated their results with automatic measures such as BLEU, METEOR, ROUGE Template retrieval an also community based measure. The results show that generating language directly from the knowledge base triples is a much more suitable approach than Machine translation approach. Few suggestions: - underserved -underresourced not consistent. - No consistency with reference Sauper et al with  and without. - Reference missing[?] near NLG Related work-Text generation
Metareview by Maria-Esther Vidal
This paper presents a method for summarizing articles in new languages that seems to be founded on machine translation. There were significant concerns about the reproducibility of the method, however the release of the code goes some way to address this concern, however the authors should make care to make the final version of the paper clearer in its methodology. There are some concerns about the novelty of this work, however the extensive evaluation of this paper means that it will almost certainly be of interest to the audience at ESWC.