GSP (Geo-Semantic-Parsing)- Geoparsing and Geotagging with Machine Learning on top of Linked Data
Author(s): Marco Avvenuti, Stefano Cresci, Leonardo Nizzoli, Maurizio Tesconi
Full text: submitted version
Abstract: Recently, user-generated content in social media opened up new alluring possibilities for understanding the geospatial aspects of many real-world phenomena. Yet, the vast majority of such content lacks explicit, structured geographic information. Here, we describe the design and implementation of a novel approach for associating geographic information to text documents. GSP exploits powerful machine learning algorithms on top of the rich, interconnected Linked Data in order to overcome limitations of previous state-of-the-art approaches. In detail, our technique performs semantic annotation to identify relevant tokens in the input document, traverses a sub-graph of Linked Data for extracting possible geographic information related to the identified tokens, and optimizes its results by means of a Support Vector Machine classifier. We compare our results with those of 4 state-of-the-art techniques and baselines, on ground-truth data from 2 evaluation datasets. Our GSP technique achieves excellent performances, with the best F1 = 0.91, sensibly outperforming benchmarked techniques that achieve F1 < 0.78.
Keywords: Geoparsing; machine learning; linked data; Twitter
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) By combining semantic web technologies, annotation of text documents, and machine learning, the work is clearly of interest to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) I like that the authors tackle a well constrained problem and propose an interesting solution. To the best of my knowledge, the proposed solution is indeed novel. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution is presented and evaluated with good level of detail. The presentation seems to be correct and sufficiently complete. (EVALUATION OF THE STATE-OF-THE-ART) Related work is concise and sufficient, considering the page limitation. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation is appropriate. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Authors could improve the reproducibility of the work and results by publishing data and software. Authors may want to consider to publish these resources onto, e.g. figshare, and cite the resources by means of the DOI provided by figshare. I expect that a different experiment would come to similar conclusions. (OVERALL SCORE) The authors describe an approach for automatically associating geographic coordinates to text documents. This is a well constrained problem and the proposed approach is interesting. The results speak strongly for the approach. As strong points, the paper is well written and approachable by a wide audience. The problem and solution are interesting. The description is clear, including the more formal parts of the paper. As weak points, the authors should consider publishing the implementation. Some of the statements appear overly confident, e.g. "However, to date no working solution has ever been proposed to perform geoparsing and geotagging of text documents b exploiting Linked Data." While I do not have a counter example, I doubt this is factually correct. The paper could be improved with another round of proofreading (e.g. "number of geographic infromation retrieved"; "knowledge-base" (no dash); "the 2 evaluation datasets" (s/2/two/g, multiple) as well as occasional rephrasing for clarity.
Review 2 (by Ralph Ewerth)
(RELEVANCE TO ESWC) The task of geoparsing addressed by the paper is relevant to the conference. (NOVELTY OF THE PROPOSED SOLUTION) The authors use available annotation tools to identify place tokens. The contribution of the paper is the exploration of linked data to enrich retrieved geo information as well as the rejection of false GPS estimations via a SVM classifier. The idea of exploring linked data for geospatial resources has also been investigated by prior works (as indicated in this paper). However, these approaches concentrate on supporting use cases in journalism in finding contextual geospatial information related to geoparsed content. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The description and the implementation of the method is technically sound. (EVALUATION OF THE STATE-OF-THE-ART) The related work is sufficiently covered and discussed. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the proposed approach are discussed in sufficient detail. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) State-of-the-Art approaches are involved in experiments with selected benchmark sets. The description of evaluation datasets and metrics is sufficient. Information to used tools and other technical details are provided. However, the authors use a distance threshold within that a geographic match is considered correct and do not specify the exact threshold value. Nevertheless, the threshold is consistent among all evaluated approaches. The suggested solution outperforms the reference systems. (OVERALL SCORE) This paper introduces a novel approach for assigning geographic information to text documents by exploiting linked data. The approach identifies relevant text tokens via available semantic annotators and traverses linked Data for extracting geographic candidates. Then, irrelevant candidates are pruned by means of a Support Vector Machine classifier. The suggested framework significantly outperforms State-of-the-Art approaches and an own baseline method on two selected evaluation sets. Strong points: + Linked Data provides more potential geo candidates and thereby more precise geo estimations + Significant improvement of the State-of-the-Art + Paper is very well written and structured Weak points: - Geospatial granularity of predictions could have been elaborated in more detail - Information regarding distance threshold is missing - Graphics (especially in the evaluation) could be enriched with more information (e.g. number/fraction of places retrieved at each geospatial level) Question: Was one distance threshold setting or were different threshold settings used for the coarse to fine toponyms as in the mentioned MediaEval 2016 Placing Task? -------------------------------- *** After the authors' response: We thank the authors for their response and additional explanations, they also replied to our question above (#R2). Thera are no changes in my scores.
Review 3 (by Krzysztof Janowicz)
(RELEVANCE TO ESWC) This paper proposed an innovative approach to geoparse and geotag the unstructured information (e.g., tweets and documents). In the approach, the authors creatively used machine learning techniques (e.g., SVM and feature selection) together with linked data (e.g., semantic annotation) to enrich and improve the geoparse and geotag results. Both the topic and the proposed methods are state-of-the-art and highly relevant to the conference's interests. (NOVELTY OF THE PROPOSED SOLUTION) The proposed solution is novel at least in two senses: (1). it takes advantage of the linked data to enrich the annotated nodes from one resource to many related resources thus providing a solution to increase the recall of the task; (2). by using features that characterize the quality/confidence of the texts, links and resources where the uncertainties might be introduced, the authors applied SVM to train a model to filter out the candidates that might not be correct. This step increased the precision of the approach. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed approach is organized and discussed in a very detailed manner, and most of the contents make sense. In addition, notations, pseudopod and tables in the paper are self-explained. The authors improve their approach not only from an recall perspective by expanding the nodes using various knowledge-base, but also from a precision perspective by filtering the parsed candidates considering a set of features. (EVALUATION OF THE STATE-OF-THE-ART) Geoparsing and geotagging from the unstructured data is a hot topic currently; it helps geographically enrich the information retrieval and discovery. This paper proposes a new approach in this topic, and evaluates its performance by comparing the precision, recall, accuracy and F1 with the state-of-the-art geoparsing and geotagging techniques, and shows that its performance outperforms all the others. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors did a great job on explaining the properties of the proposed approach in the paper. For example, in section 4.2, the authors explicitly bring up the point that the capability of the approach depends on its ability to parse as many as possible of RDF predicates that store geographic information, and discussed how they solve this issue in their work. In addition, the authors discussed about the limitation of the proposed approach in terms of relatively low recall in the Conclusion section, and discussed potential solutions. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) As I mentioned before, the paper discussed each step of the approach in very details and the psudcode is self-explained.The data and APIs used in the paper are well-documented and also open-source. One thing that might affect the reproducibility of this work is that when using the second evaluation dataset: ITA-DSTR, the annotated places/locations might be hard to find. (OVERALL SCORE) Summary of the paper: This paper combines the techniques from both link data and machine learning to improve the performance of geoparsing and geotagging on unstructured data like twitters and documents. The experiments conducted in this paper demonstrate a promising performance compared to other approaches. Strong points: (1). The proposed methods are innovative (2). The paper is well-written and the approach is comprehensively designed and discussed (3). The experiments are well-designed and the results are promising Weak points: (1). Some notations are not clear (2). The relate work could be extended (3). More machine learning techniques (potentially simpler ones than SVM) could be explored Questions to the authors: (1). In the Introduction, the authors mentioned several challenges of geoparsing and geotagging, like the toponymic polysemy and limited amount of context. But in the approach and experiment parts, I did not explicitly see how the proposed approach could solve these issues. Could you explain? Some specific examples might be helpful. (2). What do the subscripts i and j represent in the notation of RDF sources (epsilon)? Although I could figure it out from the context in the paper. It would be better to explicitly notate them at the first time you use it. (3). In section 3.1, when deciding the coordinates that represent the set of resources, the authors propose to do the clustering/binning first and then when two or more clusters contain the same number of elements, a score, calculated based on the reliability of the resources, will be used to determine which cluster to choose. I am wondering why not simultaneously consider this reliability information together with the number of elements in the clusters to decide the cluster to choose?
Metareview by Andreas Hotho
This paper proposes a context-agnostic Geo-Semantic-Parsing approach using machine learning to associate geographic locations mentioned in social media text. While all reviewer like the work, weak points are mentioned in the reviews, which are partially addressed by the author in the response. Given the clear contribution of the work and the agreed positive recommendation of the reviewers, we accept the work.