Geo-semantic labelling of Open Data
Author(s): Sebastian Neumaier, Axel Polleres
Full text: submitted version
Abstract: In the past years Open Data has become a trend among governments to increase transparency and public engagement by opening up national, regional, and local datasets. However, while many of these datasets come in semi-structured file formats, they use different schemata and lack geo-references or semantically meaningful links and descriptions of the corresponding geo-entities.
We aim to address this by detecting and establishing links to geo-entities in the datasets found in Open Data catalogs and their respective metadata descriptions and link them to a knowledge graph of geo-entities. This knowledge graph does not yet readily exist, though, or at least, not a single one: so, we integrate and interlink several datasets to construct our (extensible) base geo-entities knowledge graph: (i) the openly available geospatial data repository GeoNames, (ii) the map service OpenStreetMap, (iii) country-specific sets of postal codes, and (iv) the European Union’s classification system NUTS.
As a second step, this base knowledge graph is used to add semantic labels to the open datasets, i.e., we heuristically disambiguate the geo-entities in CSV columns using the context of the labels and the hierarchical graph structure of our base knowledge graph. Finally, in order to interact with and retrieve the content, we index the datasets and provide a demo user interface. Currently we indexed resources from four Open Data portals, and allow search queries for geo-entities as well as full-text matches at http://data.wu.ac.at/odgraph/
Keywords: geo-entity extraction; geospatial labelling; geo-entity disambiguation; open data; linked data; geonames; openstreetmap
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper is is concerned with linking open data sets and thus is relevant to the ESWC with regard to the "the generation, maintenance and curation of links within and across datasets" as mentioned by the description of the Linked Data track. (NOVELTY OF THE PROPOSED SOLUTION) The generation of the geo-spatial knowledge graph is mainly based on augmenting the already existing GeoNames ontology by adding postal codes (by name matching) as well as NUTS identifiers (relations extracted from WikiData) to entities and extending entities to contain OpenStreetMap features. While this yields an augmented geo-spatial knowledge graph the methodological value of the approach is limited. Similarly, linking column entries in CSV files, is straightforward: it uses simple string matching to link individual column entries to geo-spatial entities. The disambiguation is done by assuming homogeneous geo-spatial parent entities of the entries in each column. Thus, even though an augmented knowledge graph is attained, I do not see a significant methodological contribution. Also since it is hardly discussed and not evaluated on how this augmented knowledge graph impacts labeling performance (as opposed to only using GeoNames or a some graph derived from OpenStreetMap), I currently judge the novelty to be rather limited. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Generally the proposed method adds additional information to an existing geo-spatial knowledge graph (GeoNames), and as such is implicitly correct. However, completeness is lacking in several aspects: only one way of constructing the graph and matching column entries is proposed. For example, parameter influence (thresholds) or alternatives are not discussed. This includes for example an argument on why the four used information sources where used (because they fit the data well [needs proof of course]; because there are no other valuable data sources, ...). Finally, some design choices may influence performance. E.g., the assumption that columns need to be geo-spatially homogeneous. Does this reduce performance in some cases? A similar effect may be due to restricting the labeling procedure based on the meta-data (e.g., 3.3.3). (EVALUATION OF THE STATE-OF-THE-ART) I am missing an evaluation of alternative approaches and their comparison. For example only using GeoNames, only using some graph derived from OSM, or something similar. Also some design choices are not evaluated (see Completeness). Additionally, the given related work section needs more relation to the presented article _explicitly_ stating how the proposed method actually improves on or differs from the covered related work. Also, a general overview on what is covered in the related work section is missing (at the beginning of the paragraph) leaving the reader to a list of work she has to connect to the relevant parts of the article herself. Two side notes: the "ontology list" at the beginning reads as if W3C Geospatial Ontologies, Darwin Core, or GeoSPARQL Schemas, are themselves ontologies, but as far as I can tell, they represent schemata or vocabularies. This should be clarified. Also the authors claim that the six "ontologies" are maintained but no accessible ... which seems to be a contradiction? (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation of the approach is rather limited. The CSV labeling is evaluated by manual inspection. The quality of the geo-spatial knowledge graph is only evaluated indirectly via this inspection. While performance numbers are given in the text, a concise overview (e.g., in a table is missing). Results for alternatives with regard to parameters or simpler approaches (e.g., using only GeoNames) are not given. I suggest that instead of manually inspecting the results, the authors should build a gold standard (e.g., by using their approach to pre-label CSVs and then fixing the errors) and then reporting similar numbers on all sub-ontologies (NUTS, GeoNames, something derived from OSM, etc.). (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code is not available and there is no direct download link to the used data, which reduces reproducibility by a large margin. Also the augmented geo-spatial KB is not available. Similarly the used datasets are pre-processed (e.g., extracting meta-data) using methods from another work. Thus, it is hard to judge and reproduce what kind of data is extracted. In addition some of the algorithmic details are rather vague. Some examples: * "Then we use the code’s parent regions to select the GeoNames entry (in case there are several candidates).": selecting the parent? or something else? * Do you completely skip the NUTS identifiers not available in WikiData? * 3.1.2: why are you only looking up geo-entities with NUTS labels in OSM and not all entities from GeoNames? * 3.1.2: how do you map OSM administrative levels to NUTS levels in order to disambiguate nominatim results? * 3.1. how does Wikidata link NUTS levels to GeoNames if e.g., AT1 is not part of GeoNames? * Fig. 1: should arrows from GeoNames to PLZ always be bidirectional? * 3.3.1.b: a threshold is mentioned. is it the same as above? (OVERALL SCORE) === Summary of the Paper The paper proposes a method to link content of openly available data in CSV format to geo-spatial entities (i.e., each entry in a column containing city names are linked to the geo-spatial entity representing each city). In particular, they exhaustively parse every entry in each CSV and try to match it to a geo-spatial entity from a custom built geo-spatial knowledge graph. The latter is constructed from several data sources, i.e., OpenStreetMap, NUTS, GeoNames and datasets of postal codes and contains a hierarchy of places. They provide an interface to explore the constructed dataset. === Strong Points (SPs) * straightforward approach to add geo-spatial entities to any CSV data * general approach understandable * example interface provided === Weak Points (WPs) * evaluation to shallow (no controlled setting, no parameter studies, no evaluation of the knowledge graph, no performance study [hierarchy search may be expensive?]...) * methodological contribution is limited * discussion of design choices is lacking * data storage format not described * augmented data and code not directly available * not an automated process (at least not in its current implementation) * it is not clear how much manual labor is required (can I apply your approach to my own datasets easily? will you provide a labeling service?) * some concrete application scenarios should be given by the text (exploration of the data, applications in research, etc.) === Questions to the Authors (QAs) * have you experimented with different parameter settings (thresholds, knowledge graph restrictions, ...)? what are the results? * how much does each component of the geo-spatial knowledge graph contribute to the labeling procedure (e.g., how far will only using OSM get us with regard to the number of labelled entries) * who has inspected (/evaluated) the results? researchers? practitioners? several people? a single person? * will the code and data be made available? * what is the data storage format? is it standardized? * I believe the process still needs many manual steps, including collecting various data from various data source and is thus not an easily automatable process, is this correct? If not not what is the state of the system and how much effort would be needed to fully automate it? === Suggestions * assigning labels based on spanning columns is probably a challenging problem computationally * when zooming, the selected entity is deselected: http://www.geonames.org/2762314/politischer-bezirk-voelkermarkt.html * "hausnummer" not taken into consideration: http://data.wu.ac.at/odgraph/eswc/http://offenedaten.kdvz-frechen.de/sites/default/files/Gewerbe_Titz_final.csv === Overall Generally the authors build an interesting resource which may be useful to augment data sets for research or exploration. However the methodology is rather straightforward and the design choices are not well discussed. Similarly the explanation of the approach lacks in detail and the evaluation is very limited. Also, no alternative methods are evaluated. Thus, while I can see the usefulness of the presented work, it is necessary to improve on the previously mentioned aspects in order to obtain the required depth of a conference paper. Currently I see the presented paper as a workshop contribution (or for the resource track if the results and the implementation are made openly available). === After the rebuttal ==== Relevance, significance, novelty: First and foremost, I want to emphasize that I find the general concept interesting and the potential application of the paper's approach very useful. That is, on a conceptual level I very much acknowledge the novelty of the paper. That is why for me the novelty justifies an accept (however somewhat weighed against the "limited" methodological advancements), as reflected by the score. About the notes from the authors: * "linking column entries in CSV files *at scale* on 1000s of CSV files with tens of columns is NOT straightforward": unfortunately this is neither discussed nor an efficient solution presented; it would have been an interesting insight contributing to the evaluation of the proposed approach * "OSM data did not exist in the hierarchical form we needed: mapping the sources into one coherent hierarchical knowledge graph (which we do by geospatial mapping, not by simple string matching)": the paper states "OSM provides different administrative levels for their relations" which seems to imply a hierarchy; also looking up labels to link to OSM to GeoNames is done via an external service (Nominatim) exploiting existing links between NUTS and OSM labels (https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative ). Linking streets (etc.) is done via a regional lookup of the resulting polyon (similar to a database query). If I understand this correctly, while this is a nice way of using existing links and services and I acknowledge the insight to make this connection, it can not be considered a methodological challenge or breakthrough. ==== Evaluation/gold-standard: The gold standard I was sugesting could esstenially haven been built by the authors during manual inspection (by fixing missing labels instead of just counting them), i.e., I was not suggesting to label more data (even so, e.g., a crowd sourcing task could have provided a larger scale dataset if appropriate tooling can be provided; which might however be worth a paper in itself by the way). What I was suggesting, that such a gold standard (even if small) could have been used to better study the properties of the approach with regard to the factors I have mentioned (this includes, e.g., the influence of parameters choices, the value of adding postal codes, preprocessing steps [on Strings], or answering questions like: are errors mainly due to OSM, or OpenGeo? are missing entities mainly on certain administrative areas, etc.). Currently, while I strongly believe that the approach has potential and I can also imagine that each step taken does provide additional value, unfortunately the various properties are not discussed or evaluated. In other words, the usefulness and influence of the design choices have not been illustrated in detail. This is one of the main issues I have with this paper in its current form. If some more work is put into this part, this can be a very strong paper! Additional notes on scalability and effectiveness of the implementation would have further improved the quality of the paper. I am well aware that adding all of the above mentioned points will we hard to squeeze into limited space. However, some more details on the properties of the approach should have been evaluted and could have fit the paper (e.g. by reducing the length of the conclusion). ==== Reproducibility/Generality: Here, making the source code available and adding READMEs will greatly improve the situation. Due to this and the authors clarification with regard to some raised points, I am revising my scores somewhat (without a README and same basic instructions I cannot completely judge the situation though). Nevertheless, this should also be stated as one of the main contributions of the paper as it is a major selling point. ==== On notes from the authors * Thus we look-up OSM data for certain GeoNames regions. ** then why not for all GeoNames entries? Or at least for all the leaves? This would yield a more complete knowledge graph, would it not? * "on the alignment of OSM admin levels and NUTS": I did not read this from the text ... probably a wording issue ... maybe try to clarify, especially because I stumbled over this fact several times.
Review 2 (by Pieter Colpaert)
(RELEVANCE TO ESWC) This paper describes a framework to extract geo-semantic labels from unstructured data on the Web. While the framework is a contribution as a resource, it is difficult to see this as a reasonable contribution to the research track of ESWC2018. (NOVELTY OF THE PROPOSED SOLUTION) Different open data enrichment approaches exist (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Framework and implementation are well done! (EVALUATION OF THE STATE-OF-THE-ART) State of the art with regards to geospatial data is complete. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Well described (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Did not try to reproduce the results, as the paper did not draw heavy conclusions from them. (OVERALL SCORE) # Introduction → Linked Data to the rescue? Probably one of the most difficult tasks we have as semantic web researchers is motivating our choice for RDF as a technology. Too many times has our research been critisized for the fact that we seem to have a hammer that make all data problems look like nails. Instead, I would like to see a motivation where Linked Data and RDF technologies are indeed the better solution, by reasoning or by evaluation and comparison. Even worse: if you could indeed reliably categorize all geospatial entities in unstructured open datasets, wouldn’t this be a case for open data maintainers to keep publishing simple CSV files? Linked Data is key to decentralization, I would love to read the author’s view on what needs to be decentralized how in the introduction. # Background Typos: * etc.. * OSM was *funded* in 2004 # Approach Mapping postal codes to GeoNames: what’s wrong with http://download.geonames.org/export/zip/ ? Unclear how you handle the legal constraints for the harvested open datasets such as attribution, share alike or even closed data for e.g., the postcodes in the UK. # Indexed data & Search interface Good idea to do an valuation with sampling the dataset. An overview of the results would be nice. # Conclusion Are your test results conclusive? What did you learn from your results? The conclusion mentions sufficiently correct, particularly useful and adequate, but no definitions are provided of what this means. Linked Open Data research is also researching the organizational effects and business models. It is unclear what will happen to the framework: how is/will it be funded? Should this be funded in this way, or should data publishers do some more work? This is a centralization effort: will it keep existing for ever? --- After rebuttal I (still) find the research contribution too low for the research track. I will accept the paper however for resources track if some effort could be put in the README.md of the repository.
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) The authors tackle an important issue, namely the lack of formal semantics in Open Data. While there is much data out there, in general it is hardly reusable because it is too often difficult to understand what the data express. The authors tackle this issue by attempting to establish links between the strings found in datasets (specifically for places) and more expressive and less ambiguous resources. The authors use semantic technologies to tackle the issue. The paper is thus relevant to ESWC, at least from an application point of view. (NOVELTY OF THE PROPOSED SOLUTION) In my opinion, the approach is not very inspiring and probably not novel. It is actually surprising how much effort is required to build a geo-entities database, even though the semantic web community has been arguing for years that this should all become easy. It may be easier but still not entirely easy. My main problem with the approach is foundational. I don't quite understand why to try to link strings in datasets with resources representing geo-entities after the datasets have been published. Why not do this work before the datasets are published? Why not work with the data publishers to get it done in data curation before publication? It seems extremely wasteful to attempt this after publication. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) As far as I can tell, the solution seems correct and complete. (EVALUATION OF THE STATE-OF-THE-ART) The authors provide a reasonable albeit not thorough review and evaluation of the state of the art. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is demonstrated well but not discussed in much depth. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The work is not reproducible. Indeed, I would encourage authors to publish source code and data used, rather than provide pseudo-code descriptions (see, e.g., Section 3.3) of algorithms. I would rather like to see this space used for a critical discussion of the approach and how it compares to other approaches found in the literature. (OVERALL SCORE) If have provided answers to some of these points above. The work surely highlights an important issue: Much of published Open Data is hardly interoperable and reusable. The authors present how this could be improved but in my opinion this should be done bottom-up, before data are published, not top-down, using approaches that will always rely on heuristics that will never perform perfectly. Overall, the paper is easy to read and can be approached by a large audience. It raises an important issue. The proposed approach is defensible, although in my opinion more should be done to publish interoperable and reusable data in first place. The language of the paper is OK but a revision could benefit still from a careful read.
Review 4 (by Steven Moran)
(RELEVANCE TO ESWC) See comments below. (NOVELTY OF THE PROPOSED SOLUTION) See comments below. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) See comments below. (EVALUATION OF THE STATE-OF-THE-ART) See comments below. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See comments below. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) See comments below. (OVERALL SCORE) See comments below.
Metareview by Hala Skaf
The submission proposes an approach to add geo-spatial entities to any CSV open data. While the topic of the submission is related to ESWC, the submission lacks scientific contribution, methodology and discussion of design choices. As pointed by reviewers, the process of adding entities is not automated, it is not clear how much manual labor is required. The evaluation is narrow and misses precision: vague quality measure, no baselines. Reviewers agree that the proposed framework cannot be considered as a contribution for the research track of ESWC2018. The proposed framework could be considered a contribution as a resource, if more efforts are done for the reproducibility and a README is added to the website.