GNIS-LD- Serving and Visualizing the Geographic Names Information System As Linked Data
Author(s): Blake Regalia, Krzysztof Janowicz, Gengchen Mai
Full text: submitted version
Abstract: In this dataset description paper we introduce GNIS-LD, an authoritative Linked Dataset derived from the Geographic Names Information System (GNIS) which was developed by the U.S. Geological Survey (USGS) and the U.S. Board on Geographic Names. GNIS provides data about current, as well as historical, physical, and cultural geographic features in the United States. We describe the dataset, introduce an ontology for geographic feature types, and demonstrate the utility of recent Linked Geographic Data contributions made in conjunction with the development of this resource. Co-reference resolution links to GeoNames and DBpedia are provided as owl:SameAs relations. Finally, we point out how the adapted workflow will be used to publish complex Digital Line Graph (DLG) data from the USGS National Map in the future.
Keywords: Gazetteer; Geographic Dataset; Dereferencing Interface
Review 1 (by María Poveda-Villalón)
------- Thank you very much for the answers to our concerns about the paper. Having read the rebuttal I'll keep my score, that is, I'm not against the paper acceptance if the rest of reviewers agree on that, indeed it would be nice having the work presented. I'm just still concerned about the URI naming in addition authors mentioned that they are building on top of an ontology that has not been validated but they are correcting it. ------- This resource paper presents a dataset for GNIS information together with visualization and publication methods to publish it as linked data. The main value of the presented resource is that it represents an authoritative version and complements existing datasets, some of them reusing the same original data. The resource at hand also provides a SPARQL endpoint. The review will first go through the set of questions proposed by the call and at the end some comments and questions for authors are included. *Potential impact* Does the resource break new ground? Not completely, there are already datasets offering this data. The presented resource complement them. Does the resource plug an important gap? Not completely, it provides alternatives to existing solutions. How does the resource advance the state of the art? It provides new ways of accessing the data, following different models (e.g. in contrast to GeoNames) and it is based on authoritative data. Has the resource been compared to other existing resources (if any) of similar scope? Yes. Is the resource of interest to the Semantic Web community? Yes. Is the resource of interest to society in general? Yes. Will the resource have an impact, especially in supporting the adoption of Semantic Web technologies? It would be used to some extent by projects needing this kind of data. Is the resource relevant and sufficiently general, does it measure some significant aspect? Yes. *Reusability* Is there evidence of usage by a wider community beyond the resource creators or their project? Alternatively, what is the resource’s potential for being (re)used; for example, based on the activity volume on discussion forums, mailing list, issue tracker, support portal, etc? I have been seen evidence of use further than the creators' applications. It will have the same potential as any other, existing or not, geographical dataset. Is the resource easy to (re)use? For example, does it have good quality documentation? Are there tutorials availability? etc. Ye Is the resource general enough to be applied in a wider set of scenarios, not just for the originally designed use? Yes Is there potential for extensibility to meet future requirements? Yes Does the resource clearly explain how others use the data and software? Yes Does the resource description clearly state what the resource can and cannot do, and the rationale for the exclusion of some functionality? Yes *Design & Technical quality* Does the design of the resource follow resource specific best practices? Mostly yes. See comment about URI design below and ontology publication. Did the authors perform an appropriate re-use or extension of suitable high-quality resources? For example, in the case of ontologies, authors might extend upper ontologies and/or reuse ontology design patterns. Yes Is the resource suitable to solve the task at hand? Yes Does the resource provide an appropriate description (both human and machine readable), thus encouraging the adoption of FAIR principles? Is there a schema diagram? For datasets, is the description available in terms of VoID/DCAT/DublinCore? Yes If the resource proposes performance metrics, are such metrics sufficiently broad and relevant? Yes If the resource is a comparative analysis or replication study, was the coverage of systems reasonable, or were any obvious choices missing? NA. *Availability* Is the resource (and related results) published at a persistent URI (PURL, DOI, w3id)? Yes. Does the resource provide a licence specification? (See creativecommons.org, opensource.org for more information) Yes, but the licence information points to a page riddle with policies links and it is difficult to say what one can do or not with this resource. A simpler licence similar to CC ones would be very helpful. How is the resource publicly available? For example, as API, Linked Open Data, Download, Open Code Repository. Is the resource publicly findable? Is it registered in (community) registries (e.g. Linked Open Vocabularies, BioPortal, or DataHub)? Is it registered in generic repositories such as FigShare, Zenodo or GitHub? Registered in Datahub Is there a sustainability plan specified for the resource? Is there a plan for the maintenance of the resource? Not sure. The US government agencies create and maintained the data but it is not clear to me what is the process to maintain the resource. Does it use open standards, when applicable, or have good reason not to? Yes. -------------- Additional comments: *Important ones* The following two issues are the main motivations for my *current* score for this paper. My main concern about the paper is why authors choose a schema that does not generate unique IDs for the 100% of the instances? How many instances does actually represent that 4% left? The abstract claims to introduce an ontology however it seems to be that the ontology is only mentioned and referenced in footnote 5. I do think the underlying model to transform the data, i.e. the ontology, is an important piece of the process and would be nice to have it described, to some extent, in the paper. It addition, the ontology could be better published and referenced. The paper provides the link https://old.datahub.io/dataset/geographic-names-information-system-gnis/resource/c5ce6131-1190-4099-998d-b8369584aec0 that points to the DataHub entry for the ontology instead of providing the ontology URI. Going to the ttl published as ontology (http://usgs-stko.geog.ucsb.edu/resource/cegis.ttl) one gets ontology URIs as http://data.usgs.gov/lod/gnis/ontology/ that actually raise a "Page not found" error. *Less important* Do authors make a consistent use of the word "Open" through the article? Is there a reason for having "Linked Data" in the title and for example section 3 and "... to be published on the Linked *Open* Data cloud" in the first paragraph of such section? This is just a suggestion to review and make sure that the concept "Open" is only included when referring to licensing information in combination with "Linked Data" for the technology behind, in case is not done yet. Paper organization: as an external reader the sequence of sections "3 Converting GNIS To Linked Data" -> "5 The Dataset" -> "4 User Interface" seems more suitable to understanding the resource, reading from conversion to visualization rather than the current flow. In section 3, more details about properties transformations and links between entities would be advisable. Page 8 line 3: how is that hierarchical structure generated? Where is it available? Page 2, reference  "under review": I cannot check this reference, I'm not complaining about this to this particular case but to this practice in overall in the community. It might be needed to stablish some rules about this kind of reference, for example only using them when the paper under review is available for example in journals as the "Semantic Web Journal" or when a link to a pre-print or online version is provided. *Minor* .- The link http://sparqles.ai.wu.ac.at/ in page 2 does not load, it might be http://sparqles.ai.wu.ac.at/api .- Broken sentence in page 8 second paragraph "Overall, the data"
Review 2 (by Daniel Garijo)
----------AFTER REBUTTAL-------- After seeing the other reviews and rebuttal, I am happy to maintain my original score. In the final revision of the paper I would like to encourage the authors to: * Clearly specify what is the license of the resource, as the link they provide goes to a page with multiple links instead of a clear text with the license. * Add the link to the source code used for the transformations. They mention in the response that this is referenced in the text, but I wasn't able to see it when reviewing the manuscript. ---------ORIGINAL REVIEW-------- This paper describes a resource for serving the Geographic Names Information System, from the US Geological Survey, As Linked data. The paper also presents an application showcasing the usage of the dataset with interactive maps in both human and machine readable representations. The paper is well written and easy to follow. I find it very relevant to the ESWC conference, and I would like to see more authoritative Linked Data, like the one described in this resource. The resource is available, useful (as shown with the application) and modeled using common standards and conventions. I think that having a sustainability plan is an excellent idea, although the paper does not specify the license of the resource (apart from being open). I also loved having content negotiation for geometries using geo standards. There are a few details that could be improved in the camera ready of the paper: - The scripts used for the conversion are not available. I think they could be a nice addition if someone aims to convert similar types of data. - There is no description on how different updates and versions of resources are handled. For example, if the location of a URI is changed, will there be a record of the change? - What is the community adoption of the resource? - Prefixes in Listing 1 are not declared. - SPARQL endpoint has a long URL, which makes it less accessible. - There are a few typos in the paper. In particular, on page 8 there is an unfinished sentence "Overall, the data ..."
Review 3 (by Antoine Isaac)
This paper presents the publication by the US Geographical survey of their Gazetteer, GNIS. Arguably the publication is not utterly innovative in terms of technology used and application scenarios enabled. However it makes a few good points about authority, both for database content and model, and sustainability. Datasets like this one are needed for enabling applications that cannot trust (any more) less controlled sources like GeoNames. There are some negative points, but I believe they can be resolved in the final version - I am counting on the rebuttal phase for receiving elements of answers. - the web service is very slow, which diminishes the argument that this is an institution-backed data publication - I am not sure where the promises of the abstract and the conclusion (“we point out how the adapted workflow will be used to publish complex Digital Line Graph (DLG)”, “We presented preliminary work for how this resource aligns with upcoming datasets such as the DLGs and National Map data more broadly as well as with other authoritative data sources such as USGS WaterWatch sensor data.”) are described. I guess this happens in section 2 but is not clear. - I am not sure how the strategy for creating URI (which is good to report on, by the way) can work with features that exist across states and counties. - the reasons for creating an own interface are a bit unclear. Is it for enabling features like conversion of data values? Couldn’t it have been done by extending an existing system like Pubby? - the IRIs for the ontology classes (like http://usgs-stko.geog.ucsb.edu/lod/cegis/ontology/SurfaceWater) seem not to be valid from the perspective of LD publication recipes (it leads only to a web page, no RDF representation). It fails with https://www.w3.org/RDF/Validator at least. I am afraid the claim at the beginning of section 6 is wrong. - https://www2.usgs.gov/publishing/policies.html is not clear about the type of open license under which the dataset is released. There are some rather good definition of what counts as ‘open’ and some examples at http://opendefinition.org/licenses/. Interestingly https://old.datahub.io/dataset/geographic-names-information-system-gnis mentions that the data is open according to the standard definition, but “Other (Public Domain)” can hide some tricky detail (this reviewer is familiar with cases of public domain that’s actually not so open...) - I am surprised by the low amount of mappings to GeoNames and DBpedia. Do the authors have an explanation for it? How could it be improved? Finally there is a point that’s both positive and negative: the publication uses w3id persistent IRIs, which is great. But why isn’t this reported in the paper? Minor points: - publication of Linked Data resources by institutions is not so novel as the wording of the intro (p2) suggests. This review is familiar with the library sector, where efforts were already well started 5-6 years ago (http://www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset/). Other domains have done rather well, too. - the root URI of the ontology http://usgs-stko.geog.ucsb.edu/lod/cegis/ontology/ does not redirect. - sparqles.ai.wu.ac.at/ does not work at the time of review, which is quite ironical... - the prefix ago: is not defined. - p6 typo “gazzetteer” - p8 unfinished sentence “Overall, the data” - p10 owl:SameAs should be owl:sameAs. - the bibliography has a lot of capitalization issues. All acronyms were turned into lower-case --- AFTER REBUTTAL I have read the author's response. I understand that space was scarce and that almost none of my comments were answered, as my overall assessment was high (if I'm not mistaken, I could only find ab answer about the rationale for the choice of developing a new UI). Still I do not understand why so much was copied from the reviewers original comments. And I am frustrated to find no answer for a point as important as the licensing. I am also quite unimpressed by the fact that the answer on matching Geonames (which could have answered one of my questions at the same time as another reviewer's) begins by "The code for this process is referenced in the paper" while there is no such reference apparent in the paper. My assessment was borderline between weak and confident assessment, after reading the rebuttal I have to admit I am less confident.
Review 4 (by Cassia Trojahn)
------------------------- After the rebuttal : I acknowledge the rebuttal and thanks the authors for this effort. I was expecting more convincing answers (with respect to the extensions to the existing vocabulary, which are the matching strategies applied to match the proposed resource to external sources, how the alignments have been validated, and the other issues raised by the other reviewers, in particular, with respect to the license). However, I can accept the arguments of the other reviewers highlighting the strong points of this paper, in particular with regards to the authority. Hence, I have revised my initial scores. I hope the authors will be able to develop the questions above in case the paper is accepted (for instance, saving some space by reducing the description of the user interface). ------------------------- This paper presents the GNIS-LD dataset, a LOD dataset generated from the Geographic Names Information System (GNIS). GNIS describes about historical, physical and cultural geographic US features. GNIS-LD is linked to GeoNames and DBPedia. Although one can see the efforts in producing such dataset and a user interface for consuming such data, the resource is presented with insufficiency of details. In particular, the sources of data are not clearly described (what are the cultural aspects and historical contexts, for instance, taken into account?), the extended ontology proposed to describe the dataset is not discussed, the matching to existing LOD datasets is not presented, and the GeoSPARQL 'extensions' are limited to a specific mechanism for content negotiation of geometries. For that reasons, i recommend to reject this paper. Major remarks : - It is not clear how GeoSPARQL has been complemented with dereferenceable URIs ? (i.e., making available specific features of geometries accessible via a URI)? Is this GeoSPARL extension available ? It seems to be a straightforward way of content negotiation for geometries; - What are the extensions from the USTopographic ontology ? (the links to CEGIS, XSD datatypes and QUDTS units ?). It could be interesting to say a little more on this extended model. - This kind of statement is quite vague "functionalities of semantic web browsers to the user with the intend of maximising usability". I can not see how the cited works, which seem to propose relatively simple interfaces, answer to this question (for instance, "Phuzzy describes each resource in a tabular format", what is kind of very simple visualisation). Furthermore, it is hard to see what are the novelties of the proposed user interface. - The authors state that GeoNames does include GNIS as one of its sources. Are these previous alignments used for matching GNIS-LD to GeoNames ? - The ontologies used to describe the dataset could be better detailed in order to see how the different geo features are linked together (from cultural to elevation, man-made features, etc.). In the conclusions, the authors mention that they provide "an ontology", which is not presented. - As the authors proposed a LOD dataset, one central question concerns the alignments to published datasets. GNIS-LD is linked to GeoNames and DBPedia. It could be interesting to know how the matching process have been done. Minor remarks : - it is unusual to mention future work in the paper abstract. The authors should instead focus on the paper contributions. - Section 1 : "However, Like" - Section 1 : "To overcome this challenge" => what is in the challenge here ? - Section 1 : Digital Line Graph => reference ? - Section 2 : geomtry - Section 2 : Vafu => reference - Section 2 : implmentation - Section 2 : what are "historical records" (on which features?) - Section 4 : first paragraph of this section is quite long (same as first paragraph of page 8) - Section 4 : counties ? - Section 4 : Fig. x Figure - Section 5 : for a permanently and openly available dataset, its URL is not very intuitive (it is a matter of blinding the real URL for the submission purposes ?) - Many typos to be corrected.