A LOD backend infrastructure for scientific search portals
Author(s): Benjamin Zapilko, Katarina Boland, Dagmar Kern
Full text: submitted version
Abstract: In recent years, Linked Data has become a key technology for libraries, archives and museums in order to publish their data collections on the web and to connect it with other data sources on the web.
Especially, for unconnected data collections it is an easy method for connecting data collections without giving up historically grown unconnected infrastructures.
With the ongoing change in the research infrastructure landscape where an integrated search for research information gains importance, organizations are challenged with connecting their historically unconnected databases with each other.
In this article, we present a Linked Open Data based backend infrastructure for a scientific search portal which is set in between unconnected data collections and makes the links between data sets visible and usable for retrieval.
In addition, Linked Data technologies are used in order to organize different versions and aggregations of research data sets.
We present the in-use application of this backend infrastructure for a scientific search portal for the social sciences and evaluate the benefit of links between different data sources in an user study.
Keywords: Linked Data; search portal; research information; infrastructure
Review 1 (by anonymous reviewer)
The paper describes an infrastructure for scientific search portals based on LOD principles. It is well motivated by a concrete use case, implemented in a real application and evaluated in a user study. The paper is well written and clearly structured. The main focus and contribution is around linking research data across heterogeneous sources, addressing the aspects of representing, extracting, managing and presenting the links to users. The applied techniques are sound and state-of-the-art. The work is clearly relevant in the context of LOD. However, beyond the aspect of links across data sets, Linked Data techniques are not really considered / described. Data is stored in a MongoDB with an Elasticsearch index - RDF databases for managing data, SPARQL as interface for querying, use of standard vocabularies etc. are not considered. It is mentioned that data is stored in MongoDB following an RDF-model, but what that means is not clear. Further, while the contribution suggests a generic infrastructure, the work seems very much focused on the implementation of a specific application, with concrete data sources in mind. It is not really clear which parts of the infrastructure can be considered generic and reusable beyond this specific instance. The user study is very laudable, however, the tasks performed by the users seem to be not really based on actual information needs of the users, but more like a (directed) experimenting and playing with the system. Despite the above negative points, I believe overall the work constitutes a contribution that is worth to be presented in the in-use track.
Review 2 (by Takahiro Kawamura)
This paper proposed a scientific search portal system, which performs link (mainly for citations in papers) detection and merging, and entity disambiguation. After describing the rationality of those mechanisms, user experience has been evaluated by 17 people, and the proposed system has been evidenced as useful. However, the accuracies of the proposed mechanisms are missing in this paper. The accuracy of the link detection and merging and the entity disambiguation should be separately measured, at least by sampling base. Since, as described in the Conclusion, disambiguation of researchers’ names is not yet incorporated, it may affect the accuracies of them. Also, for the practical use, system performance, i.e., speed is a major concern, but there is no description of it. Moreover, the distinction between online and offline batch process in the system, which affects the performance is not clear. For the paper presentation, more examples will improve the readability. Some wordings are a bit vague, for examples, "a lower/higher level of granularity" of an entity and "more coarse/fine-grained version of a known link." Comment: - 3.5 Data format section should be placed earlier in the paper. - Characters in figures are too small to read. -- I acknowledge the authors' comments and slightly changed my score.
Review 3 (by Hannah Bast)
The paper describes a system for automatically providing links between entities of various scientific information portals of GESIS (the largest German Research Institute for the Social Sciences). The various portals have different architectures. The entities are of a variety of types, for example: publications, projects, research datasets, institutions. The typical problems (when dealing with heterogeneous data sources) are encountered and dealt with: identifying potential links between entities, different names for the same entity and the corresponding merging of links. A small user study is conducted (17 participants), where users where asked to find information about the research data cited in a given paper. The above-mentioned links could be used. Users were asked to think aloud. Most users found the links useful. I found the paper cumbersome to read for a number of reasons. Most of the paper is written in WoT (Wall of Text) style, with little structure and long descriptions in prose. The introduction is very vague in what exactly has been done. Section 2 basically repeats this and is similarly vague. A clear problem definition and a running example (or just more examples here and there) would have been tremendously helpful. The font of several of the figures is too small to be readable in a printout. I also did not find the figures particularly helpful. The techniques are standard, but carefully put together in a meaningful way to achieve the aimed at goal. The user study is also described in WoT style and missing a clear description of the task (a simple piece of advice to add structure: just have a paragraph with a concise task description, separated from the surrounding text; or wherever there are steps or a list, make a list that is separated from the surrounding text, and clearly say before the list what the list is about). Given that the user study asks for information about cited entities of a certain kind, it is not too surprising that the provided links (between entities, in particular those asked for in the task) were found useful by most users. I have read the response letter of the authors. I appreciate the willingness of the authors to make all the necessary amendments to the presentation. Given the present form of the paper and the extent of the necessary improvements, I consider it a bit of a gamble though whether the authors will succeed with this. If there are other submissions of similar quality and with a better presentation, these should be preferred. If there is still room for this paper, I would not mind accepting it and trusting the authors to make the necessary changes.
Review 4 (by anonymous reviewer)
This work is valuable in that it highlights a very common scenario that many research centers are encountering today: making various datasets and other research collateral discoverable. It also highlights the a value of linked open data infrastructure can add to making such collateral interoperable. It would be great to include this paper in the In Use track, but unfortunately, in its current state, it is missing several key details that would make it more tangible and easy to follow. It's also never clear about the size and scale of this implementation. The Introduction states that the landscapes and roles for libraries, archives, and research centers are changing which is reflected in research agendas of funding agencies -- while this may feel like common knowledge to some, a citation or specific example would be helpful. The same paragraph also says that a study of users revealed an interest ("requirement") to be able to search for related research information about a single topic using integrated search functionality. The Use Case is described briefly is a vague -- other than to say that there are several disparate portals of data and they want to provide a comprehensive search. Likewise, the paper does not expressly explain who the target stakeholders of the use case are and what their pain points are. Naturally, this can be inferred, but not including specifics here makes it difficult to assess the relevance of the evaluation scenario described in Section 4 (i.e. is the use of a literature entry as a starting point leading to users needing finding data and cited literature a common scenario?). The Introduction mentions an internal user study where "users are interested in...the use of research data" and "related research information". Some specific data, questions asked, results, etc. from this study (was it a survey?) would make for a more compelling Use Case description as well as a better understanding of success criteria and the evaluation scenario. Section 3 describes the importing and harvesting of metadata from multiple data collections. At this time, multiple processes for link detection occur these include extraction/lookup of DOIs, pattern based reference extraction, and term-based reference extraction. In 3.4, the authors do a sound job of explaining the challenges of dealing with metadata from a variety of sources -- with varying degrees of completeness and often with duplication. The method described for disambiguation and link merging is sound, but could use a little more detail. It is difficult to understand what the balance between automation and human intervention is in this scenario. The remainder of section 3 describes the architecture and integration points. It makes mention of the RDF database in MongoDB and how that data is pushed into an Elasticsearch index. All of this would have been more tangible with specific examples -- especially ones that highlight the challenges and anomalies that are pointed out in the paper. It would be great to see more examples of queries and topics. Or examples of datasets with good metadata (presumably the ones that had metadata imported) and ones where the metadata was harvested. The Evaluation section describes the results of a user study with 17 participants. While it does mention the participants' genders and average age, it does not explain how they were selected, their familiarity with the material and their familiarity with typical research scenarios. (are these GESIS researchers? same ones from the study in the introduction?) Nonetheless, we see positive feedback to the end result. ------ I appreciate the authors taking the time to respond to our comments. I will change my overall evaluation based on the changes and additions they have outlined in the response -- particularly about clarification and adding more specific details. I do think that efforts to make datasets and other research collateral more discoverable is a common and important use case, which is why I hope there is room for a quality presentation on the topic at ESWC.
Review 5 (by Anna Tordai)
This is a metareview for the paper that summarizes the opinions of the individual reviewers. Reviewers agree that this a relevant paper for the in-use track as it describes a concrete use-case, a real application and an evaluation. The paper would benefit from revision and clarification. The authors successfully clarified some of the points raised by the reviewers in their original review. Laura Hollink & Anna Tordai