Paper 57 (Resources track)

Where is my URI?

Author(s): Andre Valdestilhas, Tommaso Soru, Markus Nentwig, Edgard Marx, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo

Full text: submitted version

camera ready version

Decision: accept

Abstract: One of the Semantic Web foundations is the possibility to dereference URIs to let applications negotiate their semantic content.
However, this exploitation is often infeasible as the availability of such information depends on the reliability of networks, services, and human factors.
Moreover, it has been shown that around 90% of the information published as Linked Open Data is available as data dumps and 84% of endpoints are offline.
To this end, we propose a Web service called Where is my URI?.
Our service aims at indexing URIs and their use in order to let Linked Data consumers find the respective RDF data source, in case such information cannot be retrieved from the URI alone.
We rank the corresponding datasets by following the rationale upon which a dataset contributes to the definition of a URI proportionally to the number of literals.
We finally describe potential use-cases of applications that can immediately benefit from our simple yet useful service.

Keywords: Link Discovery; Linked Data; Dumps; URI; Dereferenceable

 

Review 1 (by Amelie Gyrard)

 

I have read the rebuttal from the authors.
the demo was running today: http://139.18.8.58:8080/LinkLion2_WServ/
https://dice-group.github.io/wimu/ link should be added to the web site
---
Summary: The authors provide a Web service called Where is my URI (WIMU) since around 90% of the information published as Linked Open Data is available as data dumps and 84% of endpoints are offline. The authors explaining the index creation, the web interface and the data processing. The dataset can be ranked if referenced by multiple data sources.
The authors claim that they process more than 58 billion unique triples from more than 660,000 datasets obtained from LODStats and LOD Laundromat.
Three use cases are explained: (1) data quality and data interlinking, (2) finding class axioms, and (3) statistics about the dataset.
Advantages:
•	We need such tools
Drawbacks:
•	The use case section is not clear enough
•	The resource link was dead when tested
Resources:
•	The web service https://w3id.org/where-is-my-uri/ (when tested 22 January it did not worked: “This site can’t be reached”)
•	The source code is available online https://github.com/dice-group/wimu under GNU Affero public license 3.0
Suggestions for improvements:
•	“around 21% of the information published as Linked Open Data is available as data dumps” –> prove that
•	“more than 58% of endpoints are offline“ –> prove that. Do you learnt that from those projects (SPORTAL [2], SPARQLES [3])? 
•	“We also rank the data sources in case a single URI is provided by multiple data sources” -> check the Linked Open Vocabularies (LOV) project [1] and the journal publications since it can count the number of times an ontology is used by other ontologies.
•	Page 3: “:hasURI, :hasDataset, :hasScore” -> which ontology has been used? Did you design your own ontology? More explanations are expected.
•	Page 5 “we present two use-cases” -> but you have 3 sub sections
•	Page 5 concise bounded descriptions (CBDs) -> Concise Bounded Descriptions (CBDs)
•	Page 5 “Linked Data Lifecycle” -> add reference LOD2 project?
•	Page 6. The picture is not well-explained. Not clear enough
•	Check Semantic Web Best Practices Project [4]. Similar studies are done when loading ontologies with the errors encountered.
[1] Linked Open Vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web [Vandenbussche et al. 2017]
[2] http://www.sportalproject.org/
[3]  http://sparqles.ai.wu.ac.at/?
[4] http://perfectsemanticweb.appspot.com/

 

Review 2 (by Alasdair Gray)

 

I thank the authors for their rebuttal. I have now been able to access the resource. In general there is a lot of room for improvement for this as a service in terms of user experience, but as a resource it has the potential to be very useful to the community, particularly if it is expanded to cover a larger proportion of the linked data web. 
I thank the authors for the work they have put in to get a manual together, but this should really be available from the web page of the resource rather than having to dig through to GitHub and then the presentation is not great.
----
This is potentially a very interesting and useful resource. However the authors have chosen to fill the paper with figures and trivial details rather than discuss the resource's merits. The title could be more explicit that this is a registry of identifiers.
Web interface did not work at time of reviewing. All calls returned nothing found. Web service calls were working at the time.
Very limited documentation on how to use the service. No API description, etc. Git repository could be more obviously linked on the web page. README intrcutions are provided together with a Docker image. No obvious testing of the code. Suitable tools have been used in the development, e.g. lucene.
Unclear how extensively the service is being used. There is no discussion of a community evolving around and using the resource.
Paper is poorly structured: Important information about the content of the service is stated in the use cases: section 4.3. This should be restructured and made more prominent. Why are the authors not using the LOD-A-LOT dump rather than LODStats?
What is the sustainability of the service? A lot of data processing is required each month to keep the service up to date. 
Final paragraph of 4.3 mentions that the heuristic is verified manually. More details of this should be given.
Claims in introduction are not backed by references, e.g. 21% of LOD only avaialble in dumps or the fact that it is cost that prevents providing querying services.
Stats used in introduction are not consistent with those in the abstract, e.g % of endpoints that are offline.
Are the authors aware of the work in the force11 identifiers group to provide uniform resolution of compact identifiers https://www.biorxiv.org/content/early/2017/08/18/101279. The implementaitons of this work focus on manually curated lists of where an identifier is located.
Does counting the literals for identifying the defining URI encounter problems when ontology design patterns such as the measurement one are used, i.e. where a measurement using an extra resource to capture the value and unit?
Section 4 takes up a lot of the paper without really telling us anything about the service presented in the paper. Giving more succinct examples of the usage of the service to satisfy these needs would be more beneficial. 
Service is missing major biomedical datasets
- http://bio2rdf.org/chembl:210313: http://139.18.8.58:8080/LinkLion2_WServ/Find?top=10&urihdt=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fchembl%2Fprotclass%2FCHEMBL_PC_1020
- http://rdf.ebi.ac.uk/resource/chembl/protclass/CHEMBL_PC_1020: http://139.18.8.58:8080/LinkLion2_WServ/Find?top=10&urihdt=http%3A%2F%2Fbio2rdf.org%2Fchembl%3A210313
Paper is only 8 pages, so there is plenty of space to specifically address all these issues.

 

Review 3 (by Vinh Nguyen)

 

This paper describes a service that can find the datasets for a given URI. This service is conceptually useful due to the unavailability of the endpoints and not everyone can afford to host many LOD datasets locally. The service is fully available online and could potentially be reusable in some scenarios that may need some checkings before downloading the entire dataset files, which are sometimes big.
Reasons to accept:
- The web interface and rest service basically work, I tested them all.
- The code is available on GitHub and instruction was provided to be reproduced.
- This service indexes the datasets from LOD Stats and LOD Laundromat and uses the indices to look for the datasets that contain the given URI.
- The returning datasets are heuristically ranked by the number of literals they contain for the given URI, which makes sense.
Reasons to reject:
- Although the authors described some use cases where the service could be utilized, it has not yet to be used by any application or community. It may just have a few applications.
- The input URI must exactly match the entire URI in the datasets, otherwise, it cannot find the dataset. For example, this URI http://sws.geonames.org/4896861/ would give some results while this URI http://sws.geonames.org/4896861 gives NOTHING! The two URIs are not much different from a human point of view! It took me quite a long time to realize that I got different results because I have given different input strings. This potentially causes some confusion to the users too.
- In addition to URI lookups, what if the input is an entity without a full URI? Can this service search for datasets from such a given entity, e.g. BarackObama, if someone is interested in this entity without knowing the full URI and the datasets containing information about it. I think the impact may increase if supported. The exact matching of the URIs as mentioned above will eliminate many applications of the entity lookups.

 

Review 4 (by anonymous reviewer)

 

The authors present a repository/a service of indexed URIs and corresponding RDF data resources. The service addresses a current problem that we still have and is a quite nice addition in terms of URI/RDF resolution tools. The paper is nicely written and easy to follow. Link to respective implementations are provided. Some detailed comments:
1) Do not use fancy adjectives to describe your work, unless you have already shown that they apply. Ex. in the introduction — scalable and time-efficient deployment of SW applications. Where does this come from? why time-efficient and why scalable. Introduction, page 2 efficient, low cost and scalable service. Are you developing payed service? What costs are you talking about? How is the service scalable if you are already publishing updates only once monthly? I am not against making big claims, but avoid sounding to marketing-like without actual support for making the statements
2) 3.1 Steps 3. and 4., especially 4.  I am quite sure that some people form the community would immediately ask why 3. and then 4. and how exactly do you do that. Consider adding a paragraph on that, just to be on the safe side.
3) One obvious question that comes to mind is - what is the innovation, if actually, this is only an indexed merge of LODstats and LOD Laundromat? How easy it is to include further sources? Devote some space to clearly explaining that benefits are much higher and it is not only just merging two repositories. 
4) Section 4.3 Formatting problem in the first paragraph. You can use \sloppy to fix that.

 

Share on

Leave a Reply

Your email address will not be published. Required fields are marked *