Paper 100 (Resources track) The Closure of 500M owl-sameAs Statements

Author(s): Wouter Beek, Joe Raad, Jan Wielemaker, Frank van Harmelen

Full text: submitted version

camera ready version

Decision: accept

Abstract: The owl:sameAs predicate is an essential ingredient of the Semantic Web architecture. It allows parties to independently mint names, while at the same time ensuring that these parties are able to understand each other’s data. An online resource that collects all owl:sameAs statements on the Linked Open Data Cloud has therefore both practical impact (it helps data users and providers to find different names for the same entity) as well as analytical value (it reveals important aspects of the connectivity of the LOD Cloud).

This paper presents the largest dataset of identity statements that has been gathered from the LOD Cloud to date. We describe an efficient representation and algorithm to calculate and store the full equivalence closure over this dataset.

Finally, we present analytics over these datasets, gaining insights in the use of owl:sameAs in the LOD cloud.

All datasets are published online, as well as a web service from which the data and its equivalence closure can be queried.

Keywords: linked open data; identity; owl:sameAs; reasoning


Review 1 (by Michelle Cheatham)


I thank the authors for the information provided in their rebuttal. I have slightly modified the part of my review they addressed.
The paper describes how the closure of the sameAs links in the 2015 LOD Laundromat dataset was computed and analyzes the resulting set of links. The paper is very similar to related work published in 2014, but the resulting set of links is significantly larger in this case. The motivation of the work is clearly described in Section 2.2. Even though the novelty is somewhat low, the results are likely to be of considerable interest to ESWC attendees.  In addition, useful APIs for the dataset have been made available, as well as the ability to download the data itself.
In Section 3.2, should the input to the closure algorithm involve x < y instead of x <= y since reflexive pairs are being removed?
I was surprised by this finding: “While the majority of namespaces have incoming links, far fewer namespaces have outgoing links. This means that a relatively small number of namespaces is linking to a relatively large number of them.” My intuition is that there are many small data publishers that provide sameAs links between their datasets and the few large datasets that make up the core of the linked data cloud, such as dbpedia and Freebase, i.e. I would have expected the reverse of your finding to be the case. The authors explained this in their rebuttal, and I would encourage them to clarify it in the camera-ready version of the paper as well, if possible.
Regarding the discussion of sub-properties of owl:sameAs in Section 4.3, I agree with the authors that most of the time the datasets probably intended to weaken rather than strengthen the owl:sameAs property, but I wonder if in some cases the intent is to introduce a sub-property so that domain and range restrictions can be placed on the sub-property.
The paper is overall very well written and easy to follow. Below is a list of minor errors:
The last paragraph of the introduction omits a description of the contents of Section 5. 
“studies of the use these links” => “studies of the use of these links”
“even the number of elements of an equality set” => “even the elements of an equality set”
“” => “for”
In Figure 3, it would be helpful to use more distinct colors for incoming and outgoing counts, so that the graph is more easily readable in black and white.
“alomst" => “almost”
“For instance, from the fact that two things are the same link, it does not following that…” =>  “For instance, from the fact that two things are the same link, it does not follow that…”
“as downloadable snapshot” => “as a downloadable snapshot”
Figure 6 is ok but probably not strictly necessary.


Review 2 (by Anisa Rula)


The paper presents a dataset of all statements connected through the owl:sameAs predicate which is beneficial from the point of view of both data providers and consumers who take advantage of an entity represented with different URIs in different datasets. Another benefit as described by the authors is the level of connectivity of the LOD cloud. An additional contribution of this work is the transitive closure calculated by an algorithm provided by the authors that is tested to be efficient.
Positive aspects
*Since there are already similar services, I don't think this work plugs an important gap. However, the positive aspect is that there is no such approach to efficiently calculate identity statements and to store them in an USB stick although they are to big. 
*This work advances the state of the art by providing an approach that is able to process efficiently the extraction, storing and calculating of identity statements and their closure.   
*The state of the art is comprehnsive and include all the related approaches regarding the identity links dataset. 
*This dataset is of interest to the Semantic Web community since it can be used in different scenarios as mentioned by the authors such as question answering, ontology alignment or navigating through backlinks.
* The dataset can be easiliy queried and browesed. There is a simple interface where it is possible to enter the subject, the predicate and the object. It also provides a simple autocompletion. 
*The dataset can be also used to measure the connectivity between datasets on identity links. However, the connectivity of the LOD cloud is not made only of identity links.
*The dataset is available on the Web and it provides a licence specification that is
*The dataset can be retrieved either using the APIs, or by downloading. It can also be downloaded as n-Triples.
*There is a discussion about the maintenance  of the dataset and that the links will be updated incrementally. 
*The repository is available on github
*The resource can be easily reused since there are good documentation about it
Negative points
*As stated also by the authors there exist different works trying to extract explicit owl:sameAs statements and provide also the transitive closure. With respect to these previous works the authors states that they are providing the larges dataset. The service that provide similar information is the It seems that there is no sufficient comparison with such service "A crucial difference with our work is that also includes other predicates besides owl:sameAs that do not express identity, such as umbel:isLike, skos:exactMatch and owl:inverseOf". Is this a sufficient argumentation to make this work better? 
*There are not enough explanation about its limitation and the rationale for the exclusion of some functionality
Minor comments
*References to the works must be cited as, the authors in \cite{}, Beek et al. \cite{},
*Not clear the difference between dataset level (related works) and resource level?
*Bad to see formulas in the footnotes.
*bad to see both footnote and citation together
*558M or 559M?
*P or \italic{P}
The rebuttal is good and has answered to all my concerns.


Review 3 (by Marco Luca Sbodio)


Interesting article that breaks new ground in the area of dataset interlinking. The authors provide valuable resources for the whole Semantic Web community. The paper is well written, and I have only a few minor comments:
Some typos:
- section 1.1, first paragraph: "... has motivated earlier studies of the use _of_ these links ..."
- section 4.2, subsection "Edges in ~_i": "This _calculation_ requires ..."
- section 1, semantics of owl:sameAs: I am not entirely sure that the proposed formalization I(<x, owl:sameAs, y>) is true iff I(x) = I(y)" is correct. This is not a major point, but I think some of the authors have given a better formalization in section 3 definition 1 of Wouter Beek, Stefan Schlobach, and Frank Harmelen. 2016. A Contextualised Semantics for owl: sameAs. In Proceedings of the 13th International Conference on The Semantic Web. Latest Advances and New Domains - Volume 9678, Harald Sack, Eva Blomqvist, Mathieu D'Aquin, Chiara Ghidini, Simone Paolo Ponzetto, and Christoph Lange (Eds.), Vol. 9678. Springer-Verlag New York, Inc., New York, NY, USA, 405-419. DOI:
- I suggest that the authors add information about the hardware platform used to make the performance tests: this would help in reproducing the results.


Review 4 (by Adila A. Krisnadhi)


This submission presents, a dataset of owl:sameAs triples gathered from the LOD cloud. The paper explains details of the dataset, the method (algorithms and implementation) for calculating and storing the equivalent closure over the dataset, and analytics over the datasets. The authors claim that is the largest set of such owl:sameAs triples, while at the same time, assert that the dataset and its closure can be stored on a USB stick.
My comments below are based on both the submission and the actual resource as available in the portal. Overall, although there is a clear potential use of this resource, some technical problems exist that influence availability and reusability of the resource. Some of the problems occur by the time of review, which is unfrotunate. As a result, I can only rate the submission no more than a weak accept.
- The dataset and its closure are available at the URI:, which seems to be a PURL. As the authors have remarked, availability of datasets of identity relations is extremely important, it is unfortunate that no explanation regarding the sustainability of this resource is given: what kind of guarantee, if any, is made to ensure that the resource is available forever.
- The resource is published with CC-BY-SA license, which is acceptable for linked data 
- The resource is publicly available through LOD portal. Furthermore, the authors claimed that the resource is available through API (which I didn't thoroughly test) and bulk download. At the time of my review, bulk download is not available. The links are missing for explicit relation and schema (the only provided links in the portal seem to give only a browsable, follow-your-nose presentation), and the only link for the closure gave me internal server error as a response. (I did try later and the bulk download became available again.)
- The submission does not mention whether the resource is findable through community registries. Nevertheless, if the URI is guaranteed to be persistent, then public can still find this resource through search engines (provided they know the appropriate search term; currently Googling "sameAs" doesn't give this resource as a result in the first page, but maybe it will in the future?)
There is no clear evidence of usage beyond the resource creators, probably because the resource itself is quite new. The potential is there, however, only because of the importance of  identity relation for linked data, which makes it quite easy to garner interest from the community. Unfortunately, there is no support portal, issue tracker, etc., beyond a minimal portal with links to the dataset and documentation. 
The documentation itself very minimalistic and requires some basic knowledge of HTTP REST API to use it programmatically. Examples of use would greatly help reuse and complement the documentation.
The resource is clearly general enough and applicable in a wider set of scenarios and extensibility should not be a problem. The submission also explains how the dataset can be used and what the resource can do. The authors, however, did not clearly explain  what the dataset cannot do. 
There is some dependence to the use of RocksDB, which is unfortunately not mentioned in the portal (unless I miss something obvious). It is unclear to me why this particular technological choice was made (e.g., why not other xxxDB?). 
Design & Technical quality
Theres is no problem regarding best practices and re-use of other high quality resources. The resource is obviously suitable to solve the task at hand. Unfortunately, there is no schema diagram provided both in the portal and the submission, despite having parts containing schema assertions in the dataset. More critical is that there is no discussion in the semantic accuracy of the sameAs relation stored in the dataset. The analytics presented in Section 4, to me, is more focused on the coverage of the sameAs relation. When one spot-checks the relation as presented via browser, there exists sameAs triples such as the one found in<http%3A//> as given below:  owl:sameAs "dbpedia:Company_(military_unit)"^^<>
which equates a URI resource to a string literal. 
Such triples could be a consequence of misusing sameAs relation in the original dataset from which the above triple was obtained, (or simply some bugs in the implementation?). Nevertheless, it is conceivable that a user may expect that the sameAs relations exposed by this dataset is of sufficiently high accuracy that it could be relied upon when needed for other needs such as reasoning (e.g., would the above triple lead to undesired effect if being used for reasoning purposes?). In this situation, it is thus important to explain in advance that some tasks may not be possible without performing further curation of the dataset (or a kind of disclaimer statement that users of the dataset need to understand prior to using the dataset).
Potential Impact
Potential impact is the aspect in which I find the resource not problematic. The resource clearly plugs an important gap of identity relation dataset. The resource is also critical to the community in general, and thus would garner high interest from other researchers in the Semantic Web community and beyond.
======================== comments after rebuttal ====================
After reading the authors' rebuttal, I am happy that my concern will be addressed in the revised version. I am thus in favor of accepting this submission to the conference.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *