Analyzing the Evolution of Vocabulary Terms and their Impact on the LOD Cloud

Author(s): Mohammad Abdel-Qader, Ansgar Scherp, Iacopo Vagliano

Abstract: Vocabularies are used for modeling data in Knowledge Graphs (KGs) like the Linked Open Data Cloud and Wikidata. During their lifetime, vocabularies are subject to changes. New terms are coined, while existing terms are modified or deprecated. We first quantify the amount and frequency of changes in vocabularies. Subsequently, we investigate to which extend and when the changes are adopted in the evolution of KGs. We conduct our experiments on three large-scale KGs for which time-stamped information is available, namely the Billion Triples Challenge datasets, Dynamic Linked Data Observatory dataset, and Wikidata. Our results show that the change frequency of terms is rather low, but can have high impact due to the large amount of distributed graph data on the web. Furthermore, not all coined terms are used and most of the deprecated terms are still used by data publishers. The adoption time of terms coming from different vocabularies ranges from very fast (few days) to very slow (few years). Surprisingly, we could observe some adoptions before the vocabulary changes were published. Understanding the evolution of vocabulary terms is important to avoid wrong assumptions about the modeling status of data published on the web, which may result in difficulties when querying the data from distributed sources.

Keywords: Vocabulary changes; Terms adoption; Deprecated terms


Review 1 (by anonymous reviewer)


(RELEVANCE TO ESWC) It is interesting to analyse the effects of obsoletion of ontology-classes.
(NOVELTY OF THE PROPOSED SOLUTION) This subject has been investigate multiple times.
(EVALUATION OF THE STATE-OF-THE-ART) Not really a new method or comparison in manuscript.
(OVERALL SCORE) Abdel-Qader et al. investigate the effect of changes of vocabulary terms on the resources that use these vocabularies/terms in the Linked Open Data cloud. The paper is well written.
Major points:
The authors seem to assume that the vocabulary owners/creators are aware of who uses their vocabularies. I do not think this is usually the case and these two sides are often disconnected. The authors claim in their discussion that there should be a tool to basically connect the two sides (ontology creators and data publishers) but it remains unclear how such a tool can solve this problem. As such, I feel that this manuscript is merely a description of relatively obvious problems, without real solutions to the problems.
The authors also seem to assume that the maintainers of the vocabularies are only adding terms for the LOD. There are statements such as "... some newly created terms are never adopted. ... we suggest that ontology engineers investigate these issues and possible revise them." (in Section "Adoption of LOD Vocabulary Changes"). I don't see any problems or need for revision, when an ontology contains classes that are not used in LOD. First of all, there is no reason that they must be used at all (now), second of all, just because they aren't used in LOD, does not mean they aren't used in other forms. I think those sections should be revised in the manuscript to either make clear what the authors mean or to correct this wrong assumptions.
All Figures in manuscript are hard/impossible to read and the font size must be increased.
In Table 2, it is hard to understand where the triples of are gone. It seems strange, as in one year geonames has deleted 74 million of 81 million triples. Maybe there is a misunderstanding on my side, but it would good to clarify this or explain better.
The manuscript should have a section on "consider" and/or "replaced_by" tags, usually provided by ontology-developers when they deprecate classes. Is this not the case with any of the chosen vocabularies? 
Minor points:
The introduction-section contains a long results-section. This should be revised.
I think the reader would like to know what the connection between the triples and the Guava library is. (Also, when referencing a library, the authors should not state the date of access, but the semantic version of Guave used, i.e. 21.0 or 23.0)
Revise: "...schema information in a three different leveles..."
Typo in "Adoption of LOD vocabualry Changes"


Review 2 (by anonymous reviewer)


(RELEVANCE TO ESWC) The paper is relevant due the experiments on terms and their impact on the LOD cloud.
(NOVELTY OF THE PROPOSED SOLUTION) The paper shows interesting insight of the term/vocabulary change over time, but a clear novelty of approaches is missing.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed experiments sound correct and support the evaluation of the work.
(EVALUATION OF THE STATE-OF-THE-ART) The paper covers most of the related work on this topic.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The demonstration and discussion part are explained clearly and sufficiently.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The methods and datasets are well established and datasets are publicly available, therefore reproducibility is ensured.
(OVERALL SCORE) The authors focus on the data stored in the Knowledge Graphs and the change of the data over time. The paper presents to what extend and when the changes are adopted in the Knowledge Graphs.  For their experiment they engage the Billion Triples Challenge datasets, Dynamic Linked Data Observatory dataset, and Wikidata. Their experiments demonstrate that the terms change infrequently and that that terms can be added to the graphs between few days and few years.
Strong points:
- interesting research on vocabulary change over time
Weak points:
- research focuses on exact term overlap, it would be interesting to see if term similarities measures would change the outcome


Review 3 (by Steven Moran)


(RELEVANCE TO ESWC) Interesting topic; paper needs some work. See extensive comments below.
(NOVELTY OF THE PROPOSED SOLUTION) Missing pertinent literature for evaluation techniques:
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The discussion is good, but see comments below about evaluation.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I might have missed it, but it's not clear to me if the data and results that the authors use is available for reproducibility.
(OVERALL SCORE) See comments below.


Metareview by Hsofia Pinto


The paper analyses the effect of changes of vocabulary terms on the resources that use these vocabularies/terms in the Linked Open Data cloud. While a relevant topic to the community the approach is not particularly novel. On a positive note, the experiments are fairly well described and the data set is publicly available. 
After rebuttal clarifications from the authors, the reviewers maintain their slightly favorable evaluation of the submission but require that their suggestions are introduced in the final version so that the paper can be accepted.


