What does an Ontology Engineering Community look like? A Systematic Analysis of the Schema.org Community
Author(s): Samantha Kanza, Alex Stolz, Martin Hepp, Elena Simperl
Full text: submitted version
Abstract: We present a systematic analysis of participation and interactions within the community behind schema.org, one of the largest and most relevant ontology engineering projects in recent times. Previous work conducted in this space has focused on ontology collaboration tools, and the roles that different contributors play within these projects. This paper takes a broader view and looks at the entire life cycle of the collaborative process to gain insights into how new functionality is proposed and accepted, and how contributors engage with one another. The analysis resulted in several findings. First, the collaborative ontology engineering roles identified in previous studies with a much stronger link to ontology editors apply to community interaction contexts as well. In the same time, the participation inequality is less pronounced than the 90-9-1 rule for Internet communities. In addition, schema.org seems to facilitate a form of collaboration that is friendly towards newcomers, whose concerns receive as much attention from the community as those of their longer-serving peers.
Keywords: collaborative ontology engineering; github; schema.org; community analysis; mixed methods
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) The study of communities and participations is not specific for ESWC, although this paper addresses one closely related community. (NOVELTY OF THE PROPOSED SOLUTION) As there is neither a novel approach nor a solution, this is not applicable. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) It is not possible to assume a correct or complete interpretation of the community users' behaviour. (EVALUATION OF THE STATE-OF-THE-ART) Related work is comprehensive. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors foresee possible validity issues. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Since there are no details given on how recoding or formalisation of issues was done, the study with its findings is difficult to reproduce. Also, the actual subject of Ontology Engineering was only vaguely addressed, only the collaborative workflow of schema.org was considered, so no generalisable conclusions can be drawn regarding other collaboration setups or communities. (OVERALL SCORE) *Summary: The paper describes the results of an investigation of the liveliness of the community behind schema.org regarding user participation and topics addressed in a well-structured manner. *Problem, contribution and results: Based on an analysis of contributions from users of schema.org's community channels according to the four aspects of topic prevalence, topic popularity, participation and user profiles, the goal was to provide insights into the "social" structure of the schema.org community. From this analysis distributions of topics, degrees of participation by user, together with several user roles, were reported. Details: First, no time frame is given for the data considered from the community group, only for contributions from the steering group. It would be good to state the overall period that was investigated by the study for both groups. Further, the motivation for choosing the four aspects does not become clear as they do not properly reflect ontology engineering as the actual subject matter. The significance of the investigation would be increased by looking into criteria such as the extents of changes to the vocabulary or the number of vocabulary elements affected by user participation, as this could indicate a degree to which user participations contribute to the evolution of schema.org by their topics raised or the willingness of users to deal with more complex aspects of the ontology. The topics of extension, clarification and modification are distinguished in the Methods section, however, again no extents or degrees of such topics are examined further. Such aspects are only hinted at in the Discussions and Future Work sections. In the Methods section it is stated that the topic descriptions were formalised - how was this done? In the Results & Discussion chapter it would also be very helpful to show the distributions of the four aspects over time, as this would permit to gain insights into aspects such as the maturity of the vocabulary, the manner in which e.g. extension topics are raised over time, or how community group discussions lead to changes in GitHub. Also, it would be interesting to see if users change roles over time. *SPs - The paper is well-structured, well-written and easy to follow. - There is substantial related work considered. - The authors consider the problem of meaningful information being encoded in heterogeneous user contributions. *WPs - The authors consider the problem of meaningful information being encoded in heterogeneous user contributions, but do not elaborate on how the necessary recodings or formalisations of content were done. This, however, would be extremely helpful in properly understanding the findings. - It is unclear as to whether the findings of the study are applicable to other ontology engineering communities as well. - The actual task of ontology engineering within schema.org has not been considered, so the motivations and scopes of contributions remain unclear. *QA 1) What were the reasons for choosing only the four general aspects, which are in principle applicable to any online community, while not considering issues relevant with ontology engineering, such as the number of elements that were affected by user contributions, and thus limiting the significance of study to some extent? 2) Have "social dynamics" been investigated, i.e. can some insights be provided on how the schema.org community has evolved or evolves over time? 3) Which methods were applied for recoding and formalising the user contributions from the community channels and how were the results processed?
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) Studying communities in such a way is certainly interesting and worthwhile, but I am not certain how well it fits into ESWC; while the community chosen is central to the ESWC community, the paper merely addresses very generic questions of participation, with few discussions on how the observed differs from any other GitHub community. (NOVELTY OF THE PROPOSED SOLUTION) Not really applicable: Pure empirical study, with no "solution". (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Not applicable. Completeness in this context is impossible to judge objectively. (EVALUATION OF THE STATE-OF-THE-ART) References and background work are reasonable. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There are some discussions of threats to validity. But again, this point is not really applicable to an empirical investigation. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I would judge reproducibility to be moderately strong: the descriptions of what was done are reasonably clear. However, there are some elements of human judgement such as the coding the messages and issues, which were "recoded"; but without a formal inter-rater reliability analysis it is hard to say whether the coding is repeatable. Generalisability, in the sense of "Ontology Engineering communities", is weak: there is no evidence that the study generalises beyond schema.org. However, the authors do not make such claim. I am torn about the question whether studying a single community yields sufficient insight to be judged valuable, but I tend towards yes. (OVERALL SCORE) Revised version after the rebuttal. I acknowledge the authors assertion that the topic of the study is of relevance to ESWC. However, my main concern was not that the scenario of the investigation itself was not relevant, but how the insight generated was relevant to the OEC in particular. I felt there was a lack of discussion that relates some of the observations made back to the practices of the OEC community, rather than any arbitrary GitHub community. The claim in the rebuttal that you "aimed to understand how the OEC worked as a whole" was not central enough to the discussion of the paper. A general comment about your rebuttal: For the reviewing process it would be helpful to clearly relate what you are talking about not only to comments made by the reviewers (mention them, we write so many reviews, we may not remember), but also the sections in the paper you feel those are sufficiently addressed. Also, clearly say whether you aim at clarifying the points we raised, or whether you just disagree with our assessment and wont clarify. For example, with regards to my comments (R1), you explain to me again what you did regarding the recodings, but do not mention whether you aim to make that clearer in the paper itself. **end revised review** The paper conducts a cross-sectional analysis of community participation for a use case relevant to our community, evaluating github issues and mailing list participation in particular. Studying communities in such a way is certainly interesting and worthwhile, but I am not certain how well it fits into ESWC; while the community chosen is central to the ESWC community, the paper merely addresses very generic questions of participation, with few discussions on how the observed differs from any other GitHub community. Secondly, I am not sure what generalisable conclusions we can draw from the study. While some observations appear very interesting, there is no convincing statistical argument made that the results generalise to anything beyond exactly schema.org (in particular not Ontology Engineering Communities as advertised by the title) - and if that is so, the question is raised on why we should care about the results. Nevertheless, I feel that the paper is well written and I believe that the study of OEC’s is worthwhile so that I could see the work presented as a poster. 3.2 - “Automatic scripts” is too general to be a meaningful method title. If you say, mixed methods including X and Y, maybe X and Y should be something like concrete methods? - you say "Finally, random samples of 10% of both corpora were re-coded for consistency reasons.” Did you perform inter-rater reliability? - We “re-coded 10%”, or “inspected 10% of … emails” —> It might be a good idea to justify the choice of 10%, or expilcitly state that your coding reliability and email inspection does not generalise. 4.2 - "typically GitHub issues receive more comments than Community Group emails receive replies” —> Substantiate this claim. Is this your personal experience, then maybe say so? - As above, perhaps justify the use of "10% of the emails” again. - "three other/off-topic messages that were spam, an unsubscription request” —> These could, in my opinion, have been filtered out from the start, and not even considered for counting. Just noise! - It is a bit surprising you did not analyse the open/closed ratio of issues, nor did you look at the development of participation over time, which you could have easily done given the data you obtained! 4.4 - I would have expected “issues closed” as another metric here. Conclusions "Topic prevalence shows that GitHub is used more to propose creating or editing functionality, whereas the mailing list is used more for clarifications." -> This is phrased as a generalised conclusion. As outlined above, I would question to generalise to anything beyond schema.org, and so, in order to not be misleading and quoted out of context, I think schema.org must be part of this sentence. Minor — Regarding the GitHub issues: Maybe clarify that these involved both open and closed issues — Inconsistent spelling of GitHub and Github — I know it is a matter of personal taste, but it is not very reader friendly to use citations as subjects or objects in sentences: “list of categories from ”. It forces us to always jump down to the literature, even if a citation is used multiple times. Consider using Smith, John et al (2017)  or similar in these cases.
Review 3 (by Irlan Grangel)
(RELEVANCE TO ESWC) The paper is of relevance since it addresses an important topic, i.e., collaborative ontology engineering. The study is based on the well-known schema.org vocabulary and the contributions that have been made to this vocabulary in GitHub. (NOVELTY OF THE PROPOSED SOLUTION) The study seems to be of novelty in this area. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper follows a clear workflow to realize the study. (EVALUATION OF THE STATE-OF-THE-ART) The evaluation of the state-of-the-art is performed and the contribution of the paper with relation to it is defined. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper presents research four questions which are answered in separated subsections. It could have been of benefits to explicitly define the research questions before the sections in which they are answered. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The code and the data for the experiments are available on GitHub. (OVERALL SCORE) The paper entitled "What Does an Ontology Engineering Community Look Like? A Systematic Analysis of the schema.org Community" presents a study regarding the use of a platform like GitHub to support the collaborative ontology development process. The study is based on the development of the well-known schema.org vocabulary. The paper is well written and the ideas are well structured and easy to follow. Strong Points (SPs) ** Enumerate and explain at least three Strong Points of this work** - The topic of the paper is of high relevance for the Semantic Web Community since ontologies are developed in a collaborative manner and GitHub seems to be the center of most of the development of public ontologies. Authors have chosen the by far most succesfull vocabulary, i.e., schema.org to realize the study. - The paper realizes interesting findings regarding the collaborative ontology development and they map some of these findings to existing studies. - The study is systematically performed with a correct process coming from data to conclusions Weak Points (WPs) ** Enumerate and explain at least three Weak Points of this work** - By looking into the chosen repository for the study, i.e., schemaorg one can see that branches are core to the collaborative development process (738 branches with a lot of work). The study does not refer how this influence in the overall process as well as in the four questions they answer. - The importance of the findings should have been better mapped to existing methodologies for ontology engineering in the discussion part. - Analysis of basic requirements for ontologies when collaborative development such as modularity and multilinguality and how they are managed in this setting should have been at least mentioned. Questions to the Authors (QAs) ** Enumerate the questions to be answered by the authors during the rebuttal process** - How can the findings of this paper be used to improve the collaborative ontology engineering? This study should have been included works such as  and  which clearly are focused on using Github for collaborative ontology building.  Git4Voc: Git-Based Versioning for Collaborative Vocabulary Development. ICSC 2016: 285-292 Lavdim Halilaj, Irlán Grangel-González, Gökhan Coskun, Sören Auer  VoCol: An Integrated Environment to Support Version-Controlled Vocabulary Development. Lavdim Halilaj, Niklas Petersen, Irlán Grangel-González, Christoph Lange, Sören Auer, Gökhan Coskun, Steffen Lohmann: EKAW 2016: 303-319 Minor issues: - were published [24, 18] -> were published [18, 24] - ontology editors [21, 31, 6] -> ontology editors [6, 21, 31] - 6 new topics -> six new topics - were formalised -> were formalized - top 10 participants -> top ten participants - as well organisational -> as well organizational - Walk et al -> Walk et al.
Review 4 (by Laura M. Daniele)
(RELEVANCE TO ESWC) The paper is relevant to ESWC, as it addresses the topic of collaborative ontology engineering and provides a systematic analysis of the large community around schema.org, a representative and successful example for the ESWC community. (NOVELTY OF THE PROPOSED SOLUTION) The topic of collaborative ontology engineering is not too novel nor innovative. However, the paper takes a perspective for which there is still a limited understanding, analyzing a successful collaborative project such as schema.org and prompting future discussions on what makes an ontology engineering community so successful. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper provides a complete and sound analysis based on the four aspects of topic prevalence, popular topics, participation distribution and typical user profiles. It exhaustively describes the approach and clearly presents the results. (EVALUATION OF THE STATE-OF-THE-ART) The Introduction positions the paper with respect to previous studies on ontology engineering communities. The Related Work section provides a comprehensive overview of the (more traditional) research on collaborative ontology engineering and further addresses the more recent trend for ontology engineers to collaborate via version control systems and GitHub in particular. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The results are well presented and discussed, including the limitations to the approach followed by the authors. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiment is clearly described and reproducible. The paper provides links to the sources on the schema.org website used to collect data (i.e., the steering group, the community group and the GitHub repository). The resulting datasets used as input to the analysis are made available for the reader in a repository that also contains the python code used to extract issues from the schema.org GitHub repository. (OVERALL SCORE) The paper presents a systematic analysis of the schema.org community, which provides a representative and successful example of collaborative ontology engineering project. The analysis results in a number of interesting insights on how the schema.org community works behind the scenes, providing a valuable contribution to the Semantic Web practitioners and ontology engineers eager to learn from successful stories how to build, maintain and evolve ontologies and vocabularies that are widely adopted and create impact in our society. Strong Points: • Very well-written paper, properly structured and easy to read. • Solid piece of work. The study is properly designed and conducted based on existing literature in this area. • It brings an important topic to the attention of the Semantic Web and ontology engineering community. We are all aware that making good ontologies according to the best practices is not sufficient anymore if this is not supported by a community that actually uses the ontology and contributes to its maintenance and evolution. This paper provides interesting insights on how a successful community, such as the one around schema.org, work. Weak Points: • Although very relevant to ESWC, collaborative ontology engineering is not exactly a novel and innovative topic. • The paper analyses the participation of the community, indicating on which topics the participation occurs, but does not address the quality of the ontology nor of the extensions/modifications to the ontology contributed by the engaged participants. Something that could be taken into account for future work. Questions to the Authors Figure 2 that shows the responses to emails and issues per topic category is not clear to me. How do the numbers of responses on top relate to the percentages in the bottom for the different topics?
Metareview by Hsofia Pinto
The paper presents an interesting analysis of the community behind schema.org regarding user participation and topics addressed, from an Ontology Engineering point of view. While well written the paper was perceived by the reviewers as an initial starting point lacking enough details so that the same results could be reproduced. After rebuttal, vivid discussions, and although some reviewers thought the paper was better suited for a poster presentation (since this is still an initial study), the paper was accepted since it provides interesting insights to the community.