Semi-Automatic Ontology-Driven Development Documentation
Author(s): Yevgen Pikus, Bernhard Holtkamp, Norbert Weißenberg
Full text: submitted version
Abstract: Documenting a product development, e.g., creating requirement specifications, is an indispensable, time-consuming and resource-intensive activity in large organizations. A vast amount of related information often emerges across several siloing lifecycle tools, and only a portion of it is available in the post-hoc documentation. To tackle these issues in an industrial research project, we developed a semi-automatic end-to-end documentation system. Concretely, we extended a lifecycle tool integration ontology by adding publishing information, and we leveraged a standard for digital publishing for presenting lifecycle data. A pilot implementation demonstrates that the approach is able to extract distributed lifecycle data and to generate several types of documents in multiple formats. Since the system is designed to support various data sources and numerous document types, the results can be easily generalized to other domains beyond software development. We believe that this approach could trigger the change from a document-driven to an ontology-driven documentation paradigm in large organizations.
Keywords: Development documentation; Domain ontology; Linked Data; Tool integration; OSLC; Digital publishing; DITA
Review 1 (by Carlos R. Rivero)
This paper presents an approach to compute documentation by extracting data from various products of the software lifecycle. My main concern is that the paper is very difficult to read and assess. What I got from it is that the authors are integrating different existing standards (OSLC and DITA) using semantic-web technologies. The paper is not self-contained and it really needs a brief introduction to both standards, otherwise, it is very difficult to understand. The authors claim that this paper is about documenting products of the software lifecycle, but the specific products are not listed (or at least I was not able to find them). What are the authors targeting here? The PIOME design seems a bit straightforward to me and I am not so sure why semantic-web technologies are helping here, a relational database can work perfectly fine here as well. At the end of the pipeline, the authors generate documents that are not RDF (do not have semantics?) like Word documents, PDF files or HTML pages, why not generating such documents in RDF format with Linked Data principles? Section 2.2 presents an ontology (devised by the authors? Is this a contribution?) along with an example: “It describes a software development project aiming to evolve a value-added mobile application to a traditional product and represents the relations between a contract, a requirement, a product and a project. Specifically, the extension of OSLC Resource oslc_rm:Requirement with PIO isst:EfficiencyRequirement specifies the publishing information.” This example is not clear at all, how is this information provided and computed? How does this help with the documentation? What is the final output here? The use of DITA is very unclear to me, why is this needed if all the info is already in RDF format? Is DITA being used for generating or consuming RDF, or both? I do not understand Figure 3. Section 2.4 presents a template pre-processing algorithm. I do not understand what the purpose of this algorithm is and it seems very straightforward. Also, it seems to me that there is an indentation problem and in lines 7 and 8 and the foreach loop in line 9 is actually included in the while loop in line 6. Also, there are seven calls and only some of them are explained. The way the template is created using addNode and addLeaf is not explained and not clear what the output of this algorithm is. If this is paper is about documents, the experiments should be related to the quality of the documents over real-world projects. All running times seem to be under 3 seconds, which implies that the performed computations are trivial (the algorithm has three nested loops with several calls that seem will also take time). How are the documents generated? The related work is missing previous approaches to generate documents whether or not they use semantic-web technologies. In fact, I was expecting a discussion about why the use of semantic-web technologies is interesting here. ---- After rebuttal ---- I acknowledge comments from the authors but do not change my opinion of the paper. After author clarifications, I am still worried about the novelty of this work and surprised that this is apparently one of the first approaches in dealing with this problem. Related work is mostly discussing about OSLC and DITA, which to me is more context than related work. The authors only cite  as really related work. I checked  and it actually has a compelling related work section. Also, I do not see any specific points addressing my second main concern of lack of focus: there are a huge amount of products derived from the software lifecycle and it is not clear what are the authors addressing here.
Review 2 (by anonymous reviewer)
The paper presents a framework for publishing documents related to various stages of software development life cycle. The framework uses semantic technologies and is based on extensions of the OSLC and DITA frameworks. The work presented is of interest to a community of researchers and practitioners working on tools that support development life cycle using semantic technologies. It is a reasonable fit for this track as it shows the impact of semantic technologies in this area and is a nice and novel extension of existing frameworks. The framework is designed based on an "industrial partner"'s requirements, a prototype is implemented, and various aspects of the implementation have been evaluated. I am not an expert in this area but I believe the presentation can be improved to clarify the contributions of this work and the core novelty. Several statements in the abstract and introduction for motivating the need for such a framework are written as if OSLC does not exist and that this work is all about connecting the "silo"s of life cycle tools whereas that is a reason behind the existence of OSLC. I suggest rewriting some of the arguments in the first section but also the following discussions, to make it clear what is missing from existing (OSLC based) systems and provide more concrete example of the kind of documents your partner needed and what the existing/manual process is without your solution.
Review 3 (by Evgeny Kharlamov)
* Paper summary The paper takes on the problem of producing (software) development documentation, an -- although important -- often neglected or poorly treated task. The authors presents the PIOME (Publishing Information Objects (PIO) Management Environment) system that can automatically generate documentation by integrating and presenting data collected from multiple software development lifecycle tools. The pipeline for doing so is by: 1. extracting data from multiple sources in the form of POIs (represented in an ontology) via OSLC (Open Services for Lifecycle Collaboration) plugins, 2. populating and integrating the POIs in an RDF triple store 3. from SPARQL queries over the triple store data, generating documentation using DITA (Darwin Information Typing Architecture) templates. To this end, the authors have extended the OSLC ontologies with the concept of POI and developed an algorithm for formatting POI to documents that leverages DITA. The algorithm is evaluated for effectiveness and efficiency. A prototype of the PIOME system that supports JIRA is implemented. * General feedback The problem is clearly motivated, the paper gives a nice overview of the system and is reasonably well-written. However, its narrative is too generic for my taste and offers too few details and insights into what the challenges faced by the authors were and what the real difficulties of the problem were/are. There are few examples that show what the system is really doing. This leaves me not fully convinced that the PIOME system adequately solves the described problem. The paper should evaluate the quality of the generated documentation. The claim of the paper (as I understand it) is that the PIOME system can (semi-)replace "traditionally written documents". However, little or no evidence that supports this claim is given in the paper. An end-to-end example of the documentation production pipeline (which could be partly/completely available online) could serve as such evidence. I could not find the PIOME system prototype online, the POI vocabulary (by visiting its namespace address) or the extended JIRA-OSLC-plugin online (from the Eriksson GitHub repo). If the PIOME system is to replace manually written documentation I would guess that this forces some requirements on the data entered into the lifecycle tools (to avoid garbage in, garbage out?). What would these requirements be? Examples of the input to and output of the system would make this easier to understand. I would also suggest placing a greater emphasis on how semantic technologies are used to solve the problem, e.g., (how) does it help in integrating data from different lifecycle systems, (how) is entity resolution performed, (how) is reasoning performed on the integrated data -- and to what result, what does the SPARQL queries look like. Section 2.1. briefly mentions some of this, but the explanation of and provided reference to OSLC resource shapes (one of the inspirations for the W3C SHACL recommendation), does not give further insight. Although of course important for the functioning of the system, I think the description of the DITA processing and algorithm (definitions, pseudo code and evaluation), Sections 2.3, 2.4 and 3.2, are given too much weight in the paper. ** Details - p. 1: "Documenting a product development" -> "Documenting product development"? - p. 4: "OLSC" -> "OSLC" - "etc". Suggest to use "etc" only when the remainder of the list is clear to the reader. - Ref. 6: Linked data on the We*b* - Ref 11: Authors missing. ** After rebuttal I thank the authors for detailed response. Based on the response and the discussion among reviewers, I decided to change my score to -2 "reject". My main concern is that I do not understand how the system can automatically produce useful documentation, and if/how semantic technologies is used beyond ontologies as schema and SPARQL as queries. Is the system "only" formatting SPARQL query answers, or is there any "logic inside"? What is the input and what is the output? From the author's response I am not confident that I will learn this in the final version.
Review 4 (by Michele Pasin)
Accept. * paper provides interesting insights into a key problem for large organizations: managing documentation for product development * the tool the authors propose is clearly a prototype but the discussion around the technical and conceptual aspects of the system would be relevant for moving this into production * exposition is clear and well structured Open issues: * how much do the success of such an approach rely on human factors related to knowledge acquisition difficulties? Eg people marking up requirements in JIRA differently, or simply lack of good metadata. From the paper it appears as the main challenges one needs to solve are at the data integration level (syntactic and semantic) but really I think that the main challenge (within a large organization scenario) would be to how to gather consistent semantic descriptions across a variety of systems and teams.
Review 5 (by Anna Tordai)
This is a metareview for the paper that summarizes the opinions of the individual reviewers. This paper describes a system for generating software development documentation, an interesting topic and relevant for this track. The reviewers agree that the paper needs clarifications regarding core contribution and the novelty of the work. Reviewer 1 and 3 in particular remark that more detail and discussion is needed on the input and output of the system, and details on how and why semantic technologies help to solve the problem. In addition, reviewers question what the quality of the resulting documentation is and how that would be measured. Overall, these negative points tend to outweigh the positive points. Laura Hollink & Anna Tordai