Accuracy and Efficiency of Performance Metrics in Reasoning EL Ontologies
Author(s): Isa Guclu, Martin Kollingbaum, Jeff Pan
Full text: submitted version
Abstract: This paper analyses existing performance prediction approaches for EL ontologies from the accuracy and the efficiency perspectives and proposes the core structural metrics for EL ontologies that provide efficiency of measuring metrics with a comparable prediction accuracy. The proposed approach is designed for simplicity and feasibility. The generalizability of the proposed metrics is validated through comparing their prediction accuracy with that of existing approaches by taking metric generation (time) cost into account.
The experiment results indicate that the core structural metrics provide a comparable prediction accuracy to existing metrics for (1) time prediction of ABox-intensive EL ontologies, (2) energy prediction of EL ontologies, and (3) time prediction of EL ontologies. In addition, it is shown that the proposed metrics are efficient and feasible by consuming less than the 0.1% of the time consumed by the existing metrics for measuring metrics of an ontology, which enables the adoption of the proposed approach for any processing environment, especially the resource-bounded mobile devices.
Keywords: Semantic Web; Ontology Reasoning; Performance Prediction; Random Forests
Review 1 (by anonymous reviewer)
(RELEVANCE TO ESWC) text (NOVELTY OF THE PROPOSED SOLUTION) text (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) text (EVALUATION OF THE STATE-OF-THE-ART) text (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) text (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) text (OVERALL SCORE) text
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) Reasoning and reasoner performance prediction is a relevant issue to the Semantic Web community. For a strong accept, a bit more details and examples should be given on how such metrics could be used in a realistic SW scenario. (NOVELTY OF THE PROPOSED SOLUTION) Glancing over the cited literature (most of which involve at least one of the authors), there is little new about the approach: A set of metrics are proposed (unsystematically: there is no rationale behind them other than "obvious" or "OWLAPI implementable"), the time efficiency and prediction accuracy is studied and the conclusion is drawn: You can use these X metrics to predict performance and you dont need the more expensive Y ones. No insight is gained why this set of metric is better, or whether, indeed, a tiny subset of the core metrics would have worked even better (say: ABox size only). (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Not really applicable. (EVALUATION OF THE STATE-OF-THE-ART) State of the art here means mainly referencing the 5 or 6 papers involving at least one of the authors. Other approaches are mentioned like Sazonau, but some are omitted, I can think of Alaya 2015 now top of my head. The state of the art review could have been improved by actually describing some of the metrics proposed in more detail, and motivating how your contribution of metrics might be better to your case. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There is no detailed account of the computational complexity of computing the metrics proposed. It is obvious that the complexity overhead is low, as most of the metrics are computed and indexed by the OWLOntology class in the OWL API anyways, but a formal discussion, even if short, would have been helpful (in particular since the core metrics are constantly contrasted against the more expensive graph based ones of the competing approaches). The insight from an assumption like "Most reasoners use the OWL API. The OWL API indexes all metrics we consider core structural. Therefore, they are efficient" is small. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiment is using an old dataset (ORE 2014), but it is well documented and motivated. The study is described reasonably well, describing machine setup, datasets and reasoners used well. Datasets are being made available, albeit not in the ideal form (GitHub repos change over time, while a dataset reference should be static to insure reproducibility. Zenodo allows cloning repos to that end, and provide the dataset and scripts with a DOI). (OVERALL SCORE) Response to rebuttal (Suggestion: in a rebuttal, for every item, please refer explicitly to the reviewer and comment made, and how you aim to remedy that item. It is hard and unnecessary work for reviewers to having to decide what comment your rebuttals refer to.) Practical Use: We are planning to apply our approach (core structural metrics) to well-known resource-bounded environments, e.g., mobile devices and IoT network nodes. --> Again, a set of metrics is not an approach. It remains unclear how your predictive approach really makes a difference to reasoning, other than: Should I reason or not? Applying it to mobile devices really means: run the exact experiment again with less RAM and less CPU and check how the numbers change, perhaps including battery consumption. Resource-aware reasoning on resource-bounded environments is planned to be implemented by making use of ontology modularisation techniques with the guidance of feasible resource prediction mechanisms, as in this study. --> I find this sentence very confusing. What is resource-aware reasoning? How does modularisation relate to your work here? The computational complexity of the proposed metrics is O(n). Identified set of metrics can be derived during parsing an ontology at nearly no extra cost, e.g., ObjectPropertyCount (count of axiom type), ABoxAxiomCount (aggregation), etc. We, therefore, regard measuring these particular metrics as “nearly costless”. Any ontology parser implementation can provide these metrics. For our study, we used OWL-API. --> You may be right, or may not be. It is easily conceivable that the OWL-API implements a structural metric that is beyond O(n). You are right in that this is probably a less important thing to be corrected, but a sentence that convinces us that all your metrics are in O(n), especially since you are contrasting them against those expensive graph shape ones, would be good. Your rebuttal does not at all refer to two of the key issues: (1) Why your set, and not ANY other (for example just ABox size/TBox size). For example, you cannot just mention Annotation related metrics and not discuss how they could *possibly* have any impact on reasoning; (2) Why core structural metrics work. You paper lacks giving us any insight why any particular metric should work at all; as scientist we want to at least a little bit know why stuff works, not just that it does. I stand by my final assessment. The work points in an interesting direction, but is simply too preliminary for publication at ESWC. The authors propose a set of structural "core" metrics that are efficiently computable and allow accurate reasoning time prediction for a relevant subset of OWL ontologies (EL). The main contributions are the design and execution of a study that compare the proposed metrics with existing sets of metrics, and show that for EL, the proposed core metrics are superior. The topic of the paper might be relevant to our community, and better sets for predicting reasoner performance are welcome. However, the write-up and the size of the contribution are insufficient for a publication at ESWC. In particular, the choice of the metrics are not motivated (it is just an unsystematic list, and how just that list, as opposed to any other, would be better remains unclear), the insight gained is small ("these metrics are sufficient for EL ontologies, we don't need graph based ones", without giving any explanation or discussion), and the write-up is very unclear and appears hasty. Just running the paper through Grammar.ly reveals a large number of grammatical and typographical errors that could have been easily avoided. Therefore, I cannot recommend the paper for publication at this point. A non-exhaustive list of remarks: * Abstract * The first sentence in the abstract is very unclear. Consider rephrasing. * from the accuracy and the efficiency perspectives -> with respect to accuracy and efficiency * that provide efficiency of measurement metrics -> ??? metrics that provide metrics? * What are core structural metrics as opposed to structural metrics? * with comparable prediction accuracy -> ? please clarify with readers interested in reasoning performance in mind, that do not know what kind of predictions you are talking about * “The generalizability of … is validated through [comparison]. -> Generalisable to what? All EL ontologies in general? * “Time prediction of ABox intensive…” -> Time prediction? You mean classification time? Realisation? Query answering? All of them? * “Energy prediction” -> You mean how much electricity is used running the reasoner? * Case (3) -> That is the more general case of (1).. If you want to list it here, you must justify why you list (1) in particular! * A metric cannot be efficient. It can only be efficiently computable in my opinion. * Efficient and feasible: How can something be efficient and INfeasible? It seems redundant to mention feasibility here * Metrics consuming time --> See above: Computing a metric consumes time, not the metric itself * Introduction * You should give a few examples of what you mean by “OWL 2 metrics with high computational cost" here. Many metrics used in prediction tasks, such as simply number of logical axioms, are independent of the OWL 2 profile when it comes to efficiency * Again: Ontologies do not consume resources. Algorithms do, such as reasoning algorithms. This must be clarified (in the main contribution list for example) * You should list the most important, or interesting, metrics in the contribution section * 2 * 2.1 * The complexity of SROIQ … has the complexity -> complexity has complexity? * Description Logics spelt with inconsistent capitalisation in the same paragraph * 2.2 * Inconsistent bracketing after the “hardness categories” * 2.3. * " we notice that there are experiments” -> Isn't this your very own work? * “we haven’t come across any study that analysed the resource cost of the existing performance metrics.” * What resource cost do you refer to? Time/energy? And if so: It is, of course, likely that you find no study that exactly analysed the resource costs of the existing metrics… As far as I understand, they were proposed by your lab! * 3 * “measuring these metrics” -> All of them? How many of them are really polynomial? * “most of the tasks” -> What tasks? * “extending them for new dimensions” -> What do you mean? Like adding new metrics? * Where where these core structural metrics proposed originally? If you propose them, they should be listed here, as your claims cannot be sanity checked else. * ABox axiom count for “ABox intensity”? * "to enable its easy adoption by reasoners” -> I can see how Protege might use metrics to decide which reasoner to use. It would be good to give an example here what you expect a reasoner to do with these metrics exactly. * Having only the OWL API as a dependency is really not that much of a relevant advantage. * 3.1 * AxiomType counts: What do annotation axioms have to do with energy consumption or reasoning time? Or declarations? These should never effect reasoning! * How do your core metrics differ from the 92? Which ones are the same, which ones are new? * * 4 * "What is the proportional cost of measuring a metric of an ontology to performing the reasoning task on it?” -> How is this relevant? Please explain. * Literature: * Be careful with inconsistent capitalisation, see reference 22, and other el vs EL etc.
Review 3 (by Antoine Zimmermann)
(RELEVANCE TO ESWC) Semantic Web techniques, including reasoning, can be used in systems that have constrained use of resources (memory, time, battery, etc). Predicting efficiency and consumption of reasoning is therefore a relevant task. (NOVELTY OF THE PROPOSED SOLUTION) The approach consists in reducing the number of metrics used to predict reasoning efficiency and cost in a way that makes the computation of metrics near costless, while preserving most of the accuracy of the larger sets of metrics (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposal makes sense, the experimental settings are clear. However, some details of the experiments are not well justified (see detailed comments below). (EVALUATION OF THE STATE-OF-THE-ART) As far as I know, the related work is well covered (but I'm not knowledgeable in terms of prediction metrics for reasoning). (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The results show a big improvement in the computational cost of the metrics, compared to the state of the art, while preserving much of the accuracy of other approaches. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The source code is made available online. Reproducing the results should be possible but requires some effort in puting everything together. Real life and synthetic datasets having varying characteristics are used to show that the proposed solution is effective on a wide range of cases. (OVERALL SCORE) Comments after rebuttal: ======================= While I was initially rather enthusiastic about the topic of the paper (yet seeing some shortcomings), the points raised by other reviewers, as well as the discussions among the committee and the non-convincing rebuttal, makes me feel that this paper is not ready for publication. It has, however, in my opinion, the potential to become a solid contribution with little extra effort addressing our concerns. Initial review: =============== Summary of the Paper: The paper studies the problem of predicting the cost of reasoning with an ontology or a knowledge base in order to make an informed decision whether the reasoning task should be performed, and who (what process, device, or agent) should perform it. In this paper, only the case of OWL 2 EL ontologies is considered, which helps defining fine-tuned ontology metrics for cost prediction. Multiple experiments are made to show how well the approach estimates time and energy consumption compared to other sets of metrics, and how efficiently the metrics are computed. Strong Points: The paper is quite clear in stating the problem, describing the state of the art, and showing what is tested. Moreover, the results seem to indicate that the proposed solution is noticeably more efficient than previous work. Weak Points: Some elements of the experiments are not so clear. Also, the paper does not provide much in terms of practical use of the performance metrics. Detailed comments: Overall, I find the paper pretty well written, well organised, with an interesting set of experiments, with rather convincing results supporting the main claims of the authors. Nonetheless, I am a little puzzled by some things in the experiments: each experiment set (set-1 to set-4) is changing the reasoners used (set-1: HermiT, TrOWL; set-2: ELK, JFaCT, TrOWL; set-3: ELK, jcel, TrOWL), the set of performance metrics used (set-1, 3, 4: 92 metrics vs. Core metrics; set-2: 92 metrics, 143 metrics, Core metrics), the evaluation metrics (set-1 uses MAPE, RMSE, R², while the others do not use RMSE), and other parameters (set-1 has 2 machines, not the other ones). Why is this the case? Can all reasoners be tested on all experiment sets? Also, what are machine 1 and machine 2? I see no description of these. Finally, the results are depending on what classification technique is used. Another, less crucial criticism is that the paper does not explain how the performance metrics can be used. If I know with a pretty good accuracy that reasoning with a specific input ontology is going to take more than 100 ms, how do I use this information concretely? Should I just not reason? Can I delegate the reasoning to something else? Then how? These are interesting related problems that could be mentioned to make the value of the contribution more tangible. Minor comments: Abstract: "less than the 0.1 % of the time" -> "less than 0.1 % of the time" Intro: - "(e.g. , , etc.)" -> no need for "etc." if there is "e.g." - "The OWL 2 EL profile is roposed for efficiency and feasability of semantic technologies" -> as much as OWL 2 RL and OWL 2 QL. There must be a reason for discriminating the other ones Sec.2: - first paragraph: "such as ... etc." -> no need for "etc." here - Sec.2.1 "provides the logical formalisms for ontologies" -> provide a logical formalism - 2.1: compare "2NEXPTIME-complete" and "PTIME-$Complete$" Sec.3, last paragraph before sec.4: - "metrics that are can" -> metrics that can - "by the Protégé" -> by Protégé Sec.4: - Sec.4.3 "to compare how long does it take in total" -> "to compare how long it takes in total" - 4.3 "An analysis will be made" -> is made (or is provided) - "loading an ontology , " -> remove space before comma Sec.5: - Table 2, explain what machine 1 and 2 are - 5.1: "predication performance on on" - How is the actual energy consumption measured? - Caption of Table 4: "[...] how much energy will an EL ontology take on HermiT (Her.) and TrOWL (Tr.)?" -> "[...] how much energy an EL ontology will take [...] (Tr.)." - 5.2: "accuracy of the 3 metrics (i.e., the 92 metrics [...])" -> accuray of the 3 sets of metrics (viz., the 92 metrics [...]) - "the ELK", "the TrOWL" -> no article before a name (this occurs many times) - Table 6: "how long will classification of an EL ontology take." -> how long classification will take - in the enumeration on page 12: "to a more than the 1,000 times of the execution time cost" -> "to more than 1,000 times the execution time cost" - "the 20.26 % of the" -> "20.26 % of the" (same with 0.1 %, this happens several times) Sec.6 - "the 0.1 %" -> 0.1 % - "researches" -> research References: - several missing capital letters (el -> EL; dl-lite -> DL-LITE; Owl 2, owl, abox, owl api, elk, w3c, snomed, ontoqa, sparql ...) - ref 19 has "et al." and has too many authors. The editors are sufficient for W3C recommendations (in this case Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, Carsten Lutz)
Review 4 (by Laura M. Daniele)
(RELEVANCE TO ESWC) The paper is relevant to ESWC and the reasoning track, as it addresses the topic of performance prediction of ontology reasoning. (NOVELTY OF THE PROPOSED SOLUTION) The paper presents a novel approach useful to predict resource consumption of EL ontologies with comparable accuracy to existing metrics, but lower computational costs, providing a valuable contribution especially when using ontologies in resource-bounded environments such as mobile devices. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper provides a complete study that compares the proposed approach to existing metrics in the literature. The approach is clearly described together with the experiment conducted by the authors and an evaluation of the results is presented. (EVALUATION OF THE STATE-OF-THE-ART) Good evaluation of the state-of-the-art. The Introduction positions the paper with respect to previous studies on performance prediction of OWL 2 ontologies. The Related Work section provides an overview of the existing metrics in the literature, which are further compared with the core structural metrics proposed by the authors. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The experiment is properly built upon previous research. The setup and the results are very well presented. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The source code, experimental data and results related to the study are available online. An info.txt file clearly explains how to use the material. (OVERALL SCORE) The paper proposes a novel approach for predicting resource consumption that can save time and computational power when reasoning with EL ontologies on e.g., mobile devices. The paper proposes a more efficient approach compared to existing work, introducing core structural metrics that provide a comparable prediction accuracy, but better efficiency and feasibility. Strong Points: • Well-structured paper, solid piece of work. • Potential impact if the proposed core structural metrics are really feasible to implement on resource-bounded mobile environments • The experiment is very well designed and built upon existing literature in this area. • Availability of the source code, experimental data and results Weak Points: • Lots of notions, very technical, requires lots of focus, not easy to read. • The experiment and its results are thoroughly presented, but drawing final conclusions is left to the reader. More elaborated general conclusions would help. • Not yet implemented in real world scenarios to show the actual impact on mobile devices
Metareview by Diego Calvanese
The paper presents an empirical comparison between different performance metrics for reasoning in EL ontologies. The metrics are validated by comparing their prediction accuracy and the cost of computing the metric themselves, relative to the total execution time. The main motivation is to be able to predict the cost of a reasoning task in an ontology (e.g., for resource allocation in resource-bounded devices), before actually starting the reasoner. There is disagreement in the scores of the reviews, which is in part justified by the difference in the confidence scores of the reviewers. After a lengthy discussion phase, the reviewers with the more positive evaluations agree that the weaknesses identified by the more critical reviewers are well-justified, and that they are also not addressed by the rebuttal given by the authors. Two aspects that make the paper unsuitable for ESWC are: 1) lack of a complete description of the experimental setup, of full results, and of an analysis of the data obtained (these are crucial aspects for an empirical paper like this one). 2) lack of an important comparison (cited in the paper, but not properly taken into account), which, if properly taken into account, would lead to questioning the value of the metrics proposed in the paper.