Utilizing the Semantic Web and Constrained Optimization for Cyber Threat Intelligence
Author(s): Noseong Park, Ghaith Husari, Bei-Tseng Chu, Ehab Al-Shaer
Full text: submitted version
Abstract: Cybersecurity is currently a critical problem. Many enterprise networks are improperly protected because of a lack of human experts. One possible solution is to provide more effective tools to secure such networks.
To this end, we propose three contributions: 1) a new language called OSPARQL that seamlessly integrates SPARQL and constrained optimization; 2) cybersecurity knowledge graphs created after parsing a plethora of documents and system logs; and 3) real-world use cases based on the proposed OSPARQL and knowledge graph.
We conducted experiments on a real-world large enterprise network dataset. Our platform has a rapid response time (typically in a few seconds) on all tasks and achieves high recall and precision scores (approximately 90%) for the presented use cases.
Keywords: Cybersecurity; SPARQL; Constrained Optimization
Review 1 (by Valentina Janev)
(RELEVANCE TO ESWC) This paper proposes a new language called OSPARQL that seamlessly integrates SPARQL and constrained optimization (Section 4), showcases the applicability of the language for running 'Cyber Threat'-detection queries (Section 5 and Section 6). (NOVELTY OF THE PROPOSED SOLUTION) Proposes a new language - The OSPARQL language (see attachment at http://dropcanvas.com/67rql) is an attempt to implement a generic solution for common characteristics needed in tasks in domains such as cybersecurity and Internet of Things. According to authors, 'OSPARQL is the first effort to seamlessly integrate the Semantic Web and constrained optimization' (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The paper is well structured and the presented results are convincing. The Cybersecurity Ontology is not public; the knowledge graph is partly based on Open Data (ATT&CK, https://attack.mitre.org, CAPEC schema, http://capec.mitre.org/documents/schema/index.html and others), and hence, is not open. The knowledge graph (Section 3) for the targeted domain was created by parsing various relevant Cybersecurity Knowledge sources, by collecting detailed enterprise network information and interlinking with DBpedia and YAGO. Use case is presented (Section 5), and the approach has been tested (Section 6) (EVALUATION OF THE STATE-OF-THE-ART) To my knowledge, the state of the art is comprehensive. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Using experiments based on real enterprise network datasets, authors demonstrated OSPARQL's good performance on all tasks. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Have elaborated different use cases, while cybersecurity is discussed in this paper (OVERALL SCORE) Strong Points (SPs) - new language called OSPARQL that seamlessly integrates SPARQL and constrained optimization (Section 4) - excellent organization of the paper that covers all aspects - from problem description to solution development and evaluation of results - cybersecurity domain that is relevant and I suppose interesting for all participants Weak Points (WPs) maybe authors can provide more information about the time needed to - develop the language - develop and instantiate the knowledge graph - complete the evaluation
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) The paper provides very relevant contribution to the semantic Web community in terms of technical content (OSPARQL) and application to an underexplored and relevant domain. (NOVELTY OF THE PROPOSED SOLUTION) To the best of my knowledge, the idea of supporting optimization-based queries on top of SPARQL results and RDF data is very novel. In addition, despite few contributions to the field of Cyber Threat Intelligence (CTI) (and related fields) exist, this is among the first contributions proposing an end-to-end solution to apply semantic technologies on top of real-world data in this domain. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The application of the proposed solution to the CTI domain is convincing and well explained. Otherwise, the definition of OSPARQL is quite intuitive and not all the details have been adequately explained. For example, it is not clear if OSPARQL can be considered a new full-fledged query language (in that case a BNF would have been useful) or more an approach to use constrained optimization on top of SPARQL. In addition, a more systematic experimental evaluation would have been useful to study the properties of the new language (e.g., using a larger number of synthetic queries and evaluating query processing time as a function of the input size). However, in my opinion, the novelty of the contribution somehow balances the lack of details and I consider the presentation in the paper convincing enough about the correctness of the approach. (EVALUATION OF THE STATE-OF-THE-ART) I am not expert in the field, but I am not aware of approaches that provide similar functionalities as the ones introduced in OSPARQL. Also, I am not aware of remarkable uncited work that applied semantic technologies to the CTI domain. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The negative score here points to the lack of systematic evaluation of OSPARQL in terms of efficiency when considered as a function of a given input. For example, which is the efficiency in function of the size of data returned by the SPARQL queries? Are there any syntactic constraints to respect when writing the PARAMS, SUBJECT TO and MAX sections, in relation to variables used in the FROM clause? (For example, you use "software" in PARAMS, while the variable in the FROM clauses is "?software"). In general, I believe that some example of plain SPARQL queries could be sacrificed as quite intuitive for this community, while a better explanation of the relationships between the different sections in OSPARQL queries could be given. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It seems hard to replicate these experiments (which consists in demonstration scenarios more than in a full-fledged experimental evaluation). Could the author release the code to compile OSPARQL as well as data used in their experiments? Could synthetic data be produced in such a way that they could be shared with the community? (OVERALL SCORE) The paper presents an approach to combine SPARQL query answering and constrained optimization into a language to solve constrained optimization problems on top of SPARQL results. This language, OSPARQL, provides functionalities that are not supported in SPARQL, which can be useful in several domains. In particular, the paper discuss the application of OSPARQL in the field of CTI, which is a very important application domain. Strong Points (SPs) - OSPARQL implements a very important idea, i.e., supporting optimization-based queries on top of knowledge graphs, which I believe very interesting and promising for semantic technologies. - The paper describes the main functionalities of OSPARQL in a rather clear manner and is convincing about its usefulness. - The application to the CTI is very interesting and tackle very important practical use cases for semantic technology - The knowledge graph built for this approach is built from real-world data sources, which are references in this domain. Weak Points (WPs) - The paper is somehow in a grey area between an in-use track paper (the application of semantic technologies to a problem) and a research paper (the proposal of a new query language) - More details about OSPARQL should be discussed, in particular, in relation to constraints over its syntax - A more systematic evaluation of the performance of OSPARQL, maybe using synthetic data and queries, would provide stronger evidence of the class of problems tractable with this language Questions to the Authors (QAs) Could the author release the code to compile OSPARQL as well as data used in their experiments? Could synthetic data be produced in such a way that they could be shared with the community? Which is the efficiency of OSPARQL in function of the size of data returned by the SPARQL queries? Are there any syntactic constraints to respect when writing the PARAMS, SUBJECT TO and MAX sections, in relation to variables used in the FROM clause? How are variables formulated in PARAMS section (e.g., you use "software", while the variable in the FROM clauses is "?software"). Is there a BNF form that defines the complete syntax of OSPARQL? Why did you not try to reuse one of the ontologies proposed in related work, and populate it with your extraction approach? If there is any reason for this choice, it would be worth being discussed. *Detailed comments* I suggest moving the OSPARQL queries from the introduction to Section 4. I found the intuitive introduction to the problem addressed in the paper very clear even without reading the query, which is quite difficult without some explanation that can be only given in Section 4. I also suggest reducing the space dedicated to standard SPARQL queries, which are quite intuitive for this community. I suggest using this space to provide further details about OSPARQL. In Section 2, I suggest introducing an ontology as a set of axioms. You also use that concept later on in the paper, while not always an ontology can be straightforwardly mapped to a graph. Page 11: Apache Jean --> Apache Jena Page 12: even though there many--> even though many
Review 3 (by Leo Obrst)
(RELEVANCE TO ESWC) It is certainly relevant to ESWC, focusing on an extension to SPARQL for constrained optimization and an approach to cyber threat intelligence by generating a knowledge graph. (NOVELTY OF THE PROPOSED SOLUTION) Aspects of the approach are novel, e.g., the extension to SPARQL for constrained optimization (called OSPARQL). However, the system to generate the knowledge graph is not so innovative, and the discussion is limited: most detail is relegated to Appendix A, which is an online resource only, at a site often blocked by enterprise security systems (i.e., a file-sharing site). So the relevant detail is not actually in the paper (due to space limitations). (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Seems an acceptable approach. The extension of SPARQL for constrained optimization is ok, if there is buy-in in the standards community. I am not sure of the linkage to the IBM LP format: what is the need, when space is at a minimum, and more essential detail is relegated to an appendix that is not necessarily accessible? (EVALUATION OF THE STATE-OF-THE-ART) Limited. Important references are left out. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Again, the approach seems sound, if not especially innovative. The details are missing. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Unclear what can be reproduced, since the details are missing. (OVERALL SCORE) Really this is borderline. I rate as weak accept, but that's primarily because there is no longer borderline or weak reject. There is limited new substance here. The approach proposes an extension to SPARQL for constrained optimization (called OSPARQL), the generation of a knowledge graph using their proposed system for cyber intelligence, and the development of specific use cases for this. However, the system to generate the knowledge graph is not so innovative, and the discussion is limited: most detail is relegated to Appendix A, which is an online resource only, at a site often blocked by enterprise security systems (i.e., a file-sharing site). So the relevant detail is not actually in the paper (due to space limitations). The use cases are acceptable, but fairly commonsensical. Strong points: - the discussion of the constrained optimization extension for SPARQL - high-level overview of a system that attempts to capture cyber intelligence - the paper is well-written. Weak points: - Not enough detail. Real detail is relegated to an appendix, which is not necessarily easily accessed. Really the detail needs to be in the paper body. - Important references are left out. E.g., some: Obrst, Leo; Penny Chase; Richard Markeloff. 2012. Developing an Ontology of the Cyber Security Domain. In: Costa, Paulo C. G., Kathryn B. Laskey, eds. 2012. Proceedings of the Seventh International Conference on Semantic Technologies for Intelligence, Defense, and Security, Fairfax, VA, USA, October 23-26, 2012. http://ceur-ws.org/Vol-966/, pp. 49-56. Oltramari, A., Cranor, L.F., Walls, R.J. and McDaniel, P.D., 2014. Building an Ontology of Cyber Security. In STIDS (pp. 54-61). Ulicny, B.E., Moskal, J.J., Kokar, M.M., Abe, K. and Smith, J.K., 2014. Inference and ontologies. In Cyber Defense and Situational Awareness (pp. 167-199). Springer, Cham. Grégio, A., Bonacin, R., de Marchi, A.C., Nabuco, O.F. and de Geus, P.L., 2016. An ontology of suspicious software behavior. Applied Ontology, 11(1), pp.29-49. - Common cyber resources are mentioned and discussed as being included. But many of these are XML-based, not RDF/OWL-based, so transformation would need to be made (no detail). And these are common vocabularies referenced in many cyber papers.
Review 4 (by anonymous reviewer)
(RELEVANCE TO ESWC) Part of the problems of cybersecurity touch semantic web topics, e.g. modelling concepts and working with data heterogeneous in source and nature. (NOVELTY OF THE PROPOSED SOLUTION) Several groups worked on defining ontologies for cybersecurity. The idea behind OSPARQL may be interesting, but it is not deeply discussed in the article. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) It is hard to judge the correctness of the solution (see below). (EVALUATION OF THE STATE-OF-THE-ART) Authors survey several relevant works; however, they miss relevant resources: - STIX (https://oasis-open.github.io/cti-documentation/), which actually seems to cover most of the aspects addressed by the ontology in Section 3.1; - platforms, e.g. MIPS (http://www.misp-project.org/) ; - studies in the area of context awareness. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The introduction highlights the importance of combining semantic web research with cybersecurity. Section 5 makes two attempts of showcasing the solutions proposed by the authors in two use cases, but they are incomplete. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Neither the datasets nor the code are available. Moreover, the article does not contain enough information to reproduce the experiments. (OVERALL SCORE) Authors propose an ontology and a query language. The ontology is developed in the context of cybersecurity, while the query language, OSPARQL, combines SPARQL with constraint optimization languages. Both the ontology and the query language are discussed in several use cases. An experimental evaluation studies the time performance of an implementation of OSPARQL in different scenarios. ** Strong points ** SP1: The introduction is well written, and it is effective in explaining the motivation. SP2: Combining SPARQL with other types of processing is potentially interesting. OSPARQL may find application in other contexts SP3: The article shows potential use cases and area where semantic web technologies may have an impact ** Weak points and questions** WP1: The paper has two contributions - an ontology and a query language. However, such contributions are "stealing" space each other, and as a result, none of them is fully convincing. It would've been easier to write two articles, one about the ontology and one about OSPARQL, with dedicated analyses and evaluation. WP2: As a consequence of WP1, this paper is not self-contained. In several points, authors refer to a technical report where additional information is provided. I am aware that this is a common practice in several communities, but in this specific case, the paper itself seems to be a summary of the longer one. WP3: OSPARQL, which should be one of the main contributions of the article, is described in less than one page. This makes very hard to judge it. Apparently, the proposed language OSPARQL can be easily transformed into two parts: the LP format which describes the optimization problem and a SPARQL query. Indeed, this seems to be necessary to answer an OSPARQL query in the first place: the SPARQL query has to be executed, and then a solver like CPLEX can be used to solve the optimization problem taking the query answer from the SPARQL query as input. Given this workflow, wouldn’t it be easier just to provide an API where a user can both submit a SPARQL query and the optimization problem in the LP format (or some other well-known format) instead of combining both into one new language, just to split the two parts again in the next step? It seems unnecessarily complicated for a user to learn a new language instead of using two well-established languages. It would've been interesting to study more complex interactions among the two languages. WP4: The evaluations lack some clear direction of what should be evaluated. Part of the experiments shows how good CPLEX performs on a certain problem. For other tasks, the runtime is not even mentioned. It’s also not clear, how long the SPARQL query takes to execute the query. WP5: in some cases, it is hard to judge the results without a term of comparison. For example: - In one task, authors mention that Internet Explorer 6.0 is the most vulnerable web browser in their dataset. How is this fact relevant to the evaluation? Is this a conclusion which could not have been drawn if another approach would have been taken? - In another task, the authors report precision and recall. While the raw numbers are probably good results, it’s not clear whether other methods could have achieved similar or better results. WP6: What lead the design of the ontology in Section 3.1? Why aren't the ontologies mentioned in the article (or STIX) enough? Summarising, I think that this research line is promising and may produce several research results. However, the current submission suffers from many issues (not self-contained, confused in defining the contributions, evaluation) which require time and effort to be adjusted. I read the rebuttal, and I confirm my score, with the following arguments: - There are already several cybersecurity ontologies, and I am not sure why we need a new one. I do not see new requirements or problems in the current ontologies that justify such resource. "we wanted to define our own ontology representing our current and future interests" does not sound like a very strong motivation. - OSPARQL is a query language with a very simple semantics: first SPARQL, then linear programming language. This significantly limits the scientific value of the language. An option would've been to study how to allow more complex interactions, with the related optimisation and execution issues that arise in that situation. - The evaluation is not well developed: it is a list of use cases with some insights. A more structured evaluation is needed, to test both the ontology and the query language.
Metareview by Jorge Gracia
According to the reviewers, this is an interesting and promising contribution. It shows novelty and has potentially useful applications, such as the one treated in the paper (cyber-threat detection). However, the main contribution of the paper (OSPARQL) is not described with sufficient detail, thus being difficult to judge. Also, there are some issues in the evaluation, where the hypotheses to validate are not clear. The paper would need a substantial extension to cover the missing details. The authors propose to do so by purchasing new pages for the published version, but it is difficult for some reviewers to assess the added value of the promised extension without seeing it. Therefore, the reviewers have not substantially changed their judgement based on the authors' reply letter. Nevertheless, the approach is an interesting one and we would like to encourage the authors to submit their work as a poster or demo.