Enhanced Semantic Similarity Measure for Cloud Service Discovery
Author(s): Hajer Nabli, Raoudha Bendjemaa, Ikram Amous
Full text: submitted version
Abstract: The enormous amount of Web-based information has a great ef-fect for focused crawlers in order to provide effective Cloud services. It is a chal-lenge for focused crawlers to search only for URLs that are relevant to Cloud services from this explosion of information. To solve this problem, this paper contributes to the semantic focused crawler for Cloud services. In particular, we introduce a topic model based semantic similarity measure that integrates both semantic and syntactic methods for computing similarity measures between texts. First, URLs are ranked in descending order based on their semantic priority scores. Then, an LDA topic model is applied to compute the topical similarity between the URL document and the concepts document that includes a set of keywords related to the given Cloud service category. Moreover, in order to au-tomatically discover and categorize Cloud services, we present a Cloud Service Ontology (CSOnt) that contains a set of concepts defining Cloud service catego-ries. Experimental results show that the proposed approach enhances the perfor-mance of the focused crawlers and outperforms the focused crawler based on Best-First approach. In conclusion, the proposed focused crawler presents an ef-ficient way to parse the Web and collect Web pages relevant to Cloud services.
Keywords: Cloud Service Discovery; Cloud Service Ontology; Focused Crawler; LDA Model; TF-IDF; Semantic Similarity
Review 1 (by Mario Cannataro)
(RELEVANCE TO ESWC) The topic is interesting it tackles a question dealt in the literature using an innovative approach. (NOVELTY OF THE PROPOSED SOLUTION) The novelty of the work consists of the implementation of a new approach to introduce a semantic similarity measure based focused crawler in order to efficiently collect and categorize Cloud services. Furthermore, they introduce a Cloud Service Ontology (CSOnt) that is based on concepts from Cloud services vocabulary that can be used to improve the search for Cloud services. Finally, the authors introduce an enhanced semantic similarity measure to provide relevant Cloud services. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors are able to present in detail the proposed semantic focused crawler based on Latent Dirichlet Allocation by describing deeply each main components. (EVALUATION OF THE STATE-OF-THE-ART) The state of art is sufficient, the authors review some works on collecting relevant URLs. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors evaluate enhanced semantic similarity measure by conducting a number of the experiment on a dataset. The effectiveness of the proposed approach is evaluated by considering Precision, Recall, F - Score, Fallout Rate and Retrieval Time. Furthermore, the proposed approach is compared against the Best-First crawling approach. finally, they discuss the result. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The proposed crawler design is general. Thus, the user may customize its use to analyse different dataset in addition to the one presented in work. (OVERALL SCORE) In literature exist different semantic focused crawlers based on semantic knowledge related to the search topic to compute the relevance of Web pages. However, the issue is related to their low performance when the number of documents is high. Furthermore, in some cases, they fail to detect semantic correspondences between terms. To solve this problem, the authors introduce a topic model based semantic similarity measure that integrates both semantic and syntactic methods for computing similarity measures between texts. The proposed approach is validated e the result show the proposed approach enhances the performance of the focused crawlers and outperforms the focused crawler based on Best-First approach. Strong Points 1) the definition a semantic-based focused crawler that collects the data of Cloud services to automatically update the Cloud service repository. 2)the proposed semantic focused crawler is coupled with a Cloud Service Ontology (CSOnt) to improve the search for Cloud services. The CSOnt represents the second strong point. 3) Finally, the definition of enhanced semantic similarity measure based on the combination of Term Frequency Inverse Document Frequency with LDA topic model is proposed to provide relevant Cloud services is the third strong point (to efficiently collect and categorize Cloud services.) Weak Points 1)the authors present the results of calculation of semantic similarity score but they do not clarify if the semantic similarity measure is normalised. They should clarify this point. 2)The comparison of the proposed semantic based focused crawler only against focused crawler based on Best-First approach. The authors should perform the comparison with other semantic focused crawlers. Why do the authors compare their crawler only with Best-First crawling approach? What is the performance of their semantic focused crawler based on Latent Dirichlet Allocation (LDA) compared to some other state-of-the-arts which are also mentioned in related work?
Review 2 (by Pascal Molli)
(RELEVANCE TO ESWC) - Lack of references from semantic web in the related work ex: M. Ehrig, A. Maedche, S. Handschuh, L. Stojanovic, and R. Volz, "Ontology-focused crawling of web documents and RDF-based metadata," in International Semantic Web Conference 2002 (ISWC 2002), Sardinia, 2002. Ardö, A. (2005). Focused crawling in the ALVIS semantic search engine. In Proceedings, 2nd Annual European Semantic Web Conference (ESWC 2005), Heraklion, Crete, Greece, 29th May-1st June. (NOVELTY OF THE PROPOSED SOLUTION) The proposed metrics can be original. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The correctness of the proposal looks ok. (EVALUATION OF THE STATE-OF-THE-ART) In related works, authors mainly reference work concerning cloud service discovery and not the Best-First crawling approach. However, in the experimentation part, authors compare the proposal to the Best-first crawling approach and not to the work described in related works. This makes the paper inconsistent. IMHO, "Despite the SPARQL language needs experienced users" is not adequate in the context. The limitation of related works are not clearly established. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Section 3 details the proposal. The "how" is quite well described, but the "why" is not well established. Why this approach can go beyond limitations of state of art? Does the similarity SSIM is specific to cloud services or (as it seems to be) can be applied to any semantic crawler? (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimentation part does not provide the materials making the experiment reproducible: code is not provided. The paper compares the proposal to the best first crawler, but there is no clear reference for this baseline; is it really ? Did you recode a best-first? (OVERALL SCORE) The paper aims to produce a semantic focused crawler that collects and categorize URLs of cloud services providers. A first contribution ranks URLs given a cloud service category. The paper also proposes an enhanced semantic similarity measure combining TF-IDF and topic modeling. An experiment compare the proposal to the best-first crawling as the baseline. Strong points: - Semantic focused crawlers is an interesting topic. - Crawling cloud service providers is an interesting use-case. Weak points: - The generality of the result is not established. I expect a semantic focused crawler to be able to find any topic given as a parameter not just cloud services provider. How is the proposal positioned according to an ontology-based focused crawler? Is it a contribution for cloud service retrieval systems or for the semantic focused crawler? - Lack of references from semantic web in the related work ex: M. Ehrig, A. Maedche, S. Handschuh, L. Stojanovic, and R. Volz, "Ontology-focused crawling of web documents and RDF-based metadata," in International Semantic Web Conference 2002 (ISWC 2002), Sardinia, 2002. Ardö, A. (2005). Focused crawling in the ALVIS semantic search engine. In Proceedings, 2nd Annual European Semantic Web Conference (ESWC 2005), Heraklion, Crete, Greece, 29th May-1st June. - the presentation is not very precise and then confusing.
Review 3 (by Judie Attard)
(RELEVANCE TO ESWC) The authors used semantic web technologies to enhance crawlers for cloud services. Their approach consists in calculating the semantic similarity of URLs and cloud service categories and topics. (NOVELTY OF THE PROPOSED SOLUTION) The authors propose a novel approach that combines TF-IDF with an ontology that models Cloud service categories, in order to enhance the calculation of similarity. They also propose a two-level retrieval model in order to introduce this enhanced semantic similarity measure.The authors hence improve upon existing state of the art. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors demonstrate the proposed solution provides results that surpass another existing solution, however they do not explain why they compared their solution to this particular related work, and why only to this one. Moreover, no justification is provided (including relevant citations) with regard to the selected methods (i.e. LDA). (EVALUATION OF THE STATE-OF-THE-ART) The authors provide a substantial list of related work, however they do not provide a discussion of what are the strong points and/or drawbacks of the listed approaches, and how they compare to the proposed contribution. After Rebuttal: The authors do not provide satisfactory reasons as why the state-of-the-art section was not appropriate. The authors do not provide substantial semantics background to back up their work. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors provide a good description of the undertaken approach, however, no justification is provided (including relevant citations) with regard to the selected methods (i.e. LDA). After Rebuttal: Whilst the proposed approach might be relevant, in the rebuttal the authors further confirmed that they did not have or provide sufficient justification. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The authors provide a good description of the undertaken approach. Whilst the authors focus on cloud services, their approach can certainly be adapted to other topics. (OVERALL SCORE) The authors have the aim of improving the efficiency of focused crawlers in context of providing effective Cloud services. To this aim, they propose a topic model based semantic similarity measure that exploits both semantic and syntactic methods for calculating the similarity between URLs and a set of concepts relating to Cloud services. The authors identify as a challenge the process of focused crawlers to search only for URLs that are relevant to cloud services. They contribute towards solving this challenge by proposing a two-level retrieval model in order to introduce an enhanced semantic similarity measure. Here they prioritise URLs based on semantic similarity and TF-IDF term weighting, and then they also apply a topical similarity measure based on the LDA model. The authors propose a Cloud Service Ontology that models Cloud service categories in order to further enhance Cloud service discovery. The authors compare their solution to the Best-First crawling approach and identify that it fares considerably better. Strong points: - Well written, includes nearly all relevant details with regard to approach, methods, results - Novel contributions - Good results in evaluation Weak points: - Solution is only compared to one other existing solution (also no justification is given as to why this solution was chosen to be compared with) - The authors provide a substantial list of related work, however they do not provide a discussion of what are the strong points and/or drawbacks of the listed approaches. - No justification is provided for using LDA (except that "it has recently emerged as the method of choice for working with large collections of text documents"). Questions/Comments: 1. A short overview description of the proposed crawler architecture would have been quite helpful before delving into the main components. 2. Describe acronyms upon first use (IaaS, PaaS, SaaS). 3. Reference for Fig 2 is missing. (The authors state that the shown classification was already existing in literature). 4. How exactly is the threshold calculated? The average score of which results? A set number of URI's? Per iteration? Please clarify. 5. I suggest to only use the name and acronym once (Latent Dirichlet Allocation (LDA)), then consistently only the name OR the acronym. 6. Please explain why you compared your contribution to the Best-First crawler. Is it the best with regard to state of the art? Did you consider comparing your contribution to other approaches that perform well? ____________________ After rebuttal: Score was reduced due to the authors not providing satisfactory justifications for the lack of semantic references in state of the art, as well as no providing justification for the undertaken approach.
Metareview by Amrapali Zaveri
The reviewers raise concern with regard to the extent of the contribution. While the reviewers agree that the the proposed approach is novel, it should be evaluated against existing state-of-the-art approaches. Additionally, there is a lack of reproducibility and in generality that needs to be addressed. Furthermore, they did not find the answers they hoped for in the rebuttal, as reflected by the updated reviews. As such, we propose to reject this paper.