PRISMO- Priority Based Spam Detection Using Multi Optimization
Author(s): Mohit Agrawal, R Leela Velusamy
Full text: submitted version
Abstract: The rapid growth of social networking sites such as Twitter, Facebook, Google+, MySpace, Snapchat, Instagram, etc., along with its local invariants such as Weibo, Hyves, etc., has made them infiltrated with large amount of spamming activities. Based on the features, an account or content can be classified as spam or benign. The presence of some irrelevant features decreases the performance of classifier, understandability of dataset, and the time requirement for training and classification increases. Therefore, Feature subset selection is an essential phase in the process of machine learning mechanism. The objective of feature subset selection is to choose a subset of size‘s’ (s
Keywords: Dimensionality reduction; Genetic Algorithm; Spamming; Particle Swarm Optimization; Web 2.0.
Review 1 (by Ricardo Usbeck)
(RELEVANCE TO ESWC) The conference focuses on advances in Semantic Web research and technologies. In the paper, the main scope of the conference was missed. None of the words “semantic”, “linked”, “ontology”, “triple” or “RDF” was used and therefore the focus on Semantic Web research is not given. There is a weak contextual relation to the semantic web, as the underlying data of the paper originates from the social networks Twitter and Apontador, and some linked data characteristics are examined. This is done by examining features, e.g. the number of followers. Nevertheless, this is only a number, the structure of linked data is not considered. The algorithmic focus of the paper is on Feature Subset Selection using optimization algorithms, and classification. Therefore, there is a weak connection to some of the conference topics. (NOVELTY OF THE PROPOSED SOLUTION) Overall, the described approach is a combination of existing algorithms. After a basic data collection, a Feature Subset Selection (FSS) is performed. Therefore, three approaches are applied in parallel: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Simulated Annealing (SA). There are no heuristics to choose the best alternative, the algorithms are just executed in parallel. The resulting feature subsets are then combined. In the following step of the approach, the resulting feature subsets are applied to the following four classifiers: Random Forest, Naive Bayes, JRIP, and J48. The outcoming results are compared with each other in tables. As the used algorithms already exist and only the results are combined/compared, a strong novelty is not given. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is mostly well-described. Especially the applied FSS and classification algorithms are well-known. The chosen feature subsets, the classified spam messages (marketing, relevance, violence) are presented in tables. The solution seems to be correct and complete. (EVALUATION OF THE STATE-OF-THE-ART) The presented related work for the topic spam detection covers the years 2006 to 2015 and exceeds one page. This could be improved, as e.g. a Google Scholar search for "spam detection" and optimization since 2017 returns 800 results. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As described in “Novelty of the Proposed Solution”, the proposed approach is basically a parallel execution of existing algorithms and a combination of the results. This is described well. The discussion of the approach is part of the section “Conclusion and Future Work”, even if this is mainly a summary. The authors state, that their approach performs better than existing solutions. This seems obvious, as the respectively best result of the three algorithms is chosen. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The used data consists of a Twitter data dataset and an Apontador dataset. As these datasets, as well as the developed Matlab algorithms, are not provided as an URL, the results can not be reproduced exactly. The algorithm itself is described well so that there is a chance to produce similar results. (OVERALL SCORE) The main topic of the paper is spam detection. This problem is solved by the execution of three Feature Subset Selection (FSS) algorithms and a combination of the respective results. The results are used as input of four classification algorithms. The final results are presented and compared with each other. Strong points of the paper are the description of the presented overall approach, the used FSS algorithms and the used classifiers. Weak points in terms of the conference topics are the weak relation to the semantic web context, the approach of the parallel execution of existing algorithms instead of an independent algorithm, and missing data for reproducibility. Questions: Are there more current state-of-the-art approaches to spam detection? Isn’t it obvious, that the approach produces better results, as the best results of the opponent algorithms are chosen? Is it possible to provide a public repository containing the used algorithm and data? What is the main relation to the semantic web or rather linked data?
Review 2 (by anonymous reviewer)
(RELEVANCE TO ESWC) The topic is not relevant to ESWC, which is more suitable for a machine learning conference. (NOVELTY OF THE PROPOSED SOLUTION) It is not clear to me what is the novelty of the proposed solution. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) This paper proposes a hybrid model by combining several existing approaches for feature subset selection. (EVALUATION OF THE STATE-OF-THE-ART) The evaluation of the state-of-the-art is missing. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The properties of the proposed approach is badly demonstrated and discussed. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The generality of the experiments is not enough if the paper aims to solve the general feature subset selection. (OVERALL SCORE) Summary of the Paper This paper deals with spam detection by using machine learning with feature subset selection. It proposes a hybrid model by combining several existing approaches for feature subset selection. Strong Points (SPs) - The paper provide a good introduction to feature subset selection. Weak Points (WPs) - The paper is badly structured. It lists many general concepts of machine learning, classification and feature selection, but it is hard to follow what is the main challenges they would like to address and what is the main contributions they achieved. - If the paper aims to propose a general solution to feature subset selection, the experiments should not just be conducted on spam detection in social networks, which is too specific to evaluate a general solution. If the paper aims for spam detection in social networks, the proposed solution should be compared with the state-of-the-art solutions to this task instead of the general solutions to feature subset selection. - The paper is written in poor English and there are many typos.
Review 3 (by anonymous reviewer)
(RELEVANCE TO ESWC) The authors propose a new feature selection method for classification. They do not show how feature selection on a spam classification task in the domain of social media can be used for semantic web related tasks or improve semantic web related tasks. (NOVELTY OF THE PROPOSED SOLUTION) For the feature selection task, the authors combine feature selections of three different already published algorithms by selecting the features that have been selected by at least two algorithms. There are papers already published that combine feature selection methods, e.g. .  Rokach L., Chizi B., Maimon O. (2006) Feature Selection by Combining Multiple Methods. In: Last M., Szczepaniak P.S., Volkovich Z., Kandel A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors combine the results of three methods. They do not justify their choice to use the proposed combination; why not selecting the features of all three feature selection sets? It is also not clear how the best n_1 (the lowest cardinality of all three sets) features for each selection algorithm are selected. Also, please justify why you reduced all sets to n_1. (EVALUATION OF THE STATE-OF-THE-ART) The authors only compare their method with the methods they used as input for their feature selection algorithm. They are not comparing their results with state-of-the-art, e.g. .  Rokach L., Chizi B., Maimon O. (2006) Feature Selection by Combining Multiple Methods. In: Last M., Szczepaniak P.S., Volkovich Z., Kandel A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors point out that their method only improves one classifier, but they do not discuss why only the Random Forest algorithm benefits from the proposed feature selection method. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The proposed method can be implemented very easily, if all necessary details are provided by the authors. The used datasets are publicly available for research purposes. Nevertheless, the proposed method only improves the accuracy of one classifier (Random Forest), other classifiers do not benefit from the reduced number of features. The results for one dataset are missing (there are two identical result tables). The authors report only the accuracy, true positive rate and the true negative rate, but it is also common in this setting to report the recall, precision and/or F1-score. The runtime comparison between the different feature selection is redundant, because the runtime of Random Forest depends on the number of features. (OVERALL SCORE) The authors propose a feature combination method called PRISMO that combines three other already published feature selection algorithms (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing) based on a simple voting system. A feature is selected if it occurs in the subset of at least two selected feature sets, each calculated by one of the three methods (each set is trimmed to the length of the set with the lowest cardinality of all three sets). They show that the feature selection method improves one classifier (Random Forest) in accuracy compared to three used selection methods on a spam classification task. The related work section should be restructured (currently "A did that", "B did that"). The authors do not point out the difference of their proposed method to other methods. The English could use a native speaker straightening out the text. # Strong Points - interesting research area - nice illustrations # Weak Points - do not compare with state-of-the-art / common used evaluation measures for the spam classification task - the proposed method only improves the acc and bm (maybe TNR?) of one classifier (Random Forest) out of 4 - some points are not clear / writing style # Questions - In table 3 there are 62 features, in the text and table 4 there are 60 features for Twitter, which number of features is correct? - Why is table 4 redundant?; What is Nil, MAR, POL and BM in table 4? Overall, the paper only presents a simple method for combining feature selection methods. The authors do not compare their method with state-of-the-art methods that combine feature selection algorithms. Also, it is unclear how the proposed method can be used for semantic tasks or improve methods used in semantic tasks.
Metareview by Andreas Hotho
This work deals with the problem of detecting spam in social media by adopting an ensemble learning approach. As all reviewer are pointing out, this paper is out of scope of the conference. The connection to semantics is never shown. Therefore, we can only recommend a reject.