Uludağ University Journal of The Faculty of Engineering, Vol. 25, No. 1, 2020 RESEARCH DOI: 10.17482/uumfd.649003 A MODIFIED FIREFLY ALGORITHM-BASED FEATURE SELECTION METHOD AND ARTIFICIAL IMMUNE SYSTEM FOR INTRUSION DETECTION Melike GÜNAY * Zeynep ORMAN * Received: 20.11.2019; revised: 05.02.2020; accepted: 27.03.2020 Abstract: Intrusion detection systems generally produce high dimensional data in network-based computer systems. It is required to analyze this data effectively and create a successful model by selecting the important features to save only the meaningful data and protect the system against suspicious behaviors and attacks that can occur in a system. Firefly Algorithm (FFA) is one of the most promising meta-heuristic methods which can be used to select important features from big data. In this paper, a modified Firefly Algorithm-based feature selection method is proposed. The traditional Firefly Algorithm is improved by using the K-Nearest Neighborhood (K-NN) classifier and an additional feature selection step. The proposed method is tested on 4 different datasets of various types of attacks. Three different sub-feature sets are obtained for each dataset and the classification performances are compared. Artificial Immune System (AIS) method is also implemented to generate artificial data for the datasets that have an insufficient number of data. This study shows that the proposed Firefly Algorithm performs successfully to decrease the dimension of data by selecting the features according to the obtained accuracy rates of the K-NN method. Memory usage is dramatically decreased over 50% by reducing the dimension with the proposed FFA. The obtained results indicate that this method both saves time and memory usage. Keywords: Firefly Algorithm, Artificial Immune System, K-NN, Feature Selection Saldırı Tespiti için Ateş Böceği Algoritması Tabanlı Özellik Seçim Yöntemi ve Yapay Bağışıklık Sistemi Öz: Saldırı tespit sistemleri, genel olarak, ağ-tabanlı bilgisayar sistemlerinde yüksek boyutlu veri üretmektedir. Sistemi meydana gelebilecek ataklardan ve ağdaki şüpheli hareketlerden korumak ve sadece anlamlı veriyi saklamak için bu yüksek boyutlu verinin etkili bir şekilde analiz edilmesi ve başarılı bir model oluşturulması gerekmektedir. Ateş Böceği Algoritması, büyük veriden önemli özelliklerin seçilmesi için kullanılan en önemli üst-sezgisel algoritmalardan biridir. Bu çalışmada, Ateş Böceği Algoritmasına dayalı yeni bir özellik seçme yöntemi önerilmiştir. Önerdiğimiz bu yöntemde Ateş Böceği Algoritması, K- en yakın komşuluk algoritması ve ek bir özellik seçimi adımı ile iyileştirilmiştir. Önerilen yöntem, çeşitli saldırı türlerini içeren dört farklı veri kümesi ile test edilmiştir. Her veri kümesi için 3 farklı alt özellik kümesi elde edilmiştir ve her birinin sınıflandırmadaki başarısı ölçülerek karşılaştırılmıştır. Ayrıca, Yapay Bağışıklık Sistemi yöntemi ile veri sayısı yetersiz veri kümeleri için yapay veri üretildikten sonra Ateş Böceği Algoritması uygulanmıştır. Bu çalışma, önerilen Ateş Böceği Algoritması’nın, K-en yakın komşuluk yöntemi ile elde edilen sınıflandırma sonuçlarına göre özellikleri seçerek verilerin boyutunu azaltmak için başarılı bir şekilde çalıştığını göstermektedir. Veri boyutunun azaltılması ile hafıza kullanımı da %50’den fazla bir oranda azalmıştır. Elde edilen sonuçlar, önerilen yöntem sayesinde hem zamandan ve hem de hafıza kullanımından tasarruf edildiğini göstermektedir. Anahtar Kelimeler: Ateş Böceği Algoritması, Yapay Bağışıklık Sistemi, K-NN, Özellik Seçimi * İstanbul University-Cerrahpasa, Department of Computer Engineering, 34320, Avcilar/İstanbul Corresponding Author: Zeynep Orman (ormanz@istanbul.edu.tr) 269 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. 1. INTRODUCTION Network-based computer systems are commonly used in different technological areas. The data used in such systems must be protected against the illegal attacks on the network. Intrusion detection systems (IDS) are developed to prevent computer systems from these attacks and provide high-level security. IDS are mostly developed using data mining techniques and rule-based classifiers to determine the patterns that can be used to analyze user behaviors (Lee & Stolfo, 1998). Traditional machine learning techniques can also be applied to big amounts of data that are produced by network systems. The recursive support vector machine (R-SVM) method produces a high accuracy rate to detect abnormal activities and reduces the processing time by extracting the main features (Shang-fu & Zhao, 2012). In recent studies, heuristic algorithms and well-known classification methods are used together to increase accuracy and develop more reliable systems. One of the popular heuristic algorithms that are commonly used in this area is the Genetic Algorithm. It is mentioned that the combination of the Genetic Algorithm and the K-means Algorithm performs better when it is compared with the K-means++ Algorithm (Sukumar, Pranav, Neetish, & Narayanan, 2018). The Firefly Algorithm has been used for feature selection in several studies in the literature. Return-based Binary Firefly Algorithm (Rc-BBFA) was one of the methods that were implemented for feature selection by using FFA (Zhang, Song, & Gong, 2017). In (Li, Kamlesh, Lim, & Neoh, 2017), the Firefly optimization was implemented for feature selection with the combination of classification and regression models. Moreover, FFA obtained some successful results in fingerprint feature extraction (Tariq, Al-Ta'i, & Abdulhameed, 2013). FFA was also used to detect intrusions on the networks in many studies in the literature. An FFA based feature selection method was developed to protect the network from the attacks by using KDD CUP 99 dataset (Selvakumar B, 2018). The Firefly Algorithm is generally used in optimization problems. There are several striking studies that use FFA in recent years for solving optimization problems. One of them was about EEG signals that were needed to recover true brain signals from noises (Majdouli, Bougrine, Rbouh, & Imrani, 2017). A hybrid approach was implemented by using FFA and PSO to solve optimization problems (Aydilek, 2018). Another firefly-based hybrid method was developed for churn prediction (Ahmed & D., 2017). A Firefly- based Algorithm was developed in three phases which were the feature selection phase, the model construction phase and the prediction phase in the study (Mashhour, Houby, & Khaled Tawfik Wassif, 2018). The model was tested with 7 datasets and compared with Ant-miner Algorithm. The most successful approach was found to be the FFA based on distance method for all datasets where accuracy rates vary between 83% and 90%. The Basic Firefly Algorithm is simple but needed to be improved because its accuracy is not enough and has the local optimum problem. To solve these problems, the Firefly Algorithm based on the gender difference algorithm (GDFA) was implemented (Wang & Song, 2019). In this study, different equations were used to determine the movement of fireflies for two subgroups of their genders. As a result, the proposed method presented that the GDFA’s performance was higher than other Firefly-based Algorithms for 30-dimensional problems. In another study, FFA was modified with neighborhood attraction to reduce computational time complexity (Hui, et al., 2017). The fireflies were attracted by only neighbors, not the entire population. The proposed method is tested with well-known benchmark functions and was compared with traditional FFA, variable step size FFA(VSFFA), wise step strategy FFA (WSSFA), memetic FFA (MFA), and FFA with chaos (CFA). The algorithms were ranked by fitness values, and neighborhood attraction FFA (NaFA) is found to be the most successful one. Artificial ants and fireflies were used in a color quantization study (Pérez-Delgado & María- Luisa, 2018). The Artificial Ant Algorithm was supported by the Firefly Algorithm to find the best parameters for image quantization. Ant-Tree for color quantization (ATCQ) algorithm was 270 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 combined with the Firefly Algorithm in this study. The result of the experiments showed that the proposed algorithm is better than ATCQ and FA independently. Another experiment was achieved to solve the flexible job-shop scheduling problem (FJSP) with the Discrete Firefly Algorithm and multi-objective Genetic Algorithm (Lunardi & Voos, 2018). It was discussed that the proposed FFA was effective only for small instances of the problem. According to the experiments, the proposed FFA was faster than the proposed GA. Another comparative study was conducted with the algorithms PSO, Artificial Bee Colonies (ABC), Cuckoo Search (CS) and FFA on graph coloring problem (Aranha, Junior, & Kanoh, 2018). Firefly Algorithm was modified and used to find opinion leaders in social networks (Jain & Katarya, 2019). The fireflies in the algorithm presented people in the social network. The attractiveness of the firefly was represented as user prominence and the distance between fireflies was calculated as centrality in the algorithm. The proposed algorithm showed the best result with 94% accuracy, 96% f1-score measurements in the real data set. In our proposed approach, we modify the traditional FFA (TFFA) with the K-nearest neighborhood (KNN) classification algorithm, which is called the proposed Firefly Algorithm as PFFA. PFFA is developed by using the collaboration of classification (KNN) and probability theory (eq 4) to obtain the best sub-feature set from original features. Features that are common in the feature sets were created by the PFFA (Sknn) and the other features that are obtained by the TFFA (Sr) are used to obtain a new feature set (Scommon). The three feature sets and original features are compared by using the K-NN classification method. The success of feature sets in classifications is measured by using four different datasets that are related to different types of attacks in computer systems. In addition to this, the Artificial Immune System Algorithm is applied to the dataset of the user to root attacks due to the insufficient number of data before FFA is implemented. The sufficient number of data is one of the most crucial topics for the systems that are developed by using machine learning methods. The remainder of this paper is organized as follows. In Section 2, the original FFA and the proposed FFA methods are explained. Detailed information about the datasets and analysis of the four experiments are given in Section 3. Finally, a summarization of the study and the obtained results are discussed in Section 4. 2. METHODS 2.1. Firefly Algorithm Firefly Algorithm is one of the new optimization techniques that is developed by analyzing the flashing behavior of fireflies. It is a metaheuristic and nature-inspired algorithm, which was proposed by Xing-She Yang in 2008 (Eren, B.Küçükdemiral, & Üstoğlu, 2017). The algorithm relies on three important characteristics of fireflies. One of these characteristics is the fireflies’ gender which is known to be unisex. Thus, each firefly can be attracted by any other fireflies. Secondly, the attractiveness and distance between the fireflies have an inverse ratio. If the distance between two fireflies is small, the attractiveness of the fireflies will be high. When a firefly is close to other fireflies, it looks brighter than normal because of this attractiveness. Moreover, less bright fireflies move towards more brighter ones. The third point is that the brightness of a firefly is determined by the objective function. The objective function can be different according to each problem. 2 𝐵(𝑟) = 𝐵 𝑒−𝛾𝑟0 (1) 271 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. The attractiveness of fireflies can be calculated by using equation (1). 𝐵(𝑟) is the attractiveness of a firefly at distance 𝑟 (Marie-Sainte & Alalyani, 2018). 𝐵0 is the attractiveness when the distance (𝑟) is zero. γ is the fixed light absorption coefficient and generally taken as 1. 1 𝑥𝑖 = 𝑥𝑖 + 𝐵(𝑟) ∗ (𝑥𝑘 − 𝑥𝑖) + 𝛼(𝑟𝑎𝑛𝑑 − ) (2) 2 By using the attractiveness formula, the new position of less bright firefly to move to the brighter one is calculated as in equation (2). 𝛼 and 𝑟𝑎𝑛𝑑 are the random numbers in the equation that are uniformly generated between [0,1]. 𝑥 𝑡ℎ 𝑡ℎ𝑖 and 𝑥𝑘 are 𝑖 and 𝑘 fireflies in the population. Distance between two fireflies can be calculated with the Euclidean distance formula as given in equation (3). 𝑑 𝑟(𝑖, 𝑘) = |𝑥⃑⃑⃑⃑𝑖 − 𝑥⃑⃑⃑⃑𝑘 | = √∑(𝑥𝑖𝑗 − 𝑥𝑘𝑗) (3) 𝑗=1 In equation (3), 𝑑 is the dimension of fireflies, 𝑥𝑖𝑗 is the jth dimension of 𝑖 𝑡ℎ firefly and 𝑥𝑘𝑗 is the 𝑗𝑡ℎ dimension of 𝑘𝑡ℎ firefly. 2.2. Proposed Firefly Algorithm Based Feature Selection In this study, our objective function is the accuracy of a classifier by using the dataset with selected feature sets. The population is generated by selecting different feature subsets whose size is smaller than the original dataset. Each dataset with different features is a firefly which means a candidate solution. Our aim is to find the best feature subset for the dataset to perform classification successfully. Distances between fireflies and new positions of subsets are calculated by using equations (2) and (3). The traditional Firefly Algorithm (TFFA) is given in Figure 1. N is the number of features to be selected in the algorithm. We choose 20 as N for this study which means the dataset length will be reduced to 20 from 41. Figure 1: Traditional Firefly Algorithm (TFFA) In our proposed FFA (PFFA), we use the K-nearest neighborhood classifier as the fitness function. The fitness function was used the same as the attractiveness function in the literature. In addition to this, we apply the rule given in equation (4) (Marie-Sainte & Alalyani, 2018). 272 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 If 𝑃(𝑥𝑖)> rand Select the feature 1 ( ) Else where, 𝑃 𝑥𝑖 = 𝑥 (4) 1+𝑒 𝑖 Do not select the feature According to the results of the classifier, fitness values are ranked, and the best feature set (Sknn) for K-NN is obtained. In addition to this, the second feature set is taken from the rule in equation (4) as Sr. In equation 4, 𝑥𝑖 represents the current value of each feature and 𝑃(𝑥𝑖) is the probability of 𝑥𝑖 taking 1. Moreover, we assume that if a firefly is not affected by any other firefly, it should continue its random fly as stated in (Saim, 2017) using equation (5) where 𝑟𝑎𝑛𝑑 is random number generator distributed in [0 1] and 𝛼 is another randomization parameter between [0,1]. The pseudocode of our proposed FFA algorithm is given in Figure 2 in detail. 1 (5) 𝑥𝑖 = 𝑥𝑖 + 𝛼(𝑟𝑎𝑛𝑑 − ) 2 In the PFFA, the accuracy result of the K-NN algorithm is used as a fitness value. Accuracies of fireflies are compared, and the firefly, %, however, which has lower accuracy moves towards the firefly that has higher accuracy. Thus, lower accuracy firefly is updated according to equation (2). After each update, the fitness value is recalculated. The feature set that is selected by K-NN (Sknn) and selected by equation (4) (Sr) is analyzed. Sr is generated by choosing the feature that is selected more than the threshold value which is determined as 3 for this study. The threshold is decided after several trials. As a result, the features that occur in both feature sets are selected as the most effective features for classification or dimension reduction. In addition to two subsets Sknn and Sr, Scommon is created by choosing the common features in both Sknn and Sr. Figure 2: Proposed Firefly Algorithm (PFFA) 273 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. 3. EXPERIMENTS AND RESULTS 3.1 Dataset The dataset is taken from KDD CUP 99 (JR, 1993) that consists of normal flow and attacks to the network. There are 22 different types of attacks in the dataset. Each data is recorded with 41 features. 22 different types of attacks are divided into four categories as the denial of service, remote to local, the user to root and probe (Selvakumar B, 2018). In this study, four attack types in different subsections are analyzed because each attack has different characteristics and different sizes of data. In the experiments, MATLAB R2013b software on Intel Core i7-6700HQ CPU @2.60 GHZ with a 16GB RAM computer is used for the implementation of the FFA algorithm. The first three experiments have the same steps as shown in Figure 3. Proposed Firefly and traditional Firefly Algorithms are implemented to obtain feature subsets Sr and Sknn. By using Sknn and Sr, feature set Scommon is generated as mentioned in the previous section. K-NN classifier is used as the fitness function and accuracy result is the fitness value for Firefly Algorithms. To be able to make a comparison, K-NN is implemented on the original data with 41 features. In addition to the feature set Scommon, other feature sets Sr and Sknn are separately used to classify data. As a result, we get four accuracy and time measurement results for original data, and three selected feature sets that are Sknn, Sr and Scommon. In the fourth experiment, we apply an additional algorithm called Artificial Immune System (AIS) to generate artificial data due to the unsatisfying number of data. Figure 3: General Process Diagram for Experiments 274 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 3.1.1. Experiment-1: Remote to Local (R2L) Attack The first experiment is performed for the R2L attack type. We create our training and testing datasets from KDD CUP 99 dataset by taking 6684 normal data and 1114 data that are tagged as R2L attack. We choose randomly 2339 data for the testing set and 5459 data for the training set. %70 of data is divided as the training set and 30% of data is divided for the testing set. 768 of 5459 training data is tagged as an attack. 346 of 2339 testing data is tagged as an attack. We complete our experiment in three steps. First of all, the dataset is classified with all 41 features by using K-NN. Then, we implement traditional FFA (TFFA). TFFA finds the most successful feature set that consists of 20 features. We classify our dataset with selected 20 features by TFFA. In the last step, our proposed FFA (PFFA) is implemented. As a result of PFFA, we obtain 18 features that are common in two approaches in PFFA. The feature set Sknn that is obtained by TFFA and Sr that is obtained by the second technique in PFFA are given in Table 1. The final feature set Scommon is also given in Table 1. Table 1. Feature Sets selected by TFFA(Sknn) and PFFA (Sr and Scommon) for R2L Attack Type Feature Indexes Sknn 6-17-32-41-22-39-23-31-30-3-16-2-7-14-29-5-12-27-33-25 Sr 2-9-11-12-15-17-19-22-23-24-25-27-32-34-37-38-39-40 Scommon 2-12-17-22-23-25-27-32-39 We measure the accuracy results of TFFA and PFFA. K-NN Algorithm classifies the testing data with 99.79% accuracy using 41 features. TFFA with 20 features that are given as Sknn in Table 1 classifies the testing data with 99.83% accuracy. Our proposed method PFFA reduces the dimension of data from 41x7798 to 9x7798 in the total training set and testing set. In addition to this, the correct prediction rate is 90.51% with 9 features in the feature set Scommon. We also classify the data by using the feature set Sr with 18 features whose accuracy rate 93.67% as seen in Figure 4. Accuracy Measurements with Different Feature Sets 99,83% 99,79% 93,67% 90,51% Sknn Sr Scommon Original Figure 4: Accuracy measurements of different feature sets for R2L attack type 275 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. Changes in the number of features in different feature sets that are obtained by the algorithm are given in Figure 5 to compare the size of reduced datasets with the original. The usage of memory is dramatically increased with the FFA by comparing the accuracy results. Number of Features in Different Feature Sets Original Scommon Sr Sknn 0 5 10 15 20 25 30 35 40 45 Figure 5: Number of Features in Different Feature Sets Processing time that is required for the classification is as important as the memory requirements. The time is measured in seconds as stated in Figure 6. Processing Time(Sec) Original 0,542483 Scommon 0,022989 Sr 0,231711 Sknn 0,237573 0 0,1 0,2 0,3 0,4 0,5 0,6 Figure 6: Processing Time to classify data using feature sets 276 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 3.1.2. Experiment-2: Probe Attack We choose random 8624 data for the testing data set and 20125 data for the training set. Generally, 30% of data is taken as the testing set and %70 of data is taken as the training set. 1224 data in the testing set and 2883 data in the training set are tagged as a probing attack. 7400 data from the testing set and 17242 data from the training set are tagged as normal in total. Similar to experiment-1, we classify our data first originally with K-NN, second with TFFA and lastly PFFA using K-NN. TFFA reduces the number of features from 41 to 20 called feature set Sknn. PFFA reduces the number of features to 5 called Scommon. The second method in PFFA has also produced a feature set called Sr that consists of 11 features. The index of selected features is given in Table 2. Table 2. Feature Sets selected by TFFA(Sknn) and PFFA (Sr and Scommon) for Probe Attack Type Feature Indexes Sknn 27-25-17-4-31-5-15-35-19-10-8-38-29-23-30-36-2-16-13- 14 Sr 1-2-7-10-20-24-26-27-29-36-37 Scommon 2-10-27-29-36 K-NN algorithm classifies the original testing data with 41 features with 99.83% accuracy. TFFA generates a feature set Sknn with 20 features and its correct classification rate is 99.95%. PFFA reaches very few numbers of features that is 5 in the feature set Scommon. Using PFFA, the size of data is reduced by 88%. However, the correct classification rate is decreased by only 2.03%. The accuracy of PFFA using Scommon is 97.92%. In addition to this, the feature set that is obtained in PFFA; Sr is used for classification, and it classifies 98.33% of data correctly as shown in Figure 7. Accuracy Measurements with Different Feature Sets for Probe Attack 100,50% 99,95% 100,00% 99,83% 99,50% 99,00% 98,50% 98,33% 97,92% 98,00% 97,50% 97,00% 96,50% Sknn Sr Scommon Original Figure 7: Accuracy Measurements with Different Feature Sets for Probe Attack 277 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. In addition to accuracy measurements, changes in the number of features are shown in Figure 8. Number of Features in Different Feature Sets for Probe Attack Original Scommon Sr Sknn 0 10 20 30 40 50 Figure 8: Number of Features in Different Feature Sets for Probe Attack When we measure the time that is necessary for classification with the produced feature sets by FFA, the following chart is obtained as given in Figure 9. In addition to memory constraints, time is another critical constraint for processing big data. Processing Time(Sec) Original 3,098764 Scommon 1,590823 Sr 2,597328 Sknn 3,247198 0 0,5 1 1,5 2 2,5 3 3,5 Figure 9: Processing Time of Feature Sets for Probe Attacks 278 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 3.1.3. Experiment-3: DOS Attack For the DOS attack experiment, we choose 9000 data for the testing set and 21000 data for the training set like the previous experiments. 50% of data is tagged as normal and 50% of data is tagged as DOS attack. In addition to this, 10476 of training data tagged as DOS attack and 4524 of the testing set is tagged as a DOS attack. K-NN algorithm classifies the original data with a 99.70 % accuracy rate. The feature sets that are obtained by TFFA and PFFA are given in Table 3. Table 3. Feature Sets selected by TFFA(Sknn) and PFFA (Sr and Scommon) for DOS Attack Type Feature Indexes Sknn 4-26-5-39-3-18-12-38-13-9-21-35-10-6-14-17-33-41-34-25 Sr 1-3-4-5-8-11-12-13-15-16-17-19-20-21-22-23-24-26-29-30-32-36 Scommon 3-4-5-12-13-17-21-26 TFFA with 20 features and PFFA with 8 features classify the testing data with 99.88% and 99.93% accuracy rates as shown in Figure 10. The other feature set Sr that is generated by PFFA is used and give 98.22% accuracy. Accuracy Measurements with Different Feature Sets for DOS Attack 99,88% 99,93% 100,00% 99,70% 99,50% 99,00% 98,50% 98,22% 98,00% 97,50% 97,00% Sknn Sr Scommon Original Figure 10: Processing Time of Feature Sets for DOS Attacks The changes in the number of features after TFFA and PFFA and the processing time of the algorithms are plotted in Figure 11 and Figure 12. 279 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. Number of Features in Different Feature Sets for DOS Attack Original Scommon Sr Sknn 0 5 10 15 20 25 30 35 40 45 Figure 11: Number of Features in Different Feature Sets for DOS Attack Processing Time (Sec) of K-NN Classifier with Different Feature Sets Original 8,438751 Scommon 0,824 Sr 4,65846 Sknn 4,511875 0 1 2 3 4 5 6 7 8 9 Figure 12: Processing Time of K-NN Classifier with different feature sets 3.1.4. Experiment-4: User to Root(U2R) Attack and Artificial Immune System (AIS) There are only 52 data that are tagged as U2R in the dataset. The number of U2R attacks is not enough to complete our experiments. Thus, we need to generate artificial data for U2R attack types. We use the Artificial Immune System Algorithm to generate artificial data. It is a human- inspired algorithm that uses the general properties of the natural immune system. The immune system has some important properties that are used to implement the algorithm. These properties are uniqueness, distributed detection, and self-regulation, approximate detection and pattern matching, diversification, anomaly detection, self-protection and learning, and memorization. By 280 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 using these characteristics properties, the immune system algorithms can be improved. There are several types of immune-based algorithms in the literature as negative selection algorithm, clonal selection algorithm, artificial immune network algorithm, danger theory algorithm (Fernandes, Freire, Fazendeiro, & Inácio, 2017). We prefer to use the clonal selection algorithm in this study. Our goal is to produce a population using 52 data tagged as U2R attack. The immune system recognized the antigens that enter the body and produce new cells to protect the body. The system can remember the same antigens or similar antigens even after many years later and protect itself by producing and generating clones faster than the first time. Using this behavior of the system, the clonal selection algorithm will produce new data and check whether the new data is suitable for the system with affinity measurements in the algorithm. A new clone can be generated by using two existing data (Er, Yumusak, & Temurtas, 2012). We generate new clones by taking two data’s average from U2R. we need a threshold to calculate the affinity of new clones. Average of distances between each couple from U2R data for each feature identify the thresholds E for the features. The affinity of a clone is increased one by one for each feature if the feature value greater than the threshold. We have 41 features, and if 20 of them pass the selection, the clone is added to the population. In the last step, the population is classified using K-NN and the final population is created if the clone’s class is from U2R. In general, we implement the basic steps of AIS that are given in Figure 13. Step 1: Generate clones Step 2: Determine the threshold E for each feature Step 3: Calculate affinity for clones using E Step 4: Classify the data Step 5: If affinity and class is sufficient, add clone to population Figure 13: Pseudocode of AIS With the AIS algorithm, we produce 922 artificial data addition to original data in the dataset that consist of 52 data tagged as U2R attack and 974 data tagged as normal. We choose random 584 data for the testing set and 1364 data for the training set. TFFA selects 20 features (Sknn) where PFFA selects 8 features (Scommon) as shown in Table 4. Table 4. Feature Sets selected by TFFA(Sknn) and PFFA (Sr and Scommon) for U2R Attack Type Feature Indexes Sknn 31-41-35-1-12-13-40-10-19-38-18-30-17-2-7-21-5-8-29-15 Sr 2-5-7-13-15-16-17-21-30-37-39 Scommon 2-5-7-13-15-17-21-30 Accuracy results show that TFFA and PFFA increase the correct prediction rate. TFFA classifies the data with a 99.83% accuracy rate by selecting features to feature set Sknn. Other feature sets that are produced by PFFA, Sr and Scommon also give better results than the original feature set. The accuracy rate of feature sets is given in Figure 14. 281 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. Accuracy Measurements with Different Feature Sets for U2R Attack 100,00% 99,83% 99,49% 99,49% 99,50% 99,00% 98,50% 98,29% 98,00% 97,50% Sknn Sr Scommon Original Figure 14: Accuracy Measurements with Different Feature Sets for U2R Attack The processing time of K-NN differs in different feature sets as similar to previous experiments. There is a correlation between the number of features in the dataset and the processing time of K-NN. The increase in time and number of features can be seen in Figure 15 and Figure 16. Number of Features in Different Feature Sets for U2R Attack Original Scommon Sr Sknn 0 5 10 15 20 25 30 35 40 45 Figure 15: Number of Features in Different Feature Sets for U2R Attack 282 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 Processing Time (Sec) of K-NN Classifier with Different Feature Sets Original 0,048592 Scommon 0,043942 Sr 0,016883 Sknn 0,017006 0 0,01 0,02 0,03 0,04 0,05 0,06 Figure 16: Processing Time of K-NN Classifier with different feature sets 3.2. Results Experiments show that dimension reduction with feature selection using TFFA increases the accuracy rate. In experiment 2, probe attack type classification shows the highest accuracy rate as 99.95% from 99.83% comparing to the original dataset and Sknn. In other experiments, TFFA gives better results than K-NN classification with the original dataset. Accuracy results of experiment-1 for R2L attack and experiment-3 for DOS attack are also increased from 99.79% to 99.83% and from 99.70% to 99.88% after TFFA is implemented. In addition to the accuracy results, the dimension of the data is decreased to a 50% ratio. Our proposed method PFFA generates two different feature sets that are Sr and Scommon. The accuracy results with the feature sets are not greater than the original dataset but a little different in the first two experiments. The results are decreased by 9% for experiment-1, 2% for experiment-2 when Scommon is used. In experiments 3 and 4, we analyze that the accuracy rate is increased by 2% and 5% with Scommon. The results with the feature set Sr are close to Scommon. In Table 5, we compared our proposed data with the most similar studies. The proposed method shows higher result than (B & K, 2019) that same dataset used for DOS, probe and U2R attack types. For other studies in the table except (Tariq, Al-Ta'i, & Abdulhameed, 2013), accuracy results are not better than proposed method. Sr and Scommon feature sets have a slight difference when they are compared with the original dataset but the difference in the memory that the dataset is required, and the processing time have very big changes. Processing times decreased from 0.048 sec to 0.043 sec for U2R attack, from 8.43 sec to 0.824 sec for DOS attack, from 3.098 sec to 1.59 sec for Probe attack and from 0.52sec to 0.0229 sec for R2L attack types with Scommon feature set. In addition to this, the dimension of the data set is decreased by 80% for U2R and DOS, %88 for Probe and 78% for R2L attack type with common feature sets. 283 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. Table 5. Comparison of PFFA with Other Methods in the Literature Method Dataset Accuracy Our Proposed TFFA KDD CUP 99-Original Method Data %99.83 Probe Attack %99.79 R2L %99.70 DOS %98.29 U2R PFFA KDD CUP 99-Scommon Probe Attack %97.92 R2L %90.51 DOS %99.93 U2R %99.49 (Selvakumar B, FFA with Bayesian KDD CUP 99 2018) Network Algorithm Probe Attack %93.42 R2L %97.83 DOS %99.95 U2R %68.97 (Anbu & Mala, SVM with FFA PROMISE %91 2 017) KNN with FFA Software Dataset %88 NB with FFA %87 (Tariq, Al-Ta'i, & Features Extraction of Abdulhameed, Fingerprints using Fingerprint Dataset %100 2013) Firefly Algorithm (Mashhour, A Novel Classifier Lung Dataset %80 Houby, & Khaled based on Firefly Hepatitis Dataset %82 Tawfik Wassif, Algorithm Dermatology Dataset %90 2018) Prostate Dataset %90 Leukemia1 Dataset %83 DLBCL Dataset %90 SRBCT Dataset %90 In this study, we also implement the AIS algorithm to generate artificial data with a clonal selection mechanism due to the insufficient number of data in experiment 4. As we obtained high accuracy results with the dataset, including artificial data, the proposed method is found to be successful for the unbalanced datasets. As a result, PFFA can be preferable when we analyze the accuracy rates of the feature sets that are closed to original feature sets and memory utilization. 4. CONCLUSION In this paper, the traditional Firefly Algorithm was modified to obtain a subset from the features that give the best classification accuracy. We obtained new feature sets from the traditional Firefly Algorithm and the modified Firefly Algorithm. We also created a feature set that consisted of features from the modified and original Firefly Algorithms in common. Classification accuracies for the feature sets and the original feature set were calculated and compared in 4 datasets of intrusion detection. One of the datasets that were about user to 284 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 root (U2R) attacks had only 52 data tagged as attacked and 974 data tagged as normal. When we compared it with other datasets, the number of attacked and normal data in this dataset was recognized to be unbalanced so as to apply the FFA. Thus, before the feature selection step, the Artificial Immune System (AIS) Algorithm was applied to generate artificial data to support this dataset. By using AIS, 922 additional data was generated. When the results were compared, it was determined that the feature set obtained from TFFA (Sknn) gave better accuracy rates than the original feature set. TFFA provided to decrease the number of features from 40 to 20. Although the selected feature sets obtained from PFFA (Scommon) gave lower classification accuracies than Sknn for all datasets, there was a slight difference between them. On the other hand, we could decrease the dimension of data with PFFA more than 65% on average. For that reason, PFFA was found to be successful in both accuracy results and memory saving. In addition to this, according to time measurements to classify data after PFFA was implemented, the method could also be used to save time. Although PFFA was found to be successful, it should be enhanced to reach the success of TFFA for classification accuracy. For this purpose, equations (4) and (5) could be improved, and some other different classification methods could be used to calculate the attractiveness of fireflies and to test the overall system in the future. Moreover, equation (3) is used to calculate the distance but a big difference between one feature may decrease the effect of another feature difference. Therefore, there is a gap in the distance formula for large dimensional data. Equation (3) may be modified to make more strength calculation. In addition to improved equations, the firefly algorithm can be modified and used with other methods, including standard machine learning algorithms and heuristic approaches. This study and the studies we mention applied firefly algorithm with continues variables due to algorithm’s computing architecture, but the algorithm can be modified for the binary datasets in the future to be able to expand the implementation area of the algorithm. ACKNOWLEDGMENTS This work was supported by Council of Higher Education of Turkey. Project number: MEV- 2018-863 REFERENCES 1. Aranha C., Junior J. P., & Kanoh, H. (2018). Comparative study on discrete SI approaches to the graph coloring problem, Genetic and Evolutionary Computation Conference, Kyoto, Japan, 15-19. doi:10.1145/3205651.3205664 2. Anbu M., & Mala G. S. (2019). Feature selection using firefly algorithm in software defect prediction. Cluster Computing, 22, 10925–10934. doi:10.1007/s10586-017-1235-3 3. Aydilek İ. B. (2018). A hybrid firefly and particle swarm optimization algorithm for computationally expensive numerical problems, Applied Soft Computing, 66, 232-249. doi:10.1016/j.asoc.2018.02.025 4. B Selvakumar., & K Muneeswaran. (2018). Firefly algorithm based feature selection for network intrusion detection. Computers & Security, 81, 148-155. doi:10.1016/j.cose.2018.11.005 5. Er O., Yumusak N., & Temurtas, F. (2012). Diagnosis of chest diseases using artificial immune system, Expert Systems with Applications, 39(2), 1862-1868. doi:10.1016/j.eswa.2011.08.064 285 Günay M., Orman Z.: A Modified Firefly Alg.-Based Featr. Select. Met. And Artf. Immune Systm. 6. Eren Y., Küçükdemiral İ., & Üstoğlu İ. (2017). Introduction to Optimization, In Optimization in Renewable Energy Systems, 27-74, Elsevier Butterworth-Heinemann. ISBN:9780081010419, 0081010419 7. Fernandes, D. A., Freire, M. M., Fazendeiro, P., & Inácio, P. R. (2017). Applications of artificial immune systems to computer security: A survey, Journal of Information Security and Applications, 35, 138-159. doi:10.1016/j.jisa.2017.06.007 8. Hui W., Wenjun W., Xinyu Z., Hui S., Jia Z., Xiang Y., & Zhihua C. (2017). Firefly algorithm with neighborhood attraction. Information Sciences, 382-383, 374-387. doi:10.1016/j.ins.2016.12.024 9. Jain L., & Katarya R. (2019). Discover opinion leader in online social network using firefly algorithm, Expert Systems With Applications, 122, 1-15. doi: 10.1016/j.eswa.2018.12.043 10. JR, Q. (1993). C4.5: Programs for Machine Learning. Erişim Adresi: https://github.com/defcom17/NSL _ KDD (Erişim Tarihi: 12.11.2018) 11. Lee W., & Stolfo S. J. (1998). Data Mining Approaches for Intrusion Detection, Proceedings of the 7th USENIX Security Symposium, San Antonio, Texas: Usenix, 1-15. doi:10.5555/1267549.1267555 12. Li Z., Kamlesh M., Lim C. P., & Neoh S. C. (2017). Feature selection using firefly optimization for classification and regression models, Decision Support Systems, 106, 64-85. doi: 10.1016/j.dss.2017.12.001 13. Lunardi W. T., & Voos H. (2018). Comparative study of genetic and discrete firefly algorithm for combinatorial optimization, Proceedings of the 33rd Annual ACM Symposium on Applied Computing, Pau, France, 300-308. doi:10.1145/3167132.3167160 14. Majdouli M. A., Bougrine, S., Rbouh, I., & Imrani, A. A. (2017). A Comparative Study of the EEG Signals Big Optimization problem using evolutionary, swarm and memetic computation algorithms, The Genetic and Evolutionary Computation Conference, Berlin, Germany, 1357-1364. doi:10.1145/3067695.3082489 15. Marie-Sainte, S. L., & Alalyani, N. (2020). Firefly Algorithm based Feature Selection for Arabic Text Classification, Journal of King Saud University- Computer and Information Sciences, 32(3), 320-328. doi:10.1016/j.jksuci.2018.06.004 16. Mashhour E. M., Houby E. M., & Khaled Tawfik Wassif, A. I. (2018). A Novel Classifier based on Firefly Algorithm, Journal of King Saud University – Computer and Information Sciences, In Press, Corrected Proof. doi:10.1016/j.jksuci.2018.11.009 17. Pérez-Delgado, & María-Luisa. (2018). Artificial ants and fireflies can perform colour quantisation, Applied Soft Computing Journal, 73, 153-177. doi:10.1016/j.asoc.2018.08.018 18. Saim B. (2017). Retrieved from Bilal Saim Website: https://bilalsaim.com/ates-bocegi- algoritmasi-fafirefly-algorithm-h1635 (Erişim Tarihi: 06.11.2019 ) 19. Shang-fu, G., & Zhao, C.-I. (2012). Intrusion Detection System Based on Classification, 2012 IEEE International Conference on Intelligent Control, Automatic Detection and High-End Equipment, Beijing, China, 78-83. doi:10.1109/ICADE.2012.6330103 20. Sukumar J. V., Pranav I., Neetish, M., & Narayanan, J. (2018). Network Intrusion Detection Using Improved Genetic k-means Algorithm, International Conference on Advances in Computing, Communications and Informatics, Bangalore, India, 2441-2446. doi:10.1109/ICACCI.2018.8554710 286 Uludağ University Journal of The Faculty of Engineering, Vol. 25, No.1, 2020 21. Tariq, Z., Al-Ta'i, M., & Abdulhameed, O. Y. (2013). Features extraction of fingerprints using firefly algorithm, Proceedings of the 6th International Conference on Security of Information and Networks, Aksaray, Turkey, 392-395. doi:10.1145/2523514.2527014 22. Wang, C.-F., & Song, W.-X. (2019). A novel firefly algorithm based on gender difference and its convergence, Applied Soft Computing Journal, 80, 107-124. doi:10.1016/j.asoc.2019.03.010 23. Zhang, Y., Song, X.-f., & Gong, D.-w. (2017). A return-cost-based binary firefly algorithm for feature selection, Information Sciences, 418, 561-574. doi:10.1016/j.ins.2017.08.047 287 288