Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 RESEARCH DOI:10.17482/uumfd.435723 A COMPARATIVE STUDY FOR HYPERSPECTRAL DATA CLASSIFICATION WITH DEEP LEARNING AND DIMENSIONALITY REDUCTION TECHNIQUES * Gizem ORTAÇ ** Gıyasettin ÖZCAN Received: 25.06.2018; revised: 27.09.2018; accepted: 16.10.2018 Abstract: In recent years, hyperspectral imaging has been a popular subject in the remote sensing community by providing a rich amount of information for each pixel about fields. In general, dimensionality reduction techniques are utilized before classification in statistical pattern-classification to handle high-dimensional and highly correlated feature spaces. However, traditional classifiers and dimensionality reduction methods are difficult tasks in the spectral domain and cannot extract discriminative features. Recently, deep convolutional neural networks are proposed to classify hyperspectral images directly in the spectral domain. In this paper, we present comparative study among traditional data reduction techniques and convolutional neural network. The obtained results on hyperspectral image data sets show that our proposed CNN architecture improves the accuracy rates for classification performance, when compared to traditional methods by increasing the classification accuracy rate by 3% and 6%. Keywords: Hyperspectral Imaging, Deep Learning, Dimensionality Reduction, Classification, Convolutional Neural Networks, Hiperspektral Verilerin Sınıflandırmasında Derin Öğrenme ve Boyut İndirgeme Tekniklerinin Karşılaştırılması Öz: Son yıllarda, hiperspektral görüntüleme yüzey pikselleri ile ilgili zengin miktarda bilgi sağlamasıyla uzaktan algılama alanında popüler bir konu olmuştur. Genel olarak, elde edilen yüksek boyutlu ve ilişkisel veriyi işlemek için, sınıflandırmadan önce boyut indirgeme teknikleri uygulanmaktadır. Bununla birlikte geleneksel sınıflandırıcılar ve boyut azaltma yöntemleri, spektral alanda hala zorlu bir işlemdir ve ayırt edici öznitelikler çıkarmaz. Son zamanlarda ise derin konvolüsyonel sinir ağları, hiperspektral görüntüleri doğrudan spektral alanda sınıflandırmak için geliştirilmiştir. Önerilen çalışmada, geleneksel sınıflandırma ve konvolüsyonel sinir ağları arasında karşılaştırmalı bir çalışma ve analiz yapılmıştır. Çeşitli hiperspektral görüntü verilerine dayanarak elde edilen sonuçlar, önerilen konvolüsyonel sinir ağının, geleneksel yöntemlerden %3 ve %6 oranında daha iyi bir sınıflandırma oranı sağladığını göstermiştir. Anahtar Kelimeler: Hiperspektral Görüntüleme, Derin Öğrenme, Boyut Azaltma, Sınıflandırma, Konvolüsyonel Sinir Ağları 1. INTRODUCTION Hyperspectral remote sense imaging technology, HSI, is widely used for monitoring Earth’s surface (Chang, 2003, P. F. Hsieh,1998). In contrast to traditional multispectral sensors with * Bursa Technical University, Faculty of Engineering and Natural Sciences, Department Of Computer Engineering, 16330 Bursa, Turkey ** Uludağ University, Faculty of Engineering, Department Of Computer Engineering, 16059 Bursa, Turkey Correspondence Author: Gıyasettin ÖZCAN (gozcan@uludag.edu.tr) 73 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques low spectral resolution, hyperspectral remote sensing imaging is advanced from the developments of hyperspectral sensors and provides better discrimination among ground cover classes (Scott, 2015). The sensors provide a vast amount of spectral and spatial information, comprise highly correlated and very narrow spectral bands under a specific spectral frequency. The information is exploited in HSI classification such as agriculture, environmental management, urban planning, mineral detection and urban mapping (Liang et.al., 2016). Hyperspectral image comprises of two-dimensional images at a series of wavelengths. Spectral information is provided by the grey information of the same pixel point at each wavelength (Wang et.al., 2018). The traditional HSI classification based on pixel-wise approach (Landgrebe, 2005) that classifies each pixel by its digital numbers and reflectance values from different spectral bands. In particular, the classification introduces good performance due to the high spatial and spectral resolution, although some pitfalls can affect classification results negatively. For instance, training samples and spectral information (i.e., hundreds of correlated spectral bands) collection is complex and causes Hughes phenomenon (Hughes, 1968). As a consequence, classification accuracy may be insufficient. The Hughes phenomenon, also known as the curse of dimensionality, emerges when the number of features and available training samples are unbalanced and causes complete failure of the traditional classifiers (Bazi et.al., 2006). On the other hand, the classification process can suffer from high-resolution images since the process can increase the intra-class variation or decrease the interclass variation in both spectral and spatial domains (Chen et.al., 2011). In the literature, various studies have been carried out to overcome the issues. The studies are based on the following approaches (Bazi et.al., 2006): 1) The using of the sample covariance matrix (Hoffbeck et.al.,1996a, Tadjudin et.al., 1999); 2) The exploitation of the classified samples (Shahshahani, 1994, Jackson, 2001); 3) Reducing/transforming the original feature space into lower dimensionality with feature selection/extraction techniques (Lee et.al., 1993, Jimenez et.al., 1999); 4) Modeling the class spectral signatures with shape description techniques; and (Hoffbeck et.al., 1996b, Tsai et.al., 2002) 5) Support vector machine (SVM) classifiers (Gualtieri et.al,2000, Huang et.al., 2002, Melgani et.al.,2004, Camps-Valls et.al., 2004, Foody et.al., 2004, Camps-Valls et.al, 2006, Pal et.al., 2005) Regarding classification, the transformation of a hyperspectral image into a meaningful domain without losing the relevant object information has become an important research topic, recently. Ideally, the reduced image should correspond to a minimum number of variables for efficient image modeling. Instead of using the full spectral bands, dimensionality reduction techniques are effective methods for data processing and for finding the class-specific subspace. However, determination of the most effective dimensionality reduction technique is difficult in practice. In the early stage, spectral-based methods, including principal component analysis (PCA) (Licciardi et.al., 2012), independent component analysis (ICA) (Villa et.al., 2011), linear discriminant analysis (Villa et.al., 2011), etc. can be thought as linear transformations to extract better features of the input image in the lower dimensions (Bruce et.al., 2002, Jimenez et.al., 1999). Nonetheless, the linear transformation-based methods are not suitable for neither analyzing inherently nonlinear hyperspectral data (Chen et.al., 2016) nor in the existence of interference sources such as striping (Chang et.al., 1999). In recent years, deep learning based methods also provide promising results to explore the higher level and more effective spatial features (Fang et.al., 2014). In the computer vision field, deep learning methods are designed as automatic multi-layer feature learning and exploration tools by using non-linear activation functions and provide more robust features compared to lower level ones (Fang et.al., 2014). 74 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 In deep learning, convolutional neural networks, CNNs, play a dominant role in the implementation on GPUs and have recently outperformed other conventional method (Hinton et.al., 2006). However, CNNs have been mostly used for visual-related problems, a relatively newer method for hyperspectral image classification. A convolutional neural network is used to extract spectral and spatial feature maps by linear convolution filters followed by nonlinear activation functions. The classical CNNs were proposed by Lecun and has recently become popular in image processing applications including object detection (Bruna et.al., 2015), face recognition (Sun et.al., 2014), image denoising (Li, 2014). In recent works, the convolutional neural networks have been used to learn the discriminating features to classify hyperspectral images adaptively. For instance, Hu et.al. (2015) developed a deep convolutional neural network and compared the experimental results for some traditional methods. The experimental results on different hyperspectral datasets showed that the proposed neural network architecture which was contained five layers with weights achieved better classification performance. Also, Chen et.al. (2016) presented a CNN- based deep feature extraction method for HSI classification. The proposed method performed on three public hyperspectral datasets with some state-of-the-art way and provided competitive results. Yu et.al. (2017) introduced an efficient CNN architecture that overcomes some limitations such as over-fitting. The designed architecture included different principles such as data augmentation, more substantial drop rates and discarding max-pooling layers. The experimental results for different hyperspectral datasets showed that the well-designed deep learning model CNNs can achieve better classification performance. In summary, reduction of the spectral information is a necessary pre-processing step to hyperspectral analysis. Although, these methods can be affected by the small number of training samples and they usually need a large number of samples. They also suffer from unbalanced structure between curse of dimensionality of the data and the limited availability of training samples. In this work, we develop a 2-D deep CNN model for classifying hyperspectral data after building appropriate architecture. The model presents a powerful tool to extract the spatial feature representation. We also produce a comparative study with traditional classifiers. This paper is organized as follows: In Section 2, a brief introduction to CNN and dimensionality reduction is presented. In Section 3, the CNN architecture and training process is presented. In Section 4, we experimentally compare the performance of the CNN with the classification of lower-dimensional hyperspectral datasets generated by different dimensionality reduction techniques. Finally, we summarize our experimental results in Section 5. 2. DEFINITIONS AND RELATED WORK In this section, some general aspects of CNN and dimensionality reduction in hyperspectral image classification are presented. 2.1. Convolutional Neural Networks CNN is a special type of feed-forward neural network that is composed of one or more pairs of convolution layers and pooling layers. A CNN architecture can be designed according to different tasks such as image classification (Agarwal et.al., 2007), speech recognition (Xu et.al., 2015) and text recognition (Tuia et.al., 2014). However, there is relatively less CNN technique for HSI classification in the literature. In general, CNN is composed of the convolutional layers, pooling layers, and fully connected layers. Convolutional layer extracts the previous layer feature maps by using linear convolution filters. At least one layer of the nonlinear activation functions (e.g., rectifier, sigmoid, tanh, etc.) is applied to obtain the output feature map. Let 𝑋 ∈ 𝑅𝑁 𝑥 𝑀be a training input image or the layer and 𝑛 𝑥 𝑛 is a square region 75 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques extracted from the image and w be a weighted filter of kernel size with the size of (𝑚 𝑥 𝑚). The output layer is computed as: 𝑚−1 𝑚−1 ℎ 𝑙𝑖𝑗 = 𝑓 ( ∑ ∑ 𝑤𝑎𝑏𝑥 𝑙−1 𝑙 (𝑖+𝑘)(𝑗+𝑙) + 𝑏𝑖𝑗 ) (1) 𝑘=0 𝑙=0 (Fotiadou et.al., 2014) where b is the bias term and 𝑓(. ) is an activation unit of the neuron. Every neuron is presented with a spatial location (𝑖, 𝑗) concerning the input image in the convolutional layer. The pooling layer provides a group of the local features from adjacent pixels to correct deformations of objects. The input is partitioned into a set of patches and returns the max or mean value for each partition. By pooling, down-sampled input maps are created to reduce computational complexity for the upper layers. The pooling operation is formulated as: ℎ 𝑙𝑖𝑗 = 𝑓 (𝛽 𝑙 𝑗 𝑑𝑜𝑤𝑛 (ℎ 𝑙−1 𝑙 𝑖𝑗 + 𝑏𝑖𝑗 )) (2) (Fotiadou et.al., 2014) where 𝑑𝑜𝑤𝑛(. ) is the sub-sampling function that sums over each distinct patch in the input feature and β is the multiplicative bias of the output feature maps. The last layer is generally a fully-connected layer with a softmax function that generates the probability of class membership for each unit. The amount of neurons is equal to the number of classes to be categorized in a softmax layer. The last layer can be defined as (Liang et.al., 2016). 𝐻𝑗−1 𝑊𝑗−1 𝑥𝑦 (𝑥+ℎ)(𝑦+𝑤) 𝑣𝑙𝑗 = 𝑓 (∑ ∑ ∑ 𝑘𝑙𝑗𝑚 + 𝑏𝑙𝑗 ) (3) 𝑚 ℎ=0 𝑤=0 𝑥𝑦 where 𝑙 is the layer that is processed, 𝑗 is the number of feature maps in layer 𝑙. 𝑣𝑙𝑗 is the output at position (𝑥, 𝑦) in that feature map and layer. 𝑚 indexes in the (𝑙 − 1)𝑡ℎ layer connected to the current (𝑗𝑡ℎ) feature map and 𝑘ℎ𝑤𝑙𝑗𝑚 is the value at position (ℎ, 𝑤) of the kernel connected to the 𝑗𝑡ℎ feature map. 𝐻𝑗 refer to the height and width of the spatial convolution kernel, respectively (Chen et.al., 2016). In the proposed network, a hyperspectral image is considered as a 3D tensor of dimensions ℎ 𝑥 𝑤 𝑥 𝑐 where ℎ and 𝑤 refers the height and width of the image and 𝑐 is the spectral bands (channels). The images are decomposed into square patches to align with the specific nature of CNNs. Each square patch contains spectral and spatial information for a specific pixel 𝑝𝑥𝑦 to classification. 𝑙𝑥𝑦 is the class label of the pixel at location (𝑥, 𝑦) and 𝑤𝑥𝑦 the patch centered at pixel 𝑝𝑥𝑦. In final, the dataset is formed 𝐷 = {(𝑤𝑥𝑦 , 𝑙𝑥𝑦 )} for 𝑋 = 1, 2,· · · , 𝑤 and 𝑦 = 1, 2,· · · , ℎ. Patch 𝑤𝑥𝑦 is also a 3D tensor with dimension 𝑠 × 𝑠 × 𝑐. It contains spectral and spatial information for the pixel located at (𝑥, 𝑦). Parameter 𝑐 corresponds to the number of spectral bands (Makantasiset. et. al., 2015). 76 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 2.2. Dimensionality Reduction Technique The hyperspectral images are composed of several hundred images obtained with different frequencies. In general, the ability of classification increases with detailed information about the land cover. However, some reasons make the classification of pixels challenging such as high spectral resolution, insufficient training samples and a large number of bands. The computational time is significantly increased because of these reasons. Dimensionality reduction transforms the data into a lower dimensional space. It is an effective method to eliminate irrelevant variance in the data and extract low-dimensional features which include some desired information. Instead of using the all spectral bands, the lower-dimensional representation with better specific subspace could effectively improve classification performance. In the study, we consider Principal component analysis, PCA, linear discriminative analysis, LDA, and independent component analysis, ICA, Factor Analysis, FA and Truncated Singular Value Decomposition, SVD, has been applied as classical dimensionality reduction methods. PCA (Fukunaga, 2013) is the most widely used unsupervised dimensionality reduction method and removes the dependencies among the spectral bands by eigenvector decomposition. Therefore, it is often used in hyperspectral image processing (Rodarmel et.al., 2002). It generates a lower dimensional representation of data that describe as much of the large variance. It keeps the most significant singular vectors for the projection of the data to decrease dimensionality (Lee et.al., 1993). In the study, PCA utilizes Singular Value Decomposition, SVD. SVD is a method for performing PCA by diagonalization of the covariance matrix and principal components of data are calculated more efficient and robust way for transformation (Wall et.al., 2003). LDA seeks the best projection that maximizes the between-class scatter while minimizing the within-class scatter. It optimizes the Fisher score and does not require the tuning of free parameters. Due to these reasons, LDA is extensively used in remote sensing and hyperspectral imaging for feature reduction (Bandos et.al., 2009). In the study, another linear dimensionality reduction method, Truncated SVD, is applied. This method does not center the data before computing the singular value decomposition contrary to PCA (Halko, et.al., 2011). FA is a linear statistical method that is developed for potential factors from observed variables to replace the original data (Bartholomew et.al., 2008). It is a very useful method for high-dimensional data generation model since it allows different regions in the input space to build a model of local factor data (Wang et.al., 2015). In this study, the effectiveness of CNN based model is tested by comparison of different dimensionality reduction and different classification methods with the low-dimensional data. 3. MATERIAL AND METHODS 3.1. Hyperspectral Datasets For the experimentation, we exploit Indian Pines and Pavia University hyperspectral datasets which are prominent and publicly available. The Indian Pines dataset is collected by Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor from a test site in the northeast of Indian Pine state, the USA in 1992. The dataset contains 145 𝑥 145 pixels with 20 𝑚 spatial resolution and 224 spectral bands in the wavelength range of 0.4–2.5 µ𝑚. 20 water absorption bands are ([104–108], [150–163],220). The dataset contains 10.249 labeled samples and a 16-classes ground-truth map (Gamba, 2004). The Pavia University dataset (Engineering School at the University of Pavia, Pavia, Italy) is obtained by the reflective optics system imaging spectrometer (ROSIS-03) airborne optical sensor. 77 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques The dataset has 610 𝑥 340 pixels with a spatial resolution of 1.3 𝑚 and 103 spectral bands in the wavelength range 0.43–0.86 µm. Pavia University dataset has ground truth maps of 9 classes and 42.776 labeled samples (or pixels) (Huang et.al., 2009). 3.2. Experiment Setup Different experiments are performed to evaluate the performance of classification and convolutional neural network approaches in Python environment (version 3, 64-bit) language and Tensorflow library (Abadi, et.al., 2016). The results are generated on a PC equipped with Intel(R) Core(TM) i7-7700HQ CPU @ 2.8 GHz Processor and 16.00 GB memory (RAM). a. b. Figure 1: The Indian Pines hyperspectral data; a. a sample band and b. Ground-truth map of the Indian Pines dataset (sixteen land cover classes a. b. Figure 2: The Pavia University hyperspectral data; a. a sample band and b. Ground-truth map of the Pavia University dataset (nine land cover classes) 3.3. The Architecture of the Proposed CNN We present the architecture of our CNN in Figure 3. In the architecture, there exist 2 convolutional layers in the network. The convolutional kernel size, pixels of the first 78 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 convolutional layer, is 5𝑥5 and the number of maps in this layer is 200. The number of feature maps of the second layers is 100 and the size of each feature map is 3𝑥3. After each convolution step, a 2𝑥2 max-pooling is operated on each channel. After these processes, we “flatten” the data in the third layer, i.e., stretch it to a 1-D vector, and feed it into two fully connected layers with 150 and 50 nodes. The output-layer size is set to be the same as the total number of classes. The ReLU non-linearity function is selected as the activation function to the output of every convolutional layer. Figure 3: The architecture of proposed CNN for HSI classification In Table I, we present scheme of the proposed architecture in more detail. First, the hyperspectral images are split into 3-D patches. The size of the neighboring regions (patch size) in pixels is 5𝑥5𝑥200 for Indian Pines and 5𝑥5𝑥103 for Pavia University. The created data with the patches divided into the number of parts (batches) that is the number of instances used in one iteration. Then, the batches are reshaped two-dimensional images and sent as input volume to the first convolutional layer, Conv1. After applying the RELU function, the generated feature maps by the first convolutional layers are sent to the first max pool layer (Pool1) with a 5𝑥5 kernel. The resulting output volume is sent to the last convolutional layer (Conv2) with a 3𝑥3 filter size. Again, after applying RELU function, the generated feature maps by Conv2 are sent to the second max pool layer (Pool2) with a 2𝑥2 kernel. Since there is no third max pool layer, the output volume is reshaped to send it to fully-connected layers. Three fully-connected layers are implemented to the networks. The first two fully-connected layers (F1 and F2) compute the outputs according to their weights, their biases, the output of the previous layer and the activation function RELU. Finally, the last fully connected-layer (F3) computes the outputs of the network with a softmax function. To minimize the loss function in a network, a backward propagation algorithm can be useful in a general way. Mostly, variations of the stochastic gradient descent algorithm (SGD) is applied to optimize the parameters (Liang, et.al., 2016) The optimizers require careful initialization and adjustment of the model hyper-parameters such as the learning rate used in optimization. The learning rate hyper-parameter controls the tuning the weights of out network respect the loss gradient. In this work, the Xavier initializer (Glorot et.al., 2010) is used to initialize of all weights and bias of the network. The Adam optimizer is also implemented for optimizing the parameters k and b, trainable parameters (Kingma et.al., 2014). The Adam optimizer has various advantages such as working sparse gradients, naturally performing a form of step size annealing and invariant parameter updates to a rescaling of the gradient (Kingma et.al., 2014). In the study, the cross-entropy is used to determine the loss of the CNN and measure the deviation from the target and predicted labels. The network is trained by minimizing the cross-entropy loss function by the Adam optimizer (Kingma et.al., 2014). 79 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques Table 1. The Configuration of the 2-D Convolution Neural Network Conv Conv Patch REL REL Datasets 1 2 F1 F2 F3 Size U U Pool1 Pool2 5x5 3x3 Fully Fully Indian Pines 5x5x200 1x16 2x2 Yes 2x2 Yes Connected Connected Pavia 5x5 3x3 Fully Fully 5x5103 1x9 University 2x2 Yes 2x2 Yes Connected Connected Feature Map 200 100 150 50 The parameters can be updated according to the derivatives. k and b are determined by applying the backpropagation firstly. Then, new error derivatives are generated with a feed- forward step. These derivatives could be used for parameter updating for another round. The feed-forward and back-propagation steps are repeated until obtaining optimal k and b or a predefined number of iterations is reached (Liang et.al., 2016). In our study, the number of training iteration set in 2000. 3.4. Application of Different FE Methods and Classifiers Hyperspectral images are high-dimensional data with a limited number of training samples. Since training supervised classifiers are time-consuming and costly in classification, a small part of the data is used for training classifiers. In this set of experiments, CNN was compared with the effectiveness of different dimensionality reduction techniques performances through classification results. In the dimensionality reduction step, we utilized Python’s scikit-learn machine learning package (Pedregosa et.al.,2011). For a detailed comparison, we tested various unsupervised and supervised dimensionality reduction techniques which have been described in Section 2. The number of reduced dimensions is iteratively increased to find an appropriate dimension for each technique. After dimensionality reduction is applied and new data is obtained, this data is dividing 10 groups called folds. In the process, the reduced data divided into k mutually subsets of equal size and each subset are used for training while the rest subsets are used for the test. After k times of classification, the average accuracy is calculated. Various classifiers in scikit-learn are performed to evaluate different dimensionality reduction techniques through classification results. 4. EXPERIMENTAL RESULTS AND VALIDATIONS In the CNN training process, the training samples are divided into 100 batches with the equal number of samples, randomly. Approximately 60% of the available samples were used as the training dataset, whereas remaining of them served as the test dataset in the experiment. The number of train and test samples of each class is presented in Table 2 and Table 3. The total number of training and test samples are 6153 and 4096 for Indian Pines, 25670 and 17106 for 80 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 Pavia University dataset. One batch is sent into the network for each iteration. The training process continues until it reaches the maximal number of iterations. In the test process, the test sample is sent into the trained network. Table 2. The Configuration of the 2-D Convolution Neural Network Classes Train Test 1 28 18 2 857 571 3 498 332 4 143 94 5 290 193 6 438 292 7 17 11 8 287 191 9 12 8 10 584 388 11 1473 982 12 356 237 13 123 82 14 759 506 15 232 154 16 56 37 Table 3. The Indian Pines dataset and per class training sets and corresponding test sets Classes Train Test 1 3979 2652 2 11190 7459 3 1260 839 4 1839 1225 5 807 538 6 3018 2011 81 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques 7 798 532 8 2210 1472 9 569 378 To verify that the proposed CNN is suitable for classifying hyperspectral data sets with limited training samples, we compare the CNN with different traditional classification techniques. The dimensionality reduction methods are also performed before the classification to improve the classification performance. The number of dimensions was found from 2 to 50 for two hyperspectral data sets, iteratively. Then, k-fold cross-validation is used to the reduced data in the current dimension for classification. The average classification results for all dimensionality of the data sets for the classifiers with the dimensionality reduction techniques are reported in Table 4 and Table 5. As seen from the tables, the maximum average accuracies of 87.23% and 92.47% are obtained with FA by Random Forest classifier for Indian Pines and Pavia University data sets. The experimental results also show that the FA algorithm outperforms than the other dimensionality reduction methods. FA assumes that variables within a particular group are highly correlated among themselves, but they have relatively small correlations with variables in a different group. While PCA is widely used in hyperspectral data analysis, it is not a useful dimensionality reduction method when the components of maximum variation do not coincide with a large intra-class variation. Table 4. Average classification accuracies of dimensions from 2 to 50 for the Indian Pines dataset Gaussian Quadratic Classifier Random Decision Logistic Naive Discriminant DR Forest Tree Regression Bayes Analysis Technique 87.23 81.32 87.23 67.78 76.64 Factor Analysis (FA) Independent Component 74.03 65.00 74.03 53.56 63.292 Analysis (ICA) Linear Discriminant Analysis 81.62 75.81 81.65 79.51 82.44 (LDA) 77.17 70.77 77.17 59.29 65.13 Truncated SVD Principal Component Analysis 77.47 71.29 77.542 59.44 63.33 (PCA) The classification results for the CNN is presented in Figure 4 and Figure 5 for the datasets. Compared with the conventional classification methods, the proposed CNN achieves higher accuracy using all spectral bands even with a small number of training samples. As seen in Figure 4 and Figure 5, the best accuracy of 95.24% is obtained with 2000 iterations for Pavia 82 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 University using CNN. Moreover, the best accuracy result (93.87%) is obtained for Indian Pines with CNN. In Figure 6, we can observe the evaluation of the error regarding training iteration. The value of the loss function is decreased with an increasing number of iterations. The results demonstrate that the test accuracy is relatively increasing while the cost value is reducing for both datasets. Early stopping can be considered for the training process to reduce computational cost since the proposed CNN converge in almost 900 iterations. Concerning the conventional classification method, the suggested CNN architecture provide averagely 6% classification improvements and 3% Indian Pines and Pavia University, respectively. Obviously, the proposed CNN increased the classification accuracy significantly under insufficient training data. Table 5. Average classification accuracies of dimensions from 2 to 50 for the Pavia University dataset Gaussian Quadratic Classifier Random Decision Logistic Naive Discriminant DR Forest Tree Regression Bayes Analysis Technique Factor Analysis (FA) 92.47 89.97 92.45 83.12 92.39 Independent Component 89.67 85.93 89.68 84.44 92.08 Analysis (ICA) Linear Discriminant Analysis 90.74 87.32 90.74 86.84 89.20 (LDA) Truncated SVD 89.94 86.61 89.92 80.89 92.10 Principal Component Analysis 89.65 86.59 89.65 81.46 92.08 (PCA) 83 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques Figure 4: Classification accuracies of CNN for the Indian Pines dataset Figure 5: Classification accuracies of CNN for the Pavia University dataset 84 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 Figure 6: Cost value versus the training iteration for the hyperspectral data sets 5. CONCLUSION This study considered data classification problem on hyperspectral imagery, where the size of the data set is very large. To reduce the computational burden and improve classification accuracy, we utilized dimensionality reduction and deep learning techniques. We evaluated the most efficient the dimensionality reduction techniques and the proposed convolutional neural network using accuracy performance. In hyperspectral imagery, dimensionality reduction without loss of critical information is one of the fundamental goals for efficient classification. However, finding the suitable dimensionality reduction technique is highly relying on domain knowledge. Unlike conventional hyperspectral classification approaches, we propose a 2D CNN architecture for efficient classification. In the study, we compared our design to traditional dimensionality reduction and classification techniques on two publicly available hyperspectral datasets. Experimental results demonstrate that our CNN features can yield superior accurate results with using all spectral bands. In the proposed CNN architecture, two convolutional and fully connected layers are used because of the limited number of training samples. We intend to improve multiple layers of CNN frameworks to improve our classification results, in the future works. KAYNAKLAR 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283). 2. Agarwal, A., El-Ghazawi, T., El-Askary, H., & Le-Moigne, J. (2007, December). Efficient hierarchical-PCA dimension reduction for hyperspectral imagery. In Signal Processing and 85 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques Information Technology, 2007 IEEE International Symposium on (pp. 353-356). IEEE. DOI: 10.1109/ISSPIT.2007.4458191 3. Bandos, T. V., Bruzzone, L., & Camps-Valls, G. (2009). Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3), 862-873. DOI: 10.1109/TGRS.2008.2005729 4. Bartholomew, D. J., Steele, F., Galbraith, J., & Moustaki, I. (2008). Analysis of multivariate social science data. Chapman and Hall/CRC. 5. Bazi, Y., & Melgani, F. (2006). Toward an optimal SVM classification system for hyperspectral remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 44(11), 3374-3385. DOI: 10.1109/TGRS.2006.880628. 6. Bruce, L. M., Koger, C. H., & Li, J. (2002). Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction. IEEE Transactions on geoscience and remote sensing, 40(10), 2331-2338. DOI: 10.1109/TGRS.2002.804721. 7. Bruna, J., Sprechmann, P., & LeCun, Y. (2015). Super-resolution with deep convolutional sufficient statistics. arXiv preprint arXiv:1511.05666. 8. Camps-Valls, G., Gómez-Chova, L., Calpe-Maravilla, J., Martín-Guerrero, J. D., Soria- Olivas, E., Alonso-Chordá, L., & Moreno, J. (2004). Robust support vector method for hyperspectral data classification and knowledge discovery. IEEE Transactions on Geoscience and Remote sensing, 42(7), 1530-1542. DOI: 10.1109/TGRS.2004.827262. 9. Camps-Valls, G., Gomez-Chova, L., Muñoz-Marí, J., Vila-Francés, J., & Calpe-Maravilla, J. (2006). Composite kernels for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 3(1), 93-97. DOI: 10.1109/LGRS.2005.857031. 10. Chang, C. I. (2003). Hyperspectral imaging: techniques for spectral detection and classification (Vol. 1). Springer Science & Business Media. 11. Chang, C. I., & Du, Q. (1999). Interference and noise-adjusted principal components analysis. IEEE transactions on geoscience and remote sensing, 37(5), 2387-2396. DOI: 10.1109/36.789637. DOI: 10.1109/36.789637. 12. Chen, S., & Zhang, D. (2011). Semisupervised dimensionality reduction with pairwise constraints for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 8(2), 369-373. DOI: 10.1109/LGRS.2010.2076407. 13. Chen, Y., Jiang, H., Li, C., Jia, X., & Ghamisi, P. (2016). Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10), 6232-6251. DOI: 10.1109/TGRS.2016.2584107. DOI: 10.1109/TGRS.2016.2584107 14. Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction L. O. Jimenez, D. A. Landgrebe (Nov. 1999) Hyperspectral data analysis and supervised feature reduction via projection pursuit", IEEE Trans. Geosci. Remote Sens., vol. 37, no. 6, pp. 2653-2667. DOI: 10.1109/TGRS.2002.804721. 15. Fang, L., Li, S., Kang, X., & Benediktsson, J. A. (2014). Spectral–spatial hyperspectral image classification via multiscale adaptive sparse representation. IEEE Transactions on Geoscience and Remote Sensing, 52(12), 7738-7749. DOI: 10.1109/TGRS.2014.2318058. 16. Foody, G. M., & Mathur, A. (2004). A relative evaluation of multiclass image classification by support vector machines. IEEE Transactions on geoscience and remote sensing, 42(6), 1335-1343. DOI: 10.1109/TGRS.2004.827257. 86 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 17. Fotiadou, K., Tsagkatakis, G., & Tsakalides, P. (2017). Deep Convolutional Neural Networks for the Classification of Snapshot Mosaic Hyperspectral Imagery. Electronic Imaging, 2017(17), 185-190. DOI: https://doi.org/10.2352/ISSN.2470- 1173.2017.17.COIMG-445. 18. Fukunaga, K. (2013). Introduction to statistical pattern recognition. Academic press. 19. Gamba, P. (2004, September). A collection of data for urban area characterization. In Geoscience and Remote Sensing Symposium, 2004. IGARSS'04. Proceedings. 2004 IEEE International (Vol. 1). IEEE. DOI: 10.1109/IGARSS.2004.1368947. 20. Girshick, R. (2015) Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile,; pp. 1440–1448. 21. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587). 22. Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). 23. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press. 24. Gualtieri, J. A., & Chettri, S. (2000). Support vector machines for classification of hyperspectral data. In Geoscience and Remote Sensing Symposium, 2000. Proceedings. IGARSS 2000. IEEE 2000 International (Vol. 2, pp. 813-815). IEEE. DOI: 10.1109/IGARSS.2000.861712 25. Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2), 217-288. DOI: https://doi.org/10.1137/090771806. 26. He, K., Zhang, X., Ren, S., & Sun, J. (2014, September). Spatial pyramid pooling in deep convolutional networks for visual recognition. In european conference on computer vision (pp. 346-361). Springer, Cham. DOI: 10.1109/TPAMI.2015.2389824. 27. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504-507. DOI: 10.1126/science.1127647. 28. Hoffbeck, J. P., & Landgrebe, D. A. (1996). Classification of remote sensing images having high spectral resolution. Remote Sensing of Environment, 57(3), 119-126. 29. Hoffbeck, J. P., & Landgrebe, D. A. (1996). Covariance matrix estimation and classification with limited training data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7), 763-767. DOI: 10.1109/34.506799. 30. http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, Date of Access: 01.06.2018, Topic: Hyperspectral Remote Sensing Scenes 31. Hu, W., Huang, Y., Wei, L., Zhang, F., & Li, H. (2015). Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors, 2015. DOI: http://dx.doi.org/10.1155/2015/258619 32. Huang, C., Davis, L. S., & Townshend, J. R. G. (2002). An assessment of support vector machines for land cover classification. International Journal of remote sensing, 23(4), 725- 749. DOI: https://doi.org/10.1080/01431160110040323 87 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques 33. Huang, X., & Zhang, L. (2009). A comparative study of spatial approaches for urban mapping using hyperspectral ROSIS images over Pavia City, northern Italy. International Journal of Remote Sensing, 30(12), 3205-3221. DOI: https://doi.org/10.1080/01431160802559046 34. Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory, 14(1), 55-63. DOI: 10.1109/TIT.1968.1054102. 35. Jackson, Q., & Landgrebe, D. A. (2001). An adaptive classifier design for high-dimensional data analysis with a limited training data set. IEEE Transactions on Geoscience and Remote Sensing, 39(12), 2664-2679. DOI: 10.1109/36.975001. 36. Jimenez, L. O., & Landgrebe, D. A. (1999). Hyperspectral data analysis and supervised feature reduction via projection pursuit. IEEE Transactions on Geoscience and Remote Sensing, 37(6), 2653-2667. DOI: 10.1109/36.803413. 37. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 38. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). 39. Landgrebe, D. A. (2005). Signal theory methods in multispectral remote sensing (Vol. 29). John Wiley & Sons. 40. Lee, C., & Landgrebe, D. A. (1993). Feature extraction based on decision boundaries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4), 388-400. DOI: 10.1109/34.206958. 41. Li, H. (2014). Deep learning for image denoising. International Journal of Signal Processing, Image Processing and Pattern Recognition, 7(3), 171-180. DOI: http://dx.doi.org/10.14257/ijsip.2014.7.3.14 42. Liang, H., & Li, Q. (2016). Hyperspectral imagery classification using sparse representations of convolutional neural network features. Remote Sensing, 8(2), 99. DOI:10.3390/rs8020099. 43. Licciardi, G., Marpu, P. R., Chanussot, J., & Benediktsson, J. A. (2012). Linear versus nonlinear PCA for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geoscience and Remote Sensing Letters, 9(3), 447-451. DOI: 10.1109/LGRS.2011.2172185. 44. Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5162-5170). 45. Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015, July). Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International (pp. 4959-4962). IEEE. DOI: 10.1109/IGARSS.2015.7326945. 46. Melgani, F., & Bruzzone, L. (2004). Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on geoscience and remote sensing, 42(8), 1778-1790. DOI: 10.1109/TGRS.2004.831865. 47. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML- 10) (pp. 807-814). 88 Uludağ University Journal of The Faculty of Engineering, Vol. 23, No. 3, 2018 48. P. F. Hsieh (1998) D. Landgrebe, Classification of high dimensional data. 49. Pal, M., & Mather, P. M. (2005). Support vector machines for classification in remote sensing. International Journal of Remote Sensing, 26(5), 1007-1011. DOI: https://doi.org/10.1080/01431160512331314083 50. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830. 51. Rodarmel, C., & Shan, J. (2002). Principal component analysis for hyperspectral image classification. Surveying and Land Information Science, 62(2), 115. 52. Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. 53. Shahshahani, B. M., & Landgrebe, D. A. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and remote sensing, 32(5), 1087-1095. DOI: 10.1109/36.312897. 54. Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In Advances in neural information processing systems (pp. 1988- 1996). 55. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015, June). Going deeper with convolutions. Cvpr. 56. Tadjudin, S., & Landgrebe, D. A. (1999). Covariance estimation with limited training samples. IEEE Transactions on Geoscience and Remote Sensing, 37(4), 2113-2118. 57. Tsai, F., & Philpot, W. D. (2002). A derivative-aided hyperspectral image analysis system for land-cover classification. IEEE Transactions on Geoscience and Remote Sensing, 40(2), 416-425. DOI: 10.1109/36.774728. 58. Tuia, D., Volpi, M., Dalla Mura, M., Rakotomamonjy, A., & Flamary, R. (2014). Automatic feature learning for spatio-spectral image classification with sparse SVM. IEEE Transactions on Geoscience and Remote Sensing, 52(10), 6062-6074. DOI: 10.1109/TGRS.2013.2294724. 59. Villa, A., Benediktsson, J. A., Chanussot, J., & Jutten, C. (2011). Hyperspectral image classification with independent component discriminant analysis. IEEE transactions on Geoscience and remote sensing, 49(12), 4865-4876. DOI: 10.1109/TGRS.2011.2153861. 60. Wall, M. E., Rechtsteiner, A., & Rocha, L. M. (2003). Singular value decomposition and principal component analysis. In A practical approach to microarray data analysis (pp. 91- 109). Springer, Boston, MA. 61. Wang, S., & Wang, C. (2015). Research on dimension reduction method for hyperspectral remote sensing image based on global mixture coordination factor analysis. The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 40(7), 159. DOI:10.5194/isprsarchives-XL-7-W4-159-2015. 62. Wang, Y., Lv, Y., Liu, H., Wei, Y., Zhang, J., An, D., & Wu, J. (2018). Identification of maize haploid kernels based on hyperspectral imaging technology. Computers and Electronics in Agriculture, 153, 188-195. DOI: https://doi.org/10.1016/j.compag.2018.08.012. 89 Ortaç G.,Özcan G.: A Comparative Study for Hyperspectral Data Classification with Deep Learning and Dimensionality Reduction Techniques 63. Xu, C., Lu, C., Gao, J., Zheng, W., Wang, T., & Yan, S. (2015). Discriminative analysis for symmetric positive definite matrices on lie groups. IEEE Transactions on Circuits and Systems for Video Technology, 25(10), 1576-1585. DOI: 10.1109/TCSVT.2015.2392472. 64. Yu, S., Jia, S., & Xu, C. (2017). Convolutional neural networks for hyperspectral image classification. Neurocomputing, 219, 88-98. DOI: https://doi.org/10.1016/j.neucom.2016.09.010 90