Implementation Of Convolutional Neural Network using MATLAB
Authors- U.V. Kulkarni, Shivani Degloorkar, Prachi Haldekar, Manisha Yedke
A step-by-step guide using MATLAB
Image classification is the task of classifying an image into one of the given categories based on visual content of an image. Neural networks are able to make predictions by learning the relationship between features of an image and some observed responses. In recent years, Convolutional neural networks (CNN) have achieved unprecedented performance in the field of image classification.
If you are a CNN rookie, it is advisable to go through the part of understanding CNN first and then continue on to know how to implement CNN using MATLAB. Else, you can skip to: Training CNN from scratch.
Understanding Convolutional neural network
So to start with CNN, let us first understand how computer sees an image. When an image is provided as input to a computer, it sees image as an array of pixel values. The size of array being m x n x r. Here, m, n represents height and width of the image respectively and r represents number of color channels. For instance, r value for rgb image is 3 (Figure 1) and that for gray is 1.
Figure 1: RGB image as seen by computer
Coming back, To build CNN, we use four main types of layers : Convolutional layer, Activation Layer, Pooling Layer and Fully Connected layer. The architecture of CNN may vary depending on the types and number of layers included. The types and number of layers included depend on application or data. For example, a smaller network with only one or two convolutional layers might be sufficient to learn small number of grayscale images. However, more complicated network with multiple convolutional and fully connected layers might be needed for large number of colored images.
We will now discuss all these layers with their connectivities and parameters individually.
The covolutional layer is the core building block of CNN. Input to convolutional layer is m x n x r dimensional array of pixel values.
In typical neural network, each neuron in previous layer is connected to every other neuron in hidden layer (Figure 2). When dealing with high-dimensional inputs such as images, it is impractical to connect hidden layer neurons to all neurons in the input layer. However, in CNN, only small region of neurons in input layer connect to neurons in hidden layer. These regions are referred to as local receptive fields (Figure 3).
Figure 2: Typical neural network
Figure 3: Convolutional neural network
These receptive local fields also know as kernels or filters, are the parameters of this layer. Every kernel is small along width and height as compared to input image size but is similar in depth to that of input. For example, given rgb input image of dimension 28 x 28 x 3, kernel might be of size 5 x 5 x 3 and that for gray image of same dimension, it might be of size 5 x 5 x 1.
So, what happens when an image is passed through convolutional layer ?
While passing an image through convolutional layer, we slide each kernel across the width and height of the input image. We compute element wise dot products between the entries of the kernel and the input image and add a bias term to it. This same computation is repeated across entire image i.e. convolving the input. The step size with which the kernel moves through a image is called a stride. After we slide the filter over the width and height of the input image, we form a 2-dimensional feature map. We have a set of these kernels and bias terms in a convolutional layer. Each feature map has a different set of kernel and a bias. Therefore, the number of kernels determine the number of feature maps in the output of a convolutional layer. For eg, 6 different kernels convolved over an input image would produce 6 different feature maps.
Figure 4: Sliding kernel 1 over input image to obtain feature map 1
Figure 5: Sliding kernel 2 over input image to obtain feature map 2
The kernels consists of a set of learnable weights which are randomly initialized with some small values at first. These weight matrices in form of kernel when slid over input image extracts some features from image. When we have multiple convolutional layers, these features at initial layers maybe some types of edge orientations or patches of colors and eventually at higher levels it consists of more complex or entire pattern itself.
Feature maps are the output from convolutional layer. The size and number of feature maps produced depends on size of kernels, stride rate and number of kernels.
For instance, consider a simple example where input is 2 dimensional 7 x 7 image. Now lets see how above mentioned parameters affect the size of output feature maps.
Size of kernels :
Stride rate :
Number of kernels :
Number of kernels decide number of feature maps produced. For example, 6 kernels produce 6 feature maps.
The problem seen in figure 9 can be solved by zero padding. Zero padding is basically adding rows or columns of zeros to the borders of an image input. It helps us control the output size of feature map.
Figure 10: 9 x 9 image obtained after padding 7 x 7 image with zeros along the borders
Now, to sum up how these parameters affect output of convolutional layer i.e. feature maps, consider N x N image, K x K kernel, stride rate S and zero padding P. The size of output feature map can be given as:
Output size = ( (N - K + 2 * P) / S ) + 1
In CNN it is convention to apply activation layer (non linear layer) after every convolutional layer. This is done in order to bring non linearity to the architecture after performing linear operations in convolutional layer. There are many types of nonlinear activation function such as a rectified linear unit (ReLU), tanh and sigmoid.
Pooling layers too are introduced between subsequent convolutional layers. These layers donot perform any learning tasks. It is a way of down-sampling i.e. reducing the dimension of the input to reduce amount of computation and parameters needed. Input to pooling layer are the series of features maps generated by convolutional layer. Basically what pooling layer does is, it groups a fixed number of units of a region and get a single value for that group. The region is selected using a window which in general is of size 2 x 2. This window slides with fixed stride which is most of the times fixed to two. It is worth noting that there are only two common variations of the pooling layer in practice: A pooling layer more commonly with window size = 2 and stride = 2 and window size = 3 and stride = 2. The pooling layer operates independently on every feature map and resizes it spatially. Therefore, number of pooled maps is equal to number of feature maps from previous convolutional layer.
Output size of pooling layer with n number of F x F dimensional feature maps as input, W as window size, S as stride rate can be given as n number of pooled maps with dimension P x P where,
P = ((F - W) / S) + 1
Note that it is uncommon to use zero padding in pooling layer.
Max- and average-pooling are two of the types of pooling. Max-pooling returns the maximum values whereas average-pooling outputs the average values of the fixed regions of its input.
Figure 11: Pooling with window size 2 x 2 and stride 2
The main use of pooling is to make feature detection location independent. For example, assume we have two images on very large white background. In first image the letter is written in middle of image and in second image it is present at bottom right corner. Now, after we pass these two images through pooling layer we get reduced images which are nearly similar with letters somewhere in middle. This controls overfitting. When we have overfitting, our network is great with training set but is not good with testing set i.e. it is bad at generalization.
Fully Connected Layer
The convolutional and pooling layers are followed by one or more fully connected layers. All neurons in a fully connected layer connect to all the neurons in the previous layer. This layer combines all of the features learned by the previous layers across the network to identify the images. The way this fully connected layer works is that it looks at the output of the previous layer (which are the activation maps of high level features) and determines which features most correlate to a particular class. It then outputs the highest probability for that class. The output size of the fully connected layer of the network is equal to the number of classes of the data set.
Figure 12: Complete CNN architecture
Now lets sum up how our network transforms the original image layer by layer from the original pixel values to the final class scores.
Input holds the pixel values of the image. For example 28x28x3 image.
Convolutional layer computes the output by computing dot product between kernels and a small region they are connected to in the input volume. This may result in output such as 24x24x6 if we decided to use 6 kernels of size 5x5x3.
Activation layer applies an elementwise activation function. This leaves the size of the output unchanged to 24x24x6.
Pooling layer performs a downsampling operation along the width and height resulting in output such as 12x12x6.
Fully-connected layer computes the class scores resulting in output of size 10x1, where each of the 10 numbers correspond to a class score.
Backpropagation (Training CNN)
Our goal with backpropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. When training the network, there is additional layer called loss layer. This layer provides feedback to the neural network on whether it identified inputs correctly, and if not, how far off its guesses were. Here we define a loss function which quantifies our unhappiness with the scores across the training data. The function takes in desired output from user and the output produced by network and computes its badness. Loss over dataset is sum of loss over all inputs. This helps to guide the neural network to reinforce the right concepts at the time of train.
To learn more about how backpropagation in CNN updates weights throughout the network, you can refer: ''Derivation of Backpropagation in Convolutional Neural Network (CNN)''.
Training CNN from scratch
The first step of creating and training a new convolutional neural network is to define the network architecture. For this purpose we have used architecture as depicted in Figure 13. This is reffered from paper: ''Derivation of Backpropagation in Convolutional Neural Network (CNN)''. It consists of two convolutional and pooling layer and activation layers with unipolar sigmoid function. Also refer this paper for backpropagation algorithm further used in this guide for training the network.
Figure 13: CNN Architecture
In this guide we will train our CNN model to identify Disguised faces for demo purpose. However, below implementation can be used to train network on any dataset.
Step 1: Data and Preprocessing
The dataset we used in this guide is cropped version of the IIIT-Delhi Disguise Version 1 face database (ID V1).
Note : This database can be cited in -
T. I. Dhamecha, R. Singh, M. Vatsa, and A. Kumar, Recognizing Disguised Faces: Human and Machine Evaluation, PLoS ONE, 9(7): e99212, 2014.
T. I. Dhamecha, A. Nigam, R. Singh, and M. Vatsa Disguise Detection and Face Recognition in Visible and Thermal Spectrums, In proceedings of International Conference on Biometrics, 2013 ( Poster) )
We manually split the entire dataset into two parts: disguised and undisguised. Moreover, the dataset doesn’t come with an official train and test split, so we simply use 10% of the both disguised and undisguised data as a train set. Now, we have four data folders: Train_disguised, Train_Undisguised, Test_disguised, Test_Undisguised.
These are the examples of some of the images in dataset.
Data preprocessing for this dataset will involve loading train data, resizing all images to same size, labeling images with desired output (for undisguised: 1,0 and for disguised: 0,1 since we would have two classes in output layer for undisguised and disguised) and then storing it in an array.