=Paper=
{{Paper
|id=Vol-3150/short4
|storemode=property
|title=Convolutional Neural Network and Its Application in Handwritten Digit and Traffic Sign Recognition
|pdfUrl=https://ceur-ws.org/Vol-3150/short4.pdf
|volume=Vol-3150
|authors=Pengxiang Jia 
}}
==Convolutional Neural Network and Its Application in Handwritten Digit and Traffic Sign Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-3150/short4.pdf</pdf>
<pre>
Convolutional Neural Network and Its Application in
Handwritten Digit and Traffic Sign Recognition
Pengxiang Jia

University of Victoria, Victoria BC Canada
Percy6995@gmail.com

                Abstract
                The convolution neural network (CNN) is an application of deep learning in computer vision.
                It has a strong feature extraction ability and has become an important component in deep
                learning. It has good achievement in image classification tasks. This article summarizes the
                background and concepts of CNN, talking about the application of CNN on the classification
                of MNIST handwritten numbers and traffic signs.

                Keywords
                Convolutional Neural Network (CNN), Deep Learning, MINIST Handwritten Digit
                Recognition, Traffic Sign Recognition

1. Introduction
    Machine learning is utilizing computers to perform tasks to a level of human capability by searching
the algorithm in the training process [1]. In traditional machine learning, feature extraction is done by
engineers with domain knowledge (PCA), and makings precise and correct decisions on features is
difficult [2]. Deep learning makes an alternative to learn feature extraction within the whole training
process. The structure of layers enables this learning ability of its features [3].
    The multilayer network’s feature extractions in the convolution neural network (CNN) are learned
in its training: every layer constructs its feature for the best classification success rate. Every hidden
layer’s features are built on the output of features from the last layer, and then the output of this layer
is again used for the next layer. In this way, CNN has a strong capability on defining complicated
features in objects for recognition [3, 5, 6].
    Multilayer network’s applications are very wide. Benefitted from its strong expressive power, it is
used not only in image classification, but also in object detection, computer vision, language
recognition, prediction, etc. [4].
    CNN is a kind of neural network. It has three advantages that make it particularly outstanding: First,
it has fewer parameters to study, because it uses less weight due to weight sharing. Secondly, its training
time is much reduced. Third, it has strong expressive power – strong power for object recognition and
learning capability, although there are fewer parameters used [5].
    This survey introduces the usual composition and the training process of CNN. The second part
describes CNN application on MINIST handwritten digit which is a well-known benchmark for image
recognition algorithms [6, 7]. Starting from the third part, we expand on the application of CNN on
traffic signs. Across different applications of CNN, the training and application processes are similar,
but based on training on different datasets and different purposes, the CNN architectures and training
process differ in various degrees [8, 9, 10]. LeNet-5, VGG-13, MTCNN, RCNN, Fast RCNN, and
Faster RCNN are known reported architectures [11].
    This survey emphasizes the base architectures of CNN and the differences of CNN in different
application circumstances, in terms of small differences such as operations, process, and architectures,
and their comparison.

2. Convolutional Neural Network
   The input to convolutional neural network is a 1 × 𝑘 vector 𝑋 = (𝑥1 , 𝑥2 , 𝑥3 , … ，𝑥𝑘 ). If we are
going to train the network with images, the input to the network will be matrices of pixels of the images.


Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
We need to transform the square matrix of pixels to a vector of pixels of size 1 × 𝑘, so that it can be
input into the network. We can connect the head of the second row to the tail of the first row of the
matrix.
   CNN on image classification is known to be effective, involving few parameters and fast training.
Nowadays, CNN has been applied to many recognition and classification tasks in many distinctive
domains. Classic CNN has three parts: the convolutional layer, the max-pooling layer, and the fully
connected layer. Normal fully connected can achieve the performance of CNN, but comparatively, CNN
needs fewer parameters and less training time.
   Convolutional layers are those layers specifically used for pattern recognition in neural networks.
Every single neuron can recognize a particular pattern, which is called the receptive field or filter. The
pattern is a matrix of weights that directly represents the weights assigned on. It is 3x3 in size in Figure
1.


Figure 1: Convolutional Layer and Filter
    The weights enclosed in the square boundary is corresponding to the 1st, 2nd, 3rd, 7th, 8th, 9th, 13th,
14 , and 15th connection from the past layer to the neuron in the next layer. The convolution process is
   th

to overlay the filter on the image. For example, in Figure 1, the filter is at the top left corner, and the
convolution calculation is executed. The 9 filter values and 9 overlaid pixel values (e.g., the 9 values in
the square boundary.) are timed and summed together, as in expression (1).
                                                  ∑9𝑖=1 𝑝𝑖 𝑓𝑖                                            (1)
    If the convolution result is a large value, that indicates a pattern is found; otherwise, it is a small
value. A pattern, for instance, a grain of a kind of material, can appear anywhere on the image. To
search for the pattern, we must slide the filter in a step size of a pixel and go through all pixels in the
two dimensions of the image. Each time when we slide the filter horizontally or vertically, we calculate
the convolution of the filter values with the image pixels. A very good advantage of having a filter and
sliding it is weight sharing. Weight sharing can be understood by seeing the filter represented in CNN
architecture. In the actual CNN representation of the filter, each time when a filter does the convolution
operation with 9 pixels on the image, we connect a neuron from the convolutional layer to 9 neurons
on the previous layer, applying the weights in the filter. Every neuron in the convolutional layer should
be connected to different sets of 9 neurons of the previous layer. Each neuron in the convolutional layer
reflects one step of the sliding filter across the image. The filter has static weights during the sliding,
and the filter’s weight will not be updated until all the convolution operations are done, so in the CNN
representation, the weights are shared between different convolution neurons. Compared to a fully
connected layer, weight sharing has fewer different weights and fewer connections, thus reducing
computation in training.
    In Figure 2, we are searching for a pattern where the pixel values on diagonal are 1, and other values
are -1. The first filter convolutes with the first 3x3 values and yields a 3. The second filter convolutes
with the second 3x3 values yields a -1. The 3 vs -1 indicates that the first step has more similarity to the
pattern than the second one, which is also called higher activation of the neuron.
  Figure 2: Convolution Operati on
    After testing every 3x3 area in the image against the filter, we get a matrix of the convolution results,
which is also called the feature map. For one image, we usually set up not only one filter, but multiple
filters. With multiple filters, we can detect different patterns on the image. The size of the feature map
is one dependent on the filter size, and because we only have the same size for different filters, we get
the same-size feature maps for different filters. The feature maps pile up and become the output of the
convolutional layer, which is also the input for the next layer, as illustrated in Figure 3.


  Figure 3: Feature Maps
    The multiple convolutional layers architecture allows detecting complicate features. Taking feature
maps input to another convolutional layer, the filters will do convolution operations with feature maps,
which is the same as it has been done with the original image. The features detected on the feature map
are a combination of simple features presented on the original images, thus they are complicated
features. For example, a 3x3 filter can only detect a pattern as simple as a line on the diagonal of the
filter, while the feature map is a matrix of indicators of the degree of activation of simpler features from
the last convolutional layer, a pattern in the feature map is a pattern about the combination of the
previous features with their positional information and existence information defined in the filter. For
example, a simple wave-shape feature can be combined to be the texture of the rubber wheel or the
actual sea waves.
    If an image or feature map is input, the filter will yield another feature map, which is explained
above. The number of filters determines the number of output feature maps. A filter is not only 2D, but
can also be 3D. A 3D filter processes multiple layers of feature maps together. For example, a 3x3x3
filter processes 3 layers of feature maps, and 9 points in a square matrix from a layer are processed each
time.
    The max pooling layer is another important component of CNN. By inspection, the human inception
of a picture is only a little affected by the resolution of the picture, so if we drop out every next row and
column of the image, we are still able to identify the object(s) in the image. This drop-out operation is
called max-pooling.
   In CNN, the max-pooling in the max-pooling layer is a little different. The picture is first divided
into squares of the same size, then the largest value or the average value of each square will be found.
The nominated values are combined to be the output while preserving relative positions.
   A complete CNN comprises convolutional layers and max-pooling layers, appearing alternatively.
Lastly, it is connected by a normal fully connected network.


 Figure 4: Full CNN Structure

3. Application of CNN on Handwritten Digits
    MINIST is a handwritten digit collection. It has tens of thousands of training sets and testing sets of
28x28 pictures. The application of handwritten digits is a good example for learning CNN’s
characteristics and methodology. The LeNet-5 is a CNN architecture that is good for handwritten digits,
the activation function is tanh(). It has 5 convolutional layers and 5 max-pooling layers, and is connected
with a 3-layer fully connected network. The fully connected network outputs 10 signals, each
representing the numbers 0 to 9.
    Take LeNet-5 as an example for analysing performance. The number of convolutional layers, the
number of filters, and the filter size will directly affect the training results. Other relevant factors are
what activation function is used, whether it is connected to fully connected layers, and the quantity of
training set; in particular, the quantity of training set has a large influence.
    If the activation function of LeNet-5 is changed from tanh() to sigmoid, even cutting off the full
connected layers, the performance of CNN is not significantly reduced. However, in the training
process, the error curve will be even more stable and smooth. This probably indicates that sigmoid is a
better activation function for tasks of handwritten digits.
    If we decrease the number of convolutional filters by a small amount, the error plateau will be a little
higher, but CNN will reach the plateau much faster and the training time is much reduced. If we increase
the number of filters, CNN will not converge to a satisfactory error rate, reflecting an underfitting
scenario.
    If the training set size is small, CNN will not converge; at this point, if we increase the training set
size, it is likely to see it converge.
    If the size of the filter is smaller than the average size that will work for a CNN, even when the
training set size is small, we can get an overfitting result – because the CNN model is trained on the
training set for too many times, it starts to catch noise in the training set. When an overfitted model is
set to classify handwritten digits, it can achieve high accuracy on the training set, but only low accuracy
on the testing set. To solve this problem, we need to increase the size of the filter, making it not too big
or too small. A big filter will make CNN have too many parameters, then the error rate will not reach
the plateau, because larger filters generally need more training data. When the filter size is not too big
or too small, larger filters generally have better results, but the effect of the larger filter decreases with
the increase of the filter size.
   The number of convolutional layer also influences the model’s capability. Its effect is similar to the
effect of different sizes of filters. In an appropriate range, more convolutional layers will have better
classification/prediction capability, but too many or too few convolutional layers will diminish CNN’s
capability.
   In general, we can view the classification capability as a function of the filter’s size, quantity, and
convolutional layers, then the capability function will be of 3 variables and is concave. There is the best
combination of the filter size, filter quantity, and the number of convolutional layers that maximizes the
CNN performance.
   There is a sweet point at training set size. If having too much, it causes overfitting, then the test error
increases drastically, but before the overfitting, increasing training sets will stably decrease the test
error.

4. Application of CNN on Traffic Sign Recognition
    With the increase of GPU’s performance, training a deep neural network and recognizing a traffic
sign are possible. Traffic sign recognition is an important part of automatic driving.
    Current traffic sign recognition technology includes LeNet-5, VGG-13, MTCNN, RCNN, Fast
RCNN, and Faster RCNN [11]. LetNet-5 is the first one to be invented. It has 3 convolutional layers, 2
max-pooling layers, a fully connected layer, and an output layer. The training of LeNet-5 is similar to
other CNN.
    When applying LeNet-5 to the task of traffic sign recognition, we need to prepare the images for the
input. Because traffic signs are the photos of real life, due to different distances, lighting effect, and
weather’s effect while taking the picture, there will be influences on the pictures and the pictures cannot
look the same. The pictures are not in high resolution, and the size and scale are not the same. With
preprocessing, the final pictures are suitable for CNN training. The picture’s preprocessing is to convert
colored image to grayscale image, wipe out noise, and cut out the traffic signs with the sign positioned
in the middle and the sign is the only object in the image, lastly, enhance the features such as edges and
resize the pictures to same image size, for example, 512x512. On the other hand, if the preprocessing
caused the images to be too uniform, the training is easy fall into overfitting. To avoid overfitting, we
can introduce a little noise, maybe Gaussian noise, to the image, and use tricks like early stopping.


 Figure 5: Fast RCNN Network Structure [11]
    MTCNN includes P-Net, R-Net, and O-Net. The typical image sizes are 12x12, 24x24, and 48x48.
P-Net and R-Net are relatively shallow that allow the classification process to be faster. P-Net,
compared to R-Net, doesn’t have a fully connected layer, it only has a convolutional layer. The image
input to P-Net will be delivered to an ROI Pooling layer first, which normalizes the input images, so
that MTCNN can process images of any size. The output of P-Net is some detected bounding boxes.
The boxes are input into R-Net to be selected with regression. The selected boxes are then input to the
fully connected layers of R-Net for further classification of the boxes. The O-Net is deeper than P-Net
and R-Net, so it has stronger recognition power. The other aspects are similar to P-Net and R-Net. The
final recognition results need to match with the output bounding boxes.
    The difference of Fast RCNN from the category of LeNet-5 and VGG-13 (similar to LeNet-5, but
more complex) that are capable of classification is that they are capable of detecting traffic signs. Fast
RCNN is suitable for the traffic sign detection task under noisy background and unstable images data
set, e.g., real-time data set.
    RCNN’s selective search algorithm is a process of similarity estimate, division, and composition.
This approach, compared to the process of searching pixels by pixels, is quicker to find the positions of
traffic signs. In the end, it generates the regions of the image that contains only the traffic signs.
Comparatively, Fast RCNN’s method is instead of generation of candidate region but reaching for
traffic sign’s feature right on the feature map. In this way, it is faster for fewer pixels to process.
    RCNN and Fast RCNN’s method of candidate region are both selective search algorithms, but this
algorithm has problems of big computation, taking a long time, and excessive candidate regions.
However, the RPN uses the RCNN’s CNN to directly generate candidate regions, because RPN
incorporates a selective search algorithm into the deep neural network, and the convolution
computations are shared, so the overall computation is more efficient. Additionally, because RPN
replaced selective search algorithm, that allows the computation of generating candidate boundary box
to be moved to GPU, which accelerates the training. This type of network that includes RPN is also
called Faster RCNN.

5. Conclusion
    This article mainly describes the development background of convolutional neural network, and
describes the CNN’s basics, theories, composition, principles, and effects in every component. Lastly,
it introduces CNN’s application in handwritten digit recognition and traffic sign recognition and
detection.

6. References
[1] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and
     prospects. Science, 349(6245), 255-260.
[2] Khalid, S., Khalil, T., & Nasreen, S. (2014, August). A survey of feature selection and feature
     extraction techniques in machine learning. In 2014 science and information conference (pp. 372-
     378). IEEE.
[3] Khan, A., Sohail, A., Zahoora, U., & Qureshi, A. S. (2020). A survey of the recent architectures of
     deep convolutional neural networks. Artificial Intelligence Review, 53(8), 5455-5516.
[4] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., Mohamed, N. A., & Arshad, H. (2018).
     State-of-the-art in artificial neural network applications: A survey. Heliyon, 4(11), e00938.
[5] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017, August). Understanding of a convolutional
     neural network. In 2017 International Conference on Engineering and Technology (ICET) (pp. 1-
     6). IEEE.
[6] O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint
     arXiv:1511.08458.
[7] Deng, L. (2012). The mnist database of handwritten digit images for machine learning research
     [best of the web]. IEEE Signal Processing Magazine, 29(6), 141-142.
[8] Shamsuddin, M. R., Abdul-Rahman, S., & Mohamed, A. (2018, August). Exploratory analysis of
     MNIST handwritten digit for machine learning modelling. In International Conference on Soft
     Computing in Data Science (pp. 134-145). Springer, Singapore.
[9] Shustanov, A., & Yakimov, P. (2017). CNN design for real-time traffic sign recognition. Procedia
     engineering, 201, 718-725.
[10] Luo, H., Yang, Y., Tong, B., Wu, F., & Fan, B. (2017). Traffic sign recognition using a multi-task
     convolutional neural network. IEEE Transactions on Intelligent Transportation Systems, 19(4),
     1100-1111.
[11] Zhu, S. (2019). Research on Traffic Sign Recognition Based on Deep Learning, Master’s thesis
     (pp. 46-57), Anhui University of Science and Technology.

</pre>