=Paper=
{{Paper
|id=Vol-2823/Paper3
|storemode=property
|title=Implementation of Handwritten Digit Recognizer using CNN
|pdfUrl=https://ceur-ws.org/Vol-2823/Paper3.pdf
|volume=Vol-2823
|authors=B M Vinjit, Mohit Kumar Bhojak, Sujit Kumar, Gitanjali Nikam
}}
==Implementation of Handwritten Digit Recognizer using CNN==
Implementation of Handwritten Digit Recognizer using CNN
B M Vinjit, Mohit Kumar Bhojak, Sujit Kumar and Gitanjali Nikam
National Institute of Technology, Kurukshetra Haryana, India
                 Abstract
                  In this paper, we have presented an implementation of Handwritten Digit Recognition. HCR
                 or Handwritten Character Recognition is one of the most challenging tasks in machine learning
                 as the handwriting of every individual on the Earth is unique. So it’s quite challenging to train
                 a model that can predict handwritten text with high accuracy. We have developed a CNN
                 (Convolutional Neural Network) model that can recognize digits from images with a 99.15%
                 accuracy. This model is useful in converting handwritten numbers to digital form and our
                 purpose to build this model is to develop a system for automatically inserting marks awarded
                 to students on answer sheets into a database. Nowadays, everything is getting digitized and the
                 proposed application is capable of reducing the effort taken and mistakes done while manually
                 inserting numbers into the database.
                 Keywords
                 Machine Learning, Artificial Intelligence, Convolutional Neural Network, Optical Character
                 Recognition, Offline Recognition, Handwritten character recognition, Image Processing
1. Introduction                                                                                  This technology is widely used in form
                                                                                             processing and data entry applications. Various
                                                                                             stages of OCR are shown in Fig. 1 and we go
    Handwritten Digit Recognition is a classic
                                                                                             through these four-stage process while
problem of image classification. In this, we
                                                                                             implementing an OCR model. Training a
have to classify handwritten digits into labels
                                                                                             model for all the different handwritings in this
i.e. 0-9. Neural Network models are very
                                                                                             world is impossible as handwriting is unique to
powerful and efficient classification methods to
                                                                                             every individual. So we can train a model using
perform this task. Human beings are intelligent
                                                                                             a large dataset of handwritten digits like the
and can read and recognize different
                                                                                             MNIST dataset and test them on other
handwritten characters and digits written by
                                                                                             handwritings. OCR is challenging but very
other fellow humans. We want to inculcate the
                                                                                             useful for quick processing of data records like
same features in a machine using Artificial
                                                                                             bank statements, emails, passport documents,
Intelligence and Machine Learning.
                                                                                             invoices, mark sheets, etc. OCR helps in
      Covid-19 global pandemic made us realize
                                                                                             digitizing handwritten and printed text and
the significance of digitalization in educational
                                                                                             hence making it easy to apply functions like
and business organizations. It makes all the
                                                                                             searching, sorting, and editing easier.
processes faster, efficient, and accessible.
                                                                                                   Accuracy is the most important parameter
Features like Auto Classification, Text Reader,
                                                                                             in our proposed system - to automate the
Automatic Data Extraction, etc. helps in the
                                                                                             process of manual entry of numeric data
identification of the documents and indexes
                                                                                             (marks, roll number, subject code, etc.), as even
accordingly. In remote teaching and work from
                                                                                             one wrongly recognized digit can have serious
home scenarios, these features can be pivotal.
                                                                                             consequences. The accuracy and efficiency of
The conversion of an image of text (printed or
                                                                                             the system bank upon the methodology and
handwritten) to machine-encoded format is
                                                                                             dataset used. In this implementation, we have
Optical Character Recognition (OCR).
                                                                                             used the Convolutional Neural Network
                                                                                             (CNN). CNN has a couple of key
ACI’21: Workshop on Advances in Computational Intelligence                                   characteristics. The patterns that they learn are
at ISIC 2021, February 25–27, 2021, Delhi, India                                             translation invariant [1]. After learning a certain
EMAIL: bm_51810062@nitkkr.ac.in (B M Vinjit);                                                pattern convolution neural network can
ORCID: 0000-0003-4242-1885 (B M Vinjit);
             ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
             Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                             recognize it anywhere. They can learn spatial
             CEUR Workshop Proceedings (CEUR-WS.org)                                         hierarchies of the pattern. In the first
convolution layer, we learn small local patterns        To develop the CNN model we download
such as edges. In the second convolutional          and pre-process MNIST handwritten digital
layer, we learn larger patterns made from           mission data set (Refer Fig. 2). We reshape each
features from the first layer, and so on. This      image into a 4D tensor. This is done to satisfy
allows us to efficiently learn increasingly         the input requirements of the CNNs. So, the
complex and abstract visual concepts [2].           reshaping is done with a reshape command.
    The paper is structured as: Section 2           Both train images and test images tensors are
contains steps involved and other details of the    reshaped into 4D tensors. Now we have
implementation of the model. The pseudocode         prepared data for training. To build a model we
of our implemented system is written in Section     will first create the convolution base. We are
3. Section 4 contains a comparison with the         using the equation model. We add a
traditional model for character recognition from    convolution layer to the model. The
images and Section 5 contains the result and        convolution        layer     is    added       using
some further scopes and improvements.               layers.Conv2D command.
                                                        The convolution operation involves a filter
                                                    that captures a local pattern and applies it to the
                                                    image as shown in Fig. 3. The filter is a 3D
                                                    tensor of a specific width, height, and depth.
                                                    Each entry in the filter is a real number. The
                                                    entries in the filter are learned during the CNN
                                                    transition. The filter slides across the width,
                                                    height, and depth stopping at all possible
                                                    positions to extract a local 3D patch of the
                                                    surrounding feature. We take the filter and
                                                    position it at different places in the image. If we
                                                    slide the filter we get a new position of the
                                                    filter. We keep doing this across the length and
                                                    breadth of the image till the final positioning of
                                                    the filter. Since the input is a grayscale image
                                                    we have depth = 1.
                                                        Apart from convolution, there is a second
                                                    important operation - pooling. Pooling
                                                    aggressively downsamples the output of the
Figure 1: Stages       of   Optical   Character     convolution. Strides while sliding across the
Recognition [3]                                     image, provides a way to calculate the next
                                                    position along each axis to position the filter
                                                    starting from the current position. We have
2. Implementation                                   taken stride = 1 which is the most common
                                                    choice.
    In this section, we will discuss the                Such a stridden convolution tends to
segmentation process and implementation of          downsample the input by the factor
the CNN model we created using Tensor Flow          proportional to the stride. It helps us to calculate
for digit recognition from the image.               the next position of the filter. The filter got
    Our first task will be to convert a numeric     shifted by one column to the right. Once it
string into single digits. We used the OpenCV       exhausts all the columns we start shifting it
library for the segmentation of a string. We        downwards by rows. This is how we slide the
converted a colored image into grayscale using      filter across the image and try to match the
the BGR2GRAY function in OpenCV. We are             pattern in the image. So, let us say we have a 28
recognizing black pixels in the white               x 28 x 1 image and we have a filter of size 3 x
background for the segmentation process. We         3 x 1 using which we will be able to position
make rectangular boxes around the digits in a       the filter at 26 possible positions along the
string and extract them. We are using the           width as well as on the height. So, the final
matplotlib library to represent it. After getting   position of the filter will be at position 26. So,
the digits from the string, we pass them into a     this is how we get 26 x 26 x 1 output of the
CNN model.                                          convolution. In a convolution layer, we
typically used k different filters. We define all     convolution layer at the end. We use 32 filters
these filters with the Conv2D layer in a Keras        in the first convolution layer and 64 filters each
command. In a tf.keras API, we use Conv2D.            in the second and the third layer. Each filter is
We specify the number of filters, the size of the     3 x 3 in size and we use a stride of 1. We have
patch, the activation, and the input shape. Here,     not used any padding in any of the convolution
we have 32 filters; each filter is of size 3 x 3.     layers.      We     used     max-pooling       for
Then we specify the activation that we want to        downsampling with a window of 2 x 2 with a
use after a linear combination of weight of the       stride of 2.
filter with the values of the pixel in the image
and finally, we specify the shape of the input.
The size of the filter and the stride (which is 1
by default) is applicable across all the k filters.
After applying the convolution of k filters we
get a 3D tensor with the same number of rows
and columns for each filter.
    All the outputs are combined. So, we get all
the channels to be 32; each having 26 x 26
output. So, concretely for our MNIST example,
we get a 3D tensor as output with 26 rows 26
columns, and 32 channels, i.e. 1 for each filter.
The total number of parameters for this filter
will be 320 because we have 10 parameters per
filter as shown in Fig. 4.
    Pooling is usually done with a window of
size 2 x 2 with a stride of 2. We apply the
pooling policy on the first 2x2 square box and
select a number based on that policy. We use
                                                      Figure 2: MNIST Dataset [4]
either max pooling or average pooling as the
pooling policies. The second important point is
                                                         The number of parameters in the
we apply pooling operation on each channel
                                                      convolution layer depends only on the filter
separately.
                                                      size. It does not depend on the height and width
    If the output is 26 x 26 x 1 and if we apply a
                                                      of the input. It can be observed that the width
max-pooling of 2 x 2 we get the output of 13 x
                                                      and height dimensions tend to shrink as we go
13 x 1. So, we can see that there is
                                                      deeper into the network. We started with a
downsampling happening from the output of
                                                      height and width of 28 each and after a couple
the convolution when we apply max pooling on
                                                      of convolutional pooling followed by a single
it as depicted in Fig. 5. Note that max-pooling
                                                      convolution operation, we got a height and
does not have any parameters. In practice, we
                                                      width of 3.
set up a series of convolution and pooling layers
                                                         So, what is happening here is, we take an
in CNNs. The number of convolution and
                                                      image, we apply a bunch of convolution
pooling layers is a configurable parameter and
                                                      pooling operations that gave us a representation
is set by the designer of the network. In the
                                                      that will feed into a feed-forward neural
current example, we use two convolutions and
                                                      network which will give us the label
one more convolution layer.
                                                      corresponding to the digit written in the image.
    In the current example, we use two
convolution pooling layers and one additional
Figure 3: 28x28 pixel image (left) and 3x3 filter (right) with activation function (Z)
    So, here the input is (3, 3, 64). We pass it to      followed by a dense layer of 10 units to get the
the Flatten layer which gives us 576 numbers             output. If we come up with the equivalent
that are fed into a dense layer whose output is          flattened representation, we have these 9 values
fed into another dense layer; flatten has no             and we have a node. These 9 values are
parameters it outputs 576 numbers which are              connected to this particular node which is a
input to each of the 64 units in the dense layer         neuron or a unit in the neural network which
over here. So, each of the units in the dense            performs linear combination followed by
layer has 576 parameters + 1 bias which makes            activation. So, we are capturing local patterns
it to 577 parameters per unit and we have 64             in CNN.
such units making it, 36928 parameters. So, this             So, CNN procedures by capturing local
produces 64 values one corresponding to each             patterns; whereas, in a feed-forward neural
unit. So, the final dense layer has 10 units; each       network, a global pattern involving all the
unit receives 64 values from the previous layer          pixels are captured.
adding 1 bias parameter to it makes it 65 values
per unit. So, in total, we have 650 parameters           3. Pseudocode
for the final layer. So, if we sum across the
CNNs and fully connected top layer which
gives a total of 93,322 parameters. For training            In this section, we are giving pseudocode of
the            model              we           use       the segmentation of image containing a string
sparse_categoricalcrossentropy loss with                 of numeric digits and CNN model recognize
Adam optimizer. We train the model for 5                 each digit in the string and store it.
epochs with training images and training labels.
    So, we can see that in the case of CNN we            segmentation()
are defining patches and we are taking a patch              img=image of string
                                                            image_size=height_of_image*width_of_im
and performing a convolution operation with
the filter. We perform a linear combination of              age
each position of the image with each parameter
in the filter. We perform linear combination             preprocess image using MSER in opencv
followed by non-linear activation; while in the          library
case of feed-forward neural network we take
the entire image, you flatten it so that we get a            {Convert the image to grayscale}
single array. In this case, since we have a 20 x
20 image we get an array of 576 numbers which                gray_image = BGR2GRAY ( img ,
                                                             image_size )
we are passing to a hidden layer with 128 units
Figure 4: 32 3x3 patch applied on the image giving 3D tensor of 26x26x32
       INPUT IMAGE                    (none,28,28,1)
                                    (none,28,28,28,1)
                                                   32 conv. Filters(3x3), stride=1
    ((3x3)+1)x32=320 parameters
                                    (none,28,28,28,1)
                                      (none,26,26,32)
          bias
                                    (none,28,28,28,1)
                                                   Max pool(2x2), stride=2
                                    (none,28,28,28,1)
                                      (none,13,13,32)
                                  (none,28,28,28,1)
    ((3x3x32)+1)x64=18496 parameters             64 conv. Filters(3x3), stride=1
                                  (none,28,28,28,1)
                                    (none,11,11,64)
                                     (none,28,28,28,1)
                                                   Max pool(2x2), stride=2
                                     (none,28,28,28,1)
                                        (none,5,5,64)
                                   (none,28,28,28,1)
    ((3x3x64)+1)x64=36928 parameters              64 conv. Filters(3x3), stride=1
                                   (none,28,28,28,1)
                                      (none,3,3,64)
Figure 5: Convolutional Model Building process and parameter calculation
{Using inbuilt opencv function to detect                 reshape(train_images,4)
regions}                                                 reshape(test_image,4)
   x,y=detectRegions(binary_values_of_gray_       {Normalize pixel values to be between 0 and 1}
   image)
                                                         train_images = train_images / 255.0
   Makerectagles(x,y,x+height_of_image,                  test_images = test_images / 255.0
   y+width_of_image)
                                                  {Make a sequential model by adding
{inbuilt function in matplotlib library}          convolutional layers with relu activation}
   plot(gray_image)                                      add_conv_layer(activation='relu')
   return gray_image                                     add_MaxPooling()
                                                         add_conv_layer(activation='relu')
                                                         add_MaxPooling()
 character_recognition()                                 add_conv_layer(activation='relu')
       load (MNIST_dataset)                              print(model.summary())
{convert 3D tensor to 4D tensor}
Figure 6: Summary of Model and total parameters
    To complete our model, we will feed the last          In traditional machine learning flow given
output tensor from the convolutional base (of      an image, we used to first perform feature
shape (3, 3, 64)) into one or more dense layers    engineering using computer vision libraries. A
to perform classification. Dense layers take       feature is fed into any machine learning
vectors as input (which are 1D), while the         classifier which after training will give us the
current output in a 3D tensor. First, we will      output. Now, the feature engineering part in
fatten (or unroll) the 3D output to 1D, then add   traditional machine learning is getting replaced
one or more Dense layers on top as shown in        by CNN. So, we can think of CNNs as a way of
the model summary i.e. Fig.6. MNIST has 10         generating features automatically for a given
output classes, so we use a final dense layer      image. The beauty of this approach is that the
with 10 outputs and a softmax activation [5].      right representation is learned during the model
                                                   training freeing us from expensive and tedious
   flatten_layers(model)                           feature engineering tasks.
   add_dense_layer(64,activation='relu')               We see that Feed Forward Neural Network
   add_dense_layer(10,activation='softmax')        (FFNN) has more than 100k parameters as
   print(model.summary())                          compared against 93k parameters that our CNN
                                                   has and despite that classification with FFNN
{Compiling model with adam optimizer, sparse       has less accuracy for training and test data as
categorical crossentropy loss and metric we are    we can see in Fig. 7.
interested in i.e accuracy}
                                                   5. CONCLUSION
   compile(model,optimizer='adam',
   loss='sparse_categorical_crossentropy',
   metrics='accuracy')                                 We can see from Fig. 8 that in the output, the
                                                   digits in the string are in rectangular boxes and
                                                   can be extracted by using the crop function. As
{train model for 5 epochs/iterations}
                                                   shown in Fig. 10 our CNN model has a training
                                                   accuracy of 99.36% and a testing accuracy of
   train(model,train_images,epochs=5)
   test_accuracy=evalute(model,test_images)        99.15%. When we segment the string and pass
   print(test_accuracy)                            the digits into our model, they are being
   gray_image=segmentation()                       correctly recognized as we see in Fig. 9 where
                                                   the grayscale image of digit '8' is correctly
{Recognise the segmented image one by one          predicted by our model. Similarly, we can pass
and print it}                                      all the digits one by one from the segmented
                                                   string to obtain the string/number in digital
                                                   format.
   for(all the images in rectangles in
   gray_image)                                         So this model can be used for building a
        test_predict = model.predict(image)        proposed system to automate the process of
        max=max_value(test_predict)                storing marks and other details like roll number
        number=index_of(max)                       and subject code in a database by just taking a
                                                   photograph. It will nearly remove the manual
        print(number)
                                                   process which is hectic, tedious, and error-
                                                   prone
4. Comparison with the Traditional                     Further scope involves improving this
   Method                                          model to recognize alphabets/character string
                                                   by training it on a suitable database so that the
   We did a detailed review and comparative        usability of this system can be extended to other
study of various significant handwritten           domains of HCR as well.
character recognition methods and techniques
in our paper – “A Review on Handwritten
Character     Recognition    Methods      and
Techniques" published in the International
Conference on Communication and Signal
Processing (ICCSP), Chennai, India, 2020 [6].
Figure 7: Training and Testing accuracy using Feed Forward Neural Network
Figure 8: Output of segmentation
   Figure 9: Grayscale input and predicted output by our model
Figure 10: Training and Testing accuracy of CNN model
6. References                                       [4] MNIST image-Lim, Seung-Hwan &
                                                        Young, Steven & Patton, Robert. (2016).
                                                        An analysis of image storage systems for
[1] Donald J. Norris. "Chapter 6 CNN                    scalable training of deep neural networks.
    demonstrations", Springer Science and
                                                    [5] Internet Source - www.tensorflow.org
    Business Media LLC, 2020                        [6] B. M. Vinjit, M. K. Bhojak, S. Kumar and
[2] José-Sergio Ruiz-Castilla, Juan-José                G. Chalak, "A Review on Handwritten
    RangelCortes, Farid García-Lamont,                  Character Recognition Methods and
    Adrián TruebaEspinosa. "Chapter 54 CNN              Techniques,"        2020      International
    and Metadata for Classification of Benign           Conference on Communication and Signal
    and Malignant Melanomas", Springer
                                                        Processing (ICCSP), Chennai, India, 2020,
    Science and Business Media LLC, 2019.               pp.            1224-1228,             DOI:
[3] RS, S.N., and Afseena, S., 2015.                    10.1109/ICCSP48568.2020.9182129.
    Handwritten Character Recognition–A
    Review. International Journal of Scientific
    and Research Publications.