<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1109/TMI.2016.2535302</article-id>
      <title-group>
        <article-title>1Architectural Heritage Images Classification Using Deep Learning With CNN</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammed Hamzah Abed</string-name>
          <email>mohammed.abed@qu.edu.iq</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muntasir Al-Asfoor</string-name>
          <email>muntasir.al-asfoor@qu.edu.iq</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zahir M Hussain</string-name>
          <email>zmhussain@ieee.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science and, Information Technology, University of Al-Qadisiyah</institution>
          ,
          <country country="IQ">Iraq</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Computer Science and</institution>
          ,
          <addr-line>Mathematics</addr-line>
          ,
          <institution>kufa University</institution>
          ,
          <addr-line>Iraq</addr-line>
          ,
          <institution>School of Engineering</institution>
          ,
          <addr-line>Edith Cowan</addr-line>
          ,
          <institution>University</institution>
          ,
          <addr-line>Joondalup</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>85</volume>
      <issue>16</issue>
      <fpage>1299</fpage>
      <lpage>1312</lpage>
      <abstract>
        <p>Digital documentation of cultural heritage images has emerged as an important topic in data analysis. Increasing the size and number of images to be processed making the task of categorizing them a challenging task and may take an inordinate amount of time. This research paper proposes a solution to the mentioned challenges by classifying the subject of the image of the study using Convolutional Neural Network. Classification of available images leads to improve the management of the images dataset and enhance the search of a specific item, which helps in the tasks of studying and analysis the proper heritage object. Deep learning for architectural heritage images classification has been employed during the course of this study. The pre-trained convolutional neural networks GoogLeNet, resnet18 and resnet50 proposed to be applied on public dataset Cultural Heritage images. Experimental results have shown promising outcomes with an accuracy of “87.91”, “95.47” and “95.57” respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>One of the most important aspects in the study of architectural cultural heritage documents is the diagnosis,
analysis and classification of the state of monuments and buildings and thus contribute effectively to the
conservation and restoration, so these documents must accurately reflect the information that can be extracted from
these heritage images [Llamas 17]. The number of digital documentation of cultural heritage is increased daily
because of different available sources and technology’s ability to help who is working in digital documentation
analysis and understanding. In addition, some of these images are taken by non-professional people and taken by
the phone’s camera. This kind of sources is easy to download and use but suffers from the lack of clearness without
source caption and categorization, contrariwise the expert photographer. In this paper deep learning technique is
proposed for digital documentation (usually colored images) classification. In this study, convolutional neural
network CNN to extract the useful information and features from the images has been used to help the classification
process. Nowadays, in academia and industry there is more focus on different applications of deep learning and
convolutional neural network based on images classification, such as medical images [N 16], license plate and
vehicle recognition based on local tiled convolutional neural network as proposed by Yongbin Gao et.al [Gao 16],
satellite and aerial images based on CNN by M. A. Kadhim et. al [Kadhim 20] and many more. Furthermore,
literature has focused on cultural heritage images classification using pre-trained CNN like Alexnet, Resnet and
Inception v3 [Llamas 17], finally the research proposed Gabor filter for features extraction and support vector
machine for automatic architecture style recognition [Mathias 11]. In this work multi-label images [Llamas 17] are
used in the experimental analysis and results to train our model to classify new test images. Also, the pre-trained
CNN for high-level features extraction and classification are suggested.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Deep Learning</title>
      <p>Deep structured learning is a discipline within the field of machine learning [Abed 19] based on an artificial
neural network [Mustafa 19] that used multi-layer to extract high-level features from raw input data or images.
Depending on the problem domain of study, the features have been selected. for instance, in image processing, the
low-level features can help to identify edges while high-level features can help to identify the semantic concept of
images [Llamas 17]. Most of the modern deep learning strategies are based on Convolutional Neural Network
[Abed 19].</p>
      <sec id="sec-2-1">
        <title>2.1. Convolutional Neural Network (CNN)</title>
        <p>The convolutional neural network is a class of deep learning network [Abdelaziz 19], and it is inspired by an
artificial neural network [Abdelaziz 19]. CNN is a typical structure designed as a series of stages formed by the
layers. the first phases consist of two kinds of layers: convolutional layers and assembly layers [Harangi 18], at the
end of the network's structure the classification performance of the features that extracted using fully connected
layers [Harangi 18]. In this work, many convolutional neural networks have been suggested and proposed such as
GoogLeNet [Szegedy 15], resnet18 [He 16] and resnet50 [He 16] , tested on different. CNN is a multi-layer network
structure basically consist of five layers starting from the input layer, convolutional layer, pooling layer, fully
connected layer and finally output layer as shown in figure 1.
1- Input layer: it is the first layer of the Convolutional Neural Network, in general, it is the input of the
whole CNN which is represented as a matrix of image’s pixels [Zhang 19] .
2- Convolutional layer: this layer is responsible for the features extraction from the input matrix. The
earlier convolutional layer extracts low features like edge, lines and corner. And deep level
convolutional layer used for semantic features and high features extractions. The deep features
extraction depends on the learning of the previous low features of entire matrix [Belhi 18] . The
convolutional layer holds multiple features maps by convolving the convolution kernel of a previous
layer as shown in equation 1.</p>
        <sec id="sec-2-1-1">
          <title>Where</title>
          <p>"# =  &amp;∑*∈34( *#+, ∗ *#" + "#)5
… 1
Mj : represents the input image.
*#: :represents the jth features map of the l th layer.
* : represents the convolution operation.
*#+,: is the ith features map of the l-1 layer
layer.
"#: is the bias.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Sigmoid</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>ReLU</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Tanh</title>
          <p>*#" : represents the filter connecting the jth feature map of the l th layer and ith features map of the l-1
The most common activation function was sigmoid, ReLU and Tanh [Zhang 19] . The equation as
shown
() = ,7,89:
() = max (0 ∙ )</p>
          <p>8:+89:
() = 8:789:
3- Pooling layer is used to reduce the dimensional features maps based on the limitations of the human
visual system [Zhang 19] [Belhi 18] by downsampling the conventional maps. The efficiency of
pooling layer to features reduction helps the CNN to speed up the computation process. In general,
there are two kinds of pooling operations: maximum and average pooling.
4- The fully connected layer is usually used in the last layers of the CNN structure as shown in figure(1),
to combine the features together in the former layers [Zhang 19] .
5- The output layer is the final layer of CNN architecture [Belhi 18] , is finally passed through the
classifier, the most commonly used classifier in this part is binary classification problems and
multiclassification problems such as softmax classifier as shown in equation no 5.</p>
          <p>Softmax σAxBC = ∑GDEDFEG … 5</p>
          <p>Three different convolutional neural network methods have been used through this study, namely:
2.1.1. GoogLeNet: is a pre-trained convolutional neural network [Szegedy 15]. Basically, it has 22 layers deep
[Szegedy 15]. It is trained over two different online datasets ImageNet [Img net] and Places365 [Zhou 16] ,[Places]
. The GoogLeNet version which trained on ImageNet can classify images into 1000 categories, and the other
network which trained by Places365 can classify into 365 different place categories. Both of two network has an
input image size of 224 by 224. The network version that has been used in our experimental study was 144 layers
starting from the input layer, convolution layer, ReLU and Max pooling this structure repeated until reach the fully
connected layer, then the classifier layer occurs to give the output classification result.
2.1.2. ResNet-18 [He 16] is a pre-trained convolutional neural network, that has been trained on more than a
million images from images dataset ImageNet [Img net]. the network consists of 18 layers deep which can classify
images up to 1000 categories. The Network has learned rich features extracted from a wide range of images. The
input images size of 224 by 224. The network design used in this research has 72 layers.
2.1.3. ResNet101 [He 16] is another version of the Resnet network architecture. Same as the previous version
ResNet -18 is a pre-trained convolutional network has been trained on the same dataset ImageNet. The network
depth is 347 layers, which can classify images up to 1000 categories.</p>
          <p>The setting of structure for a pre-trained convolutional network that we used in experimental shown in table 1.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Pre-trained CNN algorithm</title>
        <p>GoogLNet
ResNet 18
ResNet 101</p>
      </sec>
      <sec id="sec-2-3">
        <title>No of layers</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Model reused pre-trained CNN</title>
      <p>The most commonly used in deep learning applications is transfer learning. The deep Convolutional Neural
Network was training based on large scale images dataset is applied to architecture heritage images. It is focused
on retrained the pre-trained convolutional neural network on a new task and on new images. Transfer learning with
fine-tuning is much faster and easier than training blinded network with random initializing weight. figure 2 shows
the proposed model architecture that suggested for this work.
1- Load Architectural Heritage Images dataset for start training phase. in this work, the training images
ratio was 70%, and testing was 30% randomly distribution from the dataset that we used in our
experiment.
2- The second step in the proposed model was how to adjust and fitting the layer’s filter with the image
size that we used as an experimental case. Starting from the input layer by selecting the size and number
of channels or color then convolutional layer till the fully connected layers. The information contains
who to combine the features that extracted from all layers to be trainable features.
3- Replace the classification layer to be suitable to the number of final categories. The adaptation occurs
to loss2-classifier layer and output layer. in our case ten classes in all three experimental study.
4- Training phase by load the editing pre-trained Convolutional Neural Network (GoogLNet, ResNet-18
and ResNet-101). By extracting all the features from a fully connected layer by combining the low and
high features, to train the network based on these features. After train the CNN the test phase occurs to
check and predicate class category of the testing images as the output, and check the accuracy of the
model.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Result</title>
      <p>In this section the experimental setting is introduced first, to establish the basic idea of the work.</p>
      <sec id="sec-4-1">
        <title>4.1 The experimental setting</title>
        <sec id="sec-4-1-1">
          <title>In this paper the setting of the experiment was as the following:</title>
          <p>1</p>
          <p>Architectural Heritage Elements Dataset (AHE_Dataset), which generated in three versions: Originally
the dataset was published in two versions, first one contains images of different sizes and the second was
scaled into 128×128 pixels, as well as the small dataset was created with a small size of 64×64 and 32×32.
The third version was selected as a subset of each class which consists of 500 images that scaled into 224 ×
224 pixels to be compatible with the pre-trained CNN. Table 2 illustrates the details of the dataset.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Full training of Convolutional Neural Network</title>
        <p>Our experimental study divided into three major parts based on the dataset used in this work. First one training
AHE_Dataset small version, second sub of AHE_Dataset select 500 images from each class, finally the
AHE_Dataset original version was trained. All of the versions of dataset are trained fully by three pre-trained
Convolutional Neural Networks. Figure(3) shows the accuracy and loss of AHE_Dataset small version based on
GoogLNet.
Figure 5 : Accuracy and loss of sub-AHE_Dataset 2nd version based on ResNet 101
The third version of dataset which is the original version of dataset the highest accuracy and loss function
was based on ResNet-18. Figure 7 shows the accuracy and loss function, and figure 8 shows the confusion
matrix of it.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Result Analysis</title>
        <p>Table 3 shows a summary of the results based on a different test performed. Consider the dataset with 64×64
image size, the GoogLNet achieved the highest accuracy of 87.91 at the same learning rate and iteration in
comparison to other techniques. In another case, the highest accuracy achieved is 95.57 based on ResNet-18
under the same conditions for images with a size of 128×128. Finally, the accuracy of sub dataset with 224×224
image size is 95.47 based on ResNet-101.
Comparing the highest result that obtained based on the proposed system with a different algorithm used in
[Llamas 17]. Table 4 shows the summary of the result based on a different test performed tested on the same
dataset and dataset of architectural style with 25 categories [Xu 14].
The best result that achieved based on 64×64 dataset was by ResNet (Full Training) [Llamas 17], and the test
based on 128× 128 dataset the highest accuracy achieved was by the proposed model based on ResNet 18 95.57.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work presents a model for architecture heritage image classification using deep learning with a
convolutional neural network. In this model, CNN uses a pre-trained structure (GoogLNet , ResNet-18 and
ResNet101). All of them tested on three versions of the dataset. The classification results which are achieved by GoogLNet
overperformed the other techniques in comparison with other CNN for the small version of the dataset. However,
ResNet-18 produced better classification results compared with other pre-trained CNN. Finally, sub-images of the
dataset based on ResNet-101 was the highest accuracy 95.47.</p>
      <p>Author Contributions: conceptualization, Z.M.H, M.Al-Asfoor, M.H.A, methodology, Z.M.H, M.Al-Asfoor,
M.H.A software M.H.A validation Z.M.H, M.Al-Asfoor writing—original draft M.H.A, preparation writing—
review and editing Z.M.H, M.Al-Asfoor.</p>
      <sec id="sec-5-1">
        <title>Funding: This research received no external funding</title>
      </sec>
      <sec id="sec-5-2">
        <title>Conflicts of Interest: The authors declare no conflict of interest</title>
        <p>[Gao 16] Gao Y, Lee HJ. Local Tiled Deep Networks for Recognition of Vehicle Make and Model. Sensors (Basel).
2016;16(2):226. Published 2016 Feb 11. doi:10.3390/s16020226
[Kadhim 20] Kadhim M.A., Abed M.H. Convolutional Neural Network for Satellite Image Classification. In: Huk
M., Maleszka M., Szczerbicki E. (eds) Intelligent Information and Database Systems: Recent Developments.
ACIIDS 2019. Studies in Computational Intelligence, vol 830. Springer, Cham. 2020 doi
https://doi.org/10.1007/978-3-030-14132-5_13
[Mathias 11] Mathias, M.; Martinovic, A.; Weissenberg, J.; Haegler, S.; Van Gool, L. Automatic Architectural
Style Recognition. In Proceedings of the 4th ISPRS International Workshop 3D-ARCH 2011, Trento, Italy, 2–4
March 2011; Volume XXXVIII-5/W16, pp. 171–176.
[Abed 19] Mohammed Hamzah Abed, Atheer Hadi Issa Al-Rammahi and Mustafa Jawad Radif, REAL-TIME COLOR
IMAGE CLASSIFICATION BASED ON DEEP LEARNING NETWORK, Journal of Southwest Jiaotong University, vol 54 no
5 .2019. http://www.jsju.org/index.php/journal/article/view/384
[Abdelaziz 19] Abdelhak Belhi,Abdelaziz Bouras, Taha Alfaqheri, Akuha Solomon Aondoakaa and Abdul
HamidSadka, Investigating 3D holoscopic visual content upsampling using super-resolution for cultural heritage
digitization , Signal Processing: Image Communication Volume 75, July 2019, Pages 188-198
https://doi.org/10.1016/j.image.2019.04.005
[Harangi 18] Balazs Harangi, Agnes Baran and Andras Hajdu, Classification of skin lesions using an ensemble of
deep neural networks, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC), IEEE, 2018. DOI: 10.1109/EMBC.2018.8512800
[Szegedy 15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich , Going Deeper with Convolutions, IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) 2015. DOI: 10.1109/CVPR.2015.7298594
[He 16] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
doi: 10.1109/CVPR.2016.90
[Xu 14] Xu, Z.; Tao, D.; Zhang, Y.; Wu, J.; Tsoi, A.C. Architectural Style Classification Using Multinomial Latent
Logistic Regression. In Computer Vision—ECCV 2014; Springer: Cham, Switzerland, 2014; Volume 8689, pp.
600–615.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Llamas 17]
          <string-name>
            <surname>Llamas</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>M. Lerones</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medina</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zalama</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-García-Bermejo</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Classification of Architectural Heritage Images Using Deep Learning Techniques</article-title>
          .
          <source>Applied Sciences</source>
          .
          <year>2017</year>
          ;
          <volume>7</volume>
          (
          <issue>10</issue>
          ):
          <fpage>992</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>doi.org/10.3390/app7100992</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>