Algorithm Comparison for Cultural Heritage Image
                      Classification

                                                   Radmila Janković
                        Mathematical Institute of the Serbian Academy of Sciences and Arts
                                                  Belgrade, Serbia
                                             rjankovic@mi.sanu.ac.rs


                                                        Abstract
                       Digitization represents an important part of the development of online
                       systems. As such it includes, among other, the deployment, catego-
                       rization and preservation of audio, video and textual contents online.
                       Such process is especially interesting from the perspective of cultural
                       heritage, as it allows the long-term preservation and sharing of culture
                       worldwide. This study observes four classification algorithms: (i) the
                       multilayer perceptron, (ii) averaged one dependence estimators, (iii)
                       forest by penalizing attributes, and (iv) the k-nearest neighbor rough
                       sets and analogy based reasoning, before and after attribute classifi-
                       cation, and compares these with the results obtained from the con-
                       volutional neural network. The obtained results show that the best
                       classification performance was achieved by the multilayer perceptron,
                       followed by the convolutional neural network.


1    Introduction
With an increased use of digitization in the domain of cultural heritage, it is possible to preserve and promote
the cultural heritage present in every part of the world. Every country worldwide has its own practices, places,
values, objects, and arts that are created throughout history, and that represent its cultural heritage. Through
cultural heritage knowledge is being shared and passed on from generation to generation. Some examples of
cultural heritage include photographs, historical monuments, various types of documents, archaeological sites,
and other.
   There are three pillars of digital cultural heritage: (i) digitization focusing on conversion of objects into digital
form, (ii) access to digital heritage, and (iii) long term preservation of digital objects [IDST12]. Classification
is an important part of digitization as it includes building a classification model that groups new inputs into
categories, based on the previously available set of data. In terms of cultural heritage, classification is particularly
important because it allows the preservation of heritage for future generations. Furthermore, digitization enables
promotion of cultural heritage by using innovative technologies to increase accessibility. Through digitization, a
country’s cultural heritage is being promoted globally, thus contributing to cultural diversity.
   Different methods for cultural heritage image classification have been investigated by various authors. Deep
learning algorithms were used for image classification in [LlMMZG17]. In particular, AlexNet and Inception V3

Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC
BY 4.0).
In: A. Amelio, G. Borgefors, A. Hast (eds.): Proceedings of the 2nd International Workshop on Visual Pattern Extraction and
Recognition for Cultural Heritage Understanding, Bari, Italy, 29-Jan-2020, published at http://ceur-ws.org


                                                             26
convolutional neural networks (CNNs) were used, as well as ResNet and Inception-ResNet-v2 residual networks.
The dataset included architectural cultural heritage divided into 10 categories. The results showed that deep
learning methods perform better than other state-of-the-art methods, particularly when dealing with complex
problems [LlMMZG17]. The performance of deep learning methods has also been investigated in [KHP18], where
CNNs were used to classify images, audio and video data, while Recurrent Neural Networks (RNNs) were used to
classify the text belonging to the cultural heritage of Indonesia [KHP18]. It was observed that the RNN achieved
the highest accuracy, while the CNN obtained good accuracy for image and video classification (76%)[KHP18].
Considering other methods, k-nearest neighbor (kNN) classification was used to classify cultural heritage images
of 12 monuments and landmarks in Pisa [AFG15], but also to classify and detect alterations on historical buildings
with a high accuracy (92%) [MPAL15]. Various decision tree algorithms including J48, random tree, random
forest and fast random forest were investigated in [GDPR18]. Classification was performed in WEKA on a set
consisting of 3D cultural heritage models, and the results showed that the fast random forest achieves the highest
accuracy of 69% [GDPR18]. Different types of image classification techniques were investigated in [AJ19]. In
particular, naive Bayes and Support Vector Machine (SVM) algorithms are widely used for tangible and movable
cultural heritage, while the intangible cultural heritage is mostly classified using SVM, kNN, CNN, decision
trees and Conditional Random Fields - Gaussian Mixture Model (CRF-GMM) algorithms [AJ19]. Tangible and
immovable cultural heritage is most commonly classified using CNNs [AJ19].
   The aim of this paper is to compare the performance of several classification algorithms for cultural heritage
image classification, in particular: (i) the multilayer perceptron (MLP), (ii) averaged one dependence estimators
(AODE), (iii) forest by penalizing attributes (Forest PA), and (iv) the k-nearest neighbor rough sets and analogy
based reasoning (RSeslibKnn). The performance was observed on a full set of attributes as well as on the
reduced set of attributes, and the results were compared. Furthermore, a CNN was also developed for comparison
purposes, as deep learning represents a state-of-the-art technique [LlMMZG17, KHP18] and it is interesting to
observe and compare its performance with the performance of the other algorithms used in this study.
   This paper is organized as follows. Section 2 explains the data and methodology used in this research, while
Section 3 presents the results and discussion. Finally, Section 4 contains concluding remarks.

2     Data and methodology
2.1   Data
The cultural heritage image classification was performed on a public dataset created by [LlMMZG17] and obtained
from Datahub (https://old.datahub.io/dataset/architectural-heritage-elements-image-dataset).
The dataset consists of 10,235 images of size 128 × 128 pixels, but for the purpose of this study, 4,000 im-
ages from 5 out of 10 categories were randomly chosen. These images include altars, gargoyles, domes, columns,
and vaults (Figure 1).

2.2   Methodology
The experiments were performed in WEKA (Waikato Environment for Knowledge Analysis) [WFHP16], a free
data mining software based on Java, while the CNN model was developed in Python v.3.7 with the use of the
Keras library. The experiments were performed on a Windows machine with a 2.3 GHz processor and 8 GB of
RAM.
   In Weka, the feature extraction is performed using feature extraction algorithms integrated inside the im-
ageFilters package. Three types of filters were applied on the dataset: (i) the edge histogram, (ii) the color
layout, and (iii) the JPEG coefficients. Feature extraction for the CNN was not performed manually, as Python
automatically scans and extracts features from the dataset.
   The edge histogram extracts the MPEG7 edge features from the images. In particular, it detects the directions
of edges in images based on the changes in frequency and brightness [WPP02]. There are five types of edges:
vertical, horizontal, 45-degree diagonal, 135-degree diagonal, and non-directional [WPP02]. The color layout
feature extraction is performed by dividing the image into 64 blocks and calculating the average color for each
block, using the color layout filter in WEKA. Finally, the JPEG coefficients were extracted by splitting the image
based on different frequencies, hence keeping only the most important frequencies [MD13].
   After feature extraction, the dataset consisted of 307 attributes. The first results were generated on the full
set of attributes, while the second results were generated on the attribute-reduced set of data in order to evaluate
the change in performance before and after attribute selection. Feature selection was performed in WEKA using


                                                        27
                            (a)                        (b)                        (c)                     (d)


                                                                            (e)

Figure 1: Example of images used in this study for each of the five classes, in particular: (a) altar, (b) gargoyle,
(c) column, (d) vault, and (e) dome.


the AttributeSelection filter. The attribute search was performed using the best-first method, and attribute
evaluation using the CFS subset evaluator. After feature selection, the number of attributes was reduced to 89.
The dataset was divided into 70% of images for training and 30% for testing the algorithms. Four algorithms
were tested and compared for the purpose of image classification: (i) MLP, (ii) Forest PA, (iii) AODE, and (iv)
RSeslibKnn. Additionally, in order to compare the performance of these algorithms with the state-of-the-art
techniques, a deep learning CNN model was developed and tested.
   The MLP network consists of three layers: an input layer, a hidden layer, and an output layer. The layers
consist of neurons, where the neurons in one layer are connected to the neurons in the next layer [PM19].
The MLP uses a nonlinear activation function, making it suitable for different types of problems, without any
assumptions regarding data distribution [GD98]. The AODE is a classification technique that calculates the
probability of each class and creates a set of one dependence probability distribution estimators [WBW05]. Is
holds less rigorous independence assumptions than naive Bayes, hence it is suitable for various range of problems.
Forest PA is a relatively new decision forest algorithm developed in 2017 by [AI17] in order to overcome the
limitations of the random forest algorithm. The algorithm works in such a way that it uses the full set of
attributes for forest creation, but it also assigns weights to those attributes that already participated in the
previous decision tree. It generates the bootstrap sample from the training set and creates a decision tree using
the attribute weights [AI17]. RSeslibKnn is the k-nearest neighbor classifier that uses a fast neighbor search,
thus making it appropriate for using on large datasets [WL19]. A distance measure is calculated based on the
weighted sum of distances, which results in creation of the indexing tree [WL19]. The classification is performed
by finding the k nearest neighbors in the set. The CNN is a neural network consisting of standard type of layers,
as well as of convolution and pooling layers. It is widely used for image processing, as it is able to automatically
scan and extract features from images. In particular, each CNN is composed of convolutional layers followed
by the pooling layers that reduce the dimensionality of the data. The features extracted through these layers
are then transformed into a vector using the flattening layer, and the obtained vector is further distributed to a
dense layer, thus forming a fully-connected network.

Parameter configuration
The MLP consisted of one hidden layer with 50 neurons, a learning rate of 0.5, a momentum of 0.6, and a
batch size of 32. The sigmoid activation function was utilized in the hidden and output layers. All attributes
were standardized. The RseslibKnn included the the city and simple value difference as a distance measure,
the inverse square distance as the voting method, and the distance based weighting method. The AODE was


                                                        28
configured with the default WEKA parameters. Lastly, the Forest PA included 30 trees in the forest.
   The CNN configuration involved four convolution layers with 32 neurons in the first two layers, 64 neurons
in the last two convolution layers, one hidden layer with 128 neurons, and an output layer with 5 neurons. The
convolutional and hidden layers used the hyperbolic tangent activation function, while the output layer used
the softmax activation. The kernel size of the convolutional layers was set to 3 × 3, while the pooling size in
the pooling layer was set to 2 × 2. The dropout was set to 0.2. Lastly, in order to avoid overfitting, the early
stopping regularization parameter was used with patience set to 3. The number of epochs was set to 50, but the
training stopped after 18 epochs because there was no further improvement in the accuracy. The model used
80% of data for training and the rest of the data for validation.

Evaluation Metrics
The classification performance can be evaluated using several metrics. In particular, this study observed and
analyzed the values of correctly classified instances, precision, recall, F-score, kappa statistics, and ROC area. The
percentage of correctly classified instances represents the instances that are correctly classified by the algorithm,
while precision shows the fraction of instances that belong to the observed class among the total number of
instances that are classified into the observed class by the algorithm. Recall represents the true positive rate of
prediction, while the F-measure shows the classification accuracy based on the average value of precision and
recall. The F-measure values should ideally be closer to 1 indicating a better classification accuracy. Kappa is the
measure of agreement and can have values in an interval of 0–1, with the values in the range 0.81–1 representing
an almost perfect agreement, while values close to 0 represent poor agreement [SW05]. Lastly, the ROC area
represents the ratio of the true positives and the false positives, and its value should be close to 1, indicating a
perfect prediction [FUW06].

3   Results
All algorithms used in this study were first tested on the full dataset consisting of 307 attributes. The best
performing algorithm in this case is the MLP with 85% of accuracy obtained, followed by the RSeslibKnn,
AODE, and Forest PA with 82%, 79% and 78%, respectively (Table 1).

                            Table 1: Algorithm performance before attribute selection

                  Algorithm                        MLP        Forest PA    AODE       RSeslibKnn
                  Correctly classified instances   84.83%     77.92%       79.25%     82.17%
                  Kappa statistics                 0.810      0.724        0.741      0.777
                  Precision                        0.849      0.778        0.793      0.824
                  Recall                           0.848      0.779        0.793      0.822
                  F-measure                        0.848      0.778        0.791      0.820
                  ROC Area                         0.974      0.954        0.959      0.888
                  Running time (in seconds)        1038.09    54.02        122.72     93.14


   The MLP also obtained the best values in terms of the kappa statistics, precision, recall, and the F-measure,
comparing to other three algorithms (Table 1). In terms of the running time, the fastest algorithm is the Forest
PA with 54.02 seconds.
   After observing the results obtained from the full set of data, the next step involved observing the perfor-
mance of the algorithms on the reduced set of attributes. The MLP algorithm again performed the best, correctly
classifying 98.9% of instances (Table 2). Other algorithms obtained lower classification accuracy, in particular
80.83%, 80.67% and 78.67% for the AODE, RSeslibKnn and Forest PA, respectively. Observing other perfor-
mance measures, the MLP also obtained the highest value of kappa statistics (0.986), followed by AODE (0.760),
RSeslibKnn (0.758), and Forest PA (0.733). As described before in this paper, these values indicate a substantial
to almost perfect agreement [SW05]. Moreover, the MLP obtained the highest value of the F-measure (0.986),
followed by RSeslibKnn (0.811), AODE (0.807), and Forest PA (0.797). Lastly, observing the value of the ROC
area, the results indicate the MLP algorithm performed the best with an obtained value of 0.996, followed by
AODE, Forest PA and RSeslibKnn with ROC area values of 0.965, 0.959 and 0.879, respectively. These results
show a good classification power of the observed algorithms, with MLP and AODE performing the best (Table


                                                         29
2), but it should be noted that the MLP algorithm requires much longer running time than the other three
algorithms.

                           Table 2: Algorithm performance after attribute selection

                  Algorithm                        MLP      Forest PA   AODE      RSeslibKnn
                  Correctly classified instances   98.9%    78.67%      80.83%    80.67%
                  Kappa statistics                 0.986    0.733       0.760     0.758
                  Precision                        0.989    0.787       0.808     0.811
                  Recall                           0.989    0.787       0.808     0.807
                  F-measure                        0.986    0.797       0.807     0.805
                  ROC Area                         0.996    0.959       0.965     0.879
                  Running time (in seconds)        892.02   62.01       115.17    109.27

   In order to gain more insights about the power of the observed classification algorithms, classification matri-
ces are generated (Table 3). The classification matrix shows the number of correctly (and incorrectly) classified
instances by class, where the numbers in the diagonal represent accurate classifications. Before attribute selec-
tion, the algorithms mostly miss-classified images of gargoyle, column and vault. In particular, the MLP most
accurately classified the dome images, while the highest number of miss-classifications for the MLP algorithm
is observed for the vault and column images. The AODE most accurately classified altar images, while Forest
PA most correctly classified the dome images. The RSeslibKnn classified altar images most accurately, while the
highest number of wrongly classified instances is observed mainly for the column images (Table 3).

                 Table 3: The confusion matrices for each algorithm, before attribute selection

                  Algorithm      Altar    Column     Dome    Gargoyle    Vault   Classified as
                                 216      6          2       0           19      altar
                                 4        193        9       18          14      column
                  MLP            3        16         220     13          1       dome
                                 0        8          15      196         6       gargoyle
                                 20       12         3       13          193     vault
                                 217      9          2       1           14      altar
                                 13       163        27      19          16      column
                  AODE           2        13         213     21          4       dome
                                 0        7          25      174         19      gargoyle
                                 32       7          1       17          184     vault
                                 206      14         1       2           20      altar
                                 7        167        27      21          16      column
                  Forest PA      2        17         215     14          5       dome
                                 1        12         26      170         16      gargoyle
                                 29       13         0       22          177     vault
                                 231      4          2       0           6       altar
                                 19       169        17      15          18      column
                  RSeslibKnn     8        13         224     8           0       dome
                                 3        12         29      174         7       gargoyle
                                 33       6          2       12          188     vault

    After attribute selection has been applied, the new confusion matrix has been obtained (Table 4). In terms
of miss-classifications, the MLP and AODE algorithms mostly miss-classified column images, the RSeslibKnn
mostly miss-classified vault images, while Forest PA mostly miss-classified the images of vaults and columns. In
terms of accurate classifications, the MLP accurately classified almost all the images, while the AODE, Forest
PA and RSeslibKnn most accurately classified the images of altar and dome (Table 4).
    The previously described results were compared to the results obtained by using a deep learning algorithm, as
it represents the state-of-the-art methodology. For this purpose, the CNN model was developed in Python and the


                                                       30
                 Table 4: The confusion matrices for each algorithm, after attribute selection

                  Algorithm      Altar   Column    Dome     Gargoyle    Vault     Classified as
                                 239     0         0        0           1         altar
                                 1       237       0        0           2         column
                  MLP            0       1         238      1           0         dome
                                 0       0         0        238         2         gargoyle
                                 0       2         2        1           235       vault
                                 217     6         3        0           17        altar
                                 10      168       27       18          15        column
                  AODE           3       13        217      18          2         dome
                                 0       8         23       179         15        gargoyle
                                 28      11        0        13          189       vault
                                 201     18        1        3           20        altar
                                 6       177       22       22          11        column
                  Forest PA      1       18        210      22          2         dome
                                 2       13        26       176         8         gargoyle
                                 30      12        0        19          180       vault
                                 226     1         5        2           9         altar
                                 19      170       22       15          12        column
                  RSeslibKnn     6       13        223      10          1         dome
                                 4       10        29       175         7         gargoyle
                                 40      7         0        20          174       vault


loss and accuracy results through epochs are plotted and presented in Figure 2. The obtained results demonstrate
good accuracy of 93%, with training loss of 0.21 and validation loss of 0.42. Such results are promising as they
clearly demonstrate the potential of deep learning techniques. Furthermore, deep learning allows for detailed
modifications, thus enhancing the possibility of their application to different types of problems.


                               (a)                                          (b)

                     Figure 2: The (a) accuracy and (b) loss results for the CNN model.


4   Conclusions
The aim of this paper was to compare the performance of several classification algorithms before and after
attribute selection. In particular, MLP, AODE, Forest PA, and RSeslibKnn algorithms were applied on the
dataset consisting of cultural heritage images, and their performance was compared to the performance obtained
by the deep learning algorithm (CNN). Several conclusions can be drawn from this study: (i) the MLP algorithm
obtained the best performance both before and after attribute selection, (ii) attribute selection increases the
classification accuracy for all algorithms except the RSeslibKnn, and (iii) the deep learning techniques such as


                                                      31
the CNN, obtain higher accuracy without needing to reduce the number of attributes. Hence, as deep learning
techniques do not require to manually extract features from images, they represent an appropriate method of
image classification.

Acknowledgements
This work was supported by the Serbian Ministry of Education, Science and Technological Development through
Mathematical Institute of the Serbian Academy of Sciences and Arts.

References
[IDST12]      Ivanova, K., Dobreva, M., Stanchev, P. and Totkov, G. Access to digital cultural heritage: Innova-
              tive applications of automated metadata generation. Plovdiv University Publishing House ”Paisii
              Hilendarski”, 2012.

[SW05]        Sim, J. and Wright, C.C. ”The kappa statistic in reliability studies: use, interpretation, and sample
              size requirements.” Physical therapy 85, no. 3 (2005): 257-268.

[LlMMZG17] Llamas, J., Lerones, M. P., Medina, R., Zalama, E. and Gómez-Garcı́a-Bermejo, J. ”Classification
           of architectural heritage images using deep learning techniques.” Applied Sciences 7, no. 10 (2017):
           992.

[KHP18]       Kambau, R.A., Hasibuan, Z.A. and Pratama, M.O. ”Classification for Multiformat Object of
              Cultural Heritage using Deep Learning.” In 2018 Third International Conference on Informatics
              and Computing (ICIC), pp. 1-7. IEEE, 2018.

[AFG15]       Amato, G., Falchi, F. and Gennaro, C. ”Fast image classification for monument recognition.”
              Journal on Computing and Cultural Heritage (JOCCH) 8, no. 4 (2015): 1-25.

[GDPR18]      Grilli, E., Dininno, D., Petrucci, G. and Remondino, F. ”From 2D to 3D supervised segmenta-
              tion and classification for cultural heritage applications.” In ISPRS TC II Mid-term Symposium
              Towards Photogrammetry 2020, vol. 42, no. 42, pp. 399-406. 2018.

[MPAL15]      Meroño, J.E., Perea, A.J., Aguilera, M.J. and Laguna, A.M. ”Recognition of materials and damage
              on historical buildings using digital image classification.” South African Journal of Science. 111,
              no. 1-2 (2015): 01-09.

[AJ19]        osovi, M., Amelio, A. and Junuz, E. Classification Methods in Cultural heritage. In Proceedings of
              the 1st International Workshop on Visual Pattern Extraction and Recognition for Cultural Heritage
              Understanding co-located with 15th Italian Research Conference on Digital Libraries (IRCDL, pp.
              13-24, 2019.

[WPP02]       Won, C.S., Park, D.K. and Park, S.J. Efficient Use of MPEG-7 Edge Histogram Descriptor. ETRI
              journal 24, no. 1 (2002): 23-30.

[MD13]        More, N.K. and Dubey, S. JPEG Picture Compression Using Discrete Cosine Transform. Inter-
              national Journal of Science and Research (IJSR) 2, no. 1 (2013): 134-138.

[WFHP16]      Witten, I.H., Frank, E., Hall, M.A. and Pal, C.J. Data Mining: Practical machine learning tools
              and techniques. Morgan Kaufmann, 2016.

[PM19]        Pal, S.K. and Mitra, S. ”Multilayer perceptron, fuzzy sets, and classification.” IEEE Transactions
              on neural networks 3 (1992): 683–697.

[GD98]        Gardner, M.W. and Dorling, S. ”Artificial neural networks (the multilayer perceptron)a review of
              applications in the atmospheric sciences.” Atmospheric environment 32, no. 14-15 (1998): 2627-
              2636.

[WBW05]       Webb, G.I., Boughton, J.R. and Wang, Z. ”Not so naive Bayes: aggregating one-dependence
              estimators.” Machine learning 58, no. 1 (2005): 5-24.


                                                       32
[AI17]    Adnan, M.N. and Islam, M.Z. ”Forest PA: Constructing a decision forest by penalizing attributes
          used in previous trees.” Expert Systems with Applications 89 (2017): 389-403.

[WL19]    Wojna, A. and Latkowski, R. Rseslib 3: Library of rough set and machine learning methods with
          extensible architecture. In Transactions on Rough Sets XXI, pp. 301–323. Springer, 2019.
[FUW06]   Fan, J., Upadhye, S. and Worster, A. ”Understanding receiver operating characteristic (ROC)
          curves.” Canadian Journal of Emergency Medicine 8, no. 1 (2006): 19-20.


                                                 33