Algorithm Comparison for Cultural Heritage Image Classification Radmila Janković Mathematical Institute of the Serbian Academy of Sciences and Arts Belgrade, Serbia rjankovic@mi.sanu.ac.rs Abstract Digitization represents an important part of the development of online systems. As such it includes, among other, the deployment, catego- rization and preservation of audio, video and textual contents online. Such process is especially interesting from the perspective of cultural heritage, as it allows the long-term preservation and sharing of culture worldwide. This study observes four classification algorithms: (i) the multilayer perceptron, (ii) averaged one dependence estimators, (iii) forest by penalizing attributes, and (iv) the k-nearest neighbor rough sets and analogy based reasoning, before and after attribute classifi- cation, and compares these with the results obtained from the con- volutional neural network. The obtained results show that the best classification performance was achieved by the multilayer perceptron, followed by the convolutional neural network. 1 Introduction With an increased use of digitization in the domain of cultural heritage, it is possible to preserve and promote the cultural heritage present in every part of the world. Every country worldwide has its own practices, places, values, objects, and arts that are created throughout history, and that represent its cultural heritage. Through cultural heritage knowledge is being shared and passed on from generation to generation. Some examples of cultural heritage include photographs, historical monuments, various types of documents, archaeological sites, and other. There are three pillars of digital cultural heritage: (i) digitization focusing on conversion of objects into digital form, (ii) access to digital heritage, and (iii) long term preservation of digital objects [IDST12]. Classification is an important part of digitization as it includes building a classification model that groups new inputs into categories, based on the previously available set of data. In terms of cultural heritage, classification is particularly important because it allows the preservation of heritage for future generations. Furthermore, digitization enables promotion of cultural heritage by using innovative technologies to increase accessibility. Through digitization, a country’s cultural heritage is being promoted globally, thus contributing to cultural diversity. Different methods for cultural heritage image classification have been investigated by various authors. Deep learning algorithms were used for image classification in [LlMMZG17]. In particular, AlexNet and Inception V3 Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: A. Amelio, G. Borgefors, A. Hast (eds.): Proceedings of the 2nd International Workshop on Visual Pattern Extraction and Recognition for Cultural Heritage Understanding, Bari, Italy, 29-Jan-2020, published at http://ceur-ws.org 26 convolutional neural networks (CNNs) were used, as well as ResNet and Inception-ResNet-v2 residual networks. The dataset included architectural cultural heritage divided into 10 categories. The results showed that deep learning methods perform better than other state-of-the-art methods, particularly when dealing with complex problems [LlMMZG17]. The performance of deep learning methods has also been investigated in [KHP18], where CNNs were used to classify images, audio and video data, while Recurrent Neural Networks (RNNs) were used to classify the text belonging to the cultural heritage of Indonesia [KHP18]. It was observed that the RNN achieved the highest accuracy, while the CNN obtained good accuracy for image and video classification (76%)[KHP18]. Considering other methods, k-nearest neighbor (kNN) classification was used to classify cultural heritage images of 12 monuments and landmarks in Pisa [AFG15], but also to classify and detect alterations on historical buildings with a high accuracy (92%) [MPAL15]. Various decision tree algorithms including J48, random tree, random forest and fast random forest were investigated in [GDPR18]. Classification was performed in WEKA on a set consisting of 3D cultural heritage models, and the results showed that the fast random forest achieves the highest accuracy of 69% [GDPR18]. Different types of image classification techniques were investigated in [AJ19]. In particular, naive Bayes and Support Vector Machine (SVM) algorithms are widely used for tangible and movable cultural heritage, while the intangible cultural heritage is mostly classified using SVM, kNN, CNN, decision trees and Conditional Random Fields - Gaussian Mixture Model (CRF-GMM) algorithms [AJ19]. Tangible and immovable cultural heritage is most commonly classified using CNNs [AJ19]. The aim of this paper is to compare the performance of several classification algorithms for cultural heritage image classification, in particular: (i) the multilayer perceptron (MLP), (ii) averaged one dependence estimators (AODE), (iii) forest by penalizing attributes (Forest PA), and (iv) the k-nearest neighbor rough sets and analogy based reasoning (RSeslibKnn). The performance was observed on a full set of attributes as well as on the reduced set of attributes, and the results were compared. Furthermore, a CNN was also developed for comparison purposes, as deep learning represents a state-of-the-art technique [LlMMZG17, KHP18] and it is interesting to observe and compare its performance with the performance of the other algorithms used in this study. This paper is organized as follows. Section 2 explains the data and methodology used in this research, while Section 3 presents the results and discussion. Finally, Section 4 contains concluding remarks. 2 Data and methodology 2.1 Data The cultural heritage image classification was performed on a public dataset created by [LlMMZG17] and obtained from Datahub (https://old.datahub.io/dataset/architectural-heritage-elements-image-dataset). The dataset consists of 10,235 images of size 128 × 128 pixels, but for the purpose of this study, 4,000 im- ages from 5 out of 10 categories were randomly chosen. These images include altars, gargoyles, domes, columns, and vaults (Figure 1). 2.2 Methodology The experiments were performed in WEKA (Waikato Environment for Knowledge Analysis) [WFHP16], a free data mining software based on Java, while the CNN model was developed in Python v.3.7 with the use of the Keras library. The experiments were performed on a Windows machine with a 2.3 GHz processor and 8 GB of RAM. In Weka, the feature extraction is performed using feature extraction algorithms integrated inside the im- ageFilters package. Three types of filters were applied on the dataset: (i) the edge histogram, (ii) the color layout, and (iii) the JPEG coefficients. Feature extraction for the CNN was not performed manually, as Python automatically scans and extracts features from the dataset. The edge histogram extracts the MPEG7 edge features from the images. In particular, it detects the directions of edges in images based on the changes in frequency and brightness [WPP02]. There are five types of edges: vertical, horizontal, 45-degree diagonal, 135-degree diagonal, and non-directional [WPP02]. The color layout feature extraction is performed by dividing the image into 64 blocks and calculating the average color for each block, using the color layout filter in WEKA. Finally, the JPEG coefficients were extracted by splitting the image based on different frequencies, hence keeping only the most important frequencies [MD13]. After feature extraction, the dataset consisted of 307 attributes. The first results were generated on the full set of attributes, while the second results were generated on the attribute-reduced set of data in order to evaluate the change in performance before and after attribute selection. Feature selection was performed in WEKA using 27 (a) (b) (c) (d) (e) Figure 1: Example of images used in this study for each of the five classes, in particular: (a) altar, (b) gargoyle, (c) column, (d) vault, and (e) dome. the AttributeSelection filter. The attribute search was performed using the best-first method, and attribute evaluation using the CFS subset evaluator. After feature selection, the number of attributes was reduced to 89. The dataset was divided into 70% of images for training and 30% for testing the algorithms. Four algorithms were tested and compared for the purpose of image classification: (i) MLP, (ii) Forest PA, (iii) AODE, and (iv) RSeslibKnn. Additionally, in order to compare the performance of these algorithms with the state-of-the-art techniques, a deep learning CNN model was developed and tested. The MLP network consists of three layers: an input layer, a hidden layer, and an output layer. The layers consist of neurons, where the neurons in one layer are connected to the neurons in the next layer [PM19]. The MLP uses a nonlinear activation function, making it suitable for different types of problems, without any assumptions regarding data distribution [GD98]. The AODE is a classification technique that calculates the probability of each class and creates a set of one dependence probability distribution estimators [WBW05]. Is holds less rigorous independence assumptions than naive Bayes, hence it is suitable for various range of problems. Forest PA is a relatively new decision forest algorithm developed in 2017 by [AI17] in order to overcome the limitations of the random forest algorithm. The algorithm works in such a way that it uses the full set of attributes for forest creation, but it also assigns weights to those attributes that already participated in the previous decision tree. It generates the bootstrap sample from the training set and creates a decision tree using the attribute weights [AI17]. RSeslibKnn is the k-nearest neighbor classifier that uses a fast neighbor search, thus making it appropriate for using on large datasets [WL19]. A distance measure is calculated based on the weighted sum of distances, which results in creation of the indexing tree [WL19]. The classification is performed by finding the k nearest neighbors in the set. The CNN is a neural network consisting of standard type of layers, as well as of convolution and pooling layers. It is widely used for image processing, as it is able to automatically scan and extract features from images. In particular, each CNN is composed of convolutional layers followed by the pooling layers that reduce the dimensionality of the data. The features extracted through these layers are then transformed into a vector using the flattening layer, and the obtained vector is further distributed to a dense layer, thus forming a fully-connected network. Parameter configuration The MLP consisted of one hidden layer with 50 neurons, a learning rate of 0.5, a momentum of 0.6, and a batch size of 32. The sigmoid activation function was utilized in the hidden and output layers. All attributes were standardized. The RseslibKnn included the the city and simple value difference as a distance measure, the inverse square distance as the voting method, and the distance based weighting method. The AODE was 28 configured with the default WEKA parameters. Lastly, the Forest PA included 30 trees in the forest. The CNN configuration involved four convolution layers with 32 neurons in the first two layers, 64 neurons in the last two convolution layers, one hidden layer with 128 neurons, and an output layer with 5 neurons. The convolutional and hidden layers used the hyperbolic tangent activation function, while the output layer used the softmax activation. The kernel size of the convolutional layers was set to 3 × 3, while the pooling size in the pooling layer was set to 2 × 2. The dropout was set to 0.2. Lastly, in order to avoid overfitting, the early stopping regularization parameter was used with patience set to 3. The number of epochs was set to 50, but the training stopped after 18 epochs because there was no further improvement in the accuracy. The model used 80% of data for training and the rest of the data for validation. Evaluation Metrics The classification performance can be evaluated using several metrics. In particular, this study observed and analyzed the values of correctly classified instances, precision, recall, F-score, kappa statistics, and ROC area. The percentage of correctly classified instances represents the instances that are correctly classified by the algorithm, while precision shows the fraction of instances that belong to the observed class among the total number of instances that are classified into the observed class by the algorithm. Recall represents the true positive rate of prediction, while the F-measure shows the classification accuracy based on the average value of precision and recall. The F-measure values should ideally be closer to 1 indicating a better classification accuracy. Kappa is the measure of agreement and can have values in an interval of 0–1, with the values in the range 0.81–1 representing an almost perfect agreement, while values close to 0 represent poor agreement [SW05]. Lastly, the ROC area represents the ratio of the true positives and the false positives, and its value should be close to 1, indicating a perfect prediction [FUW06]. 3 Results All algorithms used in this study were first tested on the full dataset consisting of 307 attributes. The best performing algorithm in this case is the MLP with 85% of accuracy obtained, followed by the RSeslibKnn, AODE, and Forest PA with 82%, 79% and 78%, respectively (Table 1). Table 1: Algorithm performance before attribute selection Algorithm MLP Forest PA AODE RSeslibKnn Correctly classified instances 84.83% 77.92% 79.25% 82.17% Kappa statistics 0.810 0.724 0.741 0.777 Precision 0.849 0.778 0.793 0.824 Recall 0.848 0.779 0.793 0.822 F-measure 0.848 0.778 0.791 0.820 ROC Area 0.974 0.954 0.959 0.888 Running time (in seconds) 1038.09 54.02 122.72 93.14 The MLP also obtained the best values in terms of the kappa statistics, precision, recall, and the F-measure, comparing to other three algorithms (Table 1). In terms of the running time, the fastest algorithm is the Forest PA with 54.02 seconds. After observing the results obtained from the full set of data, the next step involved observing the perfor- mance of the algorithms on the reduced set of attributes. The MLP algorithm again performed the best, correctly classifying 98.9% of instances (Table 2). Other algorithms obtained lower classification accuracy, in particular 80.83%, 80.67% and 78.67% for the AODE, RSeslibKnn and Forest PA, respectively. Observing other perfor- mance measures, the MLP also obtained the highest value of kappa statistics (0.986), followed by AODE (0.760), RSeslibKnn (0.758), and Forest PA (0.733). As described before in this paper, these values indicate a substantial to almost perfect agreement [SW05]. Moreover, the MLP obtained the highest value of the F-measure (0.986), followed by RSeslibKnn (0.811), AODE (0.807), and Forest PA (0.797). Lastly, observing the value of the ROC area, the results indicate the MLP algorithm performed the best with an obtained value of 0.996, followed by AODE, Forest PA and RSeslibKnn with ROC area values of 0.965, 0.959 and 0.879, respectively. These results show a good classification power of the observed algorithms, with MLP and AODE performing the best (Table 29 2), but it should be noted that the MLP algorithm requires much longer running time than the other three algorithms. Table 2: Algorithm performance after attribute selection Algorithm MLP Forest PA AODE RSeslibKnn Correctly classified instances 98.9% 78.67% 80.83% 80.67% Kappa statistics 0.986 0.733 0.760 0.758 Precision 0.989 0.787 0.808 0.811 Recall 0.989 0.787 0.808 0.807 F-measure 0.986 0.797 0.807 0.805 ROC Area 0.996 0.959 0.965 0.879 Running time (in seconds) 892.02 62.01 115.17 109.27 In order to gain more insights about the power of the observed classification algorithms, classification matri- ces are generated (Table 3). The classification matrix shows the number of correctly (and incorrectly) classified instances by class, where the numbers in the diagonal represent accurate classifications. Before attribute selec- tion, the algorithms mostly miss-classified images of gargoyle, column and vault. In particular, the MLP most accurately classified the dome images, while the highest number of miss-classifications for the MLP algorithm is observed for the vault and column images. The AODE most accurately classified altar images, while Forest PA most correctly classified the dome images. The RSeslibKnn classified altar images most accurately, while the highest number of wrongly classified instances is observed mainly for the column images (Table 3). Table 3: The confusion matrices for each algorithm, before attribute selection Algorithm Altar Column Dome Gargoyle Vault Classified as 216 6 2 0 19 altar 4 193 9 18 14 column MLP 3 16 220 13 1 dome 0 8 15 196 6 gargoyle 20 12 3 13 193 vault 217 9 2 1 14 altar 13 163 27 19 16 column AODE 2 13 213 21 4 dome 0 7 25 174 19 gargoyle 32 7 1 17 184 vault 206 14 1 2 20 altar 7 167 27 21 16 column Forest PA 2 17 215 14 5 dome 1 12 26 170 16 gargoyle 29 13 0 22 177 vault 231 4 2 0 6 altar 19 169 17 15 18 column RSeslibKnn 8 13 224 8 0 dome 3 12 29 174 7 gargoyle 33 6 2 12 188 vault After attribute selection has been applied, the new confusion matrix has been obtained (Table 4). In terms of miss-classifications, the MLP and AODE algorithms mostly miss-classified column images, the RSeslibKnn mostly miss-classified vault images, while Forest PA mostly miss-classified the images of vaults and columns. In terms of accurate classifications, the MLP accurately classified almost all the images, while the AODE, Forest PA and RSeslibKnn most accurately classified the images of altar and dome (Table 4). The previously described results were compared to the results obtained by using a deep learning algorithm, as it represents the state-of-the-art methodology. For this purpose, the CNN model was developed in Python and the 30 Table 4: The confusion matrices for each algorithm, after attribute selection Algorithm Altar Column Dome Gargoyle Vault Classified as 239 0 0 0 1 altar 1 237 0 0 2 column MLP 0 1 238 1 0 dome 0 0 0 238 2 gargoyle 0 2 2 1 235 vault 217 6 3 0 17 altar 10 168 27 18 15 column AODE 3 13 217 18 2 dome 0 8 23 179 15 gargoyle 28 11 0 13 189 vault 201 18 1 3 20 altar 6 177 22 22 11 column Forest PA 1 18 210 22 2 dome 2 13 26 176 8 gargoyle 30 12 0 19 180 vault 226 1 5 2 9 altar 19 170 22 15 12 column RSeslibKnn 6 13 223 10 1 dome 4 10 29 175 7 gargoyle 40 7 0 20 174 vault loss and accuracy results through epochs are plotted and presented in Figure 2. The obtained results demonstrate good accuracy of 93%, with training loss of 0.21 and validation loss of 0.42. Such results are promising as they clearly demonstrate the potential of deep learning techniques. Furthermore, deep learning allows for detailed modifications, thus enhancing the possibility of their application to different types of problems. (a) (b) Figure 2: The (a) accuracy and (b) loss results for the CNN model. 4 Conclusions The aim of this paper was to compare the performance of several classification algorithms before and after attribute selection. In particular, MLP, AODE, Forest PA, and RSeslibKnn algorithms were applied on the dataset consisting of cultural heritage images, and their performance was compared to the performance obtained by the deep learning algorithm (CNN). Several conclusions can be drawn from this study: (i) the MLP algorithm obtained the best performance both before and after attribute selection, (ii) attribute selection increases the classification accuracy for all algorithms except the RSeslibKnn, and (iii) the deep learning techniques such as 31 the CNN, obtain higher accuracy without needing to reduce the number of attributes. Hence, as deep learning techniques do not require to manually extract features from images, they represent an appropriate method of image classification. Acknowledgements This work was supported by the Serbian Ministry of Education, Science and Technological Development through Mathematical Institute of the Serbian Academy of Sciences and Arts. References [IDST12] Ivanova, K., Dobreva, M., Stanchev, P. and Totkov, G. Access to digital cultural heritage: Innova- tive applications of automated metadata generation. Plovdiv University Publishing House ”Paisii Hilendarski”, 2012. [SW05] Sim, J. and Wright, C.C. ”The kappa statistic in reliability studies: use, interpretation, and sample size requirements.” Physical therapy 85, no. 3 (2005): 257-268. [LlMMZG17] Llamas, J., Lerones, M. P., Medina, R., Zalama, E. and Gómez-Garcı́a-Bermejo, J. ”Classification of architectural heritage images using deep learning techniques.” Applied Sciences 7, no. 10 (2017): 992. [KHP18] Kambau, R.A., Hasibuan, Z.A. and Pratama, M.O. ”Classification for Multiformat Object of Cultural Heritage using Deep Learning.” In 2018 Third International Conference on Informatics and Computing (ICIC), pp. 1-7. IEEE, 2018. [AFG15] Amato, G., Falchi, F. and Gennaro, C. ”Fast image classification for monument recognition.” Journal on Computing and Cultural Heritage (JOCCH) 8, no. 4 (2015): 1-25. [GDPR18] Grilli, E., Dininno, D., Petrucci, G. and Remondino, F. ”From 2D to 3D supervised segmenta- tion and classification for cultural heritage applications.” In ISPRS TC II Mid-term Symposium Towards Photogrammetry 2020, vol. 42, no. 42, pp. 399-406. 2018. [MPAL15] Meroño, J.E., Perea, A.J., Aguilera, M.J. and Laguna, A.M. ”Recognition of materials and damage on historical buildings using digital image classification.” South African Journal of Science. 111, no. 1-2 (2015): 01-09. [AJ19] osovi, M., Amelio, A. and Junuz, E. Classification Methods in Cultural heritage. In Proceedings of the 1st International Workshop on Visual Pattern Extraction and Recognition for Cultural Heritage Understanding co-located with 15th Italian Research Conference on Digital Libraries (IRCDL, pp. 13-24, 2019. [WPP02] Won, C.S., Park, D.K. and Park, S.J. Efficient Use of MPEG-7 Edge Histogram Descriptor. ETRI journal 24, no. 1 (2002): 23-30. [MD13] More, N.K. and Dubey, S. JPEG Picture Compression Using Discrete Cosine Transform. Inter- national Journal of Science and Research (IJSR) 2, no. 1 (2013): 134-138. [WFHP16] Witten, I.H., Frank, E., Hall, M.A. and Pal, C.J. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016. [PM19] Pal, S.K. and Mitra, S. ”Multilayer perceptron, fuzzy sets, and classification.” IEEE Transactions on neural networks 3 (1992): 683–697. [GD98] Gardner, M.W. and Dorling, S. ”Artificial neural networks (the multilayer perceptron)a review of applications in the atmospheric sciences.” Atmospheric environment 32, no. 14-15 (1998): 2627- 2636. [WBW05] Webb, G.I., Boughton, J.R. and Wang, Z. ”Not so naive Bayes: aggregating one-dependence estimators.” Machine learning 58, no. 1 (2005): 5-24. 32 [AI17] Adnan, M.N. and Islam, M.Z. ”Forest PA: Constructing a decision forest by penalizing attributes used in previous trees.” Expert Systems with Applications 89 (2017): 389-403. [WL19] Wojna, A. and Latkowski, R. Rseslib 3: Library of rough set and machine learning methods with extensible architecture. In Transactions on Rough Sets XXI, pp. 301–323. Springer, 2019. [FUW06] Fan, J., Upadhye, S. and Worster, A. ”Understanding receiver operating characteristic (ROC) curves.” Canadian Journal of Emergency Medicine 8, no. 1 (2006): 19-20. 33