Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data Bashir Kazimi1 Leibniz University of Hannover, Institute of Cartography and Geoinformatics kazimi@ikg.uni-hannover.de Frank Thiemann Leibniz University of Hannover, Institute of Cartography and Geoinformatics thiemann@ikg.uni-hannover.de Katharina Malek Lower Saxony State Service for Cultural Heritage, Mining Archaeology katharina.malek@nld.niedersachsen.de Monika Sester Leibniz University of Hannover, Institute of Cartography and Geoinformatics sester@ikg.uni-hannover.de Kourosh Khoshelham University of Melbourne, Department of Infrastructure Engineering k.khoshelham@unimelb.edu.au Abstract It is important to preserve archaeological monuments as they play a key role in helping us understand human history and their accomplishments for times with no or little written sources. The first step for this purpose is an efficient method for collecting and documenting information about objects of interest for archaeologists. Airborne laser scanning (ALS) is of great use in collecting and documenting detailed measurements from an area of interest. However, it is time consuming for scientists to manually analyze the collected ALS data. One possible way to automate this process is using deep neural networks. In this work, we propose a hierarchical Convolutional Neural Network (CNN) model to classify archaeological objects in ALS data. The data is acquired from the Harz mining Region in Lower Saxony, where a high density of different archaeological monuments including the UNESCO world heritage site Historic Town of Goslar, Mines of Rammelsberg, and the Upper Harz Water Management System can be found. To compare and validate our method, we run experiments on the same data set using two existing deep learning models. The first model is VGG-16; an image classification network pretrained on ImageNet2 data. The second model is a stacked autoencoders model. The results of the classification as analyzed in this paper show that our model is suitably tuned for this task as it achieves the best classification accuracy of around 91 percent, compared to 88 percent and 82 percent accuracy by the pretrained and stacked autoencoders models, respectively. 2012 ACM Subject Classification F.1.1 Models of Computation Keywords and phrases Archaeology, Deep Learning, Object Detection Digital Object Identifier 10.4230/LIPIcs.COARCH.2018. 1 Some parts of this work has been carried out during a research visit at the University of Melbourne, Department of Infrastructure Engineering. 2 http://image-net.org/about-overview © Bashir Kazimi, Frank Thiemann, Katharina Malek, Monika Sester and Kourosh Khoshelham; 2nd Workshop On Computing Techniques For Spatio-Temporal Data in Archaeology And Cultural Heritage. Editors: A. Belussi, R. Billen, P. Hallot, S. Migliorini 21 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data 1 Introduction Archaeology studies physical remains of objects in order to understand human history and culture for times with no or little written sources. It helps us have a glance at the lives of people in the past and how things have changed through time. While history uses written records to analyze events and explain human life, archaeology investigates what humans made and left behind and makes sense of how, where, and when they lived. Places with physical remains of human activities in the past are called archaeological sites. Archaeological sites can be of different types like settlements, cemeteries or depositions. They can be visible or not visible above ground. Studying these materials informs us of unrecorded human history in the past, and thus it is important to protect and conserve archaeological sites. To do so, it is necessary to first identify and document such sites. Airborne laser scanning (ALS) is an efficient remote sensing technique for collecting 3D data of large areas by measuring the range and reflectance of objects on their surface. It can be used to capture data from archaeological sites, but it needs to be manually analyzed by archaeologists for correct identification and localization of archaeological objects. This could be very time-consuming, and needs an automated process. Deep learning has shown great potential in automating processes in many applications and outperformed classical machine learning methods. It has performed well in image classification, machine translation, speech recognition, image captioning, healthcare domain applications, prediction of events, and many more [1]. It can work with 2D image data [2], 3D point cloud data [3], hyperspectral image data [4], and interpret sequence to sequence data [5]. Among deep learning models, convolutional neural networks (CNN) have been proved to work well on image data [2, 6], while stacked autoencoders are more suitable for learning features from the input data and encoding them to a compressed vector, from which a decoder learns to generate the original input data [7]. Deep learning models usually require a lot of training data to be able to generalize well. In many applications where the amount of training data is not sufficient, transfer learning can be applied. Transfer learning refers to training a model on a large set of training data first, and then using the pretrained model as a feature extractor for a new application with small dataset [8]. There are many models pretrained on ImageNet data; Xception [9], VGG Net [10], ResNet [11], Inception [12], etc, which are shown to perform well as feature extractors for other image-based domains. In this work, we build a hierarchical CNN model to detect objects in archaeological sites using digital terrain models (DTMs) generated from ALS data. The data is collected from the Harz mining region in Lower Saxony, Germany. Objects to be detected are archaeological objects such as hollow ways, streams, pathways, lakes, streets, ditches, heaps, mining shafts, and more, but for this study, our model is fit to detect 4 classes of objects: natural streams, lakes, tracks, and an ’others’ class which represents the rest of the objects for which enough labeled data is not available yet. We also implement and use a stacked autoencoders model, and the pretrained VGG16 model to compare the performance of our model. The rest of the paper is arranged as follows. Section 2 explains previous work on deep learning for object detection. Section 3 gives details of our model, explains the stacked autoencoder model, and describes the pretrained VGG16 model used for comparison. Section 4 includes the experiments performed and the results obtained, comparing our model with the other models used in this study. Finally, Section 5 summarizes this work and points out open future directions in this line of research. 22 Kazimi et al. 2 Related Work Deep learning has been successfully used in many applications. Among the deep learning methods, recurrent neural networks (RNNs) are good at dealing with sequential data as they take into account temporal information. RNNs have been applied in speech recognition to map acoustic sequences to phonetic sequences [13]. RNNs have also been used in natural language processing to translate text from one language to another [14, 15, 16]. Another famous method in deep learning is convolutional neural networks (CNNs). CNNs take into account spatial correlation among data points and hence perform well in image-based data. CNNs have been used in image classification [2] video classification [17], face recognition [18], scene labelling [19], action recognition [20], pose estimation [21], and more. Autoencoders are neural networks that take an input, map it to a latent space using a hidden layer, and reconstructs the original input by the final output or decoding layer. Stacked autoencoders are deep networks made of multiple autoencoders stacked together. Stacked autoencoders have also been applied in many applications: feature extraction and classification of hyperspectral data [22, 4], internet traffic prediction [23], classification and diagnosis of the parkinson disease [24], and more. Even though deep learning models perform well in most applications, they are highly dependent on a lot of data, and usually labeled data. In domains where sufficient data is not available, transfer learning can be applied. First, a deep neural model is trained on a domain with a high amount of labeled data, and the learned parameters are saved. The learned parameters can then be transferred and used in a domain with a smaller amount of data for feature extraction. The extracted features are used to train a classifier directly, or in some cases, the transferred weights are further tuned for that specific application [8]. There are many other deep learning methods with various fields of applications, but that is out of the scope of this research. Similar to other domains, deep learning methods have gained popularity and are used in the field of remote sensing as well, such as scene classification in satellite imagery [25], object detection in optical remote sensing images [26], 3D shape reconstruction [27], and more. Some example applications of deep learning methods specifically on laser scanning data include tree classification [28], road manhole covers detection [29], and road markings detection [30]. Hu and Yuan [31] used CNNs to extract DTMs and filter out non-ground points from ALS data, and claim that their method performs better than previous filtering methods. Generally, deep learning models are fit to work with image data or 3D point clouds where the example objects have a complete 3D shape. In some cases, deep learning models need to work with height information or DTMs extracted from ALS data as in the work of Politz et al. [32]. In this study, we also work with ALS data. In order to leverage deep learning, specifically convolutional neural networks, we build a hierarchical CNN model inspired by the work of Palafox et al. [33], and preprocess our ALS point cloud data to generate a DTM, and image-like inputs, 2.5D height maps, to train our model and detect objects in archaeological sites. 3 Approach To detect objects in archaeological sites, we propose a two-step approach. The first step is to train classifier models that take height map data of size n x n as input and assign a label to it. The second step is the detection process in which the input is height map data of size h1 x h2 where h1 , h2  n, and the output is a heat map of size h1 x h2 for each class showing the probability of the objects at each location. Since the objects to be detected 23 C OA R C H 2 0 1 8 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data Algorithm 1 Selecting optimal values of n for the input size n x n 1: nList: list of possible values of n, in this research {8,10,...,98,100} 2: num_optimal: number of optimal values to select (= 5 in this research) 3: trainData: DTM showing height maps from the training region. 4: testData: DTM showing height maps from the test region. 5: procedure TopNselection(nList, num_optimal, trainData, testData) 6: models ← ∅ 7: accuracies ← ∅ 8: losses ← ∅ 9: topN ← ∅ 10: for each item n in nList do 11: modelsn ← model trained with input size n x n clipped from trainData 12: accuraciesn ← accuracy of the modelsn on testData 13: lossesn ← loss of the modelsn on testData 14: sortedList1 ← nList sorted based on accuracies in decreasing order 15: sortedList2 ← nList sorted based on losses in increasing order 16: i←1 17: while true do 18: f irst_i_acc ← first i elements in sortedList1 19: f irst_i_loss ← first i elements in sortedList2 20: common_N s ← list of common numbers in f irst_i_acc and f irst_i_loss 21: for each item c in common_N s do 22: if c is not already in topN then 23: append c to topN 24: if i > length of nList then 25: break 26: i← i+1 27: return topN [: num_optimal] # first num_optimal elements Algorithm 2 The detection process 1: classif iern : classifier model trained with input size n by n 2: H: Input matrix of size h1 x h2 , showing height maps from the region of interest. 3: M : Number of classes the the classifier model is trained to detect. 4: procedure DetectObjects(classif iern , H, M ) 5: O ← 0 # a matrix of zeros of shape (M , h1 , h2 ) 6: for each item i in range(0, h1 - n + 1) do 7: for each item j in range(0, h2 - n + 1) do 8: currentP atch ← H[i : i + n, j : j + n] 9: m ← class predicted by classif iern 10: O[m, i : i + n, j : j + n] += ones(n, n) # increment by 1 11: return O # a heat map of size h1 x h2 for each class 24 Kazimi et al. Figure 1 Detection process. Heatmap matrices O are initialized to 0. Sliding one pixel at a time, patches of size n x n are extracted from the DTM H, and fed to the classifier. Based on the class label predicted by the classifier, the pixels on the same location on matrix O of the predicted class are incremented by 1. At the end of the process, each object class 1 to M , will have their corresponding heat maps O1 to OM are of different sizes and granularities - sometimes even within a single object class, we first investigate and select 5 optimal size for the input to the classifier networks. The 5 optimal values of n for the input size are selected using Algorithm 1. After selecting 5 optimal values of n for the input size to the classifier models, the 5 corresponding trained classifiers are used in the second step, the detection process. The detection process takes a big matrix of size h1 x h2 , height map data from the region of interest, as input and returns one heat map of size h1 x h2 for each class, indicating the probability of objects at each location of the heat map. The detection method is performed using Algorithm 2, and illustrated in Figure 1. To this end, we build a hierarchical CNN model and compare it with two of the existing deep learning methods: the pretrained VGG16 model available in the Keras API [34], and stacked autoencoders model implemented by ourselves. The following subsections describe the three models. 3.1 Our CNN Model Our CNN models are multi-class classifiers consisting of 4 Convolutional layers each of which is followed by a max pooling function and a ReLU operation which is defined in the following equation. ReLU (x) = max(x, 0) (1) In the end, there is one fully connected layer with a softmax function outputting the probabilities for each class label. The structure used for the CNN models is depicted in Figure 2. For this study, we train multiple CNN models that take as input matrices of size; n x n, with n ranging between, and including 8 and 100, taking only even numbers. When the models are trained, we check their performance based on the accuracy and loss obtained from the test set and choose the best 5 for the second step, using Algorithm 1. 25 C OA R C H 2 0 1 8 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data Figure 2 Structure of our CNN Model. The model takes an input matrix of shape n by n and outputs a class label for it. 3.2 Pretrained VGG16 Model VGG16 is a deep convolutional neural network for object recognition developed by Visual Geometry Group at Oxford [10]. The model has been trained on ImageNet data, a database of images where each image is labeled as one of the thousand classes consisting of concepts included in the WordNet hierarchy. The learned training weights have been made available online, making it possible for various deep learning frameworks, such as Keras, TensorFlow, and Caffe to include the implementation of the model as an application, along with the pre- trained weights. We use the Keras implementation of this model. The Keras implementation allows for input images with height and width greater than 48 up to 224 and input channels of 3. Since the input data in our study is height map data with a single channel, we triple the number of channels by copying the same values for our single-channeled input data to create 3-channeled inputs. The final classification layer of the model is replaced with our own layer, a fully connected layer followed by a softmax function classifying 4 object classes. The structure of the modified VGG16 model is depicted in Figure 3. The model is fine-tuned for our task, keeping the weights in the original VGG16 model fixed, and only training the weights in the newly added fully connected layer. The same approach is taken for this model. Figure 3 Structure of the modified pretrained VGG16 Model, instead of 1000 classes as the original model, we classify 4 objects, hence the modification in the final layer. 26 Kazimi et al. We train the model with input sizes n x n, with n ranging between, and including 48 and 100, taking only even numbers. The range of values for n starts from 48, rather than 8 since 48 is the smallest possible value for the input size to the pretrained model. After selecting top 5 models based on accuracy and loss on the test data using Algorithm 1, similar to our own model, we scan the area of interest and generate heat maps of each object class for each model using Algorithm 2. 3.3 Stacked Autoencoders (SAE) Model The stacked autoencoders model takes as input a height map data of size n x n, learns to encode it to a compressed, smaller matrix and is trained to decode and generate the original input, a height map of size n x n. The encoder consists of 3 convolutional layers, each followed by a ReLU operation, and a max pooling operation. Then follows a fully connected layer whose output is the code, used by the following decoder. The decoder consists of 4 convolutional layers, each with a ReLU operation, and an upsampling function. The output of the decoder has the same size as the original height map data that is given as input to the encoder. The structure of the SAE model is illustrated in Figure 4. After training, only the encoder model is used to extract features from the data. The training data is given to the encoder as input, the resulting compressed, smaller vectors – referred to as code – are then used as features to train a simple neural network to classify into 4 classes. The simple classifier neural network that we have implemented consists of a fully connected layer with a softmax function. Similar to the previous two methods, we try multiple values for n ranging between, and including 48 and 100. The final top 5 models are then used for object detection in the same manner as explained before. Figure 4 Structure of the SAE model trained to take an input of size (n x n), and regenerate it. 4 Experiments and Results Our region of study is the western part of the Harz Mountains in the south of Lower Saxony, Germany. The upper Harz was once one of the most important mining regions in Germany. The major products of its mines were silver, copper, lead, iron, and zinc. The area is home for the Upper Harz Water Management System, which together with the Mines of Rammelsberg and the Historic Town of Goslar was declared as UNESCO World Heritage Site. The Upper 27 C OA R C H 2 0 1 8 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data Harz Water Management System is one of the largest and most important historic mining water management systems in the world where dammed ponds, ditches, tunnels and drainage adits were built for collecting, diverting, and storing surface runoff water during the 16th-19th centuries. The area is also covered with plenty of mining relics such as mine shafts, hollow ways, medieval heaps, and more. ALS data is acquired from the region of the interest, and known objects are digitized using the ArcGIS software. Since our model and the other two previous methods use 2D data as input, we first create a DTM with a 1-meter resolution for the area of interest using the ALS point cloud data. Afterwards, the digitized labels are used to buffer known objects and generate matrices of size 100 x 100 with the corresponding label for the object, in order to create training and test datasets for our models. In this study, our labeled data contains four object classes including streams, tracks, lakes, and an ’others’ label representing everything else in the dataset. Streams and tracks are labeled using continuous, linear-shaped buffers while lakes are labeled using closed areal buffers. Areas excluding the three known object classes are all labeled as ’others’. The generated matrices indicate height values and are used as input to our model and the SAE model. For the pretrained VGG16 model, the matrices are tripled to create 3-channeled input data of the same width and height from the 1-channel height map data. The resulting 3-channeled matrices are then used for training and testing the pretrained models. To create training and test data for models with input size n smaller than 100, matrices of size n x n are cropped from the center of the original matrices generated earlier, and used by the corresponding model. A small part of the region of study with an overlay of the classes used in this study is shown in Figure 5, and the statistics for the generated training and test data used in this study are shown in Table 1. (a) Original DTM (b) DTM with Hillshad- (c) Digitization of the ing classes on ArcGIS Figure 5 Test area from the region of study, 2x2 kilometers squared area in the Harz Mountains Region in Lower Saxony. Train Test Number of examples 20.000 (80 %) 5.000 (20 %) Table 1 Dataset statistics To have a fair comparison among them, extreme care has been taken to use as much similar configuration for the 3 models as possible. Some of the hyperparameters used while training the models are listed in Table 2. As explained in Section 3, the first method was explored using input sizes n ranging from 8 to 100, but since 48 is the smallest possible n value for the pretrained VGG16 model, the range of values used for n for this model and the SAE model is 48 to 100. The 5 optimal filter sizes for each model is different. Therefore, to compare the models, we extend our previous method of choosing optimal filter sizes, 28 Kazimi et al. Epochs Batchsize Range of values for n Optimizer Our CNN 10 32 8-100 Adam (lr=0.001) SAE 10 32 48-100 Adam (lr=0.001) VGG16 10 32 48-100 Adam (lr=0.001) Table 2 Settings used for the three models during training. lr stands for learning rate Algorithm 1, to select 5 common filter sizes for all the three models. The new method for selecting optimal filter sizes, n, for the three models in common is shown in Algorithm 3. Algorithm 3 Selecting optimal values of n for the three models in common 1: nList: list of possible values of n, in this section {48,50,...,98,100} 2: num_common: number of common optimal values to select (= 5 in this research) 3: trainData: DTM showing height maps from the training region. 4: testData: DTM showing height maps from the test region. 5: procedure CommonNselection(nList, num_common, trainData, testData) 6: len_nList ← length of nList # Passed to Algorithm 1 as num_optimal instead of 5 7: optimalsCN N ← T opN selection(nList, len_nList, trainData, testData) 8: optimalsV GG16 ← T opN selection(nList, len_nList, trainData, testData) 9: optimalsSAE ← T opN selection(nList, len_nList, trainData, testData) 10: topN ← ∅ 11: i←1 12: while true do 13: i_cnn ← first i elements in optimalsCN N 14: i_vgg ← first i elements in optimalsV GG16 15: i_sae ← first i elements in optimalsSAE 16: common_N s ← common numbers in i_cnn, i_vgg, and i_sae 17: for each item c in common_N s do 18: if c is not already in topN then 19: append c to topN 20: if i > length of nList then 21: break 22: i← i+1 23: return topN [: num_common] # first num_common elements In this experiment, the 5 common optimal filter sizes and the results obtained by the corresponding models are listed in Table 3. As the filter size grows, VGG16 generalizes better while our model and the SAE decreases generalization ability. This notion is visible in the loss values obtained by the models on the test set. Confusion matrices for the smallest and biggest optimal filters for each model are shown in Table 4. The evaluation metrics in Table 3 and the confusion matrices in Table 4 are based on the test data set, which similar to our training data set, is located, labeled, and verified by archaeology experts. In general, as the filter size grows, there is a decrease in prediction accuracy of the three models for linear structures like tracks and streams, but an increase in prediction accuracy for areal structures like lakes. It is also observed from the confusion matrices that the models confuse tracks with streams and vice versa more than they confuse them with lakes. This is expected since tracks and streams are linear and are somewhat more similar in structure. Additionally, a higher number of lakes are confused by the three models with streams rather than tracks, 29 C OA R C H 2 0 1 8 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data Model Model Our CNN Our CNN VGG16 VGG16 SAE SAE 48 90.6 85.5 82.1 48 0.31 0.40 0.49 Filter Filter 64 90.2 85.9 80.7 64 0.31 0.37 0.54 76 89.8 87.1 76.2 76 0.31 0.36 0.63 82 89.7 86.7 78.3 82 0.32 0.38 0.58 98 90.7 88.0 80.9 98 0.33 0.35 0.57 (a) Accuracy in % obtained from the test set. (b) Loss obtained from the test set. Table 3 Quantitative evaluation results for the three models using 5 common optimal filter sizes. Predicted Predicted Streams Streams Others Others Tracks Tracks Lakes Lakes Tracks 91.9 2.1 4.8 1.3 Tracks 89.7 3.9 5.8 0.5 Actual Actual Streams 4.1 89.3 4.9 1.7 Streams 3.3 89.8 5.5 1.4 Others 8.1 6.6 83.0 2.2 Others 8.1 8.1 82.6 1.1 Lakes 0.8 2.0 0.8 96.4 Lakes 0.7 1.0 0.6 97.6 (a) Our Model; Filter size n = 48 (b) Our Model; Filter size n = 98 Predicted Predicted Streams Streams Others Others Tracks Tracks Lakes Lakes Tracks 73.5 12.1 12.4 1.9 Tracks 72.1 14.7 13.1 0.1 Actual Actual Streams 4.2 86.7 6.5 2.6 Streams 2.1 91.5 5.8 0.5 Others 10.4 9.5 77.8 2.3 Others 5.4 12.4 82.0 0.2 Lakes 0.7 1.4 0.9 97.0 Lakes 0.4 1.1 0.4 98.1 (c) VGG 16; Filter size n = 48 (d) VGG16; Filter size n = 98 Predicted Predicted Streams Streams Others Others Tracks Tracks Lakes Lakes Tracks 70.0 19.9 8.2 1.8 Tracks 77.9 12.0 8.4 1.6 Actual Actual Streams 1.8 91.9 3.4 2.8 Streams 6.0 87.2 4.2 2.6 Others 8.5 24.3 66.0 1.1 Others 22.1 12.0 64.6 1.3 Lakes 1.8 7.2 1.4 89.7 Lakes 2.9 8.2 2.3 86.5 (e) SAE model; Filter size n = 48 (f) SAE model; Filter size n = 98 Table 4 Confusion matrices for the three models and the two filter sizes. Values are shown in percent. 30 Kazimi et al. which is also intuitive since some of the lakes with a smaller width could be confused with larger streams. In a subsequent step, the detection results are analyzed visually by inspecting the generated heat maps. The final heat maps generated by the models for filter sizes n = 48 and n = 98 are illustrated in Tables 5 and 6. The heat maps are generated for the four object classes from the area shown in Figure 5. The colors from low to high show the certainty of the model in detecting the object at that location. In the following, the visual analysis takes into account, that potential features are represented with high confidence, and as connected, continuous components with similar confidence. Taking into account the resolution of the DTM, 1 meter resolution, using smaller values for n while creating training data, the n x n patches include a good representation of tracks and streams or other objects of smaller width. However, they yield poor results for lakes or other objects of larger width. This is due to the fact that the patch represents just a small part of the whole object, e.g. it would be a flat height map cropped from the middle of a lake. On the other hand, using bigger values for n results in a better representation for objects of larger width, such as lakes, since the training patches contain the full structure of lakes and parts of their surroundings. However, it results in a poor representation of objects of smaller width, since only a small section in the middle of the patch represents the object and the majority of the patch includes other objects. The effect of this choice of values of n is clearly seen in the heat maps shown in Tables 5 and 6. For models with smaller values of n, the generated heat maps of narrow tracks and streams are more accurate, but those of lakes are less accurate. Streams with larger widths appear to be labeled as lakes. For models with larger values of n, the heat maps are more accurate for lakes, which are of a larger width, but those of streams and tracks are less accurate. A larger portion in the surroundings of tracks and streams are also labeled as streams and tracks. While the variation in the filter size has the same effect for all the three models, there are significant differences among the models using the same filter sizes. As illustrated in the heat maps for filter size 48 in Table 5, the VGG16 model creates the best heat map for tracks while the one generated by the stacked AE model is less accurate than our model. In generating heat maps for the streams, our model performs better than the other two, and the Stacked AE model is the worst among them again. While there are some similarities between our model and the VGG16, after taking a closer look, one can realize that the streams heat map by our model is more continuous and flowing, thus more accurate, compared to discontinuities on the streams heat map by the VGG16 model. For example, the continuous stream on the lower left corner is more apparent in our model than the VGG16 model. Finally, VGG16 performs better in generating heat maps for lakes, and our model generates a rather similar heat map to that of stacked AE model. Comparing the results to the ground truth, it seems that the flat areas in the upper part and the middle of the test area are detected as lakes by our model and the stacked AE model. Using filter size of 98 (see Table 6), the generated heat maps of our model and the stacked AE model indicate a clear pattern of tracks on the upper section of the area, while the VGG16 model is not as confident. In the lower part of the area, however, there is a better pattern of tracks by the VGG16 model, even though the confidence values are still not high. In the heat maps for the streams, our model is better than the other two. In the heat map generated by our model, the streams are continuous and more similar to the actual streams, while the VGG16 model produces discontinuous streams in the middle and upper section of the area, and shows multiple tiny streams in the lower section of the area as one big stream, rather than the detailed information on the heat map by our model. The stacked AE model 31 C OA R C H 2 0 1 8 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data Our CNN VGG16 SAE Ground Truth Tracks Streams Lakes Table 5 Heat maps using filter size 48 x 48. Colors show the confidence of the models in detecting objects at that location. Our CNN VGG16 SAE Ground Truth Tracks Streams Lakes Table 6 Heat maps using filter size 98 x 98. Colors show the confidence of the models in detecting objects at that location. 32 Kazimi et al. produces a heat map with even more discontinuity, and with lower confidence values. The heat maps for lakes are produced in a similar manner by our model and the VGG16 model. The stacked AE model falsely labels some flat areas in the upper right and middle section of the area as lakes instead of labeling them as ’others’. 5 Conclusion and Outlook In this study, we present a two-step approach to detect objects in archaeological sites, leveraging deep learning, specifically CNNS and using ALS data. In the first step, we develop a multiclass classifier CNN model and train multiple instances of the model with various sizes as the input matrix. Based on their accuracy, the best models are then in the second step applied to the DTM of the area of interest, to classify objects and generate a heat map for each class. As a result of the experiments performed, it is concluded that smaller matrices as training data result in better detection of objects with a smaller width, while bigger matrices are good for objects of larger width. Our model is compared with two existing methods, the pretrained VGG16 model and a stacked AE model which are also trained with various sizes as the input matrix. The evaluation matrices in Table 3 indicate that our model performs better than the two others. The ultimate goal of the research is to provide a tool for archaeology experts that they could load raw point cloud data acquired from an area of their choice, automatically see the results of the detection algorithm and find out locations of each object. The result of the current approach is given in terms of a heat map. However, ultimately, we are interested in geometric delineations and description of the objects. In order to obtain them, additional analysis steps are needed, e.g. region growing or line following methods. We also plan to use parametric models in order to geometrically reconstruct objects such as charcoal stack locations. Such approaches have successfully been used in the domain of building reconstruction (e.g. [35]). Currently, we are able to detect only three classes: tracks, streams, and lakes, but everything else is labeled as ’others’. We incrementally label objects in the ’others’ class and plan to include them in the training when sufficient number of instances are labeled for an object class. The current approach uses a classification scheme where each pixel is classified with a filter mask and the classification is assigned to the pixels under that mask. This approach guarantees that appropriate environments of the individual pixels are taken into account for the classification. Future research will exploit other networks, especially semantic segmentation approaches, such as UNET (e.g. [36]). Such approaches have the advantage, that each pixel is labeled in the classification process (e.g. [37]). Acknowledgements The project is funded by the Ministry of Science in Lower Saxony. The joint work with Kourosh Khoshelham has been supported by the DAAD. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. References 1 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015. 2 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 33 C OA R C H 2 0 1 8 Deep Learning for Archaeological Object Detection in Airborne Laser Scanning Data 3 Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017. 4 Yushi Chen, Zhouhan Lin, Xing Zhao, Gang Wang, and Yanfeng Gu. Deep learning- based classification of hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote sensing, 7(6):2094–2107, 2014. 5 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. 6 Pim Moeskops, Max A Viergever, Adriënne M Mendrik, Linda S de Vries, Manon JNL Benders, and Ivana Išgum. Automatic segmentation of mr brain images with a convolutional neural network. IEEE transactions on medical imaging, 35(5):1252–1261, 2016. 7 Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pages 3–10, 1994. 8 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010. 9 François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, 2016. 10 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 11 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 12 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 13 Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013. 14 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. 15 Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147, 2016. 16 Bashir Kazimi and Marta Ruiz Costa-Jussà. Coverage for character based neural machine translation. Procesamiento del lenguaje natural (SEPLN), 59:99–106, 2017. 17 Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 18 Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98–113, 1997. 19 Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. 20 Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013. 34 Kazimi et al. 21 Sijin Li and Antoni B Chan. 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision, pages 332–347. Springer, 2014. 22 Jaime Zabalza, Jinchang Ren, Jiangbin Zheng, Huimin Zhao, Chunmei Qing, Zhijing Yang, Peijun Du, and Stephen Marshall. Novel segmented stacked autoencoder for effective di- mensionality reduction and feature extraction in hyperspectral imaging. Neurocomputing, 185:1 – 10, 2016. 23 Tiago Prado Oliveira, Jamil Salem Barbar, and Alexsandro Santos Soares. Multilayer perceptron and stacked autoencoder for internet traffic prediction. In Ching-Hsien Hsu, Xuanhua Shi, and Valentina Salapura, editors, Network and Parallel Computing, pages 61–71, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. 24 Hasan Badem, Abdullah Caliskan, Alper Basturk, and Mehmet Emin Yuksel. Classification and diagnosis of the parkinson disease by stacked autoencoder. In Electrical, Electronics and Biomedical Engineering (ELECO), 2016 National Conference on, pages 499–502. IEEE, 2016. 25 Qin Zou, Lihao Ni, Tong Zhang, and Qian Wang. Deep learning based feature selec- tion for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters, 12(11):2321–2325, 2015. 26 Gong Cheng, Peicheng Zhou, and Junwei Han. Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transac- tions on Geoscience and Remote Sensing, 54(12):7405–7415, 2016. 27 Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective trans- former nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016. 28 Haiyan Guan, Yongtao Yu, Zheng Ji, Jonathan Li, and Qi Zhang. Deep learning-based tree classification using mobile lidar data. Remote Sensing Letters, 6(11):864–873, 2015. 29 Yongtao Yu, Haiyan Guan, and Zheng Ji. Automated detection of urban road manhole covers using mobile laser scanning data. IEEE Transactions on Intelligent Transportation Systems, 16(6):3258–3269, 2015. 30 Yongtao Yu, Jonathan Li, Haiyan Guan, Fukai Jia, and Cheng Wang. Learning hier- archical features for automated extraction of road markings from 3-d mobile lidar point clouds. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sens- ing, 8(2):709–726, 2015. 31 Xiangyun Hu and Yi Yuan. Deep-learning-based classification for dtm extraction from als point cloud. Remote sensing, 8(9):730, 2016. 32 Florian Politz, Bashir Kazimi, and Monika Sester. Classification of laser scanning data using deep learning. 38. Wissenschaftlich-Technische Jahrestagung der DGPF und PFGK18 Tagung, 2018. 33 Leon F Palafox, Christopher W Hamilton, Stephen P Scheidt, and Alexander M Alvarez. Automated detection of geological landforms on mars using convolutional neural networks. Computers & geosciences, 101:48–56, 2017. 34 François Chollet et al. Keras. https://github.com/fchollet/keras, 2015. 35 Claus Brenner. Building reconstruction from images and laser scanning. International Journal of Applied Earth Observation and Geoinformation, 6(3-4):187–198, 2005. 36 Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 37 Zhishuang Yang, Wanshou Jiang, Bo Xu, Quansheng Zhu, San Jiang, and Wei Huang. A convolutional neural network-based 3d semantic labeling method for als point clouds. Remote Sensing, 9(9):936, 2017. 35 C OA R C H 2 0 1 8