237 Deep Learning for Terrain Surface Classification: Vibration-based Approach Marcos Concona , W. K. Wonga , Filbert H. Juwonoa and Catur Aprionob a Curtin University Malaysia, CDT 250, Miri 98009, Sarawak, Malaysia b University of Indonesia, Kampus Baru UI, Depok, West Java 16424, Indonesia Abstract As robots become more pervasive in the service sector, control in dynamic environment has become an important element in optimising the deployment of mobile robots. A mobile robot should be knowledgeable not only of the barriers, but also of the surface on which the robot navigates to estimate slippage and adaptive control. We note that various terrains/surfaces have different characteristics, which can directly influence the handling, driving, efficiency, and stability of the robot vehicle. Knowledge of the terrain can provide valuable information for establishing effective and secure navigation strategies. We built a mobile robot prototype equipped by Inertial Measurement Unit (IMU) to obtain the terrain data and applied deep learning models to classify the terrain using the data. Three deep learning configurations have been proposed in this paper, i.e. long short-term memory (LSTM), 1D convolutional network (1D CNN), and convolutional neural network-long short-term memory network (CNN-LSTM). The deep learning architectures were trained and evaluated based on the data collected from five different surfaces. It is shown that the CNN-LSTM performs the best with an F1 score of 98.49%. The other two networks also generalize relatively well with the unseen vibration sequences with F1 scores of 97.47% and 95.98% for the 1D CNN and LSTM, respectively. Finally, we investigate the effect of varying input sequence to find the optimal length, so that we are able to obtain the highest accuracy and generalization of the deep learning networks. Keywords LTSM, CNN, terrain classification, 1. Introduction In this paper, we focus on the area of terrain mapping using Inertial Measurement Unit (IMU) sensors. In or- Intelligent robotics have seen rapid advancement in their der to map the readings to the respective terrain labels, scope of operations such as in military reconnaissance we present and evaluate three types of deep learning in hostile environments [1], unmanned surveillance for frameworks: long short-term memory (LSTM), one di- disaster management [2], and telemedicine robot used mensional convolutional neural network (1D-CNN), and for examining remote patients [3], and in factories. It the CNN-LSTM architectures. Both LSTM and CNN have is necessary for a robot to acquire a clear understand- been extensively used in the literature; however, the ap- ing of its current environment in order to successfully plications of utilizing both frameworks in a unified struc- manoeuvre and accomplish its planned operation, while ture have been lacking. This paper aims to leverage the preventing any damage to itself and creating hazards to temporal and spatial advantages towards the vibration- others. As service robots have achieved broad adoption based terrain classification. in the above-mentioned industries, precise navigation It is worth noting that deep learning can be used for and surrounding awareness have become crucial issues tasks where it is almost impossible to execute a raw data to improve the capacity of the device to deploy. An sig- engineering function manually. Despite being highly nificant consideration for the robot’s efficient navigation ’blackbox’, the end-to-end deep learning approach is suit- is the motion control algorithm, based on the type of able for automatically extracting useful features in com- terrain being travelled. Thus, a detailed classification of plex non-linear classification tasks. Therefore, deep learn- the type of terrain is required for the robot to adapt its ing method can be implemented to obtain more reliable speed of navigation and the parameters of route planning, results to recognize the surrounding environment of the which depend on the characteristics of the terrain. robot, thereby enhancing the robot’s adaptive controls and mobility. ISIC 2021: International Semantic Intelligence Conference, February 25–27, 2021, New Delhi, India Envelope-Open marcos@respiree.com (M. Concon); weikitt.w@curtin.edu.my 2. Related Works (W.K. Wong); filbert@ieee.org (F.H. Juwono); catur@eng.ui.ac.id (C. Apriono) The problems of adaptive control in mobile robots have Orcid 0000-0001-6212-6096 (W.K. Wong); 0000-0002-2596-8101 (F.H. been constantly researched. The challenges presents var- Juwono); 0000-0002-7843-6352 (C. Apriono) ious opportunities for researchers to develop methods in © 2021 Copyright for this paper by its authors. Use permitted under Creative CEUR Workshop http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) predicting the dynamic changes in the environment. In Proceedings 238 [4], the authors investigated the use of kinematics-based to some factors such as environmental noise and the analytic for wheel slippage calculation. The results were robot’s internal motor noise as described in [13]. A deep validated using collected data on a mobile platform. Sim- learning approach was applied in [14] where a CNN was ilar work was found in [5] where the authors applied developed and trained using the short time Fourier trans- rolling resistance torque without using any additional form (STFT) spectograms extracted from the raw terrain sensors. Rolling resistance torque in multiple terrains audio signals. It was demonstrated that the network was can be acquired by reaction torque observer. Proposed robust even when the terrain audio signal was corrupted concept was verified by using a differential drive mobile with the white Gaussian noise. robot. In [6], wheel slips were estimated based on the Haptic-based classification uses ground contact forces odometric data. The collected data were analyzed using between a legged robot and terrain to describe different two different approaches, which were instantaneous esti- terrain properties. Typically, features such as the robot’s mator and temporal window approach. Results showed stride frequency, peak and average motor torque in a that temporal window approach yielded a better result. single stride are used to train an SVM classifier [15]. In In [7], researchers presented a solution of using laser- [16], a 1-dimensional CNN and an RNN architecture were based point cloud generation to detect robot traversal implemented and evaluated when raw force/torque sig- surface. The researchers explored several terrains includ- nals from a hexapod robot were passed to them. There ing carpet, coated asphalt, and asphalt. The solution was was a significant improvement of about 15% in classifica- highly precise with high computation cost as it gener- tion accuracy when compared to the SVM method with ated point clouds which needed to be further processed a Gaussian kernel. digitally. The authors stated that there were opportuni- The last reaction-based technique is based on the vibra- ties to further investigate how a mobile robotic platform tion characteristics of the terrain. It was first suggested could provide reliable and accurate surface prediction in [17] where the vibration signal was measured using of the terrain for improving the navigation with prior an accelerometer during the robot’s traversal. In terms knowledge of the surface. These research works have of performance, SVM has proven to be the best when shown that there is strong motivation in investigating trained on hand crafted time domain features such as methods to enable service robots to have perception on skewness, impulse factor, and root mean square (RMS), the terrain and traversal surface. along with frequency-domain features from the discrete Several sensing methodologies have been developed to Fourier transform (DFT). Experiments using a CNN for tackle the problem of terrain classification. The method- vibrational wheel slip estimation in ground robotics was ology is typically categorized into two main groups: vision- carried out in [11]. The wheel torque, vertical accelera- based and reaction-based techniques. Traditional vi- tion and degree of pitch were used to train the classifier. sual feature engineering approaches include the scale- The difference of 10% for classification was obtained be- invariant feature transform (SIFT) [8], speeded-up robust fore and after filtering the input data for the CNN which features (SURF) [9], and the bag of visual words (BOVW) reinforces the generality of deep learning frameworks [10], among many others. These algorithms pass the in extracting meaningful information directly from raw useful features of the images obtained from light detec- input vibration data. tion and ranging (lidar) or stereo camera to a classifier to be trained and classified. In [11], raw grayscale terrain images were trained using deep convolutional network 3. Methodology and the accuracy was 6% less than the support vector The mobile robot used in this research is a two wheel dif- machine (SVM) classifier used jointly with the histogram ferential drive with an attached 6-axis accelerometer that of gradients (HOG) feature extractor. measures six vibrational terrain signatures. The setup While vision-based approaches are useful because of is shown in Fig. 2. The form is very similar to the con- their high accuracy, they are vulnerable to distortion ventional indoor service robots such as robotic vaccuum caused by lighting changes and other factors such as cleaner. The vibration characteristics are all dependent the realization of the surface’s physical properties (e.g. on the terrain’s texture/material and robot movement. material type and degree of hardness) [12]. Reaction- This study primarily aims to address terrain classifica- based techniques, on the other hand, utilize sensor mea- tion by utilising raw time-series vibration data as input surements to obtain either the acoustics, haptics, or the to three implemented deep learning frameworks: LSTM, vibration profiles for the classification. Acoustic-based CNN, and a CNN-LSTM architectures. An overview of classification relies on the use of microphone to record the experiment workflow is given in Fig. 1. the sound of signal generated between the robot and The data set used in this study contains a total of 24000 terrain during traversal. Noise removal and smoothing samples distributed evenly from five different terrain techniques are necessary in traditional acoustic-based sources. The six features includes the lateral, longitu- classification to achieve satisfactory results. This is due 239 Figure 1: Experiment workflow. dinal, and vertical accelerations and angular velocities tion. The vibration samples were then segmented into (𝑎𝑥 , 𝑎𝑦 , 𝑎𝑧 , 𝑔𝑥 , 𝑔𝑦 , 𝑔𝑧 ) of the traversing robot. The setup fixed windows of 1.5 seconds (75 samples). An overlap is shown in Fig. 2. Fig. 3 illustrates the five different rate of 20% was applied between two consecutive 1.5 vibration signals corresponding to the surface type. The second segments to conserve the temporal dependencies vibration samples were collected via I2C using IMU unit between the time steps in the vibration sequence. One- MPU-6050 containing both an accelerometer and gyro- hot encoding was then performed to map the different scope integrated in a single chip. The controlled condi- labelled surfaces numerically. Lastly, the vibration data tions for the wheeled robot are: 50 Hz sampling rate, 1.6 set was split into training, validation, and testing sets to minutes traversal time per surface, and circular motion allow the neural networks to generalize with the unseen of the robot. vibration characteristics. These data set partition were set to be 70%, 15% and 15% for training, validation, and testing, respectively. 3.1. Implementation LSTM is a type of recurrent neural network (RNN) that is typically used for sequence prediction. In particu- lar, LSTM solves the issues of the disappearing gradient present in the RNNs while allowing the long-term tem- poral dynamics of the series to be exploited. In contrast, CNNs have been commonly used for 2D problems (e.g. image classification task); however, it can be modified to classify the 1D vibrational problem. The dimension- Figure 2: Experimental Setup. ality of the convolutional layers is reduced to match the model’s 1D input. The CNN-LSTM model leverages the robustness of Vibration samples must be converted into an appro- CNN in extracting spatial features and LSTM in exploit- priate format before entering the neural networks. Also, ing the temporal dependencies of the vibration sequence. as the measurements contain multiple units, the range of In this paper, the time-series vibration is downsampled vibration samples must be normalised to a mean of zero by the 1D CNN to extract the higher level features. This and a variance of one. The equation for normalization is can be considered as the pre-processing step which al- given by 𝑥𝑖 − 𝜇 lows the LSTM to interpret the features extracted at each 𝑠𝑖 , = , (1) block of the sequence. The concept is illustrated in Fig. 𝜎 where i is the index of the element from the vibration 4. sequence, 𝜇 is the average, and 𝜎 is the standard devia- The three models were built and trained using a Ten- 240 Figure 3: The five different terrain vibration signals. Table 1 Overview of the architecture used in this study Layer Output shape LSTM (20 units) (75, 20) Dropout (25%) (75, 20) LSTM LSTM (70 units) 70 Dropout (40%) 70 Dense 112 Dense 5 Conv1D (80@6×1) (70, 80) Figure 4: Time slice processing for CNN-LSTM Dropout (50%) (70, 80) Conv1D (65, 128) (128@6×1) 1D CNN Dropout(50%) (65, 128) sorflow backend with the Keras API. A detailed overview Max pooling (32, 128) of the three models was summarized in Table 1. The Flatten 4096 hyperband algorithm was used to select the hyperparam- Dense 96 eters allowing for the best balance between training time Dense 5 Conv1D (96@6×1) (3, 20, 96) and accuracy. The learning rate, batch size, and number Conv1D (48@6×1) (3, 15, 48) of epochs were set at 0.001, 64, and 30, respectively. Addi- Dropout (30%) (3, 15, 48) tionally, early stopping regularization was implemented Max pooling (3, 7, 48) to avoid overfitting during model training. Further, the CNN-LSTM Flatten (3, 336) Adam optimization algorithm based on the Stochastic LSTM (20 units) 60 gradient descent was used as the optimizer. Dropout (20%) 60 The implemented CNN-LSTM architecture is shown Dense 96 in Fig. 5. For both the LSTM and 1D CNN networks, Dense 5 the data length of a vibration training sample was a flat vector of 75 time steps. In a stacked LSTM network, the input sequence to the first LSTM layer returns a shape to retain its temporal representation during the convo- of (timestep, unit) to be passed on to the next layer. The lution process. The time distributed layer expects a 3D output from the last LSTM layer returns only the unit. input and so the input sequence was reshaped from 75u For the 1D CNN, the input shape to the network is repre- time steps into 3 subsequences of 25 time steps. The con- sented as (timestep, features). In the case for the CNN- volutional layer used the ReLU activation and consisted LSTM network, a time distributed wrapper is first used of a 6 × 1 kernel that moves across in one dimension before the LSTM layers to allow the input vibration signal during the convolution operation. 241 Figure 5: Implemented CNN-LSTM model for vibration-based terrain classification The dropout layers were then added to tackle overfit- Fig. 7 illustrates the performance of the models on the ting issues by arbitrarily setting a fraction rate of input 5-class vibration test data set. In the worst case, about units to zero. The pooling layer was added to reduce the 7% (average) of the wood class was mistakenly classified spatial size of the output representation into half. Note as tiles across the three models. Overall, it can be seen that both the dropout and pooling layer allows for faster that the three models exhibited good performance and training time due to the reduced parameter size. The generalized well with the unseen data. Further, the CNN- flatten layer was used to transform the input from the LSTM architecture has the best performance with F1 previous layers as input to the LSTM layer where the score of 98.49% (average). The 1D CNN follows at the temporal characteristics of the vibration sequence were second place with F1 score of 97.49% (average). We note extracted. Lastly, the fully connected layers with the soft- that the slight improvement of the CNN-LSTM model max activation function was used to structure the outputs compared to the 1D CNN may suggest that the temporal of the previous layer for the final classification task. In characteristics of the LSTM is less important than the this experiment, the categorical cross-entropy loss func- feature generation capability of the CNN-LSTM. Table 4 tion was used to address the 5-class terrain classification summarizes the overall performance of the three models. problem. Table 2 Average Precision, Recall, and F1 scores (Based on testing 4. Results data) The confusion matrix of the three models are depicted in Model Precision Recall F1 Score Fig. 6. From the confusion matrix, we can calculate the LSTM 96.23% 96.00% 95.98% F1 score, the precision 𝑃𝑟 , and the recall, 𝑅𝑐 . The F1 score, 1D CNN 97.53% 97.50% 97.47% which is calculated using 𝑃𝑟 and 𝑅𝑐 , has been commonly CNN-LSTM 98.60% 98.50% 98.49% used to analyze the performance of the models. We used macro-averaging technique to expand these benchmarks One factor influencing the performance of the models towards multi-class terrain classification. The equations is the sequence length of the input vibration. To fur- for the precision, recall, and F1 score, respectively are ther validate the performance of the three models, we given by analyzed the F1 score with varying segment lengths as shown in Fig. 8. It can be seen that a longer sequence 𝑇𝑃 length results in a better accuracy with the cost of per- 𝑃𝑟 = , (2) 𝑇𝑃 + 𝐹 𝑃 formance saturation at a certain length. It can be shown that the average F1 score rises as the duration of the 𝑇𝑃 𝑅𝑐 = , (3) vibration series increases from 30 to 60 samples but de- 𝑇𝑃 + 𝐹 𝑁 creases afterwards. This may be caused by the lack of the 2𝑃𝑅 𝑅𝐶 training data after the segmentation process of the given 𝐹1 = , (4) length. Therefore, we can consider an optimal sequence 𝑃𝑅 + 𝑅 𝐶 length of 75 (1.5 seconds) for obtaining high accuracy and where 𝑇𝑃 shows the outcome where the model correctly generalization of the models. Furthermore, the proposed classifies the positive class, 𝐹𝑃 is the outcome where the CNN-LSTM architecture slightly outperformed the CNN model incorrectly classifies the positive class, 𝑇𝑁 is the and LSTM models across the varying segment lengths. outcome where the model correctly classifies the negative class, and 𝐹𝑁 is the outcome where the model incorrectly classifies the negative class. 242 Figure 7: F1-scores for the three architectures Figure 8: Average F1 Score at varying segment length 5. Conclusion and Future Work In this paper, we have demonstrated the application of IMU-based surface classification task. We have compared three candidates for classifying the IMU data, i.e. LSTM, 1D CNN, and a combination of CNN and LSTM. By com- paring the results, CNN-LSTM provided the best results (F1 score of 98.49%). However, we can further observe that the 1D CNN presented favorable results although slightly lower than the CNN-LSTM. The results suggest that 1D CNN is able to map the classification better when compared to the LSTM on standalone basis. CNN and LSTM works on different principle in which the latter is based on the temporal dynamics of the data. On the other hand, 1D CNN is based on static convolution, similar to the 2D counterparts. This implies that there is a clear static pattern when the IMU data enabling well defined mapping to their respective classes. The results, despite counter intuitive, may prompt fur- Figure 6: Confusion matrices of (a) LSTM, (b) CNN, and (c) ther research in the this direction. With the growing CNN-LSTM on the vibration test dataset. of edge computing and capacity of embedded system, enabling robots to recognize surface would enable fur- ther applications for indoor or industrial applications. 243 Reducing the complexity of the machine learning models [9] Seung-Youn Lee, D. Kwak, A terrain classification to further benefit in terms of computation reduction is method for ugv autonomous navigation based on required. This is possible given that there is a clear static surf, in: 2011 8th International Conference on Ubiq- pattern demonstrated from the results using 1D CNN. uitous Robots and Ambient Intelligence (URAI), 2011, pp. 303–306. doi:1 0 . 1 1 0 9 / U R A I . 2 0 1 1 . 6 1 4 5 9 8 1 . [10] H. Wu, B. Liu, W. Su, Z. Chen, W. Zhang, X. Ren, Acknowledgments J. Sun, Optimum pipeline for visual terrain classi- fication using improved bag of visual words and This research was supported by Ministry of Research and fusion methods, Journal of Sensors 2017 (2017). Technology/National Agency for Research and Innova- [11] R. González, K. Iagnemma, Deepterramechanics: tion, Republic of Indonesia through Penelitian Dasar Ung- Terrain classification and slip estimation for ground gulan Perguruan Tinggi (PDUPT) Grant, contract num- robots via deep learning, CoRR abs/1806.07379 ber: NKB-2838/UN2.RST/HKP.05.00/2020, year 2020. (2018). URL: http://arxiv.org/abs/1806.07379. [12] P. Roy, S. Ghosh, S. Bhattacharya, U. Pal, Effects of References degradations on deep neural network architectures, CoRR abs/1807.10108 (2018). URL: http://arxiv.org/ [1] J. G. Bellingham, K. Rajan, Robotics in remote abs/1807.10108. a r X i v : 1 8 0 7 . 1 0 1 0 8 . and hostile environments, Science 318 (2007) [13] J. Libby, A. J. Stentz, Using sound to classify vehicle- 1098–1102. terrain interactions in outdoor environments, in: [2] V. Jorge, R. Granada, R. Maidana, D. Jurak, G. Heck, 2012 IEEE International Conference on Robotics A. Negreiros, D. dos Santos, L. Gonçalves, A. Amory, and Automation, 2012, pp. 3559–3566. doi:1 0 . 1 1 0 9 / A survey on unmanned surface vehicles for disas- ICRA.2012.6225357. ter robotics: Main challenges and directions, Sen- [14] A. Valada, L. Spinello, W. Burgard, Deep feature sors 19 (2019) 702. URL: http://dx.doi.org/10.3390/ learning for acoustics-based terrain classification, s19030702. doi:1 0 . 3 3 9 0 / s 1 9 0 3 0 7 0 2 . in: International Symposium on Robotics Research [3] K. K. Chung, K. W. Grathwohl, R. K. Poropatich, (ISRR), 2015. S. E. Wolf, J. B. Holcomb, Robotic telepresence: [15] X. A. Wu, T. M. Huh, R. Mukherjee, M. Cutkosky, In- past, present, and future, Journal of cardiothoracic tegrated ground reaction force sensing and terrain and vascular anesthesia 21 (2007) 593—596. classification for small legged robots, IEEE Robotics [4] R. Chaichaowarat, W. Wannasuphoprasit, Wheel and Automation Letters 1 (2016) 1125–1132. doi:1 0 . slip angle estimation of a planar mobile platform, 1109/LRA.2016.2524073. in: 2019 First International Symposium on In- [16] J. Bednarek, M. Bednarek, L. Wellhausen, M. Hut- strumentation, Control, Artificial Intelligence, and ter, K. Walas, What am i touching? learning to Robotics (ICA-SYMP), 2019, pp. 163–166. doi:1 0 . classify terrain via haptic sensing, in: 2019 In- 1109/ICA- SYMP.2019.8646198. ternational Conference on Robotics and Automa- [5] S. D. A. P. Senadheera, A. M. H. S. Abeykoon, Sen- tion (ICRA), 2019, pp. 7187–7193. doi:1 0 . 1 1 0 9 / I C R A . sorless terrain estimation for a wheeled mobile 2019.8794478. robot, in: 2017 IEEE International Conference on [17] C. A. Brooks, K. Iagnemma, Vibration-based ter- Industrial and Information Systems (ICIIS), 2017, rain classification for planetary exploration rovers, pp. 1–6. doi:1 0 . 1 1 0 9 / I C I I N F S . 2 0 1 7 . 8 3 0 0 4 2 2 . IEEE Transactions on Robotics 21 (2005) 1185–1191. [6] D. Masha, M. Burke, b. Twala, Slip estimation meth- doi:1 0 . 1 1 0 9 / T R O . 2 0 0 5 . 8 5 5 9 9 4 . ods for proprioceptive terrain classification using tracked mobile robots, in: International Conference (PRASA-RobMech), 2017, pp. 150–152. [7] S. Wilson, J. Potgieter, K. Arif, Floor surface map- ping using mobile robot and 2d laser scanner, in: 2017 24th International Conference on Mechatron- ics and Machine Vision in Practice (M2VIP), 2017, pp. 1–6. doi:1 0 . 1 1 0 9 / M 2 V I P . 2 0 1 7 . 8 2 1 1 5 0 8 . [8] S. Zenker, E. E. Aksoy, D. Goldschmidt, F. Wörgöt- ter, P. Manoonpong, Visual terrain classification for selecting energy efficient gaits of a hexapod robot, in: 2013 IEEE/ASME International Confer- ence on Advanced Intelligent Mechatronics, 2013, pp. 577–584. doi:1 0 . 1 1 0 9 / A I M . 2 0 1 3 . 6 5 8 4 1 5 4 .