Multi-Stage Deep Learning for Context-Free Handwriting Recognition Daniel Lückehe1 , Nicole Mühlpforte2 , Alfred O. Effenberg2 , and Gabriele von Voigt1 1 Computational Health Informatics, Leibniz University Hanover, Germany lueckehe@chi.uni-hannover.de, 2 Institute for Sports Science, Leibniz University Hanover, Germany Abstract. Handwriting recognition approaches usually use the seman- tic context of the individual letters. This helps to achieve highly accurate classification results. As it is less likely to detect semantically invalid let- ters, these approaches imply an error correction. In most cases, this is a beneficial side-effect but it makes the approaches not usable if these se- mantic invalid letters should be detected, e.g., in the case of educational software where spelling mistakes should be recognized. For this purpose, we developed an advanced context-free handwriting recognition which is based on multiple techniques from the field of deep learning. We motivate the individual components of our approach and show how the components can benefit from each other. In the experimen- tal section, the behavior of our new approach is analyzed in detail and the classification accuracy is compared between the different approaches. Keywords: Handwriting Recognition, Deep Learning, Human-Machine Interface, Learning Support, Educational Software 1 Introduction Writing is a key skill for learning results at school and professional success [8]. Thereby, correct orthography is an important part. In school, it is often trained by handwritten exercises as this can improve the orthography skills. Thus, hand- writing is a crucial skill. To realize educational software which supports the development of orthogra- phy, especially for children with spelling problems [22], a well-working handwrit- ing recognition is required. Usually, the objective of a handwriting recognition is to be as accurate as possible. Therefore, the semantic context of the individ- ual letters is used as this can improve the accuracy. For example, it might be challenging to distinguish between the written letters u and v. But if the letters are considered in their semantic context, the task may be easy: in the middle of the word house, there is an u. So, if a handwriting recognition algorithm is not able to determine whether there is an u or v in the middle of the word, the algorithm can determine the letter by its context easily. Even a wrongly written hovse would be recognized as house. In practice, this increases the accuracy of handwriting recognition algorithms significantly and the implicit error correc- tion is mostly a beneficial side-effect. But in the case of educational software for children all errors must be detected to provide valuable assistance. Thus, in this case, a usual handwriting recognition algorithm is not useable. There are multiple ways of taking spelling mistakes into account. For exam- ple, the software could display the error. As too much shown mistakes might be demotivating, the software could also collect all errors and give a report to a teacher who can individually support the child. Also, it is possible that the educational software adapts to the child. All spelling mistakes can be classified, e.g., there are errors with double consonants after a short vowel. If a child makes a disproportionately high number of errors of a certain class, this class could be more practiced. For all these ways taking spelling mistakes into account, a handwriting recognition is required which does not use the semantic context. Our paper is structured as follows. In the following, the foundation of the paper is laid and the data set is introduced. Then, we present two different perspectives to the data: a visual one and a movement oriented one. In Section 3, we introduce our new approach of a context-free handwriting recognition which combines both perspectives. In the experimental results, shown in Section 4, the capabilities of our approach are presented and analyzed. Thereby, the capabilities of the single perspectives are compared to our approach. In the last section, conclusions are drawn. 1.1 Foundation This work is based on a project in which the handwriting of elementary school children was recorded to develop a new educational software for children to train spelling skills. There are already software products to train spelling skills like Gut1 [7] or Tintenklex [4] but to the best of our knowledge, none of them includes a handwriting recognition, i.e., they are all using a keyboard input. To develop a handwriting recognition for an educational software, 189 chil- dren from the first to the fourth grade have done a handwriting task in four elementary schools in the region of Hanover, Germany. In this task, each child wrote the whole alphabet of small and capital letters and the German special characters. Additionally, each child wrote complete words which contains all possible letters. The words are selected from a list. There was no dictation and the processing sequence was freely chosen. The handwriting was recorded on an iPad Air using an Adonit Jot Pro stylus. Afterward, the individual letters have been separated and two times classified by two research assistants. These data have been used in [19] to develop a handwriting recognition using a high-level representation. It is a nice approach but due to its poor classifica- tion accuracy, it is not useable. In the experimental section of their paper, only five capital letters are employed – the letters A, M, O, T, U. These letters are correctly predicted in 74.7% of the cases. The limitation of five letters and a classification error of more than 25% are not acceptable. The objective must be to use all letters and to achieve a classification accuracy of significantly above 90%. Otherwise, the errors of the recognition will demotivate the children while using it. Especially, correctly written words which are classified as an error are frustrating. So, the recognition must work in the context of an educational soft- ware and thereby, the focus must be put to the task of the educational software and must not be put to errors of the handwriting recognition. 1.2 Data Set In this section, we introduce the employed data set. It contains 34 627 samples which are not equally distributed. The frequency results from the occurrence while writing. Thus, there are more small than capital letters in the data set. The most frequent letter is the small e which occurs 4205 times. It is followed by the small n, 2596 times, and the small r, 1518 times. The median occurrence of a letter is 336 times and even the least frequent one occurs 179 times. It is the capital special character Ü. Overall, there are 59 different letters: 26 capital and 26 small letters, capital and small special characters (the umlauts: ä, ö, ü), and another special character (eszett: ß). As shown in Figure 1, the letters in the data set are written in different writing styles. The figure presents six different ways to write a small t. Additionally, there is a large variation in the letters as they are written by children. Fig. 1. Example of six written small letters t visualizing different writing styles The data set includes a chronological sequence for each written letter. For each time step, the x- and y-position of the pen which is used for the writing are given. Additionally, a flag is set if the pen is lifted. Based on this data, the letter should be classified. Thereby, it is interesting that the data can be interpreted as visual images. As the children grasp the written letters visually, this approach is obvious. Additionally, the data can be interpreted as a writing movement. We present both interpretations in the next section and show the advantages and disadvantages of them. Then, in our new approach, we employ both interpretations at the same time to combine their advantages. 2 Applying One Perspective to the Data In this section, we show two interpretations of the data. In both cases, we employ methods from the field of deep learning to classify the written letters. In the first perspective, the data are interpreted as visual images. These images can be classified by a Convolutional Neural Network (CNN) [13, 14]. The second perspective is taking the writing movement into account. As the movement is a time series, a Long Short-Term Memory Network (LSTM) [9, 11] is well suited to classify letters based on this perspective. Deep Visual Approach In our deep visual approach, the data are interpreted as visual images. This is the most obvious perspective, as the children grasp their written letters visually. Based on the data, we create images with a resolution of 296 × 220 pixels. As fine structures are not relevant for the classification, the image sizes are reduced by the factor 4. We choose a CNN for the classification task. CNNs have been very successful in the last years [18]. They are deep neural networks applying a structure of neurons which is inspired by the human visual cortex [1]. Therefore, they are well suited to classify images. The structure of our approach applying a CNN is very straight-forward. It is shown in Figure 2. The data are interpreted from the visual perspective resulting to xvisual . One pattern xvisual contains a brightness information for each entry of a matrix with the size of 296 × 220 representing the pixels. The pattern is handed to the CNN which classifies it. The classification result is y. Fig. 2. Structure of our deep visual approach The used CNN employs a classical structure [12]. The first layer is a con- volutional layer employing 32 filters with the size of 3 × 3 pixels. It is followed by a max pooling layer which downsamples the data. We employ a 2 × 2 down- sampling, i.e., the data height and width are reduced by the factor 2. Then, the combination of a convolutional layer and a max pooling layer is repeated but now, the convolutional layer employs 64 filters. The extracted features from the convolutional and a max pooling layers are handed to a dense layer. As acti- vation function, we employ the ReLU function [16] for all the described layers. To avoid overfitting, a dropout layer [21] is used. Finally, an output layer with 59 neurons determines the classification results for the 59 different letters. The output layer uses the softmax function [2]. Deep Movement Approach Our second approach is to interpret the data as a movement. This means movement vectors are determined based on the movement of the pen. Multiple vectors yield a complete letter. Each child has its own writing speed which is not relevant for the classification. Thus, all letters are normalized to a size of 32 movement vectors. As the vectors can have different length, the relative speed while writing a letter is taken into account. Besides ∆x and ∆y values which describe the movement, there is a binary information for each vector. This information indicates whether the pen is lifted or not. Fig. 3. Example of three capital letters A visualizing the movement while writing them Figure 3 shows three letters interpreted as movement. The dots in the figure represent the start and end points of the movement vectors. The colors of the lines visualize the direction of the movement. The sequence starts with dark purple and ends with light yellow. Dashed red lines present movements in which the pen is lifted. Considering the left letter A in the figure, the movement is well recognizable. It starts from the bottom left, goes up, and then to the bottom right. The pen is lifted and finally, a horizontal line from the left to the right is drawn. This description includes much more information than just an image of the letter A. This information might be helpful for the classification task. However, the motion vectors are also challenging since the children grasp their written letters visually and do not focus on a clear movement. Especially, if the pen is lifted to redraw parts of a letter, this might be challenging to classify. Considering the middle letter in Figure 3, it can be observed that after the letter was finished, the child extended the bottom left part of the capital A. Visually, this makes sense and makes it easier to recognize the letter but the motion vectors become more complex due to this extension. The same applies to the right letter of the figure where the top center part of the letter is reworked. As the movement perspective delivers a chronological sequence, a Recurrent Neural Network (RNN) [10] is well suited to classify the data. In the field of RNNs, LSTMs are very powerful [6]. In contrast to classical RNNs, LSTMs include an additional short-time memory and due to their special structure, they can keep values in their short-time memory for a long time [9]. For this reason, they are called Long Short-Time Memory Networks. Our approach to applying an LSTM is very straight-forward. It is shown in Figure 4. The data is interpreted Fig. 4. Structure of our deep movement approach from the movement perspective resulting in xmovement . One pattern xmovement consists of the 32 movement vectors and the binary information per vector which indicates whether the pen is lifted. The pattern is handed over to the LSTM which classifies it. The classification result is y. The employed LSTM includes 1024 cells, followed by a dropout and an output layer. The output layer uses the softmax function. The other layers employ the ReLU function as activation function. 3 Multi-Stage Deep Context-Free Handwriting Recognition In this section, we introduce our new approach. It works context-free, thus, it can be used in educational software as described in Section 1. The idea behind our approach is to combine the advantages of the deep visual and the deep movement approach. For this propose, we employ both approaches and connect their outputs to a Deep Neural Network (DNN) [5] using dense layers. This makes it a multi-stage approach. We denote our multi-stage deep context-free handwriting recognition as C− HR. Structure of C− HR The structure of our approach is shown in Figure 5. Like the structures in the previous section, it is very straight-forward. The figure shows how the deep visual and deep movement approach are combined. In contrast to the approaches employing only one perspective, the outputs of the CNN and LSTM are not used directly. The outputs are handed over to a DNN and the DNN determines the classification result y. Thereby, the CNN and LSTM are configured as described in Section 2. As DNN, we employ two dense layers with 1024 neurons per layer. Both layers employ the ReLU function. These layers are followed by a dropout layer and an output layer which uses the softmax activation function. Fig. 5. Structure of C− HR The task of the DNN is to classify a letter based on the information from the CNN and LSTM. To provide the DNN detailed information, the CNN and LSTM are applied without the activation function in their output layer. This is due to the fact that the activation function is the softmax function which scales the output values to a range from 0 to 1. The original range is lost which can offer valuable information. By removing the activation function, this information is kept. Benefits of Combining As described in Section 2, interpreting letters as move- ment vectors has advantages and disadvantages compared to the visual interpre- tation of the data. By using both perspectives combined by a DNN, our approach is able to use the strength of both approaches. For example in preliminary exper- iments, it turned out that the small letters h and n can be better distinguished by employing the visual perspective. This makes sense as the motion for both letters is very similar. In both cases, it is a line down followed by a curve. To distinguish the letters, the length of the line is crucial which can be recognized best by visually examining it. A different example are the small letters c and l. They could look very similar if they are written unclearly. Also, the movement is similar but as the upper part of the letter c is a curve and of the letter l is a line, the motion vectors are slightly different. Additionally, the curve is usually written a bit slower than the line. In preliminary experiments, it turned out that the differences between the movement vectors and the writing speed are easier to detect as the visual differences. 4 Experimental Results In this section, the experimental results are presented. For the experiments, we employ three different variants of the data set. The first variant only consists of the capital letters A, M, O, T, U. We denote this variant as amotu. It includes 1473 patterns. In the second variant, all small letters are contained. It consists of 24 810 patterns and we denote it as small. The last variant which we call full includes all 59 different letters, i.e., all 26 capital and small letters, capital and small special characters (the umlauts: ä, ö, ü), and another special character (eszett: ß). This variant includes 34 627 patterns. For all experimental results, we employ a 10-point cross-validation using a stratified k-folds cross-validator [17]. This provides folds which are preserving the percentage of samples for each class. The neural networks are trained using a keep probability of 0.5 for the dropout layers. They are trained until no more improvements are achieved. Thereby, each component gets the same amount of time. Thus, the deep visual approach which is relatively easy to compute gets more learning epochs. For amotu, we train about 270 epochs for the deep visual approach, 80 epochs for the deep movement approach, and 65 epochs for the multi-stage approach. For small and full, the training time is quintupled. During the training, a batch size of 100 patterns is used. To train C− HR, first, the CNN and LSTM are trained. Then, the DNN which connects the CNN and LSTM is trained. Thereby, the weights of the CNN and LSTM are fixed. Comparison of Classification Accuracy In the first experiment, we compare the classification accuracy of C− HR to the result from [19]. As introduced in Sec- tion 1.1, [19] has used a high-level representation to classify the data. There- fore, we denote this approach as HLR. Additionally, we show results of decision trees (DTs) [20] and the more advanced approach of random forest (RF) [3] which is employing multiple DTs. Both approaches are using the visual per- spective. We select DTs as their performance can be adjusted by their maximal depth. This makes it possible to configure a DT with a similar performance to HLR for comparison. The two employed DTs are configured with (1) a maximal depth of 5, leading to similar results as HLR for amotu, and (2) none maximal depth, showing the best possible results for a DT. RF is employing 100 DTs with none maximal depth. Table 1. Comparison of the classification accuracy Data HLR DT5 DT∞ RF∞ 100 C− HR amotu .747 .734 ± .034 .806 ± .039 .950 ± .019 .989 ± .008 small - .438 ± .019 .724 ± .027 .889 ± .021 .978 ± .008 full - .328 ± .019 .589 ± .028 .811 ± .022 .964 ± .009 Table 1 shows the results. In the table, the mean accuracies of the 10 folds of the cross-validation and the matching standard deviations are presented. It can be observed that C− HR clearly outperforms all other approaches. Thereby, it is remarkable that the classification accuracy decreases only a little while increasing the number of letters – from 0.989 for five letters to 0.964 for all 59 letters. The results of the DTs and RF show how the classification accuracy can significantly decrease if the problem is getting more complex due to more letters. Essential is the result of employing all letters. For this task, C− HR only makes 3.6% prediction errors. Compared to the previous approach, this is a huge improvement. HLR has only been applied to amotu and performs similarly to a DT with a max depth of 5. Obviously, each approach scales differently. However, the results indicate that an approach which is able to predict amotu with a classification accuracy of less than 0.75 will likely predict all letter with an accuracy of signif- icantly less than 0.5. Even an approach which is able to achieve a score of 0.95 for amotu is just slightly above 0.8 for all letters. Thus, a word with five letters is recognized correctly in less than one-third of the cases (0.85 < 1/3) by this approach. Individual Components of C− HR Now, we show the results of the components of C− HR, i.e., the performance by the deep visual approach, by the deep move- ment approach, and by C− HR. This indicates the contributions of the individual components of C− HR. We emphasize this point for two examples at the end of this section. Table 2 presents the classification accuracies. For amotu, the visual approach works very well. This makes sense as the letters A, M, O, T, U are visually significantly different. For both other data sets, the movement approach performs better than the visual one. This indicates that the additional information due to the movement perspective is beneficial. In all cases, C− HR is very close to the Table 2. Comparison of the accuracy of the individual components of C− HR Data Visual Movement C− HR amotu .991 ± .007 .986 ± .011 .989 ± .008 small .962 ± .011 .975 ± .007 .978 ± .008 full .920 ± .013 .948 ± .011 .964 ± .009 best approach or it is the best. It seems the more complex the data gets, the higher are the benefits from the two perspectives. Two examples of letters which benefit from the two perspectives are the small b and the small l. The small letter b is correctly classified by the visual approach with an accuracy of 98.0%. The movement approach only achieves 86.2%. For the small letter l, the accuracies are reversed. The visual approach achieves 87.9% while the movement approach classified the letter with an accu- racy of 95.5%. To make this more clear, Figure 6 presents four letters which are correctly classified by C− HR. The left two letters are a small b. It is correctly labeled using the visual approach. The movement approach classifies both letters as a small t. The right two letters are a small l. It is also correctly labeled by C− HR The visual approach classifies both letters as a capital I. Fig. 6. Example of four letters which are correctly labeled by only one approach Sensitivity and Specificity of Letters To further analyze the performance of our approach, we present the sensitivity and specificity of each letter. For a specific letter, the sensitivity indicates the probability to correctly classify this letter. The specificity indicates the probability to be correctly not classified by applying a different letter. For example, if a small letter a is written. A high sensitivity for the letter a shows that there is a high chance to correctly predict this letter. A high specificity for the letter b indicates that there is a high chance that the written letter is correctly not labeled as a b. Table 3 presents the results. We employ the first fold of the cross-validation. As shown in the legend of the table which is located in the lower right part, there are five different colors from light vanilla to red. Red stands for relatively many errors. The lighter the color gets, the fewer errors are made and if there Table 3. Sensitivity and specificity analysis a b c d e f g h i j k l mn o p q r s t u v w x y z Sensitivity Specificity A B C D E F G H I J K L M N O P Q R S T U V WX Y Z Sensitivity Specificity ä ö ü Ä Ö Ü ß Legend Sensitivity i0 i1 i2 i3 i4 i5 Specificity is no color, there are no errors. For the sensitivity, the thresholds for the inter- vals i0 , . . . , i5 are 80%, 85%, 90%, 95%, 100% and for the specificity, they are 99.80%, 99.85%, 99.90%, 99.95%, 100%. The thresholds for the specificity are much higher as an unknown letter is classified as one letter and it is not classified as the 58 other letters. All results are close to the thresholds, except the sensitivity of the capital P which is often wrongly classified as a small p. Only in 16.1% of the cases, the letter is correctly classified. This makes sense as without using the context, it is very hard to distinguish between the letters p and P. This points out that there is potential for improvements, e.g., it might be an approach to add relative height information to our approach. Thereby, we need to keep in mind that our approach stays completely semantic context-free. To visualize the problems with p, Figure 7 shows four letters: on the left side, two capital letters P and the right side, two small letters p. All letters are labeled as p. The figure indicates that the distinction between the small and capital p is very challenging without additional information. Besides the problems with the capital P which is often classified as the small p, the table visualizes other noticeable findings. For example, the sensitivity of the capital I is relatively low, leading to a relatively low specificity of the small l. Also, the capital B and the special character ß can be written very similarly. This can be seen in the sensitivity of the special character ß and the specificity of the capital B. Overall, the results of the sensitivity and specificity are satisfactory. Multiple letters are recognized perfectly and a lot of letters are within the first interval. The letters within the last interval point to opportunities to improve our ap- proach. Fig. 7. Examples of four small and capital letters p 5 Conclusions Handwriting is a crucial skill and is important for success at school. This in- cludes correct orthography. To realize educational software which supports the development of orthography, a context-free handwriting recognition is required. In contrast to usually used recognitions, a context-free handwriting recognition does not imply a correction of misspelled words which makes it possible to detect all writing errors. This is essential for educational software. In our work, we developed a context-free handwriting recognition which is based on multiple techniques from the field of deep learning. It interprets the data from two perspectives: a visual one and a movement oriented one. The visual perspective is realized as a CNN and the movement perspective as an LSTM. Both perspectives are combined by a DNN. In the experimental results, our new approach can show its superior classification ability compared to the previous approach. It turned out that the visual and movement approach can benefit from each other, especially if the data set contains many letters. Additionally, our results show that it is very challenging to distinguish between capital and small letters in some cases, e.g., in the case of the letter p. In our future work, we are going to tackle the potentials for further improve- ments, e.g., the implementation of a semantic-free support to better distinguish between small and capital letters. Also, an analysis of the data like in [15] could lead to further findings. References 1. Aghdam, H., Heravi, E.: Guide to Convolutional Neural Networks: A Practical Application to Traffic-Sign Detection and Classification. Springer International Publishing (2017), https://books.google.de/books?id=72UkDwAAQBAJ 2. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer (2007) 3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (Oct 2001) 4. Frerichs, J.: Tintenklex Legasthenie Software, Digita-Preisträger 2006, Legas- thenie und Dyskalkulie (2018), accessed March 2018: http://www.legasthenie- software.de/ 5. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press (2016) 6. Graves, A., r. Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6645–6649 (May 2013) 7. Grund, M.: Rechtschreibtraining mit Computer und Lernkartei (2018), accessed March 2018: https://www.gut1.de/ 8. Gut, J., Reimann, G., Grob, A.: Kognitive, sprachliche, mathematische und sozial-emotionale Kompetenzen als Prädiktoren späterer schulischer Leistungen. Zeitschrift für Pädagogische Psychologie 26(3), 213–220 (2012) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (Nov 1997) 10. Jain, L.C., Medsker, L.R.: Recurrent Neural Networks: Design and Applications. CRC Press, Inc., Boca Raton, FL, USA (1999) 11. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Proceedings of the 32Nd International Conference on In- ternational Conference on Machine Learning - Volume 37. pp. 2342–2350. ICML’15, JMLR.org (2015), http://dl.acm.org/citation.cfm?id=3045118.3045367 12. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks 8(1), 98–113 (Jan 1997) 13. LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object Recognition with Gradient- Based Learning, pp. 319–345. Springer Berlin Heidelberg, Berlin, Heidelberg (1999) 14. Lu, L., Zheng, Y., Carneiro, G., Yang, L.: Deep Learning and Convolutional Neural Networks for Medical Image Computing: Precision Medicine, High Performance and Large-Scale Datasets. Advances in Computer Vision and Pattern Recognition, Springer International Publishing (2017) 15. Lückehe, D., Schmidt, K., Plotz, K., von Voigt, G.: Analysis of sound localization data generated by the extended mainzer kindertisch. In: KI 2017: Advances in Ar- tificial Intelligence - 40th Annual German Conference on AI, Dortmund, Germany, September 25-29, 2017, Proceedings. pp. 330–336 (2017) 16. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma- chines. In: Proceedings of the 27th International Conference on International Con- ference on Machine Learning. pp. 807–814. ICML’10, Omnipress, USA (2010) 17. Olson, D., Delen, D.: Advanced Data Mining Techniques. Springer Berlin Heidel- berg (2008) 18. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation 29(9), 2352–2449 (2017) 19. Reinders, C., Baumann, F., Scheuermann, B., Ehlers, A., Mühlpforte, N., Effen- berg, A., Rosenhahn, B.: On-the-fly handwriting recognition using a high-level representation. In: Computer Analysis of Images and Patterns - 16th International Conference, CAIP 2015, Valletta, Malta, September 2-4, 2015 Proceedings, Part I. pp. 1–13 (2015) 20. Rokach, L., Maimon, O.: Data Mining with Decision Trees: Theory and Applica- tions. World Scientific Publishing Company (2014) 21. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014) 22. Steinbrink, C., Lachmann, T.: Lese-Rechtschreibstörung: Grundlagen, Diagnostik, Intervention. Springer Berlin Heidelberg (2014)