Three-Dimensional Convolutions and Temporal Data for Sign Language Recognition Serhii Kondratiuka,b,, Iurii Kraka,b, Vladislav Kuznetsova and Anatoliy Kuliasa a Glushkov Cybernetics Institute, Kyiv, 40, Glushkov ave., 03187, Ukraine b Taras Shevchenko National University of Kyiv, Kyiv, 64/13, Volodymyrska str., 01601, Ukraine Abstract The technology is proposed for recognition of gesture units (fingerspelling alphabet) of sign language. Implemented technology performs recognition of dactyl items from camera input using trained on collected training dataset set convolutional neural network, based on the MobileNetv2 architecture with spatio-temporal overlapping approach. Multiple configurations were used and based on experiments, optimal configuration in terms of complexity and quality was selected. On the test dataset accuracy of over 96% is achieved. Keywords 1 Sing language, recognition, convolutional neural network, mobilenetv2. 1. Introduction Sign language is a widely spread mean of communicating among people with special communication requirements. In order to connect with society and within their own group, those who have hearing impairments might make use of supplementary software. The dactyl alphabet should be learned using gesture recognition technology, which would be included in this information technology. Smartphones, along with personal computers and laptops, have risen in popularity as devices with operating systems in recent years. Because it enables the technology to be developed and operated without modifying the code, cross-platforming is critical because it gives users a consistent experience across a variety of platforms, including mobile, low-resource, and powerful, as well as stationary. Gestural recognition is increasingly used in fields like as communication, human-computer interactions, etc. When it comes to platform variety, one solution is to use distributed computing and cross-platform programming [1, 2]. Instead of using virtual machines [3] or doing a lot of mono-platform programming, consider using cross-platform development [4, 5]. Using machine learning methods and neural networks, the study attempts to recognize sign language gestures and construct cross-platform modules that can operate on a range of current devices. Single gesture communication technology includes sing (gesture) recognition, and this article builds on past work by the author [6, 7, 8]. 2. Existing approaches Hand gesture detection may be seen as a form of object recognition challenge that has a number of mature and unique algorithms in conventional computer vision as well as deep learning, including convolution neural networks in particular. Convolutional neural networks with 3-dimensional convolutions became effective when larger datasets with recorded activities were available (AlexNet [9], Sports-1M [10], Kinetics [11], Jester [12]). Because of the dataset's size, we were able to train the model without worrying about it being overfit [13]. Various algorithms based on conventional computer vision with hand-crafted features, Information Technology and Implementation (IT&I-2021), December 01-03, 2021, Kyiv, Ukraine EMAIL: sergey.kondrat1990@gmail.com (S.Kondratiuk); krak@univ.kiev.ua (I.Krak); kuznetsowwlad@gmail.com (V.Kuznetsov); anatoly016@gmail.com (A.Kulias) ORCID: 0000-0002-5048-2576(S.Kondratiuk); 0000-0002-8043-0785(I. Krak); 0000-0002-1068-769X(V. Kuznetsov); 0000-0003-3715-1454 (A.Kulias) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 220 such as the orientation of histograms [4], the histogram of oriented gradients (HOG) [14], bag-of- features [15], hyperplanes separation [16] were used to recognize sign language gestures. As with other computer vision applications, current state-of-the-art hand gesture recognition architectures [17, 18, 19] are based on CNNs.A focus on lightweight architectures, such as SqueezeNet [22], MobileNet [23], MobileNetV2 [24], ShuffleNet [25] and ShuffleNetV2 [26], MobileNetV3 [27], which seek to minimize computational cost while maintaining high accuracy, was made in research on current methodologies among CNN [20, 21]. MobileNetV2's 2D and 3D versions have both been utilized in our projects. 3. Problem statement The suggested system should include a module for recognizing sign language gestures. Modules should be cross-platform compatible and execute without the need for codebase change on different systems. The gesture recognition module should have a model that can detect and recognize the gesture provided by the user from a camera input. Gestures are constrained by the Ukrainian dactyl language, although they may be expanded. To evaluate the model's performance, gather a suitable dataset of Ukrainian dactyl language [28]. For high accuracy and FPS-rate on many platforms, employing cross- platform technologies, the gesture recognition module should use a model that shows resilient and state- of-the-art performance as well as high efficiency in terms of processing resources. 4. Proposed approach Proposed approach suggests using cross-platform technologies to create Ukrainian dactyl language recognition software that can work on several operating systems without modifying the code base. Tensorflow [29, 30] is suggested as a cross-platform framework for developing a gesture recognition module. By using a cross-platform machine learning framework, a gesture recognition model may be constructed and trained just once, and then deployed across different platforms (mobile, desktop, and online) with no need to change the model or training code. It's a unified cross-platform technology for Ukrainian dactyl language recognition with upgraded MobileNet architecture for better recognition of the Ukrainian dactyl alphabet all in all, which is the innovation suggested for the technology. 5. Gesture recognition Cross-platform tools should be used to create gesture recognition for Ukrainian dactyl language recognition as part of cross-platform technology. Ong et al. [31] offer Sequential Pattern Mining for the detection of indications based on the tree topologies in their technique. In the field of image and video analysis, convolutional neural networks (CNNs) are most typically used as regularized versions of multilayer perceptron. CNNs excel in image analysis due to their ability to take into consideration the picture's location reference for the data they process (typically nearby samples at some input data are not related, which is not true in case of an image). As a consequence, CNN's picture classification and recognition findings are cutting-edge. In addition to recognizing characteristics, dynamic gesture recognition requires modeling the time component of motions. Spatio-temporal classifiers can recognize sequences of spatial descriptors or pictures by creating descriptors that incorporate both spatial and temporal information. It was decided to go with this technique in this case study. If the video stream has to be analyzed, either a single gesture picture or a succession of images may be used as input data for the neural network. It's possible to train the network to be more resistant to change in a dynamic object like the hand by looking at numerous surrounding photos concurrently. This enables the network to be trained to take into account the temporal element, i.e. the dynamics of change in motions in several photographs. If a picture has artifacts, has poor lighting, is blurry, or is occluded in some way, gesture recognition may be used to smooth things out using neighboring frames in succession. 6. Spatio-temporal approach The usage of a temporary floating window was suggested and executed to improve the efficiency of this strategy. To achieve this, the input sequence should be divided into n subsequences, each having a 221 minimum length of m, and these subsequences should all overlap at some point (into a certain part, from 10 percent to 50 percent of the subsequence length) as shown at Fig. 1. It is possible for the same fingerprints to have different exterior characteristics on the hand (the challenge of identifying these differences falls on the recognition model) and different data parameters while acquiring video sequences and individual frames, as well (size, quality, focal length, lighting, background, artifacts, blur and etc.). A uniform data processing approach was designed to convert them to a generic form for further computations inside the specified recognition model, both at the training and recognition stages. Figure 1: Sequence of frames, divided into two sub-constants of five frames, which intersect into three frames Thus, from one input video stream with a gesture you can get n video streams of this gesture of a smaller size. Therefore, the data can be presented as: D  {d i k ,..., d i ,..., d i k }, i  1, n  k , (1) where k is the number of previous and subsequent frames from the current, from which a sequence of images is formed (Fig. 2). Adapting the neural network to the input data's space-time format is the goal of this article. Three- dimensional convolutions were introduced to enhance the architecture of the convolutional neural network in order to better use the input data's spatiotemporal properties. Convolution may be performed in picture space as well as time, which makes it possible to use three-dimensional convolutional neural networks for this purpose. There are three stages to data processing:  normalizing  noise reduction  shrinkage to a single size are all examples of normalization. The new MobileNetV2 architecture (Fig. 3) is a development of the MobileNet concept and is a new mobile architecture. MobileNetV2 adds two new features over its predecessor. Residual blocks use a skip connection to link the beginning and finish of a convolution block. The addition of these two states 222 allows the network to retrieve activations from previous blocks that weren't altered throughout the convolutional process. Figure 2: Two subsequences created from a single video stream Figure 3: Architecture of MobileNetv2 There have been two strategies to improve the network:  swish non-linearity  layer removal Using the developed approach of accumulation of probability from prior subsequences, the research suggests a methodology for smoothing anomalous recognition results. Using a floating window with intersections, subsequences emerge on the premise of reducing the maximum probability of a gesture and increasing the maximum probability of the following gesture over time. Predictions from earlier subsequences are used to build up the model, which then uses that information to update the current recognition result only when the total number of predictions surpasses a certain threshold. t n t k   p  threshold , t t  ni t  k i (2) where: pi - the probability of a gesture in the frame; i - frame number on the current subsequence; t - number of the current subsequence; k - the size of the subsequence in both directions; n - number of accumulated subsequences. By using a projection layer on top of the preceding block's final layer, this was accomplished. As a result, the preceding bottleneck layer's projection and filtering layers may be removed (block). 7. Ukrainian dactyl dataset collection for gesture recognition using MobileNet An examination of the educational data collection gathered for the first time by people and the environment for the Ukrainian dactyl alphabet in such amount and variety (Fig. 4). We experimented 223 with various lighting settings (with distribution: 20 percent of images in low light conditions, 30 percent in low light conditions and 50 percent in high quality lighting). Noisy and blurry pictures accounted for around 10 percent of the total pictures in the collection. A training data set of around 50,000 original photos was produced. Figure 4: Dataset example Additional data augmentation methods (such as rotation, random cropping, mirroring, and so on) resulted in a final data collection of around 150,000 pictures. After selecting a tenth of the whole data set for testing, we had 135,000 photos and 15,000 images for final testing. Augmenting the amount of an existing dataset without having to manually create additional photos is known as data mining. There are various strategies for augmenting data, and they all include increasing the quantity of pictures and diversifying them while also making it less likely for the neural network to overfit characteristics seen in the original data set. The original data set may be further distorted by combining image alteration techniques. As a result, the conditions under which the trained model is evaluated may be altered. As a result of the data growth, procedures like the following were performed (Fig. 5): Gaussian noise; affine transformation; trimming + shift; reflection; distortion of perspective; blurring. During dataset collection it is important to maintain it’s statistically significant diversity in the data, but maintain similar distribution among train and test data – not to introduce unwanted bias for the model, which is present in train data but absent in test data. This would lead the model into worse performance in real life cases, whilst showing artificially better performance during testing. For instance, Fig. 6 shows distribution in light condition in the train and test split of the dataset, represented in three types: poor light condition, mediocre light condition and good light condition. 8. Experiments The software implementation of dactyl recognition of the Ukrainian dactyl alphabet was put to the test using a variety of different techniques. Several adjustments to the design were made throughout the Convolutional Neural Network training process based on the MobileNetv2 architecture, which shows good quality and performance on mobile and devices with low computational capacity. The training architecture may be adjusted using hyperparameters (learning rate, batch size, number of epochs) and the architecture itself (number and configuration of repeating the same kind of layers), which are picked for each training independently. 224 Five distinct neural network architecture configurations were constructed using the developed technology, each one with varying numbers of layers and parameters, allowing for a balanced neural network design that was both small and effective on the test data set. The trained model's accuracy plateaued with time, as seen in Fig. 7, therefore the architecture No. 4 (Table 1) was selected as the best option since it was the smallest and had the highest accuracy (average macro-score f1). It is important to analyze confusion matrix of model prediction (Fig. 8). This also helped to select the best approach and best configuration. In order to train and compare neural networks on a single test set, a grid of alternative configurations and hyperparameter values is created. Figure 5: Original image (topmost) and augmented images. Figure 6: Distribution of light quality on the train and test datasets. The chosen MobileNetv2 architecture, which demonstrates high quality and performance on mobile devices and devices with limited computing power, can, however, at the training stage be configured with hyperparameters that are selected for each training individually (learning rate, batch size, number of epochs), and the architecture itself, i.e. the number and configuration of repeating layers of the same type. This set of configurations and possible values of hyperparameters forms a grid within which a set of neural networks is trained and compared on a single test set. Each training was subjected to standard strategies for combating neural network overfitting. Model’s prediction time is sufficient for real-time (24 fps) performance using Nvidia K80 GPU. 225 Model performance based on architecture Figure 7: Model quality related to architecture and number of iterations Table 1: Architectures considered Example grid: learning_rate: [0.001, 0.0001], { batch_size: [8,16,32], (3) layers_config: [config1, config2,config3] 9. Conclusions There is a core gesture recognition module in the relational database that uses the database containing gesture specifications provided in YAML format. A unique model architecture and data preparation processes are required to enhance outcomes for gesture detection in a video, according to proven data processing techniques. It was shown that adopting the sophisticated MobileNetv2 architecture with three-dimensional convolution and spatio-temporal overlapping subsequences improved the quality of recognition when compared to studies using other model architectures and data sets. As a result of this selection 226 procedure, we know that the model's complexity and recognition efficiency are optimally aligned. On a specific test set, the model's quality was reached at 0.96 macro-score f1. A picture data collection containing all 50 Ukrainian dactyls shown by 50 distinct persons was gathered for the first time as part of the suggested implementation and up to 150,000 photos were enhanced. Other gestures and languages, as well as cross-platform modules, may be added to the proposed gesture communication system. Figure 8: Confusion matrix of architecture # 4. 10. References [1] P. Mell, T. Grance (September 2011). The NIST Definition of Cloud Computing (Technical report). National Institute of Standards and Technology: U.S. Department of Commerce. doi:10.6028/NIST.SP.800-145. Special publication 800-145 [2] The Linux Information Project, Cross-platform Definition. www.linfo.org [3] Yu.V. Krak, A.V. Barmak, E.M. Baraban. Usage of nurbs-approximation for construction of spatial model of human face. Journal of Automation and Information Sciences. 43(2) (2011): 71- 81. doi:10.1615/JAutomatInfScien.v43.i2.70 [4] W.T. Freemanand, M. Roth. Orientation histograms for hand gesture recognition. In International workshop on automatic face and gesture recognition. volume 12, pages 296–301, 1995. [5] J. Smith, N. Ravi. The Architecture of Virtual Machines. Computer. IEEE Computer Society. 38 (5) (2005): 32–38. [6] S. Kondratiuk, I. Krak. Dactyl Alphabet Modeling and Recognition Using Cross Platform Software, Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing, DSMP 2018, pp. 420-423. doi: 10.1109/DSMP.2018.8478417 [7] Yu.V. Krak, Yu.V. Barchukova, B.A. Trotsenko. Human hand motion parametrization for dactylemes modeling, Journal of Automation and Information Sciences, 43(12) (2011):1-11. doi:10.1615/JAutomatInfScien.v43.i12.10 [8] I.G. Kryvonos, I.V. Krak. Modeling human hand movements, facial expressions, and articulation to synthesize and visualize gesture information, Cybernetics and Systems Analysis: 47(4) (2011): 501-505. doi: 10.1007/s10559-011-9332-4 227 [9] A. Krizhevsky, I. Sutskever, G.E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012. [10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725-1732, 2014. [11] J. Carreira, A. Zisserman. Quovadis, action recognition a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference, pages 4724-4733. IEEE, 2017. [12] T. B. N. GmbH. The 20bn-jester dataset v1. https://20bn.com/datasets/jester, 2019. [13] K. Hara, H. Kataoka, Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pages 18-22, 2018. [14] L. Prasuhn, Y. Oyamada, Y. Mochizuki, H. Ishikawa. A hog-based hand gesture recognition system on a mobile device. In 2014 IEEE International Conference on Image Processing (ICIP), pages 3973-3977. IEEE, 2014. [15] N. H. Dardas, N. D. Georganas. Real-time hand gesture detection and recognition using bag-of- features and support vector machine techniques. IEEE Transactions on Instrumentation and measurement, 60(11) (2011): 3592-3607. [16] I.V. Krak, G.I. Kudin, A.I. Kulias. Multidimensional Scaling by Means of Pseudoinverse Operations, Cybernetics and Systems Analysis, 55(1) (2019):22-29. doi: 10.1007/s10559-019- 00108-9 [17] O. Kopuklu, N. Kose, G. Rigoll. Motion fused frames: Data level fusion strategy for hand gesture recognition. arXiv preprint arXiv:1804.07187, 2018. [18] P. Molchanov, S. Gupta, K. Kim, K. Pulli. Multi-sensor system for driver’s hand-gesture recognition. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops, volume 1, pages 1-8. IEEE, 2015. [19] P. Molchanov, S. Gupta, K. Kim, J. Kautz. Hand gesture recognition with 3d convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1-7. June 2015. [20] J. Hu, L. Shen, G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132-7141, 2018. [21] K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016. [22] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer. Squeezenet: Alexnet- level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. [23] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510-4520. IEEE, 2018. [25] X. Zhang, X. Zhou, M. Lin, J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6848–6856. IEEE, 2018. [26] N. Ma, X. Zhang, H.-T. Zheng, J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 5, 2018. [27] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang. Searching for MobileNetV3. axXiv: 1905.02244, 5, 2019 [28] ASL Sing language dictionary [http://www.signasl.org/sign/model] [29] Unity3D framework [https://unity3d.com/] [30] Tensorflow framework documentation [https://www.tensorflow.org/api/] [31] Eng-Jon Ong et al. Sign language recognition using sequential pattern trees. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE. 2012, pp. 2200-2207 228