A Deep Learning Framework for Human Action Recognition on YouTube Videos Rahul Darelli a, Mahathi L P V S b, Koushik M V c and Praveen Kumar Dr. Kollu d a V R Siddhartha Engineering College, Kanuru, Vijayawada, 520007, India Abstract Human Action Recognition or HAR is still a challenging problem that has been surrounded by various studies and experiments in the last decade. Due to the approach of deep learning techniques such as CNNs, it has become possible to improve the performance of HAR systems over traditional method. CNNs have been widely used in the analysis of image and when comes to LSTM networks, these networks work as a better version when prediction and analysis of sequence of data is involved but when we combine both of them, we get the best versions of both CNNs [7], LSTM [8]. So that difficult computer vision problems like video classification can be solved. Here we train and test the deep learning models ConvLSTM and LRCN using the dataset UCF50 - Action Recognition Dataset. In which UCF50 dataset consists of 50 action categories and each category grouped of 25 videos per action. Where ConvLSTM uses the concept similar to the LSTM [9] approach that which uses result processing and its computation simultaneously. Whereas LRCN is combination of convolution and LSTM layers that were mounted in single model. We find the accuracy of both ConvLSTM and LRCN models. The best model out of above two mentioned models that which is considered based on highest accuracy among both and is taken to test the accuracy model on YouTube videos to predict the human action that which is performed. Keywords 1 HAR, CNNs, LSTM, ConvLSTM, UCF50 dataset, 50 actions, 25 videos per action, LRCN, accuracy, YouTube videos. 1. Introduction Recognition of human activities from various sources like recorded videos, real time cameras start from various image processing technics. If we need to understand what exactly an image consists, we just input this image to an image classifier or to a pre-trained deep neural network. Just like that, videos are also a collection of frames or image. So, recognition of an activity on a video or real time camera is just analyzing all the collected frames, using an image classifier at each frame. And labeling the output. One on each frame and finally deciding most frequently occurred label among the frames as output. So, this is one of the tradition approaches for recognizing human activity from a sequence of data that which involved in videos. But this approach may not be correct all the time because this traditional approach is not effective at the same time it doesn’t considers all the aspects that involved in the video. For example, if in the sequence of collected frames has jumping but most probably involves standing, Since by Using above mention traditional approach provides standing action as output. But it actually has jumping activity. So, in such cases this approach fails to give accurate results. Another approach for recognition of human activity is using CNN which comes under deep neural network. Generally, CNN works by taking an image and generating feature maps, which are WINS-2022: Workshop on Intelligent Systems, April 22 – 24, 2022, Chennai, India. EMAIL: 188w1a0573@vrsiddhartha.ac.in (Rahul Darelli) ORCID: 0000-0001-9697-1220 (Rahul Darelli) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 13 representations of a certain feature or location in the image. As the network gets deeper, the number of feature maps gets increased. However, the disadvantage of this approach expensive at the same time its computation is very slow. Since working on videos involves sequenced data, the best method when comes to this type of data LSTM is a best approach. Here the limitation is that we ignore all the data other than land markings. So, when we combine both CNN and LSTM results in better accuracy for recognizing activities of human. Using this approach, we will be able extracting spatial features from a video sequence and then identifying temporal relations between frames. And this way of combining CNN and LSTM is called LRCN approach. The proposed model LRCM achieved an accuracy of 92.62%. Whereas ConvLSTM model achieved an accuracy of 80.33%. On comparing both LRCN achieved more accuracy, so we use LRCM model on YouTube videos to recognize involved activities of humans. We use pafy [15] library to download videos on colab that which only requires URL of YouTube videos. 2. Literature Survey T. Liu, Y. Song, Y. Gu and A. Li [1] proposed a methodology for activity recognition basing on humans using Microsoft Kinect sensors, Hidden Markov Models (HMMs), and k-means clustering. The method mainly contains two modules training the actions using above mentioned model and identify actions using sensors. In training the action module, the first depth image is taken as input and the skeleton information is derived and features are extracted, as in the frames have frequently having similar coordinates, so to divide them into 50 frames and to convert them into clusters they used k-means clustering algorithm and secondly, they created HMM model with three hidden layers. Next, in the human recognition model again the same process of action learning is repeated up to frames extraction then clusters assignment is done and an observation sequence is produced. And action recognized is given as output. This study is done on seven actions like movementing hands in upwards and downwards directions, pushing forward and circle construction in both clockwise and counter anti clockwise. They did this approach by inspiring from high accuracy in posture and gesture recognition by Kinect as it innovatively separates an action into several clusters for recognition. And this method has achieved an accuracy of 91.4%. Kamel, B. Sheng, P. Yang, P. Li, R. Shen, and D. D. Feng [2] proposed two types of data sequence that are used as the inputs. And they are Joint posture sequence and Depth map sequence. After they are transformed to the descriptor, the descriptor used for body posture is MJD and for depth map is DMI. Next, preprocessing of the input is done. And three CNN models are trained with three different channels (Ch1, Ch2, and Ch3) and they are tested with different inputs. In three one CNN channel is trained for Depth map images, another one is trained with joint postures and another one is trained with both joint posture and depth map images. Using score fusion operation all the outputs are fused and the final action is classified. N. Jaouedi, N. Boujnah, O. Htiwich, and M. S. Bouhlel [3] mainly concentrated on behavior analysis on humans from recorded one’s that is from camera or any other electronic source and they also focused on actions in the background like fast walking, and sudden movement. This model was mainly designed for predicting behavior on humans through their movement analysis. In this study, they explained human action recognition by using the K Nearest Neighbors approach. In this study, they used GMM model or Gaussian Mixture Model which generally used for data analysis. GMM is a mainly focus on areas which the current state pixel changes from previous state in a sequence of collections of frames. The proposed algorithm run on each frame image converts into binary images for better performance. For this they deleclared 0 for black to to their background and 1 for white background. Kalman Filter method is used for moving human tracking. And these filters are used in two phases frequently, the two phases are prediction and correction. In the prediction phase, it calculates the current state by using the information of previous state. The main agenda of this study is to get an efficient output. At last, classification is done using the KNN method and achieved a rate 14 of 71.1%. Q. Xiao and Y. Si [4], they used a deep neural network model which uses autoencoder, PRNN or pattern recognition neural network to predict action performed by human beings. They used two approaches a learning stage system and an action recognition stage. In the learning stage system, they build a binary frame for each frame by drawing the outlines of human body outlines and then join all the frames. And these they used these frames to train their model. In another approach, they used autoencoder to train the model to predict characteristics of actions. After these two approaches they train the PRNN model using unsupervised learning technic. In the end, they merge the autoencoder followed by PRNN model named as APRNN. To calculate the APRNN’s performance they used Weizmann motion data which consists of 93 different action recognition recorded clips with 10 motion semantics. For achieving better performance, they used fine-tuning. Y. Ji, Y. Yang, F. Shen, H. T. Shen and X. Li [5], mainly studied on the analysis of human actions in robotic platforms. They consider the various steps involved in the recognition and prediction of human actions. In this paper divided the human action recognition field into three main categories: hand gesture-based HRI, body action-based HRI, and multi-modal fusion. They discussed the various platforms and datasets that are commonly used in the field of HRI. They also discussed the various challenges and opportunities in the field of action analysis for human recognition. They concluded that in the future, data should be built to address the storage problems related to data. For 3D action recognition, the authors. L. Wang, D. Q. Huynh and P. Koniusz [6] proposed a total of ten Kinect-based algorithms that are used on six datasets. These algorithms are for cross-view and cross-subject detection. And the algorithms used by them are HON4D, HDG, LARP-SO, HOPC, SCK+DCK, P-LSTM, HPM+TM, clips+CNN+MTLN, indRNN, and ST-GCN. A 3D action analysis was also done to compare the results of cross-object and cross-view action recognition. It was concluded that depth-based action recognition techniques are better at recognizing objects with greater details.They performed an extensive evaluation of the HDG representation with various variants of the descriptor types. They also introduced four variants of the P-LSTM framework. 3. Planned Procedure The main objective is to develop a ConvLSTM and LRCN model at the same time to predict the accuracy of both the models. And is to perform activity prediction on YouTube videos using the LRCN model. As the proposed model ConvLSTM [11] achieved an accuracy of 80.33% and LRCN achieved 92.62%. Since we are considering highest accuracy LRCN is best among both. We test LRCN model on YouTube videos. 15 Figure 1: Proposed Architecture Above mentioned diagram is about our proposed architecture. Our proposed work involves 4 modules and they are preprocessing the dataset, Divide the structured dataset into training and testing set, implementing ConvLSTM, implementing LSTM, to test accurate model on YouTube videos. 3.1. Module - 1 Pre-Processing the Dataset Data pre-processing is a process of extraction use full information from collected raw data. To perform data pre-processing, initially we must choose a relevant raw data. The raw data we chose is UCF50- which consists raw data of different activities. In which our data set consists of 50 activity categories. In each category there are 25 videos. And on average our dataset has 133 videos per each action category. On an average each video has 199 frames. Average frames per width is 320 and height is 240. Averagely there are 26 frames per second for each video. For testing, we only selected 20 random categories. The first frame of the selected video is represented by its associated labels. This method will allow us to identify the first 20 random videos in the dataset. Now, we perform some pre-training on the dataset. This process involves 2 steps. First step is to create a extract_frames() function. And the second step is to create dataset_creation() function. In the first step, extract_frames () function is created to extract the frames and extracted frames are resized into 255 pixels. And the normalization of frames is done. That is unwanted frames that which doesn’t contains any information of activity are removed. And finally, this function returns useful frames. Then after we set frame width and height to 64 x 64 pixels which is a general format for any pre-processing technic. And Sequence length is set to 20 which we will use as default case in entire project. In the second step, a dataset_creation () function is created. In which this method involves mapping of features, labels and video path for each category of video that we selected above. For skipping the frames, we use the following formula. 16 Formula: skip_f = max(int(v_f_c/S_L), 1) Where, v_f_c represents count of video frames S_L represents sequence length i.e., S_L = 20. Since, our dataset UCF50 [10] consists of on average 199 frames per video so we chose sequence length as 20. How our concept skipping the frames works is like if there are 199 frames in video then 199/20 i.e., 9.95~10 that is we skip every 10th frame from the sequence of frames from the video. Finally, we convert different classes of encoded indexes to one-hot. For this we use keras’s which is a built-in library in python. Figure 2: Dataset Information 3.2. Module - 2 divide the structured data-set into training and testing set Before training and testing [12] of our proposed models first we need to divide our constructed structured dataset into training dataset and testing dataset. The most important requirements for splitting of data are features and one_hot_encoded_labels on categorical data. So, with the help of these requirements we divide the dataset into 75% as training dataset and 25% as testing dataset. To avoid any kind of bias we rearrange the dataset by putting shuffle = True. And also, we set random_state to 27. We split the dataset using sklearn library in python. Where training and testing dataset is used on both the models ConvLSTM and LRCN models. 3.3. Module - 3 Implementing convLSTM approach In this module, we introduce the concept of ConvLSTM cells. They are typically an integral part of an LSTM network that can be used to identify spatial features of the data. Figure 3: ConvLSTM Architecture 17 This method is useful for video classification as it captures the spatial relation between the various frames. It can also take in 3D input, which is different from the usual approach for modeling Spatio- temporal data. Step – 1: We construct our model by using Keras ConvLSTM2d recurrent layers. The layers are inputted to the Dense layer, which predicts and match the output to its associated activity. MaxPooling3D [14] layers are used to minimize the number of frames and prevent overfitting the model. Here we focus on limited data thus we do not require a larger capability model. While constructing this ConvLSTM [11] model we use 4 filters for initial layer and then go on increase the filters layer by layer i.e., in the second layer we increase to 8 filters. In the third layer we go on increase to 14 filters and in the 4th layer we give 18 filters to the model. In each layer we commonly set kernel_size to (3,3). And we set activation layer as tanh, format of data to channels last, 2% is set as dropout for each state, we enable sequence to be returned at each state. After adding each layer, we do maxpooling3D [14] to minimize the number of frames. Below figure is the summary of our model ConvLSTM. Figure 3: Summary of ConvLSTM model The created ConvLSTM model has a total of 44,524 parameters and Trainable parameters are 44,524. And there are zero non trainable parameters. Below figure contains detailed layers information of ConvLSTM model. 18 Figure 4: Layers of ConvLSTM2D Step – 2: After constructing the model now we need to train the model. Initially we Create an Instance for Early Stopping_Callback(). And then we test our model accuracy by considering loss percentage and another parameter. And then we start training the model. During the training of the model, we set maximum length of batch to 4, before training we randomize our data- set. Below is the summary of trained model ConvLSTM. Figure 5: Summary of training Step – 3: After the training we test the model using 25% test dataset. The evaluation of trained model as follows. Figure 6: Evaluation on trained model Finally using ConvLSTM we got an accuracy of 80.33% on test dataset. 19 3.4. Module - 4 LRCN approach Step – 1, In this module we merge the CNN and the LSTM layers as a single unit. Form extracted frames to generate feature map we use CNN model. The combination of both LSTM and LRCN uses CNN and LST in order to perform in the most effective manner. It learns spatiotemporal features through end-to-end training. Figure 7: LRCN Architecture Above mention fig.7 is the architecture diagram of LRCN model. Here, we implement the LRCN architecture by using time-distributed Conv2D [13] layers. The Conv2D layer is merged with LSTM layer. The next step is the Flatten layer, which will flatten the Conv2D feature. Here we used sequential model for constructing our model. Initially we add Time Distributed Convolution2D layer to our model and then we set 16 filters for initial layer and on consecutive layers we going on increasing filters to 32, 64, 64. Activation is set to relu. And at each layer we perform maxpooling 2d in order to removed unwanted frames, and also, we done dropout which helps our model to ignore randomly selected neurons. After adding these layers, we add Flatten layer to input to the next layer. And then after we add LSTM layer. And in the end, we add dense layer to collect all the inputs from the previous layers. Below figure is the summary of our proposed LSTM model. Figure 8: Summary of LRCN Model 20 Our proposed model LRCN has a total of 73,060 Trainable parameters. And has zero non- Trainable parameters. Out of them a total of 60,512 CNN parameters and 9548 LSTM parameters. On plotting our model, we got below results. Figure 9: Layers of LRCN Model Step – 2: After constructing our LRCN model we then train our LRCN model using 75% of train dataset of UCF50 dataset. For this we initially create instance for Early Stopping Callback. And then we compile and test the LRCN model using some loss factor and another parameter. We test our model by setting epochs to 70, batch_size to 4, shuffling of test dataset to true. Below is the accuracy summary on test dataset on LRCN model. Figure 10: Accuracy summary on Test data-set The highest accuracy we got among the test dataset is 97.26% accuracy. Which is better than ConvLSTM. Below is the evaluation of LRCN model on 25% of test dataset. Figure 11: Evaluation on LRCN Model We got 92.62 accuracy by evaluating our LRCN model on test dataset. Since the accuracy of ConvLSTM is 80.33% and LRCN is 92.62. So, basing on highest accuracy i.e., 92.62% among both, we chose LRCN model to test on YouTube videos. 21 3.5. Module - 5 to test LRCN module on YouTube videos Step – 1: In this module initial step is to implement the LRCN model to test TaiChi and horse racing acitives from the source YouTube. We create a function that which used for downloading selected the videos from YouTube. Another function that will be created will predict the video frame's path and save the results. We download YouTube videos using pafy library. Below is the function for downloading any YouTube video. We just need give the URL of video. Figure 12: Function for Downloading YouTube Video Step – 2: In the second step we create a function that performs recognition on selected videos. This function predict_single_action () method takes URL of video, sequence length which we set 20 at initial stage as inputs. This method divides specified video by download using above mentioned download function into frames in order to predict the action. And we perform pre-processing technics on those frames. And finally, we give these pre-processed frames to our LRCN model to predict action recognition on given video. 4. Results and Analysis Our proposed model LRCN gave an accuracy of 97.0059% to selected YouTube video. The Human Action that involved in our selected videos are one video it involves horse racing and another video it has TaiChi – which a involves a series of physical exercises and stretches. Our model LRCN on these YouTube videos recognized successfully with the accuracy of 97.0059. Figure 13: Accuracy of LRCN Model Above figure is about the accuracy of our model LRCN achieved i.e.97.0059%. 4.1. Results of action prediction on YouTube videos as follows a. Horse Race 22 Figure 14.1: Horse-race Prediction Figure 14.2: Horse-race Prediction Our model successfully predicted HorseRace Action successfully. b. TaiChi Human Action Prediction Figure 15.1: Taichi Human Action Prediction in 26th second Figure 15.2: Taichi Human Action Prediction in 31st second 23 Our Model Successfully predicted Taichi Huma Action Prediction using LRCN model. 5. Conclusion and Future Work Human Action Recognition is often used in the development of surveillance systems and other systems designed for the elderly. It can help individuals with learning and memory loss. The agenda of our study is to develop an assistive technology system that which will allow our work to help elderly individuals to live a more connected life with an efficient way. As every human being on earth have a desire to live forever. But living by human earth is maximized to 100 years. Yet they live more than that they lose their memory and forget how to do all other human activities. So, for such desired persons we can implement our concept of action recognition. So that if we know what they have to done like walking, running, then we can use this system to predict and perform those activities that elderly person wishes to do. Not for elderly persons we implement this concept to who lose memory or who are not able to perform their activities. 6. References [1] T. Liu, Y. Song, Y. Gu and A. Li, "Human Action Recognition Based on Depth Images from Microsoft Kinect," 2013 Fourth Global Congress on Intelligent Systems, 2013, pp. 200-204, doi: 10.1109/GCIS.2013.38. [2] Kamel, B. Sheng, P. Yang, P. Li, R. Shen, and D. D. Feng, "Deep Convolutional Neural Networks for Human Action Recognition Using Depth Maps and Postures," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 9, pp. 1806-1819, Sept. 2019, doi: 10.1109/TSMC.2018.2850149. [3] N. Jaouedi, N. Boujnah, O. Htiwich, and M. S. Bouhlel, "Human action recognition to human behavior analysis," 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), 2016, pp. 263-266, doi: 10.1109/SETIT.2016.7939877. [4] Q. Xiao and Y. Si, "Human action recognitan ion using autoencoder," 2017 3rd IEEE International Conference on Computer and Communications (ICCC), 2017, pp. 1672-1675, doi: 10.1109/CompComm.2017.8322824. [5] Y. Ji, Y. Yang, F. Shen, H. T. Shen and X. Li, "A Survey of Human Action Analysis in HRI Applications," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 2114-2128, July 2020, doi: 10.1109/TCSVT.2019.2912988. [6] L. Wang, D. Q. Huynh and P. Koniusz, "A Comparative Review of Recent Kinect-Based Action Recognition Algorithms," in IEEE Transactions on Image Processing, vol. 29, pp. 15-28, 2020, doi: 10.1109/TIP.2019.2925285 [7] “A Comprehensive Guide to Convolutional Neural Networks”. [Online] https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the- eli5-way-3bd2b1164a53W [8] “Long short-term memory (LSTM) ” [Online] https://en.wikipedia.org/wiki/Long_short- term_memory [9] “Long-term Recurrent Convolutional Network (LRCN) ”. [online] https://kobiso.github.io/research/research-lrcn/ [10] “UCF50 - Action Recognition Data Set ”. [Online] https://www.crcv.ucf.edu/data/UCF50.php [11] “An introduction to ConvLSTM”. [Online] https://medium.com/neuronio/an-introduction-to- convlstm-55c9025563a7 [12] “How to Split a Dataset Into Training and Testing Sets with Python”. [online] https://towardsdatascience.com/how-to-split-a-dataset-into-training-and-testing-sets- b146b1649830 [13] “Keras.Conv2D Class”. [Online] https://www.geeksforgeeks.org/keras-conv2d-class/ [14] “MaxPooling3D layer”. [Online] https://keras.io/api/layers/pooling_layers/max_pooling3d/ [15] “Introduction to Pafy Module in Python ”. [Online] https://www.geeksforgeeks.org/introduction- to-pafy-module-in-python/ 24