A Proposal on Stampede Detection in Real Environments Antonio Carlos Cob-Parro, Cristina Losada-Gutiérrez, Marta Marrón-Romera, Alfredo Gardel-Vicente, Ignacio Bravo-Muñoz and Mohammad Ibrahim Sarker Abstract It is a fact that the world population has grown in recent decades, as well as the number of social and tourism events, generating situations of agglomerations where different problems may lead to generate bottlenecks stampedes or falls, that can be a risk for people. Thus, the study of the behaviour of crowds is a relevant research topic. In this context, this paper presents and approach for real-time stampede detection from images, in low and medium crowd scenarios. The proposal is based on a feature vector extracted from the optical flow entropy, and this does not require the use of thresholds. Instead of that, it includes a a Stacking classifier, based on the union of a random forest with ten estimators and an support vector classifier, that works properly in the different analyzed scenarios. The proposal has been evaluated in UMN and PETS 2009 datasets and compared to other state-of-the-art proposals in terms of accuracy and computational cost. However, since the provided ground-truth was not accurate, a new manually-labelled ground-truth has been generated and make publicly available to the scientific community. The obtained results allows validating the proposal, outperforming the state-of-the-art methods both in terms of accuracy and computational cost in all the evaluated scenarios. 1. Introduction It is a fact that the world population has grown in recent decades. In 1950s, there was a population of 2.5 billion people, while in the year 2020, there are approximately 7.7 billion people. This fact is more shocking when the increase in the last ten years is approximately 1 billion people, suggesting that the population is increasing in a non-linear way year by year. Some population experts suggest that for the next century, the population could exceed 11 billion people. This situation creates a scenario in which it is becoming relevant to deploy surveillance systems capable of detecting individual behaviour and group behaviour. The number of social and tourism events will grow, generating situations of agglomerations where different problems may lead to generate bottlenecks stampedes, falls and a long plethora of risk scenarios. The study of crowds’ behaviour is a relevant research topic, to be able to control and protect people in the face of uncontrolled events. IPIN 2021 WiP Proceedings, November 29 – December 2, 2021, Lloret de Mar, Spain Envelope-Open antonio.cob@edu.uah.es (A. C. Cob-Parro); cristina.losada@uah.es (C. Losada-Gutiérrez); marta.marron@uah.es (M. Marrón-Romera); alfredo.gardel@uah.es (A. Gardel-Vicente); ignacio.bravo@uah.es (I. Bravo-Muñoz); ibrahim.sarker@uah.es (M. I. Sarker) Orcid 0000-0001-8608-7351 (A. C. Cob-Parro); 0000-0001-9545-327X (C. Losada-Gutiérrez); 0000-0002-9421-8566 (M. Marrón-Romera); 0000-0001-7887-4689 (A. Gardel-Vicente); 0000-0002-6964-0036 (I. Bravo-Muñoz); 0000-0002-9589-294X (M. I. Sarker) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Crowd analysis, in general, can be approached holistically or by object-based methods. object based methods analyse crowds as sets of objects, studying people in a particular way [1], being these objects detected and followed in a particular way. Then in this type of systems, the amount and kind of tracking performed are analysed. The problem of these object-based methods is the accuracy because, in dense crowds, the identification of people is complicated. On the other hand, holistic approaches are based on identifying the crowd as a single unit [2, 3]. These methods are based on extracting the characteristics of the crowd to deduce its behaviour. Holistic methods have a reasonable accuracy rate in detecting anomalous behaviour. In this work, we have focused on the detection of anomalous behavior in crowds. In particular, in the behaviour related to stampedes. Detection systems usually have two differentiated phases. The first phase is based on the representation of the event by a set of characteristics. There are many methods for the extraction of features, such as the study of social force [2], which is based on the measurement of the internal motivations of individuals to perform specific actions. The use of histograms of optical flow [4] to describe movement patrons or the use of histograms of movement direction [5] to describe direction patrons. The second phase consists of momentum detection from the previously extracted features. These classification models are usually characterized by having only two classes, either a stampede or not. There are different methods such as Support Vector Machine [6], neural replicator network [7], convolutional networks [8], etc. The latest research in the detection of abnormal behaviour in crowds uses technologies such as context location and motion-rich patio-temporal volumes [9],temporal convolutional neural network pattern [10], generative adversarial networks [11], global event influence model [12]. In this work, we have implemented a system that draws from both the most modern and the most classical approaches, Being a system based on the use of the value of the magnitude of the optical flow and from there extract the entropy to generate a series of descriptors that are used in a machine learning model for the detection of the anomaly. Considering the previous research work, the main contribution of this paper is to deploy a robust and reliable system capable to detect stampedes in real-time. In addition, the algorithm to detect events’ peak does not require a threshold. Additionally, we have manually labelled the UMN [2] and PETS 2009 [13] datasets to quantitatively evaluate the stampedes detection, and the result of the annotation has been made available to the scientific community [14]. The rest of the paper is organized as follows: section 2 describes the proposal for stampede detection, them section 3 presents the annotation procedure for the UMN and PETS-2009 datasets. Next, the main experimental results are shown in section 4.1. Finally, section 5 describes the main conclusions and future work. 2. Proposal for stampede detection We have focused our efforts on identifying stampedes in the first two types of scenarios, the low and intermediate density crowds. As other previous approaches, such as the one described in [15], the proposal presented in this work is based on analysing the entropy obtained from the scene optical flow. It is obtained a dense optical flow by using the method of Farneback [16], instead of a punctual one as in the case of using the method of Lucas-Kanade [17]. Besides, the system works in real time. Moreover, it does not require a threshold (that must be modified for each dataset). Instead, we have extracted a feature vector from the entropy, and designed a machine learning model based on stacking classifier for stampede detection that uses a set of features generated from the entropy signal extracted from the optical flow. The figure 1 shows a general block diagram with each of the stages of the system. Below, there are described in detail each of these stages. Figure 1: General block diagram of the system. 2.1. Optical flow computation The system extracts the entropy using the optical flow of the image. For this purpose, we have used the dense optical flow of Farneback instead of the point optical flow of Lucas-Kanade. Because we wanted a system that would analyse the whole image.This system obtains the movement variations of all pixels between frames. Unlike Lucas-Kanade, which is based on the study of the variations of a specific set of pixels. The Farneback’s method is based on the estimation of the movement of the pixels between the actual and previous frames. From that movement, there are generated the displacement vectors, that are then used to study the movement variations in the image. This type of optical flow analysis presents a higher accuracy than those algorithms based on sparse optical flow. Thus, Farneback’s dense optical flow extraction is based on expanding the position of the pixel coordinates by polynomial expansion using the neighbourhood information of each pixel in the image. The original coordinates (𝑢0 , 𝑣0 ) are independent variables, and the new coordinates (𝑢, 𝑣) are polynomials of dependent variables. The amount of motion (𝑑𝑢, 𝑑𝑣) of the pixel in 𝑢 and 𝑣 directions are determined by substituting the coordinates into them. A displacement vector is obtained for each pixel between two frames. 2.2. Feature extraction The feature vector used for detecting stampedes is based on the entropy. To obtain the entropy, first, the image is pre-processed by blurring and grayscaling. Then the magnitude value of the optical flow is extracted using the current and previous frames These magnitude values form a matrix that correspond to the movement variation of each pixel. Then, these matrices are grouped in batches of 20 frames, and the mean of all the magnitudes is extracted for each pixel. Then the median is calculated, and the result is called activity map. From the activity map, the entropy for that frame can be extracted. To extract the entropy it is used equation 1, where 𝑥 is the number of separate symbols, 𝑝𝑖 is the frequency of the each pixel in the image and 𝑛 is the actual frame. 𝑥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑛) = ∑ 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖 ) (1) 𝑖=1 In previous works, such as [15], the authors use two thresholds to determine if there was a stampede or not. The first one was based on the entropy value, and the second one was based on the temporal occupancy variation (TOV) value between frames. For this work, the TOV has not been employed using only the entropy value, from what there is obtained a feature vector for the stacking classifier. After obtaining the entropy, to determine if there is a stampede or not, we extract a set of features that are next classified using a stacking classifier. These features include the mean and standard deviation of the entropy. The mean is computed using a sliding window of 20 frames as shown in the equation 2, being 𝑘 the size of the sliding windows . Thus the proposal requires a minimum of 20 frames for detecting a stampede, but it smooths the signal and removes the high frequency noise. 𝑛 1 𝜇(𝑛) = ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (2) 𝑘 𝑖=𝑛−𝑘+1 The standard deviation of the entropy descriptor is used to detect the fast changes in the signal. A considerable variation generates a significant change of value in this descriptor. The equation 3 shows the mathematical definition used for the extraction of the descriptor, where 𝑥𝑖 are the current entropy values and 𝜇 the mean entropy value. 𝑛 ∑𝑖=𝑛−𝑘+1 (𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) − 𝜇(𝑖))2 𝜎 (𝑛) = (3) √ 𝑘 The third feature is the distance generated as the difference between the mean plus standard deviation and the mean minus standard deviation. The standard deviation is multiplied because this value (𝐿) is smaller than the other signal. In the equation 4 is defined mathematically. 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑛) = (𝜇(𝑛) + 𝐿 ∗ 𝜎 (𝑛)) − (𝜇(𝑛) − 𝐿 ∗ 𝜎 (𝑛)) (4) An example of the signals explained below are shown in the figure 2. The peak of the entropy (right signal) reflects the moment when the stampede happens. The features explained below are drawn in the left picture. In blue, it is shown the entropy mean in yellow, the standard deviation and in green with shade the distance. Figure 2: The graph on the right shows the entropy value, with the peak at the time of the stampede. The left plot shows the mean value in blue, the standard deviation in yellow and the distance generated in green with shading. 2.3. Classification The stacking classifier model (figure 3) has been used, which is a combination of an support vector classifier (SVC), random forest (RF). It is an ensemble learning technique for combining multiple classification models through a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is tuned based on the results-meta-features of the individual classification models in the ensemble. The meta-classifier can be trained on the predicted class labels or the ensemble probabilities. Figure 3 shows the basic structure of a stacking classifier. Figure 3: Example of stacking classifier structure The stacking classifier used in this work is based on the union of a random forest with ten estimators and an SVC, that has been used to classify if there is a stampede or not. The training was performed with half of the UMN videos and half of the PETS videos, being the other half used for model testing. This 50:50 margin is used for two reasons: first, a stacking classifier does not need a large amount of information to train and second, the number of cases in which there is a stampede in the videos is much smaller than in those in which there is not, so more videos have been used for testing to achieve a more reliable result. In order to obtain the best results in terms of measuring the predictive quality of the models, a k-fold of the training process has been performed with a value of k equal to 10. The value of 𝑘 is 10, because the datasets have a small number of frames and a small number of labels, so the division into 10 slots provides enough information in each slot for the training to succeed. 3. Stampede Annotation To evaluate the system, we have used two datasets, UMN and PETS2009, that have been widely used in other works for stampede detection. These datasets include several videos with stampedes. The UMN dataset include the ground-truth, however, when analysing the videos, it can be seen that there is a delay between the beginning of the stampede and the frame in which it is labelled. Furthermore, PETS dataset does not include ground-truth information for stampede events. The characteristics of the datasets are shown in the table 1. Both datasets have similar types of stampedes, in which people run either in the same direction or spread out. The way of recording the videos is the same for both datasets by high-angle shot. The lighting is constant in all videos but one environment of the UMN dataset. The big difference between the two datasets is the number of people in UMN is not more than 20 people in any video, while in PETS there are more than 30 people per video. Table 1 Main characteristics of the analysed datasets Dataset Scenario Resolution Illumination #people #videos/frames Lawn 15 2/1433 Constant UMN Plaza 240 × 320 12 3/4038 Indoor Variable 10 6/2031 PETS Street-1 41 4/1812 576 × 768 Constant 2009 Street-2 42 4/1060 Due to the lack of an accurate ground-truth, we have analysed the two dataset, and hand- labelled the information. To label the videos, it is necessary to define two moments in a stampede, the beginning and the end of the stampede. To label the initial moment we have considered that more than four people in the image are already prepared to run or moving. The end of the stampede is defined when the people in the image stop running and start walking or when less than three people are running on the screen. By means of these guidelines, we have defined a ground-truth made by hand analysing frame by frame and indicating the moment when each stampede starts and ends. Figure 4 shows a ground-truth scheme indicating in green the moments of calm and in red the times where it is considered that there is a stampede. In addition, in Figure 4, the frames where the stampede begins and ends and the total number of frames of each video are shown. Figure 4: Ground-truth UMN Figure 5 shows a scheme of those PETS videos that contain stampedes. Note that the PETS videos have the same video but seen from different camera points. For this reason, the beginnings and endings of the stampede are the same for all these recordings. As mentioned in previous sections, this more accurate annotation has been made publicly available [14]. 4. Experimental results 4.1. Experimental Set-up To evaluate the system performance, we have analysed two key parameters. The first one is the speed of execution of the system, and the second one the accuracy to detect stampedes. Figure 5: Ground-truth PETS The computational cost has been evaluated for the original system executed on a CPU (Intel Core i7-9700K CPU @ 3.60GHz x 8). It is important to note that the OpenCV version is 4.5.1, and the programming language is Python version 3.7. Regarding the accuracy, as it has been stated before, two datasets (UMN and PETS) have been used, which have been compared with the results obtained in the work [15] . The UMN dataset includes a total of 7502 frames with a resolution of 240x320 at 30 fps. The dataset has been divided into three scenarios, two outdoor and one indoor. The PETS dataset includes a single outdoor scenario with a total of 2872 frames at a resolution of 768x576 at 30 fps. To evaluate the system we have decided to use the receiver operating characteristic (ROC) and area under the curve (AUC) metrics. A ROC curve is a graph showing the performance of a classification model at all classification thresholds. This curve represents the true positives rate (TPR) versus false positives rate (FPR) at different classification thresholds, whereas the AUC measures the entire two-dimensional area below the total ROC curve. The AUC provides an aggregate measure of performance for all possible classification thresholds. 4.2. Stampede detection This section shows the results of stampede detection in different environments, both indoors and outdoors corresponding to the previously described UMN and PETS 2009 datasets. As explained previously, entropy, mean, standard deviation and distance combination among entropy, mean and standard deviation are extracted of each frame. These values are analysed by a stacking classifier that determine the value of the activity. Figure 6 shows an example of UMN where it can be clearly observed where the stampede starts, marked with an arrow. Figure 6: Green shading indicates the frames in which there is no stampede while red shading indicates the frames in which there is a stampede. The arrow indicates the moment when the stampede starts. Due to the frame characteristics of each dataset, it is necessary to normalise the magnitude values from the optical flow. This normalisation is performed so that the entropy values are limited between two close values. This allows the dispersion between the different samples to be smaller and helps the classifier to detect better. This normalisation is adjusted to the number of pixels in the image by dividing the magnitude given by the optical flow by a constant that will be larger as the image size increases. Figure 7 shows an example in which the maximum value of the entropy is similar to the maximum value in UMN (figure 7). Figure 7: Green shading indicates the frames in which there is no stampede while red shading indicates the frames in which there is a stampede. The arrow indicates the moment when the stampede starts. To observe the improvement of the system performance, we compare the ROC curves of our system with those provided in the baseline [15]. It is worth note that the ground-truth used by this paper [15] defines the beginning and the end of the stampede too late in relation to the actual times of the videos. For this reason we have manually-labelled the videos, using the above mentioned definition of stampede. Thus, our ground-truth is more accurate and realistic than the one used by [15] The first scenario is the UMN lawn (figure 8), in which no new actors enter the scene. Figure 8a shows an example of a frame corresponding to this scenario, whereas figure 8b compares the ROCs and the AUC for our system (in red) and for the baseline [15] (in blue). Comparing the ROCs, it can be seen that the results are similar, although some improvement is obtained in our system. (b) ROC lawn scenario: red our system and in (a) Lawn scenario. blue the system [15] Figure 8: Results for Lawn UMN dataset. The indoor scenario (figure 9) is characterized for being a recording of a hall, where unusual activity is observed, such as the entrance and exit of people, as well as much more abrupt light changes than in the other two scenarios. The indoor scenario is where the most significant improvement is observed, having important increase in the area under the curve (AUC), being the colour red our system and in the colour blue, the [15] system (figure 9b). (b) ROC indoor scenario: red our system and (a) Indoor scenario. in blue the system [15] Figure 9: Results for Indoor UMN. Finally, figure 10 shows the last scenario of the UMN dataset. It is recorded in a plaza, where the people is walked. This scenario (figure 10a) has the lighting, the size of the people, recording environment similar to the lawn scenario. For this reason, the values obtained in the ROC curve (figure 10b) are practically equal to the lawn scenario. As it is shown in the figure the AUC obtained by our system (colour red) and the system [15] (colour blue) is very similar, having minor improvements. (b) ROC plaza scenario: red our system and in (a) Plaza scenario. blue the system [15] Figure 10: Results for Plaza UMN. Regarding the PETS 2009 dataset (figure 11), we find a view similar to the lawn and plaza scenaios in the UMN dataset. The frames show a street crossing where a group of people run in a stampede (figure 11a). The AUC obtained with our method provides a better operation than [15]. As it has been explained before, our proposal is able to perform in real time, being faster (b) ROC PETS 2009: red our system and in blue (a) PETS scenario. the system [15] Figure 11: Results for PETS 2009. than [15]. In the table 2 is shown a comparison between our model and the [15] model. In the table there are compared both frame per second processed by each proposal and the AUC value. It can be seen that our model obtains better results in every field, both in computational cost (more than 2x speed up) and AUC (with and improvement from +3% to +10%, depending on the dataset). Table 2 Comparison of our system with [15] Our proposal Pennisi et al. [15] Dataset Scenario FPS AUC FPS AUC Lawn 54 0.99 20 0.96 UMN Plaza 54 0.99 20 0.97 Indoor 53 0.98 20 0.88 Street-1 22 0.99 11 0.95 PETS 2009 Street-2 23 0.99 11 0.96 5. Conclusions In this work, we have proposed an approach for real-time stampede detection in low and medium density crowd scenarios. The proposal is based on a feature vector extracted from the optical flow entropy, and this does not require the use of thresholds, since there has been replaced by a stacking classifier, based on the union of a random forest with ten estimators and an SVC, that works properly in the different analysed scenarios. We have selected and extracted features for stampede detection. These features are extracted through the entropy information given by the optical flow. These features are suitable descriptors for the detection of the stampede beginning. The proposal has been evaluated exhaustively in UMN and PETS 2009 datasets, that has been widely used in stampede detection works, and compared to other state-of-the-art proposals in terms of accuracy and computational cost. For this evaluation, the UMN and PETS (stampedes) datasets have been hand labelled, adjusting in a more precise way the start and end frames of the stampedes. For this reason, our ground-truth is tighter and more precise than the existing ones. As conclusion, the system offers an improvement of the results respect to previous state-of- the-art works both in precision and computational cost. In addition the system does not depends on a threshold to run correctly and it is able to adapt to different environments. This has been possible by replacing thresholds with machine learning models that are able to generalise a solution independently of the environment. As future lines of work, we plan to develop this system but running on dedicated vision processing hardware such as GPU or VPU. This will allow a faster response in the processing and add another layer of parallelisation to the system. Regarding the classification method, it is proposed to use a dense neural network instead of a classical machine learning model and compare how the system behaves in deep learning models. In addition, it is intended to test the system on more datasets in order to evaluate more environments. Evaluating in a more extensive way the generality of the designed system. Acknowledgments The authors would like to thank the GEINTRA research group (geintra-uah.org) for their support and background work. This research has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under PALAEMON project (Grant Agreement nº 814962), and by the Spanish Ministry of Economy and Competitiveness under project HEIMDAL- UAH (TIN2016-75982-C2-1-R). References [1] P. Tu, T. Sebastian, G. Doretto, N. Krahnstoever, J. Rittscher, T. Yu, Unified crowd segmen- tation, in: European conference on computer vision, Springer, 2008, p. 691–704. [2] R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, p. 935–942. [3] S. Ali, M. Shah, A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007, p. 1–6. [4] A. Adam, E. Rivlin, I. Shimshoni, D. Reinitz, Robust real-time unusual event detection using multiple fixed-location monitors, IEEE transactions on pattern analysis and machine intelligence 30 (2008) 555–560. [5] H. M. Dee, A. Caplier, Crowd behaviour analysis using histograms of motion direction, in: 2010 IEEE International Conference on Image Processing, IEEE, 2010, p. 1545–1548. [6] C. Direkoglu, M. Sah, N. E. O’Connor, Abnormal crowd behavior detection using novel optical flow-based features, in: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, 2017, p. 1–6. [7] S. Hawkins, H. He, G. Williams, R. Baxter, Outlier detection using replicator neural networks, in: International Conference on Data Warehousing and Knowledge Discovery, Springer, 2002, p. 170–180. [8] S. Zhou, W. Shen, D. Zeng, M. Fang, Y. Wei, Z. Zhang, Spatial–temporal convolutional neu- ral networks for anomaly detection and localization in crowded scenes, Signal Processing: Image Communication 47 (2016) 358–368. [9] N. Patil, P. K. Biswas, Global abnormal events detection in crowded scenes using con- text location and motion-rich spatio-temporal volumes, IET Image Processing 12 (2018) 596–604. [10] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, N. Sebe, Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, p. 1689–1698. [11] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, N. Sebe, Abnormal event detection in videos using generative adversarial nets, in: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, 2017, p. 1577–1581. [12] L. Pan, H. Zhou, Y. Liu, M. Wang, Global event influence model: integrating crowd motion and social psychology for global anomaly detection in dense crowds, Journal of Electronic Imaging 28 (2019) 023033. [13] J. Ferryman, A. Shahrokni, Pets2009: Dataset and challenge, in: 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance, IEEE, 2009, p. 1–6. [14] A. C. Cob-Parro, Umn repository with the updated grountruth, 2021. Https://github.com/CarlosCobParro/UMN-groundtruth-update. [15] A. Pennisi, D. D. Bloisi, L. Iocchi, Online real-time crowd behavior detection in video sequences, Computer Vision and Image Understanding 144 (2016) 166–176. [16] G. Farnebäck, Two-frame motion estimation based on polynomial expansion, in: Scandi- navian conference on Image analysis, Springer, 2003, p. 363–370. [17] B. D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1981, p. 674–679.