Identifying anomalous places and routes by GPS feature: a system for child monitoring Giacomo Abbattista 1, Donato Impedovo 1, Giuseppe Pirlo 1, Lucia Sarcinella 1, and Nicola Stigliano 1 1 University of Bari, Dep. Of Computer Science, Via Orabona 4, Bari, Italy Abstract The phenomenon of bullying and cyberbullying is a constant thorn for today's kids. Some of these phenomena take place on the way from home to school (and back), it can therefore materialize in anomalies through deviations from the standard route, or through pauses / interruptions. These anomalies can be detected through the use of a GPS sensor already available on all smartphones. In this work it is presented a system that through the acquisition of the GPS parameters of the mobile phone is able to recognize abnormal path compared to standard ones and to report the event to parents in order to take appropriate precautions. In addition, parents can visualize paths and events by using a simple web platform. The system is a preliminary version and has been tested on a sample of 9 users, demonstrating excellent accuracy of the results and a wide acceptance by the selected users. Keywords 1 Bullying, Cyberbullyng, GPS, DBScan, FoliumMap, Android 1. Introduction With more than four billion Internet users across the globe [1], the online world is now part of everyday life, and it plays a vital role in society. This rapid growth in technology is not coming only with advantages but has surfaced many problems out of which cyberbullying is one of the primary concerns. The internet has turned to be a double-edged sword which has brought unmatched ease in our daily life. On the other hand, the internet has also created grounds for numerous unwanted behaviors, like cyberbullying, a bullying type articulated via electronic means [2]. The bullying actions include physical assault, verbal assault and by spreading fake news, harsh words/comments, rumors, gossips, threats, exclusion from social circle etc. The technological advancement has transformed traditional bullying into cyberbullying [3] which is “the use of information and communication technologies to support deliberate, repeated, and hostile behavior by an individual or group that is intended to harm or defame others [4]”, in simple words cyberbullying is “an electronic form of peer harassment [5]”. Cyberbullying is considered as more dangerous in comparison to traditional bullying because cyberbullying has the potential to protect the bully due to anonymity. This is the biggest difference as technology, and the internet gives extra mile protection to the perpetrator. A cyberbully can bully from any part of the world, and all s/he needs is a relevant technology or medium that is readily available in almost all parts of the world. Cyberbullying can be quickly done 24 hours a day and 365 days a year, unlike physical bullying. Cyberbullying can occur at any time of life irrespective of age group [6] and it increases as a person grow [7], [8]. The work reported in this paper is part of an Italian project aimed at creating an app able to record a wide series of events can be referred to a bulling or cyberbulling action, and therefore exploiting the same technologies that created the problem [9]. In particular, in this work we will discuss a functionality (ITASEC) Italian Conference on Cybersecurity, April 7-9, 2021, Italy EMAIL: giacomo.abbattista@uniba.it (A. 1); donato.impedovo@uniba.it (A. 2); giuseppe.pirlo@uniba.it (A. 3); lucia.sarcinella@uniba.it (A. 4); n.stigliano@studenti.uniba.it (A. 5) ORCID: 0000-0003-0850-728X (A. 1); 0000-0002-9285-2555 (A. 2); 0000-0002-7305-2210 (A. 3); 0000-0002-8550-8588 (A. 4) ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) implemented through a system that acquires by consensus the GPS parameters (latitude and longitude) of the smartphone on which the application is installed and by analyzing these parameters, it is able to recognize the places most frequented by the user and the routes usually used to move, but also and above all, the unusual places and routes taken by the user. The parent then, through a special web platform can view all this information graphically and visually. The following paper is organized as follows: in section two the methods and technologies used will be presented; in section 3 experiments will be presented; in section 4 results are presented. Section 5 concludes the work. 2. Methods In this section we will examine the different techniques and technologies used within the work. During the project development many solution have been investigated, however only the main and final ones will be described below, thus excluding those discarded. 2.1. Second level heading Fundamental and essential parameters of the work are the GPS parameters related to the users' smartphones. These parameters are part of those acquired by the app (named ShieldApp) we are developing. To date, the application is only available for devices with an Android Operating System having an SDK no older than the 24. Android is the most widespread operating system in the world: it is certified that 62.94% of mobile devices, including car radios, smartwatches, televisions and IoT products, use Android as an operating system, or alternatively an operating system based on Android, each of which has a dedicated graphic interface to make the user experience highly performing. ShieldApp as soon as the installation is completed shows the user a security policy that asks for consent to acquire his personal data relating to GPS movements and to use them for scientific research purposes, ensuring not outside, all according to the protection regulations of European data (GDPR). The security policies shown to the user, in particular, state that in accordance with the GDPR, the acquired data are used for research purposes and solely and exclusively for the detection of bullying and cyberbullying. Furthermore, all the results obtained from the processing will be visible only to parents. As soon as the security policy is accepted, whenever the user has turned on the GPS, the application acquires this value and stores it on a mySQL database. In particular, in addition to the GPS parameters (latitude and longitude), for each recorded value, the corresponding ID of the device, the type of data (in this case of the "Sensor" type), the timestamp and the acquisition time (in the format YYYY-MM- DD HH: MM: SS) are also stored. In fig. 1 an example of the acquired data. Figure 1: Example of stored data 2.2. Data Clustering Data are transferred from the mobile device to a server periodically. A series of processing steps are performed on the server, in this case the user routing behavior is inspected by adopting an unsupervised clustering algorithm: elements of a cluster will be the usual places and paths, while all the outliers will be the anomaly ones. In particular, the choice of the unsupervised is mandatory since during the test phase the user was not asked to report anything or to explicitly interact with the app, therefore labels are not available. This also occurs in a real scenario in which the user normally will not voluntarily tag his/her movements [10], [11], [12], [13]. Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships. The goal is that objects within one group are similar (or related) to each other and different (or unrelated) from objects in other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between the groups, the better or more distinct the grouping. In this preliminary experiment, DBSCAN has been considered [14] based on the consideration that it has been already adopted in several works dealing with geospatial data for position prediction [15] [16]. The DBSCAN algorithm uses two parameters: • minPts: the minimum number of points (a threshold) grouped together for a region to be considered dense. • eps (ε): a distance measure that will be used to locate points in the vicinity of any point. These parameters can be understood if exploring Density Reachability and Density Connectivity. Reachability in terms of density establishes a point reachable by another if it is within a particular distance (eps) from it. Connectivity, on the other hand, involves a transitivity-based chaining approach to determine whether points are in a particular cluster. DBSCAN does not require to specify the number of clusters a priori, unlike many other widely used algorithms such as k-means. This is of vital importance since each cluster is equivalent to a place heavily frequented by the user, such as his home, school, workplace, etc., but there is no fixed number of these places that applies to all users, nor is there a fixed number of these places for the same user over time. Also, DBSCAN can find clusters of arbitrary shape. It can even find a cluster surrounded (but not connected) by a different cluster. Due to the MinPts parameter, the so-called single-link effect (several clusters connected by a thin line of dots) is reduced. In addition, it requires only two parameters and is mostly insensitive to the ordering of points in the database. Unfortunately, there are not only advantages, but DBSCAN also has disadvantages. In this case the main disadvantage is that the quality of DBSCAN depends on the distance measure used in the regionQuery function (P, ε). The most common distance metric used is the Euclidean distance. Especially for high-dimensional data, this metric can be rendered almost useless due to the so-called "Curse of Dimensionality", making it difficult to find an appropriate value for ε. This effect, however, is also present in any other Euclidean distance algorithm. As for the implementation of dbscan, it is clear that the greatest difficulty lies in deciding the values that eps and minPts will have to assume. An additional difficulty lies in the fact that depending on the device, configurations and possible data distributions will change from time to time. Since it is not possible to predict what kind of data we will have available, it was decided to test a range of values each time. This range for minPts goes from 5 to 400, while for eps, which will be tested on the basis of the best minpt, a value between 0.01 and 0.09 will be chosen if we have less than 51 for minPts, otherwise a value between 0 , 1 and 2. These choices are supported by the silhouette coefficient [17]: an important metric that is calculated using the mean intra-cluster distance (a) and the nearest mean distance (b) for each sample. The silhouette coefficient for a sample is (b - a) / max (a, b), where b is nothing more than the distance between a sample and the nearest cluster of which the sample is not part. The best value is 1 and the worst value is -1. Values close to 0 indicate overlapping clusters. Negative values generally indicate that a sample was assigned to the wrong cluster, as a different cluster is more similar. In the end, therefore, dbscan will be set with the best values based on the silhouette coefficient, using the Euclidean metric, most used metric, such as in [18] where is used as fitness function to control the process of parameters determination by optimization, or in [19] used to classify the type of text contained in the Al-Quran, or as confirmed by [20] and [21]. Obviously, all these operations were performed on normalized data. Based on some tests carried out, it was decided to work week by week on the data. This choice derives from the fact that the tests carried out have shown that less than 5 days produce inaccurate results, probably due to the scarcity of available data. The best results are obtained when we have more and more data available, but unfortunately the time required for execution would increase significantly. For this reason, the best result-time compromise was reached with 7 days (Silhouette Coefficient> 0.90). 2.3. Data Visualization Once the classification of the available GPS data is done, it is necessary to visually visualize these results. Powerful visualization tools and libraries are available nowadays. In this work the Folium library has been adopted. It is a powerful data visualization library in Python created primarily to help people visualize geospatial data. Maps are interactive so that zoom in and out are available. Folium will be used to create an interactive map that shows Cluster and Outlier in the most understandable way for the user (in this case the parent). More precisely, four different maps will be created: 1. Interactive map containing all the clusters of the week: in this map it will be possible to view all the clusters in the form of a heatmap, with the addition of a marker on the positions where there are more concentrations of data. This marker will be clickable and will give the user the possibility to see the city, the postcode, the street and possibly also the name of the place where it is located. We also tried to give continuity to the user's movement: the various points of the map were connected based on the time that elapses between them. It was decided to combine the points, based on the recording of which they took place with a time frame of less than 10 minutes (obviously other times were also tested before arriving at this choice). This choice is due to the fact that it is possible for a person to stop at a traffic light, encounter a traffic accident or simply make a stop, without changing your final destination. 2. Interactive map containing all the clusters of the week, sorted by time: Within the map the points are always displayed in the form of a heat map. It also contains a slider where the user (the parent) can move according to the time and view the precise instant in time with the precise point where the position was recorded. The parent will also be given the opportunity to decide whether to view a specific day on which to view the data. 3. Interactive map containing all outliers for the week: identical to the first map, but with outliers instead of clusters. 4 Interactive maps containing all outliers for the week, sorted by time: identical to the second map, but with outliers instead of clusters. As for the pop-up in which the data relating to the point produced by the coordinates are displayed, a reverse geocoding technique has been implemented. For this technique, the 10 most recurring points were taken (counting the occurrences present in the Data frame) and finally they were fed to Nominatim, a function present in the geopy.geocoders library that returns all the details of the position. This process was repeated for all four maps. Another action carried out was to create a dictionary, sorted by key (date and time) sent to the "heatmapwithtime" folium to create maps 2 and 4 (Heatmap with time by cluster and outlier). As for this last feature, the user will be given the opportunity to view a certain day. This was possible simply by creating an ad hoc Data Frame, selecting only the values that contain the date entered by the user. It is important to underline that the user is also given the possibility to disable the heat maps, in case they prevent them from being correctly displayed. An example of a user display is shown in fig. 2 and 3, where in figure 2, the 2 main clusters identified by a user and the relative usual paths performed by him can be observed, while in figure 3, a path carried out by the user can be observed, identified as anomalous, in particular it is a deviation commute from home to work. Figure 2: Graphic representation of clusters Figure 3: Graphic representation of an anomalous path identified 3. Experiments Two methodologies were considered to ascertain the accuracy of the system: the first relies on clustering evaluation metrics, while the second is based on questionnaires administered to users. Two methodologies were used because with simple clustering assessment metrics, it cannot be said for sure whether the locations and paths identified as frequent or unusual by a user have been correctly classified. An answer that only the user can provide. Regarding clustering evaluation metrics, three metrics were chosen [17]: • Davies-Bouldin score: The score is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score. The minimum score is zero, with lower values indicating better clustering. • Calinski and Harabasz score: It is also known as the Variance Ratio Criterion. The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion. • Silhouette Coefficient: it is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max (a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= 𝑛_𝑙𝑎𝑏𝑒𝑙𝑠 <= 𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. Nine users were involved in the testing phase. The monitoring period ranges between 10 and 14 days. At the end of this period, a 9-question questionnaire was administered to the users. A sub-set of questions related to the veracity of the displayed data of the paths and places identified as normal and anomaly. A 5-value Likert Scale (from Strongly Disagree to Strongly Agree) was adopted, each answer was then associated with a value (0, 25, 50, 75, 100). The average was adopted to estimate the degree of accuracy [22]. 4. Results The silhouette score was calculated for each user obtaining an average value of 91%, with a minimum value of 73% and a maximum value of 98%. (table 1). Table 1 Silhouette results User Silhouette % 1 98% 2 88% 3 93% 4 88% 5 88% 6 95% 7 96% 8 73% 9 98% Tot. 91% The accuracy related to the evaluation of anomalies reported by the system and evaluated by users reached a value of 87.5%. This result is similar to the silhouette score and therefore confirms the previous classification data. 5. Conclusion and future development In order to monitor the movements of children from infancy to adolescence, an android app and a web platform have been developed. The Andorid app was used with the aim of acquiring the gps parameters related to the children's smartphones, while the web platform was used to visually show the user the data acquired by the app and the results of the analyzes relating to the identification of any anomaly places or routes. The difference between this solution and those already on the market lies precisely in the fact that current solutions usually allow simple real-time monitoring or movement history, while our solution automatically identifies any anomalous routes, supporting parental control. For this purpose, DBScan was used as a clustering algorithm, and FoliumMap and Flask for the creation of the web platform. The overall system was tested on 9 users, demonstrating an accuracy of 87.5%, confirming its possible use in real contexts. Of course, in the next studies it is of primary importance to considerably extend the test sample to validate the results currently obtained and, to extend the web platform implemented by integrating with works that use other reference data other than GPS parameters, such as other sensors such as the accelerometer. In fact, alone, the results obtained from the GPS parameters are not always sufficient to affirm a phenomenon of bullying or cyberbullying, for this reason, please note that the following work is only part of a larger project in progress, where with the use of multiple sensors, apps and other technologies, it will be possible to identify these phenomena. 6. Acknowledgements This work is supported by the Italian Ministry of Education, University and Research within the PRIN2017 - BullyBuster project - A framework for bullying and cyberbullying action detection by computer vision and artificial intelligence methods and algorithms. CUP: H94I19000230006. 7. References [1] M. Group, «Internet Top 20 Countries-Internet Users 2020.,» 30 June 2019. [Online]. Available: https://www.internetworldstats.com/top20.htm. [2] Q.Li., «Cyberbullying in schools: A research of gender differences,» School Psychol, vol. 27, n. 2, pp. 157-170, 2006. [3] M. Dadvar e F. De Jong, «Cyberbullying detection: A step toward a,» 21st Int. Conf. Companion World Wide Web ( WWB ), pp. 121-125, 2012. [4] L. Robinson, « Bullying and Cyberbullying,» 5 March 2020. [Online]. Available: https://www.helpguide.org/articles/abuse/bullying-and-cyberbullying.htm. [5] P. S. Storm e R. D. Storm, «Cyberbullying by adolescents:A preliminary assestment,» Educ. Forum, vol. 70, n. 1, pp. 21-36, 2006. [6] L. Betts, T. Baguley e S. Gardner, «Examining adults' participant roles in cyberbullying,» J. Social Pers. Relationships, vol. 36, n. 11-12, pp. 3362-3370, 2019. [7] R. Ortega, P. Elipe, J. Mora-Merchán, J. Calmaestra e E. Vega, «The emotional impact on victims of traditional bullying and cyberbullying:A study of Spanish adolescents,» Zeitschrift Psychologie/J.Psychol., vol. 217, n. 4, pp. 197-204, 2009. [8] F. Shaikh, . M. Rehman e A. Amin, «Cyberbullying: A Systematic Literature Review to Identify the Factors Impelling University Students Towards Cyberbullying,» 21 Aug. 2020. [Online]. [9] N. Covertini, N. Logrillo, F. Manca e T. Palmisano, «Recommendation System using Hybrid Fuzzy Association Rules for Human Smart Cities,» 2018 AEIT International Annual Conference, Bari, Italy, 2018. [10] D. Impedovo, F. Balducci, V. Dentamaro e G. Pirlo, «Vehicular Traffic Congestion Classification by Visual Features and Deep Learning Approaches: A Comparison,» Sensors, vol. 19, n. 5213, 2019. [11] D. Impedovo, V. Dentamaro, G. Pirlo e L. Sarcinella, «TrafficWave: Generative Deep Learning Architecture for Vehicular Traffic Flow Prediction,» Sensors, vol. 9, n. 5504, 2019. [12] N. Convertini, N. Dentamaro, D. Impedovo, G. Pirlo e L. Sarcinella, «A Controlled Benchmark of Video Violence Detection Techniques,» MDPI information, vol. 11, n. 321, 2020. [13] V. Dentamaro, D. Impedovo e G. Pirlo, «Gait Analysis for Early Neurodegenerative Diseases Classification Through the Kinematic Theory of Rapid Human Movements,» IEEE Access, vol. vol. 8, pp. 193966-193980, 2020. [14] K. S. do Prado, «How DBSCAN works and why should we use it?,» 2 Apr 2017. [Online]. Available: https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it- 443b4a191c80. [15] M. S. Suchithra e M. L. Pai, «Data Mining based Geospatial Clustering for Suitable Recommendation system,» 2020 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, pp. pp. 132-139, 2020. [16] M. Perumal e B. Velumani, «Design and development of a Spatial DBSCAN Clustering framework for location prediction- An optimization approach,» 2018 3rd International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, pp. pp. 942-947, 2018. [17] Mohantysandip, «A Step by Step approach to Solve DBSCAN Algorithms by tuning its hyper parameters,» 12 May 2020. [Online]. Available: https://medium.com/@mohantysandip/a-step-by-step-approach-to-solve-dbscan-algorithms- by-tuning-its-hyper-parameters-93e693a91289. [18] M. Li, X. Bi, L. Wang e X. Han, «A method of two-stage clustering learning based on improved DBSCAN and density peak algorithm,» Computer Communications, vol. 167, pp. pp. 75-84, 2021. [19] M. A. Ahmed, H. Baharin e P. N. E. Nohuddin, «Analysis of K-means, DBSCAN and OPTICS Cluster algorithms on Al-Quran verses,» International Journal of Advanced Computer Science and Applications, vol. vol. 11, n. n. 8, pp. pp. 248-254, 2020. [20] G. Huang, W. B. Qu e H. Y. Xu, «Traffic Accident Location Clustering Based on Improved DBSCAN Algorithm,» Jiaotong Yunshu Xitong Gongcheng Yu Xinxi/Journal of Transportation Systems Engineering and Information Technology, vol. vol. 20, n. n. 5, pp. pp. 169-176, 2020. [21] U. Pandya, V. Mistry, A. Rathwa, H. Kachroo e A. Jivani, «2DBSCAN with Local Outlier Detection,» International Conference on Recent Advancement in Computer, Communication and Computational Sciences, RACCCS 2019, Ajmer, India, vol. vol. 1097, pp. pp. 255-263, 17 August 2019. [22] E. R. S. H. Saputra, E. Utami e A. Nasiri, «Implementation of Location Based Service on Monitoring System of Visually Impaired Position with A-GPS Method,» 2018 3rd International Conference on Information Technology, Information System and Electrical Engineering (ICITISEE), pp. pp. 271-275, 14 November 2018.