Learning Behavior Rate Models on Social Network Data Aleksandra V. Toropovaa , Tatiana V. Tulupyevaa,b,c a Saint-Petersburg State University, St. Petersburg, Russia b St. Petersburg Institute for Informatics and Automation of RAS, St. Petersburg, Russia c The North-West Institute of management RANEPA, St. Petersburg, Russia Abstract Intensity is one of the main characteristics of human behavior, using data about behavior intensity we can make high enough quality predictions about future human behavior. But it is often impossible to get a direct behavior rate, because of high cost, time consumption or restrictions for monitoring private lives, so we need tools to estimate it indirectly. We offer two models for behavior rate evaluation with expert-defined and learned structures. These models are Bayesian belief networks. They include information about the intervals in days between the last three behavior episodes of the study period, the minimum and maximum intervals between episodes, and the interval between the last episode of the study period and the next episode, respectively, after the end of the study period. As we need for the models approbation an example of behavior allowing us to get direct behavior rate, we take users’ posting behavior in social network. For learning parameters and structure one of the models, testing models, data from the social network Vkontakte for December 2019 was collected. This data includes an information about posting on own users’ "walls" for this month, i.e. posts quantity, time of last three posts, maximum and minimum time interval between posts for December 2019, and time of the first post starting from January 2020. 1. Introduction Many sciences studying the behavior consider its characteristics. Intensity is one of the main characteristics of human behavior. In [Abr18, Aza16], taking into account the intensity of risky user behavior, conclusions are made about the success of socio-engineering attacks. In [Gar17], based on the respondents’ data on the intensity and efforts of training, the number of training sessions in the future is estimated. In [Jil20] efforts on making improvements to farmers’ markets are evaluated by accounting farmers’ market shopping frequency and fruit and vegetable consumption among farmers’ market customers before and after establishing improvements. In [Kim20] it is showed how intensity of parents’ alcohol consumption impacts stress in their children. The behavior rate evaluates the behavior intensity as a mean number of behavior episodes during a particular period [Tor19]. Russian Advances in Artificial Intelligence: selected contributions to the Russian Conference on Artificial intelligence (RCAI 2020), October 10-16, 2020, Moscow, Russia " alexandra.toropova@gmail.com (A.V. Toropova); tvt@dscs.pro (T.V. Tulupyeva)  0000-0001-7311-6192 (A.V. Toropova); 0000-0003-3630-7971 (T.V. Tulupyeva) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Obviously the direct observation is the most reliable way to obtain a behavior rate value [May19, Reh19], but it can not possible in many situations, because of high cost, time consump- tion or restrictions for monitoring private lives, so tools are needed to estimate it indirectly. One of the most popular approaches to get information about human behavior is self-reports survey [New19]. In self-reports respondents fill questionnaires, polls or surveys answering questions, that may relate to different types: open or closed (with limited choice) or rating scales (e.g. Likert scale). This method has certain limitations: respondents can give unreli- able answers because of memory issues, faking answers to paint themselves in the best light etc. Another approach to get information about behavior intensity is "diary" method, when respondents note information about the behavior in a diary for a certain period of time and then researchers analyze these diaries. In [Suv13a, Suv13b, Suv17, Suv14, Tor19], models based on Bayesian belief networks are proposed that use information about the last three episodes of behavior. However, all the proposed models take into account the moment of the interview, which is not an episode of the studied behavior, and therefore may give distortions in the behavior rate assessment. Thus, it would make sense to consider possible models that also take into account information about recent episodes of behavior, but do not include information about the interview itself. This paper presents two models for evaluating the behavior rate with expert-defined and learned structures. Models are Bayesian belief networks [Tul19]. They include information about the intervals between the last three episodes of the study period, the minimum and maximum intervals between episodes, and the interval between the last episode of the study period and the next episode, respectively, after the end of the study period. Data on posting in the social network Vkontakte [Vko20a] during December 2019 was col- lected to build the structure of one of the models, to learn and test the models. 2. Data Collecting a fairly large amount of data with well-known information about the behavior in- tensity is not always possible for a number of reasons. Constant monitoring of a large group of subjects is practically impossible both from the technical and financial side. Conducting a survey or interviewing respondents can give no guarantee that the received information is correct. Using social networks is a good opportunity to get accurate data about users’ behav- ior, because every action is fixed by time, and it is possible to get direct behavior rate value. Social networks creating an ecosystem for different types of behavior: making connections, liking, disliking, following, expressing emotions, making statements etc. Thus social networks provide a great opportunity for behavior research. However, this approach is also not perfect, since using the social media API also implies a number of limitations. In our study, the data was collected from the social network Vkontakte [Vko20a], the largest in Russia [Sim20]. This social network also provides for users different types of activities. Each user of this network has a so-called "wall", on which one can publish posts, make reposts of other users’ records, and can also leave comments on the "walls" of other users. Users also can follow each other, connect as friends, show relations with someone, like different types of data and other things, Users may have a "private" profile, in that case access to their "wall" is restricted. For approbation of the models we chose such type of behavior as publishing posts on user’s own wall, i.e. posting. This is one of the simplest type of behavior that can be assessed properly, by considering records on user’s wall excluding records made by others. To get data from Vkontakte, a special program was written in C#. The “wall.get” method was used, which is provided by the VK API [Vko20b]. This method allows to get information about the last hundred records from the user’s wall, and you can apply the condition that these records belong to the owner of the "wall", and not to other users. This number of records is sufficient if we consider one month as the study period. The use of this method is limited to 5000 calls per day. December 2019 was taken as the study period. Accounts of users who provided access to their "wall" were randomly selected and processed. For each user, the following information was collected for December 2019: the time of the last three entries, the maximum and minimum time between entries, and the number of entries; in addition, the time of recording the first entry in 2020 was saved. Users with insufficient information were removed from the data-set. In this way, information about 6556 users was collected. 4556 records were used for learning models, and the remaining 2000 records were used for testing them 3. Models We propose models based on Bayesian belief networks [Tul19]. This is a capable and popular tool that has found application in many branches of science [Tor15], it shows strong ability on combining data from different sources, doing probabilistic reasoning [Zha19]. In addition there are many powerful software packages [Bay20, Net20, Scu10] making it easier to work with this tool. All calculations in this and the following sections were performed in R [R20] using the bn- learn package [Scu10], which provides work with Bayesian belief networks. To work with Bayesian belief networks, all continuous data needs to be sampled. Therefore, the values of variables related to time (we use a day as the unit of measurement), i.e. 𝑡_𝑛𝑒𝑥𝑡, 𝑡_12, 𝑡_23, 𝑡_𝑚𝑖𝑛 and 𝑡_𝑚𝑎𝑥, were divided into the intervals 𝑡1 = (0; 0.1), 𝑡2 = [0.1; 0.5), 𝑡3 = [0.5; 1), 𝑡4 = [1; 7), 𝑡5 = [7; 10), 𝑡6 = [10; 20) and 𝑡7 = [20; ∞); and the values of the variable 𝜆 (we will measure the behavior rate as the number of posts divided by for the number of days in a month) were divided into the intervals 𝜆1 = (0; 0.1), 𝜆2 = [0.1; 0.2), 𝜆3 = [0.2; 0, 3), 𝜆4 = [0.3; 0, 5), 𝜆5 = [0.5; 0.7), 𝜆6 = [0.7; 1) and 𝜆7 = [1; ∞). 3.1. Behavior Rate Model with an Expert-Defined Structure Figure 1 shows a behavior rate model with an expert-defined structure, which is a Bayesian belief network. The calculation of conditional probability tensors that characterize transitions between network nodes is based on the data and is described in the next section. The node 𝜆 characterizes the behavior rate, 𝑡_𝑛𝑒𝑥𝑡 is the interval between the last episode for the study period and the first episode after the end of the study period, 𝑡_12 is the inter- val between the last and penultimate episodes of behavior, 𝑡_23 is the interval between the penultimate and third from the end of the behavior episodes for the study period, 𝑡_𝑚𝑖𝑛 and 𝑡_𝑚𝑎𝑥 are the minimum and maximum intervals between episodes for the study period, n is the number of episodes for the study period. The proposed model is based on the model from [Suv13a]. The difference is that instead of information about the interval between the last episode and the moment of the interview, the model includes information between the last episode during the study period and the next episode that occurred after the study period. This makes sense, since the moment of the in- terview is not an episode of the studied behavior and does not provide any useful information about the latter. n lambda t_max t_next t_min t_12 t_23 Figure 1: Behavior Rate Model with an Expert-Defined Structure 3.2. Behavior Rate Model with a Learned Structure In order to construct the network structure from data, we used the Hill-Climbing greedy search on the space of directed graphs [Scu10]. It is one of score-based algorithms, which assign a score to each candidate Bayesian network with respect to the training data-set and try to max- imize it. We used the Bayesian information criterion (BIC) as a quality score. BIC is equivalent to the Minimum Description Length (MDL) and is also known as Schwarz Information Crite- rion [Sch78, Scu10, Gam10]. Figure 2 shows the resulting structure. As we see connections between 𝜆, 𝑡_𝑛𝑒𝑥𝑡 and 𝑛 are saved. 𝑡_12 and 𝑡_12 are always between 𝑡_𝑚𝑖𝑛 and 𝑡_𝑚𝑎𝑥, what caused occurrence of arcs between these nodes. 𝑡_𝑛𝑒𝑥𝑡 does not depend on 𝑡_𝑚𝑖𝑛 and 𝑡_𝑚𝑎𝑥, because these are extremum points only during the study period, and 𝑡_𝑛𝑒𝑥𝑡 occurs after it. The arc 𝜆–𝑡_𝑛𝑒𝑥𝑡 shows inverse dependence comparing with the expert-based structure, this can be connected with the specific characteristics of the training data-set. n lambda t_max t_next t_min t_12 t_23 Figure 2: Behavior Rate Model with a Learned Structure 3.3. Learning Parameters On this step we have two models’ structures. To define the models completely, further learning of the models’ parameters using the training data-set was conducted. In other words, tables of conditional probabilities were constructed for all pairs of network vertices connected by an arc. For example, table 1 is a conditional probabilities table for the pair 𝜆 – 𝑡_𝑛𝑒𝑥𝑡 (table 1) in a model with the expert-defined structure. After that we can use these models for making predictions of behavior rates. Table 1 Conditional Probability Table 𝜆1 𝜆2 𝜆3 𝜆4 𝜆5 𝜆6 𝜆7 𝑡1 0.001 0 0.001 0.002 0 0 0.009 𝑡2 0.001 0.006 0.007 0.005 0.032 0.071 0.054 𝑡3 0.006 0.021 0.443 0.069 0.07 0.111 0.143 𝑡4 0.176 0.257 0.411 0.494 0.601 0.622 0.673 𝑡5 0.098 0.12 0.115 0.156 0.12 0.062 0.045 𝑡6 0.272 0.298 0.265 0.186 0.136 0.11 0.067 𝑡7 0.446 0.298 0.157 0.088 0.041 0.022 0.009 4. Comparison of the Models Let us compare the received models. According to the construction algorithm, on the training set, the quality measure of the structure shown in fig. 2 is higher than the initial one set by experts (Fig. 1), as for BIC (-38789 and -54736, respectively), and for the maximum likelihood measure (-36552 and -39762). On the test dataset, the quality measures of the data-learned structure are also higher (BIC: -18379 and -30624; maximum likelihood: -16142 vs -17113). However, since the main task of the models is to evaluate the behavior rate, let us move on to the next stage, namely, comparing their prediction quality. After the models predict the behavior rates, these predictions can be compared with the known user posting intensities. Table 2 is a confusion matrix for the behavior rate model with the expert-defined structure, and table 3 is a confusion matrix for the behavior rate model with the learned structure. The rows represent the real intensities, and the columns represent the intensities predicted by the model. Table 2 Confusion Matrix for the Behavior Rate Model with the Expert-defined Structure Predicted Behavior Rates 𝜆1 𝜆2 𝜆3 𝜆4 𝜆5 𝜆6 𝜆7 𝜆1 115 125 22 9 8 19 7 Behavior Rates 𝜆2 61 352 51 95 12 29 16 𝜆3 6 119 66 141 12 4 20 𝜆4 2 43 54 177 23 21 27 𝜆5 0 8 7 67 31 18 37 𝜆6 0 2 3 25 12 14 40 𝜆7 0 1 0 16 14 16 53 Table 4 shows the main quality metrics: accuracy, average accuracy, precision, and recall for Table 3 Confusion Matrix for the Behavior Rate Model with the Learned Structure Predicted Behavior Rates 𝜆1 𝜆2 𝜆3 𝜆4 𝜆5 𝜆6 𝜆7 𝜆1 40 197 2 65 0 0 0 Behavior Rates 𝜆2 6 348 7 249 0 0 0 𝜆3 0 118 9 239 0 0 0 𝜆4 0 64 20 263 0 0 0 𝜆5 0 17 10 140 0 0 0 𝜆6 0 9 4 82 0 0 0 𝜆7 0 8 4 88 0 0 0 both the models. Table 4 Comparison of Quality Metrics Accuracy Avg. Accu- Precision Recall racy Behavior Rate 0.404 0.83 0.404 0.357 Model with the Expert-defined Structure Behavior Rate Model 0.332 0.809 0.332 0.212 with the Learned Structure As we can see from the table 4, the difference in quality metrics is not very large, but the behavior rate model with the expert-defined structure showed higher results. In addition, since in this case, the problem can be reduced to classification by seven classes, comparing table 2 and table 3 we can see that the model with the expert-defined structure, even in the case of an error, is likely to place the evaluated value in an adjacent class, while the model with the learned structure does not consider classes with high behavior rate (starting from 𝜆5 ). 5. Conclusion Two models for evaluating the behavior rate with the expert-defined and the learned structures were presented. The models are Bayesian belief networks. They include information about the intervals in days between the last three episodes during the study period, the minimum and maximum intervals between the episodes, and the interval between the last episode during the study period and the next episode, respectively, after the end of the study period. Data on posting in the social network Vkontakte during December 2019 was collected to build the structure of one of the models, to learn models’ parameters and to test the models. Despite the fact that the model with the learned structure showed higher metrics of structure quality, in the behavior rate predictions the model with the expert-defined structure showed better results. In this regard, it is recommended to use the behavior rate model with the expert- defined structure. Further we plan to consider the other discretization of continuous data, because this can affect on the received results. Also it can be interesting to test models on another data-set obtained from the different source. The application of this model can be found in many areas related to the behavior intensity, for example, in sociology, epidemiology, etc. The main advantages of suggested models is that sufficiently accurate results can be obtained using a very small amount of data: information about last and extremum episodes can be re- membered quite easily comparing remembering of all episodes of behavior. Thus researchers can use questionnaires with questions about this information for studying any behavior inten- sity. In case of absence enough real data for training-set researchers can use synthesized data according to assumptions about the studied behavior. As for social network behavior research these models also can be useful. As we could see social networks’ API not always allow to collect all the data that researcher need, so using these models it is possible to get more information about some kind of user behavior. Besides that as the suggested models include the data about the next episode, which can happen in the future, using information about other nodes, we can predict approximate time of this next episode. Acknowledgments The research was carried out in the framework of the project on state assignment SPIIRAS No. 0073-2019-0003, with financial support from the Russian Foundation for Basic Research, projects No. 19-37-90120, No. 18-01-00626 and No. 20-07-00839. References [Abr18] M. V. Abramov, T. V. Tulupyeva, A. L. Tulupyev. Sotcioinzhenernye ataki: sotcialnye seti i ocenki zashchishchennosti polzovatelei. SPb.: GUAP,. 266 p. ISBN 978-5-8088- 1377-5, 2018. (In Russian). [Aza16] A. A. Azarov, T. V. Tulupyeva, A. V. Suvorova, A. L. Tulupyev, M. V. Abramov, R. M. Iusupov. Sotcioinzhenernye ataki: problemy‘ analiza. SPb.: Nauka. ISBN: 978-5-0203- 9592-3, 2016. (In Russian). [Bay20] BayesFusion. https://www.norsys.com/, last accessed 2020/04/20. [Gam10] J. Gamez, J. Mateo, J. Puerta. Learning Bayesian networks by hill climbing: Efficient methods based on progressive restriction of the neighborhood. Data Mining and Knowledge Discovery, 22, 106–148, doi: 10.1007/s10618-010-0178-6, 2010. [Gar17] D. Garcia, T. Danielem, T. Archer. A brief measure to predict exercise behavior: the Archer-Garcia ratio. Heliyon, doi: 10.1016/j.heliyon.2017.e00314, 2017. [Jil20] S. B. Jilcott Pitts, Q. Wu, W. Gray, M. J. Lyonnais. Examining changes in farmers’ mar- kets and in customers’ farmers’ market shopping frequency and fruit and vegetable purchase and consumption: evaluation data from the Partnerships to Improve Com- munity Health Project, 2014–2017 Journal of Hunger and Environmental Nutrition, 15(1), 107–117. DOI: 10.1080/19320248.2018.1512924, 2020. [Kim20] S. Kim, W. Chae, S. H. Min, Y. Kim, S.-I. Jang. Alcohol consumption frequency of parents and stress status of their children: Korea national health and nutrition ex- amination survey (2007–2016) International Journal of Environmental Research and Public Health, 17(1), doi:10.3390/ijerph17010257, 2020. [May19] G. R. Mayer, B. Sulzer-Azaroff, M. Wallace. Behavior analysis for lasting change. Corn-wall-on-Hudson, NY: Sloan Publishing, 2019. [Net20] Netica Bayesian network software package. https://www.norsys.com/, last accessed 2020/04/20. [New19] D. Newsome, K. Newsome, T. C. Fuller, S. Meyer. How contextual behavioral scien- tists measure and report about behavior: A review of JCBS. Journal of Contextual Behavioral Science, 12, 347–354, doi:10.1016/j.jcbs.2018.11.005, 2019. [R20] R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/, last accessed 2020/04/20. [Reh19] R. A. Rehfeldt. Clarifying the nature and purpose of behavioral assessment: A response to Newsome et al. Journal of Contextual Behavioral Science, 14, 37–39, doi:10.1016/j.jcbs.2019.09.001, 2019. [Sch78] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. [Scu10] M. Scutari. Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software , 35, 2010. [Sim20] SimilarWeb. https://www.similarweb.com/fr/top-websites/russian-federation, last accessed 2020/04/20. [Suv13a] A. V. Suvorova. Modeli i algoritmy analiza sverkhkorotkikh granulyarnykh vremen- nykh ryadov na osnove bayesovskikh setey doveriya. PhD, Diss. [Models and Algo- rithms for analysis of super-short granular time series on the base of Bayesian belief networks.]. St.Petersburg, 2013. (in Russian). [Suv13b] A. V. Suvorova. Socially significant behavior modeling on the base of super-short incomplete set of observations. Information-measuring and Control Systems, 9(11) 34–38, 2013. (in Russian). [Suv17] A. V. Suvorova. Models for respondents’ behavior rate estimate: bayesian network structure synthesis. Proceedings of 2017 XX IEEE International Conference on Soft Computing And Measurements (SCM), 87–89, 2017. [Suv14] A. V. Suvorova, A. L. Tulupyev, A. V. Sirotkin. Bayesian belief networks in problems of estimating the intensity of risk behavior. Journal of Russian Association for fuzzy systems and soft computing. Journal of Russian Association for fuzzy systems and soft computing, 9(2) 115–129, 2014. [Tor15] A. V. Toropova. Approaches to the data coherence diagnosis in bayesian belief net- work models. SPIIRAS Proceedings, 6(43), 156–178, 2015. [Tor19] A. V. Toropova, T. V. Tulupyeva. Synthesis and learning of socially significant be- havior model with hidden variables. Advances in Intelligent Systems and Computing, 875, 76–84, 2019. [Tul19] A. L. Tulupyev, S. I. Nikolenko, A. V. Sirotkin. Osnovy teorii bayesovskikh setey: uchebnik. [Fundamentals of Bayesian Network Theory: A Textbook], St. Petersburg, SPbSU Publ, 399 p., 2019. (In Russian). [Vko20a] Vkontakte. https://vk.com/, last accessed 2020/04/20. [Vko20b] Vkontakte for Developers. https://vk.com/dev/methods, last accessed 2020/04/20. [Zha19] J. Zhang, H. Yue, X. Wu, W. Chen. A brief review of Bayesian belief net- work Proceedings of the 31st Chinese Control and Decision Conference, 3910–3914, doi:10.1109/CCDC.2019.8832649, 2019.