On statistical analysis and prediction of sap flow density for smart urban tree monitoring Anastasia Safargalieva1 , Irina Kochetkova1,2 , Elena Makeeva1 and Sergey Shorgin2 1 Peoples’ Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya St, Moscow, 117198, Russian Federation 2 Institute of Informatics Problems, Federal Research Center “Computer Sciences and Control” of the Russian Academy of Sciences, 44-2 Vavilova St, Moscow, 119333, Russian Federation Abstract The use of IoT technologies in various areas of our life, including environmental monitoring of green spaces, is increasing every year. One such solution is the TreeTalker sensor-based monitoring system, which collects data on various parameters of trees. One of the most important parameters is the rate of tree sap flow. Predicting the density of sap flow and studying the relationship between the parameters of trees and the environment is an urgent task. In this work, a statistical analysis of the data collected using the TreeTalker monitoring system was carried out. The data was pre-processed: outliers in the data were removed using mean value replacement, z-score replacement and cumulative moving average replacement. Groups of trees that were homogeneous in time were identified, and regression models were built to predict the sap flow parameter using auto-regressive moving average and linear modeling. The results obtained can be used for further studies of the dependence of the state of the tree on external factors. Keywords Smart Urban Nature, Smart Urban Tree, TreeTalker, time series, sap flow density, statistical analysis, prediction, 1. Introduction Monitoring of the health of the trees helps to achieve a comprehensive view of ecosystems. Nowadays environment is stressed by human activities. Providing a monitoring of trees health can answer a lot of questions about the effectiveness of the measures to maintain ecosystem’s health. TreeTalker(TT) is an IoT device that collects information about the health state of the trees based on various internal and external factors. The main factor of the tree which is considered the most important is sap flow [1], [2], [3], [4], [5]. This work has the following structure: the section 2 is devoted to a primary statistical analysis, work with the outliers in data with three methods: Mean Replacement, Z-score replacement and Cumulative Mean Average. In section 3 we perform prediction of the sap flow using linear models: auto-regressive moving average and linear regression. Workshop on information technology and scientific computing in the framework of the XI International Conference Information and Telecommunication Technologies and Mathematical Modeling of High-Tech Systems (ITTMM-2021), Moscow, Russian, April 19–23, 2021 Envelope-Open ansafargalieva@mail.ru (A. Safargalieva); gudkova-ia@rudn.ru (I. Kochetkova); elena-makeeva-96@mail.ru (E. Makeeva); sshorgin@ipiran.ru (S. Shorgin) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 64 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 2. Initial Data Analysis 2.1. Time Series Description TreeTalker sensors were installed on 195 trees in seven territories located in the center of Moscow, on the RUDN University campus and in parks in the Moscow region. During 2019, measurements were made of the parameters of 22 different tree species, different in age, as well as in different states of ”health”. Every hour, data was collected on eight parameters – sap flow, air temperature and humidity, negative pressure of water vapor in the leaves of a tree, angles of tree deviations from the axis, wood moisture, temperature inside the trunk, and the normalized relative vector of vegetation (NDVI) (Tab.1 and 2). Age group and VTA score are constants. For the following work the data from the Troitsk Territory was selected. Table 1 Parameters changing in time Designatiton Parameter Range of values Units of measurement F Flux 0.01 – 5.9 l*m− 2 h−1 t Air temperature 10-27 ℃ rh Pressure 27-28 Pa v Vapour-pressure deficit 0.5 - 2.5 g*m−3 Th, psi, phi Tree trunk axis movement -60 – -56, 3 – 9, -31 – -30 ° w Stem Humidity 25-40 ton*m−3 nt Temperature inside trunk 10-20 ℃ nd Normalized vegetation vector 0-100 % Table 2 Parameters not changing in time Designatiton Parameter Range of values AG Age group I – VI, where I – the youngest tree, VI – the oldest tree. VTA VTA score 1 – 7, where 1 – good condition, 7 – bad condition. The primary statistical analysis shows heterogeneity – there is no data for some periods of time due to damage to the electronics after heavy rain (Fig.1) and abnormally high-values (Fig.2). The presence of gaps in measurements leads to false statistical analysis, as well as incorrect modeling of dependencies. Therefore, the next task was to identify time-homogeneous groups of data [6]. 2.2. Working with Unevenly Spaced Data The presence of time gaps, that Fig.1 showed, makes it impossible to build models. The set of time values 𝑇 = {𝜏1 , 𝜏2 , ..., 𝜏𝑛 } consists of time-homogeneous subgroups 𝑇𝑗 = {𝜏𝑖𝑗 , 𝜏𝑖𝑗+1 , ..., 𝜏𝑖𝑗+𝑘 }, selected according to the algorithm [6]. For our model, we will consider as homogeneous data those values, the difference in arrival between which is 1 hour (Alg.1). 65 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 Figure 1: Air temperature: unevenly spaced data Figure 2: Sap flow density: outliers Data: 𝑇 = {𝜏1 , 𝜏2 , ..., 𝜏𝑛 } - set of time values Result: 𝑇𝑗 = {𝜏𝑖𝑗 , 𝜏𝑖𝑗+1 , ..., 𝜏𝑖𝑗+𝑘 } - time-homogeneous subgroups for 𝑡 = 1, 2, ... do if 𝑡[𝑖 + 1] − 𝑡[𝑖] > 1 then put 𝑡[𝑖 + 1] in the new group; else Leave 𝑡[𝑖 + 1] in the same group end end Algorithm 1: Algorithm to reveal time-homogeneous data After data selection we get homogeneous data regarding flux parameter (Fig.3 ). The graph of the dependence of sap flow on time within a homogeneous group showed the presence of abnormally high values of sap flow (Fig.4). 66 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 Figure 3: Air temperature: equally spaced data Figure 4: Sap flow density: equally spaced data with outliers 2.3. Working with Outliers Mean Replacement Method. The first way to work with outliers in your data is to replace outliers with mean values. 𝑥𝑖 – source row of one of eight parameters. 𝑦𝑖 – row after processing from outliers after applying the following algorithm: 𝑛 1 𝑋 ̄ = ∑ 𝑥𝑖 − mean value (1) 𝑛 𝑖=1 𝑥 , if 𝑥𝑖 ≤ 𝑋̄ 𝑦𝑖 = { 𝑖 (2) 𝑋 ̄ , if 𝑥𝑖 > 𝑋̄ The results of the replacement of outliers with mean values can be seen on Fig.5. From 67 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 the graph it is clear that the values are no more than 5 points of the sup tree flux units of measurement. Figure 5: Sap flow density: applying mean replacement method Z-score Substitution Method. The second way to process data from outliers is preliminary analysis of values using 𝑧 – estimation and subsequent processing of abnormally high values of the parameter [7]. For the 𝑧 – estimate, calculate the mean 𝑋 ̄ and standard deviation 𝑠𝑥 calculated for the set of processed data 𝑛 1 𝑠𝑥 = ∑(𝑥𝑖 − 𝑋 ̄ )2 (3) √ 𝑛 − 1 𝑖=1 𝑥𝑖 − 𝑋 ̄ 𝑧= , (4) 𝑠𝑥 𝑥 , if 𝑥𝑖 ≤ 𝑧 𝑦𝑖 = { 𝑖 (5) 𝑧, if 𝑥𝑖 > 𝑧 The results of the replacement of outliers with mean values can be seen on Fig.6. From the graph it is clear that the values are no more than 5 points of the sup tree flux units of measurement as it was with the mean replacement method. However, the structure of the curves is different. Cumulative Moving Average Method. The third method used to deal with outliers in this work is the cumulative moving average method. It is used for smoothing time series [6]. This method smooths outliers using the arithmetic mean of the original function 𝑥𝑖 over the entire period: 𝑛 1 𝑥 + 𝑥𝑛−1 + ... + 𝑥2 + 𝑥1 𝑦𝑖 = ∑ 𝑥𝑖 = 𝑛 , (6) 𝑛 𝑖=1 𝑛 68 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 Figure 6: Sap flow density: applying Z-score replacement method where 𝑦𝑖 is a new series smoothed using the cumulative moving average at the moment 𝑛 (Fig.7), 𝑛 is the number of intervals available for calculation, 𝑥𝑖 - the value of the original function at points. After using three methods mentioned above, we selected the data processed with z-score (Fig.6) as this method provides the better structure of the data - smooths it, the box-plots showed no outliers in the data. The methods of working with outliers are presented in the form of the algorithm (Alg.2). 3. Sap Flow Density Prediction 3.1. Preliminary Considerations The construction of a mathematical model of sap flow and prediction of sap flow will help to find out whether there is really a direct relationship between the sap flow and the air temperature, whether other factors affect the sap flow parameter. To analyze the obtained dependencies, 4 parameters for assessing the quality of the models were investigated: 𝑅2 , 𝐹 – statistics, root-mean-square error (RMSE) and mean absolute error (MAE) [8]. 3.2. ARMA Model The ARMA model is an auto-regressive moving average model. The formula is as follows: 𝑦𝑖 = 1.4220 + 0.7150𝑥𝑖 + 1.39𝜃 + 0.134, (7) where 𝑎 = 0.7150 is the parameter of the model, 𝑥 is the parameter of the regression model, 𝑏 = 1.39 is the coefficient of the moving average, 𝜃 is the parameter of the moving average, 𝑐 = 1.4240 is a constant. The graph of the sup flow prognostication shows deceleration of the flow (Fig.8). 69 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 Data: 𝑥𝑖 - original series, 𝑋̄ - mean value of original series, 𝑧 - z-score of original series Result: 𝑦𝑖 - processed series Case 1: Mean-value Replacement for 𝑖 = 1, 2, ... do if 𝑥[𝑖] > 𝑋 ̄ then 𝑦[𝑖] = 𝑋;̄ else 𝑦[𝑖] = 𝑥[𝑖] end end Case 2: Z-score Replacement for 𝑖 = 1, 2, ... do if 𝑥[𝑖] > 𝑧 then 𝑦[𝑖] = 𝑧; else 𝑦[𝑖] = 𝑥[𝑖] end end Case 3: Cumulative Moving Average for 𝑖 = 1, 2, ... do for 𝑛 = 1, 2, ... do 𝑛 ∑𝑗=1 𝑥[𝑗] 𝑦[𝑖] = 𝑛 end end Algorithm 2: Algorithm of Replacement of Outliers 3.3. Linear Regression We will forecast 25 observations ahead. We will draw the plot of the result, which turned out as a result of applying the linear regression model (Fig.9) [9]. Simulation of sap flow with different combinations of factors made it possible to identify the most effective models for describing the dependence of the tree sap flow. In the equation of the dependence of aspen sap flow on the territory of the Trotsk green spaces the parameter of negative pressure of water vapor in the leaves of the tree has the greatest influence: 𝑦𝐹 = 0.78 + 1.0538𝑥𝑡 + 0.3458𝑥𝑟ℎ − 4.7587𝑥𝑣 − (8) − 0.1761𝑥𝑡 ℎ − 0.8449𝑥𝑛𝑡1 + 1.4893𝑥𝑛𝑑 − 0.1945𝑥𝑊 4. Conclusion It was found in the work that the negative pressure of water vapor in the leaves is significantly correlated with the parameter of tree sap flow. After analyzing the data, it was found that the 70 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 Figure 7: Sap flow density: applying cumulative moving average method Figure 8: Sap flow density: applying ARMA model sap flow of trees depends on 7 factors. The results obtained during the work showed which parameters should be taken into account when analyzing the state of the tree and predicting time-dependent factors. This study will provide a starting point for more sophisticated modeling approaches. For example, predicting a model using Fourier series can provide more accurate 71 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 Figure 9: Sap flow density: applying linear regression Table 3 Model quality metrics Model 𝑅2 Prob (F-statistic) RMSE MAE Flux: t, rh, v, th, nt, W, nd 0.86 5.50E-16 0.79 0.6844 Flux: t, rh, v, th, psi, phi, nt, W, nd 0.68 6.02E-10 0.4152 0.3585 Flux: t, rh, v, nt, W 0.39 1.50E-23 0.3457 0.2903 parameter estimates. In addition, assessing the flow density of sap flow is the main goal of researching the health of green spaces to predict changes in their health status. The authors grateful to Dr. Alexey Yaroslavtsev (RUDN University) for providing the dataset from TreeTalker system. Acknowledgments The work was supported by the Russian Science Foundation, project 19-77-30012 (recipient Irina Kochetkova). This paper has been supported by the RUDN University Strategic Academic Leadership Program (recipient Elena Makeeva). References [1] V. Matasov, L. B. Marchesini, A. Yaroslavtsev, G. Sala, O. Fareeva, I. Seregin, S. Castaldi, V. Vasenev, R. Valentini, Iot monitoring of urban tree ecosystem services: Possibilities and challenges, Forests 11 (2020). doi:10.3390/f11070775 . 72 Anastasia Safargalieva et al. CEUR Workshop Proceedings 64–73 [2] V. Riccardo, B. L. Marchesini, S. Giovanna, A. Yaroslavtsev, V. Vasenev, S. Castaldi, New tree monitoring systems: from industry 4.0 to nature 4.0 (2019). doi:10.12899/asr- 1847 . [3] M. Fidino, S. Magle, Using fourier series to estimate periodic patterns in dynamic occupancy models, Ecosphere 8 (2017). doi:10.1002/ecs2.1944 . [4] D. Efrosinin, I. Kochetkova, N. Stepanova, A. Yarovslavtsev, K. Samouylov, R. Valentini, The fourier series model for predicting sapflow density flux based on treetalker monitoring system, Lecture Notes in Computer Science 12526 LNCS (2020) 198–209. doi:10.1007/ 978- 3- 030- 65729- 1_18 . [5] D. Efrosinin, I. Kochetkova, N. Stepanova, A. Yarovslavtsev, K. Samouylov, R. Valentini, Trees classification based on fourier coefficients of the sapflow density flux, Annales Mathematicae et Informaticae 53 (2021) 109–123. doi:10.33039/ami.2021.03.002 . [6] G. Box, G. Reinsel, G. Ljung, Time Series Analysis: Forecasting and Control, volume 68, 2016. doi:10.2307/2284112 . [7] W. McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, O’Reilly Media, Inc, Massachusetts, 2017. [8] A. Granier, A new method of sap flow measurement in tree stems, Annales Des Sciences Forestieres 42 (1985) 193–200. [9] J. A. Rice, Mathematical Statistics and Data Analysis, Duxbury Original Series, Mas- sachusetts, 2010. 73