=Paper=
{{Paper
|id=Vol-2667/paper17
|storemode=property
|title=Comparative analysis of football statistics data clustering algorithms based on deep learning and Gaussian mixture model
|pdfUrl=https://ceur-ws.org/Vol-2667/paper17.pdf
|volume=Vol-2667
|authors=Nikita Andriyanov
}}
==Comparative analysis of football statistics data clustering algorithms based on deep learning and Gaussian mixture model ==
Comparative analysis of football statistics data clustering algorithms based on deep learning and Gaussian mixture model Nikita Andriyanov JSC "RPC "Istok" named after Shokin Fryazino, Moscow Region, Russia Telecommunication department Ulyanovsk State Technical University Ulyanovsk, Russia nikita-and-nov@mail.ru Abstract—The paper considers the Gaussian mixtures Gaussian distributions. And the comparison algorithm is model and the possibilities of its application for solving trained neural network clustering. It should be noted that for clustering tasks. First, the case is considered when the the first time a comparison of the GMM and trained neural Gaussian mixtures model is formed in such a way that all the networks will be performed as part of the task of analyzing parameters of the model are known. Next, the case is football statistics. In addition, a combination of the proposed considered when the approximation of normally distributed data occurs using the Gaussian mixtures model. Finally, the clustering methods can lead to a new type of clustering article presents a study of the accuracy of clustering two- bases simultaneously on supervised and unsupervised dimensional data of football statistics of medal-position teams, learning. middle-table teams and worst teams of the top 5 European football championships such as English Premier League, II. BRIEF CLASSIFICATION OF CLUSTERING ALGORITHMS Spanish La Liga, German Bundesliga, Italian Serie A and Known clustering algorithms [3] can be divided French League One. The results of the algorithm based on the according to 2 basic principles. Let consider main features Gaussian mixtures models are compared with the results of for them. clustering performed using neural networks. First, clustering can be crisp or fuzzy. In the first case Keywords—Gaussian mixture models; machine learning; each object as a result of clustering is assigned exactly one data clustering; data analysis; football statitstics group. With fuzzy clustering a set of values is usually I. INTRODUCTION determined that characterize the belonging probability of each object to each group, i.e. such clustering gives some Today data mining as intelligent analysis allows probability distribution. specialists in various fields to greatly simplify their work. For example, on the basis of such an analysis, deliberately Secondly, cluster analysis can be flat single-level or non-solvent customers who apply for a loan to the bank can hierarchical multi-level. In the first case the initial selection be eliminated, and data on the number of taxi service orders of objects according to some criterion is divided into several can be predicted [1,2]. Indeed, digitalization of various areas classes in the form of a single partition. For example, of the economy and areas of state activity on an ongoing clustering the university students again only by gender. If basis provides significant amounts of information. In this the further clustering considers that male students and regard the range of tasks solved using data mining is so female students will be separated, keeping the first level, wide. then a deeper clustering will be obtained, in particular, the original object in the sample can be characterized not just as One of the most interesting tasks in this area is the a male student or female student, but as an excellent (“A”) problem of data clustering [3,4], which should be associated male student, excellent (“A”) female student, bad (“F”) male with the recognition, classification or segmentation tasks [5- student or bad (“F”) female student. This separation 9]. However, in these tasks it is usually possible to provides hierarchical clustering. It should be noted that the distinguish several groups of objects. The simplest example deep Gaussian mixtures model (DGMM) considered in [11] is the choice of male students and female students in a copes well with the goals of hierarchical clustering. group. Every person here can be described by their height Moreover, the assignment of an object to a particular group and weight. Each object in such sample can be displayed at is carried out according to the principle of crisp clustering. a specific point on the plane. In this case this plane is two- dimensional. It is possible to expand dimensions of the Finally, neural networks are gaining more and more plane if the new parameter, for example, a hair length will popularity in clustering problems [12]. Depending on the be introduced. Then the solution of the clustering task will training parameters and type of networks, various models be simplified. Each group of objects can be represented by for clustering can be obtained. And now a deep learning is a some ellipsoid at the plane. Then the clustering decision for very perspective tool for mentioned tasks. a particular new object will depend on which ellipsoid is Thus, before choosing a clustering algorithm, it is closest to the point characterizing this object. necessary to first formulate the clustering problem itself, So the further research considers a clustering algorithm and then perform the data splitting. based on Gaussian mixtures models (GMM) [10, 11], because quite often real data can be well approximated by Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science III. GAUSSIAN MIXTURE MODEL other hand, it can also lead to an increase in anomalous The application of flat, crisp clustering is considered on points (“too successful”, “too unsuccessful” or “strange” the example of analysis of football statistics from the Top 5 season for one team or in general). Fig. 2 shows the European Championships (England, Spain, Germany, Italy, collected statistics. The points are plotted on the abscissa France). Since the problem of multilevel clustering is not axes, and goals scored are plotted on the ordinate axes. posed, it is possible to use GMM [10]. This is such a model, the probability density function (PDF) of which is described by the sum of the PDFs of Gaussian distributions. The number of terms in the sum is the number of clusters. Thus, the total distribution has several peaks, and for each of the objects during clustering the proximity to each peak is considered and the peak with the smallest distance is selected. Moreover, each object can be characterized not by one but by several parameters, for which multidimensional PDFs are found. Fig. 1 presents an example of the PDF of the GMM of three distributions with two parameters. Fig. 2. Statistics of the top 5 football championships for the seasons 2016/2017, 2017/ 2018 and 2018/ 2019. From Fig. 2 it can be seen that the selected parameters have an almost linear relationship and visually the most preferable division seems to be simply dividing by lines along the abscissa (points). In this case, the numbers 40 (points) and 60 (points) can be chosen as the visual threshold. In fact, such a division will provide only one erroneously clustered point. Fig. 3 shows 3 clusters Fig. 1. PDF of 3 distribution GMM. according to real championship tables. An analysis of Fig. 1 allows to conclude that there are An analysis of Fig. 3 shows that there is a point in the 3rd two groups of objects that are characterized by a large cluster which is closer to the center and other points of the variance along one of the axes (ordinates or abscissas), and 1st cluster than to the cluster to which it really belongs. one group with approximately the same variance along both axes. In addition, three characteristic peaks or mathematical Next it is necessary to approximate the statistics of Fig. 2 expectations can be seen in Fig. 1. by GMMs with various parameters. Let use the following parameters: The advantage of using the GMM is that for a given 1) The number of clusters k=1…5. number of objects, the model itself performs estimates of the 2) Covariance matrix (CM) which can be described component distributions. This allows the approximation of by the following statements: diagonal/full and real data using such a model. However, even if the number shared/unshared. The diagonal or full structure of CM of clusters is not known in advance, it is possible to build characterizes the relationships between the parameters of several models of mixtures and choose the optimal one one cluster, and the shared or unshared structure of CM according to some criterion. Most often, the Akaikeian characterizes the relationships between different classes. For information criterion (AIC) [13] and the Bayesian the diagonal structure of the CM, the axes of the ellipse are information criterion (BIC) [14] are used. Application of parallel or perpendicular to the axes of abscissas and these criteria allows to cope with the problem of a priori ordinates, and for the shared structure, the dimensions and uncertainty regarding the number of classes. orientation of all ellipses are the same. 3) The regularization parameter R = 0.01 or R = 0.1 is IV. CLUSTERING WITH A GAUSSIAN MIXTURE MODEL introduced to provide a positive determinant of the CM. Consider an example of the GMM application in the clustering of teams playing in the European football championships in England, Spain, Germany, Italy and France. Only 2 parameters will be included in the initial sample. It is goals scored and points. However, in order to make it more convenient to check the accuracy of clustering, it is good idea to exclude some teams from the selection. Thus, the thinning done will include 3 teams in the upper part of the tournament table (1 - 3 places), 3 teams in the middle of the table (9/8 - 11 /10 places) and 3 teams in the lower part of the table (18/16 - 20/18 places). Such thinning is done for each championship. In addition, a statistics on such teams not only for the last season, but also for the previous 2 seasons is taken. This, on the one hand, will Fig. 3. Clustering of teams into classes according to the championship’s increase the information content of the sample, and on the tables. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 72 Data Science a) Fig. 6. Data clustering using GMM. V. CLUSTERING USING NEURAL NETWORKS In this section clustering based on neural networks is performed. Since the sample size is small, a feed forward network with the back propagation of error, consisting of 1 layer of 15 neurons, is used. For such a network, training based on data for the seasons 2016/2017 (train dataset) and 2017/2018 (validation dataset) is carried out. For test dataset statistics of season 2015/2016 is used. A pair of parameters, such as goals scored and points, is fed to the input of such a network, and the cluster number is obtained at the output. Fig. 7 shows the structure of the neural network, and Fig. 8 shows the learning process. b) Fig. 4. AIC and BIC for various models. By changing the above parameters, one can obtain several distributions of Gaussian mixtures, for which then it is possible to calculate the AIC and BIC coefficients Fig. 7. Neural network structure. presented. Fig. 4a and Fig. 4b shows AIC and BIC coefficients respectively for investigated football statistics with different parameters. According to Fig. 4, the minimum values of AIC and BIC are provided by the model for k = 3 clusters, which has a full and unshared CM structure with a regularization parameter R = 0.01. Fig. 5 shows the PDF of this model, and Fig. 6 shows the result of clustering using this model. Comparison with the clustering presented in Fig. 3 Fig. 8. Neural network training. shows that the clustering error was 1.48% or 2 incorrect The analysis of Fig. 8 shows that the network converges assignment of teams to the group. Thus, high accuracy was quite quickly by the 12th epoch, achieving minimal error on obtained during clustering using the GMM. the validation data. Fig. 9 shows the correct clustering (a), clustering using GMM (b) and clustering by the neural network (c). So Fig. 9 shows that the neural network also provides satisfactory clustering, for which the error percentage is 1.48% or 2 objects (teams). Moreover, if the Gaussian mixture model mistakenly assigned one team from the group of outsiders (worst teams) to the middle-table teams and one team from the group of leaders (medal-position teams) to the middle-table teams, then the neural network incorrectly assigned two teams from the middle of the table (middle- table teams) to the teams of the upper part (medal position Fig. 5. PDF of the best approximation GMM. teams). It should also be noted that the use of deep learning VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 73 Data Science (increasing the number of layers to 5, and the number of adequate team rating, since, for example, the FIFA rating neurons to 128) does not lead to improved results. existing today does not reflect the actual strength of teams. Thus, the use of GMM for data mining is currently advisable. Moreover, in the future it is also planned to investigate the operation of the DGMM. ACKNOWLEDGMENT This work was supported by the RFBR and the Government of the Ulyanovsk Region Grant, Project No. 19- 47-730011 and partly RFBR Grant, Project No. 19-29- 09048. REFERENCES a) [1] A.N. Danilov, N.A. Andriyanov and P.T. Azanov, “Ensuring the effectiveness of the taxi order service by mathematical modeling and machine learning,” Journal of Physics: Conference Series, vol. 1096, pp. 1-8, 2018. DOI:10.1088/1742-6596/1096/1/012188. [2] N.A. Andriyanov and V.A. Sonin, “Using mathematical modeling of time series for forecasting taxi service orders amount,” CEUR Workshop Proceedings, vol. 2258, pp. 462-472, 2018. [3] K.V. Vorontsov, “Clustering and multidimensional scaling algorithms,” Lecture course. Moscow State University, 2007. [Online]. URL: http://www.ccas.ru/voron/download/Clustering.pdf. [4] I.A. Rytsarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering media content from social networks using BigData technology,” Computer Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179- b) 2018-42-5-921-927. [5] V.B. Nemirovsky and A.K. Stoyanov, “Clustering face images,” Computer Optics, vol. 41, no. 1, pp. 59-66, 2017. DOI: 10.18287/ 2412-6179-2017-41-1-59-66. [6] Y. Tarabalka, J.A. Benediktsson and J. Chanussot, “Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques,” IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 8, pp. 2973-2987, 2009. [7] N.A. Andriyanov and V.E. Dementiev, “Developing and studying the algorithm for segmentation of simple images using detectors based on doubly stochastic random fields,” Pattern Recognition and Image Analysis, vol. 29, no. 1, pp. 1-9, 2019. DOI: 10.1134/ S105466181901005X [8] N.A. Andriyanov and V.E. Dement'ev, “Application of mixed models c) of random fields for the segmentation of satellite images,” CEUR Workshop Proceedings, vol. 2210, pp. 219-226, 2018. Fig. 9. Comparison of clustering results. [9] K.K. Vasiliev, V.E. Dementyiev and N.A. Andriyanov, “Using probabilistic statistics to determine the parameters of doubly VI. CONCLUSION stochastic models based on autoregression with multiple roots,” The paper studies data clustering algorithms using the Journal of Physics: Conference Series, vol. 1368, pp. 1-7, 2019. DOI: 10.1088/1742-6596/1368/3/032019. example of clustering football statistics. The clustering [10] Y.A. Philin and A.A. Lependin, “Application of the Gaussian mixture algorithms based on the GMM and the neural network model for speaker verification by arbitrary speech and counteracting algorithm are considered. A comparative analysis of the spoofing attacks,” Multicore processors, parallel programming, accuracy of clustering showed that for the presented FPGAs, signal processing systems, vol. 1, no. 6, pp. 64-66, 2016. example, both algorithms provide the same result. Moreover, [11] C. Viroli and G.J. McLachlan, “Deep Gaussian mixture models,” Stat the clustering error is only 1.48%. However, the model of Comput, vol. 29, pp. 43-51, 2019. DOI:10.1007/s11222-017-9793-z. Gaussian mixtures looks preferable for several reasons. [12] J. Guérin and B. Boots, “Improving Image Clustering With Multiple Firstly, it can determine the number of clusters by some Pretrained CNN Feature Extractors,” ArXiv Preprint: 1807.07760. information criterion. Secondly, when training the neural [13] H. Akaike, “A new look at the statistical model identification,” IEEE network, the data included in the data for which clustering Transactions on Automatic Control, vol. 19, pp. 716-723, 1974. was performed was used. Thirdly, in the neural network [14] H.S. Bhat and N. Kumar, “On the derivation of the Bayesian Information Criterion” [Online]. URL: https://faculty.ucmerced.edu/ algorithm there were insignificant computational costs for hbhat/BICderivation.pdf. training. The results obtained indicate that with the use of intelligent clustering algorithms it is possible to build a more VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 74