=Paper=
{{Paper
|id=Vol-2667/paper34
|storemode=property
|title=Text data mining using conversation analysis
|pdfUrl=https://ceur-ws.org/Vol-2667/paper34.pdf
|volume=Vol-2667
|authors=Igor Rytsarev
}}
==Text data mining using conversation analysis ==
Text data mining using conversation analysis Igor Rytsarev Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and Photonics" RAS; Samara National Research University Samara, Russia rycarev@gmail.com Abstract—This paper suggests an algorithm of text data III. DETERMINATION OF THE CLOSENESS OF TEXT mining based on conversation analysis. Natural languages are UNITS BASED ON CONVERSATION ANALYSIS developing dynamically nowadays. New semantic units are constantly being introduced into the spoken language. In these Conversation analysis, i.e., the study of structures and conditions, chains of dependency graphs of semantic units are formal properties of a language in its social and economic constantly being rebuilt. This paper proposes a method for application, is related to all major areas of ethnic and identifying synonyms based on conversation analysis. The methodological research. proposed method has been tested on data collected from social networks. Initially, the conversation analysis was intended for the study of verbal and everyday speech only, and more than Keywords—Social networks, Data Mining, Algoritms that, only conversations between several interlocutors. H. Sacks, the creator of the method, attracted the attention of I. INTRODUCTION scientists to the fact that conversations are central for a social The social networks are currently undergoing a turbulent world. growth: every day, users send billions of messages and A conversation shall necessarily be organized, it implies submit billions of comments. Their analysis has a great the existence of an order that does not need to be explained impact on many areas of business. For example, it is again and again during the exchange of phrases. The order is impossible to overestimate the influence of internet also needed for the spoken words to be clear to all the marketing on the promotion of goods and services. However, conversation participants. The conversation shows the social, in order to use these mechanisms effectively, it is necessary interactive competence of people willing to explain their to understand the demands of users. The source of such behavior and to interpret the behavior of interlocutors. Inside information can be the materials published by users of social the local sequences of conversation, and only there, social networks, as well as the shares and reposts by users and the institutions are finally “spoken into existence”. As a result, entire communities [1-7]. Thus, the issue of determining the the smallest and seemingly insignificant details of the closeness of text units in the social network Vkontakte using conversation actually become a means of actualizing the the BigData technology, considered in this paper, is certainly most important social institutions. a relevant objective and a task of great scientific importance The goal of conversationalists is to describe social in the field of data analysis. practices and expectations that the interlocutors rely on when II. DATA COLLECTION FROM SOCIAL constructing their own behavior and interpreting the behavior of others. NETWORKS The social network Vkontakte was selected as a data Conversion analysis focuses on particular cases as source for this research. The reasons for this choice are as opposed to idealization that is inevitably connected with any follows: theoretical generalization, from the point of view of Garfinkel and Sacks. In their opinion, idealization impedes the network provides open access to its data (no scientific development, since any typology is not much restrictions on accessing the server data); connected with the content of real cases which it is supposed Vkontakte is the most popular social network in to be based on. Sacks sought to develop a method of analysis Russia and the fifth most popular social network in that would remain at the level of primary data, raw material, the world; specific, isolated events of human behavior. In contrast to Vkontakte is a full-fledged social network (unlike classical sociology, he argued that the details of any Twitter and Instagram, which are microblogs) spontaneous human interaction are strictly organized – to the allowing to create thematic communities, which are particularly interesting for this study. extent that provides for their formal description. As part of this study, a Python software package was On the basis of the above prerequisites, the peculiarities developed, containing an authorization module, a data of conversation analysis can be formulated as follows. First, collection module, and a filtration module. This software this method follows the data, i.e. the analysis is based on package allows to collect data and filter them to take the empiricism without using (possibly) predetermined relevant information only. relevant information only. hypotheses. Secondly, the smallest details of the text are considered to be an analytical resource and not an obstacle to Within this study, the developed software package was be discarded. Third, the authors of the method are convinced used to collect more than 5,000 posts and over 170,000 that the order in organizing the details of everyday speech comments from the two most popular communities of the exists not only for researchers, but – first and utmost – for city of Samara (“Podslushano Samara” and “Uslyshano the people who construct this speech [8,9]. Samara”). Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science This idea formed the basis for the study. Initially, it was when analyzing text data and use it to extract contextual suggested that on a large data set two text units have similar meaning from the data set. use distance vectors V (vector that shows how the two text units relate to each other within the data, where index i IV. APPLY CONVERATION ANALYSIS TO THE (indicates the distance between units) and Vi (the number of STATISTICAL DEFINITION OF THE AUTHOR OF A combinations between units, V0 - total number of uses of LITERARY TEXT two text units within one sentence) serve as metrics. Conversation analysis showed good results in the problem of determining the closeness of text units and The data collected from the Vkontakte social networks therefore a theory was proposed that it is possible to use this have been pre-processed; each text unit has been brought to approach for the statistical definition of the author of a its normal form (the pymorphy2 package was used for this). literary text. The data was then pre-processed to extract the necessary statistics (WordCount, maximum sentence length). The next The main idea of the study is to make multidimensional step was to create a distance matrix. vectors that store distances between each pair of words in the text. It has been suggested that when comparing distances A cosine distance was used to calculate distances between pairs of words it is possible to estimate the degree between two vectors: of closeness of text fragments. 𝐴 ∙𝐵 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = cos(𝜃) = ‖𝐴‖‖𝐵‖ == This study can be roughly divided into two tasks: ∑𝑛 𝑖=1 𝐴𝑖 ×𝐵𝑖 1. Data preparation. (1) √∑𝑛 2 𝑛 𝑖=1(𝐴𝑖 ) ×√∑𝑖=1(𝐵𝑖 ) 2 2. Determining the optimal text fragment size for comparison between texts. The results are shown in Figure 1. To check the first stage of this hypothesis it was suggested to prepare sets of text data by the sliding window method. The sliding window method is an algorithm of transformation, which allows to form a set of data from the source text, which can serve as a set for research. In this case, the window is understood as the size of the window containing the set of texts that are used to conduct the research. During the algorithm operation the window is shifted along subchapters of the text by one measurement unit, and each position of the window forms one text. An example of the method operation is shown in Figure 2. The next step of the second stage of the study is to compare the data obtained with different window sizes. To do this, we took the windows that include the first element of the data set. These windows have been reduced to the same size (by excluding word sets that were not included in the smaller window). Next, the Pearson correlation coefficient of matrices was calculated. The results of calculating the Pearson correlation coefficient between different sizes of the window are shown in Table 1. The results of the second stage allow to make an assumption that it is possible not to use the whole text, but only a part of it, when analyzing text data. It can be seen from Table 1 that the maximum correlation growth is Fig. 1. The result of distance calculation between distance vectors. achieved by increasing the window size to 5 units and then decreasing. The calculated distances shown in Figure 1 are filtered by V. CONCLUSION the distance value (0-close, 1-far). The proposed pairs of words can be (conditionally) divided into three categories This paper has investigated the possibility of applying (the proposed interpretation of the results and the division is conversation analysis to social networks' text data analysis not accurate, but only the point of view of the author of the and showed that this approach is applicable to the context article): analysis for establishing logic chains between texts. The main problem is the interpretation of results, since the •Dark grey – the most accurate matches (40%); patterns can be implicit and can vary depending on the •White – the pair of words can be (conditionally) context in which text units are used. The application of the considered synonyms (33%); conversation analysis to the statistical definition of the author •Grey –antonyms (27%). of a literary text have also been studied. This approach has shown its effectiveness. The optimal size of the data set was The results of the proposed approach suggest that it is determined. In the future, the author plans to continue the easy to construct a graph of interchangeability of words research in this area and use the approaches based on machine learning and other NLP methods. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 160 Data Science Fig. 2. An example of the sliding window method. TABLE I. VALUE PEARSON CORRELATION COEFFICIENT BETWEEN DIFFERENT WINDOW SIZES Window 1 2 3 4 5 6 7 8 9 10 size 1 - 0.26 0.32 0.57 0.78 0.79 0.8 0.81 0.83 0.85 2 - - 0.34 0.58 0.79 0.8 0.8 0.83 0.84 0.87 3 - - - 0.59 0.81 0.8 0.81 0.84 0.84 0.87 4 - - - - 0.81 0.82 0.84 0.85 0.86 0.87 5 - - - - - 0.83 0.84 0.86 0.88 0.9 6 - - - - - - 0.85 0.86 0.9 0.92 7 - - - - - - - 0.89 0.9 0.92 8 - - - - - - - - 0.9 0.93 9 - - - - - - - - - 0.94 10 - - - - - - - - - - ACKNOWLEDGMENT Materials Science and Engineering, vol. 740, no. 1, 012143, 2020. The research was supported by the Ministry of Science DOI: 10.1088/1757-899X/740/1/012143. and Higher Education of the Russian Federation (Grant [5] V. Sanz, A. Pousa, M. Naiouf and A. De Giusti, “Efficient Pattern # 0777-2020-0017) and partially funded by RFBR, project Matching on CPU-GPU Heterogeneous Systems,” Lecture Notes in numbers # 19-29-01135, # 19-31-90160. Computer Science, vol. 11944 LNCS, pp. 391-403, 2020. [6] A.S. Mukhin, I.A. Rytsarev, R.A. Paringer, A.V. Kupriyanov and D.V. REFERENCES Kirsh, “Determining the proximity of groups in social networks based [1] I.A. Rytsarev, A.V. Kupriyanov, D.V. Kirsh and R.A. Paringer, on text analysis using big data,” CEUR Workshop Proceedings, vol. “Research and analysis of messages of users of social networks using 2416, pp. 521-526, 2019. BigData technology,” CEUR Workshop Proceedings, vol. 2416, pp. [7] I.A. Rytsarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media 504-509, 2019. content from social networks using BigData technology,” Computer [2] A.F.R. Araújo, V.O. Antonino and K.L. Ponce-Guevara, “Self- Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179- organizing subspace clustering for high-dimensional and multi-view 2018-42-5- 921-927. data,” Neural Networks, vol. 130, pp. 253-268, 2020. [8] O.G. Isupova, “Conversion Analysis: Representation of Value,” [3] D.L. Golovashkin and N.L. Kasanskiy, “Solving diffractive optics Sociology: methodology, methods, mathematical modeling (4M), vol. problem using graphics processing units,” Optical Memory and Neural 15, pp. 33-52, 2002. Networks (Information Optics), vol. 20, no. 2, pp. 85-89, 2011. DOI: [9] J. Meredith, “Conversation analysis and online interaction,” Research 10.3103/S1060992X11020019. on Language and Social Interaction, vol. 52, no. 3, pp. 241-256, 2019. [4] R. Deng, “Research on the Model Construction and Development of Computer Information Acquisition System”, IOP Conference Series: VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 161