=Paper= {{Paper |id=Vol-2667/paper34 |storemode=property |title=Text data mining using conversation analysis |pdfUrl=https://ceur-ws.org/Vol-2667/paper34.pdf |volume=Vol-2667 |authors=Igor Rytsarev }} ==Text data mining using conversation analysis == https://ceur-ws.org/Vol-2667/paper34.pdf
          Text data mining using conversation analysis
                                                                   Igor Rytsarev
                       Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and Photonics" RAS;
                                                      Samara National Research University
                                                                   Samara, Russia
                                                               rycarev@gmail.com


   Abstract—This paper suggests an algorithm of text data                     III. DETERMINATION OF THE CLOSENESS OF TEXT
mining based on conversation analysis. Natural languages are                      UNITS BASED ON CONVERSATION ANALYSIS
developing dynamically nowadays. New semantic units are
constantly being introduced into the spoken language. In these                 Conversation analysis, i.e., the study of structures and
conditions, chains of dependency graphs of semantic units are               formal properties of a language in its social and economic
constantly being rebuilt. This paper proposes a method for                  application, is related to all major areas of ethnic and
identifying synonyms based on conversation analysis. The                    methodological research.
proposed method has been tested on data collected from social
networks.                                                                       Initially, the conversation analysis was intended for the
                                                                            study of verbal and everyday speech only, and more than
    Keywords—Social networks, Data Mining, Algoritms                        that, only conversations between several interlocutors. H.
                                                                            Sacks, the creator of the method, attracted the attention of
                       I.    INTRODUCTION                                   scientists to the fact that conversations are central for a social
    The social networks are currently undergoing a turbulent                world.
growth: every day, users send billions of messages and
                                                                                A conversation shall necessarily be organized, it implies
submit billions of comments. Their analysis has a great                     the existence of an order that does not need to be explained
impact on many areas of business. For example, it is                        again and again during the exchange of phrases. The order is
impossible to overestimate the influence of internet                        also needed for the spoken words to be clear to all the
marketing on the promotion of goods and services. However,                  conversation participants. The conversation shows the social,
in order to use these mechanisms effectively, it is necessary               interactive competence of people willing to explain their
to understand the demands of users. The source of such                      behavior and to interpret the behavior of interlocutors. Inside
information can be the materials published by users of social               the local sequences of conversation, and only there, social
networks, as well as the shares and reposts by users and the                institutions are finally “spoken into existence”. As a result,
entire communities [1-7]. Thus, the issue of determining the                the smallest and seemingly insignificant details of the
closeness of text units in the social network Vkontakte using               conversation actually become a means of actualizing the
the BigData technology, considered in this paper, is certainly              most important social institutions.
a relevant objective and a task of great scientific importance                  The goal of conversationalists is to describe social
in the field of data analysis.                                              practices and expectations that the interlocutors rely on when
          II.   DATA COLLECTION FROM SOCIAL                                 constructing their own behavior and interpreting the behavior
                                                                            of others.
                         NETWORKS
    The social network Vkontakte was selected as a data                         Conversion analysis focuses on particular cases as
source for this research. The reasons for this choice are as                opposed to idealization that is inevitably connected with any
follows:                                                                    theoretical generalization, from the point of view of
                                                                            Garfinkel and Sacks. In their opinion, idealization impedes
         the network provides open access to its data (no                  scientific development, since any typology is not much
          restrictions on accessing the server data);                       connected with the content of real cases which it is supposed
         Vkontakte is the most popular social network in                   to be based on. Sacks sought to develop a method of analysis
          Russia and the fifth most popular social network in               that would remain at the level of primary data, raw material,
          the world;
                                                                            specific, isolated events of human behavior. In contrast to
         Vkontakte is a full-fledged social network (unlike                classical sociology, he argued that the details of any
          Twitter and Instagram, which are microblogs)
                                                                            spontaneous human interaction are strictly organized – to the
          allowing to create thematic communities, which are
          particularly interesting for this study.                          extent that provides for their formal description.

    As part of this study, a Python software package was                        On the basis of the above prerequisites, the peculiarities
developed, containing an authorization module, a data                       of conversation analysis can be formulated as follows. First,
collection module, and a filtration module. This software                   this method follows the data, i.e. the analysis is based on
package allows to collect data and filter them to take the                  empiricism without using (possibly) predetermined
relevant information only. relevant information only.                       hypotheses. Secondly, the smallest details of the text are
                                                                            considered to be an analytical resource and not an obstacle to
    Within this study, the developed software package was                   be discarded. Third, the authors of the method are convinced
used to collect more than 5,000 posts and over 170,000                      that the order in organizing the details of everyday speech
comments from the two most popular communities of the                       exists not only for researchers, but – first and utmost – for
city of Samara (“Podslushano Samara” and “Uslyshano                         the people who construct this speech [8,9].
Samara”).



Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

    This idea formed the basis for the study. Initially, it was              when analyzing text data and use it to extract contextual
suggested that on a large data set two text units have similar               meaning from the data set.
use distance vectors V (vector that shows how the two text
units relate to each other within the data, where index i                       IV. APPLY CONVERATION ANALYSIS TO THE
(indicates the distance between units) and Vi (the number of                   STATISTICAL DEFINITION OF THE AUTHOR OF A
combinations between units, V0 - total number of uses of                                    LITERARY TEXT
two text units within one sentence) serve as metrics.                             Conversation analysis showed good results in the
                                                                             problem of determining the closeness of text units and
    The data collected from the Vkontakte social networks                    therefore a theory was proposed that it is possible to use this
have been pre-processed; each text unit has been brought to                  approach for the statistical definition of the author of a
its normal form (the pymorphy2 package was used for this).                   literary text.
The data was then pre-processed to extract the necessary
statistics (WordCount, maximum sentence length). The next                        The main idea of the study is to make multidimensional
step was to create a distance matrix.                                        vectors that store distances between each pair of words in the
                                                                             text. It has been suggested that when comparing distances
   A cosine distance was used to calculate distances                         between pairs of words it is possible to estimate the degree
between two vectors:                                                         of closeness of text fragments.
                                                       𝐴 ∙𝐵
                       𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = cos(𝜃) = ‖𝐴‖‖𝐵‖ ==                            This study can be roughly divided into two tasks:
                     ∑𝑛
                      𝑖=1 𝐴𝑖 ×𝐵𝑖                                                 1.   Data preparation.
                                                                       (1)
               √∑𝑛       2   𝑛
                 𝑖=1(𝐴𝑖 ) ×√∑𝑖=1(𝐵𝑖 )
                                     2                                           2.   Determining the optimal text fragment size for
                                                                                      comparison between texts.
    The results are shown in Figure 1.
                                                                                  To check the first stage of this hypothesis it was
                                                                             suggested to prepare sets of text data by the sliding window
                                                                             method. The sliding window method is an algorithm of
                                                                             transformation, which allows to form a set of data from the
                                                                             source text, which can serve as a set for research.
                                                                                  In this case, the window is understood as the size of the
                                                                             window containing the set of texts that are used to conduct
                                                                             the research. During the algorithm operation the window is
                                                                             shifted along subchapters of the text by one measurement
                                                                             unit, and each position of the window forms one text. An
                                                                             example of the method operation is shown in Figure 2.
                                                                                  The next step of the second stage of the study is to
                                                                             compare the data obtained with different window sizes. To
                                                                             do this, we took the windows that include the first element of
                                                                             the data set. These windows have been reduced to the same
                                                                             size (by excluding word sets that were not included in the
                                                                             smaller window). Next, the Pearson correlation coefficient of
                                                                             matrices was calculated. The results of calculating the
                                                                             Pearson correlation coefficient between different sizes of the
                                                                             window are shown in Table 1.
                                                                                 The results of the second stage allow to make an
                                                                             assumption that it is possible not to use the whole text, but
                                                                             only a part of it, when analyzing text data. It can be seen
                                                                             from Table 1 that the maximum correlation growth is
Fig. 1. The result of distance calculation between distance vectors.         achieved by increasing the window size to 5 units and then
                                                                             decreasing.
    The calculated distances shown in Figure 1 are filtered by
                                                                                                   V. CONCLUSION
the distance value (0-close, 1-far). The proposed pairs of
words can be (conditionally) divided into three categories                       This paper has investigated the possibility of applying
(the proposed interpretation of the results and the division is              conversation analysis to social networks' text data analysis
not accurate, but only the point of view of the author of the                and showed that this approach is applicable to the context
article):                                                                    analysis for establishing logic chains between texts. The
                                                                             main problem is the interpretation of results, since the
          •Dark grey – the most accurate matches (40%);                     patterns can be implicit and can vary depending on the
          •White – the pair of words can be (conditionally)                 context in which text units are used. The application of the
           considered synonyms (33%);                                        conversation analysis to the statistical definition of the author
          •Grey –antonyms (27%).                                            of a literary text have also been studied. This approach has
                                                                             shown its effectiveness. The optimal size of the data set was
   The results of the proposed approach suggest that it is                   determined. In the future, the author plans to continue the
easy to construct a graph of interchangeability of words                     research in this area and use the approaches based on
                                                                             machine learning and other NLP methods.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                    160
Data Science




Fig. 2. An example of the sliding window method.

                              TABLE I.      VALUE PEARSON CORRELATION COEFFICIENT BETWEEN DIFFERENT WINDOW SIZES

                      Window
                                       1       2         3         4           5         6         7         8          9        10
                       size

                         1         -        0.26      0.32      0.57         0.78     0.79      0.8        0.81      0.83      0.85
                         2         -        -         0.34      0.58         0.79     0.8       0.8        0.83      0.84      0.87
                         3         -        -         -         0.59         0.81     0.8       0.81       0.84      0.84      0.87
                         4         -        -         -         -            0.81     0.82      0.84       0.85      0.86      0.87
                         5         -        -         -         -            -        0.83      0.84       0.86      0.88      0.9
                         6         -        -         -         -            -        -         0.85       0.86      0.9       0.92
                         7         -        -         -         -            -        -         -          0.89      0.9       0.92
                         8         -        -         -         -            -        -         -          -         0.9       0.93
                         9         -        -         -         -            -        -         -          -         -         0.94
                         10        -        -         -         -            -        -         -          -         -         -

                ACKNOWLEDGMENT                                                        Materials Science and Engineering, vol. 740, no. 1, 012143, 2020.
   The research was supported by the Ministry of Science                              DOI: 10.1088/1757-899X/740/1/012143.
and Higher Education of the Russian Federation (Grant                           [5]   V. Sanz, A. Pousa, M. Naiouf and A. De Giusti, “Efficient Pattern
# 0777-2020-0017) and partially funded by RFBR, project                               Matching on CPU-GPU Heterogeneous Systems,” Lecture Notes in
numbers # 19-29-01135, # 19-31-90160.                                                 Computer Science, vol. 11944 LNCS, pp. 391-403, 2020.
                                                                                [6]   A.S. Mukhin, I.A. Rytsarev, R.A. Paringer, A.V. Kupriyanov and D.V.
                               REFERENCES                                             Kirsh, “Determining the proximity of groups in social networks based
[1]   I.A. Rytsarev, A.V. Kupriyanov, D.V. Kirsh and R.A. Paringer,                   on text analysis using big data,” CEUR Workshop Proceedings, vol.
      “Research and analysis of messages of users of social networks using            2416, pp. 521-526, 2019.
      BigData technology,” CEUR Workshop Proceedings, vol. 2416, pp.            [7]   I.A. Rytsarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media
      504-509, 2019.                                                                  content from social networks using BigData technology,” Computer
[2]   A.F.R. Araújo, V.O. Antonino and K.L. Ponce-Guevara, “Self-                     Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-
      organizing subspace clustering for high-dimensional and multi-view              2018-42-5- 921-927.
      data,” Neural Networks, vol. 130, pp. 253-268, 2020.                      [8]   O.G. Isupova, “Conversion Analysis: Representation of Value,”
[3]   D.L. Golovashkin and N.L. Kasanskiy, “Solving diffractive optics                Sociology: methodology, methods, mathematical modeling (4M), vol.
      problem using graphics processing units,” Optical Memory and Neural             15, pp. 33-52, 2002.
      Networks (Information Optics), vol. 20, no. 2, pp. 85-89, 2011. DOI:      [9]   J. Meredith, “Conversation analysis and online interaction,” Research
      10.3103/S1060992X11020019.                                                      on Language and Social Interaction, vol. 52, no. 3, pp. 241-256, 2019.
[4]   R. Deng, “Research on the Model Construction and Development of
      Computer Information Acquisition System”, IOP Conference Series:




VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                 161