=Paper=
{{Paper
|id=Vol-3041/548-552-paper-101
|storemode=property
|title=Multi-Instance Learning for Rhetoric Structure Parsing
|pdfUrl=https://ceur-ws.org/Vol-3041/548-552-paper-101.pdf
|volume=Vol-3041
|authors=Sergey Volkov,Dmitry Devyatkin,Alexandr Shvets
}}
==Multi-Instance Learning for Rhetoric Structure Parsing==
<pdf width="1500px">https://ceur-ws.org/Vol-3041/548-552-paper-101.pdf</pdf>
<pre>
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


             MULTI-INSTANCE LEARNING FOR RHETORIC
                       STRUCTURE PARSING
                     S.S. Volkov1,2,a, D.A. Devyatkin1, A.V. Shvets3
       1
           Federal Research Center “Computer Science and Control” RAS, Moscow, Russia
   2
       Peoples’ Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya St,
                             Moscow, 117198, Russian Federation
                      3
                          Universitat Pompeu Fabra (UPF), Barcelona, Spain

                                    E-mail: a volkserg1@gmail.com


It would be helpful to consider various topic-independent features: syntax, semantics, and discourse
relations between text fragments to accurately detect texts containing elements of hatred or enmity.
Unfortunately, methods for identifying discourse relations in the texts of social networks are poorly
developed. The paper considers the task of classification of discourse relations between two parts of
the text. The RST Discourse Treebank dataset (LDC2002T07) is used to assess the performance of the
methods. Since the size of this dataset is too small for training large language models, the work uses a
model-pre fitting approach. Model pre-fitting is performed on a Reddit user comment dataset. Texts
from this dataset are labeled automatically. Since automatic labeling is less accurate than manual
marking, we use the multiple-instance learning (MIL) method to train models. A distinctive feature of
modern language models is the large number of parameters. Using several models at different levels of
such a text analyzer requires a lot of resources. Therefore, for the analyzer to work, it is necessary to
use high-performance or distributed computing. The use of desktop grid systems can attract and
combine computing resources to solve this type of problem.

Keywords: discourse analysis, multiple-instance learning, natural language processing,
desktop grid


                                                    Sergey Volkov, Dmitry Devyatkin, Alexandr Shvets


                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                   548
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        Social media are the most significant information source and the essential communication tool
on the Internet, and this information can be aimed at discriminating against people. However, the topic
of those messages can transform a lot, depending on the region and political factors. Therefore, pure
topic-related text features like lexis are not helpful in building practical tools to filter those messages.
At the same time, those messages often utilize some techniques of manipulation to develop hatred. We
believe those techniques can be recognized with discourse features, which can be obtained via
discourse analysis. The basis of that analysis is Rhetorical structure theory (RST). RST is a theory of
text organization that describes relations that hold between parts of the text. Unfortunately, RST
parsers are poorly developed, especially for social media text analysis. In this paper, we tackle
discourse relation classification in social media that is a crucial part of discourse analysis. Namely, we
answer the following research questions.
        1. Can pre-training on automatically labeled social media corpora help to improve the
           accuracy of discourse analysis?
        2. Does the multiple-instance approach on the pre-training step improve the accuracy of
           discourse analyzers?


2. Related work
         The most studies on the discourse analysis use small labeled corpora such as Penn Discourse
Treebank (PDTB) 2.0 and 3.0, and TED Multilingual Discourse Bank (parallel corpus) [1]. The
peculiarity of these corpora is their small size, which is due to the complexity of the labeling and the
large number of classes of discursive relations. The small size of these packages limits the
applicability of complex models containing a large number of parameters. Therefore, approaches are
being actively developed that use pre-training on related problems. For example, in [2], the author
used a multilayer neural network trained on the PDTB corpus and tested on the TED corpus. In this
case, the method of generating cross-language LASER embeddings was used [3]. In the paper [4], the
generation of a text discourse scheme is proposed as a problem of pre-training a language model. This
model is a multilayer network composed of bidirectional recurrent LSTM (Long-Short Term Memory)
layers. The novelty of the work lies in the approach proposed for building the training corpus. First, a
simple rule-based parser is used to detect explicit relations, then those relations are used to pre-train
the models. Experiments have shown that both models, after training, also make it possible to reveal
implicit discursive relationships. We believe the future development of that approach lies in applying a
multi-instance learning (MIL) approach [5]. It is a type of supervised learning. Instead of receiving a
set of instances which are individually labeled, model receives a set of labeled bags, each containing
many instances. There are plenty of studies where MIL is utilized to tackle noisy labeling, for
example, in case of distant learning [6]. Therefore, in this study we applied MIL-pre-training on
automatically labeled social media corpus and tested the obtained models on the gold RST Discourse
Treebank corpus.


3. Datasets and methods
3.1 Datasets
         The primary dataset for evaluating the quality of the algorithms is RST Discourse Treebank
(LDC2002T07) [7] – gold dataset. The dataset consists of 347 documents for training and 38
documents for testing. These documents were marked up manually. As a result, the data is a set of
annotated parts of the text and their relationships (18537 samples for training and 2255 samples for
test). The main disadvantage of this set is the small number of samples for training algorithms. For this
reason, it will only be used for model fine-tuning.
        For basic training of the models, we decided to use a large amount of automatically marked
data from user’s comments of Reddit news portal (2003-2018). These texts were divided into pairs of


                                                    549
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


connected discourse units using fast rhetorical theory discourse parser [8]. This way, we got about 16
million pairs of connected discourse units. Before training, this dataset was filtered to correct the class
balance. Thus the huge part of the most common classes was discarded. After preprocessing the
dataset contains 176677 records, which are balanced for 31 classes. The parser is able to retrieve more
classes, but we decided to use only those classes that are represented in the gold dataset.
3.2 Methods
        The main task of the discourse classification is the following. Model receives two clauses at
the input and should predict the type of relationship between these clauses. During the research,
several models were analyzed. The first model is based on Gated Recurrent Units (GRU) [9] layers.
StanfordNLP tokenizer splited text into tokens for this model. As a result of tokenization, each word
was represented as a lemma (a dictionary form of the word). After that, word2vec model vectorizes
each lemma. So that way, we got a set of clauses which are represented as vectors array. The neural
network receives two such arrays of vectors as input and must predict the type of discourse
relationship between them. The model has the following architecture:
    ● Two input layers correspond to two input sentences (250 vectors of dimension 300).
    ● Add-layer. It takes as input two tensors and returns a single tensor with the same shape.
    ● GRU -layer. 256 neurons with dropout = 0.1.
    ● Self-attention layer. Attention width = 256, attention_activation = sigmoid.
    ● Second GRU -layer with return_sequences = False, dropout=0.1.
    ● Dense layer. 64 neurons, sigmoid activation function.
    ● Output layer which corresponds to considered discourse relation classes. Softmax activation
      function.
      The second model is based on Bidirectional Encoder Representations from Transformers
(BERT) [10]. For the experiments we used pre-trained google model – “bert_uncased_L-12_H-
768_A-12”. This model has 110 million parameters and was pretrained on a large corpus (Wikipedia +
BookCorpus) by google. To preprocess the dataset for this model, a special bert-tokenizer was used.
The model was built as a core of a new neural network with following architecture:
    ●   Input layer, which receives a list of tokens ids and a list of token type ids. The model
        sequentially takes two lists of tokens related to the two clauses, which are separated by a
        special token. Maximum sequence length is 512. Token type ids is a list which for the first
        clause has all its tokens represented by a 0, whereas tokens, corresponding to the second
        clause, represented by a 1.
    ● Bert layer. BERT model represented as a layer.
    ● Dense layer. 256 neurons. ReLU activation function.
    ● Output layer which corresponds to considered discourse relation classes. Softmax activation
        function.
        The third model is modification of the second model. For this case, we used a different
approach to training the model – Multi-instance learning (MIL). In the MIL instead of receiving a set
of instances, which are individually labeled, model receives a set of labeled bags, each containing
many instances. The bag has the same label as most of the instances in the bag. The percentage of
positive instances in a bag is adjusted using a coefficient. To train the model in this way, let’s define a
loss function (1):
        𝑙𝑜𝑠𝑠 = 𝛾𝐶𝐶𝐸(𝑌𝑝𝑟𝑒𝑑 , 𝑌𝑡𝑟𝑢𝑒 ) + (1 − 𝛾)𝐶𝐶𝐸(𝑌𝑝𝑟𝑒𝑑 , 𝑌𝑏𝑎𝑔 )                                       (1)
where CCE – Categorical Cross-entropy, 𝛾 – balance coefficient, 𝑌𝑝𝑟𝑒𝑑 , 𝑌𝑡𝑟𝑢𝑒 , 𝑌𝑏𝑎𝑔 – corresponding
model predictions, instance labels and bag labels.
         The main purpose of using MIL is to improve the quality of poorly balanced classification.
Another important aspect is the fact that we use automatically labeled data to train the model;
therefore, there is a chance that some labels will not actually correspond to the correct class. MIL can
train the model on noisy data because it can handle a bag that contains a part of the negative instances.


                                                   550
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


4. Results
         The first task is to compare the two main models - BERT and GRU. Both models were trained
on automatically marked up dataset, and then fine-tuned on a gold dataset. Table 1 shows the F1-score
of classification result on gold dataset (LDC2002T07) for top-5 classes.
                                               Table 12. F1 score of classification result for top-5 classes
                         GRU-based model                           BERT-based model
                      F1               Class                     F1              Class
                     0.71          NS-Attribution               0.87         NS-Attribution
                     0.65          SN-Attribution               0.82         SN-Attribution
                     0.60          NS-Elaboration               0.73          SN-Condition
                     0.60          SN-Condition                 0.63         NN-Same-Unit
                     0.48         NS-Enablement                 0.54           NN-Joint
        As a result of the initial comparison of the BI-GRU and BERT models, we can conclude that
the BERT model performs better at this task. For this reason, the MIL approach will only be applied
for the BERT model. Figure 1 shows the F1-score of classification result on gold dataset.

                                              BERT baseline
                                              BERT Pretrained
     1
                                              BERT+MIL Pretrained
   0.8
   0.6
   0.4
   0.2
     0
                            ns


                    mp ns
          NN pari e


                 mp se


               -Co use
        NS trib ral


         SN omm al


                              l
          NN me-U t


      -To Tem ry
               -Co son


               -Co rison
         NN NN-J st


                            nd


        NS labor ast


                            nd


        SN labor ast
                            nt


        SN trib ent
                 -Co on


        NS -Con on


                          ion


          ann lanat t
                          ion
          ann lanat n


        SN -Con on
         NS Temp it


          NS blem n


                 abl on
                 ckg on


                 ckg on


                         ora
              -Sa oin


    SN N-Exp emen
                            s


    NS S-Exp uatio
                            o
                           n


               -C por
                          a


    NS NS- mma
                          e
           -Co -Cau


           -Co Cau


           NS -Mea


           SN -Mea
                         o

                      rou


                      rou
           NN nditi


            NS diti


            SN nditi
                        ati


                        ati
           -Ba uti


           -Ba uti
                        tr


                        tr
                       ntr


          SN N-Ca
                      a
                    n
                     -


                   al
                NN


                   S


                -Su


               -Te
                 m


                 er


                 er
                 N


              -Ev


                 S
                 a
             -At


             -At
               -


           -En


           -En
            -E


            -E
          NS


          pic
    NN


        NS


       -M


       -M
        N


        S


                           Figure 30. F1-score of classification result on gold dataset

         The figure 1 compares the 3 models. BERT baseline is the standard BERT model that has
been only finetuned on gold dataset. Bert Pretrained is model that has been pretrained on automatically
markuped data from Reddit, and then finetuned on gold dataset. The third model “BERT+MIL
pretrained” has been pretrained on Reddit data using MIL algorithm, and then also finetuned on gold
data without MIL. As we can see, for some classes MIL significantly increases the quality of the
classification. Some classes that were impossible to define are now being detected, albeit with a slight
accuracy.

5. Conclusion
         In this paper, a comparison of several models for solving the problem of classifying discourse
relations was presented. Experiments showed that the use of BERT-based models is most suitable for
solving this problem. Also, as a result of experiments, it was found that the use of the MIL algorithm
can increase the quality of the classification when using noisy data for pre-training. The improvement
is especially noticeable in rare classes. Presented models of discourse relations classification can be
used as part of the system for analyzing the emotional charge of the text. It is assumed that a text that
contains an aggressive or negative context can be structured in such a way that the identification of
discourse relations between its parts can increase the quality of detection of such texts. This text
analysis system requires high-performance or distributed computing. The use of desktop grid systems
can attract and combine computing resources to solve this problem.


                                                    551
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


6. Acknowledgement
        This work was funded by RFBR according to the research project No. 21-011-44242.


References
[1] Zeyrek, D., Mendes, A., Grishina, Y., Kurfalı, M., Gibbon, S. and Ogrodniczuk, M., 2019. TED
Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style. Language
Resources and Evaluation, pp.1-27
[2] Kurfalı M., Östling R. Kurfalı, M. and Östling, R., 2019, September. Zero-shot transfer for
implicit discourse relation classification. In Proceedings of the 20th Annual SIGdial Meeting on
Discourse and Dialogue (pp. 226-231).
[3] Artetxe, M. and Schwenk, H., 2019. Massively multilingual sentence embeddings for zero-shot
cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7,
pp.597-610.
[4] Nie A., Bennett E., Goodman N. DisSent: Learning sentence representations from explicit
discourse relations //Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics. – 2019. – С. 4497-4510.
[5] Dietterich T. G., Lathrop R. H., Lozano-Pérez T. Solving the multiple instance problem with
axis-parallel rectangles //Artificial intelligence. – 1997. – Т. 89. – №. 1-2. – С. 31-71.
[6] Le P., Titov I. Distant Learning for Entity Linking with Automatic Noise Detection //Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics. – 2019. – С. 4081-4090.
[7] Carlson L., Okurowski M. E., Marcu D. RST discourse treebank. – Linguistic Data Consortium,
University of Pennsylvania, 2002.
[8] Heilman M., Sagae K. Fast rhetorical structure theory discourse parsing //arXiv preprint
arXiv:1505.02425. – 2015.
[9] Cho K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine
translation //arXiv preprint arXiv:1406.1078. – 2014.
[10] Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding
//arXiv preprint arXiv:1810.04805. – 2018.


                                                   552

</pre>