Style Change Detection using Discourse Markers
Faisal Alvi1,2,∗ , Hasan Algafri1 and Naif Alqahtani1
1
    Information & Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia.
2
    Computer Science Program, Dhanani School of Science and Engineering, Habib University, Karachi, Pakistan.


                                         Abstract
                                         This article presents our approach for the Style Change Detection Task at PAN 2022 using discourse
                                         markers. Discourse markers (such as ‘what’, ‘I have’, etc.) are words or expressions used to connect,
                                         organise and manage conversations. We present two different approaches for Style Change Detection at
                                         PAN-2022. For Task 1, (Style Change Basic) our approach is based on identifying conversational patterns
                                         within the documents between a user and a possible respondent. Then, using classification algorithms,
                                         we predict the point of style change detection within each document. For Task 2 (Style Change Advanced)
                                         and Task 3 (Style Change Real World), we use an extensive list of frequently occurring discourse markers
                                         to identify the number of speakers as the number of authors within the document. This prediction serves
                                         as the number of clusters for text segments within the document. Subsequently, using unsupervised
                                         clustering we detect clusters of similar text segments such that each cluster comprises of text segment
                                         groups corresponding to each author. The resulting F1 scores for our approaches on the test set are:
                                         0.70518 for Task 1, 0.32128 for Task 2 and 0.56360 for Task 3.

                                         Keywords
                                         Style Change Detection, Discourse Markers, Conversational Patterns, Classification, Clustering


1. Introduction
The Style Change Detection [1] task has been available in PAN evaluation labs since 2017. For
PAN-2022 [2], Style Change Detection consists of three tasks: (a) Style Change Basic, (b) Style
Change Advanced, and (c) Style Change Real World. The datasets for all 3 tasks have been
derived from user posts and replies on technology related issues. These posts simulate style
change detection by providing a discourse between users and respondents on technology issues.
   Discourse Markers [3] have been used for intrinsic plagiarism detection in the literature. Rao
et al. [4] have used discourse markers as features towards intrinsic plagiarism detection on
the PAN-2011 corpus. Likewise, Elamine et al. [5] have used discourse markers as features for
intrinsic plagiarism detection on the PAN-16 and PAN-17 corpora. In this article we present
our approach for style change detection using conversational patterns and discourse markers.
Our approach utilises conversational patterns and frequently occurring unigram and bigram
discourse markers in the first, second and third persons.


CLEF 2022: Conference and Labs of the Evaluation Forum 5–8 September 2022, Bologna, Italy
∗
    Corresponding author.
Envelope-Open alvif@kfupm.edu.sa,alvi.faysal@gmail.com (F. Alvi); s201817820@kfupm.edu.sa (H. Algafri);
s201930770@kfupm.edu.sa (N. Alqahtani)
Orcid 0000-0003-3827-7710 (F. Alvi); 0000-0002-8356-8683 (H. Algafri); 0000-0002-6343-4864 (N. Alqahtani)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
  Drop down menus
     for datasets


  Original Document


  Color Formatted
  Document with Author
  1 (text) identified in
  yellow and Author 2
  (text) identified in pink


Figure 1: Corpus Viewer - a software tool to identify conversational patterns


2. Identifying Conversational Patterns using Corpus Viewer
Since the style change detection datasets are based on conversations around technology, we
developed Corpus Viewer - a software tool to identify conversational patterns in Style Change
Detection datasets, described as follows.
   Corpus Viewer: A snapshot of Corpus Viewer is shown in Figure 1. This software tool
consists of (a) drop down menus for selecting documents from datasets (train and validation),
(b) a window displaying the original document, and (c) another window with color formatted
text identifying various authors using the solutions provided for the training set.
   Conversational Patterns: Using Corpus Viewer we identified a number of patterns that
were present within the documents. For example, the file displayed in Figure 1 shows text
highlighted by Author 1 with phrases such as: “I spilled coke... so I immediately... I disassembled...”.
This text has a high presence of the first person pronoun ‘I’ in the first paragraph. In contrast,
the text by Author 2: “Check if you...” has only a single instance of the second-person pronoun
‘you’ in the second paragraph. From this analysis we can conclude that this document represents
an issue followed by an answer.
3. Proposed Approaches
In this section we describe the approaches employed for Task 1 and Tasks 2 & 3.

3.1. Task 1
We start by searching for patterns in dataset 1 observed using Corpus Viewer. More specifically,
we search for words indicating conversational patterns (such as ‘Thank you’, ‘?’, verbs in their
base form) at the boundary between Author 1 and Author 2. Based on a visual analysis of
documents in dataset 1, we identified three different types of documents that were commonly
found throughout the dataset as follows:
   1. Question (Author 1) followed by an answer (Author 2), (e.g. the document in Figure 1),
   2. A statement (Author 1) followed by a question (Author 2),
   3. Two replies (Author 1 and Author 2) to an issue, which is not in the document.
Given a document, we attempt to identify whether it belongs to one of these types by searching
for discourse markers. The rationale for identifying the document type is that each document
type has some particular pattern, indicated by discourse markers at the point of change. For
example, the presence of a verb in its base form at the beginning of a paragraph (such as
‘Check’, ‘Try’, ‘Do’) signifies an instruction by a respondent given to a user. Hence this probably
identifies the point of Author change in a document of type 1. Likewise, the appearance of a
polite word (such as ‘Thank you’ or ‘Thanks’) at the end of a paragraph identifies the point at
which there is an Author change. Furthermore, appearance of a question mark at the end of a
paragraph typically identifies a question.
   Based on these patterns we construct feature vectors that incorporate all possible positions of
author change for each document in a dataframe as shown in Table 1. For each row, appearance
of identified patterns serves as the feature set for Author change. For example, the first row in
Table 1 identifies the point of change of Authors between Paragraph 1 and Paragraphs 2, 3, 4.

Table 1
Dataframe Representing Point of Change between Authors for a given document

            Author 1             Author 2         Pattern 1   Pattern 2 (...)   Author Change
           Paragraph 1       Paragraphs 2, 3, 4     Yes             ...              Yes
         Paragraphs 1, 2      Paragraphs 3, 4        No             ...              No
        Paragraphs 1, 2, 3     Paragraph 4           No             ...              No

   After the dataframe construction, we apply four machine learning algorithms [6] on the
training set (i.e., Decision Tree, Logistic Regression, Naive Bayes’ and Random Forest) as shown
in Table 2, which represents the accuracy and F1 scores for each algorithm. These classification
algorithms are used for predictions on the point of Author Change for each document in the
validation set. From these values of accuracy and F1 Scores, we find that the Random Forest
Classification Algorithm performs the best. Therefore we consider the prediction of the Random
Forest Algorithm for a document, if available.
Table 2
Results of four Machine Learning Classification Algorithms (Bold values indicate the highest value)

                      Decision Tree          Naive Bayes     Random Forest         Logistic Regression
          Accuracy        0.80850               0.84306           0.89117                   0.88089
          F1 Score        0.62910               0.68144           0.70784                   0.67106


3.2. Tasks 2 and 3
Tasks 2 and 3 require a more fine grained identification of multiple author changes. The number
of authors for Tasks 2 and 3 ranges from 1 to 5 authors with change occurring possibly at each
paragraph boundary. We use a two staged approach for identifying author changes for Tasks 2
and 3, stated as follows:
   1. In the first step, we aim to identify the number of authors for each document. We consider
      the most frequent unigrams and bigrams in all the three datasets with the words ’I’, ’we’,
      ’you’, ’he’, ’she’, ’they’ and interrogative words as discourse markers. For each document,
      we construct a feature vector that includes counts of 212 first person, 185 second person,
      41 third person and 10 interrogative unigrams and bigrams (such as “I’d”, “you have”, etc).
      In addition, statistical information such as number of paragraphs, number of words and
      characters, punctuation to text ratio and frequency of commas and question marks are
      also included as features at the document level. Subsequently, we use Random Forest
      Algorithm [6] (best performing, similar to task 1), to predict the number of authors in the
      test set for each document.
   2. In the second step, using the number of identified authors predicted, we apply K-Means
      Clustering [6] to partition the paragraphs of a given document into 𝑛 clusters, where 𝑛
      is the number of authors. K-Means Clustering is used as it is one of the most common
      algorithms used for unsupervised clustering. Features included for clustering of para-
      graphs are the same as that used in step 1, but at the level of paragraphs instead of at the
      document level.
  Figure 2 illustrates this two-step process.

                             Documents                     Paragraphs


                                                                                  Clusters of

                     Classification ML Algorithms            K Means              paragraphs

                          (Random Forest)                   Clustering            indicating text
                                                                                  by a particular
                                                                                  author


                      Features (Document Level)      Features (Paragraph Level)


Figure 2: A two staged process for Style Change Detection for Tasks 2 and 3
   The confusion matrices for the prediction of the number of authors for the validation datasets
2 and 3 in the first step are shown in Figure 3. It can be observed that the predictions are good
for a single author as well as reasonable for documents with 5 authors. However, the model
does not adequately discriminate between 2, 3 and 4 author documents for both datasets.


Figure 3: Confusion Matrices predicting the Number of Authors for Tasks 2 and 3


   Predictions from this stage as the number of authors for each document are then sent as
input to a clustering phase, where K-means clustering is used to cluster similar paragraphs into
clusters, corresponding to the text by each author. The outcome of this stage are paragraphs
with their corresponding authors as a solution.


4. Results and Discussion
Final runs on the test set were made on the TIRA platform [7]. The final results (F1 scores and
Task 2 DER, JER) of the stated approaches on the validation and test sets are shown in Table 3.

Table 3
Results of F1 Scores on the Validation and Test Sets for Each Task (Dataset)

               F1 Score       Task 1    Task 2     Task 3    Task 2 (DER)      Task 2 (JER)
            Validation Set   0.70538    0.33016   0.57788      0.37989           0.52972
               Test Set      0.70518    0.32128   0.56360      0.39240           0.52180

  From these results we observe a very minor change from the F1 scores for the validation set to
F1 scores for the test set. This suggests that there was no overfitting of the model. Furthermore,
the approaches are effective in detecting style changes however these can be enhanced by
further refining and improvement.
5. Conclusion
In this work, we presented two related approaches for detecting style changes for the Style
Change Detection Task. Our approaches were based on finding conversational patterns as well
as using discourse markers and were effective in detecting style changes in the documents.
Future work involves developing these approaches by improved classification and clustering
algorithms as well as addition of more features.


References
[1] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Style Change Detection
    Task at PAN 2022, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR Workshop
    Proceedings, 2022.
[2] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl, R. Ortega-
    Bueno, P. Pezik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann,
    M. Wolska, E. Zangerle, Overview of PAN 2022: Authorship Verification, Profiling Irony and
    Stereotype Spreaders, and Style Change Detection, in: A. B. Cedeno, G. D. S. Martino, M. D.
    Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro
    (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings
    of the Thirteenth International Conference of the CLEF Association (CLEF 2022), volume
    13390 of Lecture Notes in Computer Science, Springer, 2022.
[3] B. Heine, G. Kaltenböck, T. Kuteva, H. Long, On the Rise of Discourse Markers, volume 219,
    John Benjamins Publishing Company, 2021.
[4] S. Rao, P. Gupta, K. Singhal, P. Majumder, External & Intrinsic Plagiarism Detection: VSM &
    Discourse Markers based Approach—Notebook for PAN at CLEF 2011, in: V. Petras, P. Forner,
    P. Clough (Eds.), Notebook Papers of CLEF 2011 Labs and Workshops, 19-22 September,
    Amsterdam, The Netherlands, CEUR-WS.org, 2011. URL: http://ceur-ws.org/Vol-1177.
[5] M. Elamine, S. Mechti, L. H. Belguith, Intrinsic Detection of Plagiarism based on Writing
    Style Grouping, in: Language Processing and Knowledge Management, 2017.
[6] G. Bonaccorso, Machine Learning Algorithms, Packt Publishing Ltd, 2017.
[7] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
    in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
    Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:1 0 . 1 0 0 7 /
    978- 3- 030- 22948- 1\_5.