Detecting a Change of Style Using Text Statistics?
                         Notebook for PAN at CLEF 2018

                            Kamil Safin and Aleksandr Ogaltsov

                                     Antiplagiat Company
                         Moscow Institute of Physics and Technology
                                  Higher School of Economics
                       kamil.safin@phystech.edu, avogaltsov@edu.hse.ru


        Abstract In this paper we address style change detection problem at PAN’18
        author identification task. For this task one should determine whether text is writ-
        ten by the same author or not. We consider supervised problem statement with
        the whole text as a training object. The roposed approach is based on three types
        of features: text statistics, hashing and high dimensional text vectors. The final
        algorithm is the ensemble of classifiers that were independently trained on each
        feature group.


1     Introduction
Authorship detection is a class of open problems in natural language processing. This
class contains a bunch of the tasks that were featured in previous PAN competitions,
namely:
 1. Author clustering [6,15] – provided with a collection of text documents one should
    label each document, where label corresponds to one of n predefined authors.
 2. Author diarization [17,5] – provided with a document written by n authors one
    should link text fragment with its author.
 3. Intrinsic plagiarism detection [11,18,8,13] – provided with a document one should
    determine reused passages without a reference collection [19].
 4. Style breach detection [1] – segmentation problem where text should be divided
    into style consistent passages.
     PAN’18 consists of the following tasks: author identification task [3], author pro-
filing task [12], author obfuscation task [10]. This year’s author identification task is
relaxation of style breach detection, i.e. binary classification task, where positive label
corresponds to the case when document has at least one style change. Therefore, we can
rely on developed solutions for these task [4,14]. General framework that was applied
for previous tasks frequently is following:

 1. To obtain text parts using some segmentation scheme. For example, sentence seg-
    mentation, n-grams with or without overlap.
 2. To construct a mapping from text segment into feature space. [2,16,14]
?
    This research is supported by RFBR project 18-07-01441
 3. Provided with segments features to train an algorithm to classify, cluster, or detect
    outliers.
    However, in this paper we develop a framework that considers the whole text as
a training object without any segmentation. On the one hand, such problem statement
was inspired by the fact that we deal with binary classification, on the other hand we try
to contribute slightly different point of view on the problem.

     First, we perform preprocessing procedure that is different for each specific clas-
sifier. Next, we extract three types of features: text statistics, hash code of a text, and
high-dimensional sparse representation of a text, obtained by simple counting of word
n-grams appearance in range 1-6. Such n-grams counting showed success in different
tasks from intrinsic plagiarism detection [16] to author profiling [7]. We train three in-
dependent classifiers on each type of features, make linear combination of probabilities
given by each classifier and, learn threshold for this linear combination.
All experiments were carried out on TIRA [9].


2     Problem Statement
In this section we state the problem formally. Consider text documents collection D of
size m and denote i-th document of collection by Di , where i ∈ 1, . . . m. Let f be the
mapping, such that each document of the collection is mapped to fixed-size vector:
                                      f : D → Rd .
Consider labeling function h, such that:
                                  h : Rd → y ∈ {0, 1},
where class label 1 is for documents written by more than one author and 0 for single-
author documents. Let LD be a empirical risk defined by:
                                        |{i : h(Di ) 6= yi }|
                             LD (h) =                         ,
                                                m
where yi is class label for i-th document.
We want to find ĥ that minimizes LD on a given collection D:
                                  ĥ = arg min LD (h),
                                           h∈H

where H is parametric family of functions.


3     Experiment
3.1   Data
The data corpus consists of user posts from various sites of the StackExchange net-
work. Data is split into training and validation sets that contain 2980 and 1492 texts
respectively.
3.2    Quality Criteria
To evaluate the quality of proposed algorithm, the accuracy score was used. Accuracy
is the fraction of correct predictions. More formally, for binary classification accuracy
has the following definition:
                                               TP + TN
                            Accuracy =                      ,
                                          TP + TN + FP + FN
where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False
Negatives.

3.3    Model
Our model consists of three independent classifiers: Statistical, Hashing, and Count-
ing Classifier. Each classifier returns the probability of the fact that text contains style
changing. And the final probability is the weighted sum of three probabilities — ps , ph , pc
respectively.

Statistical Classifier. Statistical classifier uses 19 statistical features for a text analysis.
The most important of them are:
 – number of sentences;
 – unique words fraction;
 – text length;
 – punctuation symbols fraction;
 – letter symbols fraction.
To produce final probability Random Forest Classifier was used.

Hashing Classifier. This model uses hashing function to build term frequency counts
in a text. The hash function employed is the signed 32-bit version of Murmurhash31 .
As a result, a text is maped into 3000-dimensional vector space. These vectors contains
information about occurrences of char n-grams in text. Text representation vector is
used to classify whether a text contains style changes or not. Random Forest Classifier
was used to produce probability.

Counting Classifier. Counting Classifier uses high-dimensional (3 million) representa-
tion of a text. Different dimensions were tried but they showed lower quality. It counts
word n-grams form 1 to 6 and turns it to a vector. Logistic Regression is then used to
get the probability.

   Statistical, Hashing, and Counting Classifiers were trained on the train set in order
to maximize performance measure — accuracy — independently from each other.
Resulting performances are shown in the table below.
 1
     http://scikit-learn.org/stable/modules/generated/sklearn.
     feature_extraction.text.HashingVectorizer.html
                                                      Accuracy
                             Statistical Classifier     0.67
                              Hashing Classifier        0.65
                             Counting Classifier        0.74

Model. The final score for text d is the weighted sum of probabilites:

                           score(d) = αs ps + αh ph + αc pc ,

where coefficients αs , αh , αc are selected from (0, 1).
If the score for a text exceeds the threshold δ, then this text is marked as text with
change of style:
                         score(d) > δ ⇒ d has change of style.

3.4   Parameters Tuning
Coefficients αs , αh , αc and threshold δ were tuned on the validation set by grid search
in order to maximize accuracy. Each of the coefficient αs , αh , αc shows the importance
of corresponding classifier. Optimal parameters for the final model are:

                            αs = 0.4, αh = 0.2, αc = 0.4.

We can see, that Statistical and Counting Classifiers are the most informative.
And the value of threshold is: δ = 0.55.
The relation between accuracy and value of threshold is shown on the figure below.


3.5   Results
The proposed model was tested on PAN’18 data set. The results of its performance are
shown below.
                                          Validation Test
                              Accuracy      0.805 0.803
4   Conclusion
We proposed an algorithm for style change detection task. This algorithm uses three
independent classifiers: Statistical, Hashing, and Counting. Each classifier gives its
own probability that a text may contain a change of style. Final score is computed
as weighted sum of three probabilities. And if the score exceeds the threshold, a text
will be marked as it containing a change of style.
The method was implemented for the PAN’18 style change detection task. The model
has achieved accuracy score 0.803 on the test dataset.

References
 1. Overview of the Author Identification Task at PAN 2017: Style Breach Detection and
    Author Clustering (2017)
 2. Bensalem, I., Rosso, P., Chikhi, S.: Intrinsic plagiarism detection using n-gram classes.
    EMNLP (2014)
 3. Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B.,
    Potthast, M.: Overview of the Author Identification Task at PAN-2018: Cross-domain
    Authorship Attribution and Style Change Detection. In: Cappellato, L., Ferro, N., Nie, J.Y.,
    Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR
    Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2018)
 4. Khan, J.A.: Style breach detection: An unsupervised detection model. In: CLEF (2017)
 5. Kuznetsov, M., Motrenko, A., Kuznetsova, R., Strijov, V.: Methods for intrinsic plagiarism
    detection and author diarization. Notebook for PAN at CLEF 2016 (2016)
 6. Layton, R., Watters, P., Dazeley, R.: Automated unsupervised authorship analysis using
    evidence accumulation clustering. Natural Language Engineering 19(1), 95–120 (2013)
 7. Ogaltsov, A., Romanov, A.: Language variety and gender classification for author profiling
    in pan 2017. In: CLEF (2017)
 8. Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M.,
    Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international
    competition on plagiarism detection. CLEF (Online Working Notes/Labs/Workshop).
    Citeseer (2012)
 9. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
    Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
    Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury,
    A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality,
    and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp.
    268–299. Springer, Berlin Heidelberg New York (Sep 2014)
10. Potthast, M., Hagen, M., Schremmer, F., Stein, B.: Overview of the Author Obfuscation
    Task at PAN 2018: A New Approach to Measuring Safety. In: Cappellato, L., Ferro, N.,
    Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs.
    CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2018)
11. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for
    plagiarism detection. Proceedings of the 23rd international conference on computational
    linguistics (2010)
12. Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M., Stein, B.: Overview of the 6th
    Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In:
    Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF
    2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep
    2018)
13. Safin, K., Kuznetsov, M., Kuznetsova, M.: Methods for intrinsic plagiarism detection.
    Informatics and Applications (2017)
14. Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings. In:
    CLEF (2017)
15. Samdani, R., Chang, K.W., Roth, D.: A discriminative latent variable model for online
    clustering. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference
    on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 1–9. PMLR,
    Bejing, China (22–24 Jun 2014)
16. Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles (2009)
17. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
    Potthast, M.: Clustering by authorship within and across documents. CEUR Workshop
    Proceedings (2016)
18. Stein, B., Barron, Cedeno, L., Eiselt, A., Potthast, M., Rosso, P.: Overview of the 3rd
    international competition on plagiarism detection. CEUR Workshop Proceedings (2011)
19. Zechner, M., Muhr, M., Kern, R., Granitzer, M.: External and intrinsic plagiarism detection
    using vector space models. Proc. SEPLN. vol. 32 (2009)