<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Authorship Verification with Prediction by Partial Matching and Context-free Grammar</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Łukasz Ga˛gała</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Göttingen</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>In our approach to authorship verification (AV) we proposed a data compression method based on the widespread Prediction by Partial Matching (PPM) algorithm extended with Context-free Grammar character preprocessing. The most frequent in-word bigrams in each text are replaced by special symbols and accepted by a modified version of PPM algorithm. For similarity measure between text samples we chose Compression-Based Cosine (CBC) that in previous research had been proven to perform slightly better than alternative measures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Authorship verification is an active research area of computational linguistics that can
be expressed as a fundamental question of stylometry, namely whether or not two texts
are written by one and the same author [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ]. It has a wide range of applications in
forensic linguistics and fraud and plagiarism detection. Among notable examples, we
can name blackmail messages, false insurance claims or online reviews and opinion
statements, where authorship analysis may turn out to be instrumental in answering
forensic questions.
      </p>
      <p>
        For 2020 edition of PAN challenge a large dataset in twofold version was prepared
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The dataset consists of pairs of text fragments of English language fanfiction
being written either by one or two authors. For PAN 2020, similarly as in 2018 and 2019
editions, fanfiction literature was chosen as a source of the task corpus. Fanfiction
literature is written by members of particular fandom, i.e. a subculture of fans concentrated
around specific films, TV series, books, like Harry Potter and Star Wars, who re-enact
a so-called universe of their interest by organising fan events, drawing comic books,
creating multimedia content or writing new novels. The particular literary genre of
literary fanfiction aims to emulate the original world of a respective fiction, but they very
often enhance its plot and add new figures, places and events [3] [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ]. In this regard
fanfiction of different fandoms written by one person may display a lot of stylistic
divergence coming from attempts to stick to a particular style that is regarded as typical
for a given fandom. Authorship verification for that type of texts can, therefore, be seen
as cross-domain verification similarly to cross-domain attribution [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        For PAN 2020 Authorship Verification challenge we tried to enrich
data-compressionbased methods for authorship attribution/verification (AA/AV) with text pre-processing.
Our starting point were publications on the subject by Halvani et al. [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ]
exploring different types of compression algorithms and metrics. Furthermore, we applied
character-based context-free grammar rules as a feature engineering step as described
by Aljehane [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The final verification decision was made by as proposed in [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ]. We
describe carefully our approach in the subsequent section Method.
3
3.1
(1)
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <sec id="sec-3-1">
        <title>Data Compression Methods</title>
        <p>
          Data compression algorithms are used in a variety of domains (bioinformatics, social
sciences, image processing) for classification and clustering tasks. The theoretical
foundation of those methods lies in information theory, more precisely, in Kolmogorov
complexity [
          <xref ref-type="bibr" rid="ref13">14</xref>
          ]. The length of the shortest code (description) that can produce an object is
called the Kolmogorov complexity of that object.
        </p>
        <p>K(s) = jd(s)j</p>
        <p>Kolmogorov complexity can be used to define an information distance between two
objects and has the following formula:
(2)</p>
        <p>N ID(x; y) =
maxfK(xjy); K(yjx)g ;</p>
        <p>
          maxfK(x); K(y)g
where K(xjy) is a conditional complexity of x given y. Normalised Information
Distance (NID) has been proven to satisfy the requirements for metric distance
measure, however it remains incomputable [14, pp. 660]. For practical applications we can
approximate NID by compression algorithms, such like Huffman coding, Lempel-Ziv
compression (LZ77 and LZ78) or Prediction by Partial Matching (PPM) that are
available in popular data compression libraries and programs, for example WINZIP or 7-ZIP
[
          <xref ref-type="bibr" rid="ref14">15</xref>
          ]. Such an approximation of NID is called Normalised Compression Distance (NCD).
In this case C value is the length of the compressed objects expressed in the number of
bits.
(3)
        </p>
        <p>N CD(x; y) =
maxfC(xjy); C(yjx)g</p>
        <p>maxfC(x); C(y)g
A couple of alternative variations of that metric measure were proposed for the task of
authorship verification:</p>
        <p>
          Compression-based Cosine (CBC) introduced by Sculley and Brodley [
          <xref ref-type="bibr" rid="ref15">16</xref>
          ].
(4)
        </p>
        <p>CBC(x; y) = 1</p>
        <p>
          C(x) + C(y) C(xy)
pC(x)C(y)
Chen-Li Metric (CLM) credited by Sculley and Brodley [
          <xref ref-type="bibr" rid="ref15">16</xref>
          ] to Li et al. [
          <xref ref-type="bibr" rid="ref3">4</xref>
          ].
Compression Method algorithms (CM) are widely used across many fields of
scientific research (not only in computer sciences but also in genetics or social sciences).
They have gained momentum in like manner in the domain of computational
linguistics. Particularly in the text classification task, the Prediction by Partial Matching (PPM)
algorithm seems to be a preferred CM, however it is not the most compelling approach
for data compression itself [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ]. PPM is used in different variances (therefore we can
rather speak of a family of algorithms) in many compression programs, where it is very
often combined with other computational techniques [15, pp.292]. The fundamental
principle of PPM is a context-dependent prediction of each subsequent character in a
text string. The context is given by a window of preceding characters with a predefined
length (there exists also a PPM version with a so-called unbound context [
          <xref ref-type="bibr" rid="ref4">5</xref>
          ]). The PPM
algorithm with context size = n can be also seen as a Hidden Markov Model of order
n. The context size can be freely defined, yet most applications choose a range between
3 and 12, since longer context do not appear to be of practical use. Figure 1 shows the
way, in which the PPM algorithm proceeds through a text taking the context of a given
length and each subsequent character.
        </p>
        <p>sequence
Every word has a meaning.</p>
        <p>Every word has a meaning.</p>
        <p>Every word has a meaning.</p>
        <p>Every word has a meaning.</p>
        <p>Every word has a meaning.</p>
        <p>
          context
eve
ver
ery
ry_
y_w
symbol-to-predict
r
y
_
w
o
The bulk of development research into PPM has been done in the domain of data
compression for obvious reasons. Much less seems to be done in regard to CM for different
text classification tasks – the main objective of PPM is data compression as such, and
merely by the fact that similar objects (texts) tend to let themselves be compressed
together more efficiently than dissimilar objects do. A superior compression rate of a
particular approach does not necessarily lead to better classification results, so that CM
may be an attractive starting point for AA/AV but they require some dedicated
amendments. One of the recently proposed improvements is a context-free grammar
preprocessing for PPM (Grammar-based preprocessing for PPM, GB-PPM), which focuses on
a more dense representation of character strings [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The GB-PPM algorithm
preprocessing replaces in subsequent runs n most frequent bigrams or trigrams of characters with
special symbols reducing the overall length of a text and simplifying the distribution of
characters for different contexts. First, GB-PPM identifies frequent n-grams (in our case
bigrams, in this toy example the status of being frequent is assigned arbitrarily):
(7)
        </p>
        <sec id="sec-3-1-1">
          <title>Every word has a meaning.</title>
          <p>Then the frequent n-grams are replaced with special symbols (e.g. ABC:::), which are
not a part of the set of characters in the text from Eq. 7:</p>
          <p>Aery Brd has a mCnDg.</p>
          <p>The algorithm can be run once again, identifying the next set of frequent n-grams, this
time together with special symbols standing for frequent n-grams from the last run in
Eq. 8:
(8)
(9)
(10)</p>
          <p>Aery Brd has a mCnDg.</p>
          <p>Also, those frequent n-grams can be replaced further with special symbols covering
both original characters and other special symbols. After n runs (in our example n = 2)
the algorithm can yield a new version of the text string in Eq. 10 and a corresponding
character context-free grammar in Figure 2:</p>
          <p>AE y F d hG a HnDg.</p>
          <p>A ! Ev
B ! wo
C ! ea
D ! in
E ! er
F ! Br
G ! as</p>
          <p>H ! mC</p>
          <p>The generic code for GB-PPM is given in Algorithm 1. The procedure is being
applied directly before the compression of each text sample, but one can easily imagine
an alternative implementation, where a single run is performed over the whole text
corpus D, so that the most frequent n-grams B are derived rather from D, B D, than a
particular text t; t 2 D.</p>
          <p>Algorithm 1: Non-finite symbol preprocessing with the GR-PPM algorithm</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Input:</title>
        <p>T (sequence),
i (number of passes),
n (number of most frequent bigrams in the current sequence),
Output: sequence T with non-terminal symbols
C PREDICTION BY PARTIAL MATCHING
for k 0 to i do</p>
        <p>B Find the n most frequent in-word bigraphs in the current text T ;
foreach frequent bigram b 2 B do
foreach frequent bigram b 2 T do</p>
        <p>non-terminal symbol s frequent bigram b;
return C(T )
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Verification</title>
        <p>To verify the authorship of a pair of texts = (Tx; Ty) , we need to find a decision
threshold , for which our data compression dissimilarity measure M gives us the score
s = M(Tx; Ty), so that:
decision( ) =
(Y (Y es); if s &lt;</p>
        <p>N (N o); otherwise</p>
        <p>
          Identically as in [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ], we find by using the Equal Error Rate (EER) algorithm on
n text pairs with the equal number of positive (Y) and negative (N) authorship cases. In
this way we can determine such , that the rates of false positives and false negatives
are equal.
        </p>
        <p>Algorithm 2: Determine threshold by EER</p>
        <p>Input: Y (the dissimilarity scores of the Y problems), N (the dissimilarity
scores of the N problems)
Output: threshold
if length(Y) 6= length(N) then</p>
        <p>Exception "Number of Y and N problems mismatch!";
throw Exception;
Sort Y and N in ascending order;
l length(Y);
i 0;
j l 1 ;
for k 0 to l 1 do
if Yi &lt; Nj then
i i + 1;
j j 1;
continue;
if Yi = Nj then</p>
        <p>Yi;
break;
if Yi &gt; Nj then
if i = 0 then</p>
        <p>21 (Yi + Nj ) ;
else
break;
if i = l then
return ;
12 (Yi 1 + Nj+1) ;</p>
        <p>21 (min(Yi; Nj ) + min(Yi 1; Nj ) ;</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Implementation Details and Submission</title>
      <p>
        The development dataset provided by the PAN 2020 organisers consists of two
versions: a smaller one that is supposed for data-sparse machine learning methods (52,601
text snippet pairs) and a bigger one that should be more applicable for data-hungry
approaches (275,565 text snippet pairs) [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ]. Both subcorpora were drawn from even a
bigger text collection consisting of 5.8 million stories written by 1.4 million authors
in 44 different languages and in 10,328 different topical domains. The data is a set of
problems being pairs of English language text fragments by the same or two different
authors. Moreover, the texts in the pair come from different fanfics, however there are
no fanfiction crossovers, i.e. literary texts that encompass multiple fiction universes, e.g.
Batman and Star Wars, Harry Potter and Spider-Man. The overall distribution of authors
across all the problems resembles the “long-tail” distribution of the original text corpus,
from which the development set was composed.
      </p>
      <p>To establish the verification decision threshold we used 2000 text pairs. Larger
amounts of text pairs did not improve our calibration parameter. For GB-PPM
algorithm we decided to process only bigrams in a single run, since additional executions
of that preprocessing step gave no improvement. As a dissimilarity measure we chose</p>
      <sec id="sec-4-1">
        <title>CBC as described in Section Method.</title>
        <p>As we can see in Table 1 our approach yielded better results as baseline methods
suggested by the organisers, but could not break a “glass ceiling” of around 0.8 overall
score – only two approaches by Boenninghoff and by Weerasinghe achieved the level
of around 0.9 overall score.</p>
        <p>Rank</p>
        <p>Team
F1-score Overall</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The research for the author’s work has been funded with a PhD bursary for years
20162019 by the Cusanuswerk Academic Foundation.
Potthast, M.: The Importance of Suppressing Domain Style in Authorship Analysis. CoRR
abs/2005.14714 (May 2020), https://arxiv.org/abs/2005.14714
3. Booth, P.: A Companion to Media Fandom and Fan Studies. Wiley Blackwell companions
in cultural studies, Wiley (2018),
https://books.google.de/books?id=yDJKDwAAQBAJ</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aljehane</surname>
            ,
            <given-names>N.O.M.</given-names>
          </string-name>
          :
          <article-title>Grammar-based preprocessing for PPM compression and classification</article-title>
          .
          <source>Ph.D. thesis</source>
          , Bangor University (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bischoff</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deckers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schliebs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thies</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A compression algorithm for dna sequences and its applications in genome comparison</article-title>
          . pp.
          <fpage>52</fpage>
          -
          <lpage>61</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cleary</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teahan</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>Unbounded length contexts for ppm</article-title>
          .
          <source>In: Proceedings DCC '95 Data Compression Conference</source>
          . pp.
          <fpage>52</fpage>
          -
          <lpage>61</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gagala</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Code for PAN 2020 Authorship Verification task</article-title>
          . https://github.com/Lukasz-G/PAN2020 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          7.
          <string-name>
            <surname>Halvani</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graner</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vogel</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Authorship verification in the absence of explicit features and thresholds</article-title>
          . In: Pasi,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Piwowarski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.) Advances in Information Retrieval. pp.
          <fpage>454</fpage>
          -
          <lpage>465</lpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          8.
          <string-name>
            <surname>Halvani</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graner</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>On the usefulness of compression models for authorship verification</article-title>
          .
          <source>In: Proceedings of the 12th International Conference on Availability, Reliability and Security. ARES '17</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2017</year>
          ), https://doi.org/10.1145/3098954.3104050
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hellekson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The Fan Fiction Studies Reader</article-title>
          . University of Iowa Press (
          <year>2014</year>
          ), http://www.jstor.org/stable/j.ctt20p58d6
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10.
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lonardi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratanamahatana</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          :
          <article-title>Towards parameter-free data mining</article-title>
          .
          <source>In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . p.
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          . KDD '
          <volume>04</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2004</year>
          ), https://doi.org/10.1145/1014052.1014077
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavacas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bevendorff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the Cross-Domain Authorship Verification Task at PAN 2020</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Eickhoff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Névéol</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2020 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR-WS.org (Sep</source>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.) Working Notes of CLEF 2018 -
          <article-title>Conference and Labs of the Evaluation Forum (CLEF</article-title>
          <year>2018</year>
          ). Springer (Sep
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          13.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Determining if two documents are written by the same author</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>65</volume>
          (
          <issue>1</issue>
          ),
          <fpage>178</fpage>
          -
          <lpage>187</lpage>
          (
          <year>2014</year>
          ), https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.22954
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          14.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitnyi</surname>
            ,
            <given-names>P.M.:</given-names>
          </string-name>
          <article-title>An Introduction to Kolmogorov Complexity</article-title>
          and
          <string-name>
            <given-names>Its</given-names>
            <surname>Applications</surname>
          </string-name>
          . Springer Publishing Company, Incorporated,
          <volume>3</volume>
          <fpage>edn</fpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          15.
          <string-name>
            <surname>Salomon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
          </string-name>
          , G.:
          <article-title>Handbook of Data Compression</article-title>
          . Springer Publishing Company, Incorporated, 5th edn. (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Compression and machine learning: A new perspective on feature space vectors</article-title>
          . pp.
          <fpage>332</fpage>
          -
          <lpage>341</lpage>
          (04
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>