<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UniNE at PAN-CLEF 2020: Author Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Catherine Ikae</string-name>
          <email>Catherine.Ikae@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Neuchatel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In our participation in the authorship verification task (small corpus), our main objective is to be able to discriminate between pairs of texts that were written by the same author (denoted “same-author”) and pairs of snippets written by different ones (“different-authors”). The paper describes a simple model that performs this task based on a Labbé similarity. As features, we employed the most frequent tokens (words, and punctuation symbols) from each author after including the most frequent ones of a given language. Such a representation strategy is based on word used frequently by a given author but not belonging to the most frequent in the English language. Evaluation based of authorship verification task with a rather small set of features shows an overall performance with the small dataset of F1= 0.705 and AUC = 0.840.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Authorship verification at CLEF PAN 2020 is the task of determining whether two
texts (or excerpts) have been written by the same author [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this kind of task, one
can also provide a sample of texts written by the proposed author from which the system
could generate a better author's profile. This additional sample was not provided in the
current experiment.
      </p>
      <p>
        For the CLEF PAN 2020 (small corpus), the pairs of texts have been extracted from
the website www.fanfiction.net storing texts about numerous well-known
novels, movies or TV series (e.g., Harry Potter, Twilight, Bible)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These writings
called fanfics have been scripted not by the true author(s) but by fans who want to
continue or propose a new episode for their preferred saga. Of course, such a fan could
have written for different series or propose several variants for continuing a specific
one. One can however assume that a writer is more interested to script in a given topic
or domain (called fandom, a subculture of fans sharing a common interest on a given
subject).
      </p>
      <p>Thus for the proposed task, the question is to identify whether or not a pair of text
excerpts have been authored by the same person. We view this author verification as a
similarity detection problem, or to detect when a similarity between two texts is too
high to reflect two distinct authors.</p>
      <p>
        Just like in authorship attribution, the author of a given text had to be revealed by
identifying some of his/her stylistic idiosyncrasies and to measure the similarity
between two author’s profile. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Suggest using quantification of writing style in texts
to represent the identity of their authors. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] Makes use of emojis in the feature selection
for verification of twitter users. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] Applies large numbers of linguistic information
such as vocabulary, lexical patterns, syntax, semantics, information content, and item
distribution through a text for author recognition and verification.
      </p>
      <p>
        As possible application of author verification, one can mention analysis of
anonymous emails for forensic investigations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], verification of historical literature [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
continuous authentication used in cybersecurity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], detection of changes in writing
styles with Alzheimer patients [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The rest of this paper is organized as follows. Section 2 describes the text datasets
while Section 3 describes the features used for the classification. Section 4 explains the
similarity measure and Section 5 depicts some of our evaluation results. A conclusion
draws the main findings of our experiments.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Corpus</title>
      <p>The corpus contains a set of pairs composed of two short texts (or snippets)
describing proposed variants or continuation of a series achieving a popular success. In
this context, given two snippets, the task is to determine whether the text pair has been
written by a single author or by two distinct writers. This question is a stylistic
similarity detection problem assuming that two snippets could cover distinct topics with
very distinctive characters and temporal differences but still being written by the same
person.</p>
      <p>These pairs of texts have been extracted from different domains (fandoms) and
Table 1 reports some examples of such fandoms with the number of texts extracted from
them.</p>
      <p>('G-Gundam', 56), ('Vampire Knight', 175), ('Free! - Iwatobi Swim Club', 121), ('DC
Superheroes', 98), ('Friends', 111), ('CSI: Miami', 133), ('Grimm', 108), ('Danny
Phantom', 200), ('Primeval', 117), ('Kingdom Hearts', 219), ('Jurassic Park', 107),
('Tarzan', 40), ('Dungeons and Dragons', 73), ('Final Fantasy X', 152), ('Fast and the
Furious', 112), ('OZ', 33), ('Sons of Anarchy', 115), ('Avatar: Last Airbender', 223),
('Attack on Titan', 194), ('Madam Secretary', 47)</p>
      <p>This corpus contains 52,590 text pairs (denoted problems) from which 27,823 pairs
corresponding to the same author and 24,767 are pairs written by two distinct persons.
Each text excerpt contains, in mean, 2,200 word-tokens. An example of a pair is
provided in Table 2 with their respective length and their vocabulary size.</p>
      <p>Guardians of Ga'Hoole
Tokens = 2,235
| Voc |= 1,353
I shift a bit, warily letting my eyes dart from one owl to the other -- but my eyes are
trained on the Barn Owl the most. Like Hoole...so like Hoole... He turns a bit, and our
eyes meet directly. I can"t describe it...in this next moment, I don"t look away, how
awkward it seems. I stare into his eyes. They"re like Hoole"s... They are Barn Owl
eyes, but Hoole"s eyes. They"re his eyes...Hoole"s eyes... They hold that light of valor,
…
Hetalia - Axis Powers
Tokens = 2,032
| Voc |= 1,422
"All will become one with Russia," he said, almost simply, his cheer eerie. Fists were
already clenched; now they groped about, for a pan, a rifle, a sword-there was nothing.
In some way, this brought her but a sigh of relief-Gilbert and Roderich, she was
reminded, were not here to suffer as well. If Ivan put his giant hands on Roderich...
Click, went an object, and Elizaveta was snapped into the world when her own instincts
…</p>
    </sec>
    <sec id="sec-3">
      <title>3 Feature Selection</title>
      <p>To determine the authorship of the two text chunks, we need to specify a text
representation that can characterize the stylistic idiosyncrasies of each possible author.
As a simple first solution, and knowing that only a small text size is available, we will
focus on the most frequent word-types.</p>
      <p>
        To generate a text representation, a tokenization must be defined. In this study, a
token is a sequence of letters delimited by space or punctuation symbols. We also
consider as token the punctuation marks (or sequence of them) such as comma, full
stop, or question marks. Words appearing the nltk stopword list are included in this
representation (179 entries composed of pronouns, determiners, auxiliary verb forms,
some conjunctions and prepositions). Thus our strategy is based on word-types used
recurrently by one author but not frequent in the underlying language (English in this
case). One can compare this solution to the Zeta model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Then to determine the most frequent ones, the occurrence frequency (or term
frequency denoted tf) of each word-type inside a chunk of text is computed. However,
as each text chunk has a different size and we opt for a relative term frequency (rtf)
computed as the term frequency divided by the text size. One can interpret this rtf value
as an estimation of the probability of occurrence of the term in the underlying snippet.
Finally, each pair is represented by the rtf of the k most frequent word-types, with
k varying from 100 to 400.</p>
    </sec>
    <sec id="sec-4">
      <title>Similarity Measure</title>
      <p>
        Based on a vector of k elements reflecting the rtf of each selected word-type, the
similarity (or distance) between the two text excerpts can be computed. In this study,
we opt for the Labbé similarity [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This measure is normalized and returns a value
between zero (nothing in common) and one (vectors are identical). All pairs of snippets
with similarity above a given threshold (denoted δ) are considered to be authored by the
same person. On the other hand, a similarity value lower than the specified threshold
indicates different authors.
      </p>
      <p>Denoting by d1 and d2 two document vectors, the Labbé distance corresponds to the
ratio of the absolute difference of all n terms to the to the maximal distance between the
two text representations as shown in equation 1.</p>
      <p>Dist Labbé ( 1,  2) = ∑ =1   ,1−  ,2 

2∗  2
ℎ    ,1 =    ,1 ∗   2
  1
(with δ = 0.5):</p>
      <p>The decision rule is based on the value of Labbé similarity, which is (1 - Dist Labbé)</p>
      <sec id="sec-4-1">
        <title>Decision</title>
        <sec id="sec-4-1-1">
          <title>Same author</title>
          <p>if Sim Labbé ( 1,  2) &gt; 0.5</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Different authors if Sim Labbé ( 1,  2) &lt; 0.5</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Non decision otherwise</title>
          <p>The implementation considers the absolute difference of all n terms in the text
representations. For each term in the document, the difference between the absolute
frequencies in Text  1</p>
          <p>and  2 is computed. This requires both documents to have equal
length. To ensure that both text have similar length, assuming Text  1 (  1) is larger
than Text  2 (  2 ), we multiply the relative term frequency of Text  1(   ,1) with the
ratio of the two lengths   1</p>
          <p>and   2 as shown in Equation 1.</p>
          <p>During the PAN CLEF 2020 author verification task, the system must return a value
between 0.0 and 1.0 for each problem. In our case, the Labbé similarity score provide
this value. In addition, we must specify "same-author", "different-authors" or provide
a blank answer (meaning " I don't know") that will be considered as an unanswered
question during the evaluation. Specify δ = 0.5 (see Equation 2), we ignore this last
possible answer and we provide an answer to all problems.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>As a performance measure, four evaluation indicators have been used. First, the
AUC (area under the curve) is computed. This value corresponds to area under the
curve generated according to the percentage of false positives (or false alarms) in the
xaxis and the percentage of true positive cases in the y-axis over the entire test set. A
model whose predictions are 100%</p>
      <p>wrong obtains an AUC of 0.0; one whose
predictions are 100% correct has an AUC of 1.0.</p>
      <sec id="sec-5-1">
        <title>Second the</title>
        <p>
          F1 combines the
precision and the recall into a unique value. In this computation, the non-answers are
ignored. Third, c@1 is a variant of the conventional F1 score which rewards systems
leaving difficult problems unanswered. It takes into account both the number of correct
(1)
(2)
answers and the number of problems left unsolved [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Four, F_0.5_u is a measure
according more emphasis when the system is able to solve the same-author cases
correctly[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>The entire system was based on only the training set, so the training and the
evaluation was done directly on the same corpus. With 52,590 problems in the ground
truth, the results of the similarity verification are shown in Table 3.</p>
        <p>k
100
150
200
250
300
350
400
450
500</p>
        <p>AUC
0.847
0.851
0.854
0.855
0.857
0.858
0.859
0.530
0.535
0.530
0.530
0.530</p>
        <p>Increasing the number of features from 100 to 500 does not have a significant impact
on the overall results as shown in Table 3. On the other hand, the run time is clearly
increasing.</p>
        <p>Figure 1 : Distribution of the similarity values for the two classes,</p>
        <p>same author or distinct authors (k=100)</p>
        <p>To have a better view of the results, Figure 1 shows the distribution of the Labbé
similarity values for the two classes, namely "same -author" and "different authors"
(k=100). As one can see, the "same-author" distribution (in blue) presents a higher
(mean: 0.723, sd: 0.041) and the distribution is more on the right (higher value)
compared to the "different authors" distribution (mean: 0.660, sd: 0.048) and shown in
red. However, the intersection between the two distribution is relatively large.</p>
        <p>In a last experiment, instead of building the text representation with all possible
word-types, we remove the 179 most frequent word appearing in the nltk stopword list.
Table 4 reports the overall performance of both approaches with k=500. Depending on
the evaluation measure, one representation strategy tends to propose the best
effectiveness. The results are thus inconclusive.</p>
      </sec>
      <sec id="sec-5-2">
        <title>With</title>
        <p>Without
AUC
0.860
0.531
0.591
F_0.5_u
0.585
F_0.5_u
0.599</p>
        <p>F1
0.705
overall
0.672</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6 Conclusion</title>
      <p>Due to time constraint, this report proposes a simple text similarity technique to solve
the authorship verification problem when facing with pairs of snippets. We proposed
to select features by ranking them according to their frequency of occurrence in each
text and taking only the most frequent ones (from 100 to 500) but in including the most
frequent ones in the underlying language. With this proposed strategy, we want to
identify terms occurring frequently by an author and frequent in the current language
(English in this study).</p>
      <p>
        The similarity computation is based on the Labbé between two vectors. The next
step for us is to explore the reverse text representation, taking account only of the most
frequent terms of a given language [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Of course, one could then combine the two
results to hopefully improve the overall effectiveness.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavacas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bevendorff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Overview of the Cross-Domain Authorship Verification Task at PAN 2020</article-title>
          .
          <article-title>CLEF 2020 Labs and Workshops</article-title>
          , Notebook Papers
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Bischoff</surname>
          </string-name>
          , Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, Martin Potthast:
          <article-title>The Importance of Suppressing Domain Style in Authorship Analysis</article-title>
          . CoRR abs/
          <year>2005</year>
          .14714 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Stover</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Computational authorship verification method attributes a new work to a major 2nd century African author</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          ,
          <volume>67</volume>
          . https://doi.org/10.1002/asi.23460
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Suman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chaudhari</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Emoji Helps! A Multi-modal Siamese Architecture for Tweet User Verification</article-title>
          .
          <source>Cognitive Computation</source>
          . https://doi.org/10.1007/s12559-020-09715-7
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Halteren</surname>
            ,
            <given-names>H. V.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Author verification by linguistic profiling: An exploration of the parameter space</article-title>
          .
          <source>ACM Transactions on Speech and Language Processing</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ), 1:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          :
          <fpage>17</fpage>
          . https://doi.org/10.1145/1187415.1187416
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>B. C. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Debbabi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>E-mail authorship verification for forensic investigation</article-title>
          .
          <source>Proceedings of the 2010 ACM Symposium on Applied Computing</source>
          ,
          <volume>1591</volume>
          -
          <fpage>1598</fpage>
          . https://doi.org/10.1145/1774088.1774428
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stover</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karsdorp</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Authenticating the writings of Julius Caesar</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>63</volume>
          ,
          <fpage>86</fpage>
          -
          <lpage>96</lpage>
          . https://doi.org/10.1016/j.eswa.
          <year>2016</year>
          .
          <volume>06</volume>
          .029
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Neal</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundararajan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Woodard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Exploiting Linguistic Style as a Cognitive Biometric for Continuous Verification</article-title>
          .
          <source>2018 International Conference on Biometrics (ICB)</source>
          ,
          <fpage>270</fpage>
          -
          <lpage>276</lpage>
          . https://doi.org/10.1109/ICB2018.
          <year>2018</year>
          .00048
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>V. W.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Changes in Style in Authors with Alzheimer's Disease</article-title>
          .
          <source>English Studies</source>
          ,
          <volume>93</volume>
          (
          <issue>3</issue>
          ),
          <fpage>357</fpage>
          -
          <lpage>370</lpage>
          . https://doi.org/10.1080/0013838X.
          <year>2012</year>
          .668789
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Craig</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <article-title>&amp;</article-title>
          <string-name>
            <given-names>A.F.</given-names>
            <surname>Kinney</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.F.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Shakespeare, Computers, and the Mystery of Authorship</article-title>
          . Cambridge University Press, Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>All the Way Through: Testing for Authorship in Different Frequency Strata</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <fpage>27</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Labbé</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Labbé</surname>
          </string-name>
          .
          <article-title>A tool for literary studies</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>21</volume>
          (
          <issue>3</issue>
          ):
          <fpage>311</fpage>
          -
          <lpage>326</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Peñas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rodrigo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>A Simple Measure to Assess Non-response</article-title>
          .
          <source>Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <fpage>1415</fpage>
          -
          <lpage>1424</lpage>
          . https://www.aclweb.org/anthology/P11-1142
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Bevendorff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Generalizing Unmasking for Short Texts</article-title>
          .
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <fpage>654</fpage>
          -
          <lpage>659</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1068
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2019</year>
          )
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In N. Ferro,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Peters (eds),
          <source>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF</source>
          . Springer, Berlin.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>J. Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Entropy-Based Authorship Search in Large Document Collection</article-title>
          .
          <source>In Proceedings ECIR2007</source>
          , Springer LNCS #
          <volume>4425</volume>
          ,
          <fpage>381</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>