<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UniNE at CLEF 2017: Author Clustering</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Dept., University of Neuchâtel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>This paper describes and evaluates an effective unsupervised author clustering and authorship linking model called SPATIUM. The suggested strategy can be adapted without any difficulty to different languages (such as Dutch, English, and Greek) in different text genres (e.g., newspaper articles and reviews). As features, we suggest using the m most frequent terms (isolated words and punctuation symbols) or the m most frequent character n-grams of each text. Applying a simple distance measure, we determine whether there is enough indication that two texts were written by the same author. The evaluations are based on 60 training and 120 test problems (PAN AUTHOR CLUSTERING task at CLEF 2017). Using the most frequent terms results in a higher clustering precision, while using the most frequent character n-grams of letters gives a higher clustering recall. An analysis to assess the variability of the performance measures indicates that we have a system working stable independent of the underlying text collection and that our parameter choices did not over-fit to the training data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The authorship attribution problem is an interesting problem in computational
linguistics but also in applied areas such as criminal investigation and historical studies
where knowing the author of a document (such as a ransom note) may be able to save
lives [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. With the Web 2.0 technologies, the number of anonymous or pseudonymous
texts is increasing and in many cases one person writes in different places about
different topics (e.g., multiple blog posts written by the same author). Therefore,
proposing an effective algorithm to the authorship problem presents a real interest. In
this case, the system must regroup all texts by the same author (possibly written about
different text topics) into the same group or cluster. A justification supporting the
proposed answer and a probability that the given answer is correct can be given to
improve the confidence attached to the response [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>This author clustering task is more demanding than the classical authorship
attribution problem. Given a document collection the task is to group documents
written by the same author such that each cluster corresponds to a different author. The
number of distinct authors whose documents are included is not given. For example,
based on a set of passages extracted from larger documents, we should first determine
the number of authors k and then regroup the texts into k clusters according to their real
author. This task can also be viewed as establishing authorship links between texts and
is related to the PAN 2015 task of authorship verification.</p>
      <p>This paper is organized as follows. The next section presents the test collections and
the evaluation methodology used in the experiments. The third section explains our
proposed algorithm called SPATIUM. Then, we evaluate the proposed scheme on 60
training problems and compare it to the best performing schemes using 120 different
test problems. The last section explains our parameter choices and provides a
sensibility assessment. A conclusion draws the main findings of this study.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Test Collections and Evaluation Methodology</title>
      <p>
        The evaluation was performed using the TIRA platform, which is an automated tool for
deployment and evaluation of the software [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The data access is restricted such that
during a software run the system is encapsulated and thus ensuring that there is no data
leakage back to the task participants [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This evaluation procedure also offers a fair
evaluation of the time needed to produce an answer.
      </p>
      <p>
        During the PAN CLEF 2017 evaluation campaign, six corpora (or test collections)
were built each containing 30 problems (10 for training and 20 for testing). In each
problem, all the texts matched the same language, are in the same text genre, and are
single-authored, but they may differ in text-length and can be cross-topic [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The
number of distinct authors is not given. In this context, the task is defined as:
Given a problem of up to 50 short documents, identify
authorship links and groups of documents by the same author.
      </p>
      <p>The six corpora are a combination of one of three languages (English, Dutch, or Greek)
and one of two genres (newspaper articles or reviews). An overview of these corpora
is depicted in Table 1. Considering the six benchmarks we have 120 problems to test
and 60 problems to train (pre-evaluate) our system. The training set was used to
evaluate our approach and the test set was used to compare our results with other
participants of the PAN CLEF 2017 campaign. This year, everyone had access to the
test data twice. This means we can train and test a basic approach, improve it or provide
a different approach, and then test it again for the second and final run.
For each corpus, we have 10 problems in the training dataset containing the average
number of texts as given under the label “Texts”. The number of distinct authors on
average together with the range for each corpus is indicated in the column “Authors”,
and the average with the minimum and maximum number of authors with only a single
document is presented under the label “Single”. Finally, the average number of terms
(isolated words and punctuation symbols) is given in the column “Terms”. For
example, with the English newspaper collection (training set), 20 texts are written, on
average, by 5.6 authors and we can find 1.8 authors who wrote only one single article.
These metrics are not available for the test corpora because the datasets remain
undisclosed thanks to the TIRA system. We only know that the same combinations of
language and genre are present.</p>
      <p>In Table 1 we see that the number of words is rather small. In Figure 1 we show
three texts extracted from a problem containing articles written in the English language.
The represented texts are the full unmodified documents as available in problem001.
Notice that document0014 and document0017 are a single sentence, and the latter is so
short that it would fit in a single Twitter tweet (it contains less than 140 characters).
When analyzing the texts, we should detect a shared authorship between document0017
and document0018, but not with document0014 as this was written by someone else.
The limited length of those documents is the main difficulty of this year’s author
clustering task.</p>
      <p>document0014:
But the more fastidious are also sensitive
about their reputations – and the risk that
others with shadier professional pasts, alleged
or real, may damage their fundraising.
document0017:
“Can we keep this brief?” enquired
Vaz, now beginning to get a bit
twitchy that the Gambaccini show
was overrunning.
document0018:
“It is absolutely vital that every decision we take, every policy we
pursue, every programme we start, is about giving everyone in our
country the best chance of living a fulfilling and good life,” Dave said.
“And now it’s time for the cameras to leave.” And for the cuts to begin.</p>
      <p>When inspecting the training corpora, the number of words available is rather small
(overall in mean 82 terms for each text). Since there are some authors who only wrote
a single text we should only cluster two texts if there are enough evidences for a single
authorship.</p>
      <p>During the PAN CLEF 2017 campaign, a system must return two outputs in a JSON
structure for each problem. First, the detected groups should be written to a file
indicating the author clustering. Each text must belong to exactly one cluster; thus, the
clusters must be non-overlapping. Second, a list of text pairs with a probability of
having the same author should be written to another file representing the authorship
links.</p>
      <p>
        As performance measure, two evaluation measures were used during the PAN CLEF
campaign. The first performance measure is the BCubed F1 to evaluate the clustering
output [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This value is the harmonic mean of the precision and recall associated to
each document. The document precision represents how many documents in the same
cluster are written by the same author and therefore measures the purity of its cluster.
Symmetrically, the recall associated to one document represents how many documents
from that author appear in its cluster and therefore measures the completeness of its
cluster.
      </p>
      <p>
        As another measure, the PAN CLEF campaign adopts the mean average precision
(MAP) measure for the authorship links between document pairs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This evaluation
measure provides a single-figure measure of quality across recall levels. The MAP is
roughly the average area under the precision-recall curve for a set of problems.
Therefore, this measure gives more emphasis on the first positions and a
misclassification with a lower probability is less penalized. MAP does not punish
verbosity, i.e., every true link counts even when appearing near the end of the ranked
list. Therefore, by providing all possible authorship links, one can attempt to maximize
MAP [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Simple Clustering Algorithm</title>
      <p>
        We suggest an unsupervised approach based on a simple feature extraction and distance
measure called SPATIUM (Latin word meaning distance). The selected stylistic features
correspond to the top m most frequent terms (isolated words without stemming but with
the punctuation symbols) in the first run as in the last year [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and additionally the m
most frequent character n-grams for the second run. The features are selected solely
based on the frequency in the query text. For determining the value of m, previous
studies have shown that a value between 200 and 300 tends to provide the best
performance [
        <xref ref-type="bibr" rid="ref10 ref2">2, 10</xref>
        ]. The texts were only paragraphs so the effective number of features
m was set to at most 200 but was in most cases well below. The length of the n-grams
was set to n=6 characters to ease the analysis of the most pertinent features. Unlike in
the previous year [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we did not remove the words appearing only once (hapax
legomenon) in the text due to the limited size of each document (see Table 1). For
instance, in document0017 depicted in Figure 1 every term would be deleted if the
hapax legomenon would be ignored.
      </p>
      <p>To measure the distance between a Test A and another Text B, SPATIUM uses a
variant of the L1-norm called Canberra. This distance suggests that the absolute
differences of the individual features are normalized based on the sum of them as
indicated in Equation 1.</p>
      <p>|  [  ]−  [  ]|
∆( ,  ) = ∆ = ∑ =1   [  ]+  [  ]
(1)
where m indicates the number of features (words and punctuation symbols, or character
n-grams), and PA[fi] and PB[fi] represent the estimated occurrence probability of the
feature fi in the first Text A and in the other Text B respectively. To estimate these
probabilities, we divide the feature occurrence frequency (ffi) by the sum of all features
of the corresponding text (n), Prob[fi] = ffi / n, without smoothing and therefore
accepting a probability of 0.0 in Text B. This distance measure is not symmetric due
to the choice of the features to be include in the computation.</p>
      <p>
        Observing a small value for ∆ provides evidence that both documents are written
by the same author. On the other hand, a large value suggests the opposite. The real
problem consists in defining precisely what a “small distance value” is. To verify




(2)
(3)
(4)
(5)
2: ∆( ,  ) ≤  (. , B) =  (. , B) −  ∗ 
(. , B)
With these two decision rules, one can verify if a distance ∆ is small in comparison
with all distances from Text A (Eq. 2) or all distances to Text B (Eq. 3). In the same
way, one can verify whether the resulting ∆ value is small or rather large. Therefore,
we propose to create two additional decision rules with Eq. 4 (based on the distribution
of distance values from Text B) and Eq. 5 (for distance to Text A) as follows:
3: ∆( ,  ) ≤  ( , . ) =  ( , . ) −  ∗ 
4: ∆( ,  ) ≤  (. , A) =  (. , A) −  ∗ 
( , . )
(. , A)
An authorship between Text A and Text B is expected if at least two of the four hints
are satisfied. For the clustering output, we use the single linkage strategy. For the list
of links, we must rank each pair of texts by the certainty that they have a shared
authorship. To determine the probability of a correct author linking we include both
the number of satisfied hints h and the absolute distance between two texts in the
computation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A link with h hints fulfilled gets a probability between ℎ/5 and
(ℎ + 1)/5, where the final score depends on the other text pairs that also satisfy h hints.
whether the resulting ∆  value is small or rather large, a comparison basis must be
determined.
      </p>
      <p>To achieve this with a specific problem, the distance from Text A to all other texts
is computed (or ∆( ,  )). From this distribution, the mean (denoted  ( , . )) and
standard deviation ( ( , . )) are estimated. Moreover, the distribution of distance
values to Text B (or ∆( ,  )) can be computed to provide the mean  (. ,  ) and the
standard deviation  (. ,  ) of the intertextual distances to Text B.</p>
      <p>As a first definition of a “small” distance, we can assume that a small distance value
from Text A must respect Eq. 2. In this formulation,  is a parameter to be fixed.
1: ∆( ,  ) ≤  ( , . ) =  ( , . ) −  ∗ 
( , . )
Similarly, a small distance to Text B can be defined as:</p>
    </sec>
    <sec id="sec-4">
      <title>4 Evaluation</title>
      <p>Since our system is based on an unsupervised approach we could directly evaluate it
using the training set. In Table 2a, we have reported the same performance measure
applied during the PAN CLEF campaign, namely the BCubed F1 (with the clustering
precision and recall) and the AP using the most frequent terms from our first run and in
Table 2b with the most frequent character 6-grams as used in the second run. Each
corpus consists of 10 problems and we report the average of them in the last row. The
final score is the arithmetic mean between the BCubed F1 and the MAP.</p>
      <p>The algorithm returns similar results over all corpora and seems to work stable
independent of the text genre and language. But we can see that from the first to the
second approach (from Table 2a to Table 2b) that the precision drops significantly and
the recall increases notably. Overall, the approach with 6-grams results in a slightly
higher performance of the clustering output (BCubed F1), the authorship linking
(MAP), and the Final score (+2.4% difference, +5.3% change).
The test set is then used to rank the performance of all 6 participants in this task. Based
on the same evaluation methodology, we achieve the results depicted in Table 3a and
Table 3b corresponding to the six test corpora.
training set. Therefore, the system seems to perform stable independent of the
underlying text collection and is not over-fitted to the data.</p>
      <p>To put those values in perspective we can see in Table 4 our result in comparison
with the other participants using macro-averaging for the effectiveness measures and
showing the total runtime sorted by the final score. Overall, we are ranked 2nd out of 6
approaches.
Generally, there are only small differences in the BCubed F1 between the participants.
Conversely, the MAP shows substantial variations and impacts the final score the most.
The runtime only shows the actual time spent to classify the test set. On TIRA1 there
was the possibility to first train the system using the training set which had no influence
on the final runtime. Since we have an unsupervised system it did not need to train any
parameters, but this possibility might have been used by other participants.</p>
      <p>Overall, we achieve excellent results using a rather simple and fast approach in
comparison with the other solutions.</p>
      <p>In text categorization studies, we are convinced that a deeper analysis of the
evaluation results is important to obtain a better understanding of the advantages and
drawbacks of a suggested scheme. By just focusing on overall performance measures,
we only observe a general behavior or trend without being able to acquire a better
explanation of the proposed assignment. To achieve this deeper understanding, we
could analyze some problems extracted from the English corpus. The relative
frequency (or probability) differences with very frequent tokens such as the, (comma),
to, or and can explain the decision. The confirmation of an authorship link is in many
cases based on topical words and names that two texts share, like labour, party, people,
Cameron, or work.</p>
    </sec>
    <sec id="sec-5">
      <title>5 Parameter Choices</title>
      <p>Our approach uses a few parameters to solve the clustering task. The main influences
on the performance are the choice of the distance measure, the threshold value  , and
the feature selection scheme. Taking a decision solely on the outcome in the training
data could lead to over-fitting. A leaving-one-out or a fold cross-validation is not
possible in this task. Instead the bootstrap approach can be used. In this perspective,
for each problem, the system must generate S new random bootstrap samples. More
precisely, for each text, we will create S = 200 new copies having the same text length.
1 http://www.tira.io/task/author-clustering/
For each copy the probability of choosing one given feature (word and punctuation
symbol, or n-gram) depends on its relative frequency in the original text. This drawing
is done with replacement; thus, the underlying probabilities are stable. Each resulting
text must be viewed as a bag-of-words. As the syntax is not respected, each bootstrap
text is not readable but still reflects the stylistic aspects as analyzed by the SPATIUM
approach.</p>
      <p>
        For each of the original 60 training problems (Table 1) we now have 200 generated
problems of bootstrap samples and can compare different parameter choices. In Table 5
we analyze several distance measures and report the mean of the Final score achieved
with the 200*60 new problems together with the limit of ±2 standard deviations σ
corresponding to a confidence interval of 95.4%. Furthermore, the last two columns
show the mean of the BCubed F1 and MAP over the 200 bootstrap samples and 60
problems.
In a previous study [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] we found that Canberra and Clark work better on average in
author profiling tasks than Cosine and Euclidean. We can again see the same
distinction for this clustering task. In Table 5 we can also see that there is no significant
difference between Canberra, Clark, Matusita, and KLD. For instance, between
Canberra and Clark we only observe a relative change of +1.4% in the mean final score
with the bootstrap approach, which isn’t a substantial improvement that justifies
changing our model.
      </p>
      <p>
        The next parameter to optimize is the threshold value  that indicates the willingness
of having more or less strict assignments. A smaller value for  generates more
potential links between texts and thus increases the risk of observing incorrect
assignments. For a Gaussian distribution, common choices are  = 1.96 to take
account of 95%,  = 1.64 which contains 90%,  = 1.28 to include 80%, and  = 1.0
to take in 66.3%. If a corpus is composed of many authors with each cluster contains
only a few items, the parameter  should be fixed at a relatively higher level. In our
system from 2016, we set  = 2.0 because of the small average cluster size [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. With
the dataset from 2017 the number of authors with only a single text is lower and there
are more grouped up documents. Therefore, we decreased the threshold parameter in
the current system to  = 1.64.
      </p>
      <p>Figure 2 shows the mean of the BCubed F1, the MAP, and the Final score for
different  values when using the bootstrap approach. We can see that there was a lot
of potential to improve the clustering outcome (highest line on top, BCubed F1). This
analysis was performed after the completion of the testing stage where we fixed  =
1.64 (shown with red squares in Figure 2). Setting  = 1.28 would have enhanced the
clustering output by 10% and therefore increased the final performance by 5%. The
benefit of having a higher threshold is to be more certain that a given authorship link is
correct, leading to higher clustering precisions. On the other hand, using a less
restrictive threshold gives higher a clustering recall. We propose to be more cautious,
mainly because proposing an incorrect assignment must be viewed as more problematic
in many systems (especially if they are legal and law related) than missing a link
between two documents written by the same author.</p>
      <p>Interestingly, the authorship linking seems to produce a constant result (dashed line
on the bottom, MAP) independent of the used threshold value  .
Finally, we can evaluate the performance variation on the training data to determine the
optimal length of the character n-grams for our second run. Figure 3 shows the mean
clustering precision, recall, and the BCubed F1 for different n-gram lengths from n=1
(unigrams) to n=12 based on the bootstrap approach. We can see a convergence from
n=2 to n=9 between the recall (increasing) and the precision (decreasing) before they
diverge again. In our second run, we used 6-grams (shown with red squares in
Figure 3). The highest harmonic mean between precision and recall is achieved using
7-grams, which is only slightly better than the neighboring 6-grams and 8-grams (less
than 0.5% change).</p>
      <p>Overall, the analysis has shown that the chosen parameters are fine but could have
been optimized. On the one hand, choosing Clark instead of Canberra as a distance
measure or taking n-grams with length n=7 characters instead of n=6 characters would
have unlikely improved the result noticeably. On the other hand, using a lower
threshold value like  = 1.28 instead of  = 1.64 would have significantly enhanced
the overall clustering performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6 Conclusion</title>
      <p>
        This paper proposes a simple unsupervised technique to solve the author clustering
problem. As features to discriminate between the proposed author and different
candidates, we propose using the top m most frequent terms (words and punctuations)
or character n-grams. This choice was found effective for other related tasks such as
authorship attribution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, compared to various feature selection strategies
used in text categorization [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the most frequent terms tend to select the most
discriminative features when applied to stylistic studies [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To take the author linking
decision, we propose using a simple distance measure called SPATIUM based on a
variant of the L1 norm called Canberra.
      </p>
      <p>
        The proposed approach tends to perform very well in three different languages
(Dutch, English, and Greek) and in two text genres (newspaper articles and reviews).
Such a classifier strategy can be described as having a high bias but a low variance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Changing the training data does not drastically change the decision. However, the
suggested approach ignores other significant information such as mean sentence length,
POS (part of speech) distribution, or topical terms. Even if the proposed system cannot
capture all possible stylistic features (bias), changing the available data does not modify
significantly the overall performance (variance).
      </p>
      <p>It is common to fix some parameters (such as time period, size, genre, or length of
the data) to minimize the possible sources of variation in the corpus. However, our
main goal was to present a simple and unsupervised approach without too many
predefined parameters.</p>
      <p>With SPATIUM the proposed clustering decision could be clearly explained because
it is based on a reduced set of features on the one hand and, on the other, those features
are words, punctuation symbols, or long n-grams. Thus, the interpretation for the final
user is clearer than when working with a huge number of features, when dealing with
short n-grams of letters, or when combing several similarity measures. The SPATIUM
decision can be explained by large differences in relative frequencies of frequent words,
corresponding to either functional terms or overused topical words.</p>
      <p>To improve the current classifier, we will investigate the consequence of other
cluster linking strategies. Changing the single linkage strategy to a complete, average,
or centroid linkage strategy could improve the outcome, because one sole link could no
longer merge two bigger clusters and consequently not lower the precision drastically.
Acknowledgments. The author wants to thank the task coordinators for their
valuable effort to promote test collections in author clustering. This research was
supported, in part, by the NSF under Grant #200021_149665/1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <fpage>461</fpage>
          -
          <lpage>486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Delta: A Measure of Stylistic Difference and a Guide to Likely Authorship</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <fpage>267</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2012</year>
          . Ousting Ivory Tower Research:
          <article-title>Towards a Web Framework for Providing Experiments as a Service</article-title>
          . In: Hersh,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            , &amp;
            <surname>Sanderson</surname>
          </string-name>
          , M. (eds.)
          <source>SIGIR. The 35th International ACM</source>
          ,
          <volume>1125</volume>
          -
          <fpage>1126</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>The Elements of Statistical Learning</article-title>
          .
          <source>Data Mining, Inference, and Prediction</source>
          . Springer-Verlag: New York (NY).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kocher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>UniNE at CLEF 2016 Author Clustering: Notebook for PAN at CLEF 2016</article-title>
          . In Balog,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Capellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , &amp;
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (Eds),
          <source>CLEF 2016 Labs Working Notes, Évora, Portugal, September 5-8</source>
          ,
          <year>2016</year>
          , Aachen: CEUR.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kocher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Author Clustering with an Adaptive Threshold</article-title>
          . In Jones,
          <string-name>
            <given-names>G. J. F.</given-names>
            ,
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Thomas</surname>
          </string-name>
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            , &amp;
            <surname>Ferro</surname>
          </string-name>
          , N. (Eds),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction, 8th International Conference of the CLEF Association, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          , Proceedings. (to appear)
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kocher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Distance Measures in Author Profiling</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>53</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1103</fpage>
          -
          <lpage>1119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghaven</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Schütze</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2008</year>
          . Introduction to Information Retrieval. Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Handbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , &amp;
            <surname>Toms</surname>
          </string-name>
          ,
          <string-name>
            <surname>E</surname>
          </string-name>
          . (eds.)
          <source>CLEF. Lecture Notes in Computer Science</source>
          , vol.
          <volume>8685</volume>
          ,
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Estimating the Probability of an Authorship Attribution</article-title>
          .
          <source>Journal of American Society for Information Science &amp; Technology</source>
          ,
          <volume>67</volume>
          (
          <issue>6</issue>
          ),
          <fpage>1462</fpage>
          -
          <lpage>1472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Comparative Evaluation of Term Selection Functions for Authorship Attribution</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <fpage>246</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Machine Learning in Automatic Text Categorization</article-title>
          .
          <source>ACM Computing Survey</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Clustering by Authorship Within and Across Documents</article-title>
          .
          <source>In Working Notes of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Overview of the Author Identification Task at PAN 2017: Style Breach Detection and Author Clustering</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings</source>
          , CEURWS.org.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>