<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Clustering using Hierarchical Clustering Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <email>helena.adorno@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuridiana Aleman</string-name>
          <email>yuridiana.aleman@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Darnes Vilariño</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel A. Sanchez-Perez</string-name>
          <email>miguel.sanchez.nan@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Pinto</string-name>
          <email>dpinto@cs.buap.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Benemérita Universidad Autónoma de Puebla (BUAP), Faculty of Computer Science</institution>
          ,
          <addr-line>Puebla</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC)</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to reduce the dimensionality. Our system was ranked 1st in both subtasks, author clustering and authorship-link ranking.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Authorship attribution consists in identifying the author of a given document in a
collection. There are several subtasks within the authorship attribution field such as author
verification [18], author clustering [14], and plagiarism detection [15]. This paper
describes our approach to the author clustering task at PAN 2017 [19,13]. Formally, the
task is defined as follows: given a document collection, the task is to group documents
written by the same author so that each cluster corresponds to a different author. This
task can be also viewed as establishing authorship links between documents.
Applications of this problem include automatic text processing in repositories (Web), retrieval
of documents written by the same author, among others.</p>
      <p>The number of distinct authors whose documents are included in the collection
is not given. The corpus contains documents in three languages (English, Dutch, and
Greek) and two genres (newspaper articles and reviews). Two application scenarios
were analyzed:
1. Complete author clustering: We do a detailed analysis, where we need to identify
the number k of different authors (clusters) in a collection and assign each
document to exactly one of the k clusters.
2. Authorship-link ranking: In this scenario we explore the collection of documents
as a retrieval task. We aim to establish “authorship” links between documents and
provide a list of document pairs ranked by a confidence score.</p>
      <p>We approached the first scenario using clustering techniques and extracting
character n-grams and stylometric features in a bag of words representation for each
document. The selected features are language- and genre-independent. For the second
scenario we calculated the pairwise similarity between each pair of documents in each
problem using the cosine similarity metric.</p>
      <p>The structure of this paper is as follows: In Section 2, we give an overview of the
literature in this research field. In Section 3, we describe our methodology for the
Author Clustering. In Section 4, we present the results obtained in the two phases of the
evaluation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Author clustering began in PAN 2012 as part of the Author attribution task focusing on
the paragraph-level instead of document-level. In PAN 2016, the task was extended by
the addition of the authorship link ranking problem [14].</p>
      <p>
        Bagnall [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used a multi-headed recurrent neural network to train a character
ngram model with a softmax output for each text in all problems. Later, he applied a
method to turn multiple softmax outputs into clustering decisions. As preprocessing, he
removed special tokens and decomposed capital letters into an uppercase marker
followed by the corresponding lowercase letter. Afterward, he deleted the low document
frequency words (words that appear only in a document). He built a model for each
language using all documents available in all problems along with randomly sampled
texts from previous corpora (2014, 2015, 2016). The goal of the training phase is
optimizing the F-Bcubed score. In this regard, the author applied four different strategies.
First, by prioritizing the case where each document belongs to one cluster, where the
F-Bcubed score is guaranteed to be larger than 0.5. The other strategies are based on
constraining a single-linkage approach to avoid merging large clusters, a heuristic
aiming to find anchor points in the F-Bcubed score landscape, and a cluster-aware approach
with a programming error that punished any link that joined more than two documents.
Bagnall’s approach ranked first place with an F-score of 0.8223.
      </p>
      <p>
        Kocher’s system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was ranked second. The author proposed an unsupervised
approach using simple features and a distance measure called SPATIUM-L1. The features
extracted when computing the distance between a pair of documents correspond to the
top m most frequent terms in the first document of the pair, hence the distance is
asymmetric A;B 6= B;A. He considered two documents to be linked when the distance
for that particular pair and the distance from the first document to the rest of the
collection is larger than the average minus twice the standard deviation. To compute the
links between documents he used single-linkage clustering. This approach obtained an
F-score of 0.8218.
      </p>
      <p>Sari &amp; Stevenson [17] extracted two different features: word embeddings and
character n-grams. Then, they applied clustering based on K-Means. The hyperparameter k
was optimized using the Silhouette Coefficient for each of the samples, and the words
embeddings were trained using Gensim word2vec implementation. The authors used
the 5,000 most frequent character n-grams, which included n ranging from 3 to 8.
Their system ranked third with an F-score of 0.7952.</p>
      <p>Zmiycharov et.al. [20] performed a combination of classification and
agglomerative clustering. The authors used a wide set of features such as average sentence length,
function words ratio, type-token ratio, and part of speech tags. In the classification
phase, they trained six different classifiers using an iterative SVM algorithm: one for
each language/genre pair. This approach exceeded the baseline competition, but with
lower results than the rest of the participants.</p>
      <p>The different systems presented in the Author Clustering task at PAN 2016
combined classification with clustering techniques, where the main differences are in
preprocessing, feature extraction, and classification method.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Complete author clustering</title>
        <p>
          For the Author Clustering task at PAN 2017, we applied a Hierarchical Cluster Analysis
(HCA) using an agglomerative [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] (bottom-up) approach. In this approach, each text
starts in its own cluster and in each iteration we merged pairs of clusters.
        </p>
        <p>
          To join clusters, we used an average linkage algorithm, where the average cosine
distance between all the documents in the two considered clusters was used to decide
if they were going to be merge. We used the Calin´ski Harabaz score [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to evaluate the
clustering model, where a higher Calin´ski-Harabaz score relates to a model with better
defined clusters. So, in order to determine the number of clusters in each problem we
performed the clustering process using a range of k values (with k varying from 1 to the
number of samples in each problem) and chose the value of k with the highest Calin´ski
Harabaz score. For k clusters, the Calin´ski Harabaz score is given as the ratio of the
between-clusters dispersion mean and the within-cluster dispersion:
hc(k) =
        </p>
        <p>SSB
SSW</p>
        <p>N
k
k
1
where k is the number of clusters and N is the number of observations, SSW is the
overall within-cluster variance (equivalent to the total within sum of squares), and SSB
is the overall between-cluster variance. The total within sum of squares (SSW ) is
calculated as follows:</p>
        <p>k
SSW = X X
i x Ci
jjx
mijj2
where k denotes the number of clusters, x is the data point (document sample), Ci is the
ith cluster, mi is the centroid of the cluster i, and jjx mijj is the L2 norm (Euclidean
distance) between the two vectors. The overall between-cluster variance is calculated
using the total sum of squares (TSS) minus SSW . The TSS is the squared distance of all
the data points from the dataset’s centroid; this measure is independent of the number
of clusters.</p>
        <p>SSB measures the variance of all the cluster centroids from the dataset’s grand
centroid (when the centroids of each cluster are spread out and they are not too close to
each other, the value of SSB is larger). SSW will keep on decreasing as the cluster size
goes up. Therefore, for the Calin´ski-Harabasz score, the greatest ratio of SSSSWB indicates
the optimal clustering size. In summary, this score is higher when clusters are dense
and well separated, which means that different authors are probably well grouped in
separate clusters.</p>
        <p>
          Previous work on Authorship Attribution found that character n-grams are highly
effective features, regardless of the language the texts are written in [
          <xref ref-type="bibr" rid="ref9">9,11</xref>
          ]. In our
approach, we used a combination of typed character 3-grams, untyped character n-grams
(with n varying between 2 and 8), and word n-grams (with n varying from 1 to 3).
Typed character n-grams are character n-grams classified into ten categories based on
affixes, words, and punctuation, and were introduced by Sapkota et al. [16].
        </p>
        <p>The performance of each of the feature sets was evaluated separately and in
combinations. The N most frequent terms in the vocabulary of each problem were selected
based on a grid search and optimized based on the F-Bcubed score on the entire training
set. We evaluated the N terms from 1 to 60,000 with a step of 50. We found that when
selecting the most frequent 20,000 features we achieved the highest F-Bcubed score on
the entire training set. Hence, we fixed this threshold for all the languages but selected
the features separately for each problem.</p>
        <p>
          Finally, we examined two feature representations based on a global weighting scheme:
log-entropy and tf-idf on different clustering algorithms (k-means and hierarchical
clustering). Global weighting functions measure the importance of a word across the entire
collection of documents. Previous research on document similarity judgments [
          <xref ref-type="bibr" rid="ref10 ref6">6,10</xref>
          ]
has shown that entropy-based global weighting is generally better than the tf-idf model.
The log-entropy (le) weight is calculated as follows:
ei = 1 +
        </p>
        <p>X pij
j</p>
        <p>log pij
log n
where pij =
tfij
gfi
leij = ei</p>
        <p>log(tfij + 1)
where n is the number of documents, tfij is the frequency of the term i in document
j, and gfi is the frequency of term i in the hole collection. A term that appears once in
every document, will have a weight of zero. A term that appears once in one document
will have a weight of one. Any other combination of frequencies will assign a given
term a weight between zero and one.</p>
        <p>For the early bird submission, we used the k-means algorithm with tf-idf weighting
scheme and the Silhouette Coefficient for choosing the number of clusters. In the final
submission, we used a hierarchical clustering with log-entropy weighting scheme and
the Calin´ski Harabaz score for choosing the number of clusters.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Authorship-link ranking</title>
        <p>In order to establish the authorship links, we simply calculated the pairwise similarity
between each pair of documents in each problem using the cosine similarity metric. The
vector space model was built in the same manner as for the complete author clustering
subtask, i.e., the same features and the same weighting scheme (log-entropy).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Evaluation Measures</title>
      <p>
        Two measures were used in order to estimate the performance of the submitted systems
to the PAN CLEF 2017 campaign. The F-Bcubed score [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was used to evaluate the
clustering output. This measure corresponds to the harmonic mean between precision
and recall. The Bcubed precision (P-Bcubed) represents the ratio of documents written
by the same author in the same cluster. While the Bcubed recall (R-Bcubed) represents
the ratio of documents written by an author that appear in its cluster. The Mean Average
Precision (MAP) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is used to evaluate the authorship-link ranking. The MAP measures
the average area under the precision-recall curve for a set of problems.
      </p>
      <p>Table 1 presents the results of our early bird submission obtained on the PAN Author
Clustering 2017 test dataset evaluated on the TIRA platform [12]. In this submission,
we had a problem with our authorship-link ranking module, for this reason the MAP
evaluation measure is not available.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We presented our system submitted to the Author Clustering task at PAN 2017. We
carried out experiments using different features: typed and untyped character n-grams,
and word n-grams. Our final submission implemented log-entropy weighting scheme
on the combination of the 20,000 most frequent terms with hierarchical clustering. We
optimized the number of clusters in each problem using the Calin´ ski Harabaz score.</p>
      <p>
        In future research, we would like to adapt the feature set for each language
(subcorpus), as described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], in order to improve system performance for each of the
languages individually.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Mexican Government (CONACYT projects
240844, SNI, COFAA-IPN, SIP-IPN 20171813, 20171344, and 20172008).
11. Posadas-Durán, J., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D.,
ChanonaHernández, L.: Application of the distributed document representation in the authorship
attribution task for small corpora. Soft Computing 21, 627–639 (2016)
12. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In: Information Access Evaluation meets Multilinguality, Multimodality, and
Visualization. 5th International Conference of the CLEF Initiative. pp. 268–299. CLEF’14
(2014)
13. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview
of PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones,
G., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N.
(eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th
International Conference of the CLEF Initiative. CLEF’17, Berlin Heidelberg New York (2017)
14. Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview
of PAN’16—New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering,
Diarization, and Obfuscation. In: Fuhr, N., Quaresma, P., Larsen, B., Gonçalves, T., Balog,
K., Macdonald, C., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality,
Multimodality, and Interaction. 7th International Conference of the CLEF Initiative (CLEF
16). Berlin Heidelberg New York (2016)
15. Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G.: Adaptive algorithm for plagiarism detection:
The best-performing approach at PAN 2014 text alignment competition. In: Proceedings of
the 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France,
September 8–11, 2015. vol. 9283, pp. 402–413. Springer (2015)
16. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are
created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual
Conference of the North American Chapter of the ACL: Human Language Technologies. pp.
93–102. NAACL-HLT ’15, Association for Computational Linguistics (2015)
17. Sari, Y., Stevenson, M.: Exploring Word Embeddings and Character N-Grams for Author
Clustering—Notebook for PAN at CLEF 2016. In: Balog, K., Cappellato, L., Ferro, N.,
Macdonald, C. (eds.) CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8
September, Évora, Portugal (2016)
18. Stamatatos, E., amd Ben Verhoeven, W.D., Juola, P., López-López, A., Potthast, M., Stein,
B.: Overview of the Author Identification Task at PAN 2015. In: Cappellato, L., Ferro, N.,
Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop – Working Notes
Papers (2015)
19. Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
Potthast, M.: Overview of the Author Identification Task at PAN 2017: Style Breach Detection
and Author Clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working
Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings (2017)
20. Zmiycharov, V., Alexandrov, D., Georgiev, H., Kiprov, Y., Georgiev, G., Koychev, I., Nakov,
P.: Experiments in Authorship-Link Ranking and Complete Author Clustering—Notebook
for PAN at CLEF 2016. In: Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF
2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora,
Portugal (2016)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Information retrieval 12</source>
          (
          <issue>4</issue>
          ),
          <fpage>461</fpage>
          -
          <lpage>486</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bagnall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Authorship Clustering Using Multi-headed Recurrent Neural NetworksNotebook for PAN at CLEF 2016</article-title>
          . In: Balog,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Macdonald</surname>
          </string-name>
          , C. (eds.)
          <article-title>CLEF 2016 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September, Évora, Portugal (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Calin´ski, T.,
          <string-name>
            <surname>Harabasz</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A dendrite method for cluster analysis</article-title>
          .
          <source>Communications in Statistics 3(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          (
          <year>1974</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kocher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : UniNE at CLEF 2016:
          <article-title>Author Clustering-Notebook for PAN at CLEF 2016</article-title>
          . In: Balog,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Macdonald</surname>
          </string-name>
          , C. (eds.)
          <article-title>CLEF 2016 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September, Évora, Portugal (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watters</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dazeley</surname>
          </string-name>
          , R.:
          <article-title>Automated unsupervised authorship analysis using evidence accumulation clustering</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>19</volume>
          ,
          <fpage>95</fpage>
          -
          <lpage>101</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navarro</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikkerud</surname>
            ,
            <given-names>H.:</given-names>
          </string-name>
          <article-title>An empirical evaluation of models of text document similarity</article-title>
          .
          <source>In: Proceedings of the Cognitive Science Society</source>
          . vol.
          <volume>27</volume>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schütze</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , et al.:
          <article-title>Introduction to information retrieval</article-title>
          , vol.
          <volume>1</volume>
          . Cambridge university press Cambridge (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Language- and subtask-dependent feature selection and classifier parameter tuning for author profiling</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          , T. (eds.)
          <article-title>Working Notes Papers of the CLEF 2017 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Improving cross-topic authorship attribution: The role of pre-processing</article-title>
          .
          <source>In: Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing. CICLing 2017</source>
          , Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pincombe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Comparison of human and latent semantic analysis (lsa) judgements of pairwise document similarities for a news corpus</article-title>
          .
          <source>Tech. rep., DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION SALISBURY (AUSTRALIA) INFO SCIENCES LAB</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>