<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experiments in Authorship-Link Ranking and Complete Author Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valentin Zmiycharov</string-name>
          <email>valentin.zmiycharov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitar Alexandrov</string-name>
          <email>dimityr.alexandrov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hristo Georgiev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasen Kiprov</string-name>
          <email>yasen.kiprov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgi Georgiev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Koychev</string-name>
          <email>koychev@fmi.uni-so</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Preslav Nakov</string-name>
          <email>pnakov@qf.org.qa</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FMI, Sofia University "St. Kliment Ohridski"</institution>
          ,
          <addr-line>Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Qatar Computing Research Institute</institution>
          ,
          <addr-line>HBKU, Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>The paper presents the approach we developed for the AuthorshipLink Ranking and Complete Author Clustering task at the PAN 2016 competition. Given a document collection, the task is to group documents written by the same author, so that each cluster corresponds to a different author. This task can also be viewed as one of establishing authorship links between documents. We use a combination of classification and agglomerative clustering with a rich set of features such as average sentence length, function words ratio, type-token ratio and part of speech tags.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>For this task, we are given a collection of up to 100 documents. All of them are
singleauthored, in the same language, and belong to the same genre: the language and the
genre are given. The topic and the length of the documents vary, and the number of
distinct authors whose documents are included in the collection is unknown.</p>
      <p>
        The participating systems have to provide two outputs for each instance:
– Complete author clustering result: Each cluster should contain all documents
found in the collection by a specific author. The clusters should be non-overlapping,
i.e., each document should belong to exactly one cluster (Figure 1).
– Authorship-link ranking result: A list of document pairs ranked according to a
real-valued score in [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], where higher values denote higher confidence that the
pair of documents are written by the same author (Figure 2).
After analyzing the training documents, we concluded that we cannot use typical
document similarity as a feature, e.g., based on TF-IDF, which is good for classification into
topics, but here different authors may write on the same topic. Instead, we focused on
features that model author style and are orthogonal to topic-related features [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Using
these features, we first perform classification for each pair of documents about whether
they are written by the same author. Then, we use agglomerative clustering using the
classifier’s confidence scores for each pair of documents.
The training set contains 18 folders for three languages and two genres, i.e., three
folders per language/genre pair:
– English / Articles
– English / Reviews
– Dutch / Articles
– Dutch / Reviews
– Greek / Articles
– Greek / Reviews
Each folder contains between 50 and 100 documents. The clusters are represented by
thresholded authorship-link values of 0 and 1, where 1 means that the two documents
are by the same author, and thus should be in the same cluster (only those with a value of
1 are presented). Therefore, the task may be seen as a classification task asking whether
two documents are by the same author or not. In order to distinguish one author from
another one, author style features have to be implemented, as content-based similarity
is not very helpful in this case.
2.2
      </p>
      <sec id="sec-1-1">
        <title>Features</title>
        <p>We used the following features:
– Average sentence length: When writing a document, an author could sometimes
unconsciously use conjunctions or commas instead of ending the sentence with a
period. Thus, this feature could be very indicative for authorship attribution.
– Function words ratio: Different authors have different biases and preferences with
respect to the use of function words, which makes these words some of the most
popular features for stylometry and authorship attribution. We use three separate
lists of function words3 for English (173 words), Greek (250 words), and Dutch
(104 words).
– Type-token ratio: The richness of the vocabulary used by an author is another
indicator of style. We use the number of unique word types in a document divided
by the total number of tokens in the document. We consider two documents to be
written by the same author if they have the same (or similar) type-token ratios. This
feature also reflects the author’s tendency to repeat words.
– Features, based on part of speech: described below in detail.</p>
        <p>
          For the part-of-speech (POS) features, we used the following taggers:
– English - Stanford Log-linear Part-Of-Speech tagger [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
– Dutch - Stanford Log-linear Part-Of-Speech tagger [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
– Greek - AUEB’s Natural Language Processing Group Part-of-Speech tagger4
        </p>
        <p>
          After tagging the documents, we extracted the following part-of-speech features:
Nouns ratio, Adjectives ratio, Verbs ratio, and Conjunctions ratio. These features
are based on the same principle, which we explain below. Before comparing two
documents, we perform the following steps:
1. The number of occurrences of the part-of-speech tags we are interested in (nouns,
adjectives, verbs and conjunctions) is to be extracted for each sentence in the
document and stored (in-memory or persisted in the file system).
2. Order statistics: all sample values in the document have to be put in ascending
order.
3. After extracting the statistical distribution, based on the occurrences of the required
part-of-speech tags in the document, we are now able to get some statistical
knowledge from this data. For that purpose, we calculate the minimum and the maximum
values (first and last values from the statistical order), first and third quantiles, and
the median value (which represents the second quantile)[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
4. For each of the features in 3, we calculate the information gain, which can strongly
distinguish authors. This complex feature provides information, which is not visible
when reading a document and is immune to conscious control.
5. Create a vector with the following attributes:
        </p>
        <p>– Minimum value of the distribution for the POS tag (e.g., verb) in the document
3 http://www.ranks.nl/stopwords
4 http://nlp.cs.aueb.gr/software.html
– Maximum value
– First quantile
– Median (second quantile)
– Third quantile
– Maximum value</p>
        <p>
          By using the constructed vector, we are now able to measure the distance between
any two documents from the same language, regardless of their length. The idea of this
feature was inspired from [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], but we extended it with more attributes, which we think
would be able to give better results. The distance between any two vectors is calculated
using Euclidean distance. After that, the measured distance is added as an attribute to
the classifier (see 2.3).
2.3
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Classification</title>
        <p>A classification example represents a pair of documents and whether they are by the
same or by different authors. This treats the task as a binary classification problem. For
each of the described features, the absolute difference of its values between the two
documents in the pair is calculated. We obtained better results without normalization, and
thus we pass the vector containing the differences to the classifier without modification.
Note that we trained six different classifiers: one for each language/genre pair.</p>
        <p>An approach that did not work well was to use linear regression, which computed
for each document pair a number between 0 and 1 (closer to 1 are documents that
are more likely to be by the same author). After some experiments and unsatisfactory
results, we decided to model the task as a classification problem: whether a document
pair is by the same author or not.
2.4</p>
      </sec>
      <sec id="sec-1-3">
        <title>Clustering</title>
        <p>We decided to iteratively create clusters based on authorship-link results calculated
by an SVM classifier using the above features. Our clustering algorithm consists of the
following steps:
1. Start with a single document from the test set, then construct and classify pairs with
all other documents. The ones that are classified as positive are added to the cluster,
and the rest are ignored.
2. We select some of the remaining documents and loop through all existing clusters.</p>
        <p>We add the document to all clusters which have more than 50% documents that are
close to it (pairs classified as positive). If there are no such clusters, the document
forms a new cluster.
3. Repeat the previous step for all the remaining documents.
4. Find documents that are contained in more than one cluster and remove it from all
but one clusters – the one with the highest total similarity to the documents in the
cluster. As a result, all documents are in exactly one cluster.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Results and Analysis</title>
      <p>We created a double variable which indicates what is the minimum rank for which
we consider a pair of documents to be written by the same author. After many
experiments, we chose its value to be 0.8.</p>
      <p>
        The clustering output is evaluated according to BCubed F-score [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] The ranking of
authorship links is evaluated according to Mean Average Precision [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The results of
our submission are shown in Table 2.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>We have described our authorship-link clustering solution, where we recognized
different authors, based on their style of writing. All baseline features that we use in our work
(average metrics for word and sentence, function words usage, dictionary richness) are
useful, but not precise enough to separate authors correctly. That is the main motivation
for experimenting with POS features and statistical distributions. We see great potential
in such features, and we believe the current implementation can be greatly improved.</p>
      <p>In future work, we plan to experiment with different machine learning algorithms
(Neural Networks, Bagging, Active Learning), and with more stylometry features.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This research was performed by a team of students from MSc programs in Computer
Science in the Sofia University “St Kliment Ohridski”. We also thank the Sofia
University “St Kliment Ohridski” for the support and guidance to our team participation at the
CLEF 2016 Conference.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Inf. Retr</source>
          .
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <fpage>461</fpage>
          -
          <lpage>486</lpage>
          (
          <year>Aug 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rasheed</surname>
            ,
            <given-names>K.:</given-names>
          </string-name>
          <article-title>Using machine learning techniques for stylometry</article-title>
          .
          <source>In: Proceedings of the International Conference on Artificial Intelligence</source>
          , IC-AI '
          <volume>04</volume>
          , Volume
          <volume>2</volume>
          &amp;
          <string-name>
            <surname>Proceedings</surname>
            <given-names>of</given-names>
          </string-name>
          <source>the International Conference on Machine Learning; Models, Technologies &amp; Applications</source>
          , MLMTA '04. pp.
          <fpage>897</fpage>
          -
          <lpage>903</lpage>
          . Las Vegas, Nevada, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>233</fpage>
          -
          <lpage>334</lpage>
          (
          <year>Dec 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mendenhall</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beaver</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beaver</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : Introduction to Probability and Statistics.
          <source>Cengage Learning</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Clustering by Authorship Within and Across Documents</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Enriching the knowledge sources used in a maximum entropy part-of-speech tagger</article-title>
          .
          <source>In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora</source>
          . pp.
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          . EMNLP '
          <volume>00</volume>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>