<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering Author Groups using a β-compact graph-based clustering.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>yasmany</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>daniel.castro}@cerpamid.co.cu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vania.lavielle@datys.cu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante</institution>
          ,
          <country>España</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Desarrollo de Aplicaciones</institution>
          ,
          <addr-line>Tecnología y Sistemas DATYS</addr-line>
          ,
          <country country="CU">Cuba</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yasmany García-Mondeja</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>Identifying the authorship either of an anonymous or a doubtful document constitutes a cornerstone for automatic forensic applications. Moreover, it is a challenging task for both humans and computers. Clustering documents according to the linguistic style of the authors who wrote them has been a task little studied by the research community. In order to address this problem, PAN Evaluation Framework has become the first effort to promote the development of the author clustering. This article proposes a graph-based method, specifically βcompact clustering, for discovering the groups of documents written by the same author. The β-compact algorithm is based on the analysis of the similarity between documents and they belong to the same group as long as the similarity between them exceeds the threshold β and it is the maximum similarity with respect to other documents. In our proposal we evaluated different linguistic features and similarity measures presented in previous works of authorship analysis task. The training dataset was used to determine the best value of β parameter for each language. The result of the experiments was encouraging.</p>
      </abstract>
      <kwd-group>
        <kwd>Author clustering</kwd>
        <kwd>β-compact clustering algorithm</kwd>
        <kwd>linguistic features</kwd>
        <kwd>similarity measures</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the evaluation framework the task is described as follows: "Given a collection of
(up to 50) short documents (paragraphs extracted from larger documents), identify
authorship links and groups of documents written by the same author. All documents are
single-authored, in the same language, and belong to the same genre. However, the
topic or text-length of documents may vary. The number of distinct authors whose
documents are included in the collection is not given. "1</p>
      <p>
        One of the most used strategies for documents representation in Text Mining (TM)
applications, corresponds to the classic Bag of Words [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and this will be the
proposal used in our work. In different Authorship Analysis applications, complex
methods involving several algorithms have been used in order to obtain the best results. In
document clustering applications and other Artificial Intelligence (AI) tasks, ensembles
of algorithms have also been employed. Despite this, the work presented by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is
relevant, and they use a simple clustering algorithm and achieve encouraging results.
      </p>
      <p>
        As a summary, in the last edition of authors clustering task, 6 papers were presented
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref12">12</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and in general, the data of the documents collection set contained
a high percentage of clusters composed of a single document, unlike what can be seen
in the collection of this year released for training, where we observed several documents
clusters with more than one document, although there are still few documents per
group.
      </p>
      <p>With our work, we want to propose and evaluate a clustering algorithm that we have
used in topic document clustering tasks in our research center, and its purpose is to
group objects with the condition that for each object of the group, at least there is an
object with which the similarity between them is greater than a threshold of similarity
and it´s the maximum similarity with an object of the collection.</p>
      <p>It is important to emphasize aspects of the description of the author clustering
problem, such as: short texts no longer than a paragraph; the texts corresponding to the same
author are of the same genre but not necessarily the same topic or homogeneous length.
In addition, as part of the task, we need to obtain a ranking of similarities between
objects in the same clusters. Taking these details into consideration, in the next section
we propose and describe our method considering binary linguistic features and a
clustering algorithm based on compact clusters.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Implemented method</title>
      <p>
        We propose the use of β-compact algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for authorship clustering task, because
it´s based on clustering objects with a similarity between them which is greater than a
threshold of similarity previously adjusted with a training document collection, but only
the greatest similarities are maintained.
      </p>
      <p>In the following image (Figure 1) we expose the architecture of the implemented
method and later we describe each one of the steps involved.
1 http://www.webis.de.</p>
      <p>Both in the training and test stage, collections of documents are received and the
final purpose is to obtain groups of documents, where all the documents of a group
belong to the same author.</p>
      <p>The algorithm proposed, to obtain the groups, requires the representation of each of
the documents, a comparison function that allows to evaluate the similarity between a
pair of documents and a threshold β to decide when two documents must belong to the
same cluster.</p>
      <p>
        For documents representation, we used the classic Bag of Word, and with the
training dataset we tried different types of features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We experimented with 3 similarity
functions to analyze the similarity between documents. We used the Dice, Jaccard and
Cosine functions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], using only binary features, that is, we did not compute the
frequency of each of the features, only their appearance in the document. The idea of
considering only binary features is due to the short extension of the documents, up to
one paragraph.
      </p>
      <p>
        The β-compact clustering algorithm is described in the next pseudo program code.
First we need to define the concept “Graph of Maximum β similarity: It´s an oriented
graph in which the vertices are the objects and exist an arista between two vertices Oi
and Oj if Oj is β-similar with Oi and Oj is the most similar of all the rest of objects [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
In: U - universe of documents
Out: Cluster – Several groups of documents
Cluster = Ø
G = BuildGraphMaximum_β_Similarity(U)
Cluster = SearchConexedComponentsIgnoringOrientation(G)
      </p>
      <p>
        We performed different runs, in which the comparison function and the
representation were varied, as well as the value of β from 0 to 0.5. When the β was greater than
0.5, there were no changes in the clusters obtained. For each language, the best result
is determined by analyzing the clusters obtained by comparing them with the training
data using the FBcubed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] measure.
      </p>
      <p>Finally, for each language, we have a configuration of the method, with a binary
features representation to be calculated, a similarity function and a threshold β.</p>
      <p>It is important to note that, due to the nature of the β-compact algorithm, two
documents can belong to the same group, although the similarity between them not
necessarily exceeds the defined β, because the only condition is that each one of them has a
similarity greater than β with some document of their group. This characteristic may
lead to non-necessarily spherical clusters. It can be observed in the outputs of the
method in the ranking file, where similarities between documents of the same cluster
will appear that are below the threshold β and therefore will appear little ranked in this
list.</p>
      <p>For the similarity ranking construction, we took into account all similarities among
objects of a group and order them. The similarities that are above the threshold β were
distributed in a scale of 0.5 to 1, corresponding to 0.5 the similarities that are equal to
β and close to 1 the greater ones. Similarly, was realized a distribution of the similarities
that did not exceed the β, which in this case were distributed from 0 to 0.5.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental results</title>
      <p>
        With the documents of the training dataset released and explained in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we performed
different runs of the method, evaluating in all cases the results with the groups proposed
by specialists, using the FBCubed measure suggested in the competition. Table 1 shows
a summary of the parameter configurations and features representation used in the
training phase. In Table 2, we present the final configurations used by the method for the
final evaluation.
      </p>
      <p>Parameters</p>
      <p>Features Character words lemma POS-Tagging
N-gram 1-5 1-5 1-5 1-5
Language en, gr, du en, gr, du en en</p>
      <p>β 0.05-0.5 0.05-0.5 0.05-0.5 0.05-0.5</p>
      <p>The features used in the documents representation are those presented in the Features
row, we test using n-grams character representations (n = 1, ..., 5), for English (en),
Greek (gr) and Dutch (du). Representations of n-grams words (n = 1, ..., 5) for the three
languages mentioned. For the representations of Lemma and Part of Speech Tagging
(POS-Tagging) we only processed the English texts. For each of the configurations,
evaluations were performed varying the β from 0.05 to 0.5.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>As conclusions, we must emphasize that one of the essential aspects in our work is the
features identification for the documents representation, and in this we could try other
ideas presented in the literature. The algorithm usually obtains small groups, and we
could evaluate in a future work the differences that can be reached when we use other
variants such as the β-connected and β-strongly compact algorithms. The β-connected
would obtain larger groups, while the strongly compact would have smaller and more
compact groups than those proposed in our work.</p>
      <p>We propose to evaluate different features weighing strategies and other comparison
functions proposed in the literature. Using the results achieved in this work, take into
consideration comparisons with ensemble clustering algorithms.
5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Anna</given-names>
            <surname>Vartapetiance</surname>
          </string-name>
          ,
          <article-title>Lee Gillam: A Big Increase in Known Unknowns: from Author Verification to Author Clustering</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>1008</fpage>
          -
          <lpage>1013</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Castro</surname>
          </string-name>
          , Yaritza Adame, María Peláez Brioso, Rafael Muñoz:
          <article-title>Authorship Verification, combining Linguistic Features and Different Similarity Functions</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <fpage>2015</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Bagnall</surname>
          </string-name>
          :
          <article-title>Authorship Clustering using Multi-headed Recurrent Neural Networks</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>791</fpage>
          -
          <lpage>804</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          , Volume
          <volume>60</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>3</given-names>
          </string-name>
          , pages
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          ,
          <year>March</year>
          2009
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          , Michael Tschuggnall, Ben Verhoeven, Walter Daelemans, Günther Specht, Benno Stein, Martin Potthast:
          <article-title>Clustering by Authorship Within and Across Documents</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>691</fpage>
          -
          <lpage>715</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gomaa</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Fahmy</surname>
          </string-name>
          .
          <article-title>A Survey of Text Similarity Approaches</article-title>
          .
          <source>International Journal of Computer Applications</source>
          (
          <volume>0975</volume>
          -
          <fpage>8887</fpage>
          ) Volume
          <volume>68</volume>
          - No.
          <year>13</year>
          . 2013
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Mirco Kocher: UniNE at CLEF 2016:
          <article-title>Author Clustering</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>895</fpage>
          -
          <lpage>902</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Muharram</given-names>
            <surname>Mansoorizadeh</surname>
          </string-name>
          , Mohammad Aminian, Taher Rahgooy,
          <article-title>Mehdy Eskandari: Multi Feature Space Combination for Authorship Clustering</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>932</fpage>
          -
          <lpage>938</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Juola</surname>
          </string-name>
          .
          <source>Authorship Attribution. In Foundations and Trends in Information Retrieval</source>
          , Volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>3</given-names>
          </string-name>
          ,
          <string-name>
            <surname>March</surname>
            <given-names>2008</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Reynaldo</surname>
            Gil-García,
            <given-names>José</given-names>
          </string-name>
          <string-name>
            <surname>Manuel</surname>
          </string-name>
          Badía-Contelles,
          <article-title>Aurora Pons-Porrata: A Parallel Algorithm for Incremental Compact Clustering</article-title>
          .
          <source>Euro-Par</source>
          <year>2003</year>
          :
          <fpage>310</fpage>
          -
          <lpage>317</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN 2017: Style Breach Detection and Author Clustering In: Working Notes Papers of the CLEF 2017 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Valentin</surname>
            <given-names>Zmiycharov</given-names>
          </string-name>
          , Dimitar Alexandrov, Hristo Georgiev, Yasen Kiprov, Georgi Georgiev, Ivan Koychev, Preslav Nakov:
          <article-title>Experiments in Authorship-Link Ranking and Complete Author Clustering</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>1018</fpage>
          -
          <lpage>1023</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yunita</surname>
            <given-names>Sari</given-names>
          </string-name>
          , Mark Stevenson:
          <article-title>Exploring Word Embeddings and Character N-Grams for Author Clustering</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>984</fpage>
          -
          <lpage>991</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>