<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi feature space combination for authorship clustering: Notebook for PAN at CLEF 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muharram Mansoorizadeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Aminiyan</string-name>
          <email>M.Aminiyan@Gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taher Rahgooy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahdy Eskandari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering Department, Bu-Ali Sina University</institution>
          ,
          <addr-line>Hamedan</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>.The Author Identification task for PAN 2016 consisted of three different Sub-tasks: authorship clustering, authorship links and author diarization. We developed a machine learning approaches for two of three of these tasks. For the two authorship related tasks we created various sets of feature spaces. The challenge was to combine these feature spaces to enable the machine learning algorithms to detect these difference authors across multiple feature spaces. In the case of authorship clustering we combine these feature spaces and use a two-step approach for clustering. Then we use results of the clustering, and employ new feature space to determine links between documents in given problems.</p>
      </abstract>
      <kwd-group>
        <kwd>authorship clustering</kwd>
        <kwd>authorship link</kwd>
        <kwd>tf-idf</kwd>
        <kwd>feature space combintion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the following we provide a detailed description of our approaches to
solve the two subtasks of the Author Identification track of PAN 2016. The
problem instance is a tuple &lt;K; U; L&gt; where K is a set of documents &lt;k1, k2,
k3,…, kn&gt; authored by the different authors, U is the genre of the document
and L is the enumerated value specifying the language of the documents:
English, Dutch or Greek. All documents in a problem instance are in the same
language and same genre. This lab report is structured as follows: In section 2
we propose a number of different features that characterize documents from
widely different points of view: character, word, part-of speech, sentence
length, punctuation. We construct non-overlapping groups of homogeneous
feature. In section 3 we present the two-step unsupervised method for
authorship clustering task by employing a graph based approach and the standard
kmeans++ algorithm. Then we employ new feature space to determine links
between documents. Finally, in section 4 we describe our results on the
training corpus and the final evaluation corpus of PAN-2016.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preprocessing</title>
      <p>We extract a number of different features from each document. For ease of
presentation, we group homogeneous features together, as described below.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <p>
        Word ngrams (WG): We convert all characters to lowercase and then we
transform the document to a sequence of words. We consider white spaces,
punctuation characters and digits as word separators. We count all word
ngrams, with n ≤ 3, and we obtain a feature for each different word ngram
which occurs in the training set documents of a given language [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It should
be mentioned that, we use word unigrams and 2-gram in clustering task and
preprocesses related to it and word 3-gram only used in link computation
phase.
      </p>
      <p>
        In order to normalize these set of features we use term frequency-inverse
document frequency (tf-idf) for each set of documents (each problem)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>POS (part-of-speech) tag ngrams (PG): We apply a part of speech (POS)</title>
      <p>
        tagger on each document, which assigns words with similar syntactic
properties to the same POS tag. We count all POS ngrams, with n≤ 2, and we obtain
a feature for each different POS ngram which occurs in the training set
documents of a given language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Sentence lengths (SL): We transform the document to a sequence of tokens,
a token being a sequence of characters separated by one or more blank spaces.
Next, we transform the sequence of tokens to a sequence of sentences, a
sentence being a sequence of tokens separated by any of the following characters:
., ;, :, !, ?. We count the number of sentences whose length in tokens is n, with
n {1,..,15}: we obtain a feature for each value of n [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Punctuation ngrams (MG): We transform the document by removing all
characters not included in the following set: {,, ., ;, :, !, ?, "}—the resulting
document thus consists of a (possibly empty) sequence of characters in that
set. We then count all character ngrams of the resulting document, with n≥2,
and we obtain a feature for each different punctuation ngrams which occurs in
the training set documents of a given language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In order to preprocess documents we use python NLTK 3.0 package [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
After creating the feature space we simply separate word 2grams for
authorship link task and use the rest of features for clustering. We assume that word
2grams consist of very specific relation which can effect better inside of each
cluster for determining the level of similarity between documents.
Where X old is the old value of X and Max is the maximum value of feature X
and Min is the Minimum value for feature X.
(1)
(2)
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Two-step unsupervised method</title>
      <p>In order to solve the task, we use two step method.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Step 1: Determining the number of authors</title>
      <p>
        Considering the fact that number of authors is unknown first we have to
determine the number of authors for each problem, namely, we have to
determine number of clusters for clustering algorithm. The number of clusters
should be set by the developer based on specifications of problem. Assigning
a proper number is a challenging task. A document similarity graph (DSG)
algorithm has been used. DSG is an undirected graph showing similarity
relations between documents based on their contents [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The nodes of this graph
are documents and the edges between documents are defined by the
similarities and dissimilarities between them using (2):
2.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Data normalization</title>
      <p>After feature extraction, we normalize value of each feature using min-max
normalization in order to remove the impact of different scale spaces:
X new  X old  Max</p>
      <p>Max  Min
Z i , j   1 </p>
      <p>1
V S mat i , j   
0
</p>
      <p>X .Y
X .Y
 1 </p>
      <p>in1x i .y i
in1x i2 </p>
      <p>n y 2
i 1 i
Z i , j   </p>
      <p>Z i , j   </p>
      <p>
        Where xk and yk are features of Xi and Yj documents respectively and δ is
the threshold which define the existence of the similarity between two
documents. In this paper, the δ parameter is set to 0.5. Also Z is the cosine
similarity between two documents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The number of clusters has been determined using the number of sub
graphs resulted with DSG. To find the number we just count the nodes with
value more than 65 percent of number of all document for example if we have
100 documents in problem folder, we count nodes which have more than 65
incoming edges.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Step 2: clustering and computing links</title>
      <p>
        After calculating the number of clusters, we use k-means++ [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] scikit-learn
python package in order to perform clustering task.
      </p>
      <p>When clustering completed, we collect the result and employ simple
similarity task in each of clusters. We compute similarity based word 3grams
features and cosine similarity (2).
4</p>
    </sec>
    <sec id="sec-9">
      <title>Results</title>
      <p>
        In order to evaluate our work, we use training corpus and the final
evaluation corpus of PAN-2016. These datasets consist of set of problems, each
problem comes with different number of documents in specific language
(English, Dutch and Greek) and two different genres (newspaper articles and
reviews). The clustering output will be evaluated according to BCubed
Fscore [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the ranking of authorship links will be evaluated according
to Mean Average Precision (MAP) [8]. In order to evaluate our work, first the
software has been executed on TIRA platform [9].
English
English
English
English
English
English
Dutch
Dutch
Dutch
Dutch
Dutch
Dutch
Greek
      </p>
      <p>Greek</p>
      <p>Like Table 1, Table 2, results of test dataset, also illustrates, high level of
Bcubed recall in most of problem sets, in contrast with Bcubed precision
which is not high. But it is obvious that results from test dataset are better than
train data. It shows ability of system to generalize new problems. But the
major defect of system with lower Bcubed precision than recall one still exists.</p>
      <p>Notice that you can find complete evaluation on overview [10].
problem
language
problem16
problem17
problem18</p>
      <p>Greek
Greek
Greek
Greek</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusion and future works</title>
      <p>In this research we propose a two-step unsupervised method in order to
perform author clustering. In our approach we combine different feature spaces
and use them to cluster documents based on their authors. Then, we rank
documents based on their cosine similarity using new set of feature which are
different from the set we use for clustering.</p>
      <p>Results illustrates that our work have a good Bcubed recall. But major
problem of our method was its Bcubed precision. The problem may come
from cluster number selection or the feature space. Hence as a future work,
we suggest researchers work on a way of better cluster parameter selection.
Also, it would be suggested that the task tested on more complex clustering
method without the need on parameter selection like self-organized map
(SOM) and so on.
8. Manning, C. D., Raghavan, P., &amp; Schütze, H. (2008). Introduction to information
retrieval (Vol. 1, No. 1, p. 496). Cambridge: Cambridge university press.
9. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M.,
Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality,
Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF
14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014).
10. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
Potthast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes
Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
CEUR-WS.org (Sep 2016)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tuarob</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouchard</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>A generalized topic modeling approach for automatic document annotation</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ),
          <fpage>111</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Bartoli</surname>
          </string-name>
          , Alex Dagri, Andrea De Lorenzo, Eric Medvet, and
          <string-name>
            <given-names>Fabiano</given-names>
            <surname>Tarlao</surname>
          </string-name>
          .
          <article-title>An Author Verification Approach Based on Differential Features-Notebook for PAN at CLEF 2015</article-title>
          . In Linda Cappellato, Nicola Ferro, Gareth Jones, and Eric San Juan, editors,
          <source>CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers</source>
          ,
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          September, Toulouse, France,
          <year>September 2015</year>
          .
          <article-title>CEUR-WS.org</article-title>
          .
          <source>ISSN 1613-0073.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Natural language processing with Python. "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc.".</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>B.</given-names>
            <surname>Seah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          , “PRISM :
          <article-title>Concept-preserving Social Image Search Results Summarization,”</article-title>
          <source>in Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval - SIGIR '14</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>737</fpage>
          -
          <lpage>746</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Deza</surname>
            , Michel Marie, and
            <given-names>Elena</given-names>
          </string-name>
          <string-name>
            <surname>Deza</surname>
          </string-name>
          . Encyclopedia of distances. Springer Berlin Heidelberg,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Arthur</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vassilvitskii</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>k-means ++ : The Advantages of Careful Seeding</article-title>
          .
          <source>Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms 8(2006-13)</source>
          ,
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          (
          <year>2007</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Information retrieval</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <fpage>461</fpage>
          -
          <lpage>486</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>