<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Diarization Using Cluster-Distance Approach Notebook for PAN at CLEF 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abdul Sittar</string-name>
          <email>abdulsittar72@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ha z Rizwan Iqbal</string-name>
          <email>rizwan.iqbal@ciitlahore.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rao Muhammad Adeel Nawab</string-name>
          <email>adeelnawab@ciitlahore.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>COMSATS Institute of Information Technology, M.A. Jinnah Campus</institution>
          ,
          <addr-line>O -Raiwind Road, Lahore</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Author Diarization is a new task introduced in PAN'16, to identify portion(s) of text with in a document written by multiple authors. This paper presents, our proposed approach for author diarization task. Various types of stylistic features which include lexical features, used to uniquely identify an author. Furthermore, to nd anomalous text with in a single document, ClustDist method used. Finally, clusters were generated by using simple k-means clustering algorithm. Experiments were performed both on training and testing data sets. It has been observed that by changing the text fragments length, promising results can be achieved.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Plagiarism detection is the task of determining text reuse in a document or
collection of documents. It is of two types, 1) Extrinsic Plagiarism Detection
- focuses on nding source documents which were used to generate suspicious
documents. It is a complex task because it may happen at various levels ranges
from word to sentences, paraphrasing of text and plagiarism of ideas, and 2)
Intrinsic Plagiarism Detection or Author Diarization - focuses on identifying
whether a document is written by a single or multiple authors.</p>
      <p>
        Author Diarization term came from the domain of speaker diarization -
focuses on clustering and identifying various speakers from a single audio speech
signal, by analyzing frequency range of speaker's voices e.g. a class discussion on
a particular topic and a conference conversation on mobile phones etc. Similarly,
author diarization deals with the written document instead of audio conversation
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Such documents may be resulted from a collaborative work or plagiarism. It's
objective is to identify and cluster di erent authors with in a single document
on the basis of written text.
      </p>
      <p>
        In literature, stylometric features like function words, part of speech tag,
spelling mistakes, average sentence length and average word length were used
for Author Diarization task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. David Guthrie [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced the term authorship
anomalies, to detect portion of text that deviate from the original context as
written by other author.
      </p>
      <p>
        PAN1 is the competition held as a part of CLEF Conference. The PAN'16
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] competition is designed for three di erent tasks namely, Author Identi
cation, Author Pro ling and Author Obfuscation. Author Identi cation has further
divided into Author clustering and Author Diarization. Identify and cluster
portions of text with respect to individual author for a given document, is the task
for Author Diarization. Moreover, the task divided into three sub problems, 1)
Traditional intrinsic plagiarism detection, 2) Diarization with a given number
(n) of authors and 3) Diarization with unknown number of authors, to explore
di erent variants of the parent task. Participants have to developed a software
for the given task and deploy it on TIRA (An engine to perform evaluation of
software) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. All training and testing documents are provided only in English
language.
      </p>
      <p>
        For each subtask, all text les have been read from it's respective provided
training data set. Each le text splitted into sentences. For each sentence, 15
lexical features (see table 1) were computed. Using ClustDist [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] anomaly
detection technique, a feature vector generated which contains average distances of all
sentences from each other. On the basis of these distances, training models were
created using WEKA and saved to use at the time of prediction. Our developed
software deployed on TIRA and evaluation performed on testing corpora
provided by PAN. Our software generated promising results on both training and
testing corpora.
      </p>
      <p>Rest of this paper is organized as follows: The details of the methodology is
explained in the Section 2 while experimental setup, results of training phase
and testing phase are discussed in the Section 3 and Section 4 respectively.
Section 5 provides conclusion and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <p>
        A wide variety of algorithms have been reported in literature for authorship
identi cation [
        <xref ref-type="bibr" rid="ref2 ref7">7, 2</xref>
        ], which includes Cluster distance, Counting Words,
Stylometric features, Syntactic features, Lexical features, Content speci c and Content
free features.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Lexical Features</title>
        <p>
          A language independent approach for author identi cation, considers that any
text (i.e. sentences, paragraphs, documents) is a sequence of tokens 2. On the
basis of these tokens, various types of statistics (e.g. average word length, average
sentence length, characters count) could be drawn from any text of any language
etc. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Table 1 shows lexical features, which were used in our developed software
to nd unique writing style of an author.
1 http://pan.webis.de/ Last visited: 25-05-2016
2 A token could be a word, character, punctuation mark or numeric number.
ClustDist [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] - a straightforward technique to compute the average distance from
one portion (i.e. sentences) of text to all other pieces of text. Consider a document
D with n number of sentences. At rst, each sentence i was distinguished by
computing the p lexical features (see Table 1) and a feature vector Vi for this
sentence would be generated. For our experiments, a matrix V of order n X p
was created. Each matrix row shows a feature vector for each sentence. ClustDist
computed by using equation 1, where d is the distance between any pair of
vectors. The resultant score for each sentence distance from others, generates a
ranking which describes that how a sentence is di erent from all other sentences
in the given document. To generate clusters, we used simple k-means algorithm
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For detailed insight into the proposed approach, see section 2.2.
P d(x; v)
k
        </p>
        <p>n
ClustDist(x; V ) =
(1)</p>
      </sec>
      <sec id="sec-2-2">
        <title>Example: Step-by-Step Author Diarization by ClustDist Approach</title>
        <p>This example demonstrates author diarization process from an input text to
output clusters.</p>
        <p>{ Step 1: Read Raw Input Text
\In what follows, we give a detailed overviewof Barack Obama's Family. We
shed light on himself, his immediateand extended family, including
maternal and paternal relations.Moreover, we give insights into the relations of
Michelle Obama Barack Obama's wife, as well as some distant relations
ofboth. Barack Obama Barack Hussein Obama II is the44th and current
President of the United States. He is the rstAfrican American to hold the o ce.
Obama was the junior UnitedStates Senator from Illinois from 2005 until
he resigned followinghis election to the presidency. Obama is a graduate of
ColumbiaUniversity and Harvard Law School. "
{ Step 2: Break Down Text into Sentences
1. In what follows, we give a detailed overview of Barack Obama's Family.
2. We shed light on himself, his immediate and extended family, including
maternal and paternal relations.
3. Moreover, we give insights into the relations of Michelle Obama Barack</p>
        <p>Obama's wife, as well as some distant relations of both.
4. Barack Obama Barack Hussein Obama II is the 44th and current
President of the United States.
5. He is the rst African American to hold the o ce.
6. Obama was the junior United States Senator from Illinois from 2005
until he resigned following his election to the presidency.</p>
        <p>7. Obama is a graduate of Columbia University and Harvard Law School.
{ Step 3: Lexical Features Computation
{ Step 4: Distance Calculation
For each of the three sub tasks (see section 1), di erent training data set's
are provided in PAN'16. Each training data set contains three les, \.txt" le
contains actual text written by multiple authors, \.meta" le contain JSON
object which provides text language, problem type(plagiarism or diarization) and
number of authors, and a \.truth" le which provides the output required against
the given text le.</p>
        <p>For the generation of trained models, all text les were read from the corpus
in a sequence. Each le break down into sentences. All of the lexical features (see
section 2.1) were computed for each sentence, created a feature vector containing
distances of this sentence from all other sentences using ClustDist technique (see
section 2.2). WEKA 3( A machine learning tool kit) used to generate and save
training models by generating ".ar " le from this resultant distance vector.
For cluster generation, we used simple k-means algorithm.
3 http://www.cs.waikato.ac.nz/ml/weka/ Last visited: 25-05-2016
3.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Software Evaluation</title>
        <p>PAN'16 also provided the data set to evaluate the developed software but this
data set is not publicly available because they will launch it after the competition.
However, after training our software on training data set, we deployed it on
TIRA for the evaluation phase and executed evaluation software on it. This time
software takes input les from testing corpus. For each input le, \test.ar " le
generated using ClustDist feature vector. Finally, we got clusters with respect to
each subtask as per requirement of PAN'16. e.g. For subtask 1, only two clusters
were generated, one for main author and second for the rest of the authors.
For subtask 2, clusters were created as per required number in \.meta" le.
For subtask 3, random number of clusters were created because its number of
author were not given. As nal step, clusters generated from evaluation corpus
compared with the existing trained models to predict new instances.
3.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Evaluation Measures</title>
        <p>
          PAN'16 recommended di erent evaluation measures for sub tasks 1, 2 and 3. For
subtask 1, micro-recall, micro-precision, macro-recall, macro-precision, micro-f
and macro f [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] measures used to evaluate the performance of the system. For
subtasks 2 and 3 bcubed-recall, bcubed-precision and bcubed-f measure used for
evaluation [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Analysis</title>
      <p>This section presents the results, generated using training and testing data. For
each of the three sub tasks, size of text fragments were increased to get better
results. Experiments showed improvement in results upto some extent. Section
4.1 discusses results on training data and Section 4.2 elaborates results on
testing data for each of the three sub tasks.
4.1</p>
      <sec id="sec-3-1">
        <title>Results: Training Phase</title>
        <p>For Task 1, results are shown in Table 4. It can be analyzed that highest results
for all evaluation measures are obtained for sentences of length 7. Table 5 and
Table 6 show the results for subtask 2 and subtask 3 respectively. Sentences of
length 5 demonstrated best results for subtask 2 while in subtask 3, sentences
of similar length shows highest values of the required evaluation measures.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results: Testing Phase</title>
        <p>Based upon the result of training data set, we used only those sentence lengthes
which demonstrated highest results for each sub task, because both training and
testing data sets contain almost similar type of texts. Therefore, for sub task 1,
we used sentence length 7. For task 2 and 3, sentence length 5 has been used
respectively. Table 7 shows the results for sub task 1 while for sub task 2 and
3, the results are shown in table 8.
In this paper we have discussed our participation in PAN'16 Author
Diarization task. We have developed a software for all of the 3 subtasks. Our proposed
approach based upon a language independent technique to uniquely identify an
author based upon his/her written text Lexical Features. Fifteen lexical features
were used in combination with ClustDist approach. We used ClustDist method
for the detection of anomalous text with a document. Experiments were
performed on training and testing data sets of PAN'16. Di erent size of text
fragments were used to improve the results in training phase while in testing phase
only those fragments sizes were used which gave best results in training phase. It
has been analyzed that by changing the fragment sizes of text, improvement in
results can be obtained. As future work, content based, topic based and stylistic
features in combination with the ClustDist method will be explored.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Information retrieval 12</source>
          (
          <issue>4</issue>
          ),
          <volume>461</volume>
          {
          <fpage>486</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Guthrie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Unsupervised Detection of Anomalous Text</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of She eld (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>MacQueen</surname>
          </string-name>
          , J., et al.:
          <article-title>Some methods for classi cation and analysis of multivariate observations</article-title>
          .
          <source>In: Proceedings of the fth Berkeley symposium on mathematical statistics and probability</source>
          . vol.
          <volume>1</volume>
          , pp.
          <volume>281</volume>
          {
          <fpage>297</fpage>
          . Oakland, CA, USA. (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Miro</surname>
            ,
            <given-names>X.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bozonnet</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fredouille</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedland</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Speaker diarization: A review of recent research</article-title>
          . Audio, Speech, and Language Processing,
          <source>IEEE Transactions on 20(2)</source>
          ,
          <volume>356</volume>
          {
          <fpage>370</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identication, and AuthorPro ling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          , E. (eds.)
          <article-title>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF 14)</source>
          . pp.
          <volume>268</volume>
          {
          <fpage>299</fpage>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Barron-Ceden~o,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>An evaluation framework for plagiarism detection</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on computational linguistics: Posters</source>
          . pp.
          <volume>997</volume>
          {
          <fpage>1005</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <volume>538</volume>
          {
          <fpage>556</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Clustering by Authorship Within and Across Documents</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>