<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Style Breach Detection: An Unsupervised Detection Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jamal Ahmad Khan</string-name>
          <email>J_Ahmadkhan@Yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Software Engineering, International Islamic University</institution>
          ,
          <addr-line>Islamabad</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>This paper deals with the sub-task of PAN 2017 Author Identification, which is to detect style breaches for unknown number of authors within a single document in English. The presented model is an unsupervised approach that will detect style breaches and mark text boundaries on the basis of different stylistic features. This model will use some classical stylistic features like POS analysis and sentence lexical analysis. Also some new features naming common English word frequencies within sentence text, sentence expression and sentence attitude have been proposed. The new features may not be directly linked to author's style of writing but to the subject/topic of sentence under analysis. Moreover the model uses sentence window for style detection. The sentence window may be extended to neighboring sentences during its unsupervised analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Stylometry is an important tool in the field of digital text forensics, especially in
cases where we have unidentified or dubious text documents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] written by one or
more authors. These documents do not have an external link, tool or repository to
prove that which text passage relates to which author. In other words, we use
stylometric approaches when we may have to ascertain if the acclaimed authorship of
text document actually exists in circumstances where we do not have any external
verification resources.
      </p>
      <p>
        Stylometric approaches generally achieve higher accuracy for long documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
because longer documents contain more text to reveal stylistic features of authors like
in the field of Intrinsic Plagiarism detection problem solving [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. But in cases of
short documents or texts e.g. in cases of social media like twitter where there may be
fewer sentences by each author, Stylometric approaches my not get more accurate
results. Although much work has been done in cases of scam emails [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], cyber-crimes
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and fake service provision reviews [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] using Stylometric models.
      </p>
      <p>
        One way of using stylometric approach in case of author attribution and author
profiling is by training the computer applications over specific writing style of some
specific author in a number of documents. But as discussed above the task of
detecting style breaches within a document without knowing in advance about the
exact number of authors is difficult task and also an objective for ongoing research.
Detection of style breach is related to text segmentation where text boundaries are
marked with detection in change of topics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The presented model uses unsupervised classification approach to detect and mark
passage boundaries in given documents on the basis of style breaches. A combination
of well-known stylometric features like Syntactic, Lexical and content specific
features [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are used with features like ordinary words frequency, sentence expression
and sentence attitude that may be related to textual topic specification and may not be
directly related to author’s style. But this approach may be very handy in cases where
we want to relate one sentence to its neighboring sentences and thus detect exact
passage boundaries within a given document.
      </p>
      <p>Also this model is a good example of how a text as small as a sentence within a
document may be helpful in finding its related sentences on the basis of stylometric
and other parameters to help us figure-out the passage boundaries by unknown
number of authors.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Dataset</title>
      <p>
        The training dataset of PAN at CLEF 2017 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for the task of style breach
detection under main task of author identification. The dataset contained about 187
English text documents of different lengths and sizes over different topics like
biography, politics, travel, hotels etc. Along with each text document a truth file was
provided which contained exact character positions indicating style breach
occurrences within that document, topic of document however remains unchanged.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 System Methodology</title>
      <p>The presented model uses different types of classical stylometric methods along
with some new methods in order to find text borders where style breach is identified.
The system used sentences as text segmentation unit. The sentence window keeps
extending over its neighboring sentences until style breach is detected. Following are
the methodology steps used by the system in order to find out style breaches.





</p>
      <p>Words lists preparation
Text segmentation into sentences
Sentence window based syntactic analysis
Sentence window based lexical analysis
Content based analysis of sentence window
 Sentence window expression labeling
 Sentence window attitude labeling</p>
      <p>Style breach calculation</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Words Lists Preparation</title>
      <p>
        Different types of lists of words were prepared from different internet sources [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10,
11, 12, 13</xref>
        ] that express specific moods or human feelings. Seven expression lists of
words were used including anger, confusion, curiosity, urgency, satisfaction,
inspiration and happiness; where all lists comprised of about 200 words each. One
reason for choosing only these seven expressions was the availability of proper
expressive words over internet sources for these expressions. The second reason was
to use limited set of expressions that may express human feelings while writing some
text. More expressions may be included for future research. Two words additional
words lists of about 500 words each of which reflecting positive or negative attitudes
[
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] were included. An example of these expressive and attitude lists is shown in
table 1 and table 2.
      </p>
      <p>These lists became the part of model and will be used for labeling of sentences in
next methodology steps.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Text Segmentation into Sentences</title>
      <p>Each individual document D in the repository was segmented into sentences ,
, , ,…. . A simple algorithm was used to break a document into
array of sentences. It first traverse through each character of document D from start
until the any of the two characters ‘.’ or ‘?’ are encountered, which indicates sentence
endings. The sentence is extracted and the algorithm continues from next character as
start of next sentence.</p>
      <p>D =
+
+
+
+ …. +
(1)</p>
      <p>Where i is the starting index of each sentence and n is the number of total
sentences in D. The first three sentences of any document D will be the starting
window (j = 1) for initializing point that may or may not extend and merge with
next adjacent sentence windows (two at a time) depending on further analysis, also
the adjacent sentence windows will also share boundary sentence as shown in
equation 2 and 3.</p>
      <p>The sentence is common boundary sentence in first and second windows
and . This common sentence among two adjacent windows will increase the
similarity index when comparing both windows for a possible merger/extension.</p>
      <p>As discussed above n is the total number of sentences in any document and each
sentence window W can have only three sentences in start (as shown in equations 2
and 3); hence the maximum number of text windows in any document will be as
shown in equation 4.
(2)
(3)
(4)</p>
      <p>Let’s consider for an example j = 1, so first two sentence windows and
chosen for further analysis. The next steps performed by model are as follows.
are
=
=
+
+</p>
      <p>+</p>
      <p>Max. Windows (m) =</p>
      <sec id="sec-5-1">
        <title>1. Sentence Window based syntactic analysis: Text in both adjacent</title>
        <p>windows is converted to its respective part of speech (POS) tags for each
word present in texts as shown in table 4.</p>
      </sec>
      <sec id="sec-5-2">
        <title>2. Sentence Window based Lexical Analysis: At this step, the model performs</title>
        <p>a lexical analysis for both text windows. In this analysis following features
are extracted:
window is extracted e.g.</p>
        <sec id="sec-5-2-1">
          <title>Most frequent alphanumeric and non-space character</title>
          <p>in the text
= „e‟ in both text windows in shown table 4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Most frequent non-alphanumeric and non-space character (</title>
      <p>text window is extracted e.g. , . =„,‟ in both text windows
.
in the
and</p>
      <p>Most frequent word in the text window is extracted where i in
equation below is the index of word w e.g. = “in” and = “of” in
both text windows respectively as mentioned in table 4. The frequency of
each word is calculated as shown in equation 5.</p>
      <sec id="sec-6-1">
        <title>Word Frequency (</title>
        <p>∑
(5)</p>
        <sec id="sec-6-1-1">
          <title>Character to Space Ratio</title>
          <p>shown in equation 6.
is calculated for each text window as</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Character to Space Ratio (</title>
        <p>) =</p>
      </sec>
      <sec id="sec-6-3">
        <title>3. Content Based Analysis of Sentence Window: At this step commonality index</title>
        <p>of each window is calculated using the list L of 5000 common words. Let
be a common word existing in both L and any text window where i specifies
the index (i = 1… 5000) in L in eq. 7.</p>
        <p>Where k is the total number of coexisting words in both L and , and be
the frequency of in , is the frequency of in list L (as shown in table 3)
and l is the total number of words in .</p>
        <p>Next two steps can be considered as sub-steps of Content based analysis.
4. Sentence Window Expression Labeling: The model will label each window
with a specific feeling or human mood expression . Let i is the index (i = 1… 7)
of expression list as shown in table 1, Let be a coexisting word in both
and text window where m specifies the index in . Expression score is
measured on the basis of following equation.</p>
        <p>Where k is the total number of coexisting words in both and , and be
the frequency of in . After calculating all seven expression scores the model
will calculate e through following equation.
√ ∑</p>
        <p>In cases where two or more expression scores are equal, or all expression
scores are zero, the model will assign a “neutral” expression for window .
5. Sentence Window Attitude Labeling: The model will label each window with a
specific attitude or human behavior . Let i is the index (i = 1… 2) of attitude list
as shown in table 2, Let be a coexisting word in both and text window
where m specifies the index in . Attitude score is measured on the basis of
following equation.</p>
        <p>Where k is the total number of coexisting words in both and , and
be the frequency of in . After calculating both positive and negative attitude
scores the model will calculate a through following equation.
for</p>
        <p>In case both scores are equal or zero, the model will assign a neutral attitude
e.g. both and have neutral attitude.
6. Style Breach Calculation: After computing above mentioned stylistic and
other attributes we get two result sets naming , and two matrices and
for text windows and respectively
{
{
[
[
]
]
(11)
}
}
(14)
(15)
(16)
(12)
(13)</p>
        <p>The system will now measure stylistic similarity score
equations
as shown in following</p>
        <p>Where, for each x in equation 15, the similarity score is incremented
accordingly. and are treated separately as matrices because these two
contains decimal values. A matrix subtraction is applied to and
(17)</p>
        <p>If cr and ci lie within a threshold range described in next section, then
similarity score is incremented accordingly. Finally, it’s time to decide whether
or not to merge and on the basis of value of lies within a threshold
range described in next section. At this point two cases will emerge:</p>
        <sec id="sec-6-3-1">
          <title>Case-1: lies within a threshold range</title>
          <p>In this case and are considered merged, and a new resultant window
will be created where r is the index of resultant window. The model will
continue from step 1 of methodology for sentence and .</p>
          <p>=
+
+
+
+
(18)
will keep expanding until case-1 keeps occurring and this resultant window will
reflect a single style for all sentences contained within.</p>
          <p>Case-2:</p>
          <p>does not lie within a threshold range</p>
          <p>In this case the coexisting sentence in both adjacent windows will stay either in
window or in e.g. let’s assume in equations 2 and 3.</p>
          <p>1. will become a separate single sentence window .
2. Stylistic score is calculated for following same methodology steps
and its distance from both and is calculated.
3. may remain in either of the two sentence windows depending on
the distance value calculated.
4. If remains in then will be restructured for next
consecutive sentences as shown below.
5. If
remains in
then</p>
          <p>will be restructured as shown below.
=</p>
          <p>+
=
+
(19)
(20)</p>
          <p>After the style breach detection among first two consecutive sentence
windows, new windows and will be compared starting from step 1 of
methodology.</p>
          <p>In the end we have a set of resultant windows known as R = where
m is the maximum number of sentence windows and each in R is considered a
breach detection.</p>
          <p>
            A number of experiments were carried out in order to adjust the threshold values
and for which the final F-Measure score was highest. Once the values were
adjusted over the training dataset, the system was ready to run for test dataset
provided at TIRA [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] in order to detect style breaches.
          </p>
          <p>Following are the evaluator results shown in table 5.</p>
          <p>The results were improved for the final test dataset, however the model precision
remained low from recall and that affected the final F-Measure score, which shows
that more experiments over different data sources for adjusting threshold values may
be required.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4 Conclusion</title>
      <p>In this paper an unsupervised model for the detection of style breach is presented,
this research field is rather new and more difficult to implement because non
availability of any external resources for reference and also we only have to rely on
stylistic attributes of unknown number authors that may or may not have contributed
in the creation of text document under inquiry, hence this model presents new
directions or ways i.e. Expression and Attitude labeling of textual windows in order to
find style breach within sentences without the pre-assumption of authors style of
writing and relying more on text content. In future the results can be improved with
discovery of more text labels or with the addition of more expression lists and
reduction of conventional stylistic approaches, this model can hence be applied to
other languages as well.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. [Online] https://en.wikipedia.org/wiki/Stylometry, (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Brocardo</given-names>
            <surname>Marcelo</surname>
          </string-name>
          <string-name>
            <surname>Luiz</surname>
          </string-name>
          , Issa Traore,
          <string-name>
            <given-names>Sherif</given-names>
            <surname>Saad</surname>
          </string-name>
          .
          <article-title>Authorship verification for short messages Using stylometry</article-title>
          .
          <source>Computer, Information and Telecommunication Systems (CITS)</source>
          , International Conference (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          , Barrón Cedeño, Eiselt, Martin Potthast,
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Overview of the 3rd international competition on plagiarism detection</article-title>
          .
          <source>In: CEUR Workshop Proceedings. CEUR Workshop Proceedings</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Mikhail</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          , Anastasia Motrenko, Rita Kuznetsova, and
          <string-name>
            <given-names>Vadim</given-names>
            <surname>Strijov</surname>
          </string-name>
          .
          <article-title>Methods for Intrinsic Plagiarism Detection and Author Diarization Notebook for PAN at CLEF 2016</article-title>
          . In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors,
          <source>CLEF 2016 Evaluation Labs and Workshop - Working Notes Papers</source>
          , Évora, Portugal,.
          <source>CEUR-WS.org. ISSN 1613-0073</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Edoardo</given-names>
            <surname>Airoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bradley</given-names>
            <surname>Malin</surname>
          </string-name>
          .
          <article-title>Data mining challenges for electronic safety. The case of fraudulent intent detection in e-mails</article-title>
          .
          <source>In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>B.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          .
          <article-title>Seduced into scams: Online lovers often duped</article-title>
          .
          <source>MSNBC</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Audun</given-names>
            <surname>Josanga</surname>
          </string-name>
          , Roslan Ismailb and
          <string-name>
            <given-names>Colin</given-names>
            <surname>Boyda</surname>
          </string-name>
          .
          <article-title>A survey of trust and reputation systems for online service provision</article-title>
          .
          <source>Decis. Support Syst</source>
          .
          <volume>43</volume>
          ,
          <issue>2</issue>
          ,
          <fpage>618</fpage>
          -
          <lpage>644</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Michael</given-names>
            <surname>Tschuggnall</surname>
          </string-name>
          , Efstathios Stamatatos, Ben Verhoeven, Walter Daelemans, Gunther Specht, Benno Stein and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          .
          <source>Identification Task at PAN</source>
          <year>2017</year>
          :
          <article-title>Style Breach Detection and Author Clustering</article-title>
          . In: (Eds.)
          <article-title>CLEF Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          , vol.
          <volume>10456</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ahmed</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hsinchun</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS)</source>
          , Volume
          <volume>26</volume>
          Issue 2,
          <string-name>
            <given-names>Article</given-names>
            <surname>No</surname>
          </string-name>
          .
          <volume>7</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. [Online] http://www.manythings.org/vocabulary/lists/l (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. [Online] https://www.vocabulary.com/lists/202236 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. [Online] http://descriptivewords.org
          <article-title>/descriptive-words-for-attitudepersonality (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. [Online] http://www.english-at-home.com/vocabulary/words-that-describebehaviour (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. [Online] http://positivewordsresearch.com/list-of-positive-words (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. [Online] http://www.enchantedlearning.com/wordlist/negativewords.shtml (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. [Online] http://www.wordfrequency.info/free.asp?s=y (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. [Online] http://www.tira.io/tasks/pan/ (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>