=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-PAN-VartapetianceEt2013
|storemode=property
|title=A Textual Modus Operandi: Surrey's Simple System for Author Identification Notebook for PAN at CLEF 2013
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-VartapetianceEt2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/VartapetianceG13
}}
==A Textual Modus Operandi: Surrey's Simple System for Author Identification Notebook for PAN at CLEF 2013==
<pdf width="1500px">https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-VartapetianceEt2013.pdf</pdf>
<pre>
       A Textual Modus Operandi: Surrey’s Simple
           System for Author Identification
                       Notebook for PAN at CLEF 2013

                               Anna Vartapetiance, Lee Gillam

                                     University of Surrey
                           {A.Vartapetiance, L.Gillam}@surrey.ac.uk


       Abstract. Detecting deceptions of various kinds may be variously possible, but
       has little value if the deceiver cannot be identified. In this paper, we discuss our
       approach to Authorship Attribution that uses vector similarity with a frequency-
       mean-variance framework for patterns of stopwords (no more than ten). The
       high frequency individual occurrences, and patterns of co-occurrence, can be
       used as identifier of an author’s style, and operates similarly across certain
       languages without prior linguistic knowledge. This simple system achieved F1
       values of 0.66, 0.74 and 0.78 for Early Bird, Final, and Post submission
       assessment of the Train Corpus. We cannot yet offer further explanation as the
       Test Corpus is not available at the time of writing.


1 Introduction

   Research into Deception Detection has benefited from the large (documented) sets
of human communication mediated through the web and in particular through social
media. Asynchronous distributed communication is common in such media, and with
the non-verbal and vocal cues to deception removed, as well as the deceiver having
time to plan their deception, verbal cues are the main area of exploration. Such
detection is attempted on simple text messages [7], fraud investigations [6] and court
testimonies [4]. Deceptions range from “Pareto white lies” to “Spite black lies” [2],
and include studies by forensic linguists and natural language processors alike.
Detecting the deception differs, however, from detecting the deceiver – analogous to
the difference between analysing the scene of a crime and being able to use specific
evidence from that scene to suggest the perpetrator of the crime. Extending the
analogy, we are interested in a detectable Modus Operandi (MO) for a particular
perpetrator. However in the PAN problem space of Authorship Attribution, we are
trying to denote whether a given ‘scene’ or ‘design’ reflects the MO of (a) prior
scene(s) or design(s).
   In the 6th International Workshop on Uncovering Plagiarism, Authorship, and
Social Software Misuse (PAN2012), we gave first test to our ideas that the signatures
of such scenes could be found in co-occurrence patterns of stopwords. PAN2012’s
task covered [10]:
1) Traditional Authorship Attribution: given unknown documents and sets of known
     documents from different authors, the task was,
     a) to denote an author for each document (closed class problem)
     b) an extension to a) where the author may have been somebody else (open
          class problem)
2) Authorship clustering/intrinsic plagiarism: given a document,
     a) Clustering the paragraphs written by each author – where the number of
          authors are known (closed class clustering)
     b) Clustering the paragraphs written by each author – where the number of
          authors are unknown (open class clustering)
3) Sexual Predator Identification, given a datasets of chat lines,
     a) identify whether the chat indicated a predator
     b) identify the predatory elements of the chat
   We submitted simple systems for all three subtasks to create baselines for our own
work. The results achieved 42.8% of overall correct detection for Traditional
Authorship Attribution, 91.1% for Intrinsic Plagiarism Detection. For Sexual
Predators Identification tasks, our system achieved 0.61, 0.38 and 0.48 for Precision,
Recall and F1 respectively.
   This paper, presents our approach for PAN2013 focusing only on the open class
Traditional Authorship Attribution problem for three different languages (English,
Greek and Spanish). The approach, the dataset, and the addition of two languages are
significant changes, making it inherently difficult to infer performance from prior
results and so making it likewise difficult to determine whether a given approach
adapted to this task offers better or worse performance without incurring a cost of
back-fitting.
   In this paper, we outline the approach taken at the University of Surrey to this task.
In section 2, we discuss the Train Corpus and highlight the changes compared to last
year. In sections 3 and 4, we describe the system and the evaluation of its results.
Section 5 concludes the paper with considerations for future work.


2 Corpus

   PAN2013 focuses on an open class Traditional Authorship Attribution for three
different languages -English, Greek and Spanish. PAN2012 had a related task, but any
prior approach could not be used directly, and the addition of languages likely
requires further adaptation.
   For PAN2012, given a set of documents from different known authors and a set of
documents with unknown authors; the task was to allocate the documents to one
author (or none). The PAN2013 approach requires a Boolean response as to whether
an unknown document was likely written by the same author as a set of (1 to 10)
“Known Documents” from that (single) author. Table 1, shows details of the corpus
with the number of cases per each language and the number of “Known Documents”
available per each case to help predicting the correct answer.
   As is apparent from the table, there are neither an equal number of cases provided
for each language nor identical number of “Known Documents” for each. Nor is it
clear that the numbers are representative of those used for the Test Corpus.
                           Table 1: PAN2013 Corpus Details

Language       Cases       # of Known Documents per case         Case Names
English        10         2                                      EN11, EN21, EN23, EN30
                          3                                      EN13, EN18,
                          4                                      EN07
                          5                                      EN24
Greek          20         1                                      GR01, GR02
                          2                                      GR03, GR04
                          3                                      GR05, GR06
                          4                                      GR07, GR08
                          5                                      GR09, GR10
                          6                                      GR11, GR12
                          7                                      GR13, GR14
                          8                                      GR15, GR16
                          9                                      GR17, GR18
                          10                                     GR19, GR20
Spanish        5          1                                      SP03, SP09
                          3                                      SP02, SP05
                          4                                      SP10


3 Method

   For previous Authorship Attribution tasks, many approaches have been
documented that use NLP techniques over bags of words, N-grams, and parts of
speech (POS), with varying degrees of success. Often, stopwords are either not an
integral part of the analysis, or are dropped from processing. For PAN2012, we
approached attribution using a mean-variance framework on patterns of stopwords
[1]. We used a specified maximum window size for pairs of the 10 most common
English stopwords to identify positional frequencies, and allocated an author based on
nearest match mean-variance match. We achieved F1 of 0.42, and saw post-
submission that it might have been possible to achieve F1 of 0.48 using paired sets of
5 stopwords (e.g. patterns combined from the first 5 with the second 5, and hence a
smaller feature space) [10].
   For PAN2013, this core idea was not changed. The authors have no real knowledge
of either Greek or Spanish, so attempted to find lists of 10 frequent stopwords for
each (Table 2). Given that lack of linguistic knowledge, we do not yet know whether
the lists we obtained meet this requirement.

                    Table 2: List of stopwords for all three languages

Language    Stopwords                                                    Based on
English     The Be To Of And A In That Have I                            [9]
Greek       Και Το Να Τον Η Της Με Που Την Από                           [8]
Spanish     De La Que El En Y A Los Del Se                               [3]
   For PAN2013 early bird submission, we applied the following steps with
parameters from our PAN2012 post-submission experiments. Patterns were generated
from the first 5 frequent stopwords against the second 5 frequent stopwords, with
window size of 5 words, and confidence measure of 0.95. We replaced our closest
match option from PAN2012 with the average of maximum cosine similarity values
per pattern. The approach was:

              Table 3: Approach taken for PAN2013 Early Bird Submission

Steps        Process
Step 1       Select the 10 most frequent words for each language
Step 2       Generate regular expressions of first 5 most frequent stopwords against the
             second 5 (S1*S2) and use a specific size of window N (here, 5) for each
             document
Step 3       Extract concordances containing the regular expressions for all texts
Step 4       Calculate frequency, mean and variance information for the pairs
Step 5       Calculate cosine similarities of the unknown document against each of the
             known documents per pair
Step 6       Calculate the average of all maximum cosine similarities for pairs to get a single
             value per case
Step 7       Report “Y” if the value is above the confidence measure (here, 0.95), “N”
             otherwise

   For the main submission, we introduced a filter (after Step 4) to only compare
patterns that exist more than a specified number of cases in one document. For
example, just one occurrence of a pattern may not a strong indicator for an author’s
writing style.
   An algorithm of the system, using the denotations and functions from Table 4 is
offered in Table 5.


                               Table 4: Table of Notations

Symbol       Meaning
             Set of Queries
             A single query where
             Set of documents
             A document where
             Set of documents related to query
             Set of languages
             A Stopword
             Set of stopwords                          for a language
             Subsets of ,where


             Window Size: maximum distance from         to    ,where
               Pattern of stopword from         followed by from      in maximum distance of
               Window Size
               Filter: threshold for frequency of each pattern, where
               Confidence Measure: threshold for identifying confidence in similarity of Q
               with D, where
FMV            Function that takes the incidents of given pattern            and returns three
               values of frequency, mean, and variance
CosineSim      Cosine Similarities function [5] where


                       Table 5: Algorithm of our System for PAN2013

Algorithm


     Our process of Authorship Attribution can be explained as:
1.    For all the      , calculate the FMV with pair of from Pattern set        followed
      by from Pattern set       within window size of    ; only if pattern has happened
      more than     times
2.   Only for Patterns that happened more that        times for , for related calculate
     the FMV with pair of from Pattern set             followed by from Pattern set
     within window size of       if that pattern has happened more than FT times too
3.   Find maximum of Cosine similarities (                       ) between each of the
     patterns for and related
4.   Calculate average of non-zero                    values
5.   Answer              if that value is bigger than Confidence Measure          , else
     answer


4 Submissions, Results and Evaluations

   For early bird evaluation, we used the same parameters for all three languages
following the steps presented in Table 2 (using (S1*S2) pattern in a Window Size of 5
and Confidence Measure of 95). The system achieved F1 of 0.66 for the Train
Corpus, detecting 60%, 60% and 100% of documents correctly for English, Greek and
Spanish respectively (Table 8). The results for first evaluation on the Test Corpus
showed F1 of 0.56, detecting 45%, 50% and 90% for English, Greek and Spanish
respectively.
   To try to improve results, we conducted a parameter sweep that covered 6750 tests
based on the values outlined below.

                    Table 6: Presenting Parameters and Options used for each

Parameter                  # of Options    Options
Language                   3               English, Greek, Spanish
Pattern Pairs              9               S1*S1, S1*S2, S1*S3, S2*S1, S2*S2, S2*S3,
                                           S3*S1, S3*S2, S3*S3
Window Size                5               5, 10, 15, 20
Filter                     5               No filter, 2, 3, 4, 5
Confidence Measure         10              90, 91, 92, 93, 94, 95, 96, 97, 98, 99

   The results from these tests suggested that each language should be treated slightly
differently. Although we do not have linguistic knowledge of Greek or Spanish,
Greek seemed to evidence more structured use of stopwords than Spanish (high
cosine similarities for Greek suggested stopwords occupy relatively fixed positions
which makes them less author specific than would be the case for Spanish). For the
full submission, parameters were selected for each language – to account for these
findings - as follows:

                Table 7: Values for Parameters used for PAN2013 Final submission

Language           Pattern Pairs      Window Size       Filter       Confidence Measure
English            S1*S2              20                4            92
Greek              S3*S3              10                5            98
Spanish            S1*S2              10                4            92
   These parameters improved performance of our Early Bird system from F1 of 0.66
to F1 of 0.74 (presented in Table 8). However, results from Test Corpus on Final
Submission showed F1 of only 0.54 across the three languages, a significant
difference (Spanish dropped by F1 of 0.30, while both English and Greek improved).
Unfortunately, the Test Corpus has not been released at the time of writing, and so we
are unable to offer an explanation of this variation.
   Post-competition submission, we could indicatively achieve F1 of 0.78 on the
Train Corpus by considering a factor of the number of test samples (Known
Documents) being compared against. The value of this finding would need to be
explored once all test data and suitable annotations become available.

            Table 8: Results from Various Submission for both Train and Test Corpus

Version             E     G    S     E%     G%      S%     Overall     Corr doc       F1
Train 1             6     12   5     60     60      100    73.3        23             0.657
Test- Early Bird    --    --   --    45     50      90     61.6        --             0.56
Train 2             8     13   5     80     65      100    81.6        26             0.742
Test- Final Sub     --    --   --    50     53      60     53.3        --             0.541
Train- Post sub     8     15   5     80     75      100    85          28             0.777


5 Conclusion

   In this paper, we attempted to reuse a fairly simple approach from PAN2012 for
Authorship Attribution. Our frequency-mean-variance framework over pairs of
stopwords (no more than ten) can demonstrate reasonable performance F1 of 0.74 on
Train Corpus, but seems only to achieve F1 of 0.54 on Test Corpus suggesting either
that our approach is overturned to training data, or that we suffer from generalizability
problems (not having more similar data to test with to tune parameters) or that there is
a big gap in representatively between Train and Test Corpus. Only once these data are
released could we ascertain which.


Acknowledgments

   The authors gratefully acknowledge prior from EPSRC/JISC (EP/I034408/1), the
UK’s Technology Strategy Board (TSB, 169201), and also the efforts of the PAN13
organizers in crafting and managing the tasks.


References

1.   Church, K., Hanks, P.: Word Association Norms, Mutual Information and
     Lexicography. Computational Linguistics, vol. 16(1), pp. 22-29 (1991)
2.  Erat, S., Gneezy, U.: White lies. Journal of Management Science, vol. 58 (4), pp.
    723-733 (2012)
3. Lazarinis, F.: Engineering and Utilizing a Stopword List in Greek Web Retrieval.
    Journal of the American Society for Information Science and Technology
    (JASIST), vol. 58(11), pp. 1645-1652 (2007)
4. Little, A., Skillicorn, B.: Detecting Deception in Testimony. In: Proceeding of
    IEEE International Conference of Intelligence and Security Informatics (ISI
    2008), pp. 13-18. Taipei, Taiwan (2008).
5. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval.
    Cambridge University Press, New York, USA (2008)
6. McCallion, J.: Ernst & Young Debuts FBI Co-designed Anti-Fraud Software. In:
    IT PRO. [available at] http://www.itpro.co.uk/644899/ernst--young-debuts-fbi-
    co-designed-anti-fraud-software
7. Reynolds, L., Smith, M.E., Birnholtz, J., Hancock, J.: Butler Lies From Both
    Sides: Actions and Perceptions of Unavailability Management in Texting. In:
    Proceedings of the 2013 conference on Computer supported cooperative work
    (CSCW '13), pp. 769-778. ACM, New York, NY, USA (2013)
8. Snowball: A Spanish stop word list. [available at ]
    http://snowball.tartarus.org/algorithms/spanish/stop.txt
9. The Oxford English Corpus: Facts about the language. [available at]
    http://oxforddictionaries.com/words/the-oec-facts-about-the-language
10. Vartapetiance, A., Gillam, L.: Quite simple approaches for authorship attribution,
    intrinsic plagiarism detection and sexual predator identification - notebook for
    pan at clef 2012. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.): CLEF
    2012 Evaluation Labs and Workshop - Working Notes Papers. Rome, Italy
    (2012)

</pre>