The Short Stories Corpus
                        Notebook for PAN at CLEF 2015

                     Faisal Alvi1,2 , Mark Stevenson1 , Paul Clough1
                      1
                        University of Sheffield, United Kingdom,
               2
              King Fahd University of Petroleum & Minerals, Saudi Arabia.
        {falvi1, mark.stevenson, p.d.clough}@sheffield.ac.uk


       Abstract In this work we describe the construction of a plagiarism detection/text
       reuse corpus submitted for the PAN-2015 Evaluation Lab. Our corpus consists of
       four different text reuse scenarios namely, (1) no-plagiarism, (2) story-retelling,
       (3) synonym-replacement and (4) character-substitution. Among these scenarios
       the most interesting one is story retelling - through it we find patterns of tex-
       tual similarity between story retellings. We use Grimm brothers fairy tales as
       described in the Project Gutenberg as the source of our documents. The corpus
       consists of 200 pairs of documents, with 50 document pairs for each type of
       text reuse. Empirical observation shows interesting patterns of textual similarity
       within the corpus. Furthermore, plagiarism detection using various approaches
       shows the difficulty of detection of various groups within the corpus.


1 Introduction

The PAN Lab Evaluation Series [10] has been active in conducting experimental eval-
uation of plagiarism detection approaches since the past several years. This year in
PAN-2015 [9], the corpus construction task has been initiated for the text alignment
task - i.e., the participants have to submit a corpus with annotated passages involving
real and/or artificially generated samples of text reuse or plagiarism.
    Text reuse detection has been a well researched area in the news domain [4]. In
this work we explore textual similarity in short stories. For this task, we submitted a
collection of document pairs that have annotated passages classified within four groups.
The source documents’ passages have been taken from various translations of Grimms’
fairy tales and are available on the Project Gutenberg [1] website.
    The corpus consists of 200 document pairs with 50 document pairs each within the
following four different groups namely, (1) no-plagiarism, (2) (human) story-retelling,
(3) synonym-replacement and (4) character-substitution. The no-plagiarism group con-
tains completely different short stories that may share some genre-specific terms leading
to minor textual overlap. The story retelling group describes pairs of story fragments
taken from two different retellings by human writers. The third group, synonym re-
placement, describes story fragment pairs with replacement of words and phrases with
their synonyms. Finally, character substitution refers to technical disguise, where letters
in words are replaced with their look-alike unicode equivalent characters. Empirical ob-
servation reveals interesting similarity patterns within the corpus.
2      Corpus Construction

The corpus consists of documents from the Grimms fairy tales as available on Project
Gutenberg website. The corpus is small in comparison to other PAN corpora. This is
because the number of tales available within the Grimm’s collection ranges from a
maximum of 200 in some editions to less than 50 within other editions. In order to
have a balanced collection of documents within each group, our corpus consists of 200
document pairs, with 50 pairs for each group. Some statistics related to the passage
length within the corpus documents are shown in table 1.


             Table 1. Statistics of passage sizes in the corpus (number of characters)

      Passage Length    No Plagiarism Story Retelling Synonym Replace Character subst
     Number of Docs.          50               50                 50                     50
     Maximum Length          none             1160                765                729
     Minimum Length          none             285                 259                220
      Average Length         none             590                 497                455


    Here, a passage is defined as a contiguous maximal-length sequence of characters
(or text) that consists of similar text between two versions. For corpus construction, we
selected passages from two versions of a tale that correspond to the same events, since
different versions of the same story may sometimes differ in details of events.


2.1 No plagiarism

The no-plagiarism group consists of stories that are completely different but may have
minor textual overlap due the occurrence of some genre-specific words. For this group,
we found ferret [6] trigram similarity for the the entire Grimms collection and chose 50
document pairs such that there was no other similarity within them.


2.2 Story Retelling

(Story) retelling is defined as “a new, and often updated or retranslated, version of a
story.”1 . In this context, a question that appears on some internet forums or websites is:
“Is retelling an old fairy tale (or a short story) considered plagiarism?”2 .
    In this work we do not address this question definitively i.e., “Does story retelling
involve text reuse and/or plagiarism?”. Therefore, we use the term ‘textual similarity’
when referring to similar passages of text within two story retellings. However, from
the literature we see that Clough et al [4] remark that, “Of course, reusing language
is as old as the retelling of stories...”. This suggests a link between retelling of stories
 1
     http://dictionary.reference.com/browse/retelling [Last Accessed: 07-June-2015]
 2
     http://www.answers.com/Q/Is_rewriting_an_old_fairy_tale_considered_plagiarism [Last Ac-
     cessed: 15-July-2015]
and reuse of language; consequently a story retelling may involve text reuse from the
original story, however we cannot claim this for a particular retelling.
    In figure 1 we show fragments from two retellings of the story ‘King Thrushbeard’
(also called ‘King Grisly Beard’). Here we give a correspondence between sentence
fragments found between the two retellings:

 1. (Fragment 1) The wedding of the King’s eldest son was to be celebrated.
    (Fragment 2) The king’s eldest son was passing by, going to be married.
 2. (Fragment 1) She thought of her lot with a sad heart.
    (Fragment 2) She bitterly grieved.
 3. (Fragment 1) She put in her jars to take home.
    (Fragment 2) She put into her basket to take home.

    We see that in pair 1, fragment 2 corresponds to a change of voice from fragment 1
with some modification; likewise in pair 2, fragment 2 corresponds to a summarization
of fragment 1, while the two fragments in pair 3 are exactly the same with minor word
replacement. It may be relevant to mention here that the original Grimm’s tales are in
the German language, with these two retellings being two different translated versions
in English. Earlier, Barzilay et al [3] have also extracted paraphrases from multiple
English translations of the same source text in English.


      It happened that the wedding of the King's eldest son was to be
      celebrated, so the poor woman went up and placed herself by the door of
      the hall to look on. When all the candles were lit, and people, each more
      beautiful than the other, entered, and all was full of pomp and splendour,
      she thought of her lot with a sad heart , and cursed the pride and
      haughtiness which had humbled her and brought her to so great poverty.

      The smell of the delicious dishes which were being taken in and out
      reached her, and now and then the servants threw her a few morsels of
      them: these she put in her jars to take home.


      She had not been there long before she heard that the king's eldest
      son was passing by, going to be married; and she went to one of the
      windows and looked out. Everything was ready, and all the pomp and
      brightness of the court was there. Then she bitterly grieved for the
      pride and folly which had brought her so low. And the servants gave her
      some of the richmeats, which she put into her basket to take home.


                 Figure 1. Example of textual similarity in story retellings
2.3 Synonym Replacement

The third group in our corpus is synonym replacement. This refers to replacement of
words (and some phrases) with synonymous words and equivalents. For this purpose,
we initially used Wordnet [8] by searching for the synset corresponding to a given
word in the text and returning back the first available synonym. However the resulting
text was too far from the original in some cases. This technique can be improved by
incorporating the context in which a particular word appears as well as incorporating
word sense disambiguation. We plan to incorporate these changes to the corpus at a
later stage.
    Since the size of the corpus is not large and words belong to a particular domain, we
created a customized list of synonyms for commonly occurring words and phrases in
the documents. While this approach certainly produced meaningful texts, it is may not
be scalable for large number of documents and/or corpora involving diverse topics. In
addition to these, we also removed some articles (a, an, the), and replaced alternate oc-
curences of some pronouns with proper nouns. Below we give an example of synonym
replacement:

 1. (Fragment 1) The King, who had a bad heart, and was angry...
 2. (Fragment 2) the monarch, who had a worse heart, and was enraged...


2.4   Character Substitution or Technical Disguise

Substitution of characters with their unicode equivalents in order to exploit the weak-
ness of a plagiarism detection approach is known as technical disguise [7]. In this work
we used a simple replacement of two of the most frequently occuring letters ‘a’ and ‘e’
with their Cyrillic equivalents [5]. Table 2 shows the correspondence between an ASCII
letter and its unicode cyrillic equivalent.


                Table 2. The letters ‘a’ and ‘e’ with their cyrillic equivalents

           Ansi Character Unicode Value Unicode Equivalent Unicode Value
                  a                92              a (Cyrillic)         U + 0430
                  e                97              e (Cyrillic)         U + 0435


    Most word n-gram based approaches might fail to detect this type of obfuscation
since a unit of similarity in these approaches is a word. In the example shown below we
replace an ‘e’ with an ‘e’ and an ‘a’ with an ‘a’ to emphasize the change that happens.

 1. (Sentence 1) Now there lived in the country two brothers, sons of a poor man, who
    declared themselves willing to undertake the hazardous enterprise.
 2. (Sentence 2) Now there lived in the country two brothers, sons of a poor man, who
    declared themselves willing to undertake the hazardous enterprise.
3   Results and Discussion
We tested the corpus using the simple PAN Baseline approach as well as our hashing
and merging based approach [2] submitted to PAN-2014. Numerical results of preci-
sion, recall, granularity and overall plagdet score are given in Table 3, while a compar-
ison of the two approaches performance on various groups is given in Fig 2.


      Table 3. Results of PAN Baseline and our hashing/merging approach on the corpus

                  Approach     Overall     No          Story       Synonym    Character
                    Used                   Plag        Retelling   Replace    Substitut
                PAN Baseline 0.02593       1.00000     0.00492     0.07125    0.00000
     Plagdet
                 Hash/Merge 0.26221        1.00000     0.10150     0.56686    0.00886
                PAN Baseline 0.99632       1.00000     1.00000     0.99598    0.00000
    Precision
                 Hash/Merge 0.98663        1.00000     0.99788     0.99923    0.00446
                PAN Baseline 0.01313       1.00000     0.00246     0.00369    0.00000
      Recall
                 Hash/Merge 0.15119        1.00000     0.05347     0.39565    0.61392
                PAN Baseline 1.00000       1.00000     1.00000     1.00000    1.00000
    Granularity
                 Hash/Merge 1.00000        1.00000     1.00000     1.00000    1.00000


    From figure 2, we see that the two approaches performed very well in detecting
no-plagiarism and synonym replacement. However the overall score was low in case of
story retelling and is close to zero in case of character substitution.


       Figure 2. Visual performance of various approaches based on the Plagdet Scores
4   Conclusion and Future Work

In this work we constructed a corpus in PAN format based on story retelling. We used
various translations of Grimms’ fairy tales in the construction of this corpus. The ques-
tion whether a story retelling involves text reuse has not been addressed in this work
comprehensively. However, given the nature of story retelling, it can be expected that
text reuse may happen in case of story retelling. Apart from story retelling our other
strategies were synonym replacement and character substitution in the form of techni-
cal disguise.
    In future, this corpus can be improved upon by enhancing synonym replacement
with a comprehensive automatic paraphrasing strategy. Another possible area of explo-
ration could be to extend the domain of story retelling to modern short stories. Some
of the stories in our corpus contain archaic language or words, however this is to be
expected since we used versions of fairy tales that, unlike modern short stories, are in
the public domain.


References
 1. Books: Grimm (sorted by popularity) - Project Gutenberg.
    http://www.gutenberg.org/ebooks/search/?query=grimm, Last Accessed: 2015-07-15
 2. Alvi, F., Stevenson, M., Clough, P.D.: Hashing and Merging Heuristics for Text Reuse
    Detection. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18,
    2014. pp. 939–946 (2014)
 3. Barzilay, R., McKeown, K.R.: Extracting Paraphrases from a Parallel Corpus. In:
    Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. pp.
    50–57. Association for Computational Linguistics (2001)
 4. Clough, P.D., Gaizauskas, R.J., Piao, S.S., Wilks, Y.: Measuring text reuse. In: Proceedings
    of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12,
    2002, Philadelphia, PA, USA. pp. 152–159 (2002)
 5. Gillam, L., Marinuzzi, J., Ioannou, P.: Turnitoff–Defeating Plagiarism Detection Systems.
    In: Proceedings of the 11th Higher Education Academy-ICS Annual Conference. Higher
    Education Academy (2010)
 6. Lane, P., et al.: UH Ferret: Implementation of a copy-detection tool. Software (2011),
    http://uhra.herts.ac.uk/handle/2299/12041
 7. Meuschke, N., Gipp, B.: State-of-the-art in Detecting Academic Plagiarism. International
    Journal for Educational Integrity 9(1) (2013)
 8. Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM
    38(11), 39–41 (1995), http://doi.acm.org/10.1145/219717.219748
 9. Potthast, M., Hagen, M., Göring, S., Rosso, P., Stein, B.: Towards Data Submissions for
    Shared Tasks: First Experiences for the Task of Text Alignment. In: Working Notes Papers
    of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
    CEUR-WS.org (Sep 2015), http://www.clef-initiative.eu/publication/working-notes
10. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st
    International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E.,
    Koppel, M., Agirre, E. (eds.) SEPLN 09 Workshop on Uncovering Plagiarism, Authorship,
    and Social Software Misuse (PAN 09). pp. 1–9. CEUR-WS.org (Sep 2009)
A     Peer Review of Submitted Corpora

Detailed reviews of various corpora have been submitted earlier. Due to space limita-
tions, here we give a summary of corpora review as shown in table 4. Regarding the
errors,
 – Synonym errors refer to an error in replacement of word by its synonym. Such a
   replacement may make the resulting text somewhat incomprehensible.
 – Demarcation Errors refer to errors in the alignment of the source and the suspicious
   text. For example, the source text may be marked a few characters earlier or later
   than it should have been. However, this is a minor error and might have also been
   observed due to programming errors in creation of XML files, or in the viewer
   program used for reviewing the corpora.
The average errors per document figure is given as a range, since definition of synonym
error is not precise – likewise demarcation errors might not be present at all, but may
have been observed due to the character set used.
    Due to lack of expertise in language and/or non-availability of language resources,
review of non-English corpora could not be successfully carried out.


               Table 4. Tabular Summary of Peer Review for Submitted Corpora

    Corpus     Mono/Bi Method of Documents Average Errors Observations
    Name       Lingual Construction Reviewed Per Document and Errors
      alvi-      Mono       Refer to       -             -        -
       15      (English)    Notebook
    cheema-      Mono       Possibly      ≈ 75          ≥0        • Meaningful names
       15      (English)   Automatic                              for obfuscation
                           + Manual                               strategies needed.
                                                                  • Synonym errors
                                                                  may be present.
    mohtaj-      Mono       Possibly      ≈ 75          ≥1        • Synonym errors and
      15       (English)   Automatic                              demarkation errors
                                                                  may be present.
    palkov-      Mono       Possibly      ≈ 75          ≥1        • Some grammatically
    skii-15    (English)   Automatic                              incorrect sentences.
    asghari-   Bi (Eng-    No basis for   < 10          N/A       • N/A
       15      Persian)     judgment
     hanif-    Bi (Eng-    No basis for   < 10          N/A       • N/A
      15        Urdu)       judgment
    khoshna-   Mono        No basis for   < 10          N/A       • N/A
     va-15   (Persian)      judgment
     kong-       Mono    No basis for     < 10          N/A       • N/A
      15       (Chinese)  judgment