<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Derivative Approach for Plagiarism Source Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Peoples' Friendship University of Russia</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our approach to the PAN shared task of plagiarism source retrieval based on the strategy suggested by Williams et. al [1]. We also incorporate named entities queries similar to those of Elizalde [2].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Algorithm
We attempt to improve on the current approaches to the source retrieval task.
Based on the results of the 2015 competition [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we chose to implement an
algorithm that combines the approaches of Williams et. al [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Elizalde [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Williams' software was the best performing detector in the source retrieval task
in 2013 and 2014. Even though they did not submit a new version in 2015, thier
approach still went unmatched by the 2015 participants. We chose to work from
their 2013 approach as the added complexity of their supervised result ranking
strategy in 2014 achieved virtually the same results as its predicessor. Elizalde
makes use of a novel idea of extracting named entities across each document in
an attempt to detect highly obfuscated plagiarism.
      </p>
      <p>Our approach consists of several stages, namely: chunking, key-phrase
extraction, query formulation, and download ltering.</p>
      <p>Chunking: We begin by segmenting the text of the suspicious document into
paragraphs of 5 sentences each. We preprocess each paragraph, removing all
non-alphabetic characters.</p>
      <p>Keyphrase Extraction and Query Formulation: In forming keyphrases two
distinct methods were employed. The rst attempt to nd the most important
features of the entire document, while the other forms queries based on
individual chunks.</p>
      <p>
        Named entity queries: The Natural Language Toolkit (NLTK)1 is used to
identify Named Entities over the whole text. These are then ranked in descending
order of length. The 10 longest named entities are submitted as-is as queries
to the search engine. As noted by Elizalde [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the rationale behind is that the
named entities are unlikely to change even if paraphrasing has been used to
obfuscate plagiarism. Additionally, the longer named entities are likely to contain
more speci c information, and thus be more likely to yield true positive results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 nltk 3.2.1 http://www.nltk.org/</title>
      <p>
        Chunk based queries: Each sentence in each paragraph is tokenized using NLTK's
Punkt Sentence Tokenizer and the NLTK pre-trained part-of-speech tagger is
used to tag all the tokens. All stopwords remove, and only verbs, nouns, and
adjectives are retained. The WordNet lemmatizer is used to stem word.
Stemming is done last as it may a ect the identi cation of named entities. Queries
are formed by concatenating sequences of tokens to form disjunct sequential
10-grams. The rst three 10-grams from each paragraph are submitted to the
ChatNoir search engine. The top three results are returned for each queries.
Download Filtering: The ChatNoir search engine [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] allows one to request a
snippet, of up to ve hundred characters, of a speci c document. The snippet
is based around a query that is submitted along with the request. We use the
original query that returned the result is used for requesting snippets; and
request snippets of the maximum length. Documents are only downloaded if they
are deemed similar to the suspicious document, based on their snippet. The
similarity between the snippet of each document and the suspicious document
is calculated based on an method suggested by Broder et. al [5] using word
ngrams, also known as shingles. The 5-shingles, overlapping sequences of 5 tokens,
are extracted from the document and each downloaded snippet. The similarity
between a snippet s and a suspicious document d is then calculated as:
      </p>
      <p>Sim(s; d) = S(s) \ S(d)
where S is a set of shingles. The results are then ranked by the similarity measure
of their snippet.</p>
      <p>The gure (see Algorithm 1) shows the algorithm for our source retrieval
approach. The implementation of our algorithm is publicly available through
PAN's online code repository.2</p>
    </sec>
    <sec id="sec-3">
      <title>2 https://github.com/pan-webis-de/maluleka16</title>
      <p>Algorithm 1 Source Retrieval Approach
1: procedure SourceRet(text )
2: NEs getNamedEntities(text )
3: for all result in NEs do
4: snippet getSnippet(result)
5: if similarity(snippet) min Sim then
6: if (result 2 sources) = False then
7: sources Download(result)
8: end if
9: end if
10: end for
11: paragraphs splitText(text)
12: for all p in paragraphs do
13: p preprocess(p)
14: queries extractTopQueries(p)
15: for all q 2 queries do
16: results submitQueries(q)
17: end for
18: results rankResults(results)
19: for all result in results do
20: snippet getSnippet(result)
21: if similarity(snippet) min Sim then
22: if (result 2 sources) = False then
23: sources Download(result)
24: end if
25: end if
26: end for
27: end for
28: end procedure</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.-H.,
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C. L. Unsupervised</given-names>
          </string-name>
          <article-title>Ranking for Plagiarism Source Retrieval</article-title>
          .
          <source>Notebook for PAN at CLEF</source>
          <year>2013</year>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Elizalde</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Using statistic and semantic analysis to detect plagiarism in CLEF</article-title>
          (Online Working Notes/Labs/Workshop) (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          ,
          <volume>1613</volume>
          {
          <fpage>0073</fpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gra</surname>
            <given-names>egger</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Welsch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>ChatNoir: A Search Engine for the ClueWeb09 Corpus in 35th</article-title>
          <source>International ACM Conference on Research and Development in Information Retrieval (SIGIR</source>
          <volume>12</volume>
          ) (eds Hersh,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            &amp;
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          )
          <article-title>(ACM</article-title>
          , Aug.
          <year>2012</year>
          ),
          <fpage>1004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Broder</surname>
            ,
            <given-names>A. Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glassman</surname>
            ,
            <given-names>S. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manasse</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Zweig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Syntactic clustering of the web</article-title>
          .
          <source>Computer Networks and ISDN Systems</source>
          <volume>29</volume>
          ,
          <fpage>1157</fpage>
          {
          <fpage>1166</fpage>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>