<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Source Retrieval and Text Alignment Corpus Construction for Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kong Leilei</string-name>
          <email>kongleilei1979@hotmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lu Zhimao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han Yong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Haoliang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han Zhongyuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wang Qibo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hao Zhenyuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhang Jing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1Heilongjiang Institute of Technology, China 2Harbin Engineering University, China 3Harbin Institute of Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Source Retrieval in Plagiarism Detection</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>For the task of source retrieval, we focus on the process of Download Filtering. For the process from chunking to search control, we aim at high recall, and for the process of download filtering, we devote to improve precision. A vote-based approach and a classification-based approach are incorporated to filter the searching results to get the plagiarism sources. For the task of text alignment corpus construction, we describe the methods we use to construct the Chinese plagiarism cases. At last, we report the statistics of text alignment dataset submissions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>submitting as many queries as possible to the search engine and retaining as many
retrieval results as possible.</p>
      <p>Chunking. Firstly, the suspicious texts are partitioned into segments that are made
up of only one sentence. Especially, it is found that the suspicious documents
generally contain some headings. If there are empty lines in front and one behind and
the word number of the line is less than 10, the current line are previewed as
headings. We try to use only headings as queries to retrieve the plagiarism sources
when we did not retrieve any sources on some suspicious documents, but the sources
are still not discovered by using these headings. So the headings are merged into the
sentence which were adjacent to them.</p>
      <p>
        Keyphrase Extracting. After getting all sentences, each word in each paragraph is
tagged using the Stanford POS Tagger[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and only nouns and verbs are considered as
query keyphrase.
      </p>
      <p>
        Query Formulation. Queries are constructed by extracting each sentence of k
keywords, where k = 10. If the number of nouns and verbs in one sentence is more
than 10, we retain only top 10 with high term frequencies. And if the number is less
than 10, all nouns and verbs are regarded as the query. Then these queries are
submitted to ChatNoir search engine[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to retrieve plagiarism sources.
      </p>
      <p>Search Control. Since each query is generated by only one sentence, it represents
the topic which the sentence tries to express, and maybe strayed from the subject
which the plagiarism segment which the sentence come from. The result is that many
positive plagiarism sources are ranked below. Therefore, for each query, we keep the
top 100 results. This tactic make us own a higher recall before download filtering.</p>
      <p>Download Filtering. There can be no argument that the number of retrieval results
has a large effect on the performance, and increasing the number will lead to an
increase in recall and a decrease in precision. In the steps of keywords extraction,
except for the content of suspicious document and its text chunk, we have very little
information. Submitting more queries may be the best choice without considering the
retrieval cost. But after retrieving, we can get abundant information including various
similarity scores between query and document, the length of document, the length of
words, sentences and characters of document, the snippet(the length of snippet we
requested is 500 characters), and so on. By exploiting the retrieval results and the
meta-data returned by ChatNoir API, we design a two-step download filtering
algorithm.</p>
      <p>As we known, the evaluation algorithm of source retrieval computes recall,
precision and fMeasure by using the downloading documents, so before implementing
our download filtering algorithm, we decide to filter some retrieval results firstly. We
suppose that the queries can retrieve the same plagiarism sources if they come from
the same plagiarism segment of suspicious document. Then, for one suspicious
document, the same retrieval results will occur many times. The underlying
assumption is that more possible plagiarism sources are likely to receive more search
results voting from different queries of suspicious document. So, we use a simple vote
algorithm to assign a weight to each document of the retrieval results set. If a
document is retrieved by a query, the weight of the document will add 1. We have
also tried the weighted vote approach by giving the document which ranking at the
front more higher weight, but it do not perform better than the simple vote approach.</p>
      <p>After implementing vote algorithm, the results of vote are regarded as the candidate
plagiarism sources. If the size of result list is less than 20, we choose the top 50
results according to the top voting results as the candidates.</p>
      <p>
        Table 1 shows the performance of source retrieval only using vote approach to
filter the retrieval results, which is called Han15 by PAN in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Experiments were
performed on the train dataset pan14-source-retrieval-training-corpus-2014-12-01 of
source retrieval which contains 98 suspicious documents. The numbers in the column
headers means the count of vote, and the row headers are the evaluation measures of
source retrieval. We choose vote 8 when we submit our source retrieval software to
PAN.
      </p>
      <p>fMeasure
Recall
Precision
Queries
Downloads</p>
      <p>
        The data in above table 1 is evaluated by our own evaluation detector which is
designed according to Ref. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. But we only implemented the former two-way
approach to determine true positive detections because we did not know which
algorithm was used to extract plagiarism passages’ set which were applied to compute
the containment relationship.
      </p>
      <p>
        In the past year’s evaluation, Williams et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a filtering approach which
viewed the filtering process of candidate plagiarism sources as a classification
problem. A supervised learning method based on LDA(Linear Discriminant Analysis)
was used to learn a classification model to decide which candidate plagiarism source
was the positive detections before downloading them. This year, we followed their
idea and added four new features. They are Document-snippet word 2-gram, 3-gram,
4-gram and 8gram intersection. The set of word 2, 3, 4 and 8 grams from the
suspicious document and snippet are extracted separately, and the common n-grams
are computed. We chose SVM as our classification model. The open tools SVM
light(http://www.cs.cornell.edu/People/tj/svm_light/) is used as our classifier. We
only trained the parameter c in training set which was constructed according to Ref.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. After voting, all the results which are positive case judged by classifier are
downloaded. The vote strategy follows Han15. This approach based on vote and
classification is called Kong15 by PAN in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Using the Source Oracle, we filtered our results. The final log file reported the
filtered results of source retrieval. Table 2 shows the results by using the classification
tactics.
2</p>
      <p>Text Alignment Corpus Construction</p>
      <p>For the task of text alignment corpus construction, we submit a corpus which
contains 7 plagiarism cases. The plagiarism cases are constructed by using real
plagiarism.</p>
      <p>Firstly, we recruited 10 volunteers to write a paper according to a topic we
proposed. We choose 7 of 10 to submit our corpus. Table 4 lists the 7 topic.</p>
      <p>For each essay, we request ten thousand Chinese characters at least. The volunteers
retrieved the related contents on the subject by using the specified search engine and
wrote the paper. Especially, the Baidu is used to search engine. The number of
sources has not been not limited.</p>
      <p>Then papers were submitted to a famous Chinese plagiarism detection software
which are used in many Chinese colleges and universities. This plagiarism detection
software uses the fingerprint technology to detect the plagiarism. Next, the volunteers
modified the contents which were detected by this software. The modification tactics
include: adjusting the words’ order, replacing the words and paraphrasing
modification. But no matter what kinds of modifying tactics they adopted, they must
ensure that the paper after revising is readable and consistent with the original paper's
meaning. Lastly, the modified papers were submitted to the plagiarism detection
software until the software could no longer detect any plagiarism. The modified
papers were submitted to PAN as the text alignment corpus.</p>
      <p>Suspicious Document
suspicious-document00000
suspicious-document00001
suspicious-document00002
suspicious-document00003
suspicious-document00004
suspicious-document00005
suspicious-document00006
suspicious-document00007</p>
      <p>Topic
Campus Second-hand Book Trade
Online Examination
Online Examination
Second-hand Car Trade
Automobile 4S Shop
Multimedia Material Management Library
Driving license exam</p>
      <p>Supermarket Management System
The statistics of the corpus is shown in table 5.</p>
      <p>
        We peer-review pan15 text alignment dataset submissions[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the statistics of
corpus are shown in table 6.
      </p>
      <p>Corpus characteristic
Number of suspicious document
Number of source document
Average length of suspicious documents
Average length of source documents
Average lengths of plagiarism cases
Number of plagiarism cases
Jaccard coefficient
Number of suspicious document
Number of source document
Average length of suspicious documents
Average length of source documents
Average lengths of plagiarism cases
Number of plagiarism cases
Jaccard coefficient of plagiarism cases
Acknowledgments This work is supported by Youth National Social Science Fund of
China (No. 14CTQ032), National Natural Science Foundation of China(No.
61272384), and Heilongjiang Province Educational Committee Science
Foundation(No. 12541649, No. 12541677).</p>
      <p>Remark This work was done in Heilongjiang Institute of Technology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, Paolo Rosso, Benno Stein:
          <article-title>Overview of the 6th International Competition on Plagiarism Detection</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2014</year>
          :
          <fpage>845</fpage>
          -
          <lpage>876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
          .
          <source>In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03</source>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
          (May
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Tippmann</surname>
          </string-name>
          , and Clement Welsch.
          <article-title>ChatNoir: A Search Engine for the ClueWeb09 Corpus</article-title>
          . In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors,
          <source>35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12)</source>
          , pages
          <fpage>1004</fpage>
          ,
          <year>August 2012</year>
          .
          <source>ACM. ISBN 978-1-4503-1472-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings</source>
          ,
          <year>September 2015</year>
          .
          <article-title>CLEF and CEUR-WS.org</article-title>
          .
          <source>ISSN 1613-0073.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Supervised Ranking for Plagiarism Source Retrieval-Notebook for PAN at CLEF</article-title>
          <year>2014</year>
          .
          <volume>15</volume>
          -18 September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2014</year>
          ), http://www.clefinitiative.eu/publication/working-notes.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Williams</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen H H</surname>
            ,
            <given-names>Giles C L.</given-names>
          </string-name>
          <article-title>Classifying and ranking search engine results as potential sources of plagiarism[C]//Proceedings of the 2014 ACM symposium on Document engineering</article-title>
          . ACM,
          <year>2014</year>
          :
          <fpage>97</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Matthias Hagen, Steve Göring, Paolo Rosso, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings</source>
          ,
          <year>September 2015</year>
          .
          <article-title>CLEF and CEUR-WS.org</article-title>
          .
          <source>ISSN 1613-0073.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>