<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Source Retrieval Based on Learning to Rank and Text Alignment Based on Plagiarism Type Recognition f or Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kong Leilei</string-name>
          <email>kongleilei1979@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han Yong</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han Zhongyuan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Haihao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wang Qibo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhang Tinglei</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Haoliang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Harbin Engineering University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Harbin Institute of Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Heilongjiang Institute of Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Source Retrieval Based on Learning to Rank</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>973</fpage>
      <lpage>976</lpage>
      <abstract>
        <p>This paper regards the query keywords selection problem in source retrieval as learning a ranking model to choose the method of keywords extraction over suspicious document segments. Four basic methods are used in our ranking function: BM25, TFIDF, TF and EW. Then, a ranking model based on Ranking SVM is proposed to rank the query keywords group which is contributed to get the higher evaluation measure F. In our ranking model, achieving the best performance measure F of source retrieval is used as the target of learning to rank. In text alignment, a novel method based on the plagiarism type recognition model is proposed. This approach employs the distinct strategies to detect the plagiarism text according the different plagiarism type. The plagiarism type recognition model is based on logical regression model. The experimental results on PAN 2014 plagiarism detection corpus indicate the efficiency of the proposed methods.</p>
      </abstract>
      <kwd-group>
        <kwd>plagiarism detection</kwd>
        <kwd>source retrieval</kwd>
        <kwd>text alignment</kwd>
        <kwd>ranking model</kwd>
        <kwd>plagiarism type recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>There have been many efforts toward keywords extraction for text domain. In
contrast, there is less work on query keywords extraction for source retrieval in
plagiarism detection. The methods based on machine learning are still little used.</p>
      <p>
        This year, we aim at improving the evaluation measure F. Our method regards the
keywords extraction as a ranking problem for improving the evaluation measure F
which is defined in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A ranking model is learned to rank the keywords which are
selected by various keywords extraction methods. The ranking model is used to
decide which group of the keywords helps greatly to improve measure F. At the same
time, the ranking model can incorporate more features of query keywords to describe
the keywords in more aspects.
      </p>
      <p>
        The train cases for learning the ranking model are constructed by using the corpus
of PAN@CLEF2012 detailed comparison task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A ranking model based on
Ranking SVM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is trained to selected the better query keywords group which extracted
by some keywords extraction methods. The better query keywords group means that
they are more conducive for getting the higher performance measure F. The basic
candidate keywords extraction methods include BM25, TFIDF, TF and EW [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
Ranking SVM is used as the learning to rank algorism. Some statistic features, such
as TF, TFIDF, BM25, is used to describe the keywords.
      </p>
      <p>During the test period, the suspicious document is first partitioned into text
segments that are made up of 5 sentences. Then, we use the basic keywords extraction
methods to select query keywords. Furthermore, the features of each query keyword
which selected by different basic keywords extraction methods are computed. Lastly,
the keywords ranking model is used to choose the better query keywords group for a
text segment of suspicious document.</p>
      <p>
        In the procedure of source retrieval, we used the ChatNoir [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] search engine API.
Queries are constructed by combining each non-overlapping k keywords which
selected by ranking model, where k = 10, in order to create a set of queries for each
segment. Only the top 3 results are downloaded. Then, each segment is regarded as a
query and retrieved in the index which constructed by all the downloaded documents.
Each top 1 result is reported as the final result. In addition, a voting method which
needs no downloading documents is used in our method. If k queries retrieved the
same result, that result will be regarded as the final result either. We set k=6.
      </p>
      <p>The test result on the source retrieval test corpus2 is shown in Table 1.</p>
    </sec>
    <sec id="sec-2">
      <title>Text Alignment Based on Plagiarism Type Recognition</title>
      <p>The objective of text alignment of plagiarism detection is searching the plagiarism
suspicious fragment in suspicious document together with its source.</p>
      <p>
        The plagiarism can be divided into many categories according the different
plagiarism means [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, the existing methods do not distinguish the plagiarism
types and they detect the plagiarism cases which belong to the different plagiarism
types by using the same method, which result in the difficulty of finding a balance
among the different plagiarism type. If we can identify the plagiarism types before we
align the plagiarism text, we can deal the different plagiarism types with different
methods to improve the performance.
      </p>
      <p>From this perspective, we proposed a novel method based on plagiarism type
recognition for this year’s text alignment task. During the training period, the golden
standards of detailed comparison training corpus of PAN@CLEF 2012 are used as the
training corpus to train the Plagiarism Type Recognition Model. This model is based
on Logistic Regression model. The plagiarism types are grouped into two categories:
obfuscation and no-obfuscation. The main lexical features of plagiarism text include
Dice Coefficient, Jaro Distance, Jaccard Coefficient, Levenshtein Distance,
Manhattan Distance and Ngram Distance.</p>
      <p>
        During the test period, the suspicious document and the source document are
compared to take the original plagiarism fragments by the method we developed for PAN
2013 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and then the Plagiarism Type Recognition Model is used to recognize the
plagiarism types. Finally, the pair of suspicious and source document is compared
again by the method we proposed in the text alignment task in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The only
difference is that the parameters are revised according to the different plagiarism types.
      </p>
      <p>Table 2 shows the results of PAN@CLEF2014 Text Alignment subtask on test
corpus 2 and test corpus 3.
In this paper, we describe the approaches we used in the subtask of Source Retrieval
and Text Alignment for PAN@CLEF 2014.</p>
      <p>In the sub-task of Source Retrieval, we applied a method based on learning to rank.
We design a model based on Ranking SVM to select the keywords groups which
extracted by the different keywords extraction methods that can get a better
performance on the evaluation measure F.</p>
      <p>In the sub-task of Text Alignment, we designed a method based on Logistic
Regression model to identify the different plagiarism types. The text alignment
algorisms with different parameters are used to the detailed comparison to detect the
plagiarism with various plagiarism types.</p>
      <p>We feel this is more of a beginning than an end to develop our two methods. More
features and keywords extraction approach will be used in our query keywords
extraction ranking model. And the Plagiarism Type Recognition Model will be trained to
identify more kinds of plagiarism types.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work is supported by National Natural Science Foundation of China(61370170),
National Social Science Fund Project - Youth Project of China(14CTQ032) and
Heilongjiang Province Educational Committee Science Foundation(12541649,
12541677).</p>
    </sec>
    <sec id="sec-4">
      <title>Remark</title>
      <p>This work was done in Heilongjiang Institute of Technology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Potthast</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            <given-names>T</given-names>
          </string-name>
          , et al.
          <article-title>Overview of the 5th Overview of the 5th International Competition on</article-title>
          .
          <source>CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers</source>
          ,
          <volume>23</volume>
          -
          <fpage>26</fpage>
          September,Valencia, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. http://pan.webis.de/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joachims</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>Optimizing search engines using clickthrough data. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</article-title>
          .
          <source>ACM</source>
          ,
          <year>2002</year>
          :
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Lee</given-names>
            <surname>Gillam</surname>
          </string-name>
          .
          <article-title>Guess Again and See if They Line Up: Surrey's Runs at Plagiarism Detection-Notebook for PAN at CLEF 2013</article-title>
          .
          <article-title>Working Notes Papers of the CLEF 2013 Evaluation Labs</article-title>
          ,
          <year>September 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            <given-names>B</given-names>
          </string-name>
          , et al.
          <article-title>ChatNoir: a search engine for the ClueWeb09 corpus</article-title>
          .
          <source>Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM</source>
          ,
          <year>2012</year>
          :
          <fpage>1004</fpage>
          -
          <lpage>1004</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Alzahrani S M</surname>
            , Salim
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abraham</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>Understanding plagiarism linguistic patterns, textual features, and detection methods</article-title>
          .
          <source>Systems, Man, and Cybernetics</source>
          , Part C:
          <article-title>Applications</article-title>
          and Reviews, IEEE Transactions on,
          <year>2012</year>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ):
          <fpage>133</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Leilei</given-names>
            <surname>Kong</surname>
          </string-name>
          , Haoliang Qi, Cuixia Du,
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Zhongyuan Han.
          <article-title>Approaches for Source Retrieval and Text Alignment of Plagiarism Detection-Notebook for PAN at CLEF 2013</article-title>
          .
          <article-title>Working Notes Papers of the CLEF 2013 Evaluation Labs</article-title>
          ,
          <year>September 2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>