<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approaches for Source Retrieval and Text Alignment of Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kong Leilei</string-name>
          <email>kongleilei1979@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Haoliang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Du Cuixia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wang Mingxing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han Zhongyuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1Heilongjiang Institute of Technology, China 2Harbin Engineering University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe our approach at the PAN@CLEF2013 plagiarism detection competition. In sub-task of Source Retrieval, a method combined TF-IDF, PatTree and Weighted TF-IDF to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document is proposed. In sub-task of Text Alignment, a method based on sentence similarity is presented. Our text alignment algorism and similar sentences merging algorism, called Bilateral Alternating Merging Algorithm, are described in detail.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The great development of Internet makes it easier for people to search, copy, save,
and reuse online sources. Copying another author’s text and claiming its authorship is
called plagiarism[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. During the last decade, automated plagiarism detection in natural
languages have attracted considerable attention from research and industry, which
takes the advantage of recent developments in related fields like information retrieval,
cross-language information retrieval, natural language processing, machine learning,
and artificial intelligence. PAN@CLEF is dedicated to providing an environment
which consists of a large scale corpus of artificial plagiarism and detection quality
measures to evaluate the algorithms of plagiarism detection. There are two sub-tasks
in PAN@CLEF2013: source retrieval and text alignment. The remaining sections of
this paper introduce the methods we have taken in this year’s competition.
      </p>
      <p>
        The task of source retrieval is to retrieve all plagiarized sources while minimizing
retrieval costs[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One document plagiarizes another document by simple cut–paste
manipulations, minor or wholesale alternations and more ambiguity rewriting. One of
the difficulties of efficiently detecting plagiarism source is to search the source in
millions of documents from the Internet, where each document usually involves
thousands of words. Another core problem of source retrieval is the keywords of
suspicious document which would be used for retrieval are not specified. How to
extract the keywords from the suspicious document as the queries
is the significant challenge when applying the ChatNoir search engine API to retrieve
the plagiarized sources. This year, we use the keywords extraction methods based on
statistics, and combine with the three keyword extraction methods, which are based
on TF-IDF, Pat Tree and Weighted TF-IDF. The following section describes the
construction of these queries.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2.1 Keywords Based on TF-IDF</title>
      <p>
        TF–IDF[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], term frequency–inverse document frequency, is a numerical statistic
which reflects how important a word is to a document in a collection or corpus. The
term frequency is the number of occurrences of each term in a document. The inverse
document frequency is a function of the number of document where a term took place.
The method of keywords extraction based on TF-IDF chooses Wall Street Journal
data as Corpus. After pre-treatment (removal of stop words), we calculate the TF-IDF
value of each term for every suspicious document. Then, the terms which have the
higher TF-IDF values are composed simply as the queries. The experiments show that
we can obtain a better result on the measures of the number of queries until the first
actual source is found and the number of downloads until the first actual source is
downloaded when we make the only 10 terms with the highest TF-IDF value as
queries. The experimental results are shown in section 2.5.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.2 Keywords Based on PatTree</title>
      <p>
        Pat tree is an efficient data structure successfully used in the area of keywords
extraction in the field of information retrieval. It was developed by Gonnet[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] from
Morrison’s PATRICIA algorithm (Practical Algorithm to Retrieve Information Coded
in Alphanumeric)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for indexing a continuous data stream and locating every possible
position of a prefix in the stream. The PAT tree is conceptually equivalent to
compressed digital search tree but smaller. Using this data structure, all possible
character strings, including their frequency counts in the documents, can be got in a
very efficient way. The different lengths of the character strings are used as queries in
our method of source retrieval.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.3 Keywords Based on Weighted TF-IDF</title>
      <p>We believe that the words exist in the paragraph title and chapter headings contain
more information. We give each term a weight according to its position. The final
weighted TF-IDF value of each term is calculated by its TF-IDF value timing its
weight. The weight is set to 3.6 if the term appears in the title of the article, while the
weight is 2.5 when the term locates the first sentence or the last sentence of each
paragraph.</p>
    </sec>
    <sec id="sec-5">
      <title>2.4 Combination of Queries and Execution of Retrieval</title>
      <p>All queries from a given suspicious document are extracted on the basis of the
above methods and they are executed sequentially according to their priority.</p>
      <p>Firstly, we extracte the top 20 words in the suspicious document according to their
TF-IDF weights. The top 5 ranked keywords combine into the first query, and the
next top 5 ranked keywords form the second query, and so on. In this manner, we get
four groups of queries, and each query contains 5 terms.</p>
      <p>Secondly, we get many groups of keywords based on PatTree which consist
of different numbers of terms. We choose the keywords which composed by 2, 3, 4
and 5 terms and rank them by their sum of TF-IDF value and the frequency counting
in the suspicious documents. We call them 2-gram, 3-gram, 4-gram and 5-gram
PatTree keywords. One 2-gram PatTree keyword, one 3-gram PatTree keyword and
two 4-gram PatTree keywords are regards as following four queries. Especially,
5gram PatTree keywords are chosen to all the other queries. We discard the keywords
whose the number of nouns and verbs under threshold 3. The queries which have no
any term existing in the list of top 10 weighted TF-IDF also are filtered out by us.</p>
      <p>Table 1 shows the combination of queries.</p>
      <p>Rank
1 to 2
3
4
5 to 6
7 to 8
other</p>
    </sec>
    <sec id="sec-6">
      <title>Query type</title>
      <p>Top 10 TF-IDF
2-Gram PatTree Keyword
3-Gram PatTree Keyword
4-Gram PatTree Keyword</p>
      <p>Top 10 to 20 TF-IDF
5-Gram PatTree Keyword</p>
      <p>Table 1. Combination of queries.</p>
      <p>After committing all queries we extract from a suspicious document, we decide
whether to actually download the web pages or not according to the cosine similarity
between the snippet and the suspicious document. We extract about 40 queries for
each suspicious document.</p>
    </sec>
    <sec id="sec-7">
      <title>2.5 Result</title>
      <p>According to the five performance measures that PAN@CLEF2013 employs, the
more web pages are download, the better recall could be got. For example, if the
downloading number of web page for each document returned is set to 1000, the
recall can reach 0.5719. By setting this value 600 we could lower the total number of
web pages downloaded by 400 with only 2.05% of recall lost. To ensure the
evaluation measures of Number of queries submitted, Number of queries until the first
actual source is found and Number of downloads until the first actual source is
downloaded better in this evaluation, we try to improve the measure of precision and
recall of web pages downloaded regarding actual sources of a suspicious document,
and reduce the measure of Number of web pages downloaded.</p>
      <p>In the experiment, we have found that the queries constructed by top 10 TF-IDF
words got a better result. The experiment result without using snippet filter on the
training data of Source Retrieval is shown in Table 2.</p>
      <sec id="sec-7-1">
        <title>Time to 1st Detection</title>
      </sec>
      <sec id="sec-7-2">
        <title>Retrieved Sources</title>
      </sec>
      <sec id="sec-7-3">
        <title>Total workload Queries 80 Downloads 1289 Queries 1.225</title>
        <p>
          Given a pair of documents, the task of text alignment is to identify all contiguous
maximal-length passages of reused text between them[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].The subtask of Text
Alignment has been called Detailed Comparison or Detailed Analysis. The Detailed
Analysis method of plagiarism detection is summarized in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It can be divided into 3
steps: (1) seeding, (2) match merging and (3) extraction filtering. Given a suspicious
document and a source document, seeding refers to use seeds for matches. These
matches are called "seed". The purpose of match merging is to combine the seeds
obtained in step (1) which having a maximum length of text fragments. The extraction
filtering combined fragments to get the final plagiarism fragments according to
certain standards. Our method also uses the above-described process. The following
sections will describe our approach.
        </p>
        <p>Algorithm 1 describes the Text Alignment algorithm of plagiarism detection.
Algorithm 1. Text Alignment algorithm of the plagiarism detection
Input: Suspicious document dplg，the plagiarism alternative documentation set Dsrc
Output: The collection of the pairs(splg , ssrc) from the plagiarism fragment of the
suspicious document splg and the plagiarism source document fragment ssrc
Pre-Processing dplg and Dsrc
Feature represent dplg and Dsrc
for each dsrc ∈ Dsrc {</p>
        <p>Measure sentences similarity of dplg and dsrc
}
Merge similar sentences to similar passages
(splg , ssrc)← Post-Processing the pairs of similar passages
return the set of passage pairs (splg , ssrc)</p>
      </sec>
      <sec id="sec-7-4">
        <title>We described the process of seeding in [7].</title>
        <p>The result of seeding is that we get a number of scattered one-to-many sentence
pairs. We designed an algorithm called Bilateral Alternating Merging Algorithm to
combine these fragments into a larger fragment as plagiarism fragment. The
description of the algorithm is shown as follows.</p>
      </sec>
      <sec id="sec-7-5">
        <title>Algorithm2. Bilateral Alternating Merging Algorithm</title>
        <p>Input: Similar sentence list casesList
Output: The collection of the pairs(splg , ssrc) from the plagiarism fragment of the
suspicious document splg and the plagiarism source document fragment
ssrc
1 public void merger(List&lt;String&gt; casesList, int sign) {
2 if (sign = -1) {
3 leftSort(casesList);
4 if (end &gt;= leftEndge &amp;&amp; end &lt;= rightEndge) {
5 resultList.add(dto);
6 } else {
7 for (i = 1..casesList.size()) {
8 if (|now – last| &gt; dist) {
9 checkAdjacent();
10 merger();
11 }
12 }
13 } else {
14 rightSort(casesList);
15 if (endgeFirstInfs=endgeEndInfs &amp;&amp; end &gt;= leftEndge&amp;&amp; end &lt;=
16 rightEndge) {
17 resultList.add(dto);
18 } else {
19 for (i = 1..casesList.size()) {
20 if (|now – last| &gt; dist) {
21 checkAdjacent();
22 merger();
23 }</p>
        <p>
          In paper [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we discard passages whose word overlap under a threshold. This year,
we improve the method of post-processing. We apply a sliding window surrounding
the original answer to refine the final plagiarism fragment which has a larger Jaccard
coefficient value than the initial result we got. Due to time constraints, the parameters
are not fully adjusted, which makes our PlagDet scores increased only by 0.5%.
        </p>
        <p>The experiments are carried out in the PAN@CLEF2012 training data set. Table 4
shows the results of our Text Alignment sub-task which applied different parameters.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Sub-Corpus</title>
      <p>02_no_obfuscation
03_artificial_low
04_artificial_high
06_simulated_paraphrase
Overall</p>
      <p>In the sub-task of Source Retrieval, we apply a method based on TF-IDF, Weighted
TF-IDF and PatTree to extract the keywords to compose the queries of suspicious
documents. The number of queries submitted is reduced by two filters. The
performance measures of Number of queries until the first actual source is found and
Number of downloads until the first actual source is downloaded get the promotion by
combining and ranking the queries. To achieve better performance measure recall and
precision, we try to keep a balance between Number of web pages downloaded and
Precision and recall of web pages downloaded regarding actual sources of a
suspicious document.</p>
      <p>In the sub-task of Text Alignment, our method is more adaptable to obfuscation
plagiarism. We achieve the better performance in sub-corpus 03-random-obfuscation,
04-translation-obfuscation and 05-summary-obfuscation. But the performance for
sub-corpus 02-no-obfuscation still remains to rise.</p>
      <p>There are many high-performance methods in dealing with no-obfuscation
plagiarism, therefore the current results show that it is possible to build up a classifier
based on different plagiarism types. Furthermore, we can use different methods to
deal with different plagiarism problems to obtain a better performance. Moreover,
more efforts are required to figure out more plagiarism features to be used in the
design of the plagiarism type classifier to improve the performance.</p>
      <p>Acknowledgements This work is supported by NSF of China(61272384).
Remark This work was done in Heilongjiang Institute of Technology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Potthast</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeno</surname>
            <given-names>A</given-names>
          </string-name>
          , et al.
          <article-title>Overview of the 3rd international competition on plagiarism detection[C].Notebook Papers of CLEF 2011 LABs and Workshops</article-title>
          .
          <year>2011</year>
          :
          <fpage>19</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[2] http://pan.webis.de/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>SALTON</surname>
            <given-names>G</given-names>
          </string-name>
          , CLEMENTT Y.
          <article-title>On the construction of effective vocabularies for information retrieval[C]</article-title>
          .
          <source>Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval</source>
          New York: ACM,
          <year>1973</year>
          :
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gaston</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gonnet</surname>
          </string-name>
          , Ricardo A.
          <article-title>Baeza-yates, Tim Snider</article-title>
          .
          <article-title>New Indices for Text: Pat Trees</article-title>
          and
          <string-name>
            <given-names>Pat</given-names>
            <surname>Arrays</surname>
          </string-name>
          .
          <source>Information Retrieval Data Structures &amp; Algorithms</source>
          , Prentice Hall, pp.
          <fpage>66</fpage>
          -
          <lpage>82</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Morrison</surname>
            ,
            <given-names>D..</given-names>
          </string-name>
          <article-title>PATRICIA: Practical Algorithm to Retrieve Information Coded in Alphanumeric</article-title>
          . JACM, pp.
          <fpage>514</fpage>
          -
          <lpage>534</lpage>
          ,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel,Maximilian Michel,
          <article-title>Arnd Oberländer1 Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso, Benno Stein. Overview of the 4th International Competition on Plagiarism Detection</article-title>
          .
          <article-title>CLEF 2012 Evaluation Labs</article-title>
          and Workshop-Working Notes Papers.
          <volume>17</volume>
          -20 September, Rome, Italy.
          <source>ISBN 978-88-904810-3-1. ISSN</source>
          <year>2038</year>
          -
          <volume>4963</volume>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Leilei</given-names>
            <surname>Kong</surname>
          </string-name>
          , Haoliang Qi, Shuai Wang, Cuixia Du,
          <string-name>
            <given-names>Suhong</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Yong Han.
          <article-title>Approaches for Candidate Document Retrieval and Detailed Comparison of Plagiarism Detection-Notebook for PAN at CLEF 2012</article-title>
          . In Forner et al. [
          <volume>6</volume>
          ].
          <source>ISBN 978-88-904810-3-1</source>
          . URL.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>