<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UFRGS@PAN2010: Detecting External Plagiarism Lab Report for Pan at CLEF 2010</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafael Corezola Pereira</string-name>
          <email>rcpereira@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviane P. Moreira</string-name>
          <email>viviane@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renata Galante</string-name>
          <email>galante@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto de Informática - Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal 15.</institution>
          <addr-line>064 - 91.501-970 - Porto Alegre - RS -</addr-line>
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our approach to detect plagiarism in the PAN'10 competition. To accomplish this task we applied a method which aims at detecting external plagiarism cases. The method is specially designed to detect crosslanguage plagiarism and is composed by five phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. Our group got the seventh place in the competition with an overall score of 0.5175. It is important to notice that the final score was affected by our low recall (0.4036) which arose as a result of not detecting intrinsic plagiarism cases, which were also present in the competition corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This paper describes our participation on the plagiarism detection task during the PAN
competition at CLEF 2010. In order to detect the plagiarism cases present in the
competition corpus we used the method described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which focuses on detecting
plagiarism based on a reference collection. In particular, the method is specially designed
to detected cross-language plagiarism, which is also present in the competition corpus.
Thus, our task is to detect the plagiarized passages in the suspicious documents and
their corresponding text passages in the source documents.
      </p>
      <p>
        The method is composed by five phases: language normalization, retrieval of
candidate documents, classifier training, plagiarism analysis, and post-processing. Since
the method is also designed to detect cross-language plagiarism, an automatic
translation tool is used to translate the documents into a common language. A classification
algorithm is used to build a model that is able to differentiate a plagiarized text
passage from a non-plagiarized one. Note that the use of classification algorithms is
common in the area of intrinsic plagiarism analysis [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ], but not in the area of
external plagiarism analysis.
      </p>
      <p>Based on the text passages extracted from the suspicious documents, an
information retrieval (IR) system is used to retrieve the passages that are more likely to be the
source of plagiarism cases. This is an important phase since the time necessary to
perform a complete analysis of each suspicious document against all the documents in
the reference collection would not be feasible. Only after the candidate passages of the
source documents are retrieved, the plagiarism analysis is performed. Finally, a
postprocessing technique is applied in the results in order to join the contiguous
plagiarized passages.</p>
      <p>The remainder of this paper is organized as follows: Section 2 presents the
employed method. Section 3 describes how training was done and shows the results
achieved in the competition. Finally, Section 4 presents our conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2 The Method</title>
      <p>
        We present here a brief description of how the method we used in the experiments
works. A detailed description can be seen in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The applied method is divided into
five main phases, which are all briefly described below:
• Language Normalization: at this phase, the documents in the collection are
translated into a default language so they can be analyzed in a uniform way. The
English language was chosen as the default language. A language guesser is used to
identify the documents that must be translated and an automatic translation tool is
used to translate the documents.
• Retrieval of Candidate Documents: at this phase, an information retrieval system
is used to retrieve, based on each suspicious document, the documents in the
source collection that are candidates of being used as source of plagiarism. Before
indexing the source documents, they are divided into several subdocuments, each
one containing a single paragraph of the original document. Thus, when
submitting a query to the system it will only return the relevant subdocuments, not the
entire source document. For each passage in the suspicious document, the index is
queried and the most relevant subdocuments are returned. These candidate
subdocuments are the ones selected to be analyzed in the next phases of the method.
It is important to notice that both the terms passage and subdocument represent a
paragraph of the suspicious or source document.
• Feature Selection and Classifier Training: at this phase, a classification model is
built to enable the method differentiates between a plagiarized and a
nonplagiarized text passage. Thus, for each pair [suspicious passage, candidate
subdocument] the following features are considered during the classifier training: (i)
the cosine similarity between the suspicious passage and the candidate
subdocument; (ii) the score assigned by the IR system to the candidate subdocument; (iii)
the position of the candidate subdocument in the rank returned by the IR system;
(iv) the length (in characters) of both the suspicious passage and the candidate
subdocument. Note that a training collection (with the plagiarism cases annotated)
must be supplied in order to create the training instances to train the classifier.
• Plagiarism Analysis: at this phase, for each pair [suspicious passage, candidate
subdocument] we extract the necessary information to create the test instance and
pass it to the classifier. Thus, the classifier is able to decide whether the
suspicious passage is plagiarized from the candidate subdocument.
• Post-Processing: at this phase, the detection results of each suspicious document
are post-processed to join the contiguous plagiarized passages. The goal is to
report a plagiarism case as a whole instead of several small plagiarism cases. The
following heuristic is applied: (i) separate the detections in groups, each group
containing the detections referring to a single source document; (ii) for each
group, sort the detections in order of appearance in the suspicious document; (iii)
join adjacent detections that are close to each other (less than a pre-defined
number of characters); (iv) for each plagiarized passage, keep only the detection with
the largest length in the source document, i.e., do not report more than one
possible source of plagiarism for the same passage in the suspicious document.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Experiments</title>
      <sec id="sec-3-1">
        <title>3.1 Setting up the detector</title>
        <p>
          In order to tune our plagiarism detector to the PAN’10 competition, we used the
PANPC-09 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] training corpus, which is a large-scale corpus containing artificial
plagiarism offenses. It is important to mention that all the steps presented here are the same
ones described in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], the only difference is that we analyzed a different corpus.
        </p>
        <p>
          As in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] , we used the Terrier Information Retrieval Platform [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] as our IR system.
We also employed the same IR techniques: the TF-IDF weighting scheme, stop-word
removal (a list of 733 words included in the Terrier Platform), and stemming (Porter
Stemmer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]). To train our classifier, we used the Weka Data Mining Software [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In
particular, we applied the J48 classification algorithm to build the classifier.
        </p>
        <p>
          We divided the source documents into several subdocuments before translation in
order to keep the original offset and length of each passage in the original document.
As mentioned before, during the language normalization phase, we translate all the
non-English documents in the corpus to English. We used LEC Power Translator 12
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as our translation tool and the Google Translator [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] as our language guesser.
        </p>
        <p>After all documents in the reference collection are divided into subdocuments and
translated into English, the collection is indexed. To reduce index size and speed up
retrieval, only the subdocuments longer then 250 characters were indexed.</p>
        <p>Before analyzing each document, we first have to train the classifier. To
accomplish this, we randomly selected 50 suspicious documents. For each suspicious
passage the top ten candidate subdocuments were retrieved. Based on each pair
[suspicious passage, candidate subdocument], we can extract the information necessary to
create the 500 training instances. The annotations provided with the corpus allowed us
to check if the suspicious passage was actually plagiarized from the candidate
subdocument. After the training instances were created, we generated the ARRF
(Attribute-Relation File Format) file containing the training instances according to the Weka
file format. Once we have the ARRF file with examples of plagiarized and
nonplagiarized passages, we applied the J48 classification algorithm to build the
classification model. After the classifier is trained, we can proceed to the analysis of the
suspicious documents of the training corpus.</p>
        <p>To analyze the suspicious documents, we divided them into passages. For each
passage, we queried the index to get the top ten candidate subdocuments. Thus, for each
pair [suspicious passage, candidate subdocument] we extracted the information
needed by the classifier, and let it decide whether the suspicious passage was, in fact,
plagiarized from the candidate subdocument. After we analyzed all the suspicious
documents, we post-processed the results to join the contiguous plagiarized passages
according to the heuristic described previously.</p>
        <p>The parameters shown on Table 1 were defined based on tests with the training
corpus. These same parameters were used for analyzing the competition corpus.</p>
        <p>As shown in Table 1, to reduce the index size and speed up retrieval, we only
indexed the subdocuments with length greater than 250 characters. The IR system
returned at most 10 candidate subdocuments for each suspicious passage. Also, to speed
retrieval, instead of using all the terms of the suspicious passage to query the index,
we discarded the terms which had an IDF (inverse document frequency) value lower
than 8. We also discarded the subdocuments that received (by the IR system) a score
lower than 11. Finally, in the post-processing phase, we joined the contiguous
plagiarized passages that were at most 3000 characters distant from each other.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Evaluation</title>
        <p>In order to analyze the competition corpus, we proceeded the same way described in
the previous section. Note that we used the same classifier built during the analysis of
the training corpus. Table 2 shows our overall result in the competition as well as the
result of the analysis when considering only the external plagiarism cases. Note that
since the competition corpus had both external and intrinsic plagiarism cases mixed
up, the recall value ended up getting affected since the applied method was designed
to detect only external plagiarism cases.</p>
        <p>With the final score of 0.5175 our group got the seventh place in the competition.
Table 3 shows an in-depth analysis of the results. We provide an overall analysis
considering the results of the competition and we also analyze our results in detecting
only the external plagiarism cases (which is the focus of the applied method). To
analyze in which situations the method performs better, we investigated how well it
handles text obfuscation and in what level the length of the plagiarized passage affects its
overall performance. We divided the plagiarized passages according to their textual
lengths: short (less than 1500 characters), medium (from 1501 to 5000 characters),
and large (greater than 5000 characters).</p>
        <p>According to Table 3, during the competition the method detected 29,486 out of
68,558 plagiarized passages (i.e., 43%). When ignoring the intrinsic plagiarism cases,
the method detected 29,486 out of 55,723 plagiarized passages (i.e., 53%). It is
possible to see that the method performed poorly while detecting short plagiarized
passages. This is partially explained by our decision of indexing only the subdocuments
with length greater than 250 characters (to speed up retrieval). Table 3 also shows
that, other than translation, the intrinsic plagiarism cases did not suffered any kind of
obfuscation. While detecting medium plagiarized passages, the performance of the
method decreased as the level of obfuscation increased (none to high). It is worth
noticing that the translated and the simulated plagiarized passages did not seem to</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>
        This paper described our approach to the plagiarism detection task during the PAN
competition at CLEF 2010. We applied the method presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which focuses on
detecting external plagiarism. In particular, the method is specially designed to detect
cross-language plagiarism, which is also present in the competition corpus.
      </p>
      <p>We used the training corpus PAN-PC-09 to set up the detector. The training corpus
was also used to build the classification model used during the analysis of the
competition corpus. With an overall score of 0.5175 we ended up in the seventh place in the
competition. Our overall score was affected by our low recall (0.4036) since the
applied method was designed to detect only the external plagiarism cases, leading the
detector to ignore the intrinsic plagiarism cases present in the competition corpus.</p>
      <p>An in-depth analysis was conducted to check in what situations the method
performs better. Regarding the textual length of the plagiarized passage, the larger is the
passage the easier is the detection. In fact, when analyzing large plagiarized passages
the method detected almost all of them, regardless of the type of obfuscation.
However, the method performed poorly while detecting short passages. We attribute this
low performance to the fact that we only indexed subdocuments with length greater
than 250 characters. Finally, the translated and the simulated plagiarized passages did
not seem to have a negative impact in the performance of the method, since the
percentage of the passages detected are not lower than the other types of obfuscation.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This project was partially supported by CNPq-Brazil and INCT-Web.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Webis at Bauhaus-Universität</surname>
            <given-names>Weimar</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>NLEL at Universidad Politécnica de Valencia PAN Plagiarism Corpus 2009 (PAN-PC-</surname>
          </string-name>
          09). http://www.webis.de/research/corpora M.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          , and P. Rosso (editors).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
            and S.
          </string-name>
          <string-name>
            <surname>Levitan</surname>
          </string-name>
          ,
          <article-title>Measuring the Usefulness of Function Words for Authorship Attribution</article-title>
          , in Association for Literary and Linguistic Computing/ Association Computer Humanities. 2005: University Of Victoria, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. Google Translator http://www.google.com/translate_t.</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          .
          <article-title>Authorship Verification as a One-Class Classification Problem</article-title>
          .
          <source>in Proceedings of the 21st International Conference on Machine Learning</source>
          .
          <year>2004</year>
          . Banff, Canada: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. LEC Power Translator http://www.lec.com/power-translator-software.
          <source>asp.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ounis</surname>
            , I.,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Plachouras</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , Terrier Information Retrieval Platform,
          <source>in Proceedings of the 27th European Conference on Information Retrieval (ECIR 05)</source>
          .
          <year>2005</year>
          , Springer. p.
          <fpage>517</fpage>
          -
          <lpage>519</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>V.P.</given-names>
            <surname>Moreira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Galante</surname>
          </string-name>
          ,
          <article-title>A New Approach for Cross-Language Plagiarism Analysis</article-title>
          ,
          <source>in Proceedings of the CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation</source>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          , et al.,
          <source>Editors</source>
          .
          <year>2010</year>
          , Springer: Padua, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <article-title>An algorithm for suffix stripping</article-title>
          ,
          <source>in Readings in information retrieval</source>
          .
          <year>1997</year>
          , Morgan Kaufmann. p.
          <fpage>313</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. Weka http://www.cs.waikato.ac.nz/ml/weka/.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>