<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Putting Ourselves in SME's Shoes: Automatic Detection of Plagiarism by the WCopyFind tool</article-title>
      </title-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>34</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>Thanks in part, to the large amount of information circulating today on the Internet, unfortunately, the plagiarism has become a very common practice, up to become one of the biggest problems of today's society. One of the most affected sectors by the plagiarism are small and medium entreprises (SME's), which are daily victims from their competitors. Finding a system able to detect plagiarism in texts, has become a major goal for the interests of SME's, which are forced to solve the problem through the tools available on the web. In this paper we analyze the results obtained in the PAN'09 competition with the WCopyFind tool.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Internet is one of the greatest advances
in history in the area of communication.</p>
      <p>Thanks to (the) Internet, you can have
immediate access to information, regardless
of the distances. However, the easy access
to information, has increased the number of
plagiarism cases.</p>
      <p>
        Within the business area must be
emphasized the importance of the automatic
plagiarism detection for SME’s. For SME’s is
vital to know if their proposals, products,
ideas, etc, have been plagiarized by
competitors. To solve this problem, the companies
have mainly to rely on the software available
on the web. In this paper, we attempt using
the software WCopyFind
        <xref ref-type="bibr" rid="ref2">(Dreher, 2007)</xref>
        developed in the University of Virginia.
      </p>
      <p>Plagiarism detection for SME’s
SME’s build web pages to enter information
about themselves, advertise their products,
etc..., to approach (to) the consumer. But
the information on the web is also visible for
the competitors. When a company launches
a new tool, this is discovered by competitors
within a few hours or days. But, there are
companies that use this information to copy.</p>
      <p>The automatic plagiarism detection aims to
try to find an automated approach that is
able to locate fragments of texts suspects of
plagiarism.</p>
      <p>Currently the automatic plagiarism
detection is divided into two different
branches. By one side, is the external
plagiarism analysis, which requires a set of
original sources from which seeking possible
plagiarized fragments in suspicious texts.</p>
      <p>Within this branch, there are methods
developed with the intention to locate
fragments suspected of plagiarism through
search strategies.</p>
      <p>
        Given the large amount of information
available at present, comparing a suspected
document with all the available ones is a
virtually unmanageable task. Therefore,
emerged the intrinsic plagiarism analysis,
tries to rely on the suspected document. Its
intention is to capture the style and the
complexity of a document with the aim of
finding unusual fragments that are
candidates to be instances of plagiarism
        <xref ref-type="bibr" rid="ref1">(Barro´nCeden˜o and Rosso, 2009)</xref>
        .
2.1
WCopyFind1 is a software developed in 2004
by Bloomfield at the University of Virginia.
      </p>
      <p>To detect suspicious fragments of plagiarism,
WCopyfind conducts a search through the
comparison of n-grams.</p>
      <p>Since WCopyfind works with n-grams,
language is not important and matches are
1http://plagiarism.phys.virginia.edu/
Putting Ourselves in SME's Shoes: Automatic Detection of Plagiarism by the WCopyFind Tool
n-gram</p>
      <p>
        Precision
readily identified from the candidate
documents submitted for analysis
        <xref ref-type="bibr" rid="ref2">(Dreher, 2007)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Corpus</title>
      <p>The PAN’09 corpus which refers to the
External Plagiarism Analysis task, consists mainly
of documents in English, in which you can
find any type of plagiarism.</p>
      <p>There are a total of 7,214 suspicious
documents, which may contain plagiarized
fragments from one or more original
documents or do not contain any plagiarized
fragment at all. On the other hand, the number
of original documents that constitute the
corpus is 7,215.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Due to the fact that the WCopyFind tool
allows the user to select the size of the
n-grams, before carrying out the analysis on
the competition corpus, we have made
several experiments with training corpus to find
the appropriate size of the n-grams. Table
1 shows the results for each one of the
experiments. We can highlight several
interesting points. By one side it is noteworthy that
contrary to other language engineering tasks,
we must stress that the obtained precision is
smaller than the obtained recall.</p>
      <p>
        Another interesting fact observed in Table
1 is that, how much smaller size of n-grams
is, the smaller is the precision. However, it
happens all the contrary to the measure of
recall, that is, the smaller the n-grams, the
greater is the recall. This is because, the
smaller are the n-grams, the greater is the
possibility of finding similar fragments in
plagiarized documents. In
        <xref ref-type="bibr" rid="ref1">(Barro´n-Ceden˜o and
Rosso, 2009)</xref>
        , the authors analyzed this fact,
and they showed that the probability of
finding common n-grams in different documents
decreases as n increases.
      </p>
      <p>Finally, we have taken the decision that
the best size for the n-grams was hexagrams,
because there is no great loss with to respect
of recall and it has the best result in precision.
Software</p>
      <p>
        Precision
Unlike most areas of the language
engineering, in the automatic detection
of plagiarism, the precision is lower than the
recall. This is because it is very likely to find
similar fragments between two documents,
although these are not plagiarized fragments.
For a future work, it would be interesting
search for a automated approach to reduce
the space of search before conducting the
search based on the comparison between
n-grams. In
        <xref ref-type="bibr" rid="ref1">(Barro´n-Ceden˜o, Rosso, and
Bened´ı, 2009)</xref>
        , the author proposed the
reduction of the space of search on the basis
of the Kullback-Leibler distance.
      </p>
      <p>In this paper we tried to put ourselves in
a SME’s shoes and in its need of detecting
cases of plagiarism of its marketing campaign
on the web. The idea was to investigate to
what extent this could be done using the
plagiarism detection software which is available
on the web. The poor results we obtained
with WCopyFind tool, highlight the need to
develop at-hoc plagiarism detection methods
for SME’s.
Barr´on-Ceden˜o, A. and P. Rosso. 2009.</p>
      <p>On automatic plagiarism detection based
on n-grams comparisons. Proc.
European Conference on Information
Retrieval, ECIR-2009, pages 696–700.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Barr´on-Ceden˜o, A</article-title>
          .,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Bened</surname>
          </string-name>
          ´ı.
          <year>2009</year>
          .
          <article-title>Reducing the plagiarism detection search space on the basis of the Kullback-Leibler Distance</article-title>
          .
          <source>Proc. 10th Int. Conf. on Comput. Ling. and Intelligent Text Processing</source>
          , CICLing-2009, SpringerVerlag,
          <source>LNCS(5449)</source>
          , pages
          <fpage>523</fpage>
          -
          <lpage>534</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Dreher</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Automatic conceptual analysis for plagiarism detection</article-title>
          .
          <source>Journal of Issues in Informing Science and Information Technology 4</source>
          , pages
          <fpage>601</fpage>
          -
          <lpage>614</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>