=Paper= {{Paper |id=Vol-1737/T4-6 |storemode=property |title=Persian Plagiarism Detection Using Sentence Correlations |pdfUrl=https://ceur-ws.org/Vol-1737/T4-6.pdf |volume=Vol-1737 |authors=Muharram Mansoorizadeh,Taher Rahgooy |dblpUrl=https://dblp.org/rec/conf/fire/MansoorizadehR16 }} ==Persian Plagiarism Detection Using Sentence Correlations== https://ceur-ws.org/Vol-1737/T4-6.pdf
      Persian Plagiarism Detection Using Sentence Correlations


                                                                                  Muharram Mansoorizadeh                                  Taher Rahgooy
                                                           Department of Computer engineering                                 Department of Computer engineering
                                                           Bu-Ali Sina University, Hamedan, Iran                              Bu-Ali Sina University, Hamedan, Iran
                                                                   mansoorm@basu.ac.ir                                              taher.rahgooy@gmail.com



ABSTRACT                                                                                                   the competition participants can be explained in terms of these
This report explains our Persian plagiarism detection system                                               building blocks [5].
which we used to submit our run to Persian PlagDet competition                                             The process starts with a suspicious document dq and a collection
at FIRE 2016. The system was constructed through four main                                                 D of documents from which dq’s author may have plagiarized.
stages. First is pre-processing and tokenization. Second is                                                Within a so-called heuristic retrieval step a small number of
constructing a corpus of sentences from combination of source                                              candidate documents, Dx, which are likely to be sources for
and suspicious document pair. Each sentence considered to be a                                             plagiarism, are retrieved from D. Note that D can be as large as
document and represented as a tf-idf vector. Third step is to                                              the entire web. Hence, it is impractical to compare dq with all of
construct a similarity matrix between source and suspicious                                                its members. Usually an initial inspection is made to select a
document. Finally the most similar documents which their                                                   rough subset of D as prospective candidates. In a so-called
similarity is higher than a specific threshold marked as plagiarized                                       detailed analysis step, dq is compared section-wise with the
segments. Our performance measures on the training corpus were                                             retrieved candidates. All pairs of sections (sq, sx) with sq ∈ dq and
promising (precision=0.914, recall=0.848, granularity=3.85).                                               sx ∈ dx, dx ∈ Dx, are to be retrieved such that sq and sx have a high
                                                                                                           similarity under some retrieval model. In a knowledge-based post-
      CCS Concepts                                                                                         processing step those sections are filtered for which certain
• Information systems➝Information retrieval ➝Retrieval                                                     exclusion criteria hold, such as the use of proper citation or literal
tasks and goals. Near-duplicate and plagiarism detection.                                                  speech. The remaining suspicious sections are presented to a
[1] [2] [3] [4] [5] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

                                                                                                           human, who may decide whether or not a plagiarism offense is
                                                                                                           given.
Keywords
Persian PlagDet, Plagiarism detection, document retrieval, tf-idf

1. INTRODUCTION
Plagiarism in academia is rising and multiple authors have worked
to describe these phenomena [11]. As commented by Hunt in [7] ,
“Internet Plagiarism” is referred sometimes as a consequence of
the “Information Technology revolution”, as it proves to be a big
problem in academia. According to Park [11], plagiarism is
analyzed from various perspectives and considered as a problem
that is growing over time. To tackle this problem, the most
common approach so far is to detect plagiarism using automated
algorithms based on text processing and string matching
algorithms.                                                                                                Figure 1: Generic retrieval process for external plagiarism
                                                                                                           detection [5].
Two main strategies for plagiarism detection have been
considered by researchers, namely, Intrinsic and external                                                  Given a set of suspicious and source documents written in
plagiarism detection [17] [5]. Intrinsic plagiarism detection aims                                         Persian, the plagiarism detection task of Persian PlagDetect
at discovering plagiarism by examining only the input document,                                            competition [2] at FIRE 2016, as an external plagiarism
deciding whether parts of the input document are not from the                                              detection setup, is to find all plagiarized sections in the suspicious
same author. External plagiarism detection is the approach where                                           documents and, if available, the corresponding source sections.
suspicious documents are compared against a set of possible                                                The problem consists of a set of source and suspicious documents
references. From exact document copy, to paraphrasing, different                                           and a list of source-suspicious pairs to evaluate. The challenge is
levels of plagiarism techniques can be used in several contexts,                                           to find and locate plagiarized fragments in each suspicious and
according to Meyer zu Eissen [17].                                                                         source document for each pair. We tackle this challenge by using
                                                                                                           a four-step method which is described in the following.
For external plagiarism detection Stein, Meyer zu Eissen, and
Potthast [15] introduce a generic three-step retrieval process. The
authors consider that the source of a plagiarism case may be
hidden in a large reference collection, as well as that the detection                                      2. RELATED WORK
results may not be perfectly accurate. Figure 1 illustrates this                                           Text plagiarism is a long lasting game between authors and
retrieval process. In fact, all detection approaches submitted by                                          publishers [8]. Automatic plagiarism detection has a story as long
                                                                                                           as modern machine learning and text mining [15]. Plagiarism in
simplest case can be exact copying of others' published or               3.2 Preprocessing and Feature Extraction
unpublished works. More advanced cases try to paraphrase stolen          Document retrieval task is defined as: [9]
ideas in different words [1]. Studies confirm that usually
plagiarists copy and paste the original text or just paraphrase it       Given a document collection D and a query document, q, find it's
keeping most of the text's keywords [16]. Text mining methods            most similar documents in D.
can reveal these simpler but widely used cases of plagiarism [10].
                                                                         The task is usually solved by representing documents in some
Deep semantic analysis of texts is studied for detecting more
                                                                         vector space and exploiting similarities between the query
advanced paraphrasing cases [3].
                                                                         document and the collection members in this space. Bag of words
3. PROPOSED ALGORITHM                                                    representation of documents is a classic, yet powerful method for
In the following we describe the steps we took in order to find          document indexing and retrieval. In this method a document is
plagiarized segments of the document pairs.                              assumed to be the set of its terms, ignoring their relative positions
                                                                         in paragraphs and sentences. A dictionary is the set of all the
3.1 Motivation                                                           distinct terms in the collection. A document, d, is represented by
In her MSc thesis [16], Zahra Taheri has investigated many               a one-dimensional vector, v, in which vi denotes relative
scientific plagiarism cases. Most of the cases were near-copy or         importance of the i-th dictionary term in d. Using this
slightly modified versions of the original sentences. Any scientific     representation, similarities between documents can be estimated
field has a common and widely accepted terminology that the              by well-known techniques such as Euclidean distance or cosine of
related community adopts in writing papers, books and other types        their respective vectors. Usually vi is defined as function of
of publications. For example in machine learning literature the          appearances of the word in the document and the whole
term feature is used for describing attributes of objects. As a          collection. Term frequency (TF) of term in a document is the
dictionary term property is synonymous to feature but no one uses        relative frequency of the term in the document.
property and property extraction instead of feature and feature
extraction in a scientific writing that discusses classification and                                                                    Eq 1
pattern recognition. Hence, a plagiarized text inherits many of                          | |
the main terms of the original text.                                     In this equation Fi is the frequency of i-th term in the document
                                                                         and |.| denotes document length; i.e. total number of terms in the
                                                                         document. Assuming that there totally N documents in the
Original sentence                                                        collection and Ni of them contains i-th term, inverse document
                                  ‫نقطه جوش آب صد درجه سانتیگراد است‬      frequency (IDF) is defined by equation Eq 2
Boiling Point of Water is one hundred Degrees.
                                                                                                                                        Eq 2

Plagiarized sentences                                                    Finally, as an importance measure of the word in a
                                   ‫نقطه جوش آب صد درجه سلسیوس است‬        document, tf-idf is defined as:
Boiling Point of Water is one hundred Celsius degrees.                                                                                  Eq 3
                                 ‫ درجه سانتیگراد است‬011 ‫نقطه جوش آب‬      tf-idf favors terms with high frequency in the document but low
Boiling Point of Water is 100 degrees.                                   frequency in the collection. Common terms such as it, is, and what
                                                                         that appear in most of the documents will have a low tf-idf value.
                                ‫آب در صد درجه سانتیگراد به جوش می آید‬    These so-called stop words cannot distinguish the documents
Water boils at one hundred degrees                                       efficiently; hence usually are removed in pre-processing steps of
                                                                         NLP tasks.
                        ‫ می جوشد‬،‫اگر دمای آب به صد درجه سانتیگراد برسد‬
                                                                         We adopt tf-idf for representation and retrieval of matching
If temperature of water reaches one hundred degrees, it will boil        documents. Each source and suspicious document is firstly
                              ‫اگر آب را تا صد درجه حرارت دهید می جوشد‬    tokenized and then split to sentences. This step is performed using
                                                                         NLTK 3.0 toolkit [4]. Then, each sentence is treated as an
If you heat water up to one hundred degrees, it boils.
                                                                         individual document. All the sentences of source and suspicious
                                                                         documents constitute document collection.
  Figure 2 a syntactic example of a persian sentence and its
                possible plagiarized versions                            3.3 Similarity matrix construction
                                                                         In this step, the similarity matrix between source sentences and
A syntactic example is demonstrated in Figure 2. Key terms in the        suspicious sentences is calculated. If there are N source sentences
original sentence are water, boiling, and one-hundred. To                and M suspicious sentences, the similarity matrix S, has N×M
communicate the same concept, plagiarized versions of the                elements in which Si,j denotes the similarity between sentence i
sentence had to use these terms with re-ordering and /or changing        from source document and sentence j from suspicious document.
verb tenses. This fact narrows down the task of plagiarism               There are many different similarity and distance measures. For
detection to classic document matching and retrieval.                    example an option is to use Euclidian distance, which measures
                                                                         the distance between two vectors u and v in the encompassing
                                                                         Euclidian space. In order to convert the distance to a similarity
                                                                         measure we can use:
                                                                Eq 4      5. CONCLUSION
                            ‖         ‖                                   In this paper, we proposed a sentence-level algorithm based on tf-
Another choice that we used is cosine similarity:                         idf features for plagiarism detection task, which shows
                                                                          competitive results on the training data.
                                                                Eq 5      The algorithm works is designed for near-copy and paraphrasing
                ‖ ‖‖ ‖                                                    types of plagiarism. It relies on the fact that a plagiarist willing to
This measure can range from -1 to 1. When the value is 1, it              publish an article in a scientific field must use popular
means that u and v are in the same direction and when it is -1, it        terminology of the field. Obviously sophisticated cases such as
means that u and v are in the opposite directions. We use cosine          cross language plagiarism or grabbing another ones idea and
similarity in subsequent steps. Initial inspection and evaluation         discussing it in one's own words would hard to be detected by this
confirmed that it has slightly better results than Euclidean              algorithm. In the future works we will consider improving the
distance.                                                                 feature vector of the sentences by incorporating more features and
                                                                          also to use a method to combine overlapping fragments.
3.4 Finding plagiarized fragments                                         Furthermore we will study language modeling and semantic text
We used pairs of sentences which their similarity was greater than        normalization for detecting harder cases of plagiarism.
a pre-specified threshold as plagiarized fragments for both source
and suspicious documents. The value we used as a threshold was
0.4, which is obtained through cross validation.                          6. REFERENCES
                                                                          [1] Alzahrani, S.M., Salim, N. and Abraham, A., 2012.
                                                                              Understanding plagiarism linguistic patterns, textual features,
4. EVALUATION                                                                 and detection methods. IEEE Transactions on Systems, Man,
The training results were obtained after we run our application on            and Cybernetics, Part C (Applications and Reviews), 42(2),
TIRA platform [6] [13]. The evaluation measures used for this                 pp.133-149.
task are precision, recall, and granularity. Another measure called
plagdet used to combine the aforementioned measures in order to           [2] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
enable us to sort the results for all algorithms and compare them             Potthast, M., 2016. Algorithms and Corpora for Persian
in more objective way [12] [14].                                              Plagiarism Detection: Overview of PAN at FIRE 2016. In
                                                                              Working notes of FIRE 2016 - Forum for Information
The detection granularity of detected documents collection R                  Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
under source documents collection S is defined as:                            CEUR Workshop Proceedings, CEUR-WS.org.
                                ∑|     |                           Eq 6   [3] Barrón-Cedeño, A., Vila, M., Martí, M.A. and Rosso, P.,
                    |       |
                                ∈                                             2013. Plagiarism meets paraphrasing: Insights for the next
Where SR ⊆ S are cases detected by detections in R, and                       generation in automatic plagiarism detection. Computational
Rs ⊆ R are the detections of a given s:                                       Linguistics, 39(4), pp.917-947.

          | ∈           ∈                                                 [4] Bird, S., 2006, July. NLTK: the natural language toolkit.
                                                                              In Proceedings of the COLING/ACL on Interactive
          | ∈                                                                 presentation sessions (pp. 69-72). Association for
The domain of gran(S, R) is [1, |R|], with 1 indicating the desired           Computational Linguistics.
one-to-one correspondence and |R| indicating the extreme case,
                                                                          [5] Potthast, M., Stein, B., Eiselt, A., Barron, Cedeno, A., and
where a single s ∈ S is detected multiple times.
                                                                              Rosso, P., 2009. Overview of the 1st international
Precision, recall, and granularity allow for a partial ordering               competition on plagiarism detection. In 3rd PAN Workshop.
among plagiarism detection algorithms. To obtain an absolute                  Uncovering Plagiarism, Authorship and Social Software
order they must be combined to an overall score:                              Misuse (p. 1)

                                                                Eq 7      [6] Gollub, T., Stein, B. and Burrows, S., 2012, August. Ousting
                                                                              ivory tower research: towards a web framework for
                                                                              providing experiments as a service. In Proceedings of the
                                                                              35th international ACM SIGIR conference on Research and
Where F denotes the F-Measure, i.e., the weighted harmonic mean
                                                                              development in information retrieval (pp. 1125-1126).
of precision and recall. Log of granularity decreases its impact on
                                                                              ACM..
the overall score [14]. The results obtained by our algorithm are
showed in table 1.                                                        [7] Hunt, R., 2003. Let’s hear it for internet plagiarism. Teaching
   Table 1. Evaluation results for proposed algorithm                         Learning Bridges, 2(3), pp.2-5.

                                             granularit                   [8] Mali, Yaser. 2011. Crib from the novelist Or thirteen best
Threshold       Precision           Recall                plagdet             practices for how to do a clean plagiarism (in Persian).
                                                 y
    0.4          0.914               0.848     3.859       0.385          [9] Manning, Christopher D., Raghavan, Prabhakar, and
    0.5          0.821               0.926     4.481       0.354              Schütze, Hinrich. Introduction to information retrieval.
                                                                              Cambridge University Press, 2008.
                                                                          [10] Oberreuter, G. and VeláSquez, J.D., 2013. Text mining
                                                                               applied to plagiarism detection: The use of words for
     detecting deviations in the writing style. Expert Systems with
     Applications, 40(9), pp.3756-3763.
[11] Park, C., 2003. In other (people's) words: Plagiarism by
     university students--literature and lessons. Assessment &
     evaluation in higher education, 28(5), pp.471-488.
[12] Potthast, M., Eiselt, A., Barrón Cedeño, L.A., Stein, B. and
     Rosso, P., 2011. Overview of the 3rd international
     competition on plagiarism detection. In CEUR Workshop
     Proceedings. CEUR Workshop Proceedings.
[13] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E.
     and Stein, B., 2014, September. Improving the
     Reproducibility of PAN’s Shared Tasks. In International
     Conference of the Cross-Language Evaluation Forum for
     European Languages (pp. 268-299). Springer International
     Publishing.
[14] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.,
     2010, August. An evaluation framework for plagiarism
     detection. In Proceedings of the 23rd international
     conference on computational linguistics: Posters (pp. 997-
     1005). Association for Computational Linguistics.
[15] Stein, B., zu Eissen, S.M. and Potthast, M., 2007, July.
     Strategies     for   retrieving  plagiarized     documents.
     In Proceedings of the 30th annual international ACM SIGIR
     conference on Research and development in information
     retrieval (pp. 825-826). ACM.
[16] Taheri, Zahra. 2012. Plagiarism Detection in Scientific Texts
     using Structural and Semantic Relations.
[17] Zu Eissen, S.M., Stein, B. and Kulig, M., 2007. Plagiarism
     detection without reference collections. In Advances in data
     analysis (pp. 359-366). Springer Berlin Heidelberg.