-

Source Retrieval and Text Alignment Corpus Construction for Plagiarism Detection

Kong Leilei

kongleilei1979@hotmail.com 0 1

Lu Zhimao

0 1

Han Yong

0 1

Qi Haoliang

0 1

Han Zhongyuan

0 1

Wang Qibo

0 1

Hao Zhenyuan

0 1

Zhang Jing

0 1 0 1Heilongjiang Institute of Technology, China 2Harbin Engineering University, China 3Harbin Institute of Technology , China 1 Source Retrieval in Plagiarism Detection

For the task of source retrieval, we focus on the process of Download Filtering. For the process from chunking to search control, we aim at high recall, and for the process of download filtering, we devote to improve precision. A vote-based approach and a classification-based approach are incorporated to filter the searching results to get the plagiarism sources. For the task of text alignment corpus construction, we describe the methods we use to construct the Chinese plagiarism cases. At last, we report the statistics of text alignment dataset submissions.

submitting as many queries as possible to the search engine and retaining as many retrieval results as possible.

Chunking. Firstly, the suspicious texts are partitioned into segments that are made up of only one sentence. Especially, it is found that the suspicious documents generally contain some headings. If there are empty lines in front and one behind and the word number of the line is less than 10, the current line are previewed as headings. We try to use only headings as queries to retrieve the plagiarism sources when we did not retrieve any sources on some suspicious documents, but the sources are still not discovered by using these headings. So the headings are merged into the sentence which were adjacent to them.

Keyphrase Extracting. After getting all sentences, each word in each paragraph is tagged using the Stanford POS Tagger[ 2 ] and only nouns and verbs are considered as query keyphrase.

Query Formulation. Queries are constructed by extracting each sentence of k keywords, where k = 10. If the number of nouns and verbs in one sentence is more than 10, we retain only top 10 with high term frequencies. And if the number is less than 10, all nouns and verbs are regarded as the query. Then these queries are submitted to ChatNoir search engine[ 3 ] to retrieve plagiarism sources.

Search Control. Since each query is generated by only one sentence, it represents the topic which the sentence tries to express, and maybe strayed from the subject which the plagiarism segment which the sentence come from. The result is that many positive plagiarism sources are ranked below. Therefore, for each query, we keep the top 100 results. This tactic make us own a higher recall before download filtering.

Download Filtering. There can be no argument that the number of retrieval results has a large effect on the performance, and increasing the number will lead to an increase in recall and a decrease in precision. In the steps of keywords extraction, except for the content of suspicious document and its text chunk, we have very little information. Submitting more queries may be the best choice without considering the retrieval cost. But after retrieving, we can get abundant information including various similarity scores between query and document, the length of document, the length of words, sentences and characters of document, the snippet(the length of snippet we requested is 500 characters), and so on. By exploiting the retrieval results and the meta-data returned by ChatNoir API, we design a two-step download filtering algorithm.

As we known, the evaluation algorithm of source retrieval computes recall, precision and fMeasure by using the downloading documents, so before implementing our download filtering algorithm, we decide to filter some retrieval results firstly. We suppose that the queries can retrieve the same plagiarism sources if they come from the same plagiarism segment of suspicious document. Then, for one suspicious document, the same retrieval results will occur many times. The underlying assumption is that more possible plagiarism sources are likely to receive more search results voting from different queries of suspicious document. So, we use a simple vote algorithm to assign a weight to each document of the retrieval results set. If a document is retrieved by a query, the weight of the document will add 1. We have also tried the weighted vote approach by giving the document which ranking at the front more higher weight, but it do not perform better than the simple vote approach.

After implementing vote algorithm, the results of vote are regarded as the candidate plagiarism sources. If the size of result list is less than 20, we choose the top 50 results according to the top voting results as the candidates.

Table 1 shows the performance of source retrieval only using vote approach to filter the retrieval results, which is called Han15 by PAN in [ 4 ]. Experiments were performed on the train dataset pan14-source-retrieval-training-corpus-2014-12-01 of source retrieval which contains 98 suspicious documents. The numbers in the column headers means the count of vote, and the row headers are the evaluation measures of source retrieval. We choose vote 8 when we submit our source retrieval software to PAN.

fMeasure Recall Precision Queries Downloads

The data in above table 1 is evaluated by our own evaluation detector which is designed according to Ref. [ 1 ]. But we only implemented the former two-way approach to determine true positive detections because we did not know which algorithm was used to extract plagiarism passages’ set which were applied to compute the containment relationship.

In the past year’s evaluation, Williams et al.[ 5 ] proposed a filtering approach which viewed the filtering process of candidate plagiarism sources as a classification problem. A supervised learning method based on LDA(Linear Discriminant Analysis) was used to learn a classification model to decide which candidate plagiarism source was the positive detections before downloading them. This year, we followed their idea and added four new features. They are Document-snippet word 2-gram, 3-gram, 4-gram and 8gram intersection. The set of word 2, 3, 4 and 8 grams from the suspicious document and snippet are extracted separately, and the common n-grams are computed. We chose SVM as our classification model. The open tools SVM light(http://www.cs.cornell.edu/People/tj/svm_light/) is used as our classifier. We only trained the parameter c in training set which was constructed according to Ref. [ 6 ]. After voting, all the results which are positive case judged by classifier are downloaded. The vote strategy follows Han15. This approach based on vote and classification is called Kong15 by PAN in [ 4 ].

Using the Source Oracle, we filtered our results. The final log file reported the filtered results of source retrieval. Table 2 shows the results by using the classification tactics. 2

Text Alignment Corpus Construction

For the task of text alignment corpus construction, we submit a corpus which contains 7 plagiarism cases. The plagiarism cases are constructed by using real plagiarism.

Firstly, we recruited 10 volunteers to write a paper according to a topic we proposed. We choose 7 of 10 to submit our corpus. Table 4 lists the 7 topic.

For each essay, we request ten thousand Chinese characters at least. The volunteers retrieved the related contents on the subject by using the specified search engine and wrote the paper. Especially, the Baidu is used to search engine. The number of sources has not been not limited.

Then papers were submitted to a famous Chinese plagiarism detection software which are used in many Chinese colleges and universities. This plagiarism detection software uses the fingerprint technology to detect the plagiarism. Next, the volunteers modified the contents which were detected by this software. The modification tactics include: adjusting the words’ order, replacing the words and paraphrasing modification. But no matter what kinds of modifying tactics they adopted, they must ensure that the paper after revising is readable and consistent with the original paper's meaning. Lastly, the modified papers were submitted to the plagiarism detection software until the software could no longer detect any plagiarism. The modified papers were submitted to PAN as the text alignment corpus.

Suspicious Document suspicious-document00000 suspicious-document00001 suspicious-document00002 suspicious-document00003 suspicious-document00004 suspicious-document00005 suspicious-document00006 suspicious-document00007

Topic Campus Second-hand Book Trade Online Examination Online Examination Second-hand Car Trade Automobile 4S Shop Multimedia Material Management Library Driving license exam

Supermarket Management System The statistics of the corpus is shown in table 5.

We peer-review pan15 text alignment dataset submissions[ 7 ] and the statistics of corpus are shown in table 6.

Corpus characteristic Number of suspicious document Number of source document Average length of suspicious documents Average length of source documents Average lengths of plagiarism cases Number of plagiarism cases Jaccard coefficient Number of suspicious document Number of source document Average length of suspicious documents Average length of source documents Average lengths of plagiarism cases Number of plagiarism cases Jaccard coefficient of plagiarism cases Acknowledgments This work is supported by Youth National Social Science Fund of China (No. 14CTQ032), National Natural Science Foundation of China(No. 61272384), and Heilongjiang Province Educational Committee Science Foundation(No. 12541649, No. 12541677).

Remark This work was done in Heilongjiang Institute of Technology.

Martin

Potthast , Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, Paolo Rosso, Benno Stein: Overview of the 6th International Competition on Plagiarism Detection . CLEF (Working Notes) 2014 : 845 - 876 .

2. Toutanova , K. , Klein , D. , Manning , C.D. , Singer , Y. : Feature-rich part-of-speech tagging with a cyclic dependency network . In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03 . vol. 1 , pp. 173 - 180 (May 2003 )

Martin

Potthast , Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel,

Martin

Tippmann , and Clement Welsch. ChatNoir: A Search Engine for the ClueWeb09 Corpus . In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12) , pages 1004 , August 2012 . ACM. ISBN 978-1-4503-1472-5.

Matthias

Hagen ,

Martin

Potthast , and

Benno

Stein . Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches . In Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings , September 2015 . CLEF and CEUR-WS.org . ISSN 1613-0073.

5. Williams , K. , Chen , H.H. , Giles , C. : Supervised Ranking for Plagiarism Source Retrieval-Notebook for PAN at CLEF 2014 . 15 -18 September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org ( 2014 ), http://www.clefinitiative.eu/publication/working-notes.

6. Williams

, Chen H H , Giles C L. Classifying and ranking search engine results as potential sources of plagiarism[C]//Proceedings of the 2014 ACM symposium on Document engineering . ACM, 2014 : 97 - 106 .

Martin

Potthast , Matthias Hagen, Steve Göring, Paolo Rosso, and

Benno

Stein . Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment . In Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings , September 2015 . CLEF and CEUR-WS.org . ISSN 1613-0073.