Source Retrieval Plagiarism Detection based on Noun Phrase and Keyword Phrase Extraction Notebook for PAN at CLEF 2015 Javad Rafiei, Salar Mohtaj, Vahid Zarrabi, Habibollah Asghari ICT Research Institute, Academic Center for Education, Culture and Reseach (ACECR), Iran {javad.rafiei, salar.mohtaj, vahid.zarrabi, habib.asghari} @ ictrc.ir Abstract. This paper describes an approach for source retrieval task of PAN 2015 competition. We apply two methods to extract important terms, namely weighted noun phrases and keyword phrases which are extracted from long sen- tences in terms of word count. Queries are constructed from top marked sen- tences. The prepared system tries to gather a complete dataset of downloaded sources and employ it in query filtering operations. The ChatNoir search API is used for submitted queries. Each query is split into two sub-queries and the sys- tem extract one snippet for each of sub-queries and exploits them in download- ing operation. The evaluation results show high scores for three measures: re- call, total queries number and no detection. Keywords: Plagiarism Detection, Source retrieval, Keyword phrase extraction, Noun phrase extraction 1 Introduction The advent and rapid development of the World Wide Web facilitate public access to digital information, so that everybody can easily read and alter the content of web pages for personal use. There can be considered two aspect of information generation. First, using the existing information with their references as input and extending them with new innovation as output, and second, taking information and altering the con- tent while maintaining the concept without any reference to source which is called plagiarism. In simple terms, plagiarism is the unreferenced use of other's ideas or text. Plagiarism detection in PAN [1] is divided into source retrieval and text alignment subtasks. The former task uses two search engines [2, 3] to retrieve all plagiarized sources for input suspicious documents. In this paper we focus on source retriev- al task at PAN 2014 and present a simple solution for keyword extraction and query building with regard to high quality of query‟s terms and minimum use of web search API. We have used ChatNoir API to retrieve candidate source documents. The rest of this paper is structured as follows: Section 2 describes our approach that includes document segmentation, keyword extraction methods, query formulation and query and document filtering. Section 3 presents the results of our method and a discussion on the results. Conclusion and steps for future works will be explained in section 4. 2 Our Methodology Our approach has been divided into five steps as follows:  Suspicious Document Chunking,  Noun phrase and keyword phrase Extraction,  Query Formulation,  Search Control,  Document Filtering and Downloading These steps are equal to those described in [1]. Before these main steps, raw suspi- cious documents are passed through a preprocessing block that includes stop words removal and punctuation deletion. 2.1 Suspicious Document Chunking After the preprocessing step, the documents are prepared for suspicious document chunking. Each document is segmented into some parts called chunks. These chunks are separately used for keyword phrase and noun phrase extraction and also query construction. Therefore, their length should be long enough to extract meaningful queries. On the other hand, these chunks may contain unknown numbers of plagia- rism cases from source documents. Suspicious documents are divided into chunks of 500 words length and then each chunk is tokenized into individual sentences. As a result, we have some sentences that are used to extract appropriate keywords. 2.2 Noun phrase and keyword phrase Extraction This step has the main role in source retrieval task. Extracting appropriate keywords help us to efficiently perform the next steps. There are many previous studies that have tried to extract the keywords by investigating the contents [4, 5]. We have used two types of keyword extraction in our approach: 1) Keyword phrase extraction. 2) Noun phrase extraction. The input to this step is a series of sentences that are extracted from a specific chunk at the chunking step. Before starting extraction process, sentences with low information content are discarded: We rank the input sentences based on their length and the number of nouns, and then discard the lower 20% of the sentences in the rank- ing. The resulting sentences are long enough and have rich content for keyword ex- traction. In this stage, two different types of tf.idf weighting scheme are used for rating im- portant words in the sentences:  Type #1 (tf.idf1): In this type, Tf = term frequency in the chunk, and Idf = inverse term frequency in the whole suspicious document  Type #2 (tf.idf2): in this type, Tf = term frequency in the chunk, and Idf = inverse document frequency in PAN 2011 corpus Keyword phrase Extraction. Keyword phrases are considered as a collection of keywords with high tf.idf weights in a sentence. The number of keywords that can be used as input to ChatNoir search engine is limited, so the number of query words should be restricted in such a way to maximize the information content. The process of keyword phrase extraction is done in four steps as follows: In the first step, for each type of tf.idf weighting scheme that mentioned above, ten words with highest tf.idf value are selected from the whole chunk. In the second step, the sentences that contain words with high values of tf.idf1 AND tfidf2 are selected. In the next step, among the selected sentences, those that contain words with highest tf.idf1 and tfidf2 are selected for keyword phrase extraction. As a result, three sentences are selected in this step. Finally, in the fourth step, the keywords are extracted from the resulting sentences as follows:  Nouns with high tf.idf values  Remaining nouns in the sentence  Adjective and verbs with high tf.idf1 values It should be noted that the selection process is done based on the above mentioned priority. When we reach the maximum number of keywords in each stage, then the process would be stopped. The four steps mentioned above are repeated for each chunk and the extracted keyword phrases are passed to the next step for query building. Noun phrase Extraction. Noun phrase extraction is accomplished by processing the remaining sentences. The formulation has been deployed based on the formal English noun phrase structure [6]. For each noun phrase, a score is calculated based on tf.idf1 values. This score is the average of tf.idf1 values. Next, the noun phrases are ranked based on their scores. From the three top ranked noun phrases, the top tf.idf2 weighting words are passed to query formulation step. As a result, the implemented system uses two different scenarios applied to sen- tences for keyword extraction. After dividing a suspicious document into some chunks, the following scenarios are used based on operations depicted in Table 1. Scenario1: Operation 1 Operation 2 Operation 3 for noun phrase extraction Scenario2: Operation 1 Operation 2 Operation 4 for keyword phrase extraction Table 1. Multiple Operations on sentences in keywords extraction Operation Operation Description number 1 Selection of top 80% long sentences (based on length in chars) 2 Selection of top 80% sentences (based on number of nouns) 3 Selection of top three sentences (based on average tf.idf1 values) Selection of top three sentences (based on number of words with highest tf.idf1 4 and tf.idf2 values) The outputs of these scenarios are also passed through a filter that removes some terms with low weight to reach the word count limitation of ChatNoir API. The re- maining terms formulate a query and in the next step we can select top weighted sen- tence for query formulation. 2.3 Query Formulation For top sentences selected from previous step, the extracted keywords are simply placed next to each other based on their order in sentence and are passed to next step as a query. According to ChatNoir limitations, the threshold for the number of words in each query is limited to 10. 2.4 Search Control In this step, we filtered the constructed queries based on the possibly previous down- loaded documents. The input query is compared against the downloaded documents that are gathered from previous rounds of source retrieval steps. If there is at least one downloaded document that contains at least 60% words of the query, then the query is dropped from passing to the next step. This threshold was achieved based on experi- ments. 2.5 Document Filtering and Downloading We have used ChatNoir API for applying the input queries into the search engine. Then 14 top ranked results returned for each query. Input query is divided into two sub-queries and for each of them, one snippet with the length of 500 characters is extracted per returned document. These snippets are combined with one another and make a passage. If the resulted passage contains at least 50% words of the query, then the related document is downloaded and maintained for search control operation. This threshold was achieved based on experiments. 3 Evaluations and Discussion We have implemented our approach using python programming language and NLTK package for text processing operations [7]. At first, the prepared software for source retrieval was run on training dataset [8] and after getting feedback from the results, the following parameters were optimized:  Chunk length  Number of queries in each chunk  Returned results for each query  Similarity threshold between a query and resulted snippets  Similarity threshold between a query and downloaded documents Then, the software was placed on a dedicated virtual machine and was run on test dataset through TIRA [9]. Table 1 shows achieved results of our software on the PAN 2015 test dataset. The bold cases show highest rank for our software between all par- ticipants. According to “No Detection” score, our software has achieved highest rank in this measure. In other words, for only one plagiarized document, the “no true posi- tive detection” was made. However, the number of downloads is relatively high. One of our main objectives was to deploy a method for query building with special key- words to get a high recall. Our algorithm has reached second highest rank in recall score among all participants. In download filtering step, the software gathers a com- plete set of suitable source documents and uses them with a simple query filtering method. As a result, the number of queries that we use as input to APIs search engine in our software is lowest among other participants. Since that we used a simple ap- proach in download filtering, so we achieved the best rank in software runtime meas- ure among the participants. Table 2. Performance of our approach on source retrieval 2015 Downloads F1 No Detection Precision Queries Recall Runtime 183.3 0.1154 1 0.07539 43.5 0.41381 8:32:37 4 Conclusion and Future Works In this paper, we have described an approach for source retrieval task of PAN compe- tition. This process has achieved second highest rank in recall and first in “No Detec- tion” score. Because of high detection power of the system, the collected documents cover most of the relevant sources. Extra queries have been filtered using a simple method by making a union between query terms and collected documents. As a result we have achieved the highest rank in terms of „number of queries‟. Moreover, we also achieved the first rank in „runtime‟. For future works, we will try to decrease the number of downloaded source docu- ments while keeping the complete set of related documents for query filtering. Acknowledgement This work has been accomplished in ICT research Institute, ACECR, under the sup- port of Vice Presidency for Science and Technology of Iran - grant No. 1164331. The authors gratefully acknowledge the support of aforementioned organizations. Special thanks go to the members of ITBM research group for their valuable collaboration. References 1. Potthast, Martin, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximili- an Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Pao- lo Rosso, and Benno Stein. Overview of the 4th International Competition on Plagiarism Detection. In Working Notes Papers of the CLEF 2012 Evaluation Labs, September 2012. ISBN 978-88-904810-3-1. ISSN 2038-4963. 2. Potthast, Martin, Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel, Martin Tippmann, and Clement Welsch. "ChatNoir: a search engine for the ClueWeb09 corpus." In Proceedings of the 35th international ACM SIGIR conference on Research and devel- opment in information retrieval, pp. 1004-1004. ACM, 2012. 3. Strohman, Trevor, Donald Metzler, Howard Turtle, and W. Bruce Croft. "Indri: A lan- guage model-based search engine for complex queries." In Proceedings of the Internation- al Conference on Intelligent Analysis, vol. 2, no. 6, pp. 2-6. 2005. 4. Matsuo, Yutaka, and Mitsuru Ishizuka. "Keyword extraction from a single document using word co-occurrence statistical information." International Journal on Artificial Intelli- gence Tools 13, no. 01 (2004): 157-169. 5. Witten, Ian H., Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill- Manning. "KEA: Practical automatic keyphrase extraction." In Proceedings of the fourth ACM conference on Digital libraries, pp. 254-255. ACM, 1999. 6. Abney, Steven Paul. "The English noun phrase in its sentential aspect." PhD diss., Massa- chusetts Institute of Technology, 1987. 7. Bird, Steven. "NLTK: the natural language toolkit." In Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69-72. Association for Computational Linguistics, 2006. 8. Potthast, Martin, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. "An evaluation framework for plagiarism detection." In Proceedings of the 23rd international conference on computational linguistics: Posters, pp. 997-1005. Association for Computational Lin- guistics, 2010. 9. Gollub, Tim, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel, Paolo Ros- so, Efstathios Stamatatos, and Benno Stein. "Recent trends in digital text forensics and its evaluation." In Information Access Evaluation. Multilinguality, Multimodality, and Visual- ization, pp. 282-302. Springer Berlin Heidelberg, 2013.