Dynamically Adjustable Approach through Obfuscation Type Recognition Notebook for PAN at CLEF 2015 Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico masp1988@hotmail.com, gelbukh@gelbukh.com, sidorov@cic.ipn.mx Abstract. The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applica- tions in plagiarism detection, detection of text reuse, author identifica- tion, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask of the plagiarism detection competition at PAN 2015. Our method relies on a sentence sim- ilarity measure based on a tf-idf-like weighting scheme and cosine and dice similarity measures. We used and extended our previous algorithm for clustering and introduced a new verbatim detection method and ex- tended the decision making regarding which approach or output to use. We improve significantly the performance regarding our previous PAN 2014 approach and hence, our approach outperforms the best-performing system of the PAN 2014. Our system is available open source. 1 Introduction Plagiarism detection, and more generally, text reuse detection, has become a hot research topic given the increasing amount of information being produced as the result of easy access to the Web, large databases and telecommunication in gen- eral, which poses a serious problem for publishers, researchers, and educational institutions [5]. Plagiarism detection techniques are also useful in applications such as content authoring systems, which offer fast and simple means for adding and editing content and where avoiding content duplication is desired [1]. Hence, detecting text reuse has become imperative in such contexts. Text reuse detection is divided into two main tasks: First, searching related documents to a suspicious document, and then finding concrete evidence that passages of text were reused. This paper focused in the second task called Text Alignment as part of the PAN 2015 lab [10]. Text Alignment may be seen as a simple task but imply a huge load of work. Usually passages of text that were reused are not a verbatim copy, instead, some degree of obfuscation is introduced in order to make the new text seem like original work. Given the amount of obfuscation techniques that might be present in a reused passage, proposing a single method to addressed all of them is nearly impossible. Another issue to deal with is, how strict a method should be. A stricter approach could find cases of text reused more precisely but discard those cases with higher level of obfuscation. In the other hand, a more lenient approach could result in a lot of false positives. Finding this balance between precision and recall and creating a method capable of recognizing several obfuscation types, are the biggest challenges of a Text Alignment model. In our approach, we used two different methods, one stricter based on string matching and other based on text similarity at sentence level. We combined both methods resulting in three possible outputs and deciding which is more suitable according to the type of obfuscation detected. Our system is available open source.1 2 Our Approach We describe our approach using the strategy of the common building blocks used in text alignment algorithms in previous years [9]: seeding, extension, and filtering. Our approach is not solely built on this blocks, we also implemented a method based on string matching and a decision maker to determine which output to use, as will be explained in the Adaptive behavior section. The stages of seeding, extension and filtering were describe in our previous work in [12]. In order for this paper to be self-contained, we include a brief description of them. However there is a modification in the clustering algorithm since we made it recursive. 2.1 Seeding Given a suspicious document and a source document, the task of the seeding stage is to construct a large set S of short similar passages called seeds. We extract our seeds segmenting the documents by sentences and computing the cosine and dice similarities between them. We used a tf-idf-like weighting scheme where sentences are consider documents as follows: tf (t, s) = f (t, s) , (1) |D| isf (t, D) = log , (2) |{s ∈ D : t ∈ s}| w (t, s) = tf (t, s) × isf (t, D) , (3) In Fig. 1, each dot represents a seed in the comparison of two documents that contain plagiarism cases with random obfuscation; a darker color indicates a greater degree of similarity between the two sentences. 2.2 Extension Given the set of seeds S, defined as the pairs (i, j) of similar sentences, the task of the extension stage is to form larger text fragments that are similar between two 1 http://www.gelbukh.com/plagiarism-detection/PAN-2014 Fig. 1. Seeds representation: similarity between sentences of the suspicious document and the source document documents. For this, the sentences i are joint into maximal contiguous fragments of the suspicious document and sentences j into maximal contiguous fragments of the source document, so that those large fragments be still similar. The extension stage is divide into two recursive procedures: (1) Clustering and (2) Validation. In the clustering step we first cluster the seeds according to a distance maxgap looking only to the suspicious side and then we divide these clusters alternating sides. The validation step consist in assessing the quality of resulting clusters from the clustering step. The quality is given by the cosine sim- ilarity from the text fragments on each document represented by a cluster and the amount of seeds in it. If the similarity in a cluster is less than a given thresh- old th validation, it goes to the clustering stage using a distance of maxgap − 1 and up to maxgap least. In the case a cluster has less than minsize seeds, then it is discarded. We measured the similarity between text fragments Fsusp and Fsrc computing the cosine between theirs vectors. The vector representation of the fragments is done adding together the vectors corresponding to all sentences of Fsusp and Fsrc respectively.   X X similarity (Fsusp , Fsrc ) = cos  v, v (4) v∈Fsusp v∈Fsrc Following with the example from Fig. 1 the expected result and output from our extension method is shown in Fig. 2. The output of the Extension stage is a set of pairs of similar text fragments {(Fsusp , Fsrc ) , . . . }. The diagram of the extension algorithm is shown in Fig. 3. Fig. 2. Clusters and corresponding fragments representing plagiarism cases 2.3 Filtering Given the set {(Fsusp , Fsrc ) , . . . } of plagiarism cases, the task of the filtering stage is to improve precision (at the expense of recall) by removing some “bad” plagiarism cases. We did the filtering in two stages: first, we resolved overlapping fragments; then, we removed too short fragments (in what follows we only refer to fragments that represent plagiarism cases, not to arbitrary fragments of the documents).  0 0  Resolving overlapping cases. We call two plagiarism cases Fsusp , Fsrc  00 00  0 00 and Fsusp , Fsrc overlapping if the fragments Fsusp and Fsusp share (in the suspicious document) at least one sentence. We assume that the same source fragment can be used several times in a suspicious document, but not vice versa: each sentence can be plagiarized from only one source and thus can only be- long to one plagiarism case. To simplify things, instead of re-assigning only the overlapping parts, we simply discarded whole cases that overlapped with other cases. Specifically, we used the following algorithm: 1. While exists a case P (“pivot”) that overlaps with some other case (a) Denote Ψ (P ) be the set of cases Q 6= P overlapping with P (b) For each Q ∈ Ψ (P ), compute the quality qQ (P ) and qP (Q); see (5) (c) Find the maximum value among all obtained qy (x) (d) Discard all cases in Ψ (P ) ∪ {P } except the found x In our implementation, at the first step we always used the first case from the beginning of the suspicious document. We compute the quality function qy (x) of the case x with respect to an overlapping case y as follows. The overlapping cases Start Seeds (i, j) Clustering by i Clustering by j (1) Clustering no No new clusters? yes decrease Cluster ≥ no maxgroup discard minsize? by 1 yes (2) Validation no sim(F1 , F2 ) > th3? yes End Fig. 3. Diagram of the extension algorithm x x  y y  x = Fsusp , Fsrc and y = Fsusp , Fsrc are pairs of corresponding fragments. x y x Let O = Fsusp ∩ Fsusp be the overlap and N = Fsusp /O be the non-overlapping part. Then the quality  qy (x) = simFsrc x (O) + 1 − simFsrc x (O) × simFsrcx (N ) , (5) where sim is a non-symmetric similarity of a fragment Fsusp (in the suspicious document) to a reference fragment Fsrc (in the source document): 1 X simFsrc (Fsusp ) = max (cos (s, r)) . (6) |Fsusp | r∈Fsrc s∈Fsusp Formula (5) combines the similarity of the overlapping part and of the non- overlapping part of suspicious fragment to the source counterpart. Removing small cases. We also discard the plagiarism cases that relate too small fragments: if either suspicious or source fragment of a case has the length in characters less than minplaglen, then the case is discarded. 2.4 Adaptive behavior At the PAN competition, the methods are evaluated on a corpus that contain plagiarism cases created using four different types of obfuscation: none, random, translation and summary. In the training dataset we have the option to test our approaches in sub-corpus divided according the type of obfuscation used to create the plagiarism cases. We observed that the optimal parameters of our method are different to detect such diverse types of obfuscated plagiarism cases. Therefore, we introduced three alternative paths and decided which output to use according to the type of obfuscation we are likely dealing with in each specific document pair. The final set up of our approach is shown in Fig. 4. After initial preprocessing and seeding steps, we follow two separate path with different maxgap values: one value (maxgap summary) that we found to be best for the summary obfuscation sub-corpus and one that was best for the other three corpora (maxgap). After we obtain the plagiarism cases using these two different settings, we applied a verbatim detector method to the non-summary approach resulting in three possible results. We named these results as: Verbatim plagiarism cases (cases V), summary plagiarism cases (cases S) and other plagiarism cases (cases O). Verbatim detector The verbatim detection method is based on the Longest Common Substring (LCS) algorithm. We modify the LCS algorithm in order to use words instead of characters and to find every single common sequence of words above a certain threshold measured in characters (th verbatim). Output selector The decision of which of the three outputs (cases V, cases S and cases O) report as the final result of our approach follows a decision cascade where the verbatim plagiarism cases have priority, them summary and finally other cases. All three possible outputs are mutually exclusive. If there is at least one Verbatim case, the pair of documents is consider as a none obfuscation pair and the Verbatim output is reported. If none Verbatim cases were reported, we decide whether cases S are likely to represent summary obfuscation or not, judging by the relative length of the suggested suspicious fragments with respect to the source fragments. Specifically, the decision is made based on the variables srclen and susplen , which correspond to the total length of all passages, in characters, in the source document and the suspicious docu- ment, respectively: when susplen is much smaller than srclen , then we are likely Suspicious document Source document Preprocessing Preprocessing Seeding seeds Extension Extension maxgap maxgap summary Verbatim Filtering Filtering detection cases O cases V cases S Output selector Plagiarism cases Fig. 4. Diagram of the final approach dealing with summary obfuscation. If both, Verbatim and Summary cases where discarded then the reported output is Other plagiarism cases. In table Table 1 we show the final setting of the parameters used in our system. Table 1. Parameters settings Parameter Value Parameter Value minsentlen 3 maxgap summary 24 th cos 0.30 maxgap least 0 th dice 0.33 minsize 1 th validation 0.34 minplaglen 150 maxgap 4 th verbatim 256 3 Experimental Results We trained our system using the corpus provided for PAN 2014 competition and using the performance measures introduced in [11]. We compared this ap- proach to our previous one in all of the sub-corpus and measurements. We tested our system in the PAN 2014 training corpus using the TIRA platform [4]. As observed in Table 2 the performance in the none obfuscated sub-corpus was in- creased dramatically because of the new verbatim detector. We also observed a increased in the summary obfuscated sub-corpus because of the new parame- ters settings. These improvement came without affecting significantly the other sub-corpus and hence, the final result was increased as well. Table 2. Comparison between our result from PAN2015 and PAN2014 approaches on PAN 2014 training corpus PAN 2015 approach PAN 2014 approach Obfuscation Plagdet Recall Precision Granul. Plagdet Recall Precision Granul. None 0.9812 0.9761 0.9933 1.0048 0.8938 0.9782 0.8228 1.0000 Random 0.8847 0.8699 0.8999 1.0000 0.8886 0.8581 0.9213 1.0000 Translation 0.8792 0.9128 0.8481 1.0000 0.8839 0.8902 0.8777 1.0000 Summary 0.6304 0.4862 0.9739 1.0404 0.5772 0.4247 0.9941 1.0434 Entire 0.9025 0.8937 0.9164 1.0036 0.8773 0.8799 0.8774 1.0021 The results showed that recall is the measure where we excel but need to improve the precision of the model by identifying and adjusting to other types of obfuscation rather than just summary and verbatim obfuscation. Regarding the system runtime, even our goal is not aiming at efficiency, our software performed in an average level. In Table 3 we present a comparison between our approach and 2014 partici- pants showing a remarkable improvement. Table 3. Our approach compared to the PAN 2014 Official results reported in [9] Team PlagDet Recall Precision Granularity Runtime Sanchez-Perez15 0.9010 0.8957 0.9125 1.0046 – Sanchez-Perez14 0.8781 0.8790 0.8816 1.0034 00:25:35 Oberreuter 0.8693 0.8577 0.8859 1.0036 00:05:31 Palkovskii 0.8680 0.8263 0.9222 1.0058 01:10:04 Glinos 0.8593 0.7933 0.9625 1.0169 00:23:13 Shrestha 0.8440 0.8378 0.8590 1.0070 69:51:15 R. Torrejón 0.8295 0.7690 0.9042 1.0027 00:00:42 Gross 0.8264 0.7662 0.9327 1.0251 00:03:00 Kong 0.8216 0.8074 0.8400 1.0030 00:05:26 Abnar 0.6722 0.6116 0.7733 1.0224 01:27:00 Alvi 0.6595 0.5506 0.9337 1.0711 00:04:57 Baseline 0.4219 0.3422 0.9293 1.2747 00:30:30 Gillam 0.2830 0.1684 0.8863 1.0000 00:00:55 4 Conclusions and Future Work We have described our approach to the task of text alignment in the context of PAN 2015 competition showing great improvement. The additions from our previous work are the verbatim detector, applying clustering recursively and optimization of parameters. We also tested other methods to detect paraphrase but need further improvement and testing, and given the resources available at PAN so far, it is not possible. In our future work, we plan to use linguistically motivated methods to ad- dress possible paraphrase obfuscation [2] and test it on the P4P corpus.2 We also plan to build a meta-classifier that would guess which obfuscation type of plagia- rism case we deal with at each moment and dynamically adjust the parameters. Finally, we plan to apply concept-based models for similarity and paraphrase detection [6–8]. Acknowledgements Work done under partial support of FP7-PEOPLE-2010- IRSES: Web Information Quality – Evaluation Initiative (WIQ-EI) European Commission project 269180, Government of Mexico (SNI, CONACYT), and In- stituto Politécnico Nacional, Mexico (SIP 20144274, 20150028, BEIFI, COFAA). References 1. Bär, D., Zesch, T., Gurevych, I.: Text reuse detection using a composition of text similarity measures. In: Kay, M., Boitet, C. (eds.) COLING 2012, 24th Interna- tional Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India. pp. 167–184. Indian Insti- tute of Technology Bombay (2012) 2. Barrón-Cedeño, A., Vila, M., Martı́, M.A., Rosso, P.: Plagiarism meets paraphras- ing: Insights for the next generation in automatic plagiarism detection. Computa- tional Linguistics 39(4), 917–947 (2013) 3. Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.): Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014, CEUR Workshop Pro- ceedings, vol. 1180. CEUR-WS.org (2014) 4. Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower research: Towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and De- velopment in Information Retrieval (SIGIR 12). pp. 1125–1126. ACM (Aug 2012) 5. Maurer, H., Kappe, F., Zaka, B.: Plagiarism – A survey. Journal of Universal Computer Science 12(8), 1050–1084 (Aug 2006) 6. Poria, S., Agarwal, B., Gelbukh, A., Hussain, A., Howard, N.: Dependency-based semantic parsing for concept-level text analysis. In: Gelbukh, A.F. (ed.) Computa- tional Linguistics and Intelligent Text Processing, 15th International Conference, CICLing 2014, Kathmandu, Nepal, April 6–12, 2014, Proceedings, Part I. Lecture Notes in Computer Science, vol. 8403, pp. 113–127. Springer (2014) 2 http://clic.ub.edu/corpus/en/paraphrases-en 7. Poria, S., Cambria, E., Ku, L.W., Gui, C., Gelbukh, A.: A rule-based approach to aspect extraction from product reviews. In: Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP). pp. 28–37. Association for Computational Linguistics and Dublin City University, Dublin, Ireland (August 2014) 8. Poria, S., Cambria, E., Winterstein, G., Huang, G.: Sentic patterns: Dependency- based rules for concept-level sentiment analysis. Knowl.-Based Syst. 69, 45–63 (2014) 9. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th International Competition on Plagiarism Detection. In: Cappellato et al. [3], pp. 845–876 10. Potthast, M., Hagen, M., Göring, S., Rosso, P., Stein, B.: Towards Data Sub- missions for Shared Tasks: First Experiences for the Task of Text Alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Pro- ceedings, CLEF and CEUR-WS.org (Sep 2015) 11. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Huang, C., Jurafsky, D. (eds.) COLING 2010, 23rd International Conference on Computational Linguistics, Posters Volume, 23-27 Au- gust 2010, Beijing, China. pp. 997–1005. Chinese Information Processing Society of China (2010) 12. Sanchez-Perez, M.A., Sidorov, G., Gelbukh, A.: The winning approach to text alignment for text reuse detection at PAN 2014. In: Cappellato et al. [3], pp. 1004– 1011