=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Pan-PalkovskiiEt2014
|storemode=property
|title=Developing High-resolution Universal Multy-type n-gram Text Similarity Detector
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-PalkovskiiEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/PalkovskiiB14
}}
==Developing High-resolution Universal Multy-type n-gram Text Similarity Detector==
Developing High-Resolution Universal Multi- Type N-Gram Plagiarism Detector Notebook for PAN at CLEF 2014 Yurii Palkovskii, Alexei Belov Zhytomyr Ivan Franko State University, Institute of Foreign Philology SkyLine LLC, Plagiarism Detector Project vb-expert@yandex.ru cargo-base@yandex.ru Abstract. This paper describes approaches used for the Plagiarism Detection task during PAN 2014 International Competition on Uncovering Plagiarism, Authorship, and Social Software Misuse, that scored 1-st place with plagdet score (0.907) for test corpus no.3 and 3-rd place score (0.868) for test corpus no. 2. In this work we aggregated all the previously researched experience from PAN12 and PAN 13 research works [2] and thus further improved previously developed methods of detecting plagiarism [8], with the help of: contextual n- grams, surrounding context n-grams, named entity based n-grams, odd-even skip n-grams, functional words frame based n-grams, TF-IDF sentence level similarity index and noise sensitive clusterization algorithm, focused summary type detection heuristics, combined into a single model to mark similarity sections and thus effectively detect different types of obfuscation techniques. 1 Introduction Each year PAN competition pushes forward the baseline of plagiarism detection effectiveness thus making it harder and harder to compete with the top plagdet score becoming close to the absolute values irrespective of even more demanding plagiarism types included into the new corpora. This year we tried to incorporate all the knowledge we could aggregate via already published PAN works [3,2,4,5] and try to figure out how to make most of the algorithms that already proved to achieve best performance during the last years. The introduction of the lately developed TIRA platform [1] has greatly boosted the software deployment and thus made it easier both to participate in the competition, synchronize the local result and the result in the test environment. 2 Methods The main goal of this year research was to aggregate all the most efficient methods [6,7] that have already been applied to the task of plagiarism detection. The dominating approach that showed best results and greatly improved recall was the 984 one demonstrated by Efstathios Stamatatos [7] - namely the introductions of stop- words n-gram generation. We combined the above idea with another idea introduced by Jan Kasprzak [3,4] of generating context based n-grams from both the exact sentence context, surrounding sentences n-gram extraction and n-gram generation based on cyclical word deletion. We also introduced "Named Entity injection as n- gram" method: basically it is not an n-gram in any manner, but we considered an option to somehow "inject" the Named Entity uniqueness into the general model and to achieve this we harvest all the Named Entities from the text and treat it as a regular n-gram, irrespective of the fact that it is not actually a typical n-gram that usually consists of multiple words. We did such analysis and came up with the following methods of n-gram generation: 1. Regular n-grams 2. Variable length stop-word n-grams 3. Named Entity based n-grams 4. Most Frequently used words n-grams N-grams expansion by: 1. Odd-even generation on a sentence level 2. Resulting n-gram set stemming 3. A single word deletion on a fingerprint level 4. N-grams alphasorting Text preprocessing included special symbols removal and space trimming. The input text parsing was made on 2 levels: sentence level and word level. We applied angled ellipse based graphical clustering algorithm [8] to define clusters of shared fingerprints. The main approach was to detect what kind of plagiarism is dominating within the document pair - by applying special kind of analysis that included global noise level detection, single stage analysis for the existence of verbatim plagiarism clusters, shared fingerprints sequence analysis, diagonal density analysis, summary type presence and then finally selecting which analysis preset will be used for final analysis. Our software selects one of 4 possible parameter presets for the final detection strategy: 1. Verbatim Plagiarism 2. Random Plagiarism 3. Summary type Plagiarism 4. Undefined type with high noise The algorithm in detail: 1. Both d_1 as the 1st document and d_2 as the 2nd document are preloaded with the same preloading sequence. 985 2. The text is being preprocessed via removing all special characters and text control symbols with the initial offsets kept intact. Diacritics is being neutralized by dot net provided ".Normalize" method. At word delimitation stage, we also apply StringComparison.InvariantCultureIgnoreCase option to get better split quality. We used our own developed sentence split heuristics, based on a set of predefined rules for sentence delimitation. If global parameter for punctuation signs removal is on, the application removes punctuation marks, again with the offsets left intact. 3. Each sentence is being split into words. We use text representation with object structure, having access to each sentence and word by both value and offset, using hash tables and inverted indices with performance as a consideration. 4. We do quality preprocessing for each word if global flag is on again with the resulting word-length and offset being corrected to pertain globally correct offsets. 5. At this stage we generate all required n-grams. N-grams are formed using different strategies: structural n-grams, regular n-grams, stop-word based n-grams, Named Entity based n-grams, near-context based n-grams. We use different input for n-grams input values: words sequences of different lengths and Named Entities. The former being divided into several stages with a separate stage for a specific n-gram generation strategy [4,6,7,8]. All resulting n-grams are then translated to hashes using dot net built in .GetHash method and stored within hash table, with key being a hash and value being the n-gram offset. 6. We do a complete search within hash table of d_1 against hash table of d_2 aggregating the results into a list of "shared" n-grams. 7. At this stage we do a sentence-against-sentence vector space model statistical analysis. We pre-calculate word vector for each sentence and then we ran exhaustive comparison via sentence-against-sentence analysis with cosine similarity being a final comparison value stored as a result. If this similarity value exceeds a predefined global threshold, we add a final result cluster that points to the analyzed sentence pair. The interesting point is, that this stage results do affect the final results only at late stage 9. 8. For each "shared" n-gram detected we run a clustering algorithm to define the exact clusters of n-grams that are the final product of the developed application. Our clusterization algorithm has been already described in our previous publications [8]. 9. Clustering algorithm is the most complicated part of the software and most computationally intensive procedure. The exact logic of this stage depends on a global strategy that is automatically selected in an attempt to better tackle a specific type of obfuscation. 10. We form initial clusters from detected "shared" n-grams via skewed eclipse based (last year we used circle-based type) offset based two dimensional plain graphical analysis. 11. We do vector space model analysis routines at this point. 12. We merge n-gram based detected clusters with VSM detected clusters at this point. 13. We compute n-gram ordering in all detected clusters. 14. We do 1st stage preprocessing for all detected clusters that includes: building n-gram distribution histogram [8], cluster removal from the current result list if cluster absolute length is less than a globally predefined value, cluster removal from the current result list if cluster "is inside" another large cluster. 986 15. We do final cluster remerge procedure: if clusters are close enough (using a predefined global value) then they are merged into a single cluster, thus adjusting granularity score. 16. We do final cluster list post-processing: removing all clusters that are too small in length (character-based), contain very few n-grams, do not have sufficient percent of n-grams located within the main cluster diagonal - so called low "insufficient n- gram diagonal distribution", "is vertical" filter that is made in order to mitigate PAN corpus caused malformed clusters and duplicating false positives. 17. Final results are formed from the remaining clusters. We use specially developed visualization tool to graphically assess resulting clusters and follow our methodology of "visualize-analyze-correct" strategy. Below there is a screenshot of the produced final analysis cluster graph, with sample skewed eclipse, that shows the exact "is-in-eclipse" n-gram clustering logic output. Structural n-grams are shown in green and are printed, regular n-grams are shown in red circles, the master cluster definition of gold standard xml-descriptor is marked by a red rectangle whereas vector space model detected cluster (marked in grey) fully coincides with finally detected cluster (marked in black), which, in its turn, coincides with the master cluster (marked in red). Shared Named Entities are marked in blue and act as a regular n- grams. The graph is formed using "Zero Obfuscation Strategy" (see 18 for more details): 18. The above mentioned steps from 1 to 17 are strategy dependent. That is, they use a special preset of parameters that best fits specific obfuscation type. At the initial analysis stage, the software analyses several critical input data characteristics: global noise level, cluster ordering, number of verbatim plagiarism clusters via the percent of clusters with absolute diagonal distribution. Then it selects one of the three possible parameter sets - "Zero Obfuscation Strategy", "Low Obfuscation Strategy, Low Noise", "High Noise Strategy", "High Obfuscation Strategy, High Noise Strategy" and recursively runs another analysis. Each of the abovementioned strategies use its 987 own globally predefined sets of values and switches that define the exact changes in routines in steps 1-17 mentioned above. 19. At this stage software compares the results of preliminary analysis that was used for strategy definition with the results received with strategy applied. Final result is chosen at this stage. 3 Evaluation The developed software scored 1-st place with plagdet score (0.907) for test corpus no.3 and 3-rd place score (0.868) for test corpus no. 2. At the moment we are not able to assess and explain the difference in results as long as both corpora are not yet publically downloadable and we do not know the difference between the test corpora versions, we are waiting for the test data release to further research the difference in the achieved Recall and Granularity. Corpus: Plagdet: Recall: Precision Granularity Runtime: Test 3 0.90779 0.88916 0.92757 1.00027 00:57:15 Test 2 0.86806 0.82637 0.92227 1.00580 01:10:04 4 Conclusions and Future Work The result achieved at this year PAN competition shows really tight competition for every percent. Comparing our previous results to this year results we may conclude that we made really good progress landing between the best competing teams at PAN. Unfortunately, due to many different factors that go beyond the scope of the abovementioned research, we were not able to fully optimize the developed system. The majority of the parameters used were heuristically set to most optimal values we come with initially. This fact shows large space for future improvements. Applying genetic algorithm for global parameter tuning will definitely give much better results and will allow much better tuning for each specific corpus. Acknowledgments. To the PAN RnD team, for their fantastic job making deploying and running software on TIRA platform possible, for their prompt responses and discrete guidance both on the competition website and Google groups. We would like to thank our competing teams as well, as our research was heavily based on and dramatically boosted by the knowledge and experience they shared in their previously published works and live workshop discussions at previous PANs. We also would like to thank the organizers for providing support and the official letter of invitation to be able to append PAN workshop aiding us with visa acquisition for the conference. 988 References 1. Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. An Evaluation Framework for Plagiarism Detection. In 23rd International Conference on Computational Linguistics (COLING 10), August 2010. Association for Computational Linguistics. 2. Martin Potthast, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Overview of the 5th International Competition on Plagiarism Detection. In Pamela Forner, Roberto Navigli, and Dan Tufis, editors, Working Notes Papers of the CLEF 2013 Evaluation Labs, September 2013. ISBN 978-88-904810-3- 1. 3. Rodríguez-Torrejón, D.A., Martín-Ramos, J.M.: “Detección de plagio en documentos: sistema externo monolingüe de altas prestaciones basado en n- gramas contextuales” (Plagiarism Detection in Documents: High Performance Monolingual External Plagiarism Detector System Based on Contextual N-grams). Procesamiento del Lenguaje Natural. N. 45 (2010). 4. Rodríguez-Torrejón D.A., Martín-Ramos J.M.: CoReMo System (Contextual Reference Monotony) A Fast, Low Cost and High Performance Plagiarism Analyzer System: Lab Report for PAN at CLEF 2010. In Braschler M., Harman D., Pianta E., editors. Notebook Papers of CLEF 2010 LABs and Workshops, 22-23 September, Padua, Italy, 2010. 5. Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent Trends in Digital Text Forensics and its Evaluation. In Pamela Forner, Henning Müller, Roberto Paredes, Paolo Rosso, and Benno Stein, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 13), September 2013. Springer. ISBN 978-3-642-40801-4. 6. Šimon Suchomel, Jan Kasprzak, and Michal Brandejs. Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection—Notebook for PAN at CLEF 2012. In Forner et al. [6]. ISBN 978-88-904810-3-1. URL http://www.clef-initiative.eu/publication/working- notes 7. Efstathios Stamatatos. Plagiarism Detection Using Stopword n-Grams. JASIST, 62(12):2512–2527, 2011. doi: http://dx.doi.org/10.1002/asi.21630. 8. Yurii Palkovskii and Alexei Belov. Applying Specific Clusterization and Fingerprint Density Distribution with Genetic Algorithm Overall Tuning in External Plagiarism Detection—Notebook for PAN at CLEF 2012. In Forner et al. [6]. ISBN 978-88-904810-3-1. URL http://www.clef- initiative.eu/publication/working-notes. 989