Overview of the Author Obfuscation Task at PAN 2017: Safety Evaluation Revisited

Overview of the Author Obfuscation Task at PAN 2017: Safety Evaluation Revisited MatthiasHagen Bauhaus-Universität Weimar MartinPotthast Bauhaus-Universität Weimar BennoStein Bauhaus-Universität Weimar Overview of the Author Obfuscation Task at PAN 2017: Safety Evaluation Revisited 96D4EDB4EEC355492E9BD00DB96CB296 GROBID - A machine learning software for extracting information from scholarly documents

We report on the second large-scale evaluation of style obfuscation approaches in a shared task on author obfuscation, organized at the PAN 2017 lab on digital text forensics. Author obfuscation means to automatically paraphrase a given text such that state-of-the-art authorship verification approaches misjudge a given pair of documents as having been written by "different authors" if in fact they would have decided otherwise without obfuscation. This year, two new obfuscators are compared to the participants from last year's task against a total of 44 authorship verification approaches. The best-performing obfuscator successfully impacts the decision-making process of the authorship verifiers significantly. However, as in the last year, the paraphrased texts are often not really human-readable anymore and have some changed context, indicating that there is still way to go to "perfect" automatic obfuscation that (1) tricks verification approaches, (2) keeps the meaning of the original, and (3) is, regarding its obfuscation, unsuspicious to a human eye.

Introduction

At PAN 2017 we organized the second shared task on author obfuscation in order to foster exploring the potential vulnerabilities of author identification technology. Like in the first edition, the specific task is that of author masking against authorship verification, which in turn has been a shared task at PAN 2013-2015 [11,17,18]. The following synopses point out the differences:

Authorship Verification

Given two documents, decide whether both have been written by the same author.

vs.

Author Masking

Given two documents from the same author, paraphrase the designated one such that an authorship verification will fail.

Figure 1 illustrates the setting and shows that the two tasks are diametrically opposed to each other: Success of a certain approach for one of these tasks depends on its "immunity" against the most effective approaches for the other. In our overview of last year's first author masking edition [16], we already included a survey of related work on author obfuscation. In particular, we introduced and discussed the "obfuscation impact measures" used in the evaluation, which we will quickly recap in Section 2. reviews the obfuscation approaches that have been submitted to this year's edition of the shared task, and Section 4 reports on their evaluation against the state of the art in authorship verification.

Evaluating Author Obfuscation

As of last year, we consider three performance dimensions according to which an author obfuscation approach must excel to be considered fit for practical use. Obviously, the obfuscation performance should depend on the capability of fooling forensic expertsbe it a piece of software or a human. However, fulfilling this requirement in isolation will disregard writers and their target audience, whose primary goal is to communicate, albeit safe from deanonymization: the quality of an obfuscated text along with the fact that its semantics is preserved are equally important. We hence call an obfuscation software 1. safe, if its obfuscated texts cannot be attributed to their original authors anymore, 2. sound, if its obfuscated texts are textually entailed by their originals, and 3. sensible, if its obfuscated texts are well-formed and inconspicuous. These dimensions are orthogonal; an obfuscation software may meet each of them to a certain degree of perfection. Related work on operationalizing measures for these dimensions has been included in our overview from the last year [16]. In order to analyze the safety dimension, we run the obfuscated texts against 44 authorship verification approaches and measure the impact of the obfuscation on the verifiers in form of changed verification decisions (cf. last year's overview for details on the used measures [16]). As for sensibleness and soundness we stick to manual inspection and grading of examples.

Survey of Submitted Obfuscation Approaches

The two approaches submitted to this year's edition of our shared task follow different strategies: sequence-to-sequence models and rule-based replacements. While a more conservative rule-based strategy often changes the to-be-obfuscated text only slightly, the sequence-to-sequence modeling can lead to substantial differences.

Bakhteev and Khazov

The approach of Bakhteev and Khazov [1] is mainly based on different sequence-to-sequence models and some small set of rules. The rules replace contractions (e.g., 'll → will), split or concatenate sentences using conjunctive words (e.g., and), and add or remove introductory phrases (e.g., anyway) to and from sentences respectively. The main idea of sequence-to-sequence modeling comes in two flavors: (1) replacing synonyms based on nearest neighbors in word embeddings from a Wikipedia dump, and (2) an encoder-decoder approach that generates some "reproduced" version of the original text, which is also based on embeddings trained on a Wikipedia dump. In both cases, the author choose from different possible variants of an obfuscated sentence that one that best matches a language model trained on Shakespeare texts.

As for the resulting texts, the strategy for combining and splitting sentences should pay more attention to the local situation, since otherwise it will quickly lead to incomplete or overlong constructions. A more detailed analysis of the text quality follows in the evaluation (cf. Section 4).

Castro et al.

The approach of Castro et al. [6] focuses on simple rule-or patternbased replacements. Using the FreeLing NLP tool for pre-processing texts (POS tagging, word sense disambiguation, etc.) several ideas are combined. Contractions are replaced based on a dictionary or the long version if it is used more often, synonyms are substituted using FreeLing functionality, and sentences are shortened by leaving out parts in parentheses, by leaving out discourse markers, or by eliminating appositions based on two simple patterns that identify explanations if named entities are introduced in the text.

The resulting text will usually be shorter than the original text, which, however, is intended by the authors. Most of the removals do not dramatically change the meaning of the text; a similar observation applies to the treatment of contractions. Still, leaving out information from the original may render parts of the resulting text hard to understand. Depending on FreeLing's synonym functionality, synonyms are often not appropriately chosen since the context seems not to be considered when selecting a replacement candidate. A more detailed analysis of the text quality follows in the evaluation (cf. Section 4).

Evaluation

As in the last year, we automatically evaluate the safety of the submitted obfuscation approaches against 44 authorship verifiers which have been submitted to the previous three shared tasks on authorship identification at PAN 2013-2015. Sensibleness and soundness of the obfuscated texts are assessed manually by human inspection.

The evaluation setup is the cloud-based evaluation platform TIRA [9,15], 1 which is being developed as part of our long-term evaluation-as-a-service initiative [10]. We Table 1. Safety evaluation of five obfuscators, including those submitted to PAN 2016, against sets of 26-36 authorship verification approaches submitted to PAN 2013 through PAN 2015. The column group "PAN measures" shows the average performance delta on the evaluation measures ROC AUC, C@1, and the final score AUC • C@1 applied at PAN. The four row groups belong to the four English PAN test datasets; the rows within the row groups are ordered by average impact (avg imp, see the last column).

Obfuscator

Verifier Dataset PAN Measures Obfuscation Measures want to point out that-by using TIRA-it was possible to run 44 of the 49 authorship verification approaches (which have been submitted to the shared tasks at PAN 2013-2015) on the outputs of the submitted obfuscation approaches. The outputs, in turn, were generated from the authorship verification corpora PAN13, PAN14 EE, PAN14 EN, and PAN15.

Team [Reference] |Y | D test |D + test | ∆ AUC ∆ C@1 ∆ final ∆ acc ∆ rec avg imp

Safety

Table 1 shows the results of our safety evaluation of the two approaches from this year compared to the three approaches from last year against 44 authorship verification approaches on the aforementioned four PAN evaluation datasets. We combine the two rankings into an overall ranking of obfuscation approaches suggested so far in order to interpret the results of this year's participants in context. The best-performing approach this year was submitted by Castro et al. [6], which achieve second rank overall across both years as per average impact; the average impact quantifies the averaged ratio of true positive decisions turned false negative. However, this result must be taken with a grain of salt since this approach basically removed large parts of the original text. The approach of Bakhteev and Khazov [1] performs second-best this year, and ranks fourth out of five overall. The ranking induced by average impact is the same as that induced by all other measures, rendering the measures perfectly consistent. This consistency, however, forecloses more insights that can usually be derived from differing performance characteristics. In this regard, the qualitative assessment of sensibleness and soundness presented in the following subsection is important. Altogether, the approach of Mihaylova et al. [14] still performs best among all approaches.

Sensibleness and Soundness

As in last year's edition, a human assessor conducted an in-depth manual assessment on problem instances 135, and 430. Spot checks on other instances again indicated that the overall characteristics of the output texts are similar on other instances. The human assessor started by reading the obfuscated texts without knowing which was the output of what approach. During this reading phase, the assessor marked up errors (typos, grammar) and assigned school grades (on a scale from 1 (excellent) to 5 (fail)) for the sensibleness of each of the sample problem instances. The sensibleness scores obtained in the last year were a grade 2 for Mansoorizadeh et al.'s approach [13] that does not really change a lot on a per sentence basis, a grade 4 for Mihaylova et al.'s obfuscator [14], and a grade 5 for Keswani et al.'s obfuscator [12]. This year's approaches get a grade 4 for Bakhteev's and Khazov's approach [1], since there are a lot of issues with respect to uppercasing at sentence starts as well as many grammatical problems due to problematic sentence splits and merges, and due to inappropriate use of synonyms. As for Castro et al.'s approach [6] grade 2s were assigned if only some problematically short sentences were grammatically incorrect or if spacing around punctuation marks was incorrect, while other documents got a grade 3 for too short sentences that were grammatically wrong or for synonyms not making sense in some contexts.

After grading the sensibleness of the obfuscated texts, the assessor read the original texts and judged the textual differences in various ways to evaluate the soundness of the obfuscated texts on a three-point scale as either "correct", "passable", or "incorrect". The obfuscated texts of Mihaylova et al.'s and Keswani et al.'s approaches were all judged "incorrect", while Mansoorizadeh et al.'s very conservative approach achieved "correct" and "passable" scores. This year's approaches (Bakhteev's and Khazov's, and Castro et al.'s) both got "incorrect" as judgments-but for different reasons: With regard to Bakhteev's and Khazov's approach, many parts of the resulting texts were not understandable anymore because of overly rigid changes in sentences, which completely removed the original meaning. With regard to Castro et al.'s approach, the judgment results from the fact that the obfuscated text covers only a small portion of the original text (about the first third of the original), maybe an undesired side-effect due to some pre-processing problems. The parts that are still contained in the obfuscated version often achieve at least a "passable" judgment, and they could even be judged as "correct". However, the fact that about two thirds of the original was omitted precluded a better outcome.

In the second year of evaluating author obfuscation approaches in terms of their safety against the state of the art in authorship verification, two new approaches were added to the three approaches from last year. The best-performing obfuscator flips on average about 42% of an authorship verifier's decisions towards choosing "different author" when the opposite decision would have been correct, indicating some level of safety against verification approaches. As for soundness and sensibleness, though, the approaches often produce rather unreadable text or text whose meaning is significantly changed. Still, such insights are mainly obtained from manual inspection.

The challenge of evaluating author obfuscation approaches properly and at scale would definitely benefit from new technologies that are capable of recognizing paraphrases, textual entailment, grammaticality, and style deception. However, a very important direction for future research in the authorship obfuscation domain is that on producing safe and still sound and sensible texts. So far, there are only two groups of obfuscation approaches: (1) approaches that are somewhat safe but that often produce unreadable text or text that is neither sound nor sensible, and (2) approaches that produce sound and sensible texts but that are not safe against authorship verification.

A significant improvement of current obfuscation technology requires a much better consideration and integration of the surrounding context when replacing, adding, or removing words. Note that such kind of sensible text operations can also be operationalized by applying paraphrasing rules from the PPDB [8], as is done for instance in an approach on constrained paraphrasing [19].

Figure 1 .1Figure1illustrates the setting and shows that the two tasks are diametrically opposed to each other: Success of a certain approach for one of these tasks depends on its "immunity" against the most effective approaches for the other. In our overview of last year's first author masking edition[16], we already included a survey of related work on author obfuscation. In particular, we introduced and discussed the "obfuscation impact measures" used in the evaluation, which we will quickly recap in Section 2. Section 3 www.tira.io

Acknowledgments

We thank the participating teams of the two editions of this shared task.

Bibliography

Author Masking using Sequence-to-Sequence Models-Notebook for PAN at CLEF OBakhteev AKhazov 2017 CLEF 2016 Evaluation Labs and Workshop -Working Notes Papers CEUR Workshop Proceedings KBalog LCappellato NFerro CMacdonald

Évora, Portugal

September. 2016 CLEF 2017 Evaluation Labs and Workshop -Working Notes Papers CEUR Workshop Proceedings LCappellato NFerro LGoeuriot TMandl

Dublin, Ireland

September. 2017 CLEF 2014 Evaluation Labs and Workshop -Working Notes Papers CEUR Workshop Proceedings LCappellato NFerro MHalvey WKraaij

Sheffield, UK

September. 2014 CLEF 2015 Evaluation Labs and Workshop -Working Notes Papers CEUR Workshop Proceedings LCappellato NFerro GJones ESan Juan

Toulouse, France

September. 2015 Author Masking by Sentence Transformation-Notebook for PAN at CLEF DCastro ROrtega RMuñoz 2017 CLEF 2013 Evaluation Labs and Workshop -Working Notes Papers PForner RNavigli DTufis

Valencia, Spain

September. 2013 PPDB: The paraphrase database JGanitkevitch BVan Durme CCallison-Burch Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings

Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA

June 9-14, 2013. 2013 Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service TGollub BStein SBurrows 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12) BHersh JCallan YMaarek MSanderson ACM Aug 2012 AHanbury HMüller KBalog TBrodt GCormack IEggel TGollub FHopfgartner JKalpathy-Cramer NKando AKrithara JLin SMercer MPotthast ArXiv e-prints Evaluation-as-a-Service: Overview and Outlook Dec 2015 Overview of the Author Identification Task at PAN PJuola EStamatatos 2013 Author Masking through Translation-Notebook for PAN at CLEF YKeswani HTrivedi PMehta PMajumder 2016 Author Obfuscation using WordNet and Language Models-Notebook for PAN at CLEF MMansoorizadeh TRahgooy MAminiyan MEskandari 2016 SU@PAN'2016: Author Obfuscation-Notebook for PAN at CLEF TMihaylova GKaradjov PNakov YKiprov GGeorgiev IKoychev 2016 Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 Author Obfuscation: Attacking the State of the Art in Authorship Verification MPotthast MHagen BStein Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS Sep 2016 Overview of the Author Identification Task at PAN EStamatatos WDAmd Ben Verhoeven PJuola ALópez-López MPotthast BStein 2015 Overview of the Author Identification Task at PAN EStamatatos WDaelemans BVerhoeven MPotthast BStein PJuola MSanchez-Perez ABarrón-Cedeño 2014 Generating Acrostics via Paraphrasing and Heuristic Search BStein MHagen CBräutigam 25th International Conference on Computational Linguistics (COLING 14 JTsujii JHajic Aug 2014 Association for Computational Linguistics