<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the Author Obfuscation Task at PAN 2017: Safety Evaluation Revisited</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthias Hagen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We report on the second large-scale evaluation of style obfuscation approaches in a shared task on author obfuscation, organized at the PAN 2017 lab on digital text forensics. Author obfuscation means to automatically paraphrase a given text such that state-of-the-art authorship verification approaches misjudge a given pair of documents as having been written by “different authors” if in fact they would have decided otherwise without obfuscation. This year, two new obfuscators are compared to the participants from last year's task against a total of 44 authorship verification approaches. The best-performing obfuscator successfully impacts the decision-making process of the authorship verifiers significantly. However, as in the last year, the paraphrased texts are often not really human-readable anymore and have some changed context, indicating that there is still way to go to “perfect” automatic obfuscation that (1) tricks verification approaches, (2) keeps the meaning of the original, and (3) is, regarding its obfuscation, unsuspicious to a human eye.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>At PAN 2017 we organized the second shared task on author obfuscation in order to
foster exploring the potential vulnerabilities of author identification technology. Like
in the first edition, the specific task is that of author masking against authorship
verification, which in turn has been a shared task at PAN 2013–2015 [11, 17, 18]. The
following synopses point out the differences:</p>
      <p>Given two documents,
decide whether both have been
written by the same author.</p>
    </sec>
    <sec id="sec-2">
      <title>Author Masking</title>
      <p>vs.</p>
      <sec id="sec-2-1">
        <title>Given two documents from the same author,</title>
        <p>paraphrase the designated one
such that an authorship verification will fail.</p>
        <p>Figure 1 illustrates the setting and shows that the two tasks are diametrically
opposed to each other: Success of a certain approach for one of these tasks depends on its
“immunity” against the most effective approaches for the other. In our overview of last
year’s first author masking edition [16], we already included a survey of related work on
author obfuscation. In particular, we introduced and discussed the “obfuscation impact
measures” used in the evaluation, which we will quickly recap in Section 2. Section 3
Alice
is known
to have
written
automatically obfuscates text
circumvents OR obstructs
is used as reference</p>
        <p>Eve
automatically verifies authorship
is subject
to analysis
reviews the obfuscation approaches that have been submitted to this year’s edition of
the shared task, and Section 4 reports on their evaluation against the state of the art in
authorship verification.
2</p>
        <sec id="sec-2-1-1">
          <title>Evaluating Author Obfuscation</title>
          <p>As of last year, we consider three performance dimensions according to which an author
obfuscation approach must excel to be considered fit for practical use. Obviously, the
obfuscation performance should depend on the capability of fooling forensic experts—
be it a piece of software or a human. However, fulfilling this requirement in isolation
will disregard writers and their target audience, whose primary goal is to communicate,
albeit safe from deanonymization: the quality of an obfuscated text along with the fact
that its semantics is preserved are equally important. We hence call an obfuscation
software
1. safe, if its obfuscated texts cannot be attributed to their original authors anymore,
2. sound, if its obfuscated texts are textually entailed by their originals, and
3. sensible, if its obfuscated texts are well-formed and inconspicuous.</p>
          <p>These dimensions are orthogonal; an obfuscation software may meet each of them to
a certain degree of perfection. Related work on operationalizing measures for these
dimensions has been included in our overview from the last year [16]. In order to analyze
the safety dimension, we run the obfuscated texts against 44 authorship verification
approaches and measure the impact of the obfuscation on the verifiers in form of changed
verification decisions (cf. last year’s overview for details on the used measures [16]). As
for sensibleness and soundness we stick to manual inspection and grading of examples.
3</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Survey of Submitted Obfuscation Approaches</title>
          <p>
            The two approaches submitted to this year’s edition of our shared task follow different
strategies: sequence-to-sequence models and rule-based replacements. While a more
conservative rule-based strategy often changes the to-be-obfuscated text only slightly,
the sequence-to-sequence modeling can lead to substantial differences.
Bakhteev and Khazov The approach of Bakhteev and Khazov [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] is mainly based on
different sequence-to-sequence models and some small set of rules. The rules replace
contractions (e.g., ’ll ! will), split or concatenate sentences using conjunctive words
(e.g., and), and add or remove introductory phrases (e.g., anyway) to and from
sentences respectively. The main idea of sequence-to-sequence modeling comes in two
flavors: (1) replacing synonyms based on nearest neighbors in word embeddings from
a Wikipedia dump, and (2) an encoder-decoder approach that generates some
“reproduced” version of the original text, which is also based on embeddings trained on a
Wikipedia dump. In both cases, the author choose from different possible variants of
an obfuscated sentence that one that best matches a language model trained on
Shakespeare texts.
          </p>
          <p>As for the resulting texts, the strategy for combining and splitting sentences should
pay more attention to the local situation, since otherwise it will quickly lead to
incomplete or overlong constructions. A more detailed analysis of the text quality follows in
the evaluation (cf. Section 4).</p>
          <p>Castro et al. The approach of Castro et al. [6] focuses on simple rule- or
patternbased replacements. Using the FreeLing NLP tool for pre-processing texts (POS
tagging, word sense disambiguation, etc.) several ideas are combined. Contractions are
replaced based on a dictionary or the long version if it is used more often, synonyms
are substituted using FreeLing functionality, and sentences are shortened by leaving out
parts in parentheses, by leaving out discourse markers, or by eliminating appositions
based on two simple patterns that identify explanations if named entities are introduced
in the text.</p>
          <p>The resulting text will usually be shorter than the original text, which, however, is
intended by the authors. Most of the removals do not dramatically change the meaning
of the text; a similar observation applies to the treatment of contractions. Still,
leaving out information from the original may render parts of the resulting text hard to
understand. Depending on FreeLing’s synonym functionality, synonyms are often not
appropriately chosen since the context seems not to be considered when selecting a
replacement candidate. A more detailed analysis of the text quality follows in the
evaluation (cf. Section 4).
4</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Evaluation</title>
          <p>As in the last year, we automatically evaluate the safety of the submitted obfuscation
approaches against 44 authorship verifiers which have been submitted to the previous
three shared tasks on authorship identification at PAN 2013–2015. Sensibleness and
soundness of the obfuscated texts are assessed manually by human inspection.</p>
          <p>The evaluation setup is the cloud-based evaluation platform TIRA [9, 15],1 which
is being developed as part of our long-term evaluation-as-a-service initiative [10]. We</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>1 www.tira.io</title>
        <p>final
acc
rec avg imp
want to point out that—by using TIRA—it was possible to run 44 of the 49
authorship verification approaches (which have been submitted to the shared tasks at
PAN 2013–2015) on the outputs of the submitted obfuscation approaches. The outputs,
in turn, were generated from the authorship verification corpora PAN13, PAN14 EE,
PAN14 EN, and PAN15.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Safety</title>
      <p>Table 1 shows the results of our safety evaluation of the two approaches from this
year compared to the three approaches from last year against 44 authorship verification
approaches on the aforementioned four PAN evaluation datasets. We combine the two
rankings into an overall ranking of obfuscation approaches suggested so far in order to
interpret the results of this year’s participants in context.</p>
      <p>
        The best-performing approach this year was submitted by Castro et al. [6], which
achieve second rank overall across both years as per average impact; the average impact
quantifies the averaged ratio of true positive decisions turned false negative. However,
this result must be taken with a grain of salt since this approach basically removed
large parts of the original text. The approach of Bakhteev and Khazov [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] performs
second-best this year, and ranks fourth out of five overall. The ranking induced by
average impact is the same as that induced by all other measures, rendering the measures
perfectly consistent. This consistency, however, forecloses more insights that can
usually be derived from differing performance characteristics. In this regard, the qualitative
assessment of sensibleness and soundness presented in the following subsection is
important. Altogether, the approach of Mihaylova et al. [14] still performs best among all
approaches.
4.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Sensibleness and Soundness</title>
      <p>
        As in last year’s edition, a human assessor conducted an in-depth manual assessment on
problem instances 6, 135, and 430. Spot checks on other instances again indicated that
the overall characteristics of the output texts are similar on other instances. The human
assessor started by reading the obfuscated texts without knowing which was the
output of what approach. During this reading phase, the assessor marked up errors (typos,
grammar) and assigned school grades (on a scale from 1 (excellent) to 5 (fail)) for the
sensibleness of each of the sample problem instances. The sensibleness scores obtained
in the last year were a grade 2 for Mansoorizadeh et al.’s approach [13] that does not
really change a lot on a per sentence basis, a grade 4 for Mihaylova et al.’s
obfuscator [14], and a grade 5 for Keswani et al.’s obfuscator [12]. This year’s approaches get
a grade 4 for Bakhteev’s and Khazov’s approach [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], since there are a lot of issues with
respect to uppercasing at sentence starts as well as many grammatical problems due to
problematic sentence splits and merges, and due to inappropriate use of synonyms. As
for Castro et al.’s approach [6] grade 2s were assigned if only some problematically
short sentences were grammatically incorrect or if spacing around punctuation marks
was incorrect, while other documents got a grade 3 for too short sentences that were
grammatically wrong or for synonyms not making sense in some contexts.
      </p>
      <p>After grading the sensibleness of the obfuscated texts, the assessor read the
original texts and judged the textual differences in various ways to evaluate the soundness
of the obfuscated texts on a three-point scale as either “correct”, “passable”, or
“incorrect”. The obfuscated texts of Mihaylova et al.’s and Keswani et al.’s approaches
were all judged “incorrect”, while Mansoorizadeh et al.’s very conservative approach
achieved “correct” and “passable” scores. This year’s approaches (Bakhteev’s and
Khazov’s, and Castro et al.’s) both got “incorrect” as judgments—but for different reasons:
With regard to Bakhteev’s and Khazov’s approach, many parts of the resulting texts
were not understandable anymore because of overly rigid changes in sentences, which
completely removed the original meaning. With regard to Castro et al.’s approach, the
judgment results from the fact that the obfuscated text covers only a small portion of the
original text (about the first third of the original), maybe an undesired side-effect due
to some pre-processing problems. The parts that are still contained in the obfuscated
version often achieve at least a “passable” judgment, and they could even be judged as
“correct”. However, the fact that about two thirds of the original was omitted precluded
a better outcome.</p>
      <sec id="sec-4-1">
        <title>Conclusion and Outlook</title>
        <p>In the second year of evaluating author obfuscation approaches in terms of their safety
against the state of the art in authorship verification, two new approaches were added
to the three approaches from last year. The best-performing obfuscator flips on average
about 42% of an authorship verifier’s decisions towards choosing “different author”
when the opposite decision would have been correct, indicating some level of safety
against verification approaches. As for soundness and sensibleness, though, the
approaches often produce rather unreadable text or text whose meaning is significantly
changed. Still, such insights are mainly obtained from manual inspection.</p>
        <p>The challenge of evaluating author obfuscation approaches properly and at scale
would definitely benefit from new technologies that are capable of recognizing
paraphrases, textual entailment, grammaticality, and style deception. However, a very
important direction for future research in the authorship obfuscation domain is that on
producing safe and still sound and sensible texts. So far, there are only two groups of
obfuscation approaches: (1) approaches that are somewhat safe but that often produce
unreadable text or text that is neither sound nor sensible, and (2) approaches that
produce sound and sensible texts but that are not safe against authorship verification.</p>
        <p>A significant improvement of current obfuscation technology requires a much
better consideration and integration of the surrounding context when replacing, adding,
or removing words. Note that such kind of sensible text operations can also be
operationalized by applying paraphrasing rules from the PPDB [8], as is done for instance in
an approach on constrained paraphrasing [19].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank the participating teams of the two editions of this shared task.
[6] Castro, D., Ortega, R., Muñoz, R.: Author Masking by Sentence</p>
      <p>
        Transformation—Notebook for PAN at CLEF 2017. In: [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], http://ceur-ws.org/Vol-/
[7] Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop –
Working Notes Papers, 23-26 September, Valencia, Spain (2013),
http://www.clef-initiative.eu/publication/working-notes
[8] Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: The paraphrase database. In:
Human Language Technologies: Conference of the North American Chapter of the
Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree
Plaza Hotel, Atlanta, Georgia, USA. pp. 758–764 (2013)
[9] Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web
Framework for Providing Experiments as a Service. In: Hersh, B., Callan, J., Maarek, Y.,
Sanderson, M. (eds.) 35th International ACM Conference on Research and Development
in Information Retrieval (SIGIR 12). pp. 1125–1126. ACM (Aug 2012)
[10] Hanbury, A., Müller, H., Balog, K., Brodt, T., Cormack, G., Eggel, I., Gollub, T.,
Hopfgartner, F., Kalpathy-Cramer, J., Kando, N., Krithara, A., Lin, J., Mercer, S., Potthast,
M.: Evaluation-as-a-Service: Overview and Outlook. ArXiv e-prints (Dec 2015),
http://arxiv.org/abs/1512.07454
[11] Juola, P., Stamatatos, E.: Overview of the Author Identification Task at PAN 2013. In: [7]
[12] Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author Masking through
      </p>
      <p>
        Translation—Notebook for PAN at CLEF 2016. In: [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], http://ceur-ws.org/Vol-1609/
[13] Mansoorizadeh, M., Rahgooy, T., Aminiyan, M., Eskandari, M.: Author Obfuscation using
WordNet and Language Models—Notebook for PAN at CLEF 2016. In: [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
http://ceur-ws.org/Vol-1609/
[14] Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.:
SU@PAN’2016: Author Obfuscation—Notebook for PAN at CLEF 2016. In: [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
http://ceur-ws.org/Vol-1609/
[15] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M.,
Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality,
Multimodality, and Visualization. 5th International Conference of the CLEF Initiative
(CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014)
[16] Potthast, M., Hagen, M., Stein, B.: Author Obfuscation: Attacking the State of the Art in
Authorship Verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs.
CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2016),
http://ceur-ws.org/Vol-1609/
[17] Stamatatos, E., amd Ben Verhoeven, W.D., Juola, P., López-López, A., Potthast, M., Stein,
      </p>
      <p>
        B.: Overview of the Author Identification Task at PAN 2015. In: [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
[18] Stamatatos, E., Daelemans, W., Verhoeven, B., Potthast, M., Stein, B., Juola, P.,
Sanchez-Perez, M., Barrón-Cedeño, A.: Overview of the Author Identification Task at
PAN 2014. In: [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
[19] Stein, B., Hagen, M., Bräutigam, C.: Generating Acrostics via Paraphrasing and Heuristic
Search. In: Tsujii, J., Hajic, J. (eds.) 25th International Conference on Computational
Linguistics (COLING 14). pp. 2018–2029. Association for Computational Linguistics
(Aug 2014)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bakhteev</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khazov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Author Masking using Sequence-to-Sequence Models-Notebook for PAN at CLEF 2017</article-title>
          . In: [3], http://ceur-ws.
          <source>org/</source>
          Vol-/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , C. (eds.):
          <article-title>CLEF 2016 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September, Évora, Portugal. CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2016</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
          </string-name>
          , T. (eds.):
          <article-title>CLEF 2017 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>11</volume>
          -
          <fpage>14</fpage>
          September, Dublin, Ireland. CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2017</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halvey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraaij</surname>
          </string-name>
          , W. (eds.):
          <article-title>CLEF 2014 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>15</volume>
          -
          <fpage>18</fpage>
          September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2014</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , San Juan, E. (eds.):
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          September, Toulouse, France. CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2015</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>