<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Masking through Translation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Dhirubhai Ambani Institute of Information and Communication Technology</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Introduction &amp; Related Work</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yashwant Keswani</institution>
          ,
          <addr-line>Harsh Trivedi, Parth Mehta, and Prasenjit Majumder</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>This notebook paper documents the approach adopted by our team for Author Masking Task in PAN 2016. For the purpose of masking the identity of the author, we use a simple translation based approach. From the source language (English), the text is translated to an intermediate language before it gets finally translated back to English. In this process, depending on the translation model and various penalties used during the translation process, a change of the structure of the language seeps in. Besides this, translation process can also change the vocabulary used in the text as well as the average sentence length. We attempt to use this approach for obfuscating the identity of author of the text.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        author can preserve the anonymity for a particular document [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. They have discussed
method of identifying most salient features for identification and shown how this
information can be fed back to create the obfuscated document so that the attribution moves
away from the original. Also, there has been a previous attempt to perform this task by
to-fro language translation: English ! French ! English. As mentioned in their paper
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], considering the low quality of the state-of-the-art translation methods then, they
were not able to yield a good performance. In this attempt, we try to test the idea of
to-fro translation using an additional intermediate language and check its performance
with current state-of-the-art translation tools.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        Our approach tries to exploit the corruption caused by translation system when
translating a piece of text from one language to another and leverage this to perform the task
of obfuscation. The idea is to perform sequential translation of the to-be-obfuscated
document of each author to a few intermediate languages and then translate it back to
English: English ! IL1 ! IL2 ! ... ILn ! English, where ILj is the
intermediate languages. Our initial approach was to use the translation API provided by Google
Translate1. However Google translate uses English as a pivot language for translation.
Which means while translating a document from English to French to German, the
English document will be translated to French, which will be translated back to
English, and the new English document will be then translated to German. This approach
didn’t turn out to be much useful. Most machine translation systems don’t drift to a new
sentence while translating between two pairs of languages. Which means translating a
English sentence to French and then back to English will, in most cases, return the
original English sentence itself. To counter this we tried using other translation systems like
Yandex2 and Microsoft Bing Translate3 for performing a part of the intermediate
translations. For example we would translate a English sentence to French using Google
translate and then use Bing translate to get the German sentence which will then be
translated to English using Yandex. This approach seemed promising in terms of
language quality. Most of the sentences generated were human readable with few phrase
positioning shifts and some words being replaced by the synonyms. However there were
certain unexpected errors deploying and running the software on TIRA platform [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
due to a high number of API calls required by such a system. Finally we opted to train
our own translation models using Moses SMT toolkit[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We used German and French
as the intermediate languages: English ! German ! French ! English.
      </p>
      <p>
        We used the Europarl corpus[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for training. The Europarl corpus provides
parallel translations where one of the languages is English. We created the German-French
corpus using the English-German and English-French corpus. We used a randomly
selected subset of 100K sentences per language pair for training the translation model and
tuned the model with another 5K randomly sampled sentences.
1 https://translate.google.com
2 https://translate.yandex.com/
3 https://www.bing.com/translator
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Details</title>
      <p>The organizers have proposed the following parameters for evaluation of Author
Masking task. The quality of submitted systems is judged based on:
– Safeness: does forensic analysis reveal the original author of its obfuscated texts?
– Soundness: are the obfuscated texts textually entailed with their originals?
– Sensibleness: are the obfuscated texts inconspicuous to a human reader?</p>
      <p>
        Safeness was evaluated for all the three participant teams by the task organizers
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It was done by a large-scale safety evaluation of the three participant obfuscation
approaches. This was done based on the authorship verification approaches submitted
to the PAN 2013-2015 authorship verification tasks [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9,10,11</xref>
        ]. These authorship
verification approaches were run on obfuscated versions of the test datasets of their
respective years. Each submitted obfuscation approach was run, yielding three additional test
datasets per original dataset. Only the English portions of the test datasets were used.
      </p>
      <p>The following table documents the average performance drop in terms of final score
of the verification approaches when compared to their performance on the unobfuscated
(original) test datasets. Higher the performance drop, better the obfuscation. Our team
stood second in terms of the safeness measure.</p>
      <p>The organizer have acknowledged the gap in automatic evaluation measures for this
task and have invited proposals for an automatic evaluation measure, which is done
through a separate task, "Obfuscation Evaluation". Results of all three "Author
Masking" teams in terms of soundness and sensibleness would be available in the task
notebooks of "Obfuscation Evaluation" teams.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion &amp; Future Work</title>
      <p>Overall the use of machine translation systems seems a worthy attempt at Authorship
attribution. We would like to try several further approaches in future. For instance, due
to the limitations of the virtual machines, we had to reduce the size of the training
corpus. We would like to see the effect of using the entire Europarl Corpus(1.5 million
sentences). We would also like to try it on a different corpus which has a more broader
vocabulary. Another approach we would like to further explore is tuning the language
model and sentence length penalties in moses translation system. These penalties
control the linguistic quality and length of the translated sentences. Yet another possibility
is to use the word usage trends to manipulate the translations. Replacing a few words
that are used in recent times by those that were popular in 18th century would be an
interesting approach.
13. Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminiyan, and Mahdy Eskandari.</p>
      <p>Author Obfuscation using WordNet and Language Models—Notebook for PAN at CLEF
2016. In Balog et al. [14].
14. Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors. CLEF 2016
Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal,
CEUR Workshop Proceedings. CEUR-WS.org, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Matthias Hagen, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          . Author Obfuscation:
          <article-title>Attacking Stateof-the-Art Authorship Verification Approaches</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org</source>
          ,
          <year>September 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Gary</given-names>
            <surname>Kacmarcik</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gamon</surname>
          </string-name>
          .
          <article-title>Obfuscating document stylometry to preserve author anonymity</article-title>
          .
          <source>In Proceedings of the COLING/ACL on Main conference poster sessions</source>
          , pages
          <fpage>444</fpage>
          -
          <lpage>451</lpage>
          . Association for Computational Linguistics,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Juola</surname>
          </string-name>
          and
          <string-name>
            <given-names>Darren</given-names>
            <surname>Vescovi</surname>
          </string-name>
          .
          <article-title>Empirical evaluation of authorship obfuscation using jgaap</article-title>
          .
          <source>In Proceedings of the 3rd ACM workshop on Artificial Intelligence and Security</source>
          , pages
          <fpage>14</fpage>
          -
          <lpage>18</lpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Josyula R Rao</surname>
            ,
            <given-names>Pankaj</given-names>
          </string-name>
          <string-name>
            <surname>Rohatgi</surname>
          </string-name>
          , et al.
          <article-title>Can pseudonymity really guarantee privacy?</article-title>
          <source>In USENIX Security Symposium</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Tim</given-names>
            <surname>Gollub</surname>
          </string-name>
          , Benno Stein, Steven Burrows, and
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Hoppe</surname>
          </string-name>
          . TIRA: Configuring, Executing, and
          <article-title>Disseminating Information Retrieval Experiments</article-title>
          . In A Min Tjoa, Stephen Liddle,
          <string-name>
            <surname>Klaus-Dieter Schewe</surname>
          </string-name>
          , and Xiaofang Zhou, editors,
          <source>9th International Workshop on Text-based Information Retrieval (TIR</source>
          <volume>12</volume>
          )
          <string-name>
            <surname>at</surname>
            <given-names>DEXA</given-names>
          </string-name>
          , pages
          <fpage>151</fpage>
          -
          <lpage>155</lpage>
          , Los Alamitos, California,
          <year>September 2012</year>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Tim Gollub, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In Evangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sanderson, Mark Hall, Allan Hanbury, and Elaine Toms, editors,
          <source>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14)</source>
          , pages
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          , Berlin Heidelberg New York,
          <year>September 2014</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          , Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al.
          <article-title>Moses: Open source toolkit for statistical machine translation</article-title>
          .
          <source>In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions</source>
          , pages
          <fpage>177</fpage>
          -
          <lpage>180</lpage>
          . Association for Computational Linguistics,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          .
          <article-title>Europarl: A parallel corpus for statistical machine translation</article-title>
          .
          <source>In MT summit</source>
          , volume
          <volume>5</volume>
          , pages
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Juola</surname>
          </string-name>
          and
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>Overview of the author identification task at pan 2013</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Efstathios</surname>
            <given-names>Stamatatos</given-names>
          </string-name>
          , Walter Daelemans, Ben Verhoeven, Benno Stein, Martin Potthast, Patrick Juola,
          <string-name>
            <surname>Miguel A Sanchez-Perez</surname>
          </string-name>
          , and
          <string-name>
            <surname>Alberto</surname>
          </string-name>
          Barrón-Cedeño.
          <article-title>Overview of the author identification task at pan 2014</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>877</fpage>
          -
          <lpage>897</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Efstathios</surname>
            <given-names>Stamatatos</given-names>
          </string-name>
          , Martin Potthast, Francisco Rangel, Paolo Rosso, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Overview of the pan/clef 2015 evaluation lab</article-title>
          .
          <source>In International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , pages
          <fpage>518</fpage>
          -
          <lpage>538</lpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tsvetomila</surname>
            <given-names>Mihaylova</given-names>
          </string-name>
          , Georgi Karadjov, Preslav Nakov, Yasen Kiprov, Georgi Georgiev, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Koychev</surname>
          </string-name>
          . SU@PAN'2016:
          <article-title>Author Obfuscation-Notebook for PAN at CLEF 2016</article-title>
          . In Balog et al. [
          <volume>14</volume>
          ].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>