=Paper= {{Paper |id=Vol-1584/paper10 |storemode=property |title=Towards the Development of a Cyber Analysis & Advisement Tool (CAAT) for Mitigating De-Anonymization Attacks |pdfUrl=https://ceur-ws.org/Vol-1584/paper10.pdf |volume=Vol-1584 |authors=Siobahn C. Day,Henry Williams,Joseph Shelton,Gerry Dozier |dblpUrl=https://dblp.org/rec/conf/maics/DayWSD16 }} ==Towards the Development of a Cyber Analysis & Advisement Tool (CAAT) for Mitigating De-Anonymization Attacks== https://ceur-ws.org/Vol-1584/paper10.pdf
 Siobahn C. Day et al.                                       MAICS 2016                                                 pp. 41–45




       Towards the Development of a Cyber Analysis & Advisement Tool
             (CAAT) for Mitigating De-Anonymization Attacks
                           Siobahn C. Day, Henry Williams, Joseph Shelton, Gerry Dozier
                          Department of Computer Science, North Carolina A&T State University, Greensboro, U.S.A
                                              Center for Advanced Studies in Identity Science
                                    {scday, hcwillia, jashelt1} @aggies.ncat.edu , {gvdozier} @ncat.edu




                               Abstract                                 selves while the second form, imitation, is when an author
We are seeing a rise in the number of Anonymous Social Net-             tries to ‘mimic’ the writing style of another author. Re-
works (ASN) that claim to provide a sense of user anonymity.            search shows that both of these techniques are effective in
However, what many users of ASNs do not know that a person              concealing one’s writing style. In the case of disguising
can be identified by their writing style.                               one’s writing style, M. Brennan et al. (2012) demonstrate
In this paper, we provide an overview of a number of author con-
                                                                        that obfuscation and imitation are easy on the short term
cealment techniques, their impact on the semantic meaning of an
author's original text, and introduce AuthorCAAT, an application        but more difficult to maintain on the long term. In Section
for mitigating de-anonymization attacks. Our results show that          IV, it will be shown how AuthorCAAT can be used to pro-
iterative paraphrasing performs the best in terms of author con-        vide authors with the ability to perform long-term adver-
cealment and performs well with respect to Latent Semantic              sarial stylometry.
Analysis.                                                                  Another form of author concealment is Iterative Lan-
                                                                        guage Translation (ILT) (Mack, Bowers, Williams, Dozier,
                                                                        and Shelton 2015). ILT is where an original text is trans-
                           Introduction
                                                                        lated to another language and then back to its original lan-
   Anonymous Social Networks (ASN) can provide users                    guage. This technique was first presented in Rao and
with a false sense of anonymity; however, research in the               Rohatgi (2000), where the authors describe this approach
area of Author Identification (Attribution) has shown that              as being “somewhat facetious” and “drastic.” They be-
users can be identified simply by their writing style                   lieved that this approach would change the meaning of a
(Stamatatos 2009). Narayanan et al. (2012), introduces the              message thus making it an impractical approach. It was
concept of a de-anonymization attack where hackers apply                also mentioned by Kacmarcik and Gamon (2006), that this
sophisticated Author Identification techniques (AITs) in an             approach could be a good starting point for someone look-
effort to uncover the identity of an author of a text. Once             ing to “scramble” their words. ILT is effective in conceal-
this occurs the hackers can track a victim across the web               ing the writing style of an author; however, it is vulnerable
and even through other ASNs.                                            to fingerprinting, (Caliskan and Greenstadt 2012). If one
   Recently researchers, M. Brennan, Afroz, and                         knows the language used in translating the text, one can
Greenstadt (2012); Kacmarcik and Gamon (2006); Rao and                  then recover the original writing style of the author.
Rohatgi (2000), have developed a number of techniques                      The last form of author concealment is Iterative Para-
for author concealment. These techniques as well as their               phrasing (IP). The use of IP was originally mentioned in
ability to conceal one’s writing style are as follows: adver-           Kacmarcik and Gamon (2006). In IP, one will take the
sarial stylometry, iterative language translation and itera-            original text and use a paraphrasing tool to convert it into a
tive paraphrasing.                                                      paraphrased text. Concerning IP, to the authors’
   Presently there exist two forms of adversarial stylometry            knowledge, no one has as of yet analyzed its effectiveness
(Afroz, Brennan, and Greenstadt 2012; M. Brennan et al.                 in author concealment, semantics, and its vulnerability to
2012; M. R. Brennan and Greenstadt 2009). The first form,               fingerprinting (this will be discussed in Section III).
obfuscation, is when an author tries not to write like them-                The remainder of the paper will be as follows. In Sec-
                                                                        tion II, we discuss our experiments. In Section III, we dis-
                                                                        cuss our results. In Section IV, we provide a brief discus-
Copyright held by the author(s).




                                                                   41
 Siobahn C. Day et al.                                   MAICS 2016                                                   pp. 41–45


sion of AuthorCAAT. In Section V, we provide our con-                 were used to determine how well ILT/IP reduces the author
clusion and future work.                                              recognition rate with respect to the baseline.

Author Concealment & Fingerprinting Exper-                            Experiment II: Fingerprinting the Translators
                 iments                                               and the Paraphrasers
                                                                         For Experiment II, a tool known as JGAAP, Java Graph-
Our Dataset                                                           ical Author Attribution Program, (Juola, Sofko, and Bren-
                                                                      nan 2006) was used to fingerprint the translators and the
   The datasets we used for our experiments were gathered             paraphraser. This tool allows for text analysis using vari-
from blogs written by 100 different authors. For every au-            ous stylometry and textometry techniques. We used the
thor in our dataset, there are 4 instances. Those instances in        first 100 authors from each ILT/IP Iteration using the first
the dataset are as follows: the first instance served as the          gallery instance as the ‘unknown’ author and the remaining
probe and the remaining 3 instances served as the gallery.            two instances from the gallery as the ‘known’ authors. The
This results in 100 instances in the probe set and 300 in-            ‘known’ authors were labeled by languages and/or para-
stances in the gallery set.                                           phraser. This was used for all three Iterations of ILT/IP.
                                                                      The analysis was processed by using WEKA SMO, with
Our Translators & Paraphrasers                                        the results ordered with event culling from most to least.
                                                                      Character N Grams, where n=2, was used as the event
   Our ILT dataset, used Google translation tools for Eng-            driver.
lish to Spanish, Spanish to English, English to Chinese,
and Chinese to English. The ILT text was prepared in itera-           Experiment III: Fingerprinting the Number of
tions. We consider an iteration to be a full round trip cycle         Iterations Used to Conceal an Author’s Writing
of translation (e.g. English-Spanish-English and English-             Style
Chinese-English). Therefore, Iteration 1 would be E-X-E,                 In Experiment III, the ‘unknown’ authors were chosen
Iteration 2 would be E-X-E-X-E, and Iteration 3 would be              from the first gallery instances of all Iterations of ILT/IP.
E-X-E-X-E-X-E, where E stands for English and X ∈                     The ‘known’ authors were chosen from the remaining two
{Spanish, Chinese}. Therefore, a total of six ILT datasets            instances of the gallery and were labeled by the number of
                                                                      ILT/IP Iterations that were applied. The same settings as
were developed consisting of 300 gallery instances of the
                                                                      Experiment II were used with respect to the event driver,
100 authors.                                                          analysis, and event culling.
   Our IP dataset was created using an online tool known
as Plagarisma. The Iterations for IP are similar to ILT.                                        Results
Combining ILT with IP we have X ∈ {Spanish, Chinese,
Paraphraser}. Therefore, three IP datasets were developed             Results of Experiment I
consisting of 300 gallery instances of the 100 authors. For              The results of Experiment I, Author Concealment via
ILT/IP, there were a total of nine datasets.                          ILT/IP, are shown in Figure 1. Figure 1 shows the affect
                                                                      that ILT/IP has on the accuracy of the AIS. In Figure 1, the
Experiment I: Author Concealment via ILT/IP                           x-axis represents the iteration number (Iteration 1, Iteration
   For Experiment I, the feature extractor used in Mack,              2, Iteration 3) and the y-axis represents the accuracy of the
Bowers, Williams, Dozier, and Shelton (2015), referred to             AIS.
as the Hybrid-II Author Identification System (AIS), was                     In Figure 1, the accuracy of the AIS is 54% percent.
applied to the instances of the nine datasets (and the probe          In the first iteration of ILT/IP, the author identification
set) to create feature vectors where each feature vector              rates drop. At Iteration 1, ILT-Spanish has the best perfor-
consisted of 1282 features. The Hybrid-II AIS, is com-                mance in terms of reducing the AIS rate to 6%, followed
posed of 95 features from the Unigram feature extractor               by IP at 7% and ILT-Chinese at 10%. In the second itera-
(Forsyth 1997), 170 stylometric features from De Vel, An-             tion, IP has the best performance in reducing the AIS rate
derson, Corney, and Mohay (2001) feature extractor, as                to 1%, followed by ILT-Chinese at 11% and ILT-Spanish
well as 256 features in the form of function words and 761            at 6%. At Iteration 3, IP continues to outperform ILT. At
features that come from the Stanford Parser in the form of            Iteration 3, IP reduces the AIS rate to 6 %, followed by
Parts-of-Speech parent child pairs for a total of 1282 fea-           ILT-Spanish at 7% and ILT-Chinese at 11%. These results
tures.                                                                show the effectiveness of ILT/IP in concealing an authors
   In Experiment I, the baseline performance was the au-              identity.
thor recognition rate of the 100 authors (English only) us-
ing no ILT/IP iterations. While, the ILT/IP experiments




                                                                 42
 Siobahn C. Day et al.                                    MAICS 2016                                                     pp. 41–45




                                                                         Table 1: LSA Results from Comparing the Original Text with
                                                                                         Resulting Text from ILT/IP

                                                                       ILT/IP Method         LSA Results                 EC
                                                                       Spanish               0.862 (0.11)                1
                                                                       Paraphraser           0.802 (0.09)                2
                                                                       Chinese               0.773 .16)                  3
Figure 1: A Comparison of the Effectiveness of ILT/IP on Reduc-
                ing Author Recognition Rates                           Results of Experiment II
                                                                          The results of Experiment II, Fingerprinting the Transla-
Prior research suggests, (Caliskan and Greenstadt 2012;                tors and the Paraphrasers, are shown in Figure 2. In Figure
Kacmarcik and Gamon 2006; Rao and Rohatgi 2000), that                  2, the x-axis shows the iterations (Iteration 1, Iteration 2,
ILT/IP is naïve as well as problematic due to the resulting            Iteration 3) and on the y-axis it shows the accuracy in de-
text being unable to retain its original meaning. In in order          termining the ILT/IP method used. In Figure 2, one can see
to address this issue, we applied Latent Semantic Analysis             as the number of iterations increases so does the accuracy
(LSA) on all iterations of the dataset.                                for each ILT/IP method that is being used.
    Latent Semantic Analysis (LSA) “…is a theory and                      In Figure 2, at Iteration 1, ILT-Spanish has the best fin-
method for extracting and representing the contextual-                 gerprinting accuracy at 93%, followed by ILT-Chinese at
usage meaning of words by statistical computations ap-                 90%, and IP at 86%. In Iteration 2, ILT-Spanish leads at
plied to a large corpus of text” (Landauer, Foltz, and                 98% followed by ILT-Chinese 97%, and IP at 91%. In
Laham 1998). Using a LSA tool developed by the Univer-                 Iteration 3, ILT-Chinese comes in at 99%, followed by
sity of Colorado Boulder, we compared our original text                ILT-Spanish at 98%, and IP at 95%. The results not only
with the resulting text of ILT/IP.                                     show that the translators can be accurately fingerprinted,
    In the Table 1, the results of using the LSA tool on our           but they also show that of the three IP is hardest to finger-
dataset are shown. Given two samples of text, the LSA                  print but only at the first iteration. On the other hand, these
tool will provide an output of 1 if the semantics of the two           results show that the translator and paraphrasers are able to
text samples are exact and -1 if the semantics of the two              be identified which can potentially allow for reversibility
text samples do not match at all. Given the output of the              or the uncovering of the original text, thus revealing an
LSA tool on our dataset, we ran an ANOVA test as well a                authors writing style.
t-test to break the performances of ILT/IP into equivalence
classes as shown in Table 1.
    In Table 1, the first column represents the ILT/IP meth-
od used, the second column represents the average output
of the LSA tool with the standard deviation in parenthesis,
and the third column, labeled EC, represents the equiva-
lence class. The equivalence classes are ordered from best
to worst in terms of performance. The equivalent classes
were determined by applying ANOVA and a t-test to check
for statistical significance. The p-value used for the ANO-
VA test was 0.05.
    The results displayed in Table 1, show that the resulting
text from ILT-Spanish is closest to the semantics of the
original text with an output of 0.862 followed by IP at                 Figure 2: A Fingerprinting Analysis of ILT/IP over 3 Iterations
0.802 and ILT-Chinese at 0.773. This indicates that ILT/IP
is not only non-problematic but effective at preserving the            Results of Experiment III
semantics of the original text.                                           The results of Experiment III, Fingerprinting the Num-
                                                                       ber of Iterations Used to Conceal an Author’s Writing
                                                                       Style, are shown in Figure 3. In Figure 3, the x-axis shows
                                                                       the iterations (Iteration 1, Iteration 2, Iteration 3) and the y-
                                                                       axis shows the accuracy of an iteration of ILT/IP in being
                                                                       fingerprinted. Figure 3 shows determining which Iteration




                                                                  43
 Siobahn C. Day et al.                                      MAICS 2016                                                  pp. 41–45


of ILT/IP of a given text proves to be more difficult; how-             cycle of ILT on the text currently within the author win-
ever, the accuracy rises over iterations.                               dow.
    In Figure 3, at Iteration 1, ILT-Spanish leads at 70%,
followed by ILT-Chinese at 61%, and IP at 47%. At Itera-
tion 2, IP performs best at 31%, followed by ILT-Spanish
at 18%, and ILT-Chinese at 15%. At Iteration 3, ILT-
Chinese is the best performer at 60%, followed by ILT-
Spanish at 53 % and IP at 49% making it the worst per-
former. The results show that fingerprinting ILT/IP by iter-
ation is harder to fingerprint but not impossible. Thus al-
lowing an original text and author to be revealed.




 Figure 3: A Fingerprinting Analysis of the Number of Iterations
                   of ILT/IP over 3 Iterations
                                                                                           Figure 5: AuthorCAAT

       DISCUSSION: THE DEVELOPMENT OF                                      In Figure 5, one can see that AuthorCAAT allows a user
                AUTHORCAAT                                              to perform both forms of Adversarial Stylometry. If the
   The results presented earlier show that translators and              user sees that their writing style is detected and shown in
paraphrasers can be fingerprinted. Even the iterations can              the pane, then they can choose to re-write their text is such
be fingerprinted. In order to conceal one’s identity in an              a way that it is not shown in the pane. A user can also
efficient and effective way, the authors’ believe that a sys-           monitor the pane in an effort to perform imitation author-
tem must be developed that will allow a user to use all of              ship. As long as a particular author ID is shown in the pane
the author concealment methods mentioned in this paper                  (while their author ID is not in the pane) then they are writ-
simultaneously while authoring a text. The Center for Ad-               ing like that particular author.
vanced Studies in Identity Sciences (CASIS) has devel-                     Finally, AuthorCAAT allows for ILT/IP at the sentence
oped such a system for author concealment known as Au-                  level. For example, an author can type in the first sentence
thorCAAT (Author Cyber Analysis & Advisement Tool).                     and apply ILT/IP to that sentence. After this, the author can
   Figure 5 provides a screenshot of AuthorCAAT. Au-                    add a second sentence and then apply ILT/IP to both sen-
thorCAAT has a window that allows an author to type in                  tences in the window and/or edit the resulting sentences
text. As the author types, their writing style is analyzed.             further (Adversarial Stylometry).
The feature vector associated with their writing style is
shown just below the window. To the right of the window,                          Conclusions and Future Work
is a pane that displays the author samples that match the                  In this paper, ILT/IP dramatically reduces the author
sample written within the window based on a user speci-                 recognition rate. Secondly, translators and paraphraser are
fied by the slide bar. For example, if the slide bar is at ‘10’         good enough to preserve the semantics. This is based on
this means that the pane will display the authors whose                 our results from our LSA table. Thirdly that not only can
writing samples are within the closest 10% to the author                language translators be fingerprinted but we can fingerprint
sample that was typed in the window.                                    paraphrasers too. Lastly we show that the iteration of a
   Below the Matches to, pane is a drop-down box that will              particular ILT/IP can be fingerprinted as well. This all
allow an author to translate what is currently in the window            leads to a development tool, AuthorCAAT that can do all
in either Spanish, Chinese, or Paraphrase and back to Eng-              of things at the sentence level. This will allow fingerprint-
lish. Once a language or paraphraser has been selected, the             ing to be more difficult. Our Future work will include in-
user (author) presses the ‘Translate’ button to execute one             creasing our dataset from 100 to 1000 to see if the finger-




                                                                   44
 Siobahn C. Day et al.                                        MAICS 2016                                                       pp. 41–45


printing becomes more accurate with more authors in terms                  LSA @ CU Boulder. (n.d.). Retrieved February 02, 2016, from
of ILT/IP. We suspect the accuracy of fingerprinting itera-                http://lsa.colorado.edu/
tions at Iteration 1 and 2 will increase with the number of                Narayanan, A., Paskov, H., Gong, N. Z., Bethencourt, J., Stefa-
                                                                           nov, E., Shin, E. C. R., & Song, D. (2012, May). On the feasibil-
authors analyzed. This is a contrast to what was stated in
                                                                           ity of internet-scale author identification. In Security and Privacy
Caliskan and Greenstadt (2012).                                            (SP), 2012 IEEE Symposium on (pp. 300-314). IEEE.
                                                                           Nathan Mack, Jasmine Bowers, Henry Williams, Gerry Dozier,
                    Acknowledgments                                        and Joseph Shelton, "The Best Way to a Strong Defense is a
                                                                           Strong Offense: Mitigating Deanonymization Attacks via Itera-
   This research is based upon work supported by the
                                                                           tive Language Translation," International Journal of Machine
United States Government including the National Science                    Learning and Computing vol.5, no. 5, pp. 409-413, 2015.
Foundation. The views and conclusions contained herein                     Rao, J. R., & Rohatgi, P. (2000). Can pseudonymity really guar-
are those of the authors and should not be interpreted as                  antee privacy? Paper presented at the USENIX Security Sympo-
necessarily representing the official policies or                          sium.
endorsements, either expressed or implied, of the U.S.                     Stamatatos, E. (2009). A survey of modern authorship attribution
Government. The U.S. Government is authorized to                           methods.Journal of the American Society for information Science
reproduce and distribute reprints for Governmental                         and Technology,60(3), 538-556.
purposes notwithstanding any copyright annotation
thereon.


                         References
Afroz, S., Brennan, M., & Greenstadt, R. (2012, May). Detecting
hoaxes, frauds, and deception in writing style online. In Security
and Privacy (SP), 2012 IEEE Symposium on (pp. 461-475).
IEEE.
Brennan, M. R., & Greenstadt, R. (2009, July). Practical Attacks
Against Authorship Recognition Techniques. In IAAI.
Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial
stylometry: Circumventing authorship recognition to preserve
privacy and anonymity.ACM Transactions on Information and
System Security (TISSEC), 15(3), 12.
Caliskan, A., & Greenstadt, R. (2012, September). Translate
once, translate twice, translate thrice and attribute: Identifying
authors and machine translation tools in translated text.
In Semantic Computing (ICSC), 2012 IEEE Sixth International
Conference on (pp. 121-125). IEEE.
De Vel, O., Anderson, A., Corney, M., & Mohay, G. (2001).
Mining e-mail content for author identification forensics. ACM
Sigmod Record, 30(4), 55-64.
Forsyth, R. S. (1997). Short substrings as document discrimina-
tors: An empirical study. In ACH-ALLC (Vol. 97).
Free Online Plagiarism Checker for Students, Teachers, Scholars,
Educators, Scientists, Essayists, Writers. Free TurnItIn and Copy-
scape Alternative. (n.d.). Retrieved February 02, 2016, from
http://plagiarisma.net/
Google Translate. (n.d.). Retrieved February 04, 2016, from
https://translate.google.com/
Juola, P., Sofko, J., & Brennan, P. (2006). A prototype for author-
ship attribution studies. Literary and Linguistic Computing, 21(2),
169-178.
Kacmarcik, G., & Gamon, M. (2006). Obfuscating document
stylometry to preserve author anonymity. Paper presented at the
Proceedings of the COLING/ACL on Main conference poster
sessions.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction
to Latent Semantic Analysis. Discourse Processes, 25, 259-284.




                                                                      45