=Paper=
{{Paper
|id=Vol-1584/paper10
|storemode=property
|title=Towards the Development of a Cyber Analysis & Advisement Tool (CAAT) for Mitigating De-Anonymization Attacks
|pdfUrl=https://ceur-ws.org/Vol-1584/paper10.pdf
|volume=Vol-1584
|authors=Siobahn C. Day,Henry Williams,Joseph Shelton,Gerry Dozier
|dblpUrl=https://dblp.org/rec/conf/maics/DayWSD16
}}
==Towards the Development of a Cyber Analysis & Advisement Tool (CAAT) for Mitigating De-Anonymization Attacks==
Siobahn C. Day et al. MAICS 2016 pp. 41–45
Towards the Development of a Cyber Analysis & Advisement Tool
(CAAT) for Mitigating De-Anonymization Attacks
Siobahn C. Day, Henry Williams, Joseph Shelton, Gerry Dozier
Department of Computer Science, North Carolina A&T State University, Greensboro, U.S.A
Center for Advanced Studies in Identity Science
{scday, hcwillia, jashelt1} @aggies.ncat.edu , {gvdozier} @ncat.edu
Abstract selves while the second form, imitation, is when an author
We are seeing a rise in the number of Anonymous Social Net- tries to ‘mimic’ the writing style of another author. Re-
works (ASN) that claim to provide a sense of user anonymity. search shows that both of these techniques are effective in
However, what many users of ASNs do not know that a person concealing one’s writing style. In the case of disguising
can be identified by their writing style. one’s writing style, M. Brennan et al. (2012) demonstrate
In this paper, we provide an overview of a number of author con-
that obfuscation and imitation are easy on the short term
cealment techniques, their impact on the semantic meaning of an
author's original text, and introduce AuthorCAAT, an application but more difficult to maintain on the long term. In Section
for mitigating de-anonymization attacks. Our results show that IV, it will be shown how AuthorCAAT can be used to pro-
iterative paraphrasing performs the best in terms of author con- vide authors with the ability to perform long-term adver-
cealment and performs well with respect to Latent Semantic sarial stylometry.
Analysis. Another form of author concealment is Iterative Lan-
guage Translation (ILT) (Mack, Bowers, Williams, Dozier,
and Shelton 2015). ILT is where an original text is trans-
Introduction
lated to another language and then back to its original lan-
Anonymous Social Networks (ASN) can provide users guage. This technique was first presented in Rao and
with a false sense of anonymity; however, research in the Rohatgi (2000), where the authors describe this approach
area of Author Identification (Attribution) has shown that as being “somewhat facetious” and “drastic.” They be-
users can be identified simply by their writing style lieved that this approach would change the meaning of a
(Stamatatos 2009). Narayanan et al. (2012), introduces the message thus making it an impractical approach. It was
concept of a de-anonymization attack where hackers apply also mentioned by Kacmarcik and Gamon (2006), that this
sophisticated Author Identification techniques (AITs) in an approach could be a good starting point for someone look-
effort to uncover the identity of an author of a text. Once ing to “scramble” their words. ILT is effective in conceal-
this occurs the hackers can track a victim across the web ing the writing style of an author; however, it is vulnerable
and even through other ASNs. to fingerprinting, (Caliskan and Greenstadt 2012). If one
Recently researchers, M. Brennan, Afroz, and knows the language used in translating the text, one can
Greenstadt (2012); Kacmarcik and Gamon (2006); Rao and then recover the original writing style of the author.
Rohatgi (2000), have developed a number of techniques The last form of author concealment is Iterative Para-
for author concealment. These techniques as well as their phrasing (IP). The use of IP was originally mentioned in
ability to conceal one’s writing style are as follows: adver- Kacmarcik and Gamon (2006). In IP, one will take the
sarial stylometry, iterative language translation and itera- original text and use a paraphrasing tool to convert it into a
tive paraphrasing. paraphrased text. Concerning IP, to the authors’
Presently there exist two forms of adversarial stylometry knowledge, no one has as of yet analyzed its effectiveness
(Afroz, Brennan, and Greenstadt 2012; M. Brennan et al. in author concealment, semantics, and its vulnerability to
2012; M. R. Brennan and Greenstadt 2009). The first form, fingerprinting (this will be discussed in Section III).
obfuscation, is when an author tries not to write like them- The remainder of the paper will be as follows. In Sec-
tion II, we discuss our experiments. In Section III, we dis-
cuss our results. In Section IV, we provide a brief discus-
Copyright held by the author(s).
41
Siobahn C. Day et al. MAICS 2016 pp. 41–45
sion of AuthorCAAT. In Section V, we provide our con- were used to determine how well ILT/IP reduces the author
clusion and future work. recognition rate with respect to the baseline.
Author Concealment & Fingerprinting Exper- Experiment II: Fingerprinting the Translators
iments and the Paraphrasers
For Experiment II, a tool known as JGAAP, Java Graph-
Our Dataset ical Author Attribution Program, (Juola, Sofko, and Bren-
nan 2006) was used to fingerprint the translators and the
The datasets we used for our experiments were gathered paraphraser. This tool allows for text analysis using vari-
from blogs written by 100 different authors. For every au- ous stylometry and textometry techniques. We used the
thor in our dataset, there are 4 instances. Those instances in first 100 authors from each ILT/IP Iteration using the first
the dataset are as follows: the first instance served as the gallery instance as the ‘unknown’ author and the remaining
probe and the remaining 3 instances served as the gallery. two instances from the gallery as the ‘known’ authors. The
This results in 100 instances in the probe set and 300 in- ‘known’ authors were labeled by languages and/or para-
stances in the gallery set. phraser. This was used for all three Iterations of ILT/IP.
The analysis was processed by using WEKA SMO, with
Our Translators & Paraphrasers the results ordered with event culling from most to least.
Character N Grams, where n=2, was used as the event
Our ILT dataset, used Google translation tools for Eng- driver.
lish to Spanish, Spanish to English, English to Chinese,
and Chinese to English. The ILT text was prepared in itera- Experiment III: Fingerprinting the Number of
tions. We consider an iteration to be a full round trip cycle Iterations Used to Conceal an Author’s Writing
of translation (e.g. English-Spanish-English and English- Style
Chinese-English). Therefore, Iteration 1 would be E-X-E, In Experiment III, the ‘unknown’ authors were chosen
Iteration 2 would be E-X-E-X-E, and Iteration 3 would be from the first gallery instances of all Iterations of ILT/IP.
E-X-E-X-E-X-E, where E stands for English and X ∈ The ‘known’ authors were chosen from the remaining two
{Spanish, Chinese}. Therefore, a total of six ILT datasets instances of the gallery and were labeled by the number of
ILT/IP Iterations that were applied. The same settings as
were developed consisting of 300 gallery instances of the
Experiment II were used with respect to the event driver,
100 authors. analysis, and event culling.
Our IP dataset was created using an online tool known
as Plagarisma. The Iterations for IP are similar to ILT. Results
Combining ILT with IP we have X ∈ {Spanish, Chinese,
Paraphraser}. Therefore, three IP datasets were developed Results of Experiment I
consisting of 300 gallery instances of the 100 authors. For The results of Experiment I, Author Concealment via
ILT/IP, there were a total of nine datasets. ILT/IP, are shown in Figure 1. Figure 1 shows the affect
that ILT/IP has on the accuracy of the AIS. In Figure 1, the
Experiment I: Author Concealment via ILT/IP x-axis represents the iteration number (Iteration 1, Iteration
For Experiment I, the feature extractor used in Mack, 2, Iteration 3) and the y-axis represents the accuracy of the
Bowers, Williams, Dozier, and Shelton (2015), referred to AIS.
as the Hybrid-II Author Identification System (AIS), was In Figure 1, the accuracy of the AIS is 54% percent.
applied to the instances of the nine datasets (and the probe In the first iteration of ILT/IP, the author identification
set) to create feature vectors where each feature vector rates drop. At Iteration 1, ILT-Spanish has the best perfor-
consisted of 1282 features. The Hybrid-II AIS, is com- mance in terms of reducing the AIS rate to 6%, followed
posed of 95 features from the Unigram feature extractor by IP at 7% and ILT-Chinese at 10%. In the second itera-
(Forsyth 1997), 170 stylometric features from De Vel, An- tion, IP has the best performance in reducing the AIS rate
derson, Corney, and Mohay (2001) feature extractor, as to 1%, followed by ILT-Chinese at 11% and ILT-Spanish
well as 256 features in the form of function words and 761 at 6%. At Iteration 3, IP continues to outperform ILT. At
features that come from the Stanford Parser in the form of Iteration 3, IP reduces the AIS rate to 6 %, followed by
Parts-of-Speech parent child pairs for a total of 1282 fea- ILT-Spanish at 7% and ILT-Chinese at 11%. These results
tures. show the effectiveness of ILT/IP in concealing an authors
In Experiment I, the baseline performance was the au- identity.
thor recognition rate of the 100 authors (English only) us-
ing no ILT/IP iterations. While, the ILT/IP experiments
42
Siobahn C. Day et al. MAICS 2016 pp. 41–45
Table 1: LSA Results from Comparing the Original Text with
Resulting Text from ILT/IP
ILT/IP Method LSA Results EC
Spanish 0.862 (0.11) 1
Paraphraser 0.802 (0.09) 2
Chinese 0.773 .16) 3
Figure 1: A Comparison of the Effectiveness of ILT/IP on Reduc-
ing Author Recognition Rates Results of Experiment II
The results of Experiment II, Fingerprinting the Transla-
Prior research suggests, (Caliskan and Greenstadt 2012; tors and the Paraphrasers, are shown in Figure 2. In Figure
Kacmarcik and Gamon 2006; Rao and Rohatgi 2000), that 2, the x-axis shows the iterations (Iteration 1, Iteration 2,
ILT/IP is naïve as well as problematic due to the resulting Iteration 3) and on the y-axis it shows the accuracy in de-
text being unable to retain its original meaning. In in order termining the ILT/IP method used. In Figure 2, one can see
to address this issue, we applied Latent Semantic Analysis as the number of iterations increases so does the accuracy
(LSA) on all iterations of the dataset. for each ILT/IP method that is being used.
Latent Semantic Analysis (LSA) “…is a theory and In Figure 2, at Iteration 1, ILT-Spanish has the best fin-
method for extracting and representing the contextual- gerprinting accuracy at 93%, followed by ILT-Chinese at
usage meaning of words by statistical computations ap- 90%, and IP at 86%. In Iteration 2, ILT-Spanish leads at
plied to a large corpus of text” (Landauer, Foltz, and 98% followed by ILT-Chinese 97%, and IP at 91%. In
Laham 1998). Using a LSA tool developed by the Univer- Iteration 3, ILT-Chinese comes in at 99%, followed by
sity of Colorado Boulder, we compared our original text ILT-Spanish at 98%, and IP at 95%. The results not only
with the resulting text of ILT/IP. show that the translators can be accurately fingerprinted,
In the Table 1, the results of using the LSA tool on our but they also show that of the three IP is hardest to finger-
dataset are shown. Given two samples of text, the LSA print but only at the first iteration. On the other hand, these
tool will provide an output of 1 if the semantics of the two results show that the translator and paraphrasers are able to
text samples are exact and -1 if the semantics of the two be identified which can potentially allow for reversibility
text samples do not match at all. Given the output of the or the uncovering of the original text, thus revealing an
LSA tool on our dataset, we ran an ANOVA test as well a authors writing style.
t-test to break the performances of ILT/IP into equivalence
classes as shown in Table 1.
In Table 1, the first column represents the ILT/IP meth-
od used, the second column represents the average output
of the LSA tool with the standard deviation in parenthesis,
and the third column, labeled EC, represents the equiva-
lence class. The equivalence classes are ordered from best
to worst in terms of performance. The equivalent classes
were determined by applying ANOVA and a t-test to check
for statistical significance. The p-value used for the ANO-
VA test was 0.05.
The results displayed in Table 1, show that the resulting
text from ILT-Spanish is closest to the semantics of the
original text with an output of 0.862 followed by IP at Figure 2: A Fingerprinting Analysis of ILT/IP over 3 Iterations
0.802 and ILT-Chinese at 0.773. This indicates that ILT/IP
is not only non-problematic but effective at preserving the Results of Experiment III
semantics of the original text. The results of Experiment III, Fingerprinting the Num-
ber of Iterations Used to Conceal an Author’s Writing
Style, are shown in Figure 3. In Figure 3, the x-axis shows
the iterations (Iteration 1, Iteration 2, Iteration 3) and the y-
axis shows the accuracy of an iteration of ILT/IP in being
fingerprinted. Figure 3 shows determining which Iteration
43
Siobahn C. Day et al. MAICS 2016 pp. 41–45
of ILT/IP of a given text proves to be more difficult; how- cycle of ILT on the text currently within the author win-
ever, the accuracy rises over iterations. dow.
In Figure 3, at Iteration 1, ILT-Spanish leads at 70%,
followed by ILT-Chinese at 61%, and IP at 47%. At Itera-
tion 2, IP performs best at 31%, followed by ILT-Spanish
at 18%, and ILT-Chinese at 15%. At Iteration 3, ILT-
Chinese is the best performer at 60%, followed by ILT-
Spanish at 53 % and IP at 49% making it the worst per-
former. The results show that fingerprinting ILT/IP by iter-
ation is harder to fingerprint but not impossible. Thus al-
lowing an original text and author to be revealed.
Figure 3: A Fingerprinting Analysis of the Number of Iterations
of ILT/IP over 3 Iterations
Figure 5: AuthorCAAT
DISCUSSION: THE DEVELOPMENT OF In Figure 5, one can see that AuthorCAAT allows a user
AUTHORCAAT to perform both forms of Adversarial Stylometry. If the
The results presented earlier show that translators and user sees that their writing style is detected and shown in
paraphrasers can be fingerprinted. Even the iterations can the pane, then they can choose to re-write their text is such
be fingerprinted. In order to conceal one’s identity in an a way that it is not shown in the pane. A user can also
efficient and effective way, the authors’ believe that a sys- monitor the pane in an effort to perform imitation author-
tem must be developed that will allow a user to use all of ship. As long as a particular author ID is shown in the pane
the author concealment methods mentioned in this paper (while their author ID is not in the pane) then they are writ-
simultaneously while authoring a text. The Center for Ad- ing like that particular author.
vanced Studies in Identity Sciences (CASIS) has devel- Finally, AuthorCAAT allows for ILT/IP at the sentence
oped such a system for author concealment known as Au- level. For example, an author can type in the first sentence
thorCAAT (Author Cyber Analysis & Advisement Tool). and apply ILT/IP to that sentence. After this, the author can
Figure 5 provides a screenshot of AuthorCAAT. Au- add a second sentence and then apply ILT/IP to both sen-
thorCAAT has a window that allows an author to type in tences in the window and/or edit the resulting sentences
text. As the author types, their writing style is analyzed. further (Adversarial Stylometry).
The feature vector associated with their writing style is
shown just below the window. To the right of the window, Conclusions and Future Work
is a pane that displays the author samples that match the In this paper, ILT/IP dramatically reduces the author
sample written within the window based on a user speci- recognition rate. Secondly, translators and paraphraser are
fied by the slide bar. For example, if the slide bar is at ‘10’ good enough to preserve the semantics. This is based on
this means that the pane will display the authors whose our results from our LSA table. Thirdly that not only can
writing samples are within the closest 10% to the author language translators be fingerprinted but we can fingerprint
sample that was typed in the window. paraphrasers too. Lastly we show that the iteration of a
Below the Matches to, pane is a drop-down box that will particular ILT/IP can be fingerprinted as well. This all
allow an author to translate what is currently in the window leads to a development tool, AuthorCAAT that can do all
in either Spanish, Chinese, or Paraphrase and back to Eng- of things at the sentence level. This will allow fingerprint-
lish. Once a language or paraphraser has been selected, the ing to be more difficult. Our Future work will include in-
user (author) presses the ‘Translate’ button to execute one creasing our dataset from 100 to 1000 to see if the finger-
44
Siobahn C. Day et al. MAICS 2016 pp. 41–45
printing becomes more accurate with more authors in terms LSA @ CU Boulder. (n.d.). Retrieved February 02, 2016, from
of ILT/IP. We suspect the accuracy of fingerprinting itera- http://lsa.colorado.edu/
tions at Iteration 1 and 2 will increase with the number of Narayanan, A., Paskov, H., Gong, N. Z., Bethencourt, J., Stefa-
nov, E., Shin, E. C. R., & Song, D. (2012, May). On the feasibil-
authors analyzed. This is a contrast to what was stated in
ity of internet-scale author identification. In Security and Privacy
Caliskan and Greenstadt (2012). (SP), 2012 IEEE Symposium on (pp. 300-314). IEEE.
Nathan Mack, Jasmine Bowers, Henry Williams, Gerry Dozier,
Acknowledgments and Joseph Shelton, "The Best Way to a Strong Defense is a
Strong Offense: Mitigating Deanonymization Attacks via Itera-
This research is based upon work supported by the
tive Language Translation," International Journal of Machine
United States Government including the National Science Learning and Computing vol.5, no. 5, pp. 409-413, 2015.
Foundation. The views and conclusions contained herein Rao, J. R., & Rohatgi, P. (2000). Can pseudonymity really guar-
are those of the authors and should not be interpreted as antee privacy? Paper presented at the USENIX Security Sympo-
necessarily representing the official policies or sium.
endorsements, either expressed or implied, of the U.S. Stamatatos, E. (2009). A survey of modern authorship attribution
Government. The U.S. Government is authorized to methods.Journal of the American Society for information Science
reproduce and distribute reprints for Governmental and Technology,60(3), 538-556.
purposes notwithstanding any copyright annotation
thereon.
References
Afroz, S., Brennan, M., & Greenstadt, R. (2012, May). Detecting
hoaxes, frauds, and deception in writing style online. In Security
and Privacy (SP), 2012 IEEE Symposium on (pp. 461-475).
IEEE.
Brennan, M. R., & Greenstadt, R. (2009, July). Practical Attacks
Against Authorship Recognition Techniques. In IAAI.
Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial
stylometry: Circumventing authorship recognition to preserve
privacy and anonymity.ACM Transactions on Information and
System Security (TISSEC), 15(3), 12.
Caliskan, A., & Greenstadt, R. (2012, September). Translate
once, translate twice, translate thrice and attribute: Identifying
authors and machine translation tools in translated text.
In Semantic Computing (ICSC), 2012 IEEE Sixth International
Conference on (pp. 121-125). IEEE.
De Vel, O., Anderson, A., Corney, M., & Mohay, G. (2001).
Mining e-mail content for author identification forensics. ACM
Sigmod Record, 30(4), 55-64.
Forsyth, R. S. (1997). Short substrings as document discrimina-
tors: An empirical study. In ACH-ALLC (Vol. 97).
Free Online Plagiarism Checker for Students, Teachers, Scholars,
Educators, Scientists, Essayists, Writers. Free TurnItIn and Copy-
scape Alternative. (n.d.). Retrieved February 02, 2016, from
http://plagiarisma.net/
Google Translate. (n.d.). Retrieved February 04, 2016, from
https://translate.google.com/
Juola, P., Sofko, J., & Brennan, P. (2006). A prototype for author-
ship attribution studies. Literary and Linguistic Computing, 21(2),
169-178.
Kacmarcik, G., & Gamon, M. (2006). Obfuscating document
stylometry to preserve author anonymity. Paper presented at the
Proceedings of the COLING/ACL on Main conference poster
sessions.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction
to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
45