<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Feb</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A User-Centric and Sentiment Aware Privacy-Disclosure Detection Framework based on Multi-input Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A K M Nuhil Mehdy</string-name>
          <email>akmnuhilmehdy@u.boisestate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoda Mehrpouyan</string-name>
          <email>hodamehrpouyan@boisestate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Boise State University</institution>
          ,
          <addr-line>Boise, Idaho</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>7</volume>
      <issue>2020</issue>
      <abstract>
        <p>Data and information privacy is a major concern of today's world. More specifically, users' digital privacy has become one of the most important issues to deal with, as advancements are being made in information sharing technology. An increasing number of users are sharing information through text messages, emails, and social media without proper awareness of privacy threats and their consequences. One approach to prevent the disclosure of private information is to identify them in a conversation and warn the dispatcher before the conveyance happens between the sender and the receiver. Another way of preventing information (sensitive) loss might be to analyze and sanitize a batch of ofline documents when the data is already accumulated somewhere. However, automating the process of identifying user-centric privacy disclosure in textual data is challenging. This is because the natural language has an extremely rich form and structure with diferent levels of ambiguities. Therefore, we inquire after a potential framework that could bring this challenge within reach by precisely recognizing users' privacy disclosures in a piece of text by taking into account - the authorship and sentiment (tone) of the content alongside the linguistic features and techniques. The proposed framework is considered as the supporting plugin to help text classification systems more accurately identify text that might disclose the author's personal or private information.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Security and privacy → Privacy protections.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Privacy is an ancient concept concerning human values that could
be "intruded upon", "invaded", "violated", "breached", "lost", and
"diminished"[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Each of these analogies reflects a conception
of privacy that can be found in one or more standard models or
theories of privacy. Users’ privacy has been defined as "the right to
be left alone" or being free from intrusion by the seclusion and
nonintrusion theory[
        <xref ref-type="bibr" rid="ref32 ref8">8, 32</xref>
        ]. Even though privacy varies from individual
to individual and each user may have diferent views of privacy,
there is an imperfect societal consensus that certain information
(e.g. personal information, situation, condition, circumstance, etc)
is more private than the others (e.g. public statements, opinion,
comments, etc)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Recent advances in communication technologies such as
messaging applications and social media [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] have resulted in privacy
concerns [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] about analogous information amongst the users. In
this era of digital communication, an increasing number of users
are sharing information through text messages, emails, and social
media without proper awareness of privacy threats and their
consequences. Moreover, in the context of the information society,
historical documents of entities (e.g. people, organization) are needed
to be made public and shared among authorities every day [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
In such cases, improper disclosure 1 of user’s information could
increase his/her security/privacy vulnerabilities, and the negative
consequences of disclosing such information could be immense [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        A recent data scandal involving Facebook and Cambridge
Analytica reveals how personally identifiable information of up to 87
million Facebook users influenced voter’s opinion [
        <xref ref-type="bibr" rid="ref10 ref25">10, 25</xref>
        ]. Likewise,
millions of data breach incidents are reported all over the world
and unfortunately most of them expose users’ personal data [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
Therefore, user-centric targeted attacks by exploiting the victim’s
Personally Identifiable Information (PII) has become a new kind
of privacy threat in the present-day [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. It’s worth mentioning
that United States is the number one destination for such
usercentric targeted attacks based on recent statistics [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. That being
the case, users’ data privacy has become one of the major concerns
of today’s world and the requirements for privacy measures to
protect sensitive information about individuals have been researched
extensively [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref19 ref24 ref3">3, 12–14, 19, 24</xref>
        ].
      </p>
      <p>As part of this eforts, researchers in the area of Natural
Language Processing (NLP) have focused on developing techniques
and methodologies to detect, classify, and sanitize private
information in textual data. However, most of these works tend to solve
these tasks by just detecting set of keywords, leveraging
dictionaries of terms, or applying regular expression patterns. These types
of detection do not consider the context and the relationship of
the keywords in the text, therefore they result in high amount of
false positive (e.g. a doctor’s article about a disease is considered
public and not private). However, it is considered sensitive and
private when associated with other entities (e.g., a patient himself)
in certain ways that yield diferent meaning and actually reveals
someone’s privacy. Therefore, its equally necessary to look into
the keywords, data subject (i.e. users), authorship, tone, and overall
1In this work, disclosure is defined as revealing personally identifiable information
(e.g., name, address, age) or sensitive data (e.g., health, finance, and mental status) to
others.
meaning of the content before classifying as privacy disclosure
(Refer to Figure 1). While a few of the recent works are concerned with
disclosure detection techniques by considering user-centric factors,
most of them still omit other important decision-making factors
such as sentiment and authorship of the content. Therefore, this
paper aims to review the existing methodologies and techniques
from the area of NLP and proposes a novel disclosure identification
framework by keeping the following factors in mind:
• Considering users-centric circumstances, tone, and
authorship of content: content having - sensitive
information but no data subject, sensitive keywords but public
ambience, analytical tone should not be classified as disclosure.
• Checking sentence coherence and grammatical
structure: appearance of random keywords, ambiguous and
meaningless information, or invalid utterances should not be
classified as disclosure.</p>
      <p>The rest of the paper is organized as follows: Section 2 contains
the review on the related research works following some of their
limitations we have observed. Section 3 describes about the dataset
used in this paper. The methodology is described in detail in section
4. In addition, the detail of the deep neural network architecture,
data cleaning, pre-processing, featurization, and the experiment is
presented in section 5. Lastly, section 6 represents the experimental
results following the conclusion.
2</p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND AND RELATED WORK</title>
      <p>This section is a review on the state-of-the art research studies,
related to information disclosure identification of individuals or
organisations. Specially, the related literature which has been studied
across diferent privacy domains such as finance, health, location,
etc. We briefly described the research works which are related to
detection, classification, and sanitization of private information in
natural language text. We categorize the related works into three
distinct groups based on their methodologies: i) Leveraging
Dictionaries ii) Information Theory and Global Search iii) Machine
Learning and Statistical Models.</p>
      <p>
        The works under the category of dictionary utilization, leverage
the linguistic resources such as privacy dictionary to automate the
content analysis of privacy related information. Privacy dictionaries
are used with existing automated content-analysis software such
as LIWC [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Vasalou et al. proposes a technique that uses such
a dictionary of individual words or phrases which are assigned to
one or more privacy domains [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. They showed that the
dictionary categories could distinguish diferences between documents
of privacy discussions and general language by measuring unique
linguistic patterns within privacy discussions (e.g., medical records,
confidential business documents).
      </p>
      <p>
        The researchers from the area of information theory utilize
theories along with large corpus of words to automatically detect
sensitive information from textual documents. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] define sensitive
information as the pieces of text that either reveals the identity of
a private entity or refer to some confidential information of that
entity. In their approach sensitive terms are those that provide more
information than common terms due to their specificity. Therefore,
the task is to quantify how much information each textual term
provides, before identifying those as sensitive terms. Similar document
sanitization tasks have been well addressed by Chakaravarthy et.
al. where they represent a scheme that detects sensitive elements
using a database of entities instead of patterns [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Each entity in
this database (e.g., persons, products, diseases, etc.) is associated
with a set of terms related to the that entity. Each set is considered
as the context of an entity. For example, the context of a person type
entity could be his/her birth date, name etc. Another research work
by Abril et. al. that focuses on domain-independent unstructured
documents has also been reviewed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where they propose to use a
named entity recognition techniques to identify sensitive or private
entities.
      </p>
      <p>
        Detection of privacy leaks has also been well-addressed by
machine learning and statistical techniques such as association rule
mining [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In such an approach, (Chow et. al.) employs a model
of inference detection using a customized web based corpus as
reference where inferences are based on word co-occurrences. The
model is then provided a topic (e.g. HIV - human immunodeficiency
virus) and said to identify all the associated keywords. Hart et al.
(2011) utilize machine learning techniques to classify full documents
as either sensitive or non-sensitive by automatic text classification
algorithms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Their task is to develop an eficient and automated
tool for enterprise data loss prevention (DLP) by keeping the
sensitive documents secret. They introduce a novel training strategy
called supplement and adjust to create an enterprise-level classifier
based on support vector machine (SVM) with a linear kernel, stop
word elimination, and unigram methodology.
2.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Our Contribution</title>
      <p>
        The limitations of the current studies are based on the fact that they
solely rely on the existence of keywords and neglect the sentence
coherence, ignore grammatical validation, and disregard meaning
inference in a piece of content. It has been addressed that these
limitations, in some cases, result in miss classification and could be
resolved by integrating parts-of-speech tags, dependency parse tree
information, and word embedding[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. However, a novel approach
is required to take into the account the emotional tone or sentiment
of the users that are hidden in the textual contents. For example, in
Figure 1, the text from the red box is revealing someone’s private
(health) information (the patient has cancer) and the text from
green box is about the Idaho state that represents some public
ambiences. It’s quite easy to distinguish these two piece of texts
based on the keyword spotting techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, in another
example, the text from the yellow box (comment from a doctor about
cancer) has similar keywords as the patient’s post, in the red box,
containing valid word sequences and the presence of grammatical
subjects (i.e. first person) with references etc. This piece of text
is definitely not revealing private health situation (i.e. the doctor
himself does not have cancer). Hence, it is quite challenging to
distinguish between the types of contents without taking into the
consideration the sentiment of the statements. To this end, this
paper focuses on distinguishing highly similar contents based on
the users’ involvement, sentiment, authorship, and grammatical
structure to classify texts containing someone’s privacy disclosure.
However, one of the assumptions of this work is: the proposed
model does not solve all the privacy and security requirements
of users by providing an entire threat model, rather it provides a
better NLP tool to be integrated into any comprehensive privacy
framework.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>DATASET</title>
      <p>We collected 10,000 users’ (patients and doctors) posts from a public
online health forum, based on the observation (inspired from the
example of figure 1) that, patients’ posts are somewhat disclosing
their health status in that forum. Whereas, doctors’ comments on
patients’ posts are highly similar content (having similar keywords
and syntactic representation) but usually do not disclose doctors’
health status (doctors’ do not have those diseases). Therefore, we
labeled patients’ posts as disclosure (private) and doctor’s comments
as non-disclosure (public). For this paper, we crawled 5000 posts
and 5000 comments and narrow down our privacy domain to health
only. The length of the posts and comments varies from 10 words
to more than 100 words comprised of several sentences.
4</p>
    </sec>
    <sec id="sec-6">
      <title>METHODOLOGY</title>
      <p>Combination of both linguistic operations and artificial neural
network is the core of our methodology. A bigger picture of the
framework is depicted in Figure 2. In this section, the data pre-processing,
representation, and featurization steps are briefly explained
following the detail of the neural network architecture.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>Featurization and Data Representation</title>
      <p>As can be seen from the examples in Figure 1 many domain specific
keywords can be used in both private and public posts. This makes
the problem particularly challenging because we cannot simply rely
on the lexical items in the text; we have to consider the intent of the
author of the text, and somehow determine if the intent was for the
text to be public or private. To this end, we do custom tokenization
and enrich our data with additional information using linguistic
details such as syntactic dependency relations.</p>
      <p>Tokenization. In many text-based natural language processing
tasks, the text is pre-processed by removing punctuation and stop
words, leaving only the lexical items. However, we found that the
way people punctuate their texts helps give the clues as to whether
or not it is a valid private or public information. Therefore, we
use NLP Toolkit to tokenize the sentences in a customized way
that ignores redundant tokens such as "„", ";–", "!!!", ":-)" but keeps
the important ones such as ",", ";", ":", ".", "he", "the", "in" etc. This
step of considering all the valid sequential tokens helps our model
learn important arrangement of tokens for validating relationships
of entities. This is somewhat in contrast to other text analysis
literature where clearing of all the punctuation tends to improve
task performance.</p>
      <p>Syntactic Structure. In the experiments,
dependency-parsetree information is also utilized as additional underlying features
that improved the performance of the neural network model. This
helps the model to observe common sequence of tokens as well as
co-occurrence of dependency tags. We use a Dependency Parser
(DP) Toolkit to extract the syntactic relation information (which is
diferent from, but in some ways similar to, entity relation
information). This allowed us to enrich our data with dependency parse
information.</p>
      <p>Supplemental Features. In addition to the features mentioned
above, more user specific features or meta data are prepared and
provided to the extended variant of our models as supplimental
input. Some of those auxiliary data are - i) number of pronouns ii)
emotional tone iii) number of negations found in the post etc. This
additional information are supposed to give the neural network
model some distinguishable features about highly similar contents
of diferent class.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Deep Neural Network Model</title>
      <p>After doing all the necessary pre-processing steps the data is then
fed into a multi-input deep neural network to learn the hidden
patterns and features to distinguish between texts having
disclosure and non-disclosure occurrences. It takes lexical (word tokens)
features through one input, syntactical features (dependency parse
tree information) through another input following a merging of
those feature vectors. Later these vectors additionally get merged
with supplemental (auxiliary) inputs before going through a further
multi-layer perceptron stage. At the end of the deep neural network,
a single neuron is used to provide the probability toward each of
the above mentioned classes. More detail about the architecture is
depicted in appendix C.</p>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENT</title>
      <p>
        In the data pre-processing step, we apply Spacy [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] to perform the
linguistic operations on the text. The Keras functional API is utilized
to create the multi-input architecture [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. For implementing word
embeddings, we use it’s Embedding [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] layer where pre-trained
word embedding (glove) is used with trainable flag set to true. In
another input of the multi-input model, same type of embedding
layer but without pre-trained vector, is used to learn the
embedding space from the dependency parse tree information. For the
Convolution on the information of the first input channel, we use
the Conv1D layer [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] following a pooling layer just after it.
      </p>
      <p>
        In the other input of the model, a long short term memory (LSTM)
layer is used over the dependency parse tree information. The
concatenate method of Keras then takes the output vectors from
the convolution layer and the lstm layer and merges them into a
single vector which then acts as the input to the fully connected
layers. At this step, supplemental input, prepared by utilizing IBM
Watson Tone Analyzer[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are added with the concatenated vector
following another stage of dense layers. Finally, a single neuron
with sigmoid activation function outputs the probability of each
class with 0.5 as the cutof value. As false negatives of the classifier
may bring dangerous consequences, it would be wise to lower this
probability cutof value towards the negative class, depending on
the usage of the model. Detail of the hyperparameters are listed in
appendix B.
6
      </p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS</title>
      <p>Prior to experiment with the multi-input model, the classification
task was examined using baseline models such as naive bayes
classifier and simple convolutional neural network. Appendix A shows
in detail the comparison of accuracy among all the models along
with the model which uses user-specific supplemental input. The
results show that, despite a lack of large amounts of labeled data,
neural network based classifier can be trained that goes beyond
simple keyword spotting and uses linguistic features to determine if
a text contains a disclosure or not with an useful degree of accuracy.
Moreover, it is observed that, integration of user-specific meta data
to the models increases the classification accuracy, significantly (up
to 97%). However, the generalizability of the model has not been
well evaluated because of the lack of data set with similar
characteristics (i.e., indistinguishable utterances yet carrying diferent
meaning).
7</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION</title>
      <p>A practical model of privacy disclosure detection is in dire need by
users in this era of social networks that results in activities such as
online forum posting, emailing, text messaging etc. Accordingly, the
development of algorithm and tools that helps identifying privacy
disclosure in textual data is important. While many of these works
in this area mainly focus on classifying textual data as public or
private at the document level by just spotting keywords, only few
of those are concerned with the the privacy detection, taking the
users context into account.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors would like to thank National Science Foundation for
its support through the Computer and Information Science and
Engineering (CISE) program and Research Initiation Initiative(CRII)
grant number 1657774 of the Secure and Trustworthy Cyberspace
(SaTC) program: A System for Privacy Management in Ubiquitous
Environments</p>
      <p>A</p>
    </sec>
    <sec id="sec-13">
      <title>RESULTS IN DETAIL</title>
    </sec>
    <sec id="sec-14">
      <title>MODEL HYPERPARAMETERS</title>
      <p>Some hyperparamters worth mentioning are: pre-trained
embedding with glove 100 dimensional embedding matrix having the
capability of adjusting weights through the training iteration.
Convolution with 32 filters with kernel size of 4. These layers have
rectifier linear unit as activation function and followed by global
max pooling technique. The LSTM layer contains 32 neurons with
all the default settings as per the keras documentation. The first
stage of dense layers after the first concatenation contains 128 and
64 neurons with rectifier linear unit as activation function. The
second stage of dense layers contains 64, 32, and 16 neurons with same
kind of activation function following a single output neuron with
sigmoid as activation function. We train the model for 20 epochs
providing the batch size of 32. The model also uses binary cross
entropy as the loss function and rmsprop as the optimizer.
C</p>
    </sec>
    <sec id="sec-15">
      <title>NEURAL NETWORK ARCHITECTURE</title>
      <p>Architecture of the Neural Network (automatically rendered by the
Keras plotter) is given below.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Abril</surname>
          </string-name>
          , Guillermo Navarro-Arribas, and
          <string-name>
            <given-names>Vicenç</given-names>
            <surname>Torra</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>On the declassification of confidential documents</article-title>
          .
          <source>In International Conference on Modeling Decisions for Artificial Intelligence</source>
          . Springer,
          <fpage>235</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Roy</surname>
            <given-names>F</given-names>
          </string-name>
          <string-name>
            <surname>Baumeister and Kenneth J Cairns</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Repression and self-presentation: When audiences interfere with self-deceptive strategies</article-title>
          .
          <source>Journal of Personality and Social Psychology</source>
          <volume>62</volume>
          ,
          <issue>5</issue>
          (
          <year>1992</year>
          ),
          <fpage>851</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Buchanan</surname>
          </string-name>
          , Carina Paine,
          <string-name>
            <surname>Adam N Joinson</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ulf-Dietrich Reips</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Development of measures of online privacy concern and protection for use on the Internet</article-title>
          .
          <source>Journal of the Association for Information Science and Technology 58</source>
          ,
          <issue>2</issue>
          (
          <year>2007</year>
          ),
          <fpage>157</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Aylin</given-names>
            <surname>Caliskan</surname>
          </string-name>
          <string-name>
            <surname>Islam</surname>
          </string-name>
          , Jonathan Walsh, and
          <string-name>
            <given-names>Rachel</given-names>
            <surname>Greenstadt</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Privacy detective: Detecting private information and collective privacy behavior in a large social network</article-title>
          .
          <source>In Proceedings of the 13th Workshop on Privacy in the Electronic Society. ACM</source>
          ,
          <volume>35</volume>
          -
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Venkatesan</surname>
            <given-names>T Chakaravarthy</given-names>
          </string-name>
          , Himanshu Gupta, Prasan Roy, and
          <article-title>Mukesh</article-title>
          K Mohania.
          <year>2008</year>
          .
          <article-title>Eficient techniques for document sanitization</article-title>
          .
          <source>In Proceedings of the 17th ACM conference on Information and knowledge management. ACM</source>
          ,
          <volume>843</volume>
          -
          <fpage>852</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Chow</surname>
          </string-name>
          , Philippe Golle, and
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Staddon</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Detecting privacy leaks using corpus-based association rules</article-title>
          .
          <source>In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM</source>
          ,
          <volume>893</volume>
          -
          <fpage>901</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Emily</given-names>
            <surname>Christofides</surname>
          </string-name>
          , Amy Muise, and
          <string-name>
            <given-names>Serge</given-names>
            <surname>Desmarais</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Information disclosure and control on Facebook: Are they two sides of the same coin or two diferent processes?</article-title>
          <source>Cyberpsychology &amp; behavior 12</source>
          ,
          <issue>3</issue>
          (
          <year>2009</year>
          ),
          <fpage>341</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jamal</given-names>
            <surname>Greene</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The so-called right to privacy</article-title>
          . UC Davis L.
          <year>Rev</year>
          .
          <volume>43</volume>
          (
          <year>2009</year>
          ),
          <fpage>715</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hart</surname>
          </string-name>
          , Pratyusa Manadhata, and Rob Johnson.
          <year>2011</year>
          .
          <article-title>Text classification for data loss prevention</article-title>
          .
          <source>In International Symposium on Privacy Enhancing Technologies Symposium</source>
          . Springer,
          <fpage>18</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Hern</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Far more than 87m Facebook users had data compromised, MPs told</article-title>
          . https://www.theguardian.com/uk-news/
          <year>2018</year>
          /apr/17/facebook-usersdata
          <article-title>-compromised-far-more-than-87m-mps-</article-title>
          <string-name>
            <surname>told</surname>
          </string-name>
          /-cambridge-analytica. [Online; accessed 01-April-2019].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          .
          <year>2019</year>
          . IBM Watson - Tone
          <string-name>
            <surname>Analyzer</surname>
          </string-name>
          . https://www.ibm.com/watson/services/tone-analyzer/. [Online; accessed 01-December-2019].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Adam</surname>
            <given-names>N Joinson</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ulf-Dietrich</surname>
            <given-names>Reips</given-names>
          </string-name>
          , Tom
          <string-name>
            <surname>Buchanan</surname>
          </string-name>
          , and
          <string-name>
            <surname>Carina</surname>
            <given-names>B Paine</given-names>
          </string-name>
          <string-name>
            <surname>Schofield</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Privacy, trust, and self-disclosure online</article-title>
          .
          <source>Human-Computer Interaction 25</source>
          ,
          <issue>1</issue>
          (
          <year>2010</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rezvan</surname>
            <given-names>Joshaghani</given-names>
          </string-name>
          , Stacy Black, Elena Sherman, and
          <string-name>
            <given-names>Hoda</given-names>
            <surname>Mehrpouyan</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Formal specification and verification of user-centric privacy policies for ubiquitous systems</article-title>
          .
          <source>In Proceedings of the 23rd International Database Applications &amp; Engineering Symposium</source>
          . 1-
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Rezvan</given-names>
            <surname>Joshaghani</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hoda</given-names>
            <surname>Mehrpouyan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A model-checking approach for enforcing purpose-based privacy policies</article-title>
          .
          <source>In 2017 IEEE Symposium on PrivacyAware Computing (PAC)</source>
          . IEEE,
          <fpage>178</fpage>
          -
          <lpage>179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Keras</surname>
          </string-name>
          .
          <year>2018</year>
          . Convolutional Layres - Keras
          <string-name>
            <surname>Documentation</surname>
          </string-name>
          . https://keras.io/layers/convolutional/. [Online; accessed 01-February-2019].
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Keras</surname>
          </string-name>
          .
          <year>2018</year>
          . Embedding Layres - Keras
          <string-name>
            <surname>Documentation</surname>
          </string-name>
          . https://keras.io/layers/embeddings/. [Online; accessed 01-February-2019].
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Keras</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Guide to the Functional API - Keras Documentation</article-title>
          . https://keras.io/getting-started/functional-api-guide/. [Online; accessed 01- February-2019].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>LIWC</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Linguistic Inquiry and Word Count</article-title>
          . https://liwc.wpengine.com/. [Online; accessed 01-February-2019].
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Naresh</surname>
            <given-names>K Malhotra</given-names>
          </string-name>
          , Sung S Kim, and
          <string-name>
            <given-names>James</given-names>
            <surname>Agarwal</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Internet users' information privacy concerns (IUIPC): The construct, the scale, and a causal model</article-title>
          .
          <source>Information systems research 15</source>
          ,
          <issue>4</issue>
          (
          <year>2004</year>
          ),
          <fpage>336</fpage>
          -
          <lpage>355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Nuhil</surname>
            <given-names>Mehdy</given-names>
          </string-name>
          , Casey Kennington, and
          <string-name>
            <given-names>Hoda</given-names>
            <surname>Mehrpouyan</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Privacy Disclosures Detection in Natural-Language Text Through Linguistically-motivated Artificial Neural Network</article-title>
          .
          <source>In 2nd EAI International Conference on Security and Privacy in New Computing Environments. EAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Joseph</surname>
            <given-names>Phelps</given-names>
          </string-name>
          , Glen Nowak, and
          <string-name>
            <given-names>Elizabeth</given-names>
            <surname>Ferrell</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Privacy concerns and consumer willingness to provide personal information</article-title>
          .
          <source>Journal of Public Policy &amp; Marketing</source>
          <volume>19</volume>
          ,
          <issue>1</issue>
          (
          <year>2000</year>
          ),
          <fpage>27</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>David</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Montserrat</given-names>
            <surname>Batet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Viejo</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Detecting sensitive information from textual documents: an information-theoretic approach</article-title>
          .
          <source>In International Conference on Modeling Decisions for Artificial Intelligence</source>
          . Springer,
          <fpage>173</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Ashley</given-names>
            <surname>Savage</surname>
          </string-name>
          and Richard Hyde.
          <year>2014</year>
          .
          <article-title>Using freedom of information requests to facilitate research</article-title>
          .
          <source>International Journal of Social Research Methodology</source>
          <volume>17</volume>
          ,
          <issue>3</issue>
          (
          <year>2014</year>
          ),
          <fpage>303</fpage>
          -
          <lpage>317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Arnon</given-names>
            <surname>Siegel</surname>
          </string-name>
          .
          <year>1997</year>
          . In Pursuit of Privacy: Laws, Ethics, and
          <source>the Rise of Technology. The Wilson Quarterly</source>
          <volume>21</volume>
          ,
          <issue>4</issue>
          (
          <year>1997</year>
          ),
          <fpage>100</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Olivia</given-names>
            <surname>Solon</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Facebook says Cambridge Analytica may have gained 37m more users' data</article-title>
          . https://www.theguardian.com/technology/2018/apr/04/facebook-cambridgeanalytica
          <article-title>-user-data-latest-more-than-thought</article-title>
          . [Online; accessed 01-April-2019].
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Spacy</surname>
          </string-name>
          .
          <year>2018</year>
          . Linguistic Features - Named
          <string-name>
            <surname>Entities</surname>
          </string-name>
          . https://spacy.io/usage/linguistic-features#
          <article-title>section-named-entities</article-title>
          . [Online; accessed 01-February-2019].
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Statista</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Number of U.S. data breaches 2014-2018, by industry</article-title>
          . https://www.statista.com/statistics/273550/data
          <article-title>-breaches-recorded-in-theunited-states-by-number-of-breaches-and-records-exposed/</article-title>
          . [Online; accessed 01-April-2019].
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Symantec</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>10 cyber security facts and statistics for 2018</article-title>
          . https://us.norton.com/internetsecurity-emerging-threats-10
          <string-name>
            <surname>-</surname>
          </string-name>
          facts
          <article-title>-abouttodays-cybersecurity-landscape-that-you-should-know</article-title>
          .html. [Online; accessed 01-April-2019].
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Herman</surname>
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Tavani</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Philosophical theories of privacy: Implications for an adequate online privacy policy</article-title>
          .
          <source>Metaphilosophy</source>
          <volume>38</volume>
          ,
          <issue>1</issue>
          (
          <year>2007</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Asimina</surname>
            <given-names>Vasalou</given-names>
          </string-name>
          , Alastair J Gill, Fadhila Mazanderani, Chrysanthi Papoutsi, and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Joinson</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Privacy dictionary: A new resource for the automated content analysis of privacy</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>62</volume>
          ,
          <issue>11</issue>
          (
          <year>2011</year>
          ),
          <fpage>2095</fpage>
          -
          <lpage>2105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Ding</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Zijian Zhang, Ping Wang,
          <string-name>
            <surname>Jef Yan</surname>
            , and
            <given-names>Xinyi</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Targeted online password guessing: An underestimated threat</article-title>
          .
          <source>In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM</source>
          ,
          <volume>1242</volume>
          -
          <fpage>1254</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Warren</surname>
          </string-name>
          et al.
          <year>1890</year>
          .
          <string-name>
            <given-names>Louis</given-names>
            <surname>Brandeis</surname>
          </string-name>
          . The Right to Privacy.
          <source>Harvard Law Review</source>
          <volume>4</volume>
          ,
          <issue>5</issue>
          (
          <year>1890</year>
          ),
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>