=Paper= {{Paper |id=Vol-2335/paper9 |storemode=property |title=Challenges in Automated Question Answering for Privacy Policies |pdfUrl=https://ceur-ws.org/Vol-2335/1st_PAL_paper_13.pdf |volume=Vol-2335 |authors=Abhilasha Ravichander,Alan Black,Eduard Hovy,Joel Reidenberg,N. Cameron Russell,Norman Sadeh }} ==Challenges in Automated Question Answering for Privacy Policies== https://ceur-ws.org/Vol-2335/1st_PAL_paper_13.pdf
             Challenges in Automated Question Answering for Privacy Policies
                                Abhilasha Ravichander§ , Alan Black§ , Eduard Hovy§ ,
                               Joel Reidenberg† , N. Cameron Russell† , Norman Sadeh§
                       §                                                          †
                         Carnegie Mellon University                                 Fordham University
                        School of Computer Science                                    Law School,
                           Pittsburgh, PA, USA                                     New York, NY, USA
                  {aravicha, awb, ehovy, sadeh}@cs.cmu.edu                     jreidenberg@law.fordham.edu

                            Abstract
  Privacy policies are legal documents used to inform users
  about the collection and handling of their data services or
  technologies with which they interact. Research has shown
  that few users take the time to read these policies, as they are
  often long and difficult to understand. In addition, users often
  only care about a small subset of issues discussed in privacy
  policies, and some of the issues they actually care about may
  not even be addressed in the text of the policies. Rather than
  requiring users to read the policies, a better approach might be
  to allow them to simply ask questions about those issues they
  care about, possibly through iterative dialog. In this work, we
  take a step towards this goal by exploring the idea of an au-
  tomated privacy question-answering assistant, and look at the
  kinds of questions users are likely to pose to such a system.
  This analysis is informed by an initial study that elicits pri-
  vacy questions from crowdworkers about the data practices
  of mobile apps. We analyze 1350 questions posed by crowd-
  workers about the privacy practices of a diverse cross sec-        Figure 1: Examples of privacy-related questions users ask
  tion of mobile applications. This analysis sheds some light
  on privacy issues mobile app users are likely to inquire about
                                                                     for Fiverr. Policy evidence represents sentences in the pri-
  as well as their ability to articulate questions in this domain.   vacy policy that are relevant for determining the answer to
  Our findings in turn should help inform the design of future       the user’s question.
  privacy question answering systems.

                                                                     (Reidenberg et al., 2015b). This is an opportunity for lan-
                        Introduction                                 guage technologies to help better serve the needs of users,
Privacy policies are the legal documents which disclose the          by processing privacy policies automatically and allowing
ways in which a company gathers, uses, shares and manages            users to engage with them through interactive dialog. The
user data. They are now nearly ubiquitous on websites and            legal domain has long served as a useful application domain
mobile applications. Privacy policies work under the ”notice         for Natural Language Processing techniques (Mahler, 2015),
and choice” regime, where users read privacy policies and            however the sheer pervasiveness of websites and mobile ap-
can then choose whether or not to accept the terms of the            plications in today’s world necessitates the creation of auto-
policy, occasionally subject to some opt-in or opt-out provi-        matic techniques to help users better understand the content
sions.                                                               of privacy policies.
   However due to the length and verbosity of these docu-               In this work, we explore the idea of an automated “pri-
ments (Cate, 2010; Cranor, 2012; Schaub et al., 2015; Gluck          vacy assistant”, which allows users to explore the content of
et al., 2016), the average user does not read the privacy            a privacy policy by answering their questions. This kind of
policies they consent to (Jain, Gyanchandani, and Khare,             question-answering approach would allow for a more per-
2016; Commission and others, 2012). McDonald and Cranor              sonalized approach to privacy, enabling users to review sec-
(2008) find that if users spent time reading the privacy poli-       tions of policies that they are most interested in. The suc-
cies for all the website they interact with, it would account        cessful development of effective question-answering func-
for a significant portion of the time they currently spend on        tionality for privacy requires a careful understanding of the
the web. This disconnect between the requirements of real            types of questions users are likely to ask, how users are
Internet users and their theoretical behavior under the notice       likely to formulate these questions, as well as estimating the
and choice paradigm render this model largely ineffective            difficulty of answering these questions. In this work, it is our
goal to explore these issues by providing a preliminary qual-         et al., 2018) and news articles (Trischler et al., 2016; Her-
itative analysis of privacy-related questions posed by crowd-         mann et al., 2015; Onishi et al., 2016). Our work consid-
workers.                                                              ers question-answering within the specialized privacy do-
                                                                      main, where documents are typically long and complex, and
                       Related Work                                   their accurate interpretation requires legal expertise. Thus,
                                                                      our work can also be considered to be related to similar ef-
Policy Analysis                                                       forts in the legal domain (e.g., Monroy, Calvo, and Gelbukh
                                                                      (2009); Quaresma and Rodrigues (2005)). These approaches
There has been considerable interest in making the content
                                                                      are based on information retrieval for legal documents and
of privacy policies easy to understand. These include ap-
                                                                      have primarily been applied to juridical documents. Do
proaches that prescribe guidelines for drafting privacy poli-
                                                                      et al. (2017) describes retrieving relevant Japanese Civil
cies (Kelley et al., 2009; Micheti, Burkell, and Steeves,
                                                                      Code documents for question answering. Kim, Xu, and
2010) or require service providers to encode privacy-policies
                                                                      Goebel (2015) investigate answering true/false questions
in a machine-readable format (Cranor, 2003). These meth-
                                                                      from Japanese bar exams. Liu, Chen, and Ho (2015) ex-
ods have not seen widespread adoption from industry and
                                                                      plores finding relevant Taiwanese legal statutes for a nat-
were abandoned. More recently, the community has been
                                                                      ural language query . A number of authors have also de-
looking at automatically understanding the content of pri-
                                                                      scribed domain-specific knowledge engineering approaches
vacy policies (Sadeh et al., 2013; Liu et al., 2016; Oltramari
                                                                      combining ontologies and knowledge bases to answer ques-
et al., 2017; Mysore Sathyendra et al., 2017; Wilson et al.,
                                                                      tions (e.g., Mollá and Vicedo (2007); Frank et al. (2007)).
2017). Perhaps most closely related to our contribution is the
                                                                      Feng et al. (2015); Tan et al. (2016) look at non-factoid ques-
work of Harkous et al. (2018), which investigates answering
                                                                      tion answering in the insurance domain. Each of these spe-
questions from privacy policies by looking at privacy-related
                                                                      cialized domains present their own unique challenges, and
questions users ask companies on Twitter and annotating
                                                                      progress in them requires a careful understanding of the do-
“segments” in the privacy policy as being relevant answers.
                                                                      main as well as best practices in presenting information to
Our study differs from their approach in several ways. First,
                                                                      the end user.
our study is an order larger in magnitude . This is in part due
to the scalability of our crowdsourcing methodology (§ ), at
the expense of having ‘natural’ questions. However, as we                               Crowdsourced Study
show later in this work finding such questions in the wild            We would like to gain a better understanding of the kinds
can also be challenging. Secondly, we take into account for           of questions users are likely to ask, and what legally-sound
the fact that an answer to a question might not always be in          answers to them would be. For this purpose, we collect our
the privacy policy, and if it is, it is possible there are multiple   data in two stages: first, we crowdsource questions on the
correct answers. This more accurately reflects a real-world           contents of privacy policies from crowdworkers, and then
scenario where users can ask any question of a privacy as-            we rely on domain experts with legal training to provide
sistant. Third, our answers are provided by domain experts            answers to the questions. We would like to note that our
with legal training. Moreover, the annotations are provided           methodology only exposes crowdworkers to public informa-
at a sentence-level granularity. This is a considerable advan-        tion about each of the companies, rather than requiring them
tage over segment-level annotations for two reasons: first,           to read the privacy policy to formulate questions. This in-
the concept of what constitutes a segment is poorly defined           cludes the name of the mobile application, the description of
and has different meanings to different audiences whereas             the mobile application as presented on the Google Playstore
the notion of a sentence is much more objective. Second :             as well as screenshots from the mobile application. This ap-
a finer level of granularity allows us to eliminate redundant         proach attempts to circumvent potential bias from lexical
information within segments, and presenting irrelevant in-            entrainment, and more generally the risk of biasing crowd-
formation to a user detracts from how helpful an answer is.           workers to ask questions only about the practices disclosed
A system can always default to presenting segment-level in-           in the privacy policy.
formation if required, by selecting all the sentences within             In this study we intentionally select mobile applications
the segment. Sathyendra et al. (2017) present some initial            from a number of different categories, specifically focus-
approaches to question answering for privacy policies. They           ing on apps from categories that occupy ≥ 2% of mobile
outline several avenues for future work, including the need           applications on the Google Playstore (Story, Zimmeck, and
to elicit more representative datasets, determine if questions        Sadeh, 2018)1 2 3 . We would like to collect a representative
are unanswerable, and decrease reliance on segments. Our
work takes a first step in this direction through a crowd-               1
                                                                            As of April 1, 2018
sourced study that elicits a wide range of questions as well             2
                                                                            Games are by far the largest category of apps on the Google
as legally-sound answers at the sentence-level of granularity.        Playstore. We collapse the different game subcategories into one
                                                                      category for our purposes.
                                                                          3
Reading Comprehension                                                       We choose to focus on the privacy policies of mobile applica-
                                                                      tions given the ubiquitousness of smartphones. However, our study
Several large-scale reading comprehension/answer selection            design is limited to Android mobile applications. In practice how-
datasets exist for Wikipedia passages (Rajpurkar et al., 2016;        ever, these mobile applications often share privacy policies across
Rajpurkar, Jia, and Liang, 2018; Joshi et al., 2017; Choi             platforms
 Statistic                   Train      Test        All
 # Questions                 1000       350         1350
 # Passages                  20         7           27
 # Sentences                 2879       909         3788
 Avg Question Length         8.44       8.56        8.47
 Avg Passage Length          3372.1     2990.29     3273.11
 Avg Answer Length           93.94      111.9       104.52

Table 1: Statistics of PrivacyQA Dataset, where # denotes
number of questions, passages and sentences, and average
length of questions, passages and answers in words, for
training and test partitions.


set of questions such that we range from mobile applica-
tions which are well-known and likely to have carefully con-
structed privacy policies, all the way to applications which
may have smaller install bases and less sophisticated pri-
vacy policies. We sample applications from each category
using the Google Playstore recommendation engine, such
that only half of the applications in our corpus have more                   Figure 2: User interface for question elicitation.
than 5 million installs 4 . We collect data for 27 privacy poli-
cies across 10 categories of mobile applications. 5
                                                                      for a particular application, we recruit four experts with legal
Crowdsourced Question Elicitation                                     training to formulate answers to these questions based on the
An important objective of this study is to elicit and under-          text of that application’s privacy policy. The experts annotate
stand the types of questions users are likely to have when            questions for their relevance, subjectivity and also identify
looking to install a mobile application. As discussed ear-            the relevant OPP-115(Wilson et al., 2016) category(ies) cor-
lier, we present information similar to the information found         responding to each question, if any. We then formulate the
when looking at the application in the Google playstore (Fig-         problem of answering the question as a sentence selection
ure 2). We use Amazon Mechanical Turk to elicit questions             task, and ask our annotators to find supporting evidence in
about these privacy policies. Crowdworkers were asked to              the document which can help in answering the question. In
imagine they installed a mobile application and could talk to         this way, every question is shown to at least one annotator,
a trusted privacy assistant, whom they could ask any privacy-         and 350 questions are annotated by multiple annotators 6 .
related question pertaining to the app. They were paid 12$
per hour to ask five questions for a given policy. We solicited                                   Analysis
questions from Turkers who were conferred “master” sta-
tus, and whose location was within the United States and              Table 1. describes the results of our data collection effort.
our task received favorable reviews on TurkerHub. For each            We receive 1350 questions to our imaginary privacy assis-
mobile application, crowdworkers were also asked to rate              tant, about the privacy practices of 27 mobile applications.
their understanding of what the app does on a Likert scale            The question length is on average 8.4 words and the privacy
of 1-5 (ranging from not being familiar to understanding              policies are typically very long pieces of text, 3̃000 words
it extremely well), as well as to indicate whether they had           long. The answers to the questions typically have 1̃00 words
installed or used the app before. We also collected demo-             of evidence in the privacy policy document.
graphic information regarding the age of the crowdworkers.
                                                                      What types of questions do users ask the privacy
Answer Selection                                                      assistant?
We are not just interested in collecting data on what ques-           We would like to explore the kinds of questions that users
tions users ask, but also a corpus of what good answers to            ask our conversational assistant. We analyze questions based
these questions would be. For this purpose, given questions           on their question words, as well as by having our expert an-
   4                                                                  notators indicate whether they believe the questions are re-
      We choose 5 million installs as a threshold on popularity of
the mobile application, but this choice is debatable. Mobile appli-   lated to privacy, whether they are subjective in nature and
cations with fewer than 5 million installs could also represent ap-   what categories they belong to in the OPP-15 ontology (Wil-
plications of large corporations and vice versa.                      son et al., 2016). The results of this analysis is as follows7 :
    5
      The Playstore categories we sample applications from in-
                                                                         6
clude: Books and Reference, Business, Education, Entertainment,           These form our held-out test set.
                                                                         7
Lifestyle, Health and Fitness, News and Magazines, Tools, Travel          All analyses in this section are presented on the ’All’ data split
and Local, and Games.                                                 unless mentioned otherwise
                Question Word      Percentage                       Property            Privacy-Related      Not Privacy-Related
                                                                    Subjective          4.86%                1.43%
                is/does            27.9 %                           Not Subjective      74%                  19.71%
                what               13.5 %
                will               11.9 %                          Table 3: Relevance and subjectivity judgments for 350 ques-
                how                10.1 %                          tions posed by crowdworkers.
                can                8.6 %
                are                4.5 %
                who                4.4 %
                where              1.3 %                           In the real-world it isn’t necessary that users will only ask
                if                 1.8 %                           our privacy assistant questions related to privacy. Thus, it is
                                                                   important for us to be able to identify which questions we
Table 2: Analysis of questions by question words for cate-         are capable of attempting to answer. We analyze the test-set
gories that account for >1% of questions                           where each example features multiple annotations from our
                                                                   expert-annotators. We consider the majority-vote to be the
                                                                   judgement of whether a question is relevant or subjective.
Question Words We qualitatively analyze questions by               We find that 78.85% of questions received by our privacy
their expected types, based on the first word of the question.     assistant are relevant, with 6.28% being subjective. Table. 3
Note that while the question word can give us some informa-        gives us more insight into this phenomena. We observe that
tion about the information-seeking intent of the user, the dif-    the majority of questions (74%) are relevant but not subjec-
ferent question words can often be used interchangeably. For       tive (for example, ‘what information are they collecting?’).
example, the questions ’will it collect my location?’ can also     4.86% of questions are both relevant and subjective (for ex-
be phrased as ‘does it collect my location’. Keeping these         ample, ‘is my data safe?’), 1.4% are subjective but not rel-
limitations in mind, we perform a qualitative analysis of the      evant (for example, ‘are there any in game purchases in the
elicited questions to identify common user intents. The dis-       wordscapes app that i should be concerned about?’) and fi-
tributions of questions across types can be found in Table 2.      nally 19.71% are neither relevant nor subjective (‘does the
By far, the largest proportion of questions can be grouped         app require an account to play?’).
into the ‘is/does’ category where, similar to the ‘are’ cate-
gory, users are often questioning the assistant about a partic-
ular privacy attribute (for example, ’does this app track my       Question Ontology Categories Next we ask our annota-
location?’ or ‘is this app tracking my location?’). The next       tors to indicate the OPP-15 data practice category (Wilson et
largest category includes ‘what’ questions which include a         al., 2016) that best describes the question. Broadly, the on-
broad spectrum of questions (for example, what sort of an-         tology describes 10 data practice categories. The interested
alytics are integrated in the app?’ or ’what do you do with        reader is invited to refer to (Wilson et al., 2016) for a detailed
my information’). The ‘will’ and ‘can’ questions are usually       description of these data practices. Annotators are allowed
asking about a potential privacy harm (for example, ‘will i        to annotate a question as belonging to multiple categories.
be explicitly told when my info is being shared with a third       For example, the question ’What information of mine is col-
party?’ or ‘will any academic institutions or employers be         lected by this app and who is it shared with?’ might belong
able to access my performance/score information?’ or ‘can          to both the ‘First Party Collection and Use’ and the ‘Third
the app see what i type and what i search for?’). ’How’            Party Sharing and Collection’ OPP-115 data practice cate-
questions generally either ask about specific company pro-         gories. We consider a category to be correct, if at least 2 an-
cesses, or abstract attributes, such as security, longevity of     notators identify it to be relevant. In cases where none of the
data retention etc (for example, ‘how safe is my password’         categories are identified as relevant, we default to ’other’ if it
and ‘how is my data protected’). Relevant ‘where’ questions        is identified as a relevant category by at least one annotator.
are generally related to data storage (for example, ‘Where is      If not, we mark the category as ’no agreement’. The results
my data stored?’). Questions that begin with ‘who’ are usu-        from this analysis are presented in Table. 4. We observe that
ally asking about first party or third party access to data (for   questions about first party and third party practices account
example, ‘who can see my account information?’ or ‘who all         for nearly 58.7% of all the questions asked of our assistant.
has access to my medical information?’). Finally questions
in the ‘if’ category typically establish a premise, before ask-
ing a question. Such a question needs to be answered based         Comparative Analysis We analyze 100 samples drawn
on both the contents of the policy as well as assuming the         from the Twitter privacy dataset (Harkous et al., 2018), an-
information in the premise is true (for example, ‘if i link it     notating them for OPP-category, relevance and if they are a
to my Facebook will it have access to view my private infor-       question or not. We find that in the Twitter dataset, 23 % of
mation?’ or ‘if i choose to opt out of the app gathering my        the questions are complaints rather than questions. By OPP-
personal data, can i still use the app?’).                         category classification, 26% are First Party, 37% are Third
                                                                   Party, 14% are Data Security, 5% are User Access, 3% are
Relevance and Subjectivity We analyze how many of the              User choice and 9% could be grouped in the ‘other’ cate-
questions asked to our privacy assistant are ‘relevant’ i.e are    gory. Only 6% of the questions collected are not privacy re-
related to privacy, and how many are subjective in nature.         lated.
 Privacy Practice                              Percentage      Example
 First Party Collection//Use                   36.4 %          what data does this game collect?
 Third Party Sharing//Collection               22.3 %          will my data be sold to advertisers?
 Data Security                                 10.9 %          how is my info protected from hackers?
 Data Retention                                4.2 %           how long do you save my information?
 User Access, Edit and Deletion                2.6 %           can i delete my information permanently?
 User Choice//Control                          7.2 %           is there a way to opt out of data sharing
 Other                                         9.4 %           does the app connect to the internet at any point during its use?
 International and Specific Audiences          0.6 %           what are your GDPR policies?
 No Agreement                                  6.6 %           how are features personalized?

                        Table 4: OPP-115 categories most relevant to the questions collected from users.


                       Experiments                                       Model                Precision     Recall    F1
                                                                         No Answer (NA)       36.2          36.2      36.2
We would like to characterize and study the difficulty of the
                                                                         Human                70.3          71.1      70.7
question-answering task for humans. We formulate the prob-
lem of identifying relevant evidence in the document to an-
                                                                       Table 5: Human performance and performance of a No-
swer the question as a sentence-selection task, where it is
                                                                       Answer Baseline. Human performance demonstrates con-
possible to choose not to answer a question by not identi-
                                                                       siderable agreement on the right answer for the privacy
fying any relevant sentences. We evaluate using sentence-
                                                                       domain, where experts often disagree (Reidenberg et al.,
level F1 rather than IR-style metrics so as to accommodate
                                                                       2015a).
models to abstain from answering 8 . Similar to Choi et al.
(2018); Rajpurkar et al. (2016), we compute the maximum
F1 amongst all the reference answers. As abstaining from
giving an answer is always legally sound but seldom help-              ideally a privacy assistant would be able to answer, but is
ful, we do not consider a question to be unanswerable if only          not present within a typical privacy policy. In the future, a
a minority of experts abstain from giving an answer. Similar           privacy assistant could draw upon various sources of infor-
to (Choi et al., 2018) given n reference answers, we report            mation such as metadata from the Google Playstore, back-
the average maximum F1 performance of the (n − 1)th sub-               ground legal knowledge, news articles, social media etc. in
set compared to the heldout reference.                                 order to broaden its coverage across questions. For an ad-
   As discussed previously, since most questions are diffi-            ditional 24% of unanswerable questions, the answers were
cult to answer in a legally-sound way based on the contents            expected to be found in the privacy policy, but the privacy
of the privacy policy alone, abstaining from answering is of-          policy was silent on a possible answer (such as ‘is my app
ten going to be a safe action. We would like to emphasize              data encrypted?’). Generally when a policy is silent it is not
that this is not a criticism of the annotators or the people           safe to make any assumptions. 6% of questions asked by a
asking the questions, but rather a characteristic of this do-          user are too vague to understand correctly such as ‘who can
main where privacy policies are often silent or ambiguous              contact me through the app?’, such questions would ben-
on issues users are likely to inquire about. To quantify the           efit from the assistant engaging in a clarification dialogue.
magnitude of this effect, we demonstrate that a model which            Another 4% are ambiguously phrased, such as ‘any difficul-
always abstains from answering the question can achieve                ties to occupy the privacy assistant?’. These kind of ques-
reasonable performance (Table 5), yet still leaves a large             tions are very hard to interpret correctly. 3% of unanswer-
gap for improvement. We would further like to understand               able questions are too specific in nature, and it is unlikely the
what makes the majority of our annotators decide a question            creators of the privacy policy would anticipate that particular
should not be answered. We randomly sample 100 ques-                   question (‘does it have access to financial apps i use?’). Fi-
tions that were deemed unanswerable, and annotate them                 nally, 7% of unanswerable questions are too subjective and
post-hoc with reasons informed by expert annotations. We               our annotators tend to abstain from answering (for example,
find that for 56% of unanswerable questions, the answer to             ‘how do i know this app is legit?’).
the question would typically not be present in most privacy               We would also like to be able to characterize the dis-
policies. These would include questions such as ‘how does              agreement on this task. It is important to note here that all
the currency within the game work?’ and suggests that users            of our annotators are experts with legal training rather than
would benefit from being informed about the scope of typ-              crowdworkers, and their provided answers can generally be
ical privacy policies. However, they also include questions            assumed to be valid legal opinions about the question. We
such as ‘has Viber had data breaches in the past?’ which               tease apart the difference from where they abstained to an-
                                                                       swer to their disagreements by comparing against the No
   8
     Similar to (Rajpurkar, Jia, and Liang, 2018) and (Yang, Yih,      Answer (henceforth known as NA) baseline (Table 5). In Ta-
and Meek, 2015), for negative examples models are awarded 1 F1         ble 5 we observe the human F1 is 70.7%, demonstrating con-
if they abstain from answering and 0 F1 for any answer at all          siderable agreement on the right answer. We would still like
           Question Word        NA Model       Human                    Privacy Practice                        NA Model       Human
           is/does              37.22          73.19                    First Party Collection/Use              24.6           67.1
           what                 39.77          73.35                    Third Party Sharing/Collection          6.9            60.6
           will                 13.04          66.56                    Data Security                           35.3           87.2
           how                  27.84          80.16                    Data Retention                          0              79.8
           can                  27.17          63.04                    User Access, Edit and Deletion          0              53.1
           are                  35.85          68.68                    User Choice/Control                     46.3           64.7
           who                  17.02          58.44                    Other                                   89.1           84.1
           where                54.55          54.55                    International & Specific Audiences      0              100
           if                   0              62.19                    No Agreement                            76.2           78.3

Table 6: Classifier performance in F1 stratified by first word         Table 7: Classifier performance in F1 stratified by OPP-115
in the question.                                                       category of the question.


to investigate whether any disagreements are valid, or if they                                Conclusion
are due to poor definitions or lack of adequate specification          What kinds of questions should an automated privacy assis-
in the annotation instructions. We randomly sample 50 sam-             tant expect to receive? We explore this question by design-
ples and annotate them for likely reasons for disagreement             ing a study that elicits questions from crowdworkers who are
9                                                                      asked to think about the data practices of mobile apps they
  . We find that they ”agree on 64% of instances and disagree
on 36%. We further determine that 92.8% of disagreements               might consider downloading on their smartphones. We qual-
were legitimate, valid different interpretations. For 43.75%           itatively analyze the types of questions asked by users, and
the question was interpreted differently, in 25% the contents          identify a number of challenges associated with generating
of the privacy policy were interpreted differently and the re-         answers to these questions. While in principle privacy poli-
maining were due to other sources of error (for example, in            cies should be written to answer questions users are likely
the question ‘who is allowed to use the app’, most anno-               to have, in practice, our study shows that questions asked
tators abstain from answering, but one annotator points out            by users often go beyond what is disclosed in the text of
that the policy states that children under the age of 13 are           privacy policies. Challenges arise in automated question an-
not allowed to use the app.)                                           swering, both because policies are often silent or ambiguous
    We next analyze disagreements based on the type of ques-           on issues that users are likely to inquire about, and also be-
tion that was asked (Table. 6). As observed, the wh-type of            cause users are not very good at articulating their privacy
the question may give us some information about the in-                questions - and occasionally even ask questions that have
tent of the questions. We observe that our expert annotators           nothing to do with privacy. Determining a user’s intent may
rarely abstain to answer when a user asks a ’will’ question            be a process of discovery for both the user and the assistant,
about a potential privacy harm, taking care to identify rele-          and thus in the future it would be helpful if the assistant was
vant sections of the privacy policy. Similarly ’if’ type ques-         capable of engaging in clarification dialogue. Such a privacy
tions generally are quite specific and require careful reason-         assistant would have to reconcile the need to be helpful to
ing. On the other hand ‘where’ questions are generally about           the user and provide answers that are legally accurate with
data storage. They are vague, for example ’where is my data            the need to be helpful. It would have to be capable of dis-
stored?’ is probably not asking for the exact location of the          ambiguating questions by engaging in dialogues with users;
company’s datacenters but it is unclear what granularity is            it would have to be able to supplement information found
meant in the question (e.g., particular country, versus know-          (or lacking) in the privacy policy with additional sources of
ing whether the data is stored on a mobile phone or in the             information such as background legal knowledge. Ideally, it
cloud).                                                                would also be able to interpret ambiguity in the policy and
    We also analyze disagreements based on the OPP-115                 also be able interpret silence about different issues. We hope
category of the question (Table. 7). As expected, questions            that the identification of these requirements will help inform
where annotators disagree on the category of the question,             the design of effective automatic privacy assistants.
have more disagreements than simply abstaining to answer.
Similarly for user choice, the policy typically does not an-                             Acknowledgements
swer questions like ‘how do I limit its access to data’ fully,         This work has been supported by the National Science Foun-
so the annotators tend to abstain from answering. In contrast,         dation under Grant No. CNS 13-30596 and No. CNS CNS
questions about first party and third party practices are usu-         13-30214. The views and conclusions contained herein are
ally anticipated and often have answers in the privacy policy.         those of the authors and should not be interpreted as nec-
                                                                       essarily representing the official policies or endorsements,
    9
      We do not use F1 to measure disagreement, and instead man-       either expressed or implied, of the NSF, or the US Govern-
ually filter samples so we can capture both when the legal experts     ment. The authors would like to thank Lorrie Cranor, Florian
interpreted the question differently, as well as when they interpret   Schaub and Shomir Wilson for insightful feedback and dis-
the contents of the privacy policy differently.                        cussion related to this work.
                       References                              Liu, F.; Wilson, S.; Schaub, F.; and Sadeh, N. 2016. Ana-
Cate, F. H. 2010. The limits of notice and choice. IEEE          lyzing vocabulary intersections of expert annotations and
  Security & Privacy 8(2):59–62.                                 topic models for data practices in privacy policies. In 2016
                                                                 AAAI Fall Symposium Series.
Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.-t.; Choi,    Liu, Y.-H.; Chen, Y.-L.; and Ho, W.-L. 2015. Predicting
  Y.; Liang, P.; and Zettlemoyer, L. 2018. Quac: Question        associated statutes for legal problems. Information Pro-
  answering in context. arXiv preprint arXiv:1808.07036.         cessing & Management 51(1):194–211.
Commission, U. F. T., et al. 2012. Protecting consumer         Mahler, L. 2015. What is nlp and why should lawyers care.
  privacy in an era of rapid change: Recommendations for         Retrieved March 12:2018.
  businesses and policymakers. FTC Report.
                                                               McDonald, A. M., and Cranor, L. F. 2008. The cost of
Cranor, L. F. 2003. P3p: Making privacy policies more            reading privacy policies. ISJLP 4:543.
  useful. IEEE Security & Privacy 99(6):50–55.                 Micheti, A.; Burkell, J.; and Steeves, V. 2010. Fixing bro-
Cranor, L. F. 2012. Necessary but not sufficient: Standard-      ken doors: Strategies for drafting privacy policies young
  ized mechanisms for privacy notice and choice. J. on           people can understand. Bulletin of Science, Technology &
  Telecomm. & High Tech. L. 10:273.                              Society 30(2):130–143.
Do, P.-K.; Nguyen, H.-T.; Tran, C.-X.; Nguyen, M.-T.; and      Mollá, D., and Vicedo, J. L. 2007. Question answering in
  Nguyen, M.-L. 2017. Legal question answering us-               restricted domains: An overview. Computational Linguis-
  ing ranking svm and deep convolutional neural network.         tics 33(1):41–61.
  arXiv preprint arXiv:1703.05320.                             Monroy, A.; Calvo, H.; and Gelbukh, A. 2009. Nlp for shal-
                                                                 low question answering of legal documents using graphs.
Feng, M.; Xiang, B.; Glass, M. R.; Wang, L.; and Zhou,
                                                                 Computational Linguistics and Intelligent Text Process-
  B. 2015. Applying deep learning to answer selection: A
                                                                 ing 498–508.
  study and an open task. arXiv preprint arXiv:1508.01585.
                                                               Mysore Sathyendra, K.; Wilson, S.; Schaub, F.; Zimmeck,
Frank, A.; Krieger, H.-U.; Xu, F.; Uszkoreit, H.; Crysmann,      S.; and Sadeh, N. 2017. Identifying the provision of
  B.; Jörg, B.; and Schäfer, U. 2007. Question answering       choices in privacy policy text. In Proceedings of the 2017
  from structured knowledge sources. Journal of Applied          Conference on Empirical Methods in Natural Language
  Logic 5(1):20–48.                                              Processing, 2774–2779. Copenhagen, Denmark: Associ-
Gluck, J.; Schaub, F.; Friedman, A.; Habib, H.; Sadeh, N.;       ation for Computational Linguistics.
  Cranor, L. F.; and Agarwal, Y. 2016. How short is too        Oltramari, A.; Piraviperumal, D.; Schaub, F.; Wilson, S.;
  short? implications of length and framing on the effec-        Cherivirala, S.; Norton, T. B.; Russell, N. C.; Story, P.;
  tiveness of privacy notices. In 12th Symposium on Usable       Reidenberg, J.; and Sadeh, N. 2017. Privonto: A semantic
  Privacy and Security (SOUPS), 321–340.                         framework for the analysis of privacy policies. Semantic
Harkous, H.; Fawaz, K.; Lebret, R.; Schaub, F.; Shin, K. G.;     Web (Preprint):1–19.
  and Aberer, K. 2018. Polisis: Automated analysis and pre-    Onishi, T.; Wang, H.; Bansal, M.; Gimpel, K.; and
  sentation of privacy policies using deep learning. arXiv       McAllester, D. 2016. Who did what: A large-scale
  preprint arXiv:1802.02561.                                     person-centered cloze dataset. In Proceedings of the 2016
                                                                 Conference on Empirical Methods in Natural Language
Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.;     Processing, 2230–2235. Austin, Texas: Association for
  Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching          Computational Linguistics.
  machines to read and comprehend. In Advances in Neural
  Information Processing Systems, 1693–1701.                   Quaresma, P., and Rodrigues, I. P. 2005. A question answer
                                                                 system for legal information retrieval. In JURIX, 91–100.
Jain, P.; Gyanchandani, M.; and Khare, N. 2016. Big data
   privacy: a technological perspective and review. Journal    Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.
   of Big Data 3(1):25.                                          Squad: 100,000+ questions for machine comprehension
                                                                 of text. arXiv preprint arXiv:1606.05250.
Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L.          Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you
  2017. Triviaqa: A large scale distantly supervised chal-       don’t know: Unanswerable questions for squad. arXiv
  lenge dataset for reading comprehension. arXiv preprint        preprint arXiv:1806.03822.
  arXiv:1705.03551.
                                                               Reidenberg, J. R.; Breaux, T.; Cranor, L. F.; French, B.;
Kelley, P. G.; Bresee, J.; Cranor, L. F.; and Reeder, R. W.      Grannis, A.; Graves, J. T.; Liu, F.; McDonald, A.; Nor-
  2009. A nutrition label for privacy. In Proceedings of the     ton, T. B.; and Ramanath, R. 2015a. Disagreeable privacy
  5th Symposium on Usable Privacy and Security, 4. ACM.          policies: Mismatches between meaning and users’ under-
Kim, M.-Y.; Xu, Y.; and Goebel, R. 2015. Applying a con-         standing. Berkeley Tech. LJ 30:39.
  volutional neural network to legal question answering. In    Reidenberg, J. R.; Russell, N. C.; Callen, A. J.; Qasir, S.; and
  JSAI International Symposium on Artificial Intelligence,       Norton, T. B. 2015b. Privacy harms and the effectiveness
  282–294. Springer.                                             of the notice and choice framework. ISJLP 11:485.
Sadeh, N.; Acquisti, A.; Breaux, T. D.; Cranor, L. F.; Mc-
  Donald, A. M.; Reidenberg, J. R.; Smith, N. A.; Liu,
  F.; Russell, N. C.; Schaub, F.; et al. 2013. The usable
  privacy policy project: Combining crowdsourcing, ma-
  chine learning and natural language processing to semi-
  automatically answer those privacy questions users care
  about. Technical report, Technical Report, CMU-ISR-13-
  119, Carnegie Mellon University.
Sathyendra, K. M.; Ravichander, A.; Story, P. G.; Black,
  A. W.; and Sadeh, N. 2017. Helping users understand
  privacy notices with automated query answering function-
  ality: An exploratory study. Technical Report.
Schaub, F.; Balebako, R.; Durity, A. L.; and Cranor, L. F.
  2015. A design space for effective privacy notices. In
  Eleventh Symposium On Usable Privacy and Security
  (SOUPS 2015), 1–17.
Story, P.; Zimmeck, S.; and Sadeh, N. 2018. Which apps
  have privacy policies?
Tan, M.; dos Santos, C.; Xiang, B.; and Zhou, B. 2016.
  Improved representation learning for question answer
  matching. In Proceedings of the 54th Annual Meeting
  of the Association for Computational Linguistics (Volume
  1: Long Papers), 464–473. Berlin, Germany: Association
  for Computational Linguistics.
Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni,
  A.; Bachman, P.; and Suleman, K. 2016. Newsqa:
  A machine comprehension dataset.      arXiv preprint
  arXiv:1611.09830.
Wilson, S.; Schaub, F.; Dara, A. A.; Liu, F.; Cherivirala, S.;
 Leon, P. G.; Andersen, M. S.; Zimmeck, S.; Sathyendra,
 K. M.; Russell, N. C.; et al. 2016. The creation and anal-
 ysis of a website privacy policy corpus. In Proceedings of
 the 54th Annual Meeting of the Association for Compu-
 tational Linguistics (Volume 1: Long Papers), volume 1,
 1330–1340.
Wilson, S.; Schaub, F.; Liu, F.; Sathyendra, K.; Zimmeck,
 S.; Ramanath, R.; Liu, F.; Sadeh, N.; and Smith, N. 2017.
 Analyzing privacy policies at scale: From crowdsourcing
 to automated annotations. ACM Transactions on the Web.
Yang, Y.; Yih, W.-t.; and Meek, C. 2015. Wikiqa: A chal-
  lenge dataset for open-domain question answering. In
  Proceedings of the 2015 Conference on Empirical Meth-
  ods in Natural Language Processing, 2013–2018.