=Paper=
{{Paper
|id=Vol-2335/paper9
|storemode=property
|title=Challenges in Automated Question Answering for Privacy Policies
|pdfUrl=https://ceur-ws.org/Vol-2335/1st_PAL_paper_13.pdf
|volume=Vol-2335
|authors=Abhilasha Ravichander,Alan Black,Eduard Hovy,Joel Reidenberg,N. Cameron Russell,Norman Sadeh
}}
==Challenges in Automated Question Answering for Privacy Policies==
Challenges in Automated Question Answering for Privacy Policies Abhilasha Ravichander§ , Alan Black§ , Eduard Hovy§ , Joel Reidenberg† , N. Cameron Russell† , Norman Sadeh§ § † Carnegie Mellon University Fordham University School of Computer Science Law School, Pittsburgh, PA, USA New York, NY, USA {aravicha, awb, ehovy, sadeh}@cs.cmu.edu jreidenberg@law.fordham.edu Abstract Privacy policies are legal documents used to inform users about the collection and handling of their data services or technologies with which they interact. Research has shown that few users take the time to read these policies, as they are often long and difficult to understand. In addition, users often only care about a small subset of issues discussed in privacy policies, and some of the issues they actually care about may not even be addressed in the text of the policies. Rather than requiring users to read the policies, a better approach might be to allow them to simply ask questions about those issues they care about, possibly through iterative dialog. In this work, we take a step towards this goal by exploring the idea of an au- tomated privacy question-answering assistant, and look at the kinds of questions users are likely to pose to such a system. This analysis is informed by an initial study that elicits pri- vacy questions from crowdworkers about the data practices of mobile apps. We analyze 1350 questions posed by crowd- workers about the privacy practices of a diverse cross sec- Figure 1: Examples of privacy-related questions users ask tion of mobile applications. This analysis sheds some light on privacy issues mobile app users are likely to inquire about for Fiverr. Policy evidence represents sentences in the pri- as well as their ability to articulate questions in this domain. vacy policy that are relevant for determining the answer to Our findings in turn should help inform the design of future the user’s question. privacy question answering systems. (Reidenberg et al., 2015b). This is an opportunity for lan- Introduction guage technologies to help better serve the needs of users, Privacy policies are the legal documents which disclose the by processing privacy policies automatically and allowing ways in which a company gathers, uses, shares and manages users to engage with them through interactive dialog. The user data. They are now nearly ubiquitous on websites and legal domain has long served as a useful application domain mobile applications. Privacy policies work under the ”notice for Natural Language Processing techniques (Mahler, 2015), and choice” regime, where users read privacy policies and however the sheer pervasiveness of websites and mobile ap- can then choose whether or not to accept the terms of the plications in today’s world necessitates the creation of auto- policy, occasionally subject to some opt-in or opt-out provi- matic techniques to help users better understand the content sions. of privacy policies. However due to the length and verbosity of these docu- In this work, we explore the idea of an automated “pri- ments (Cate, 2010; Cranor, 2012; Schaub et al., 2015; Gluck vacy assistant”, which allows users to explore the content of et al., 2016), the average user does not read the privacy a privacy policy by answering their questions. This kind of policies they consent to (Jain, Gyanchandani, and Khare, question-answering approach would allow for a more per- 2016; Commission and others, 2012). McDonald and Cranor sonalized approach to privacy, enabling users to review sec- (2008) find that if users spent time reading the privacy poli- tions of policies that they are most interested in. The suc- cies for all the website they interact with, it would account cessful development of effective question-answering func- for a significant portion of the time they currently spend on tionality for privacy requires a careful understanding of the the web. This disconnect between the requirements of real types of questions users are likely to ask, how users are Internet users and their theoretical behavior under the notice likely to formulate these questions, as well as estimating the and choice paradigm render this model largely ineffective difficulty of answering these questions. In this work, it is our goal to explore these issues by providing a preliminary qual- et al., 2018) and news articles (Trischler et al., 2016; Her- itative analysis of privacy-related questions posed by crowd- mann et al., 2015; Onishi et al., 2016). Our work consid- workers. ers question-answering within the specialized privacy do- main, where documents are typically long and complex, and Related Work their accurate interpretation requires legal expertise. Thus, our work can also be considered to be related to similar ef- Policy Analysis forts in the legal domain (e.g., Monroy, Calvo, and Gelbukh (2009); Quaresma and Rodrigues (2005)). These approaches There has been considerable interest in making the content are based on information retrieval for legal documents and of privacy policies easy to understand. These include ap- have primarily been applied to juridical documents. Do proaches that prescribe guidelines for drafting privacy poli- et al. (2017) describes retrieving relevant Japanese Civil cies (Kelley et al., 2009; Micheti, Burkell, and Steeves, Code documents for question answering. Kim, Xu, and 2010) or require service providers to encode privacy-policies Goebel (2015) investigate answering true/false questions in a machine-readable format (Cranor, 2003). These meth- from Japanese bar exams. Liu, Chen, and Ho (2015) ex- ods have not seen widespread adoption from industry and plores finding relevant Taiwanese legal statutes for a nat- were abandoned. More recently, the community has been ural language query . A number of authors have also de- looking at automatically understanding the content of pri- scribed domain-specific knowledge engineering approaches vacy policies (Sadeh et al., 2013; Liu et al., 2016; Oltramari combining ontologies and knowledge bases to answer ques- et al., 2017; Mysore Sathyendra et al., 2017; Wilson et al., tions (e.g., Mollá and Vicedo (2007); Frank et al. (2007)). 2017). Perhaps most closely related to our contribution is the Feng et al. (2015); Tan et al. (2016) look at non-factoid ques- work of Harkous et al. (2018), which investigates answering tion answering in the insurance domain. Each of these spe- questions from privacy policies by looking at privacy-related cialized domains present their own unique challenges, and questions users ask companies on Twitter and annotating progress in them requires a careful understanding of the do- “segments” in the privacy policy as being relevant answers. main as well as best practices in presenting information to Our study differs from their approach in several ways. First, the end user. our study is an order larger in magnitude . This is in part due to the scalability of our crowdsourcing methodology (§ ), at the expense of having ‘natural’ questions. However, as we Crowdsourced Study show later in this work finding such questions in the wild We would like to gain a better understanding of the kinds can also be challenging. Secondly, we take into account for of questions users are likely to ask, and what legally-sound the fact that an answer to a question might not always be in answers to them would be. For this purpose, we collect our the privacy policy, and if it is, it is possible there are multiple data in two stages: first, we crowdsource questions on the correct answers. This more accurately reflects a real-world contents of privacy policies from crowdworkers, and then scenario where users can ask any question of a privacy as- we rely on domain experts with legal training to provide sistant. Third, our answers are provided by domain experts answers to the questions. We would like to note that our with legal training. Moreover, the annotations are provided methodology only exposes crowdworkers to public informa- at a sentence-level granularity. This is a considerable advan- tion about each of the companies, rather than requiring them tage over segment-level annotations for two reasons: first, to read the privacy policy to formulate questions. This in- the concept of what constitutes a segment is poorly defined cludes the name of the mobile application, the description of and has different meanings to different audiences whereas the mobile application as presented on the Google Playstore the notion of a sentence is much more objective. Second : as well as screenshots from the mobile application. This ap- a finer level of granularity allows us to eliminate redundant proach attempts to circumvent potential bias from lexical information within segments, and presenting irrelevant in- entrainment, and more generally the risk of biasing crowd- formation to a user detracts from how helpful an answer is. workers to ask questions only about the practices disclosed A system can always default to presenting segment-level in- in the privacy policy. formation if required, by selecting all the sentences within In this study we intentionally select mobile applications the segment. Sathyendra et al. (2017) present some initial from a number of different categories, specifically focus- approaches to question answering for privacy policies. They ing on apps from categories that occupy ≥ 2% of mobile outline several avenues for future work, including the need applications on the Google Playstore (Story, Zimmeck, and to elicit more representative datasets, determine if questions Sadeh, 2018)1 2 3 . We would like to collect a representative are unanswerable, and decrease reliance on segments. Our work takes a first step in this direction through a crowd- 1 As of April 1, 2018 sourced study that elicits a wide range of questions as well 2 Games are by far the largest category of apps on the Google as legally-sound answers at the sentence-level of granularity. Playstore. We collapse the different game subcategories into one category for our purposes. 3 Reading Comprehension We choose to focus on the privacy policies of mobile applica- tions given the ubiquitousness of smartphones. However, our study Several large-scale reading comprehension/answer selection design is limited to Android mobile applications. In practice how- datasets exist for Wikipedia passages (Rajpurkar et al., 2016; ever, these mobile applications often share privacy policies across Rajpurkar, Jia, and Liang, 2018; Joshi et al., 2017; Choi platforms Statistic Train Test All # Questions 1000 350 1350 # Passages 20 7 27 # Sentences 2879 909 3788 Avg Question Length 8.44 8.56 8.47 Avg Passage Length 3372.1 2990.29 3273.11 Avg Answer Length 93.94 111.9 104.52 Table 1: Statistics of PrivacyQA Dataset, where # denotes number of questions, passages and sentences, and average length of questions, passages and answers in words, for training and test partitions. set of questions such that we range from mobile applica- tions which are well-known and likely to have carefully con- structed privacy policies, all the way to applications which may have smaller install bases and less sophisticated pri- vacy policies. We sample applications from each category using the Google Playstore recommendation engine, such that only half of the applications in our corpus have more Figure 2: User interface for question elicitation. than 5 million installs 4 . We collect data for 27 privacy poli- cies across 10 categories of mobile applications. 5 for a particular application, we recruit four experts with legal Crowdsourced Question Elicitation training to formulate answers to these questions based on the An important objective of this study is to elicit and under- text of that application’s privacy policy. The experts annotate stand the types of questions users are likely to have when questions for their relevance, subjectivity and also identify looking to install a mobile application. As discussed ear- the relevant OPP-115(Wilson et al., 2016) category(ies) cor- lier, we present information similar to the information found responding to each question, if any. We then formulate the when looking at the application in the Google playstore (Fig- problem of answering the question as a sentence selection ure 2). We use Amazon Mechanical Turk to elicit questions task, and ask our annotators to find supporting evidence in about these privacy policies. Crowdworkers were asked to the document which can help in answering the question. In imagine they installed a mobile application and could talk to this way, every question is shown to at least one annotator, a trusted privacy assistant, whom they could ask any privacy- and 350 questions are annotated by multiple annotators 6 . related question pertaining to the app. They were paid 12$ per hour to ask five questions for a given policy. We solicited Analysis questions from Turkers who were conferred “master” sta- tus, and whose location was within the United States and Table 1. describes the results of our data collection effort. our task received favorable reviews on TurkerHub. For each We receive 1350 questions to our imaginary privacy assis- mobile application, crowdworkers were also asked to rate tant, about the privacy practices of 27 mobile applications. their understanding of what the app does on a Likert scale The question length is on average 8.4 words and the privacy of 1-5 (ranging from not being familiar to understanding policies are typically very long pieces of text, 3̃000 words it extremely well), as well as to indicate whether they had long. The answers to the questions typically have 1̃00 words installed or used the app before. We also collected demo- of evidence in the privacy policy document. graphic information regarding the age of the crowdworkers. What types of questions do users ask the privacy Answer Selection assistant? We are not just interested in collecting data on what ques- We would like to explore the kinds of questions that users tions users ask, but also a corpus of what good answers to ask our conversational assistant. We analyze questions based these questions would be. For this purpose, given questions on their question words, as well as by having our expert an- 4 notators indicate whether they believe the questions are re- We choose 5 million installs as a threshold on popularity of the mobile application, but this choice is debatable. Mobile appli- lated to privacy, whether they are subjective in nature and cations with fewer than 5 million installs could also represent ap- what categories they belong to in the OPP-15 ontology (Wil- plications of large corporations and vice versa. son et al., 2016). The results of this analysis is as follows7 : 5 The Playstore categories we sample applications from in- 6 clude: Books and Reference, Business, Education, Entertainment, These form our held-out test set. 7 Lifestyle, Health and Fitness, News and Magazines, Tools, Travel All analyses in this section are presented on the ’All’ data split and Local, and Games. unless mentioned otherwise Question Word Percentage Property Privacy-Related Not Privacy-Related Subjective 4.86% 1.43% is/does 27.9 % Not Subjective 74% 19.71% what 13.5 % will 11.9 % Table 3: Relevance and subjectivity judgments for 350 ques- how 10.1 % tions posed by crowdworkers. can 8.6 % are 4.5 % who 4.4 % where 1.3 % In the real-world it isn’t necessary that users will only ask if 1.8 % our privacy assistant questions related to privacy. Thus, it is important for us to be able to identify which questions we Table 2: Analysis of questions by question words for cate- are capable of attempting to answer. We analyze the test-set gories that account for >1% of questions where each example features multiple annotations from our expert-annotators. We consider the majority-vote to be the judgement of whether a question is relevant or subjective. Question Words We qualitatively analyze questions by We find that 78.85% of questions received by our privacy their expected types, based on the first word of the question. assistant are relevant, with 6.28% being subjective. Table. 3 Note that while the question word can give us some informa- gives us more insight into this phenomena. We observe that tion about the information-seeking intent of the user, the dif- the majority of questions (74%) are relevant but not subjec- ferent question words can often be used interchangeably. For tive (for example, ‘what information are they collecting?’). example, the questions ’will it collect my location?’ can also 4.86% of questions are both relevant and subjective (for ex- be phrased as ‘does it collect my location’. Keeping these ample, ‘is my data safe?’), 1.4% are subjective but not rel- limitations in mind, we perform a qualitative analysis of the evant (for example, ‘are there any in game purchases in the elicited questions to identify common user intents. The dis- wordscapes app that i should be concerned about?’) and fi- tributions of questions across types can be found in Table 2. nally 19.71% are neither relevant nor subjective (‘does the By far, the largest proportion of questions can be grouped app require an account to play?’). into the ‘is/does’ category where, similar to the ‘are’ cate- gory, users are often questioning the assistant about a partic- ular privacy attribute (for example, ’does this app track my Question Ontology Categories Next we ask our annota- location?’ or ‘is this app tracking my location?’). The next tors to indicate the OPP-15 data practice category (Wilson et largest category includes ‘what’ questions which include a al., 2016) that best describes the question. Broadly, the on- broad spectrum of questions (for example, what sort of an- tology describes 10 data practice categories. The interested alytics are integrated in the app?’ or ’what do you do with reader is invited to refer to (Wilson et al., 2016) for a detailed my information’). The ‘will’ and ‘can’ questions are usually description of these data practices. Annotators are allowed asking about a potential privacy harm (for example, ‘will i to annotate a question as belonging to multiple categories. be explicitly told when my info is being shared with a third For example, the question ’What information of mine is col- party?’ or ‘will any academic institutions or employers be lected by this app and who is it shared with?’ might belong able to access my performance/score information?’ or ‘can to both the ‘First Party Collection and Use’ and the ‘Third the app see what i type and what i search for?’). ’How’ Party Sharing and Collection’ OPP-115 data practice cate- questions generally either ask about specific company pro- gories. We consider a category to be correct, if at least 2 an- cesses, or abstract attributes, such as security, longevity of notators identify it to be relevant. In cases where none of the data retention etc (for example, ‘how safe is my password’ categories are identified as relevant, we default to ’other’ if it and ‘how is my data protected’). Relevant ‘where’ questions is identified as a relevant category by at least one annotator. are generally related to data storage (for example, ‘Where is If not, we mark the category as ’no agreement’. The results my data stored?’). Questions that begin with ‘who’ are usu- from this analysis are presented in Table. 4. We observe that ally asking about first party or third party access to data (for questions about first party and third party practices account example, ‘who can see my account information?’ or ‘who all for nearly 58.7% of all the questions asked of our assistant. has access to my medical information?’). Finally questions in the ‘if’ category typically establish a premise, before ask- ing a question. Such a question needs to be answered based Comparative Analysis We analyze 100 samples drawn on both the contents of the policy as well as assuming the from the Twitter privacy dataset (Harkous et al., 2018), an- information in the premise is true (for example, ‘if i link it notating them for OPP-category, relevance and if they are a to my Facebook will it have access to view my private infor- question or not. We find that in the Twitter dataset, 23 % of mation?’ or ‘if i choose to opt out of the app gathering my the questions are complaints rather than questions. By OPP- personal data, can i still use the app?’). category classification, 26% are First Party, 37% are Third Party, 14% are Data Security, 5% are User Access, 3% are Relevance and Subjectivity We analyze how many of the User choice and 9% could be grouped in the ‘other’ cate- questions asked to our privacy assistant are ‘relevant’ i.e are gory. Only 6% of the questions collected are not privacy re- related to privacy, and how many are subjective in nature. lated. Privacy Practice Percentage Example First Party Collection//Use 36.4 % what data does this game collect? Third Party Sharing//Collection 22.3 % will my data be sold to advertisers? Data Security 10.9 % how is my info protected from hackers? Data Retention 4.2 % how long do you save my information? User Access, Edit and Deletion 2.6 % can i delete my information permanently? User Choice//Control 7.2 % is there a way to opt out of data sharing Other 9.4 % does the app connect to the internet at any point during its use? International and Specific Audiences 0.6 % what are your GDPR policies? No Agreement 6.6 % how are features personalized? Table 4: OPP-115 categories most relevant to the questions collected from users. Experiments Model Precision Recall F1 No Answer (NA) 36.2 36.2 36.2 We would like to characterize and study the difficulty of the Human 70.3 71.1 70.7 question-answering task for humans. We formulate the prob- lem of identifying relevant evidence in the document to an- Table 5: Human performance and performance of a No- swer the question as a sentence-selection task, where it is Answer Baseline. Human performance demonstrates con- possible to choose not to answer a question by not identi- siderable agreement on the right answer for the privacy fying any relevant sentences. We evaluate using sentence- domain, where experts often disagree (Reidenberg et al., level F1 rather than IR-style metrics so as to accommodate 2015a). models to abstain from answering 8 . Similar to Choi et al. (2018); Rajpurkar et al. (2016), we compute the maximum F1 amongst all the reference answers. As abstaining from giving an answer is always legally sound but seldom help- ideally a privacy assistant would be able to answer, but is ful, we do not consider a question to be unanswerable if only not present within a typical privacy policy. In the future, a a minority of experts abstain from giving an answer. Similar privacy assistant could draw upon various sources of infor- to (Choi et al., 2018) given n reference answers, we report mation such as metadata from the Google Playstore, back- the average maximum F1 performance of the (n − 1)th sub- ground legal knowledge, news articles, social media etc. in set compared to the heldout reference. order to broaden its coverage across questions. For an ad- As discussed previously, since most questions are diffi- ditional 24% of unanswerable questions, the answers were cult to answer in a legally-sound way based on the contents expected to be found in the privacy policy, but the privacy of the privacy policy alone, abstaining from answering is of- policy was silent on a possible answer (such as ‘is my app ten going to be a safe action. We would like to emphasize data encrypted?’). Generally when a policy is silent it is not that this is not a criticism of the annotators or the people safe to make any assumptions. 6% of questions asked by a asking the questions, but rather a characteristic of this do- user are too vague to understand correctly such as ‘who can main where privacy policies are often silent or ambiguous contact me through the app?’, such questions would ben- on issues users are likely to inquire about. To quantify the efit from the assistant engaging in a clarification dialogue. magnitude of this effect, we demonstrate that a model which Another 4% are ambiguously phrased, such as ‘any difficul- always abstains from answering the question can achieve ties to occupy the privacy assistant?’. These kind of ques- reasonable performance (Table 5), yet still leaves a large tions are very hard to interpret correctly. 3% of unanswer- gap for improvement. We would further like to understand able questions are too specific in nature, and it is unlikely the what makes the majority of our annotators decide a question creators of the privacy policy would anticipate that particular should not be answered. We randomly sample 100 ques- question (‘does it have access to financial apps i use?’). Fi- tions that were deemed unanswerable, and annotate them nally, 7% of unanswerable questions are too subjective and post-hoc with reasons informed by expert annotations. We our annotators tend to abstain from answering (for example, find that for 56% of unanswerable questions, the answer to ‘how do i know this app is legit?’). the question would typically not be present in most privacy We would also like to be able to characterize the dis- policies. These would include questions such as ‘how does agreement on this task. It is important to note here that all the currency within the game work?’ and suggests that users of our annotators are experts with legal training rather than would benefit from being informed about the scope of typ- crowdworkers, and their provided answers can generally be ical privacy policies. However, they also include questions assumed to be valid legal opinions about the question. We such as ‘has Viber had data breaches in the past?’ which tease apart the difference from where they abstained to an- swer to their disagreements by comparing against the No 8 Similar to (Rajpurkar, Jia, and Liang, 2018) and (Yang, Yih, Answer (henceforth known as NA) baseline (Table 5). In Ta- and Meek, 2015), for negative examples models are awarded 1 F1 ble 5 we observe the human F1 is 70.7%, demonstrating con- if they abstain from answering and 0 F1 for any answer at all siderable agreement on the right answer. We would still like Question Word NA Model Human Privacy Practice NA Model Human is/does 37.22 73.19 First Party Collection/Use 24.6 67.1 what 39.77 73.35 Third Party Sharing/Collection 6.9 60.6 will 13.04 66.56 Data Security 35.3 87.2 how 27.84 80.16 Data Retention 0 79.8 can 27.17 63.04 User Access, Edit and Deletion 0 53.1 are 35.85 68.68 User Choice/Control 46.3 64.7 who 17.02 58.44 Other 89.1 84.1 where 54.55 54.55 International & Specific Audiences 0 100 if 0 62.19 No Agreement 76.2 78.3 Table 6: Classifier performance in F1 stratified by first word Table 7: Classifier performance in F1 stratified by OPP-115 in the question. category of the question. to investigate whether any disagreements are valid, or if they Conclusion are due to poor definitions or lack of adequate specification What kinds of questions should an automated privacy assis- in the annotation instructions. We randomly sample 50 sam- tant expect to receive? We explore this question by design- ples and annotate them for likely reasons for disagreement ing a study that elicits questions from crowdworkers who are 9 asked to think about the data practices of mobile apps they . We find that they ”agree on 64% of instances and disagree on 36%. We further determine that 92.8% of disagreements might consider downloading on their smartphones. We qual- were legitimate, valid different interpretations. For 43.75% itatively analyze the types of questions asked by users, and the question was interpreted differently, in 25% the contents identify a number of challenges associated with generating of the privacy policy were interpreted differently and the re- answers to these questions. While in principle privacy poli- maining were due to other sources of error (for example, in cies should be written to answer questions users are likely the question ‘who is allowed to use the app’, most anno- to have, in practice, our study shows that questions asked tators abstain from answering, but one annotator points out by users often go beyond what is disclosed in the text of that the policy states that children under the age of 13 are privacy policies. Challenges arise in automated question an- not allowed to use the app.) swering, both because policies are often silent or ambiguous We next analyze disagreements based on the type of ques- on issues that users are likely to inquire about, and also be- tion that was asked (Table. 6). As observed, the wh-type of cause users are not very good at articulating their privacy the question may give us some information about the in- questions - and occasionally even ask questions that have tent of the questions. We observe that our expert annotators nothing to do with privacy. Determining a user’s intent may rarely abstain to answer when a user asks a ’will’ question be a process of discovery for both the user and the assistant, about a potential privacy harm, taking care to identify rele- and thus in the future it would be helpful if the assistant was vant sections of the privacy policy. Similarly ’if’ type ques- capable of engaging in clarification dialogue. Such a privacy tions generally are quite specific and require careful reason- assistant would have to reconcile the need to be helpful to ing. On the other hand ‘where’ questions are generally about the user and provide answers that are legally accurate with data storage. They are vague, for example ’where is my data the need to be helpful. It would have to be capable of dis- stored?’ is probably not asking for the exact location of the ambiguating questions by engaging in dialogues with users; company’s datacenters but it is unclear what granularity is it would have to be able to supplement information found meant in the question (e.g., particular country, versus know- (or lacking) in the privacy policy with additional sources of ing whether the data is stored on a mobile phone or in the information such as background legal knowledge. Ideally, it cloud). would also be able to interpret ambiguity in the policy and We also analyze disagreements based on the OPP-115 also be able interpret silence about different issues. We hope category of the question (Table. 7). As expected, questions that the identification of these requirements will help inform where annotators disagree on the category of the question, the design of effective automatic privacy assistants. have more disagreements than simply abstaining to answer. Similarly for user choice, the policy typically does not an- Acknowledgements swer questions like ‘how do I limit its access to data’ fully, This work has been supported by the National Science Foun- so the annotators tend to abstain from answering. In contrast, dation under Grant No. CNS 13-30596 and No. CNS CNS questions about first party and third party practices are usu- 13-30214. The views and conclusions contained herein are ally anticipated and often have answers in the privacy policy. those of the authors and should not be interpreted as nec- essarily representing the official policies or endorsements, 9 We do not use F1 to measure disagreement, and instead man- either expressed or implied, of the NSF, or the US Govern- ually filter samples so we can capture both when the legal experts ment. The authors would like to thank Lorrie Cranor, Florian interpreted the question differently, as well as when they interpret Schaub and Shomir Wilson for insightful feedback and dis- the contents of the privacy policy differently. cussion related to this work. References Liu, F.; Wilson, S.; Schaub, F.; and Sadeh, N. 2016. Ana- Cate, F. H. 2010. The limits of notice and choice. IEEE lyzing vocabulary intersections of expert annotations and Security & Privacy 8(2):59–62. topic models for data practices in privacy policies. In 2016 AAAI Fall Symposium Series. Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.-t.; Choi, Liu, Y.-H.; Chen, Y.-L.; and Ho, W.-L. 2015. Predicting Y.; Liang, P.; and Zettlemoyer, L. 2018. Quac: Question associated statutes for legal problems. Information Pro- answering in context. arXiv preprint arXiv:1808.07036. cessing & Management 51(1):194–211. Commission, U. F. T., et al. 2012. Protecting consumer Mahler, L. 2015. What is nlp and why should lawyers care. privacy in an era of rapid change: Recommendations for Retrieved March 12:2018. businesses and policymakers. FTC Report. McDonald, A. M., and Cranor, L. F. 2008. The cost of Cranor, L. F. 2003. P3p: Making privacy policies more reading privacy policies. ISJLP 4:543. useful. IEEE Security & Privacy 99(6):50–55. Micheti, A.; Burkell, J.; and Steeves, V. 2010. Fixing bro- Cranor, L. F. 2012. Necessary but not sufficient: Standard- ken doors: Strategies for drafting privacy policies young ized mechanisms for privacy notice and choice. J. on people can understand. Bulletin of Science, Technology & Telecomm. & High Tech. L. 10:273. Society 30(2):130–143. Do, P.-K.; Nguyen, H.-T.; Tran, C.-X.; Nguyen, M.-T.; and Mollá, D., and Vicedo, J. L. 2007. Question answering in Nguyen, M.-L. 2017. Legal question answering us- restricted domains: An overview. Computational Linguis- ing ranking svm and deep convolutional neural network. tics 33(1):41–61. arXiv preprint arXiv:1703.05320. Monroy, A.; Calvo, H.; and Gelbukh, A. 2009. Nlp for shal- low question answering of legal documents using graphs. Feng, M.; Xiang, B.; Glass, M. R.; Wang, L.; and Zhou, Computational Linguistics and Intelligent Text Process- B. 2015. Applying deep learning to answer selection: A ing 498–508. study and an open task. arXiv preprint arXiv:1508.01585. Mysore Sathyendra, K.; Wilson, S.; Schaub, F.; Zimmeck, Frank, A.; Krieger, H.-U.; Xu, F.; Uszkoreit, H.; Crysmann, S.; and Sadeh, N. 2017. Identifying the provision of B.; Jörg, B.; and Schäfer, U. 2007. Question answering choices in privacy policy text. In Proceedings of the 2017 from structured knowledge sources. Journal of Applied Conference on Empirical Methods in Natural Language Logic 5(1):20–48. Processing, 2774–2779. Copenhagen, Denmark: Associ- Gluck, J.; Schaub, F.; Friedman, A.; Habib, H.; Sadeh, N.; ation for Computational Linguistics. Cranor, L. F.; and Agarwal, Y. 2016. How short is too Oltramari, A.; Piraviperumal, D.; Schaub, F.; Wilson, S.; short? implications of length and framing on the effec- Cherivirala, S.; Norton, T. B.; Russell, N. C.; Story, P.; tiveness of privacy notices. In 12th Symposium on Usable Reidenberg, J.; and Sadeh, N. 2017. Privonto: A semantic Privacy and Security (SOUPS), 321–340. framework for the analysis of privacy policies. Semantic Harkous, H.; Fawaz, K.; Lebret, R.; Schaub, F.; Shin, K. G.; Web (Preprint):1–19. and Aberer, K. 2018. Polisis: Automated analysis and pre- Onishi, T.; Wang, H.; Bansal, M.; Gimpel, K.; and sentation of privacy policies using deep learning. arXiv McAllester, D. 2016. Who did what: A large-scale preprint arXiv:1802.02561. person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Processing, 2230–2235. Austin, Texas: Association for Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching Computational Linguistics. machines to read and comprehend. In Advances in Neural Information Processing Systems, 1693–1701. Quaresma, P., and Rodrigues, I. P. 2005. A question answer system for legal information retrieval. In JURIX, 91–100. Jain, P.; Gyanchandani, M.; and Khare, N. 2016. Big data privacy: a technological perspective and review. Journal Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. of Big Data 3(1):25. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you 2017. Triviaqa: A large scale distantly supervised chal- don’t know: Unanswerable questions for squad. arXiv lenge dataset for reading comprehension. arXiv preprint preprint arXiv:1806.03822. arXiv:1705.03551. Reidenberg, J. R.; Breaux, T.; Cranor, L. F.; French, B.; Kelley, P. G.; Bresee, J.; Cranor, L. F.; and Reeder, R. W. Grannis, A.; Graves, J. T.; Liu, F.; McDonald, A.; Nor- 2009. A nutrition label for privacy. In Proceedings of the ton, T. B.; and Ramanath, R. 2015a. Disagreeable privacy 5th Symposium on Usable Privacy and Security, 4. ACM. policies: Mismatches between meaning and users’ under- Kim, M.-Y.; Xu, Y.; and Goebel, R. 2015. Applying a con- standing. Berkeley Tech. LJ 30:39. volutional neural network to legal question answering. In Reidenberg, J. R.; Russell, N. C.; Callen, A. J.; Qasir, S.; and JSAI International Symposium on Artificial Intelligence, Norton, T. B. 2015b. Privacy harms and the effectiveness 282–294. Springer. of the notice and choice framework. ISJLP 11:485. Sadeh, N.; Acquisti, A.; Breaux, T. D.; Cranor, L. F.; Mc- Donald, A. M.; Reidenberg, J. R.; Smith, N. A.; Liu, F.; Russell, N. C.; Schaub, F.; et al. 2013. The usable privacy policy project: Combining crowdsourcing, ma- chine learning and natural language processing to semi- automatically answer those privacy questions users care about. Technical report, Technical Report, CMU-ISR-13- 119, Carnegie Mellon University. Sathyendra, K. M.; Ravichander, A.; Story, P. G.; Black, A. W.; and Sadeh, N. 2017. Helping users understand privacy notices with automated query answering function- ality: An exploratory study. Technical Report. Schaub, F.; Balebako, R.; Durity, A. L.; and Cranor, L. F. 2015. A design space for effective privacy notices. In Eleventh Symposium On Usable Privacy and Security (SOUPS 2015), 1–17. Story, P.; Zimmeck, S.; and Sadeh, N. 2018. Which apps have privacy policies? Tan, M.; dos Santos, C.; Xiang, B.; and Zhou, B. 2016. Improved representation learning for question answer matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 464–473. Berlin, Germany: Association for Computational Linguistics. Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; and Suleman, K. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830. Wilson, S.; Schaub, F.; Dara, A. A.; Liu, F.; Cherivirala, S.; Leon, P. G.; Andersen, M. S.; Zimmeck, S.; Sathyendra, K. M.; Russell, N. C.; et al. 2016. The creation and anal- ysis of a website privacy policy corpus. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), volume 1, 1330–1340. Wilson, S.; Schaub, F.; Liu, F.; Sathyendra, K.; Zimmeck, S.; Ramanath, R.; Liu, F.; Sadeh, N.; and Smith, N. 2017. Analyzing privacy policies at scale: From crowdsourcing to automated annotations. ACM Transactions on the Web. Yang, Y.; Yih, W.-t.; and Meek, C. 2015. Wikiqa: A chal- lenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, 2013–2018.