Towards Measuring Risk Factors in Privacy Policies Najmeh Mousavi Nejad∗ Damien Graux Diego Collarana Fraunhofer IAIS & University of Bonn Fraunhofer IAIS Fraunhofer IAIS Sankt Agustin, Germany Sankt Agustin, Germany Sankt Agustin, Germany nejad@cs.uni-bonn.de damien.graux@iais.fraunhofer.de diego.collarana.vargas@iais. fraunhofer.de ABSTRACT based on extracted information. Consequently, a user could choose The ubiquitous availability of online services and mobile apps re- to stop using a website, if the predicted risk score is high. Addition- sults in a rapid proliferation of contractual agreements in the form ally, this structured view can be also used by the administrative of privacy policies. Despite the importance of such consent forms, state to perform a shallow compliance checking. the majority of users tend to ignore them due to their content length OPP-115 is a widely-used dataset in the context of privacy pol- and complexity. Thus, users might be consenting policies that are icy analysis [5]. It contains in-depth annotations for 115 privacy not aligned to regulations in laws such as the GDPR from the EU policies at paragraph level and each paragraph was annotated by law. In this study, we propose a hybrid approach which measures a 3 experts. There are two types of annotations: high-level classes privacy policy’s risk factor applying both supervised deep learning which define 10 data practice categories; and low-level attributes and rule-based information extraction. Benefiting from an anno- which include mandatory and optional attributes. For instance, the tated dataset of 115 privacy policies, a deep learning component is high-level class First Party Collection/Use has 3 attributes: Collec- first able to predict high-level categories for each paragraph. Then, tion Mode (explicit or implicit), Information Type (financial, health, a rule-based module extracts pre-defined attributes and their values, contact, location, etc.) and Purpose (advertising, marketing, analytics, based on high-level classes. Finally, a privacy policy’s risk factor is legal requirement, etc.). computed based on these attribute values. The approach proposed in this paper, is built upon on our pre- vious effort, which exploits OPP-115 and deep learning to solve a KEYWORDS multi-label classification problem. We feed privacy policy’s para- graphs along with the predicted classes into a rule-based IE compo- Privacy policy, Deep learning, Rule-based information extraction, nent and retrieve attribute values. The rules are defined based on Risk factor OPP-115 low-level annotations. Finally, all predicted categories and extracted information are passed into a risk measurement module 1 INTRODUCTION and a risk factor will be computed based on hand-coded rules. In the current digital era, almost everyone is exposed to accepting The paper is divided into the following sections: in Section 2, contractual agreements in the form of privacy policies. However, we provide an overview of existing effort on measuring risks in the majority of people skip privacy policies due to their length privacy policies; Section 3 presents our proposed approach and our and complex terminology. According to a recent survey, from 543 evaluation scheme; and finally Section 4 will conclude this paper. university students, only 26% did not choose the ‘quick join’ routine, while joining a factious social network and unsurprisingly, their 2 RELATED WORK average reading time was only 73 seconds [2]. Moreover, for the administrative state is it important to validate the compliance the In light of the, now enforced EU-wide, General Data Protection Reg- privacy policies with a correspondent law. For example, the EU ulation (GDPR) [4], there has been an increasing interest towards regulation General Data Protection Regulation (GDPR) states that privacy policy analysis as this new set of regulations increases the the retention period must be specified and limited. constrains for companies holding customers data. Here, we provide To assist end-users with consciously agreeing to the conditions, a brief overview of studies that specifically addressed risk levels in we can apply Natural Language Processing (NLP) and Information privacy policies. Extraction (IE) to present a privacy policy in a structured view. Polisis is an online service for automatic analysis of privacy Our approach applies supervised deep learning using an annotated policies [1]. Along with classification and structured presentations dataset (named OPP-115), to assign high-level classes to a privacy of privacy policies, it assigns privacy icons which are based on policy’s paragraphs. Then, according to predicted classes, we define the Disconnect 1 icons. These icons include Expected Use, Expected hand-coded rules based on experts annotations, to extract attributes Collection, Precise Location, Data Retention and Children Privacy. For values from each paragraph. Finally, having detailed information for instance, Data Retention color assignments are: Green for retention each paragraph, a risk measurement function computes a risk factor periods of less than a year; Yellow, when the retention period is longer than one year; and Red, when there is no data retention In: Proceedings of the Workshop on Artificial Intelligence and the Administrative State policy provided. Polisis benefits from OPP-115 and employs su- (AIAS 2019), June 17, 2019, Montreal, QC, Canada. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons pervised machine learning to extract high-level categories (in the License Attribution 4.0 International (CC BY 4.0). above example, Data Retention) and attribute values of each cate- © 2019 Copyright for this paper by its authors. Use permitted under Creative Commons gory (e.g., Retention Period in this case). Finally, based on retrieved License Attribution 4.0 International (CC BY 4.0). Published at http://ceur-ws.org 1 https://disconnect.me/ Conference’17, July 2017, Washington, DC, USA Mousavi Nejad et al. Deep Learning Model Training uses OPP-115 Component Dataset uses Trained Risk Model Rules Executor Measurement high-level low-level Calculator classes attributes Privacy Privacy Policy + Policy Risk Factor Color Rule-Based Pipeline Coded values Figure 1: General Architecture. attribute values and heuristic rules, privacy icons along with their 3 PROPOSED APPROACH colors are produced. Currently, Polisis’s interface generates only In this section, we provide details of our approach for measuring a a limited set of privacy icons. In future, we intend to further analyze privacy policy’s risk factor. Our proposed method leverages OPP- privacy icons and extend them with the help of legal experts. 115 annotated dataset for training and evaluation [5]. As discussed PrivacyCheck is an approach for automatic summarization of earlier, OPP-115 high-level annotations are divided into 10 classes: privacy policies using data mining [6]. It answers 10 pre-defined (1) First Party Collection/Use: how and why the information is col- questions concerning privacy and security of users’ data and is lected. also available as a Chrome browser extension. In order to train (2) Third Party Sharing/Collection: how the information may be the model, a corpus containing 400 privacy policies was compiled used or collected by third parties. and 7 privacy experts manually assigned risk levels (Green, Yel- (3) User Choice/Control: choices and controls available to to users. low, Red) to the 10 factors. First, a pre-processing step finds those (4) User Access/Edit/Deletion: if users can modify their information paragraphs that have at least one keyword related to one of 10 and how. factors. The methodology of selecting keywords was largely man- (5) Data Retention: how long the information is stored. ual. Then, the selected paragraphs will be sent to a data mining (6) Data Security: how is users’ data secured. server where 11 data mining models were trained, one for check- (7) Policy Change: if the service provider will change their policy ing if the corresponding page is a privacy policy and one each for and how the users are informed. the 10 questions. The authors claim that on average, 60% of the (8) Do Not Track: if and how Do Not Track signals2 is honored. times, PrivacyCheck finds the correct risk level. The limitation of (9) International/Specific Audiences: practices that target a specific PrivacyCheck is its lack of Inter Annotator Agreement (IAA) for group of users (e.g., children, Europeans, etc.) the annotators. According to the paper, the quality control was per- (10) Other: additional practices not covered by the other categories. formed by assigning each policy to two team members. However, only 15% of privacy policies were compared and their discrepancies In addition, each high-level category includes low-level attribute were resolved which makes the training dataset less reliable. annotations. For instance, Data Retention category is further an- PrivacyGuide is another summarization tool inspired by GDPR notated with its attributes, which are: Retention Period, Retention that classifies a privacy policy into 11 categories using NLP and Purpose and Information Type. The annotators provided either one machine learning and further measures the associated risk level or several values for each attribute along with the span of text based of each class [3]. Similar to previous studies, PrivacyGuide uses on which they have chosen that specific value(s). In the above exam- the three-level scale risk based on classification (i.e. Green, Yellow, ple, Retention Period may have one of the following values: stated Red). The 11 criteria and their associated risk levels were defined period, limited, indefinitely or unspecified. by GDPR experts. Based on these criteria, a privacy corpus was Figure 1 shows the architecture of our proposed approach which compiled with the help of 35 university students. Each participant consists of three main components: 1) a deep learning module is assigned a privacy category to text snippets and classified them trained to predict high-level classes of a policy’s paragraphs; 2) a with a risk level. The author reported that the weighted average rule-based pipeline in which the rules are defined based on low- accuracy is 74% for classifying a privacy policy into one of the 11 level attribute annotations of OPP-115; and 3) a risk measurement classes and the accuracy of risk level detection is 90%. Although function that assigns risk icons along with their corresponding the results were encouraging, the dataset was not annotated by colors (green, yellow, red), according to extracted information. experts which is a fundamental criterion in legal text processing Following conventional ML practices, in the deep learning com- and analysis. ponent, dataset splits are randomly partitioned into a ratio of 3:1:1 for training, validation and testing respectively; while maintaining 2 https://en.wikipedia.org/wiki/Do_Not_Track Towards Measuring Risk Factors in Privacy Policies Conference’17, July 2017, Washington, DC, USA Table 1: Sample rules for extracting values of Retention Period from Data Retention Category. Rule Value Sample 1. We remove the entirety of the IP address after 6 months. [delete/remove][Token]*[after][number][day/month/year] Stated Period 2. All stored IP addresses, except the account creation IP address, are deleted after 90 days. [not][Token]*[delete/remove] Indefinitely The posts and content you made will not be automatically deleted as part of the account removal process. 1.This data is generally retained indefinitely. [store/keep/retain/maintain][Token]*[indefinitely] Indefinitely 2. The information we collect for statistical analysis and technical improvements is maintained indefi- nitely. 1. We will retain your information for as long as your account is active or as needed to provide you services. [store/keep/retain/maintain][Token]*[as long as][Token]+ Limited 2. We will retain your personal information while you have an account and thereafter for as long as we need it for purposes not prohibited by applicable laws 1. We receive and store certain types of information whenever you interact with us. If not one of the above conditions Unspecified 2. The personal information collected about you through our online applications and in our communica- tions with you is stored in our internal database. a stratified set of labels. We further decomposed the Other category For the evaluation of our approach, we intend to generate risk into its attributes: Introductory/Generic, Privacy Contact Information factors according to OPP-115 experts annotations and use it as a and Practice Not Covered. Therefore, considering that a paragraph goldstandard. We believe the final error will be close to sum of error in the dataset may be labeled with more than one category, we rate in the deep learning module (predicting high-level classes) and face a multi-label classification problem with 12 classes. The imple- the error which is caused due to incomplete set of rules in rule mentation of the ML component is completed and we achieve 79% executor component. Considering the fact that we are now able to micro-average for F1. predict the correct high-level classes with 79% F1, with the careful The high-level predicted classes are passed to the rule-based definition of rules for extracting attribute values, it is predicted to component where low-level attribute values will be extracted. The gain a reasonable accuracy at the end of our pipeline. definition of rules are based on experts annotations in OPP-115 dataset. We intend to use 60% of low-level annotations for defining 4 CONCLUSION the rules, 20% for validating the defined rules and the remaining In this study, we proposed the application of Deep Learning models 20% for the final test. Table 1 shows some sample rules for finding and Rule-Based Information Extraction to automatically present a values of Retention Period attribute in Data Retention category. We structured view of risk factors in privacy policies. In particular, we found our rules definitions based on experts annotations. As shown presented a hybrid approach that takes advantage of the dataset in the table, the rules definition use the knowledge about high-level OPP-115. This approach is of paramount importance to support categories predicted by the deep learning component. users to consciously agree with terms and conditions of online services, and to perform shallow compliance checking where a high- Algorithm 1 Sketch of risk measurement algorithm risk score can be assigned to “indefinitely” and “unspecified” values. Require: predicted high-level category, extracted attribute values As next steps, we plan to implement the proposed architecture 1: for all paragraphs in the privacy policy do and run empirical evaluations to validate the presented hypothesis, 2: cat eдory ← predicted high-level category 3: if cat eдor y ∈ Data Retention then i.e, users will be more motivated to read privacy policies when a 4: Ret ent ionP er iod ← extracted retention period color-coded structured view is presented to them. 5: if Ret ent ionP er iod ∈ (Stated Period, Limited) then 6: Dat aRet ent ionI con ← Green 7: else if Ret ent ionP er iod ∈ Indefinitely then REFERENCES 8: Dat aRet ent ionI con ← Yellow [1] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin, and K. Aberer. Polisis: 9: else Automated analysis and presentation of privacy policies using deep learning. 10: Dat aRet ent ionI con ← Red CoRR, abs/1802.02561, 2018. 11: end if [2] J. A. Obar and A. Oeldorf-Hirsch. The biggest lie on the internet: Ignoring the 12: end if privacy policies and terms of service policies of social networking services. Infor- 13: if cat eдory ∈ First Party Collection/Use then ... mation, Communication & Society, pages 1–20, 2018. 14: end if [3] W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, and J. Serna. Privacyguide: 15: end for Towards an implementation of the eu gdpr on internet privacy policy evaluation. Ensure: risk icons and their corresponding colors In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, IWSPA ’18, pages 15–21, New York, NY, USA, 2018. ACM. [4] P. Voigt and A. Von dem Bussche. The eu general data protection regulation (gdpr). Having information about attribute values, the risk measurement A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017. module is able to assign appropriate risk icons along with their [5] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, et al. The creation and analysis of a corresponding colors. As a proof-of-concept, we will found our website privacy policy corpus. In Proceedings of the 54th Annual Meeting of the risk measurement rules on Disconnect icons. Aforementioned in Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1330–1340, 2016. literature review, the Disconnect Data Retention color assignment [6] R. N. Zaeem, R. L. German, and K. S. Barber. Privacycheck: Automatic sum- are as follows: Green for retention period <= 12 months; Yellow, marization of privacy policies using data mining. ACM Trans. Internet Technol., for retention period > 12 months; and Red, when there is no data 18(4):53:1–53:18, Aug. 2018. retention policy provided. Algorithm 1 shows our interpretation of Data Retention icon. It is worth to mention that our interpretation is based on the available annotations from OPP-115 dataset. Hence, it is not the only representation that can be built from Disconnect icons and others may adopt their own understanding.