Towards Measuring Risk Factors in Privacy Policies
          Najmeh Mousavi Nejad∗                                               Damien Graux                                 Diego Collarana
    Fraunhofer IAIS & University of Bonn                                   Fraunhofer IAIS                                  Fraunhofer IAIS
          Sankt Agustin, Germany                                      Sankt Agustin, Germany                            Sankt Agustin, Germany
           nejad@cs.uni-bonn.de                                    damien.graux@iais.fraunhofer.de                    diego.collarana.vargas@iais.
                                                                                                                             fraunhofer.de
ABSTRACT                                                                                  based on extracted information. Consequently, a user could choose
The ubiquitous availability of online services and mobile apps re-                        to stop using a website, if the predicted risk score is high. Addition-
sults in a rapid proliferation of contractual agreements in the form                      ally, this structured view can be also used by the administrative
of privacy policies. Despite the importance of such consent forms,                        state to perform a shallow compliance checking.
the majority of users tend to ignore them due to their content length                        OPP-115 is a widely-used dataset in the context of privacy pol-
and complexity. Thus, users might be consenting policies that are                         icy analysis [5]. It contains in-depth annotations for 115 privacy
not aligned to regulations in laws such as the GDPR from the EU                           policies at paragraph level and each paragraph was annotated by
law. In this study, we propose a hybrid approach which measures a                         3 experts. There are two types of annotations: high-level classes
privacy policy’s risk factor applying both supervised deep learning                       which define 10 data practice categories; and low-level attributes
and rule-based information extraction. Benefiting from an anno-                           which include mandatory and optional attributes. For instance, the
tated dataset of 115 privacy policies, a deep learning component is                       high-level class First Party Collection/Use has 3 attributes: Collec-
first able to predict high-level categories for each paragraph. Then,                     tion Mode (explicit or implicit), Information Type (financial, health,
a rule-based module extracts pre-defined attributes and their values,                     contact, location, etc.) and Purpose (advertising, marketing, analytics,
based on high-level classes. Finally, a privacy policy’s risk factor is                   legal requirement, etc.).
computed based on these attribute values.                                                    The approach proposed in this paper, is built upon on our pre-
                                                                                          vious effort, which exploits OPP-115 and deep learning to solve a
KEYWORDS                                                                                  multi-label classification problem. We feed privacy policy’s para-
                                                                                          graphs along with the predicted classes into a rule-based IE compo-
Privacy policy, Deep learning, Rule-based information extraction,
                                                                                          nent and retrieve attribute values. The rules are defined based on
Risk factor
                                                                                          OPP-115 low-level annotations. Finally, all predicted categories and
                                                                                          extracted information are passed into a risk measurement module
1     INTRODUCTION
                                                                                          and a risk factor will be computed based on hand-coded rules.
In the current digital era, almost everyone is exposed to accepting                          The paper is divided into the following sections: in Section 2,
contractual agreements in the form of privacy policies. However,                          we provide an overview of existing effort on measuring risks in
the majority of people skip privacy policies due to their length                          privacy policies; Section 3 presents our proposed approach and our
and complex terminology. According to a recent survey, from 543                           evaluation scheme; and finally Section 4 will conclude this paper.
university students, only 26% did not choose the ‘quick join’ routine,
while joining a factious social network and unsurprisingly, their                         2    RELATED WORK
average reading time was only 73 seconds [2]. Moreover, for the
administrative state is it important to validate the compliance the                       In light of the, now enforced EU-wide, General Data Protection Reg-
privacy policies with a correspondent law. For example, the EU                            ulation (GDPR) [4], there has been an increasing interest towards
regulation General Data Protection Regulation (GDPR) states that                          privacy policy analysis as this new set of regulations increases the
the retention period must be specified and limited.                                       constrains for companies holding customers data. Here, we provide
   To assist end-users with consciously agreeing to the conditions,                       a brief overview of studies that specifically addressed risk levels in
we can apply Natural Language Processing (NLP) and Information                            privacy policies.
Extraction (IE) to present a privacy policy in a structured view.                            Polisis is an online service for automatic analysis of privacy
Our approach applies supervised deep learning using an annotated                          policies [1]. Along with classification and structured presentations
dataset (named OPP-115), to assign high-level classes to a privacy                        of privacy policies, it assigns privacy icons which are based on
policy’s paragraphs. Then, according to predicted classes, we define                      the Disconnect 1 icons. These icons include Expected Use, Expected
hand-coded rules based on experts annotations, to extract attributes                      Collection, Precise Location, Data Retention and Children Privacy. For
values from each paragraph. Finally, having detailed information for                      instance, Data Retention color assignments are: Green for retention
each paragraph, a risk measurement function computes a risk factor                        periods of less than a year; Yellow, when the retention period is
                                                                                          longer than one year; and Red, when there is no data retention
In: Proceedings of the Workshop on Artificial Intelligence and the Administrative State   policy provided. Polisis benefits from OPP-115 and employs su-
(AIAS 2019), June 17, 2019, Montreal, QC, Canada.
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
                                                                                          pervised machine learning to extract high-level categories (in the
License Attribution 4.0 International (CC BY 4.0).                                        above example, Data Retention) and attribute values of each cate-
© 2019 Copyright for this paper by its authors. Use permitted under Creative Commons      gory (e.g., Retention Period in this case). Finally, based on retrieved
License Attribution 4.0 International (CC BY 4.0).
Published at http://ceur-ws.org
                                                                                          1 https://disconnect.me/
Conference’17, July 2017, Washington, DC, USA                                                                                 Mousavi Nejad et al.

                      Deep Learning Model

                            Training                 uses
                                                                   OPP-115
                           Component                               Dataset

                                                                  uses


                              Trained                                                                       Risk
                              Model                            Rules Executor                           Measurement
                                                high-level                             low-level         Calculator
                                                 classes                               attributes
        Privacy                                                                                                               Privacy Policy +
        Policy                                                                                                               Risk Factor Color
                                                             Rule-Based Pipeline                                               Coded values


                                                       Figure 1: General Architecture.


attribute values and heuristic rules, privacy icons along with their          3    PROPOSED APPROACH
colors are produced. Currently, Polisis’s interface generates only            In this section, we provide details of our approach for measuring a
a limited set of privacy icons. In future, we intend to further analyze       privacy policy’s risk factor. Our proposed method leverages OPP-
privacy icons and extend them with the help of legal experts.                 115 annotated dataset for training and evaluation [5]. As discussed
    PrivacyCheck is an approach for automatic summarization of                earlier, OPP-115 high-level annotations are divided into 10 classes:
privacy policies using data mining [6]. It answers 10 pre-defined
                                                                              (1) First Party Collection/Use: how and why the information is col-
questions concerning privacy and security of users’ data and is
                                                                                  lected.
also available as a Chrome browser extension. In order to train
                                                                              (2) Third Party Sharing/Collection: how the information may be
the model, a corpus containing 400 privacy policies was compiled
                                                                                  used or collected by third parties.
and 7 privacy experts manually assigned risk levels (Green, Yel-
                                                                              (3) User Choice/Control: choices and controls available to to users.
low, Red) to the 10 factors. First, a pre-processing step finds those
                                                                              (4) User Access/Edit/Deletion: if users can modify their information
paragraphs that have at least one keyword related to one of 10
                                                                                  and how.
factors. The methodology of selecting keywords was largely man-
                                                                              (5) Data Retention: how long the information is stored.
ual. Then, the selected paragraphs will be sent to a data mining
                                                                              (6) Data Security: how is users’ data secured.
server where 11 data mining models were trained, one for check-
                                                                              (7) Policy Change: if the service provider will change their policy
ing if the corresponding page is a privacy policy and one each for
                                                                                  and how the users are informed.
the 10 questions. The authors claim that on average, 60% of the
                                                                              (8) Do Not Track: if and how Do Not Track signals2 is honored.
times, PrivacyCheck finds the correct risk level. The limitation of
                                                                              (9) International/Specific Audiences: practices that target a specific
PrivacyCheck is its lack of Inter Annotator Agreement (IAA) for
                                                                                  group of users (e.g., children, Europeans, etc.)
the annotators. According to the paper, the quality control was per-
                                                                             (10) Other: additional practices not covered by the other categories.
formed by assigning each policy to two team members. However,
only 15% of privacy policies were compared and their discrepancies            In addition, each high-level category includes low-level attribute
were resolved which makes the training dataset less reliable.                 annotations. For instance, Data Retention category is further an-
    PrivacyGuide is another summarization tool inspired by GDPR               notated with its attributes, which are: Retention Period, Retention
that classifies a privacy policy into 11 categories using NLP and             Purpose and Information Type. The annotators provided either one
machine learning and further measures the associated risk level               or several values for each attribute along with the span of text based
of each class [3]. Similar to previous studies, PrivacyGuide uses             on which they have chosen that specific value(s). In the above exam-
the three-level scale risk based on classification (i.e. Green, Yellow,       ple, Retention Period may have one of the following values: stated
Red). The 11 criteria and their associated risk levels were defined           period, limited, indefinitely or unspecified.
by GDPR experts. Based on these criteria, a privacy corpus was                   Figure 1 shows the architecture of our proposed approach which
compiled with the help of 35 university students. Each participant            consists of three main components: 1) a deep learning module is
assigned a privacy category to text snippets and classified them              trained to predict high-level classes of a policy’s paragraphs; 2) a
with a risk level. The author reported that the weighted average              rule-based pipeline in which the rules are defined based on low-
accuracy is 74% for classifying a privacy policy into one of the 11           level attribute annotations of OPP-115; and 3) a risk measurement
classes and the accuracy of risk level detection is 90%. Although             function that assigns risk icons along with their corresponding
the results were encouraging, the dataset was not annotated by                colors (green, yellow, red), according to extracted information.
experts which is a fundamental criterion in legal text processing                Following conventional ML practices, in the deep learning com-
and analysis.                                                                 ponent, dataset splits are randomly partitioned into a ratio of 3:1:1
                                                                              for training, validation and testing respectively; while maintaining
                                                                              2 https://en.wikipedia.org/wiki/Do_Not_Track
Towards Measuring Risk Factors in Privacy Policies                                                                            Conference’17, July 2017, Washington, DC, USA

                     Table 1: Sample rules for extracting values of Retention Period from Data Retention Category.

     Rule                                                            Value           Sample
                                                                                     1. We remove the entirety of the IP address after 6 months.
     [delete/remove][Token]*[after][number][day/month/year]          Stated Period
                                                                                     2. All stored IP addresses, except the account creation IP address, are deleted after 90 days.
     [not][Token]*[delete/remove]                                    Indefinitely    The posts and content you made will not be automatically deleted as part of the account removal process.
                                                                                     1.This data is generally retained indefinitely.
     [store/keep/retain/maintain][Token]*[indefinitely]              Indefinitely
                                                                                     2. The information we collect for statistical analysis and technical improvements is maintained indefi-
                                                                                     nitely.
                                                                                     1. We will retain your information for as long as your account is active or as needed to provide you services.
     [store/keep/retain/maintain][Token]*[as long as][Token]+        Limited
                                                                                     2. We will retain your personal information while you have an account and thereafter for as long as we
                                                                                     need it for purposes not prohibited by applicable laws
                                                                                     1. We receive and store certain types of information whenever you interact with us.
     If not one of the above conditions                              Unspecified
                                                                                     2. The personal information collected about you through our online applications and in our communica-
                                                                                     tions with you is stored in our internal database.


a stratified set of labels. We further decomposed the Other category                                 For the evaluation of our approach, we intend to generate risk
into its attributes: Introductory/Generic, Privacy Contact Information                            factors according to OPP-115 experts annotations and use it as a
and Practice Not Covered. Therefore, considering that a paragraph                                 goldstandard. We believe the final error will be close to sum of error
in the dataset may be labeled with more than one category, we                                     rate in the deep learning module (predicting high-level classes) and
face a multi-label classification problem with 12 classes. The imple-                             the error which is caused due to incomplete set of rules in rule
mentation of the ML component is completed and we achieve 79%                                     executor component. Considering the fact that we are now able to
micro-average for F1.                                                                             predict the correct high-level classes with 79% F1, with the careful
   The high-level predicted classes are passed to the rule-based                                  definition of rules for extracting attribute values, it is predicted to
component where low-level attribute values will be extracted. The                                 gain a reasonable accuracy at the end of our pipeline.
definition of rules are based on experts annotations in OPP-115
dataset. We intend to use 60% of low-level annotations for defining                               4      CONCLUSION
the rules, 20% for validating the defined rules and the remaining                                 In this study, we proposed the application of Deep Learning models
20% for the final test. Table 1 shows some sample rules for finding                               and Rule-Based Information Extraction to automatically present a
values of Retention Period attribute in Data Retention category. We                               structured view of risk factors in privacy policies. In particular, we
found our rules definitions based on experts annotations. As shown                                presented a hybrid approach that takes advantage of the dataset
in the table, the rules definition use the knowledge about high-level                             OPP-115. This approach is of paramount importance to support
categories predicted by the deep learning component.                                              users to consciously agree with terms and conditions of online
                                                                                                  services, and to perform shallow compliance checking where a high-
Algorithm 1 Sketch of risk measurement algorithm                                                  risk score can be assigned to “indefinitely” and “unspecified” values.
Require: predicted high-level category, extracted attribute values                                As next steps, we plan to implement the proposed architecture
 1: for all paragraphs in the privacy policy do                                                   and run empirical evaluations to validate the presented hypothesis,
 2:    cat eдory ← predicted high-level category
 3:     if cat eдor y ∈ Data Retention then                                                       i.e, users will be more motivated to read privacy policies when a
 4:         Ret ent ionP er iod ← extracted retention period                                      color-coded structured view is presented to them.
 5:         if Ret ent ionP er iod ∈ (Stated Period, Limited) then
 6:             Dat aRet ent ionI con ← Green
 7:         else if Ret ent ionP er iod ∈ Indefinitely then                                       REFERENCES
 8:             Dat aRet ent ionI con ← Yellow                                                    [1] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin, and K. Aberer. Polisis:
 9:         else                                                                                      Automated analysis and presentation of privacy policies using deep learning.
10:             Dat aRet ent ionI con ← Red                                                           CoRR, abs/1802.02561, 2018.
11:         end if                                                                                [2] J. A. Obar and A. Oeldorf-Hirsch. The biggest lie on the internet: Ignoring the
12:     end if                                                                                        privacy policies and terms of service policies of social networking services. Infor-
13:     if cat eдory ∈ First Party Collection/Use then ...                                            mation, Communication & Society, pages 1–20, 2018.
14:     end if                                                                                    [3] W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, and J. Serna. Privacyguide:
15: end for                                                                                           Towards an implementation of the eu gdpr on internet privacy policy evaluation.
Ensure: risk icons and their corresponding colors                                                     In Proceedings of the Fourth ACM International Workshop on Security and Privacy
                                                                                                      Analytics, IWSPA ’18, pages 15–21, New York, NY, USA, 2018. ACM.
                                                                                                  [4] P. Voigt and A. Von dem Bussche. The eu general data protection regulation (gdpr).
    Having information about attribute values, the risk measurement                                   A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017.
module is able to assign appropriate risk icons along with their                                  [5] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen,
                                                                                                      S. Zimmeck, K. M. Sathyendra, N. C. Russell, et al. The creation and analysis of a
corresponding colors. As a proof-of-concept, we will found our                                        website privacy policy corpus. In Proceedings of the 54th Annual Meeting of the
risk measurement rules on Disconnect icons. Aforementioned in                                         Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages
                                                                                                      1330–1340, 2016.
literature review, the Disconnect Data Retention color assignment                                 [6] R. N. Zaeem, R. L. German, and K. S. Barber. Privacycheck: Automatic sum-
are as follows: Green for retention period <= 12 months; Yellow,                                      marization of privacy policies using data mining. ACM Trans. Internet Technol.,
for retention period > 12 months; and Red, when there is no data                                      18(4):53:1–53:18, Aug. 2018.
retention policy provided. Algorithm 1 shows our interpretation of
Data Retention icon. It is worth to mention that our interpretation
is based on the available annotations from OPP-115 dataset. Hence,
it is not the only representation that can be built from Disconnect
icons and others may adopt their own understanding.