<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Measuring Risk Factors in Privacy Policies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Najmeh Mousavi Nejad∗</string-name>
          <email>nejad@cs.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damien Graux</string-name>
          <email>damien.graux@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Collarana</string-name>
          <email>diego.collarana.vargas@iais</email>
          <email>diego.collarana.vargas@iais. fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Privacy policy, Deep learning, Rule-based information extraction,</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer IAIS &amp; University of Bonn</institution>
          ,
          <addr-line>Sankt Agustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <addr-line>Sankt Agustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Risk factor</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>The ubiquitous availability of online services and mobile apps results in a rapid proliferation of contractual agreements in the form of privacy policies. Despite the importance of such consent forms, the majority of users tend to ignore them due to their content length and complexity. Thus, users might be consenting policies that are not aligned to regulations in laws such as the GDPR from the EU law. In this study, we propose a hybrid approach which measures a privacy policy's risk factor applying both supervised deep learning and rule-based information extraction. Benefiting from an annotated dataset of 115 privacy policies, a deep learning component is ifrst able to predict high-level categories for each paragraph. Then, a rule-based module extracts pre-defined attributes and their values, based on high-level classes. Finally, a privacy policy's risk factor is computed based on these attribute values.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the current digital era, almost everyone is exposed to accepting
contractual agreements in the form of privacy policies. However,
the majority of people skip privacy policies due to their length
and complex terminology. According to a recent survey, from 543
university students, only 26% did not choose the ‘quick join’ routine,
while joining a factious social network and unsurprisingly, their
average reading time was only 73 seconds [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, for the
administrative state is it important to validate the compliance the
privacy policies with a correspondent law. For example, the EU
regulation General Data Protection Regulation (GDPR) states that
the retention period must be specified and limited.
      </p>
      <p>To assist end-users with consciously agreeing to the conditions,
we can apply Natural Language Processing (NLP) and Information
Extraction (IE) to present a privacy policy in a structured view.
Our approach applies supervised deep learning using an annotated
dataset (named OPP-115), to assign high-level classes to a privacy
policy’s paragraphs. Then, according to predicted classes, we define
hand-coded rules based on experts annotations, to extract attributes
values from each paragraph. Finally, having detailed information for
each paragraph, a risk measurement function computes a risk factor
based on extracted information. Consequently, a user could choose
to stop using a website, if the predicted risk score is high.
Additionally, this structured view can be also used by the administrative
state to perform a shallow compliance checking.</p>
      <p>
        OPP-115 is a widely-used dataset in the context of privacy
policy analysis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It contains in-depth annotations for 115 privacy
policies at paragraph level and each paragraph was annotated by
3 experts. There are two types of annotations: high-level classes
which define 10 data practice categories; and low-level attributes
which include mandatory and optional attributes. For instance, the
high-level class First Party Collection/Use has 3 attributes:
Collection Mode (explicit or implicit), Information Type (financial, health,
contact, location, etc.) and Purpose (advertising, marketing, analytics,
legal requirement, etc.).
      </p>
      <p>The approach proposed in this paper, is built upon on our
previous efort, which exploits OPP-115 and deep learning to solve a
multi-label classification problem. We feed privacy policy’s
paragraphs along with the predicted classes into a rule-based IE
component and retrieve attribute values. The rules are defined based on
OPP-115 low-level annotations. Finally, all predicted categories and
extracted information are passed into a risk measurement module
and a risk factor will be computed based on hand-coded rules.</p>
      <p>The paper is divided into the following sections: in Section 2,
we provide an overview of existing efort on measuring risks in
privacy policies; Section 3 presents our proposed approach and our
evaluation scheme; and finally Section 4 will conclude this paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In light of the, now enforced EU-wide, General Data Protection
Regulation (GDPR) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], there has been an increasing interest towards
privacy policy analysis as this new set of regulations increases the
constrains for companies holding customers data. Here, we provide
a brief overview of studies that specifically addressed risk levels in
privacy policies.
      </p>
      <p>
        Polisis is an online service for automatic analysis of privacy
policies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Along with classification and structured presentations
of privacy policies, it assigns privacy icons which are based on
the Disconnect1 icons. These icons include Expected Use, Expected
Collection, Precise Location, Data Retention and Children Privacy. For
instance, Data Retention color assignments are: Green for retention
periods of less than a year; Yellow, when the retention period is
longer than one year; and Red, when there is no data retention
policy provided. Polisis benefits from OPP-115 and employs
supervised machine learning to extract high-level categories (in the
above example, Data Retention) and attribute values of each
category (e.g., Retention Period in this case). Finally, based on retrieved
      </p>
      <sec id="sec-2-1">
        <title>Deep Learning Model</title>
      </sec>
      <sec id="sec-2-2">
        <title>Training</title>
      </sec>
      <sec id="sec-2-3">
        <title>Component</title>
      </sec>
      <sec id="sec-2-4">
        <title>Trained</title>
      </sec>
      <sec id="sec-2-5">
        <title>Model</title>
        <p>uses
high-level
classes
OPP-115
Dataset
uses</p>
      </sec>
      <sec id="sec-2-6">
        <title>Rules Executor</title>
        <p>low-level
attributes</p>
      </sec>
      <sec id="sec-2-7">
        <title>Risk</title>
      </sec>
      <sec id="sec-2-8">
        <title>Measurement</title>
      </sec>
      <sec id="sec-2-9">
        <title>Calculator</title>
      </sec>
      <sec id="sec-2-10">
        <title>Rule-Based Pipeline</title>
        <p>attribute values and heuristic rules, privacy icons along with their
colors are produced. Currently, Polisis’s interface generates only
a limited set of privacy icons. In future, we intend to further analyze
privacy icons and extend them with the help of legal experts.</p>
        <p>
          PrivacyCheck is an approach for automatic summarization of
privacy policies using data mining [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It answers 10 pre-defined
questions concerning privacy and security of users’ data and is
also available as a Chrome browser extension. In order to train
the model, a corpus containing 400 privacy policies was compiled
and 7 privacy experts manually assigned risk levels (Green,
Yellow, Red) to the 10 factors. First, a pre-processing step finds those
paragraphs that have at least one keyword related to one of 10
factors. The methodology of selecting keywords was largely
manual. Then, the selected paragraphs will be sent to a data mining
server where 11 data mining models were trained, one for
checking if the corresponding page is a privacy policy and one each for
the 10 questions. The authors claim that on average, 60% of the
times, PrivacyCheck finds the correct risk level. The limitation of
PrivacyCheck is its lack of Inter Annotator Agreement (IAA) for
the annotators. According to the paper, the quality control was
performed by assigning each policy to two team members. However,
only 15% of privacy policies were compared and their discrepancies
were resolved which makes the training dataset less reliable.
        </p>
        <p>
          PrivacyGuide is another summarization tool inspired by GDPR
that classifies a privacy policy into 11 categories using NLP and
machine learning and further measures the associated risk level
of each class [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Similar to previous studies, PrivacyGuide uses
the three-level scale risk based on classification (i.e. Green, Yellow,
Red). The 11 criteria and their associated risk levels were defined
by GDPR experts. Based on these criteria, a privacy corpus was
compiled with the help of 35 university students. Each participant
assigned a privacy category to text snippets and classified them
with a risk level. The author reported that the weighted average
accuracy is 74% for classifying a privacy policy into one of the 11
classes and the accuracy of risk level detection is 90%. Although
the results were encouraging, the dataset was not annotated by
experts which is a fundamental criterion in legal text processing
and analysis.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>PROPOSED APPROACH</title>
      <p>
        In this section, we provide details of our approach for measuring a
privacy policy’s risk factor. Our proposed method leverages
OPP115 annotated dataset for training and evaluation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As discussed
earlier, OPP-115 high-level annotations are divided into 10 classes:
(1) First Party Collection/Use: how and why the information is
collected.
(2) Third Party Sharing/Collection: how the information may be
used or collected by third parties.
(3) User Choice/Control: choices and controls available to to users.
(4) User Access/Edit/Deletion: if users can modify their information
and how.
(5) Data Retention: how long the information is stored.
(6) Data Security: how is users’ data secured.
(7) Policy Change: if the service provider will change their policy
and how the users are informed.
(8) Do Not Track: if and how Do Not Track signals2 is honored.
(9) International/Specific Audiences : practices that target a specific
group of users (e.g., children, Europeans, etc.)
(10) Other: additional practices not covered by the other categories.
In addition, each high-level category includes low-level attribute
annotations. For instance, Data Retention category is further
annotated with its attributes, which are: Retention Period, Retention
Purpose and Information Type. The annotators provided either one
or several values for each attribute along with the span of text based
on which they have chosen that specific value(s). In the above
example, Retention Period may have one of the following values: stated
period, limited, indefinitely or unspecified .
      </p>
      <p>Figure 1 shows the architecture of our proposed approach which
consists of three main components: 1) a deep learning module is
trained to predict high-level classes of a policy’s paragraphs; 2) a
rule-based pipeline in which the rules are defined based on
lowlevel attribute annotations of OPP-115; and 3) a risk measurement
function that assigns risk icons along with their corresponding
colors (green, yellow, red), according to extracted information.</p>
      <p>Following conventional ML practices, in the deep learning
component, dataset splits are randomly partitioned into a ratio of 3:1:1
for training, validation and testing respectively; while maintaining
[store/keep/retain/maintain][Token]*[as long as][Token]+ Limited
If not one of the above conditions</p>
      <p>Sample
1. We remove the entirety of the IP address after 6 months.
2. All stored IP addresses, except the account creation IP address, are deleted after 90 days.
The posts and content you made will not be automatically deleted as part of the account removal process.
1.This data is generally retained indefinitely.
2. The information we collect for statistical analysis and technical improvements is maintained
indefinitely.
1. We will retain your information for as long as your account is active or as needed to provide you services.
2. We will retain your personal information while you have an account and thereafter for as long as we
need it for purposes not prohibited by applicable laws
1. We receive and store certain types of information whenever you interact with us.
2. The personal information collected about you through our online applications and in our
communications with you is stored in our internal database.
a stratified set of labels. We further decomposed the Other category
into its attributes: Introductory/Generic, Privacy Contact Information
and Practice Not Covered. Therefore, considering that a paragraph
in the dataset may be labeled with more than one category, we
face a multi-label classification problem with 12 classes. The
implementation of the ML component is completed and we achieve 79%
micro-average for F1.</p>
      <p>The high-level predicted classes are passed to the rule-based
component where low-level attribute values will be extracted. The
definition of rules are based on experts annotations in OPP-115
dataset. We intend to use 60% of low-level annotations for defining
the rules, 20% for validating the defined rules and the remaining
20% for the final test. Table 1 shows some sample rules for finding
values of Retention Period attribute in Data Retention category. We
found our rules definitions based on experts annotations. As shown
in the table, the rules definition use the knowledge about high-level
categories predicted by the deep learning component.</p>
      <p>Algorithm 1 Sketch of risk measurement algorithm
Require: predicted high-level category, extracted attribute values
1: for all paragraphs in the privacy policy do
2: cat eдor y ← predicted high-level category
3: if cat eдor y ∈ Data Retention then
4: Ret ent ionP er iod ← extracted retention period
5: if Ret ent ionP er iod ∈ (Stated Period, Limited) then
6: Dat aRet ent ionI con ← Green
7: else if Ret ent ionP er iod ∈ Indefinitely then
8: Dat aRet ent ionI con ← Yellow
9: else
10: Dat aRet ent ionI con ← Red
11: end if
12: end if
13: if cat eдor y ∈ First Party Collection/Use then ...
14: end if
15: end for
Ensure: risk icons and their corresponding colors</p>
      <p>Having information about attribute values, the risk measurement
module is able to assign appropriate risk icons along with their
corresponding colors. As a proof-of-concept, we will found our
risk measurement rules on Disconnect icons. Aforementioned in
literature review, the Disconnect Data Retention color assignment
are as follows: Green for retention period &lt;= 12 months; Yellow,
for retention period &gt; 12 months; and Red, when there is no data
retention policy provided. Algorithm 1 shows our interpretation of
Data Retention icon. It is worth to mention that our interpretation
is based on the available annotations from OPP-115 dataset. Hence,
it is not the only representation that can be built from Disconnect
icons and others may adopt their own understanding.</p>
      <p>For the evaluation of our approach, we intend to generate risk
factors according to OPP-115 experts annotations and use it as a
goldstandard. We believe the final error will be close to sum of error
rate in the deep learning module (predicting high-level classes) and
the error which is caused due to incomplete set of rules in rule
executor component. Considering the fact that we are now able to
predict the correct high-level classes with 79% F1, with the careful
definition of rules for extracting attribute values, it is predicted to
gain a reasonable accuracy at the end of our pipeline.</p>
    </sec>
    <sec id="sec-4">
      <title>4 CONCLUSION</title>
      <p>In this study, we proposed the application of Deep Learning models
and Rule-Based Information Extraction to automatically present a
structured view of risk factors in privacy policies. In particular, we
presented a hybrid approach that takes advantage of the dataset
OPP-115. This approach is of paramount importance to support
users to consciously agree with terms and conditions of online
services, and to perform shallow compliance checking where a
highrisk score can be assigned to “indefinitely” and “unspecified” values.
As next steps, we plan to implement the proposed architecture
and run empirical evaluations to validate the presented hypothesis,
i.e, users will be more motivated to read privacy policies when a
color-coded structured view is presented to them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Harkous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fawaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lebret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schaub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Shin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          . Polisis:
          <article-title>Automated analysis and presentation of privacy policies using deep learning</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1802</year>
          .02561,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Obar</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Oeldorf-Hirsch</surname>
          </string-name>
          .
          <article-title>The biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services</article-title>
          .
          <source>Information, Communication &amp; Society</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Tesfay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kiyomoto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Serna</surname>
          </string-name>
          . Privacyguide:
          <article-title>Towards an implementation of the eu gdpr on internet privacy policy evaluation</article-title>
          .
          <source>In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics</source>
          ,
          <source>IWSPA '18</source>
          , pages
          <fpage>15</fpage>
          -
          <lpage>21</lpage>
          , New York, NY, USA,
          <year>2018</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Voigt</surname>
          </string-name>
          and
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide</article-title>
          , 1st Ed., Cham: Springer International Publishing,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schaub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Dara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cherivirala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Leon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zimmeck</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Sathyendra</surname>
            ,
            <given-names>N. C.</given-names>
          </string-name>
          <string-name>
            <surname>Russell</surname>
          </string-name>
          , et al.
          <article-title>The creation and analysis of a website privacy policy corpus</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          , pages
          <fpage>1330</fpage>
          -
          <lpage>1340</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Zaeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>German</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Barber</surname>
          </string-name>
          . Privacycheck:
          <article-title>Automatic summarization of privacy policies using data mining</article-title>
          .
          <source>ACM Trans. Internet Technol</source>
          .,
          <volume>18</volume>
          (
          <issue>4</issue>
          ):
          <volume>53</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>53</lpage>
          :
          <fpage>18</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>