Toward Domain-Guided Controllable Summarization of
                            Privacy Policies
              Moniba Keymanesh                                               Micha Elsner                        Srinivasan Parthasarathy
              keymanesh.1@osu.edu                                       elsner.14@osu.edu                          parthasarathy.2@osu.edu
             The Ohio State University                               The Ohio State University                     The Ohio State University

ABSTRACT                                                                               concatenating the most important sentences in the document. The
Companies’ privacy policies are often skipped by the users as they                     abstractive systems are more flexible while the extractive models
are too long, verbose, and difficult to comprehend. Identifying the                    enjoy better factuality [13]. However, existing summarization tech-
key privacy and security risk factors mentioned in these unilateral                    niques perform poorly on contracts. Unsupervised methods [14, 15]
contracts and effectively incorporating them in a summary can                          rely on structural features of documents, such as lexical repetition,
assist users in making a more informed decision when asked to                          to identify and extract important content. These heuristics work
agree to the terms and conditions. However, existing summarization                     poorly on the legal language used in contracts [16]. Supervised
methods fail to integrate domain knowledge into their framework                        methods [7, 9, 17] can learn to cope with the features of a particular
or rely on a large corpus of annotated training data. We propose a                     domain. However, training these complex neural summarization
hybrid approach to identify sections of privacy policies with a high                   models with thousand of parameters requires a large corpus of
privacy risk factor. We incorporate these sections into summaries                      documents and their summaries. Currently existing corpora in the
by selecting the riskiest content from different privacy topics. Our                   legal domain are not large enough to train such models. We pro-
approach enables users to select the content to be summarized                          pose a hybrid approach for extractive summarization of privacy
within a controllable length. Users can view a summary that cap-                       contracts: using existing annotated resources, we train a classifier
tures different privacy factors or a summary that covers the riskiest                  to predict which pieces of content are most relevant to users [1].
content. Our approach outperforms the domain-agnostic baselines                        In particular, we identify parts of the contract which place users
by up to 27% in ROUGE-1 score and 50% in METEOR score using                            at risk by imposing unsafe data practices on them, such as selling
plain English reference summaries while relying on significantly                       email addresses to third parties or allowing the company to appro-
less training data in comparison to abstractive approaches.                            priate user-generated content. Next, we use this risk classifier for
                                                                                       content selection within an extractive summarization pipeline. The
ACM Reference Format:
                                                                                       classifier is substantially less expensive than learning to summa-
Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy. 2020.
Toward Domain-Guided Controllable Summarization of Privacy Policies. In
                                                                                       rize directly but enables our approach to outperform a selection of
Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop,             domain-agnostic unsupervised summarization methods.
24 August 2020, San Diego, US. ACM, New York, NY, USA, 7 pages.                           Prior computational work on privacy policies has used infor-
                                                                                       mation extraction and natural language processing methods to
1     INTRODUCTION AND RELATED WORK                                                    classify segments of these documents into different data practice
                                                                                       categories [18–20]. Another trajectory of work has focused on pre-
Privacy policy and terms of service are unilateral contracts by which
                                                                                       senting a graphical “at-a-glance” description of the privacy policies
companies are required to inform users about their data collection,
                                                                                       to the user. For example, PrivacyGuide [21] and PrivacyCheck [22]
processing, and sharing practices. Users are required to agree to
                                                                                       define a few privacy factors and map each factor to a risk level
abide by the terms before they can use any service. However, many
                                                                                       using a data mining model. Relying on these “at-a-glance” descrip-
users do not read or understand these contracts [1]. Thus, they often
                                                                                       tion methods raises several concerns. First, there is no way for the
end up consenting to terms that may not be aligned with legislation
                                                                                       user to check the factuality of the predicted risk classes or inter-
such as the General Data Protection Regulation (GDPR)1 [2]. This
                                                                                       pret the reasoning behind them. Moreover, users tend to have an
behavior is often because these contracts are too long and difficult
                                                                                       easier time comprehending the content when provided in natu-
to comprehend [3]. Summarization is an intuitive way to assist users
                                                                                       ral language. Researchers also have focused on assigning a risk
with conscious agreement by generating a condensed equivalent
                                                                                       factor–green, yellow, or red–to each segment of the privacy poli-
of the content. Broadly, there are two main lines of summarization
                                                                                       cies [23, 24]. However, summarizing the text may benefit users
systems: abstractive and extractive. The abstractive paradigm [4–
                                                                                       more than directly presenting the classifier output. We draw on
10] aims to create an abstract representation of the input text and
                                                                                       these approaches in building our own classifier. The first module of
involves various text rewriting operations such as paraphrasing,
                                                                                       our framework extends prior work [23, 24] to highlight segments
deletion, and reordering. The extractive paradigm [11, 12] on the
                                                                                       of privacy policies that have a higher risk. We employ a pre-trained
other hand, creates a summary by identifying and subsequently
                                                                                       encoder and convolutional neural network to classify sentences
1 https://eugdpr.org/
                                                                                       of the contracts into different risk levels. To address the limita-
                                                                                       tions of previous work, we incorporate the domain information
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                     predicted by the classifier in the form of a summary by comparing
NLLP @ KDD 2020, August 24th, San Diego, US                                            a risk-focused and a coverage-focused content selection mecha-
© 2020 Copyright held by the owner/author(s).                                          nism. The coverage-focused selection mechanism aims to reduce
NLLP @ KDD 2020, August 24th, San Diego, US                                                  Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy


the information redundancy by covering the riskiest sentence from         by repeatedly applying the convolution filter 𝑤 to a window of to-
each privacy topic. We evaluate the effectiveness of employing a          kens 𝑡𝑖:𝑖+ℎ−1 . Each element 𝑐𝑖 in feature map 𝑐 = [𝑐 1, 𝑐 2, ...𝑐𝑛−ℎ+1 ]
classifier on identifying the domain knowledge for summarization.         is then obtained from:
We also evaluate the quality of summaries extracted by our two
content selection criteria. Using our approach users can view a                                  𝑐𝑖 = 𝑓 (𝑤 . 𝐴[𝑖 : 𝑖 + ℎ − 1] + 𝑏)
summary that captures different privacy factors or a summary that
                                                                          where 𝐴[𝑖 : 𝑗] is the sub-matrix of 𝐴 from row 𝑖 to 𝑗 corresponding
covers the riskiest content. We release our dataset of 151 privacy
                                                                          to a window of tokens 𝑡𝑖 to 𝑡 𝑗 and "." represents the dot product
policies annotated with risk labels to assist future research.
                                                                          between the filter 𝑤 and the sub-matrices. 𝑏 ∈ 𝑅 represents the
2     METHODOLOGY                                                         bias term and 𝑓 is an activation function such as a rectified linear
Given a privacy policy document 𝐷 consisting of a sequence of 𝑛           unit. We use multiple kinds of filters by using various region sizes.
sentences {𝑠 1, 𝑠 2, ...𝑠𝑛 } and a sentence budget 𝑚 such that 𝑚 < 𝑛      This extracts various types of features from bigrams, trigrams, and
our summarization model extracts a risk-aware summary with                so on. The dimensionality of the feature map 𝑐 generated by each
𝑚 sentences. For each sentence 𝑠𝑖 ∈ 𝐷 we predict a binary label           convolution filter is different for sentences with various lengths
𝑦𝑖 (where a value of 1 means 𝑠𝑖 is included in the summary). We           and filters with different heights. We apply a max-over-time [25]
achieve this by computing an inclusion probability 𝑝 (𝑦𝑖 |𝑠𝑖 , 𝐷, 𝜃 )     pooling operation to downsample each feature map 𝑐 by taking the
for each sentence 𝑠𝑖 . 𝜃 are the model’s parameters. We aim to max-       maximum value over the window defined by a pool size 𝑝. The max-
imize the inclusion probability for risky sections of the privacy         pooling operation naturally deals with variable sentence lengths.
policies and minimize it for non-risky sections. We also would like       The outputs generated from each filter map are concatenated to
to cover different privacy factors within the sentence budget 𝑚 by        build a fixed-length feature vector for the penultimate layer. This
reducing the redundancy. The main intuition behind our proposed           feature vector is then fed to a fully connected softmax layer that
approach is that users when going through the privacy policies are        predicts a probability distribution over the risk level categories. We
most interested in knowing how their information can potentially          apply dropout [30] as a means of regularization in the softmax layer.
be abused [1]. Thus, a condensed equivalent of the terms should           Our objective is to minimize the binary cross-entropy. The trainable
include such risky sections. Next, we explain the architecture or         model parameters include the weight vectors 𝑤 of the filters, the
our risk prediction model and our content selection mechanisms.           bias term 𝑏 in the activation function, and the weight vector of the
                                                                          softmax function. We minimize the loss using Stochastic gradient
2.1     Risk Prediction                                                   descent and back-propagation [31].

Given the content of privacy policies, the first step in our frame-       2.1.2 Pretrained Word Vectors. Prior research indicates that
work is to identify the associated risk class with each sentence of       better word representations can improve performance in a vari-
the contract. We rely on a crowd-sourcing project called TOS;DR2          ety of natural language understanding (NLU) tasks [32]. We use
to automatically annotate 151 privacy contracts. TOSDR has anno-          ELMo [29]-a deep contextualized word representation model-to
tated several snippets of privacy contracts based on the average          map each token 𝑡𝑖 in sentence 𝑠𝑖 in contract 𝐷 to its correspond-
Internet user’s perception of risk. We explain our dataset extraction     ing contextual embedding 𝑣𝑖 with length 1024 3 . ELMo uses a bi-
in section 3. We use this dataset to train our risk classifier. Prior     directional LSTM [34] for language modeling and considers the
research has exploited word embeddings and Convolutional Neural           context of the words when assigning them to their embeddings4 .
Networks (CNN) for sentence classification [25–28]. These simple
architectures achieve strong empirical performance over a range of
                                                                          2.2      Content Selection and Redundancy
text classification tasks. Our model is a slight variant of the CNN                Reduction
architecture proposed in [25].                                            Given the probability distributions over the risk categories, we
2.1.1 Model architecture. Let 𝑠 𝑗 = {𝑡 1, 𝑡 2, ...𝑡𝑛 } be the 𝑗-th sen-   apply two content selection mechanisms to account for the sum-
                                                                          marization budget 𝑚 and minimize the information redundancy.
tence in the contract 𝐷 and 𝑣𝑖 ∈ 𝑅𝑑 be the d-dimentional vector
                                                                          The first mechanism focuses on including the most "risky" sections
representation of token 𝑡𝑖 in this sequence. Word representations
                                                                          while the second mechanism focuses on covering diverse privacy
are output of a pretrained encoder [29] and will be discussed in Sec-
                                                                          factors. Next, we explain these two variations of our model.
tion 2.1.2. We build the sentence matrix 𝐴 ∈ 𝑅𝑛×𝑑 by concatenating
the word vectors 𝑣 1 to 𝑣𝑛 :                                              2.2.1 Risk-Focused Content Selection: Given a privacy policy
                         𝐴1:𝑛 = 𝑣 1 ⊕ 𝑣 2 ⊕ ...𝑣𝑛                         contract 𝐷 with sentences {𝑠 1, ...𝑠𝑛 }, a summarization budget 𝑚,
                                                                          and risk score 𝑝 (𝑦𝑖 = 1|𝑠𝑖 , 𝐷, 𝜃 ) predicted for 𝑠𝑖 by the risk classifier,
    Following [25] we apply convolution filters to this matrix to         the risk-focused selection mechanism assembles a summary by
produce new features. The length of the filters is equal to the di-       extracting the top 𝑚 sentences that have the highest risk score.
mensionality of the word vectors 𝑑. The height or region size of the
filter is denoted by ℎ and is the number of rows (word vectors) that      3 Model was trained on the One billion word benchmark [33] and was obtained from
are considered jointly when applying the convolution filter. The fea-     https://github.com/allenai/allennlp
ture map 𝑐 ∈ 𝑅𝑛−ℎ+1 of the convolution operation is then obtained         4 BERT [35] as the current state-of-the-art for language model pretraining has achieved
                                                                          amazing results in many NLU tasks with minimal fine-tuning. However, our prelimi-
                                                                          nary results of fine-tuning bert did not outperform our results from Elmo word vectors
2 https://TOS;DR.org                                                      and task-specific architecture explained in Section 2.1.1.
Toward Domain-Guided Controllable Summarization of
Privacy Policies                                                                                           NLLP @ KDD 2020, August 24th, San Diego, US


2.2.2 Coverage-Focused Content Selection: Given a privacy                   for designing the risk classifier, and the training details. We discuss
policy contract 𝐷 with sentences {𝑠 1, ...𝑠𝑛 }, a summarization budget      our evaluation criteria in Section 4.2.
𝑚, and risk scores 𝑝 (𝑦𝑖 = 1|𝑠𝑖 , 𝐷, 𝜃 ), the coverage-focused selection    4.1     Hyperparameters and Training Details
method finds 𝑚 privacy factors by clustering sentences for which
the risk score is larger than a predefined value of 𝛼. Next, the riskiest   For the CNN model, we use two filter region sizes 3 and 4 each
sentence from each privacy factor cluster is selected to be included        of which has 50 output filters. We use rectified linear unit as the
in the summary. Note that if less than 𝑚 sentences have a risk              activation function of the convolution layer. The pool size in the
score greater than 𝛼 the summary will have less than 𝑚 sentences.           max pooling operation is set to 50. We apply dropout with a rate
To find privacy topics of a contract, we apply k-means [36] to              of 20%. We optimize the binary cross-entropy loss using stochastic
sentence representations. Sentence representations are obtained             gradient descent with a learning rate of 0.01. To account for the
through concatenating the word vectors. Number of clusters is set           class imbalance problem, we randomly under-sampled the majority
to 𝑚𝑖𝑛(𝑚, |𝑟 |) where 𝑟 = {𝑠𝑖 | 𝑝 (𝑦𝑖 = 1) > 𝛼 }.                           class (non-risky) with a rate of 10%. We also apply SMOTE over
                                                                            sampling [37] on the minority class (risky) with rate 50%. We train
3     DATASET EXTRACTION                                                    our model on this resampled dataset for 20 epochs and weight the
                                                                            loss function inversely proportional to class frequencies in the input
In this section, we explain the dataset that we compiled from the           data. To set the value of risk threshold 𝛼 in the content selection
TOS;DR website and privacy contracts of 151 companies. TOS;DR               module, we used the ROC curve of the validation set of each fold.
is a website dedicated to rating and explaining privacy policy of           We set 𝛼 for each fold to the threshold value that achieves 80% true
companies in plain English. Members of the website’s commu-                 positive rate.
nity classify specific sections of privacy policies into "bad", "good",
"blocker", and "neutral" categories and provide summaries for them.
We collected the user agreement contracts of 151 services that were
                                                                            4.2     Evaluation Metrics
annotated on TOS;DR from the companies’ websites. Some compa-               In our experiments, we seek to answer two questions: i. how well
nies have several such contracts e.g. privacy policy, terms of service,     does our model identify the risky sentences in the contracts? and
and cookie policy. In this case, all the contracts were merged into a       ii. what content selection method leads to more "human-like" sum-
single document. Next, we compared each sentence of the contract            maries? To answer the first question we report the Macro-F1 and
with specific snippets that were annotated on TOS;DR. If the cor-           Micro-F1 score of our classifier. To answer the second question,
responding sentence or a very similar sentence was annotated by             we evaluate the quality of the extracted summaries by our model
the TOS;DR contributors, the same label was used. Otherwise, it             by computing the average F1-score for ROUGE-1, ROUGE-2, and
was annotated as "neutral". The assumption behind our annotation            ROUGE-L [38] metrics (which respectively measure the unigram-
schema is that, if a section was not annotated by the contributors, it      overlap, bigram-overlap, and longest common sequence between
most likely does not include a privacy risk and thus, is considered         the reference summary and the summary to be evaluated). ROUGE
neutral. NLTK was used to segment the contracts into sentences.             metrics fail to capture semantic similarity beyond n-grams [39].
Jaccard similarity of the vocabulary was used to measure the simi-          Thus, we also report the METEOR score [40] which goes beyond
larity of the sentences. Two sentences from the same contract were          the surface matches and accounts for stems and synonyms while
considered similar if the Jaccard similarity of their tokens was more       finding the matches.6 We evaluate our model using 5-fold cross-
than 50%. We combined the "bad" and "blocker" sections to build the         validation. In each fold, contracts of 96 companies are used for
"risky" class. The "good" and "neutral" classes were also combined          training, 24 contracts are used for validation, and the rest is used
to build the "non-risky" class. This dataset is highly imbalanced           for testing. We explain our baselines in Section 4.3 and our experi-
with 61674 non-risky sentences and only 719 risky sentences. To             mental results in Section 5.
build the ground truth risk-aware summary of each privacy policy
we concatenate the plain English summaries of the snippets that             4.3     Summarization Baselines
have a "risky" label. The dataset statistics of the 151 privacy policies    We compare the performance of our domain-aware extractive sum-
and their corresponding summaries are presented in Table 1. Our             marization model with the following unsupervised baselines. Un-
dataset is available online 5 .                                             like the evaluation setup in [16], we run the models on the entire
                                                                            contract. For methods that require a word limit as the budget, a
    Dataset                          Min      Max      Median     Mean
                                                                            compression ratio 𝑟 is multiplied by the average number of to-
    Privacy Policies                 61       1707          350   411.6     kens in all contracts (10488.7) to compute the word limit. Similarly,
    Plain English Summaries          1         53            1     3.5      the compression ratio of 𝑟 is multiplied by the average number of
                                                                            sentences in all contracts (413.1) to build a sentence limit.
Table 1: The min, max, median, and average number of sen-
tences in 151 privacy contracts and their summaries.                              • TextRank: An algorithm introduced in [14] that uses page
                                                                                    rank to compute an importance score for each sentence. Sen-
                                                                                    tences with the highest importance score are then extracted
4     EXPERIMENTS
                                                                                    to build a summary until a word limit is satisfied.
In this section, we discuss our data augmentation mechanism to
reduce the data imbalance problem, our hyper parameter choice
                                                                            6 We use pyrouge and NLTK python packages for computing ROUGE and METEOR
5 www.github.com/senjed/Summarization-of-Privacy-Policies                   values respectively.
NLLP @ KDD 2020, August 24th, San Diego, US                                                  Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy


                                      Compression Ratio = 1/64                                   Compression Ratio = 1/16

                             P            R     Macro-F1          Micro-F1             P             R          Macro-F1             Micro-F1
        CNN + RF           22.40        28.13      61.94                98.01        9.86          59.74            56.65               93.10
        CNN + CF           19.64        24.06      60.26                97.95        12.19         52.65            58.51               94.94
Table 2: Precision(P), Recall(R), Macro-F1, and Micro-F1 of the CNN classifier with two different content selection mechanisms
risk-focused(RF) and coverage-focused(CF) at two different compression ratios 16  1 and 1 .
                                                                                         64

      • KLSum: Introduced in [15], KLSum aims to minimize the               two times better in terms of recall. When the compression ratio
        Kullback-Lieber (KL) divergence between the input docu-                 1 , the risk-focused method captures many more risky sections
                                                                            is 16
        ment and proposed summary by greedily selecting sentences.          and achieves a recall of 59.74. However, with this increase in re-
      • Lead-K: A common baseline in news summarization that                call, the false positive rate also increases. On the other hand, the
        extracts the first k sentences of the document until a word         coverage-focused method is better at preserving the precision at
        limit is reached.                                                   higher budgets (only 7.45 drop in precision with a 28.59 points in-
      • Random: This baseline picks random sentences of the doc-            crease in recall). This observation is caused by extracting sentences
        ument until a word limit is satisfied. For this baseline, we        with a risk score greater than 𝛼 in coverage-focused content selec-
        report the average results over 10 runs.                            tion. This naturally puts an upper bound on the false positive rate.
      • Upper Bound Baseline: This baseline picks all the sen-              We conclude that both mechanisms are moderately successful at
        tences in a contract with ground truth label "risky". This          identifying the risky sections of contracts. We also conclude that at
        baseline indicates the performance upper bound of an ex-            higher compression ratios, the risk-focused mechanism can be used
        tractive method on our dataset.                                     where recall is more essential while the coverage-focused mecha-
                                                                            nism can be used when precision is more of interest. In the next
5     RESULTS                                                               section, we examine whether the domain information given by the
In this section, we discuss our experiments conducted using 5-fold          risk classifier can improve the quality of summaries in comparison
cross-validation. We shared our training details in Section 4.1. As         to domain-agnostic extractive summarization baselines.
an example, summaries extracted by our model and the baselines              5.2    Summarization Results:
from privacy policy of Brainly 7 is displayed in Figure 1. It can be
                                                                            In this section, we evaluate the quality of the summaries extracted
seen that both of the summaries generated by our method indi-
                                                                            by our model and the baselines. We introduced our evaluation met-
cate that third party advertising companies will be able to collect
                                                                            rics in Section 4.2 and our baselines in Section 4.3. We compare
information about use of Brainly. KLSum misses this information
                                                                            the summaries against two type of reference summaries. The first
and the traditional lead-k heuristic which is very effective for news
                                                                            type of summary is built by assembling all the sentences that have
performs poorly on the contracts. This indicates the advantage of
                                                                            ground truth "risky" label. These sentences are derived directly
injecting domain-specific knowledge into content selection.
                                                                            from text of the contract. We will refer to this reference summary
                                                                            as "quote text" reference. The second type of summary is derived
5.1      Classification Results:
                                                                            by assembling the plain English summary of the "risky" sections
In this section, we evaluate the performance of our model discussed         written by the TOS;DR contributors. The summarization results
in Section 2.1.1 and study the effect of different content selection        using the quote text summaries is presented in Table 3. The sum-
mechanism on the risk prediction task. We evaluate our summaries            marization results using the plain English reference summaries is
at two compression ratios of 641 and 1 . The summarization budget
                                        16                                  presented in Table 4.
𝑚 at each compression ratio 𝑟 is achieved by multiplying 𝑟 in the av-
erage number of sentences(or words) in the contracts. Thus, at the          5.2.1 Extracting the risky content: As it can be seen in Table 3,
compression ratio of 641 , summaries are restricted to the maximum          at both compression ratios, both variation of our model outperform
                                                                            the baselines. At compression ratio of 641 , the CNN + RF, achieves
length of 6 sentences or 164 words. Similarly, at the compression
          1 , summaries are limited to the maximum length of 29 sen-
ratio of 16                                                                 the best ROUGE and METEOR results with 49.8% improvement
tences or 656 words. We report the precision, recall, Micro-F1, and         in ROUGE-1, 124.6% improvement in ROUGE-2, 56.3% improve-
Macro-F1 of our risk classifier with two different content selection        ment in ROUGE-L, and 65.6% improvement in METEOR in com-
mechanisms namely risk-focused (RF) and coverage-focused (CF)               parison to the best performing domain-agnostic baseline for each
                                                                            metric. At compression ratio of 161 the CNN + CF achieves the best
in Table 2. As can be seen in the table, the Micro-F1 scores of both
content selection methods are quite high. However, the best Macro-          ROUGE results by improving ROUGE-1 by 12.2%, ROUGE-2 by
F1 value is achieved by the risk-focused approach and is 61.94. The         30.2%, ROUGE-L by 8.8%, and METEOR by 23.7% in comparison
large gap between the two values is due to the high level of class          the the best performing baseline for each metric. The improve-
imbalance in our dataset (1 positive sample for every 100 negative          ment in METEOR score is found to be statistically significant using
samples). At 64 1 compression ratio, risk-focused performs more than        Wilcoxon signed ranked test [41] with p-value < 0.01 (Bonferroni
                                                                            corrected [42] to account for multiple testing). Similar to our obser-
7 https://Brainly.com                                                       vation in classification task, we find that the risk-focused content
Toward Domain-Guided Controllable Summarization of
Privacy Policies                                                                                                    NLLP @ KDD 2020, August 24th, San Diego, US


 Plain English Summary: The Privacy Policy states, "We and our third party partners may also use cookies and tracking technologies for advertising
 purposes.". In the Privacy Policy, it states that, "Although we do our best to honor the privacy preferences of our users, we are unable to respond to Do
 Not Track signals set by your browser at this time." The Privacy Poilicy says Brainly can track usage information and personal information "through a
 variety of tracking technologies, including cookies, web beacons, Locally Stored Objects (LSOs such as Flash or HTML5), log files, and similar technology
 (collectively, “tracking technologies”)." If Brainly aims to "preserve all content posted on the site," then we can conclude that such personal data is still
 necessary for the purpose of the site. There are places on the site where answers without usernames or profile pictures are visible. The Cookie Policy
 states, "Service oparator [sic] informs that restricting the use of cookies may affect some of the functionalities available on the Website." For users not in
 europe, brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the brainly services, your brainly
 subscription, and/or any products, services, functionality, information, content or other material. <truncated>

 CNN + RF: We participate in interest-based advertising and use third party advertising companies to serve you targeted advertisements based on your
 online browsing history and your interests. We permit third party online advertising networks, social media companies and other third party services,
 to collect, information about your use of our service over time so that they may play or display ads on our service, on other websites, apps or services
 you may use, and on other devices you may use. We may share a common account identifier (such as an email address or user id) or hashed data with
 our third party advertising partners to help identify you across devices. Brainly reserves the right to moderate the Brainly services and to remove, screen,
 or edit your content from the Brainly services at our sole discretion, at any time, and for any reason or for no reason, with no notice to you. Brainly
 reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the Brainly services, your Brainly subscription,
 and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly services
 and/or your Brainly subscription, including, but not limited to, the mobile software, and/or your access to some or all of them without cause and without
 notice. In the event that Brainly suspends or terminates your account, the Brainly services or your Brainly subscription, you acknowledge and agree
 that you shall receive no refund or exchange for any unused time on a Brainly subscription or any subscription fees or anything else.

 CNN + CF: We participate in interest-based advertising and use third party advertising companies to serve you targeted advertisements based on
 your online browsing history and your interests. We permit third party online advertising networks, social media companies and other third party
 services, to collect, information about your use of our service over time so that they may play or display ads on our service, on other websites, apps
 or services you may use, and on other devices you may use. We may share a common account identifier (such as an email address or user id) or hashed
 data with our third party advertising partners to help identify you across devices. To the fullest extent permitted by applicable law, no arbitration or
 claim under these terms shall be joined to any other arbitration or claim, including any arbitration or claim involving any other current or former user
 of the Brainly services or a Brainly subscription, and no class arbitration proceedings shall be permitted. We may modify or update this privacy policy
 from time to time to reflect the changes in our business and practices, and so you should review this page periodically. If you object to any changes,
 you may close your account. Continuing to use our service after we publish changes to this privacy policy means that you are consenting to the changes.

 Lead-K: Welcome to Brainly!. Brainly operates a group of social learning networks for students and educators. Brainly inspires students to share and
 explore knowledge in a collaborative community and engage in peer-to-peer educational assistance, which is made available on www.Brainly.com and
 any www.Brainly.com sub-domains(the “website”) as well as the Brainly.com mobile application (the “app”) (the “website” and the “app” are collectively
 the “Brainly services”. We have two sets of terms and conditions: part(a) sets out the terms that apply to our users unless you are based in Europe and
 part (b) sets out the terms that apply to our users in Europe. It is important that you read and understand the terms that apply to you when you use
 the Brainly services before using the Brainly services. Part (a): terms and conditions applicable to users unless you are based in Europe. This part and
 the documents referred to within it set out the terms and conditions that apply to your use of Brainly services if you access Brainly services from within
 the united states or other countries except Europe. The Cookie Policy states, "Service oparator [sic] informs that restricting the use of cookies may
 affect some of the functionalities available on the Website."

 KLSum: Brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the Brainly services, your Brainly
 subscription, and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly
 services and/or your Brainly subscription, including, but not limited to, the mobile software, and/or your access to some or all of them without cause and
 without notice. Brainly makes no warranty that the Brainly services and/or any products, services, functionality, information, content or other materials
 available on, through or in connection with the Brainly services or your Brainly subscription, including, but not limited to, the mobile software, will meet
 your requirements, or that the Brainly services or Brainly subscriptions will operate uninterrupted or in a timely, secure, or error-free manner, or as to the
 accuracy or completeness of any information or content accessible from or provided in connection with the Brainly services or Brainly subscriptions,
 regardless of whether any information or content is marked as “verified”. You must not: use Brainly services other than for its intended purpose as set out
 in the terms of use; <truncated for presentation purpose. Rest of the summary includes examples of misuse of the Brainly services.>


Figure 1: The summaries extracted by our model (CNN + RF and CNN + CF) and the baselines from the privacy policy and
                                                  1.
cookie policy of Brainly at compression ratio of 64
selection achieves more recall and thus, achieves a better METEOR                  contracts, the number of risky sentences is smaller than the budget
score in comparison to the coverage-focused mechanism. On the                                   1 (29 sentences).
                                                                                   at ratio of 16
other hand, by increasing the summarization budget, the ROUGE                      5.2.2 Building Human-like summaries: We present our sum-
values for this method slightly drop. This is because, in most of the              marization results using the plain English summaries as reference
                                                                                                                                  1 , both variations of
                                                                                   summaries in Table 4. At compression ratio of 64
NLLP @ KDD 2020, August 24th, San Diego, US                                                 Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy


                                   Compression Ratio = 1/64                                      Compression Ratio = 1/16
                    ROUGE-1          ROUGE-2         ROUGE-L       METEOR          ROUGE-1         ROUGE-2           ROUGE-L           METEOR
    CNN + RF           43.09             31.21        36.80           41.98            34.0            24.96             24.83            40.03
    CNN + CF           40.45             28.69        34.01           41.55           37.93            28.82             29.23            43.91
    Textrank            28               13.89        22.06            22.4           33.78            22.12             26.85            35.49
    KLSum              28.75             13.14        23.53           25.34           24.74            11.36             18.86            26.95
    Lead-k             25.57              9.09        20.25           19.54           25.67            11.33             19.77            26.85
    Random             24.26              6.45        18.78           18.11           24.43             9.85             18.08            27.01
Table 3: ROUGE-1, ROUGE-2, ROUGE-l, and METEOR score of our model (highlighted in light gray) in comparison to the
                                 1 and 1 . RF refers to the risk-focused content selection while CF refers to the coverage-
baselines in compression ratios 64      16
focused content selection. The quote text of the risky sections was used to build the reference summaries.

                                      Compression Ratio = 1/64                                    Compression Ratio = 1/16
                        ROUGE-1          ROUGE-2      ROUGE-L         METEOR         ROUGE-1         ROUGE-2          ROUGE-L           METEOR
    Upper Bound            22.45              13.7       18.27            22.32         22.56            13.95            18.49            23.03
    CNN + RF               13.97              6.08       9.83             16.58          9.07             3.94            5.53             12.07
    CNN + CF               12.39              4.81       8.51             14.93         10.18            4.54             6.58             13.16
    Textrank               10.94              2.78       7.51              11.2         10.08             3.37            6.37             12.47
    KLSum                  10.96              2.43       7.34             12.54          8.37             1.92             5.26            11.06
    Lead-k                 11.21               1.9        7.9             11.04          9.33             2.44             5.96            11.87
    Random                 11.44              1.87       8.03             12.02          9.13             2.32             5.73            12.45
                                                                                                                    1 and 1 .
Table 4: Performance of our model (highlighted in light gray) in comparison to the baselines in compression ratios 64     16
RF refers to the risk-focused content selection while CF refers to the coverage-focused content selection. The plain English
summaries of risky sections was used to build the reference summaries.

our model outperform the baselines. Our CNN + RF model, increases            of the moderate success in classification of our realistically imbal-
the METEOR score by 32.2% over KLSum and 48% over textrank.                  anced dataset, we observed a noticeable improvement in ROUGE
This improvement is found to be statistically significant (with p-           and METEOR metrics in comparison to domain agnostic baselines.
value < 0.01). The CNN + CF outperforms the baselines over all               We believe the summaries generated by our method can be im-
evaluation metrics. However, the improvement is not statistically            proved in multiple ways. First, the classifier itself, and the redun-
significant. At compression ratio of 16 1 , CNN + RF outperforms all         dancy reduction system, could be improved, bringing content selec-
domain-agnostic baselines. This improvement however, is not sta-             tion performance closer to the upper bound scores derived using
tistically significant. At this compression ratio, CNN + RF achieves         a perfect classifier. Secondly, our summaries would be more ac-
comparable result with textrank. We conclude from our experiments            cessible if written in plain English rather than legalese [2]. An
that our domain-aware extractive model does moderately better                abstractive system could be used to rewrite the contract text in
than the baselines at lower compression ratios, however, due to              this way. However, the abstractive summaries should not change
high level of abstraction in plain English summaries of TOS;DR [16],         the legal interpretation of the content and should be linkable to
a fully-extractive approach cannot mimic the human-like qualities            the original content to be considered binding. In addition to im-
in the plain English summaries. This can also be seen by looking at          proving the system, it is also necessary to conduct more extensive
the performance of the upper bound baseline.                                 evaluation experiments, involving human readers as well as auto-
                                                                             mated metrics. This will help determine the most effective ways to
6    CONCLUSION AND DISCUSSION                                               present information from click-through contracts so that users can
In this paper, we proposed a domain-aware extractive model for               understand their terms and make a more informed decision. We are
summarizing the privacy contracts. Our model, employs a convolu-             planning to explore if the risk classifier module can be used indepen-
tional neural network to identify risky sections of the contracts. We        dently to enhance the productivity of annotators by identifying the
build summaries by using a risk-focused and a coverage-focused               sections that need to be summarised. This can potentially facilitate
content selection mechanism. Our approach enables users to select            annotating larger resources for training abstractive models.
the content to be summarized within a controllable length while
relying on substantially less training data in comparison to the exist-       ACKNOWLEDGEMENT
ing supervised summarization methods. Our two different content              We are immensely grateful to Prof. Junyi Jessy Li, Prof. Bryan
selection mechanisms enable users to build budgeted summaries                H. Choi, Dr. Daniel Preoţiuc-Pietro, Mayank Kulkarni, and three
of contracts based on their preference of coverage vs risk. In spite         anonymous reviewers for valuable discussions.
Toward Domain-Guided Controllable Summarization of
Privacy Policies                                                                                                              NLLP @ KDD 2020, August 24th, San Diego, US


REFERENCES                                                                                [34] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.
 [1] Lorrie Faith Cranor, Praveen Guduru, and Manjula Arjula. User interfaces for              IEEE transactions on Signal Processing, 45, 1997.
     privacy agents. TOCHI, 2006.                                                         [35] Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional
 [2] Jonathan A Obar and Anne Oeldorf-Hirsch. The biggest lie on the internet:                 transformers for language understanding. arXiv:1810.04805, 2018.
     Ignoring the privacy policies and terms of service policies of social networking     [36] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth
     services. ICS, 2020.                                                                      Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis
 [3] Aleecia M McDonald and Lorrie Faith Cranor. The cost of reading privacy policies.         and implementation. IEEE TPAMI, 2002.
     Isjlp, 2008.                                                                         [37] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.
 [4] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model                Smote: synthetic minority over-sampling technique. JAIR, 2002.
     for abstractive sentence summarization. arXiv:1509.00685, 2015.                      [38] Chin-Yew Lin and Eduard Hovy. Manual and automatic evaluation of summaries.
 [5] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, et al. Abstractive text summa-             In ACL, 2002.
     rization using sequence-to-sequence rnns and beyond. arXiv:1602.06023, 2016.         [39] Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. Recent advances in document
 [6] Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. Distraction-               summarization. Knowledge and Information Systems, 2017.
     based neural networks for modeling document. In IJCAI, 2016.                         [40] Michael Denkowski and Alon Lavie. Meteor universal: Language specific trans-
 [7] Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summa-             lation evaluation for any target language. In Proceedings of the ninth workshop
     rization with pointer-generator networks. arXiv:1704.04368, 2017.                         on statistical machine translation, 2014.
 [8] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. Abstractive document summarization         [41] Frank Wilcoxon, SK Katti, and Roberta A Wilcox. Critical values and probability
     with a graph-based attentional neural model. In ACL, 2017.                                levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected
 [9] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model                 tables in mathematical statistics, 1970.
     for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.                [42] Charles W Dunnett. New tables for multiple comparisons with a control. Bio-
[10] Ritesh Sarkhel*, Moniba Keymanesh*, Arnab Nandi, and Srinivasan Parthasarathy.            metrics, 1964.
     Transfer learning for abstractive summarization at controllable budgets.
     arXiv:2002.07845, 2020.
[11] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A recurrent
     neural network based sequence model for extractive summarization of documents.
     In AAAI, 2017.
[12] Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srini-
     vasan, and Dragomir Radev. Graph-based neural multi-document summarization.
     arXiv preprint arXiv:1706.06681, 2017.
[13] Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. Faithful to the original: Fact
     aware neural abstractive summarization. In AAAI, 2018.
[14] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In EMNLP,
     2004.
[15] Aria Haghighi and Lucy Vanderwende. Exploring content models for multi-
     document summarization. In NAACL, 2009.
[16] Laura Manor and Junyi Jessy Li. Plain english summarization of contracts.
     arXiv:1906.00424, 2019.
[17] Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. Bottom-up abstrac-
     tive summarization. arXiv preprint arXiv:1808.10792, 2018.
[18] Frederick Liu, Shomir Wilson, Peter Story, et al. Towards automatic classification
     of privacy policy text. 2018.
[19] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, et al. The
     creation and analysis of a website privacy policy corpus. In ACL, 2016.
[20] Sebastian Zimmeck and Steven M Bellovin. Privee: An architecture for automati-
     cally analyzing web privacy policies. 2014.
[21] Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and
     Jetzabel Serna. Privacyguide: Towards an implementation of the eu gdpr on
     internet privacy policy evaluation. In IWSPA, 2018.
[22] Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. Privacycheck:
     Automatic summarization of privacy policies using data mining. TOIT), 2018.
[23] Najmeh Mousavi Nejad, Damien Graux, and Diego Collarana. Towards measuring
     risk factors in privacy policies. In ICAIL, 2019.
[24] Hamza Harkous, Kassem Fawaz, Rémi Lebret, et al. Polisis: Automated analysis
     and presentation of privacy policies using deep learning. 2018.
[25] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
     and Pavel Kuksa. Natural language processing (almost) from scratch. JMLR,
     12(Aug):2493–2537, 2011.
[26] Yoon Kim.         Convolutional neural networks for sentence classification.
     arXiv:1408.5882, 2014.
[27] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional
     neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
[28] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide
     to) convolutional neural networks for sentence classification. arXiv:1510.03820,
     2015.
[29] Matthew E Peters, Mark Neumann, Mohit Iyyer, et al. Deep contextualized word
     representations. arXiv:1802.05365, 2018.
[30] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus-
     lan R Salakhutdinov. Improving neural networks by preventing co-adaptation of
     feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[31] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre-
     sentations by back-propagating errors. nature, 1986.
[32] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power.
     Semi-supervised sequence tagging with bidirectional language models. arXiv
     preprint arXiv:1705.00108, 2017.
[33] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
     Koehn, and Tony Robinson. One billion word benchmark for measuring progress
     in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.