=Paper= {{Paper |id=Vol-2645/paper3 |storemode=property |title=Toward Domain-Guided Controllable Summarization of Privacy Policies |pdfUrl=https://ceur-ws.org/Vol-2645/paper3.pdf |volume=Vol-2645 |authors=Moniba Keymanesh,Micha Elsner,Srinivasan Parthasarathy |dblpUrl=https://dblp.org/rec/conf/kdd/KeymaneshES20 }} ==Toward Domain-Guided Controllable Summarization of Privacy Policies== https://ceur-ws.org/Vol-2645/paper3.pdf

Toward Domain-Guided Controllable Summarization of
Privacy Policies
Moniba Keymanesh Micha Elsner Srinivasan Parthasarathy
keymanesh.1@osu.edu elsner.14@osu.edu parthasarathy.2@osu.edu
The Ohio State University The Ohio State University The Ohio State University

ABSTRACT concatenating the most important sentences in the document. The
Companies’ privacy policies are often skipped by the users as they abstractive systems are more flexible while the extractive models
are too long, verbose, and difficult to comprehend. Identifying the enjoy better factuality [13]. However, existing summarization tech-
key privacy and security risk factors mentioned in these unilateral niques perform poorly on contracts. Unsupervised methods [14, 15]
contracts and effectively incorporating them in a summary can rely on structural features of documents, such as lexical repetition,
assist users in making a more informed decision when asked to to identify and extract important content. These heuristics work
agree to the terms and conditions. However, existing summarization poorly on the legal language used in contracts [16]. Supervised
methods fail to integrate domain knowledge into their framework methods [7, 9, 17] can learn to cope with the features of a particular
or rely on a large corpus of annotated training data. We propose a domain. However, training these complex neural summarization
hybrid approach to identify sections of privacy policies with a high models with thousand of parameters requires a large corpus of
privacy risk factor. We incorporate these sections into summaries documents and their summaries. Currently existing corpora in the
by selecting the riskiest content from different privacy topics. Our legal domain are not large enough to train such models. We pro-
approach enables users to select the content to be summarized pose a hybrid approach for extractive summarization of privacy
within a controllable length. Users can view a summary that cap- contracts: using existing annotated resources, we train a classifier
tures different privacy factors or a summary that covers the riskiest to predict which pieces of content are most relevant to users [1].
content. Our approach outperforms the domain-agnostic baselines In particular, we identify parts of the contract which place users
by up to 27% in ROUGE-1 score and 50% in METEOR score using at risk by imposing unsafe data practices on them, such as selling
plain English reference summaries while relying on significantly email addresses to third parties or allowing the company to appro-
less training data in comparison to abstractive approaches. priate user-generated content. Next, we use this risk classifier for
content selection within an extractive summarization pipeline. The
ACM Reference Format:
classifier is substantially less expensive than learning to summa-
Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy. 2020.
Toward Domain-Guided Controllable Summarization of Privacy Policies. In
rize directly but enables our approach to outperform a selection of
Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop, domain-agnostic unsupervised summarization methods.
24 August 2020, San Diego, US. ACM, New York, NY, USA, 7 pages. Prior computational work on privacy policies has used infor-
mation extraction and natural language processing methods to
1 INTRODUCTION AND RELATED WORK classify segments of these documents into different data practice
categories [18–20]. Another trajectory of work has focused on pre-
Privacy policy and terms of service are unilateral contracts by which
senting a graphical “at-a-glance” description of the privacy policies
companies are required to inform users about their data collection,
to the user. For example, PrivacyGuide [21] and PrivacyCheck [22]
processing, and sharing practices. Users are required to agree to
define a few privacy factors and map each factor to a risk level
abide by the terms before they can use any service. However, many
using a data mining model. Relying on these “at-a-glance” descrip-
users do not read or understand these contracts [1]. Thus, they often
tion methods raises several concerns. First, there is no way for the
end up consenting to terms that may not be aligned with legislation
user to check the factuality of the predicted risk classes or inter-
such as the General Data Protection Regulation (GDPR)1 [2]. This
pret the reasoning behind them. Moreover, users tend to have an
behavior is often because these contracts are too long and difficult
easier time comprehending the content when provided in natu-
to comprehend [3]. Summarization is an intuitive way to assist users
ral language. Researchers also have focused on assigning a risk
with conscious agreement by generating a condensed equivalent
factor–green, yellow, or red–to each segment of the privacy poli-
of the content. Broadly, there are two main lines of summarization
cies [23, 24]. However, summarizing the text may benefit users
systems: abstractive and extractive. The abstractive paradigm [4–
more than directly presenting the classifier output. We draw on
10] aims to create an abstract representation of the input text and
these approaches in building our own classifier. The first module of
involves various text rewriting operations such as paraphrasing,
our framework extends prior work [23, 24] to highlight segments
deletion, and reordering. The extractive paradigm [11, 12] on the
of privacy policies that have a higher risk. We employ a pre-trained
other hand, creates a summary by identifying and subsequently
encoder and convolutional neural network to classify sentences
1 https://eugdpr.org/
of the contracts into different risk levels. To address the limita-
tions of previous work, we incorporate the domain information
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). predicted by the classifier in the form of a summary by comparing
NLLP @ KDD 2020, August 24th, San Diego, US a risk-focused and a coverage-focused content selection mecha-
© 2020 Copyright held by the owner/author(s). nism. The coverage-focused selection mechanism aims to reduce
NLLP @ KDD 2020, August 24th, San Diego, US Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy

the information redundancy by covering the riskiest sentence from by repeatedly applying the convolution filter 𝑤 to a window of to-
each privacy topic. We evaluate the effectiveness of employing a kens 𝑡𝑖:𝑖+ℎ−1 . Each element 𝑐𝑖 in feature map 𝑐 = [𝑐 1, 𝑐 2, ...𝑐𝑛−ℎ+1 ]
classifier on identifying the domain knowledge for summarization. is then obtained from:
We also evaluate the quality of summaries extracted by our two
content selection criteria. Using our approach users can view a 𝑐𝑖 = 𝑓 (𝑤 . 𝐴[𝑖 : 𝑖 + ℎ − 1] + 𝑏)
summary that captures different privacy factors or a summary that
where 𝐴[𝑖 : 𝑗] is the sub-matrix of 𝐴 from row 𝑖 to 𝑗 corresponding
covers the riskiest content. We release our dataset of 151 privacy
to a window of tokens 𝑡𝑖 to 𝑡 𝑗 and "." represents the dot product
policies annotated with risk labels to assist future research.
between the filter 𝑤 and the sub-matrices. 𝑏 ∈ 𝑅 represents the
2 METHODOLOGY bias term and 𝑓 is an activation function such as a rectified linear
Given a privacy policy document 𝐷 consisting of a sequence of 𝑛 unit. We use multiple kinds of filters by using various region sizes.
sentences {𝑠 1, 𝑠 2, ...𝑠𝑛 } and a sentence budget 𝑚 such that 𝑚 < 𝑛 This extracts various types of features from bigrams, trigrams, and
our summarization model extracts a risk-aware summary with so on. The dimensionality of the feature map 𝑐 generated by each
𝑚 sentences. For each sentence 𝑠𝑖 ∈ 𝐷 we predict a binary label convolution filter is different for sentences with various lengths
𝑦𝑖 (where a value of 1 means 𝑠𝑖 is included in the summary). We and filters with different heights. We apply a max-over-time [25]
achieve this by computing an inclusion probability 𝑝 (𝑦𝑖 |𝑠𝑖 , 𝐷, 𝜃 ) pooling operation to downsample each feature map 𝑐 by taking the
for each sentence 𝑠𝑖 . 𝜃 are the model’s parameters. We aim to max- maximum value over the window defined by a pool size 𝑝. The max-
imize the inclusion probability for risky sections of the privacy pooling operation naturally deals with variable sentence lengths.
policies and minimize it for non-risky sections. We also would like The outputs generated from each filter map are concatenated to
to cover different privacy factors within the sentence budget 𝑚 by build a fixed-length feature vector for the penultimate layer. This
reducing the redundancy. The main intuition behind our proposed feature vector is then fed to a fully connected softmax layer that
approach is that users when going through the privacy policies are predicts a probability distribution over the risk level categories. We
most interested in knowing how their information can potentially apply dropout [30] as a means of regularization in the softmax layer.
be abused [1]. Thus, a condensed equivalent of the terms should Our objective is to minimize the binary cross-entropy. The trainable
include such risky sections. Next, we explain the architecture or model parameters include the weight vectors 𝑤 of the filters, the
our risk prediction model and our content selection mechanisms. bias term 𝑏 in the activation function, and the weight vector of the
softmax function. We minimize the loss using Stochastic gradient
2.1 Risk Prediction descent and back-propagation [31].

Given the content of privacy policies, the first step in our frame- 2.1.2 Pretrained Word Vectors. Prior research indicates that
work is to identify the associated risk class with each sentence of better word representations can improve performance in a vari-
the contract. We rely on a crowd-sourcing project called TOS;DR2 ety of natural language understanding (NLU) tasks [32]. We use
to automatically annotate 151 privacy contracts. TOSDR has anno- ELMo [29]-a deep contextualized word representation model-to
tated several snippets of privacy contracts based on the average map each token 𝑡𝑖 in sentence 𝑠𝑖 in contract 𝐷 to its correspond-
Internet user’s perception of risk. We explain our dataset extraction ing contextual embedding 𝑣𝑖 with length 1024 3 . ELMo uses a bi-
in section 3. We use this dataset to train our risk classifier. Prior directional LSTM [34] for language modeling and considers the
research has exploited word embeddings and Convolutional Neural context of the words when assigning them to their embeddings4 .
Networks (CNN) for sentence classification [25–28]. These simple
architectures achieve strong empirical performance over a range of
2.2 Content Selection and Redundancy
text classification tasks. Our model is a slight variant of the CNN Reduction
architecture proposed in [25]. Given the probability distributions over the risk categories, we
2.1.1 Model architecture. Let 𝑠 𝑗 = {𝑡 1, 𝑡 2, ...𝑡𝑛 } be the 𝑗-th sen- apply two content selection mechanisms to account for the sum-
marization budget 𝑚 and minimize the information redundancy.
tence in the contract 𝐷 and 𝑣𝑖 ∈ 𝑅𝑑 be the d-dimentional vector
The first mechanism focuses on including the most "risky" sections
representation of token 𝑡𝑖 in this sequence. Word representations
while the second mechanism focuses on covering diverse privacy
are output of a pretrained encoder [29] and will be discussed in Sec-
factors. Next, we explain these two variations of our model.
tion 2.1.2. We build the sentence matrix 𝐴 ∈ 𝑅𝑛×𝑑 by concatenating
the word vectors 𝑣 1 to 𝑣𝑛 : 2.2.1 Risk-Focused Content Selection: Given a privacy policy
𝐴1:𝑛 = 𝑣 1 ⊕ 𝑣 2 ⊕ ...𝑣𝑛 contract 𝐷 with sentences {𝑠 1, ...𝑠𝑛 }, a summarization budget 𝑚,
and risk score 𝑝 (𝑦𝑖 = 1|𝑠𝑖 , 𝐷, 𝜃 ) predicted for 𝑠𝑖 by the risk classifier,
Following [25] we apply convolution filters to this matrix to the risk-focused selection mechanism assembles a summary by
produce new features. The length of the filters is equal to the di- extracting the top 𝑚 sentences that have the highest risk score.
mensionality of the word vectors 𝑑. The height or region size of the
filter is denoted by ℎ and is the number of rows (word vectors) that 3 Model was trained on the One billion word benchmark [33] and was obtained from
are considered jointly when applying the convolution filter. The fea- https://github.com/allenai/allennlp
ture map 𝑐 ∈ 𝑅𝑛−ℎ+1 of the convolution operation is then obtained 4 BERT [35] as the current state-of-the-art for language model pretraining has achieved
amazing results in many NLU tasks with minimal fine-tuning. However, our prelimi-
nary results of fine-tuning bert did not outperform our results from Elmo word vectors
2 https://TOS;DR.org and task-specific architecture explained in Section 2.1.1.
Toward Domain-Guided Controllable Summarization of
Privacy Policies NLLP @ KDD 2020, August 24th, San Diego, US

2.2.2 Coverage-Focused Content Selection: Given a privacy for designing the risk classifier, and the training details. We discuss
policy contract 𝐷 with sentences {𝑠 1, ...𝑠𝑛 }, a summarization budget our evaluation criteria in Section 4.2.
𝑚, and risk scores 𝑝 (𝑦𝑖 = 1|𝑠𝑖 , 𝐷, 𝜃 ), the coverage-focused selection 4.1 Hyperparameters and Training Details
method finds 𝑚 privacy factors by clustering sentences for which
the risk score is larger than a predefined value of 𝛼. Next, the riskiest For the CNN model, we use two filter region sizes 3 and 4 each
sentence from each privacy factor cluster is selected to be included of which has 50 output filters. We use rectified linear unit as the
in the summary. Note that if less than 𝑚 sentences have a risk activation function of the convolution layer. The pool size in the
score greater than 𝛼 the summary will have less than 𝑚 sentences. max pooling operation is set to 50. We apply dropout with a rate
To find privacy topics of a contract, we apply k-means [36] to of 20%. We optimize the binary cross-entropy loss using stochastic
sentence representations. Sentence representations are obtained gradient descent with a learning rate of 0.01. To account for the
through concatenating the word vectors. Number of clusters is set class imbalance problem, we randomly under-sampled the majority
to 𝑚𝑖𝑛(𝑚, |𝑟 |) where 𝑟 = {𝑠𝑖 | 𝑝 (𝑦𝑖 = 1) > 𝛼 }. class (non-risky) with a rate of 10%. We also apply SMOTE over
sampling [37] on the minority class (risky) with rate 50%. We train
3 DATASET EXTRACTION our model on this resampled dataset for 20 epochs and weight the
loss function inversely proportional to class frequencies in the input
In this section, we explain the dataset that we compiled from the data. To set the value of risk threshold 𝛼 in the content selection
TOS;DR website and privacy contracts of 151 companies. TOS;DR module, we used the ROC curve of the validation set of each fold.
is a website dedicated to rating and explaining privacy policy of We set 𝛼 for each fold to the threshold value that achieves 80% true
companies in plain English. Members of the website’s commu- positive rate.
nity classify specific sections of privacy policies into "bad", "good",
"blocker", and "neutral" categories and provide summaries for them.
We collected the user agreement contracts of 151 services that were
4.2 Evaluation Metrics
annotated on TOS;DR from the companies’ websites. Some compa- In our experiments, we seek to answer two questions: i. how well
nies have several such contracts e.g. privacy policy, terms of service, does our model identify the risky sentences in the contracts? and
and cookie policy. In this case, all the contracts were merged into a ii. what content selection method leads to more "human-like" sum-
single document. Next, we compared each sentence of the contract maries? To answer the first question we report the Macro-F1 and
with specific snippets that were annotated on TOS;DR. If the cor- Micro-F1 score of our classifier. To answer the second question,
responding sentence or a very similar sentence was annotated by we evaluate the quality of the extracted summaries by our model
the TOS;DR contributors, the same label was used. Otherwise, it by computing the average F1-score for ROUGE-1, ROUGE-2, and
was annotated as "neutral". The assumption behind our annotation ROUGE-L [38] metrics (which respectively measure the unigram-
schema is that, if a section was not annotated by the contributors, it overlap, bigram-overlap, and longest common sequence between
most likely does not include a privacy risk and thus, is considered the reference summary and the summary to be evaluated). ROUGE
neutral. NLTK was used to segment the contracts into sentences. metrics fail to capture semantic similarity beyond n-grams [39].
Jaccard similarity of the vocabulary was used to measure the simi- Thus, we also report the METEOR score [40] which goes beyond
larity of the sentences. Two sentences from the same contract were the surface matches and accounts for stems and synonyms while
considered similar if the Jaccard similarity of their tokens was more finding the matches.6 We evaluate our model using 5-fold cross-
than 50%. We combined the "bad" and "blocker" sections to build the validation. In each fold, contracts of 96 companies are used for
"risky" class. The "good" and "neutral" classes were also combined training, 24 contracts are used for validation, and the rest is used
to build the "non-risky" class. This dataset is highly imbalanced for testing. We explain our baselines in Section 4.3 and our experi-
with 61674 non-risky sentences and only 719 risky sentences. To mental results in Section 5.
build the ground truth risk-aware summary of each privacy policy
we concatenate the plain English summaries of the snippets that 4.3 Summarization Baselines
have a "risky" label. The dataset statistics of the 151 privacy policies We compare the performance of our domain-aware extractive sum-
and their corresponding summaries are presented in Table 1. Our marization model with the following unsupervised baselines. Un-
dataset is available online 5 . like the evaluation setup in [16], we run the models on the entire
contract. For methods that require a word limit as the budget, a
Dataset Min Max Median Mean
compression ratio 𝑟 is multiplied by the average number of to-
Privacy Policies 61 1707 350 411.6 kens in all contracts (10488.7) to compute the word limit. Similarly,
Plain English Summaries 1 53 1 3.5 the compression ratio of 𝑟 is multiplied by the average number of
sentences in all contracts (413.1) to build a sentence limit.
Table 1: The min, max, median, and average number of sen-
tences in 151 privacy contracts and their summaries. • TextRank: An algorithm introduced in [14] that uses page
rank to compute an importance score for each sentence. Sen-
tences with the highest importance score are then extracted
4 EXPERIMENTS
to build a summary until a word limit is satisfied.
In this section, we discuss our data augmentation mechanism to
reduce the data imbalance problem, our hyper parameter choice
6 We use pyrouge and NLTK python packages for computing ROUGE and METEOR
5 www.github.com/senjed/Summarization-of-Privacy-Policies values respectively.
NLLP @ KDD 2020, August 24th, San Diego, US Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy

Compression Ratio = 1/64 Compression Ratio = 1/16

P R Macro-F1 Micro-F1 P R Macro-F1 Micro-F1
CNN + RF 22.40 28.13 61.94 98.01 9.86 59.74 56.65 93.10
CNN + CF 19.64 24.06 60.26 97.95 12.19 52.65 58.51 94.94
Table 2: Precision(P), Recall(R), Macro-F1, and Micro-F1 of the CNN classifier with two different content selection mechanisms
risk-focused(RF) and coverage-focused(CF) at two different compression ratios 16 1 and 1 .
64

• KLSum: Introduced in [15], KLSum aims to minimize the two times better in terms of recall. When the compression ratio
Kullback-Lieber (KL) divergence between the input docu- 1 , the risk-focused method captures many more risky sections
is 16
ment and proposed summary by greedily selecting sentences. and achieves a recall of 59.74. However, with this increase in re-
• Lead-K: A common baseline in news summarization that call, the false positive rate also increases. On the other hand, the
extracts the first k sentences of the document until a word coverage-focused method is better at preserving the precision at
limit is reached. higher budgets (only 7.45 drop in precision with a 28.59 points in-
• Random: This baseline picks random sentences of the doc- crease in recall). This observation is caused by extracting sentences
ument until a word limit is satisfied. For this baseline, we with a risk score greater than 𝛼 in coverage-focused content selec-
report the average results over 10 runs. tion. This naturally puts an upper bound on the false positive rate.
• Upper Bound Baseline: This baseline picks all the sen- We conclude that both mechanisms are moderately successful at
tences in a contract with ground truth label "risky". This identifying the risky sections of contracts. We also conclude that at
baseline indicates the performance upper bound of an ex- higher compression ratios, the risk-focused mechanism can be used
tractive method on our dataset. where recall is more essential while the coverage-focused mecha-
nism can be used when precision is more of interest. In the next
5 RESULTS section, we examine whether the domain information given by the
In this section, we discuss our experiments conducted using 5-fold risk classifier can improve the quality of summaries in comparison
cross-validation. We shared our training details in Section 4.1. As to domain-agnostic extractive summarization baselines.
an example, summaries extracted by our model and the baselines 5.2 Summarization Results:
from privacy policy of Brainly 7 is displayed in Figure 1. It can be
In this section, we evaluate the quality of the summaries extracted
seen that both of the summaries generated by our method indi-
by our model and the baselines. We introduced our evaluation met-
cate that third party advertising companies will be able to collect
rics in Section 4.2 and our baselines in Section 4.3. We compare
information about use of Brainly. KLSum misses this information
the summaries against two type of reference summaries. The first
and the traditional lead-k heuristic which is very effective for news
type of summary is built by assembling all the sentences that have
performs poorly on the contracts. This indicates the advantage of
ground truth "risky" label. These sentences are derived directly
injecting domain-specific knowledge into content selection.
from text of the contract. We will refer to this reference summary
as "quote text" reference. The second type of summary is derived
5.1 Classification Results:
by assembling the plain English summary of the "risky" sections
In this section, we evaluate the performance of our model discussed written by the TOS;DR contributors. The summarization results
in Section 2.1.1 and study the effect of different content selection using the quote text summaries is presented in Table 3. The sum-
mechanism on the risk prediction task. We evaluate our summaries marization results using the plain English reference summaries is
at two compression ratios of 641 and 1 . The summarization budget
16 presented in Table 4.
𝑚 at each compression ratio 𝑟 is achieved by multiplying 𝑟 in the av-
erage number of sentences(or words) in the contracts. Thus, at the 5.2.1 Extracting the risky content: As it can be seen in Table 3,
compression ratio of 641 , summaries are restricted to the maximum at both compression ratios, both variation of our model outperform
the baselines. At compression ratio of 641 , the CNN + RF, achieves
length of 6 sentences or 164 words. Similarly, at the compression
1 , summaries are limited to the maximum length of 29 sen-
ratio of 16 the best ROUGE and METEOR results with 49.8% improvement
tences or 656 words. We report the precision, recall, Micro-F1, and in ROUGE-1, 124.6% improvement in ROUGE-2, 56.3% improve-
Macro-F1 of our risk classifier with two different content selection ment in ROUGE-L, and 65.6% improvement in METEOR in com-
mechanisms namely risk-focused (RF) and coverage-focused (CF) parison to the best performing domain-agnostic baseline for each
metric. At compression ratio of 161 the CNN + CF achieves the best
in Table 2. As can be seen in the table, the Micro-F1 scores of both
content selection methods are quite high. However, the best Macro- ROUGE results by improving ROUGE-1 by 12.2%, ROUGE-2 by
F1 value is achieved by the risk-focused approach and is 61.94. The 30.2%, ROUGE-L by 8.8%, and METEOR by 23.7% in comparison
large gap between the two values is due to the high level of class the the best performing baseline for each metric. The improve-
imbalance in our dataset (1 positive sample for every 100 negative ment in METEOR score is found to be statistically significant using
samples). At 64 1 compression ratio, risk-focused performs more than Wilcoxon signed ranked test [41] with p-value < 0.01 (Bonferroni
corrected [42] to account for multiple testing). Similar to our obser-
7 https://Brainly.com vation in classification task, we find that the risk-focused content
Toward Domain-Guided Controllable Summarization of
Privacy Policies NLLP @ KDD 2020, August 24th, San Diego, US

Plain English Summary: The Privacy Policy states, "We and our third party partners may also use cookies and tracking technologies for advertising
purposes.". In the Privacy Policy, it states that, "Although we do our best to honor the privacy preferences of our users, we are unable to respond to Do
Not Track signals set by your browser at this time." The Privacy Poilicy says Brainly can track usage information and personal information "through a
variety of tracking technologies, including cookies, web beacons, Locally Stored Objects (LSOs such as Flash or HTML5), log files, and similar technology
(collectively, “tracking technologies”)." If Brainly aims to "preserve all content posted on the site," then we can conclude that such personal data is still
necessary for the purpose of the site. There are places on the site where answers without usernames or profile pictures are visible. The Cookie Policy
states, "Service oparator [sic] informs that restricting the use of cookies may affect some of the functionalities available on the Website." For users not in
europe, brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the brainly services, your brainly
subscription, and/or any products, services, functionality, information, content or other material.

CNN + RF: We participate in interest-based advertising and use third party advertising companies to serve you targeted advertisements based on your
online browsing history and your interests. We permit third party online advertising networks, social media companies and other third party services,
to collect, information about your use of our service over time so that they may play or display ads on our service, on other websites, apps or services
you may use, and on other devices you may use. We may share a common account identifier (such as an email address or user id) or hashed data with
our third party advertising partners to help identify you across devices. Brainly reserves the right to moderate the Brainly services and to remove, screen,
or edit your content from the Brainly services at our sole discretion, at any time, and for any reason or for no reason, with no notice to you. Brainly
reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the Brainly services, your Brainly subscription,
and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly services
and/or your Brainly subscription, including, but not limited to, the mobile software, and/or your access to some or all of them without cause and without
notice. In the event that Brainly suspends or terminates your account, the Brainly services or your Brainly subscription, you acknowledge and agree
that you shall receive no refund or exchange for any unused time on a Brainly subscription or any subscription fees or anything else.

CNN + CF: We participate in interest-based advertising and use third party advertising companies to serve you targeted advertisements based on
your online browsing history and your interests. We permit third party online advertising networks, social media companies and other third party
services, to collect, information about your use of our service over time so that they may play or display ads on our service, on other websites, apps
or services you may use, and on other devices you may use. We may share a common account identifier (such as an email address or user id) or hashed
data with our third party advertising partners to help identify you across devices. To the fullest extent permitted by applicable law, no arbitration or
claim under these terms shall be joined to any other arbitration or claim, including any arbitration or claim involving any other current or former user
of the Brainly services or a Brainly subscription, and no class arbitration proceedings shall be permitted. We may modify or update this privacy policy
from time to time to reflect the changes in our business and practices, and so you should review this page periodically. If you object to any changes,
you may close your account. Continuing to use our service after we publish changes to this privacy policy means that you are consenting to the changes.

Lead-K: Welcome to Brainly!. Brainly operates a group of social learning networks for students and educators. Brainly inspires students to share and
explore knowledge in a collaborative community and engage in peer-to-peer educational assistance, which is made available on www.Brainly.com and
any www.Brainly.com sub-domains(the “website”) as well as the Brainly.com mobile application (the “app”) (the “website” and the “app” are collectively
the “Brainly services”. We have two sets of terms and conditions: part(a) sets out the terms that apply to our users unless you are based in Europe and
part (b) sets out the terms that apply to our users in Europe. It is important that you read and understand the terms that apply to you when you use
the Brainly services before using the Brainly services. Part (a): terms and conditions applicable to users unless you are based in Europe. This part and
the documents referred to within it set out the terms and conditions that apply to your use of Brainly services if you access Brainly services from within
the united states or other countries except Europe. The Cookie Policy states, "Service oparator [sic] informs that restricting the use of cookies may
affect some of the functionalities available on the Website."

KLSum: Brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the Brainly services, your Brainly
subscription, and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly
services and/or your Brainly subscription, including, but not limited to, the mobile software, and/or your access to some or all of them without cause and
without notice. Brainly makes no warranty that the Brainly services and/or any products, services, functionality, information, content or other materials
available on, through or in connection with the Brainly services or your Brainly subscription, including, but not limited to, the mobile software, will meet
your requirements, or that the Brainly services or Brainly subscriptions will operate uninterrupted or in a timely, secure, or error-free manner, or as to the
accuracy or completeness of any information or content accessible from or provided in connection with the Brainly services or Brainly subscriptions,
regardless of whether any information or content is marked as “verified”. You must not: use Brainly services other than for its intended purpose as set out
in the terms of use;

Figure 1: The summaries extracted by our model (CNN + RF and CNN + CF) and the baselines from the privacy policy and
1.
cookie policy of Brainly at compression ratio of 64
selection achieves more recall and thus, achieves a better METEOR contracts, the number of risky sentences is smaller than the budget
score in comparison to the coverage-focused mechanism. On the 1 (29 sentences).
at ratio of 16
other hand, by increasing the summarization budget, the ROUGE 5.2.2 Building Human-like summaries: We present our sum-
values for this method slightly drop. This is because, in most of the marization results using the plain English summaries as reference
1 , both variations of
summaries in Table 4. At compression ratio of 64
NLLP @ KDD 2020, August 24th, San Diego, US Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy

Compression Ratio = 1/64 Compression Ratio = 1/16
ROUGE-1 ROUGE-2 ROUGE-L METEOR ROUGE-1 ROUGE-2 ROUGE-L METEOR
CNN + RF 43.09 31.21 36.80 41.98 34.0 24.96 24.83 40.03
CNN + CF 40.45 28.69 34.01 41.55 37.93 28.82 29.23 43.91
Textrank 28 13.89 22.06 22.4 33.78 22.12 26.85 35.49
KLSum 28.75 13.14 23.53 25.34 24.74 11.36 18.86 26.95
Lead-k 25.57 9.09 20.25 19.54 25.67 11.33 19.77 26.85
Random 24.26 6.45 18.78 18.11 24.43 9.85 18.08 27.01
Table 3: ROUGE-1, ROUGE-2, ROUGE-l, and METEOR score of our model (highlighted in light gray) in comparison to the
1 and 1 . RF refers to the risk-focused content selection while CF refers to the coverage-
baselines in compression ratios 64 16
focused content selection. The quote text of the risky sections was used to build the reference summaries.

Compression Ratio = 1/64 Compression Ratio = 1/16
ROUGE-1 ROUGE-2 ROUGE-L METEOR ROUGE-1 ROUGE-2 ROUGE-L METEOR
Upper Bound 22.45 13.7 18.27 22.32 22.56 13.95 18.49 23.03
CNN + RF 13.97 6.08 9.83 16.58 9.07 3.94 5.53 12.07
CNN + CF 12.39 4.81 8.51 14.93 10.18 4.54 6.58 13.16
Textrank 10.94 2.78 7.51 11.2 10.08 3.37 6.37 12.47
KLSum 10.96 2.43 7.34 12.54 8.37 1.92 5.26 11.06
Lead-k 11.21 1.9 7.9 11.04 9.33 2.44 5.96 11.87
Random 11.44 1.87 8.03 12.02 9.13 2.32 5.73 12.45
1 and 1 .
Table 4: Performance of our model (highlighted in light gray) in comparison to the baselines in compression ratios 64 16
RF refers to the risk-focused content selection while CF refers to the coverage-focused content selection. The plain English
summaries of risky sections was used to build the reference summaries.

our model outperform the baselines. Our CNN + RF model, increases of the moderate success in classification of our realistically imbal-
the METEOR score by 32.2% over KLSum and 48% over textrank. anced dataset, we observed a noticeable improvement in ROUGE
This improvement is found to be statistically significant (with p- and METEOR metrics in comparison to domain agnostic baselines.
value < 0.01). The CNN + CF outperforms the baselines over all We believe the summaries generated by our method can be im-
evaluation metrics. However, the improvement is not statistically proved in multiple ways. First, the classifier itself, and the redun-
significant. At compression ratio of 16 1 , CNN + RF outperforms all dancy reduction system, could be improved, bringing content selec-
domain-agnostic baselines. This improvement however, is not sta- tion performance closer to the upper bound scores derived using
tistically significant. At this compression ratio, CNN + RF achieves a perfect classifier. Secondly, our summaries would be more ac-
comparable result with textrank. We conclude from our experiments cessible if written in plain English rather than legalese [2]. An
that our domain-aware extractive model does moderately better abstractive system could be used to rewrite the contract text in
than the baselines at lower compression ratios, however, due to this way. However, the abstractive summaries should not change
high level of abstraction in plain English summaries of TOS;DR [16], the legal interpretation of the content and should be linkable to
a fully-extractive approach cannot mimic the human-like qualities the original content to be considered binding. In addition to im-
in the plain English summaries. This can also be seen by looking at proving the system, it is also necessary to conduct more extensive
the performance of the upper bound baseline. evaluation experiments, involving human readers as well as auto-
mated metrics. This will help determine the most effective ways to
6 CONCLUSION AND DISCUSSION present information from click-through contracts so that users can
In this paper, we proposed a domain-aware extractive model for understand their terms and make a more informed decision. We are
summarizing the privacy contracts. Our model, employs a convolu- planning to explore if the risk classifier module can be used indepen-
tional neural network to identify risky sections of the contracts. We dently to enhance the productivity of annotators by identifying the
build summaries by using a risk-focused and a coverage-focused sections that need to be summarised. This can potentially facilitate
content selection mechanism. Our approach enables users to select annotating larger resources for training abstractive models.
the content to be summarized within a controllable length while
relying on substantially less training data in comparison to the exist- ACKNOWLEDGEMENT
ing supervised summarization methods. Our two different content We are immensely grateful to Prof. Junyi Jessy Li, Prof. Bryan
selection mechanisms enable users to build budgeted summaries H. Choi, Dr. Daniel Preoţiuc-Pietro, Mayank Kulkarni, and three
of contracts based on their preference of coverage vs risk. In spite anonymous reviewers for valuable discussions.
Toward Domain-Guided Controllable Summarization of
Privacy Policies NLLP @ KDD 2020, August 24th, San Diego, US

REFERENCES [34] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.
[1] Lorrie Faith Cranor, Praveen Guduru, and Manjula Arjula. User interfaces for IEEE transactions on Signal Processing, 45, 1997.
privacy agents. TOCHI, 2006. [35] Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional
[2] Jonathan A Obar and Anne Oeldorf-Hirsch. The biggest lie on the internet: transformers for language understanding. arXiv:1810.04805, 2018.
Ignoring the privacy policies and terms of service policies of social networking [36] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth
services. ICS, 2020. Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis
[3] Aleecia M McDonald and Lorrie Faith Cranor. The cost of reading privacy policies. and implementation. IEEE TPAMI, 2002.
Isjlp, 2008. [37] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.
[4] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model Smote: synthetic minority over-sampling technique. JAIR, 2002.
for abstractive sentence summarization. arXiv:1509.00685, 2015. [38] Chin-Yew Lin and Eduard Hovy. Manual and automatic evaluation of summaries.
[5] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, et al. Abstractive text summa- In ACL, 2002.
rization using sequence-to-sequence rnns and beyond. arXiv:1602.06023, 2016. [39] Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. Recent advances in document
[6] Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. Distraction- summarization. Knowledge and Information Systems, 2017.
based neural networks for modeling document. In IJCAI, 2016. [40] Michael Denkowski and Alon Lavie. Meteor universal: Language specific trans-
[7] Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summa- lation evaluation for any target language. In Proceedings of the ninth workshop
rization with pointer-generator networks. arXiv:1704.04368, 2017. on statistical machine translation, 2014.
[8] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. Abstractive document summarization [41] Frank Wilcoxon, SK Katti, and Roberta A Wilcox. Critical values and probability
with a graph-based attentional neural model. In ACL, 2017. levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected
[9] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model tables in mathematical statistics, 1970.
for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017. [42] Charles W Dunnett. New tables for multiple comparisons with a control. Bio-
[10] Ritesh Sarkhel*, Moniba Keymanesh*, Arnab Nandi, and Srinivasan Parthasarathy. metrics, 1964.
Transfer learning for abstractive summarization at controllable budgets.
arXiv:2002.07845, 2020.
[11] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A recurrent
neural network based sequence model for extractive summarization of documents.
In AAAI, 2017.
[12] Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srini-
vasan, and Dragomir Radev. Graph-based neural multi-document summarization.
arXiv preprint arXiv:1706.06681, 2017.
[13] Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. Faithful to the original: Fact
aware neural abstractive summarization. In AAAI, 2018.
[14] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In EMNLP,
2004.
[15] Aria Haghighi and Lucy Vanderwende. Exploring content models for multi-
document summarization. In NAACL, 2009.
[16] Laura Manor and Junyi Jessy Li. Plain english summarization of contracts.
arXiv:1906.00424, 2019.
[17] Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. Bottom-up abstrac-
tive summarization. arXiv preprint arXiv:1808.10792, 2018.
[18] Frederick Liu, Shomir Wilson, Peter Story, et al. Towards automatic classification
of privacy policy text. 2018.
[19] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, et al. The
creation and analysis of a website privacy policy corpus. In ACL, 2016.
[20] Sebastian Zimmeck and Steven M Bellovin. Privee: An architecture for automati-
cally analyzing web privacy policies. 2014.
[21] Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and
Jetzabel Serna. Privacyguide: Towards an implementation of the eu gdpr on
internet privacy policy evaluation. In IWSPA, 2018.
[22] Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. Privacycheck:
Automatic summarization of privacy policies using data mining. TOIT), 2018.
[23] Najmeh Mousavi Nejad, Damien Graux, and Diego Collarana. Towards measuring
risk factors in privacy policies. In ICAIL, 2019.
[24] Hamza Harkous, Kassem Fawaz, Rémi Lebret, et al. Polisis: Automated analysis
and presentation of privacy policies using deep learning. 2018.
[25] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuksa. Natural language processing (almost) from scratch. JMLR,
12(Aug):2493–2537, 2011.
[26] Yoon Kim. Convolutional neural networks for sentence classification.
arXiv:1408.5882, 2014.
[27] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional
neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
[28] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide
to) convolutional neural networks for sentence classification. arXiv:1510.03820,
2015.
[29] Matthew E Peters, Mark Neumann, Mohit Iyyer, et al. Deep contextualized word
representations. arXiv:1802.05365, 2018.
[30] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan R Salakhutdinov. Improving neural networks by preventing co-adaptation of
feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[31] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre-
sentations by back-propagating errors. nature, 1986.
[32] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power.
Semi-supervised sequence tagging with bidirectional language models. arXiv
preprint arXiv:1705.00108, 2017.
[33] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, and Tony Robinson. One billion word benchmark for measuring progress
in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.