Toward Domain-Guided Controllable Summarization of Privacy Policies Moniba Keymanesh Micha Elsner Srinivasan Parthasarathy keymanesh.1@osu.edu elsner.14@osu.edu parthasarathy.2@osu.edu The Ohio State University The Ohio State University The Ohio State University ABSTRACT concatenating the most important sentences in the document. The Companies’ privacy policies are often skipped by the users as they abstractive systems are more flexible while the extractive models are too long, verbose, and difficult to comprehend. Identifying the enjoy better factuality [13]. However, existing summarization tech- key privacy and security risk factors mentioned in these unilateral niques perform poorly on contracts. Unsupervised methods [14, 15] contracts and effectively incorporating them in a summary can rely on structural features of documents, such as lexical repetition, assist users in making a more informed decision when asked to to identify and extract important content. These heuristics work agree to the terms and conditions. However, existing summarization poorly on the legal language used in contracts [16]. Supervised methods fail to integrate domain knowledge into their framework methods [7, 9, 17] can learn to cope with the features of a particular or rely on a large corpus of annotated training data. We propose a domain. However, training these complex neural summarization hybrid approach to identify sections of privacy policies with a high models with thousand of parameters requires a large corpus of privacy risk factor. We incorporate these sections into summaries documents and their summaries. Currently existing corpora in the by selecting the riskiest content from different privacy topics. Our legal domain are not large enough to train such models. We pro- approach enables users to select the content to be summarized pose a hybrid approach for extractive summarization of privacy within a controllable length. Users can view a summary that cap- contracts: using existing annotated resources, we train a classifier tures different privacy factors or a summary that covers the riskiest to predict which pieces of content are most relevant to users [1]. content. Our approach outperforms the domain-agnostic baselines In particular, we identify parts of the contract which place users by up to 27% in ROUGE-1 score and 50% in METEOR score using at risk by imposing unsafe data practices on them, such as selling plain English reference summaries while relying on significantly email addresses to third parties or allowing the company to appro- less training data in comparison to abstractive approaches. priate user-generated content. Next, we use this risk classifier for content selection within an extractive summarization pipeline. The ACM Reference Format: classifier is substantially less expensive than learning to summa- Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy. 2020. Toward Domain-Guided Controllable Summarization of Privacy Policies. In rize directly but enables our approach to outperform a selection of Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop, domain-agnostic unsupervised summarization methods. 24 August 2020, San Diego, US. ACM, New York, NY, USA, 7 pages. Prior computational work on privacy policies has used infor- mation extraction and natural language processing methods to 1 INTRODUCTION AND RELATED WORK classify segments of these documents into different data practice categories [18–20]. Another trajectory of work has focused on pre- Privacy policy and terms of service are unilateral contracts by which senting a graphical “at-a-glance” description of the privacy policies companies are required to inform users about their data collection, to the user. For example, PrivacyGuide [21] and PrivacyCheck [22] processing, and sharing practices. Users are required to agree to define a few privacy factors and map each factor to a risk level abide by the terms before they can use any service. However, many using a data mining model. Relying on these “at-a-glance” descrip- users do not read or understand these contracts [1]. Thus, they often tion methods raises several concerns. First, there is no way for the end up consenting to terms that may not be aligned with legislation user to check the factuality of the predicted risk classes or inter- such as the General Data Protection Regulation (GDPR)1 [2]. This pret the reasoning behind them. Moreover, users tend to have an behavior is often because these contracts are too long and difficult easier time comprehending the content when provided in natu- to comprehend [3]. Summarization is an intuitive way to assist users ral language. Researchers also have focused on assigning a risk with conscious agreement by generating a condensed equivalent factor–green, yellow, or red–to each segment of the privacy poli- of the content. Broadly, there are two main lines of summarization cies [23, 24]. However, summarizing the text may benefit users systems: abstractive and extractive. The abstractive paradigm [4– more than directly presenting the classifier output. We draw on 10] aims to create an abstract representation of the input text and these approaches in building our own classifier. The first module of involves various text rewriting operations such as paraphrasing, our framework extends prior work [23, 24] to highlight segments deletion, and reordering. The extractive paradigm [11, 12] on the of privacy policies that have a higher risk. We employ a pre-trained other hand, creates a summary by identifying and subsequently encoder and convolutional neural network to classify sentences 1 https://eugdpr.org/ of the contracts into different risk levels. To address the limita- tions of previous work, we incorporate the domain information Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). predicted by the classifier in the form of a summary by comparing NLLP @ KDD 2020, August 24th, San Diego, US a risk-focused and a coverage-focused content selection mecha- © 2020 Copyright held by the owner/author(s). nism. The coverage-focused selection mechanism aims to reduce NLLP @ KDD 2020, August 24th, San Diego, US Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy the information redundancy by covering the riskiest sentence from by repeatedly applying the convolution filter 𝑤 to a window of to- each privacy topic. We evaluate the effectiveness of employing a kens 𝑡𝑖:𝑖+ℎ−1 . Each element 𝑐𝑖 in feature map 𝑐 = [𝑐 1, 𝑐 2, ...𝑐𝑛−ℎ+1 ] classifier on identifying the domain knowledge for summarization. is then obtained from: We also evaluate the quality of summaries extracted by our two content selection criteria. Using our approach users can view a 𝑐𝑖 = 𝑓 (𝑤 . 𝐴[𝑖 : 𝑖 + ℎ − 1] + 𝑏) summary that captures different privacy factors or a summary that where 𝐴[𝑖 : 𝑗] is the sub-matrix of 𝐴 from row 𝑖 to 𝑗 corresponding covers the riskiest content. We release our dataset of 151 privacy to a window of tokens 𝑡𝑖 to 𝑡 𝑗 and "." represents the dot product policies annotated with risk labels to assist future research. between the filter 𝑤 and the sub-matrices. 𝑏 ∈ 𝑅 represents the 2 METHODOLOGY bias term and 𝑓 is an activation function such as a rectified linear Given a privacy policy document 𝐷 consisting of a sequence of 𝑛 unit. We use multiple kinds of filters by using various region sizes. sentences {𝑠 1, 𝑠 2, ...𝑠𝑛 } and a sentence budget 𝑚 such that 𝑚 < 𝑛 This extracts various types of features from bigrams, trigrams, and our summarization model extracts a risk-aware summary with so on. The dimensionality of the feature map 𝑐 generated by each 𝑚 sentences. For each sentence 𝑠𝑖 ∈ 𝐷 we predict a binary label convolution filter is different for sentences with various lengths 𝑦𝑖 (where a value of 1 means 𝑠𝑖 is included in the summary). We and filters with different heights. We apply a max-over-time [25] achieve this by computing an inclusion probability 𝑝 (𝑦𝑖 |𝑠𝑖 , 𝐷, 𝜃 ) pooling operation to downsample each feature map 𝑐 by taking the for each sentence 𝑠𝑖 . 𝜃 are the model’s parameters. We aim to max- maximum value over the window defined by a pool size 𝑝. The max- imize the inclusion probability for risky sections of the privacy pooling operation naturally deals with variable sentence lengths. policies and minimize it for non-risky sections. We also would like The outputs generated from each filter map are concatenated to to cover different privacy factors within the sentence budget 𝑚 by build a fixed-length feature vector for the penultimate layer. This reducing the redundancy. The main intuition behind our proposed feature vector is then fed to a fully connected softmax layer that approach is that users when going through the privacy policies are predicts a probability distribution over the risk level categories. We most interested in knowing how their information can potentially apply dropout [30] as a means of regularization in the softmax layer. be abused [1]. Thus, a condensed equivalent of the terms should Our objective is to minimize the binary cross-entropy. The trainable include such risky sections. Next, we explain the architecture or model parameters include the weight vectors 𝑤 of the filters, the our risk prediction model and our content selection mechanisms. bias term 𝑏 in the activation function, and the weight vector of the softmax function. We minimize the loss using Stochastic gradient 2.1 Risk Prediction descent and back-propagation [31]. Given the content of privacy policies, the first step in our frame- 2.1.2 Pretrained Word Vectors. Prior research indicates that work is to identify the associated risk class with each sentence of better word representations can improve performance in a vari- the contract. We rely on a crowd-sourcing project called TOS;DR2 ety of natural language understanding (NLU) tasks [32]. We use to automatically annotate 151 privacy contracts. TOSDR has anno- ELMo [29]-a deep contextualized word representation model-to tated several snippets of privacy contracts based on the average map each token 𝑡𝑖 in sentence 𝑠𝑖 in contract 𝐷 to its correspond- Internet user’s perception of risk. We explain our dataset extraction ing contextual embedding 𝑣𝑖 with length 1024 3 . ELMo uses a bi- in section 3. We use this dataset to train our risk classifier. Prior directional LSTM [34] for language modeling and considers the research has exploited word embeddings and Convolutional Neural context of the words when assigning them to their embeddings4 . Networks (CNN) for sentence classification [25–28]. These simple architectures achieve strong empirical performance over a range of 2.2 Content Selection and Redundancy text classification tasks. Our model is a slight variant of the CNN Reduction architecture proposed in [25]. Given the probability distributions over the risk categories, we 2.1.1 Model architecture. Let 𝑠 𝑗 = {𝑡 1, 𝑡 2, ...𝑡𝑛 } be the 𝑗-th sen- apply two content selection mechanisms to account for the sum- marization budget 𝑚 and minimize the information redundancy. tence in the contract 𝐷 and 𝑣𝑖 ∈ 𝑅𝑑 be the d-dimentional vector The first mechanism focuses on including the most "risky" sections representation of token 𝑡𝑖 in this sequence. Word representations while the second mechanism focuses on covering diverse privacy are output of a pretrained encoder [29] and will be discussed in Sec- factors. Next, we explain these two variations of our model. tion 2.1.2. We build the sentence matrix 𝐴 ∈ 𝑅𝑛×𝑑 by concatenating the word vectors 𝑣 1 to 𝑣𝑛 : 2.2.1 Risk-Focused Content Selection: Given a privacy policy 𝐴1:𝑛 = 𝑣 1 ⊕ 𝑣 2 ⊕ ...𝑣𝑛 contract 𝐷 with sentences {𝑠 1, ...𝑠𝑛 }, a summarization budget 𝑚, and risk score 𝑝 (𝑦𝑖 = 1|𝑠𝑖 , 𝐷, 𝜃 ) predicted for 𝑠𝑖 by the risk classifier, Following [25] we apply convolution filters to this matrix to the risk-focused selection mechanism assembles a summary by produce new features. The length of the filters is equal to the di- extracting the top 𝑚 sentences that have the highest risk score. mensionality of the word vectors 𝑑. The height or region size of the filter is denoted by ℎ and is the number of rows (word vectors) that 3 Model was trained on the One billion word benchmark [33] and was obtained from are considered jointly when applying the convolution filter. The fea- https://github.com/allenai/allennlp ture map 𝑐 ∈ 𝑅𝑛−ℎ+1 of the convolution operation is then obtained 4 BERT [35] as the current state-of-the-art for language model pretraining has achieved amazing results in many NLU tasks with minimal fine-tuning. However, our prelimi- nary results of fine-tuning bert did not outperform our results from Elmo word vectors 2 https://TOS;DR.org and task-specific architecture explained in Section 2.1.1. Toward Domain-Guided Controllable Summarization of Privacy Policies NLLP @ KDD 2020, August 24th, San Diego, US 2.2.2 Coverage-Focused Content Selection: Given a privacy for designing the risk classifier, and the training details. We discuss policy contract 𝐷 with sentences {𝑠 1, ...𝑠𝑛 }, a summarization budget our evaluation criteria in Section 4.2. 𝑚, and risk scores 𝑝 (𝑦𝑖 = 1|𝑠𝑖 , 𝐷, 𝜃 ), the coverage-focused selection 4.1 Hyperparameters and Training Details method finds 𝑚 privacy factors by clustering sentences for which the risk score is larger than a predefined value of 𝛼. Next, the riskiest For the CNN model, we use two filter region sizes 3 and 4 each sentence from each privacy factor cluster is selected to be included of which has 50 output filters. We use rectified linear unit as the in the summary. Note that if less than 𝑚 sentences have a risk activation function of the convolution layer. The pool size in the score greater than 𝛼 the summary will have less than 𝑚 sentences. max pooling operation is set to 50. We apply dropout with a rate To find privacy topics of a contract, we apply k-means [36] to of 20%. We optimize the binary cross-entropy loss using stochastic sentence representations. Sentence representations are obtained gradient descent with a learning rate of 0.01. To account for the through concatenating the word vectors. Number of clusters is set class imbalance problem, we randomly under-sampled the majority to 𝑚𝑖𝑛(𝑚, |𝑟 |) where 𝑟 = {𝑠𝑖 | 𝑝 (𝑦𝑖 = 1) > 𝛼 }. class (non-risky) with a rate of 10%. We also apply SMOTE over sampling [37] on the minority class (risky) with rate 50%. We train 3 DATASET EXTRACTION our model on this resampled dataset for 20 epochs and weight the loss function inversely proportional to class frequencies in the input In this section, we explain the dataset that we compiled from the data. To set the value of risk threshold 𝛼 in the content selection TOS;DR website and privacy contracts of 151 companies. TOS;DR module, we used the ROC curve of the validation set of each fold. is a website dedicated to rating and explaining privacy policy of We set 𝛼 for each fold to the threshold value that achieves 80% true companies in plain English. Members of the website’s commu- positive rate. nity classify specific sections of privacy policies into "bad", "good", "blocker", and "neutral" categories and provide summaries for them. We collected the user agreement contracts of 151 services that were 4.2 Evaluation Metrics annotated on TOS;DR from the companies’ websites. Some compa- In our experiments, we seek to answer two questions: i. how well nies have several such contracts e.g. privacy policy, terms of service, does our model identify the risky sentences in the contracts? and and cookie policy. In this case, all the contracts were merged into a ii. what content selection method leads to more "human-like" sum- single document. Next, we compared each sentence of the contract maries? To answer the first question we report the Macro-F1 and with specific snippets that were annotated on TOS;DR. If the cor- Micro-F1 score of our classifier. To answer the second question, responding sentence or a very similar sentence was annotated by we evaluate the quality of the extracted summaries by our model the TOS;DR contributors, the same label was used. Otherwise, it by computing the average F1-score for ROUGE-1, ROUGE-2, and was annotated as "neutral". The assumption behind our annotation ROUGE-L [38] metrics (which respectively measure the unigram- schema is that, if a section was not annotated by the contributors, it overlap, bigram-overlap, and longest common sequence between most likely does not include a privacy risk and thus, is considered the reference summary and the summary to be evaluated). ROUGE neutral. NLTK was used to segment the contracts into sentences. metrics fail to capture semantic similarity beyond n-grams [39]. Jaccard similarity of the vocabulary was used to measure the simi- Thus, we also report the METEOR score [40] which goes beyond larity of the sentences. Two sentences from the same contract were the surface matches and accounts for stems and synonyms while considered similar if the Jaccard similarity of their tokens was more finding the matches.6 We evaluate our model using 5-fold cross- than 50%. We combined the "bad" and "blocker" sections to build the validation. In each fold, contracts of 96 companies are used for "risky" class. The "good" and "neutral" classes were also combined training, 24 contracts are used for validation, and the rest is used to build the "non-risky" class. This dataset is highly imbalanced for testing. We explain our baselines in Section 4.3 and our experi- with 61674 non-risky sentences and only 719 risky sentences. To mental results in Section 5. build the ground truth risk-aware summary of each privacy policy we concatenate the plain English summaries of the snippets that 4.3 Summarization Baselines have a "risky" label. The dataset statistics of the 151 privacy policies We compare the performance of our domain-aware extractive sum- and their corresponding summaries are presented in Table 1. Our marization model with the following unsupervised baselines. Un- dataset is available online 5 . like the evaluation setup in [16], we run the models on the entire contract. For methods that require a word limit as the budget, a Dataset Min Max Median Mean compression ratio 𝑟 is multiplied by the average number of to- Privacy Policies 61 1707 350 411.6 kens in all contracts (10488.7) to compute the word limit. Similarly, Plain English Summaries 1 53 1 3.5 the compression ratio of 𝑟 is multiplied by the average number of sentences in all contracts (413.1) to build a sentence limit. Table 1: The min, max, median, and average number of sen- tences in 151 privacy contracts and their summaries. • TextRank: An algorithm introduced in [14] that uses page rank to compute an importance score for each sentence. Sen- tences with the highest importance score are then extracted 4 EXPERIMENTS to build a summary until a word limit is satisfied. In this section, we discuss our data augmentation mechanism to reduce the data imbalance problem, our hyper parameter choice 6 We use pyrouge and NLTK python packages for computing ROUGE and METEOR 5 www.github.com/senjed/Summarization-of-Privacy-Policies values respectively. NLLP @ KDD 2020, August 24th, San Diego, US Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy Compression Ratio = 1/64 Compression Ratio = 1/16 P R Macro-F1 Micro-F1 P R Macro-F1 Micro-F1 CNN + RF 22.40 28.13 61.94 98.01 9.86 59.74 56.65 93.10 CNN + CF 19.64 24.06 60.26 97.95 12.19 52.65 58.51 94.94 Table 2: Precision(P), Recall(R), Macro-F1, and Micro-F1 of the CNN classifier with two different content selection mechanisms risk-focused(RF) and coverage-focused(CF) at two different compression ratios 16 1 and 1 . 64 • KLSum: Introduced in [15], KLSum aims to minimize the two times better in terms of recall. When the compression ratio Kullback-Lieber (KL) divergence between the input docu- 1 , the risk-focused method captures many more risky sections is 16 ment and proposed summary by greedily selecting sentences. and achieves a recall of 59.74. However, with this increase in re- • Lead-K: A common baseline in news summarization that call, the false positive rate also increases. On the other hand, the extracts the first k sentences of the document until a word coverage-focused method is better at preserving the precision at limit is reached. higher budgets (only 7.45 drop in precision with a 28.59 points in- • Random: This baseline picks random sentences of the doc- crease in recall). This observation is caused by extracting sentences ument until a word limit is satisfied. For this baseline, we with a risk score greater than 𝛼 in coverage-focused content selec- report the average results over 10 runs. tion. This naturally puts an upper bound on the false positive rate. • Upper Bound Baseline: This baseline picks all the sen- We conclude that both mechanisms are moderately successful at tences in a contract with ground truth label "risky". This identifying the risky sections of contracts. We also conclude that at baseline indicates the performance upper bound of an ex- higher compression ratios, the risk-focused mechanism can be used tractive method on our dataset. where recall is more essential while the coverage-focused mecha- nism can be used when precision is more of interest. In the next 5 RESULTS section, we examine whether the domain information given by the In this section, we discuss our experiments conducted using 5-fold risk classifier can improve the quality of summaries in comparison cross-validation. We shared our training details in Section 4.1. As to domain-agnostic extractive summarization baselines. an example, summaries extracted by our model and the baselines 5.2 Summarization Results: from privacy policy of Brainly 7 is displayed in Figure 1. It can be In this section, we evaluate the quality of the summaries extracted seen that both of the summaries generated by our method indi- by our model and the baselines. We introduced our evaluation met- cate that third party advertising companies will be able to collect rics in Section 4.2 and our baselines in Section 4.3. We compare information about use of Brainly. KLSum misses this information the summaries against two type of reference summaries. The first and the traditional lead-k heuristic which is very effective for news type of summary is built by assembling all the sentences that have performs poorly on the contracts. This indicates the advantage of ground truth "risky" label. These sentences are derived directly injecting domain-specific knowledge into content selection. from text of the contract. We will refer to this reference summary as "quote text" reference. The second type of summary is derived 5.1 Classification Results: by assembling the plain English summary of the "risky" sections In this section, we evaluate the performance of our model discussed written by the TOS;DR contributors. The summarization results in Section 2.1.1 and study the effect of different content selection using the quote text summaries is presented in Table 3. The sum- mechanism on the risk prediction task. We evaluate our summaries marization results using the plain English reference summaries is at two compression ratios of 641 and 1 . The summarization budget 16 presented in Table 4. 𝑚 at each compression ratio 𝑟 is achieved by multiplying 𝑟 in the av- erage number of sentences(or words) in the contracts. Thus, at the 5.2.1 Extracting the risky content: As it can be seen in Table 3, compression ratio of 641 , summaries are restricted to the maximum at both compression ratios, both variation of our model outperform the baselines. At compression ratio of 641 , the CNN + RF, achieves length of 6 sentences or 164 words. Similarly, at the compression 1 , summaries are limited to the maximum length of 29 sen- ratio of 16 the best ROUGE and METEOR results with 49.8% improvement tences or 656 words. We report the precision, recall, Micro-F1, and in ROUGE-1, 124.6% improvement in ROUGE-2, 56.3% improve- Macro-F1 of our risk classifier with two different content selection ment in ROUGE-L, and 65.6% improvement in METEOR in com- mechanisms namely risk-focused (RF) and coverage-focused (CF) parison to the best performing domain-agnostic baseline for each metric. At compression ratio of 161 the CNN + CF achieves the best in Table 2. As can be seen in the table, the Micro-F1 scores of both content selection methods are quite high. However, the best Macro- ROUGE results by improving ROUGE-1 by 12.2%, ROUGE-2 by F1 value is achieved by the risk-focused approach and is 61.94. The 30.2%, ROUGE-L by 8.8%, and METEOR by 23.7% in comparison large gap between the two values is due to the high level of class the the best performing baseline for each metric. The improve- imbalance in our dataset (1 positive sample for every 100 negative ment in METEOR score is found to be statistically significant using samples). At 64 1 compression ratio, risk-focused performs more than Wilcoxon signed ranked test [41] with p-value < 0.01 (Bonferroni corrected [42] to account for multiple testing). Similar to our obser- 7 https://Brainly.com vation in classification task, we find that the risk-focused content Toward Domain-Guided Controllable Summarization of Privacy Policies NLLP @ KDD 2020, August 24th, San Diego, US Plain English Summary: The Privacy Policy states, "We and our third party partners may also use cookies and tracking technologies for advertising purposes.". In the Privacy Policy, it states that, "Although we do our best to honor the privacy preferences of our users, we are unable to respond to Do Not Track signals set by your browser at this time." The Privacy Poilicy says Brainly can track usage information and personal information "through a variety of tracking technologies, including cookies, web beacons, Locally Stored Objects (LSOs such as Flash or HTML5), log files, and similar technology (collectively, “tracking technologies”)." If Brainly aims to "preserve all content posted on the site," then we can conclude that such personal data is still necessary for the purpose of the site. There are places on the site where answers without usernames or profile pictures are visible. The Cookie Policy states, "Service oparator [sic] informs that restricting the use of cookies may affect some of the functionalities available on the Website." For users not in europe, brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the brainly services, your brainly subscription, and/or any products, services, functionality, information, content or other material. CNN + RF: We participate in interest-based advertising and use third party advertising companies to serve you targeted advertisements based on your online browsing history and your interests. We permit third party online advertising networks, social media companies and other third party services, to collect, information about your use of our service over time so that they may play or display ads on our service, on other websites, apps or services you may use, and on other devices you may use. We may share a common account identifier (such as an email address or user id) or hashed data with our third party advertising partners to help identify you across devices. Brainly reserves the right to moderate the Brainly services and to remove, screen, or edit your content from the Brainly services at our sole discretion, at any time, and for any reason or for no reason, with no notice to you. Brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the Brainly services, your Brainly subscription, and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly services and/or your Brainly subscription, including, but not limited to, the mobile software, and/or your access to some or all of them without cause and without notice. In the event that Brainly suspends or terminates your account, the Brainly services or your Brainly subscription, you acknowledge and agree that you shall receive no refund or exchange for any unused time on a Brainly subscription or any subscription fees or anything else. CNN + CF: We participate in interest-based advertising and use third party advertising companies to serve you targeted advertisements based on your online browsing history and your interests. We permit third party online advertising networks, social media companies and other third party services, to collect, information about your use of our service over time so that they may play or display ads on our service, on other websites, apps or services you may use, and on other devices you may use. We may share a common account identifier (such as an email address or user id) or hashed data with our third party advertising partners to help identify you across devices. To the fullest extent permitted by applicable law, no arbitration or claim under these terms shall be joined to any other arbitration or claim, including any arbitration or claim involving any other current or former user of the Brainly services or a Brainly subscription, and no class arbitration proceedings shall be permitted. We may modify or update this privacy policy from time to time to reflect the changes in our business and practices, and so you should review this page periodically. If you object to any changes, you may close your account. Continuing to use our service after we publish changes to this privacy policy means that you are consenting to the changes. Lead-K: Welcome to Brainly!. Brainly operates a group of social learning networks for students and educators. Brainly inspires students to share and explore knowledge in a collaborative community and engage in peer-to-peer educational assistance, which is made available on www.Brainly.com and any www.Brainly.com sub-domains(the “website”) as well as the Brainly.com mobile application (the “app”) (the “website” and the “app” are collectively the “Brainly services”. We have two sets of terms and conditions: part(a) sets out the terms that apply to our users unless you are based in Europe and part (b) sets out the terms that apply to our users in Europe. It is important that you read and understand the terms that apply to you when you use the Brainly services before using the Brainly services. Part (a): terms and conditions applicable to users unless you are based in Europe. This part and the documents referred to within it set out the terms and conditions that apply to your use of Brainly services if you access Brainly services from within the united states or other countries except Europe. The Cookie Policy states, "Service oparator [sic] informs that restricting the use of cookies may affect some of the functionalities available on the Website." KLSum: Brainly reserves the right, in its sole discretion, to immediately modify, suspend or terminate your account, the Brainly services, your Brainly subscription, and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly services and/or your Brainly subscription, including, but not limited to, the mobile software, and/or your access to some or all of them without cause and without notice. Brainly makes no warranty that the Brainly services and/or any products, services, functionality, information, content or other materials available on, through or in connection with the Brainly services or your Brainly subscription, including, but not limited to, the mobile software, will meet your requirements, or that the Brainly services or Brainly subscriptions will operate uninterrupted or in a timely, secure, or error-free manner, or as to the accuracy or completeness of any information or content accessible from or provided in connection with the Brainly services or Brainly subscriptions, regardless of whether any information or content is marked as “verified”. You must not: use Brainly services other than for its intended purpose as set out in the terms of use; Figure 1: The summaries extracted by our model (CNN + RF and CNN + CF) and the baselines from the privacy policy and 1. cookie policy of Brainly at compression ratio of 64 selection achieves more recall and thus, achieves a better METEOR contracts, the number of risky sentences is smaller than the budget score in comparison to the coverage-focused mechanism. On the 1 (29 sentences). at ratio of 16 other hand, by increasing the summarization budget, the ROUGE 5.2.2 Building Human-like summaries: We present our sum- values for this method slightly drop. This is because, in most of the marization results using the plain English summaries as reference 1 , both variations of summaries in Table 4. At compression ratio of 64 NLLP @ KDD 2020, August 24th, San Diego, US Moniba Keymanesh, Micha Elsner, and Srinivasan Parthasarathy Compression Ratio = 1/64 Compression Ratio = 1/16 ROUGE-1 ROUGE-2 ROUGE-L METEOR ROUGE-1 ROUGE-2 ROUGE-L METEOR CNN + RF 43.09 31.21 36.80 41.98 34.0 24.96 24.83 40.03 CNN + CF 40.45 28.69 34.01 41.55 37.93 28.82 29.23 43.91 Textrank 28 13.89 22.06 22.4 33.78 22.12 26.85 35.49 KLSum 28.75 13.14 23.53 25.34 24.74 11.36 18.86 26.95 Lead-k 25.57 9.09 20.25 19.54 25.67 11.33 19.77 26.85 Random 24.26 6.45 18.78 18.11 24.43 9.85 18.08 27.01 Table 3: ROUGE-1, ROUGE-2, ROUGE-l, and METEOR score of our model (highlighted in light gray) in comparison to the 1 and 1 . RF refers to the risk-focused content selection while CF refers to the coverage- baselines in compression ratios 64 16 focused content selection. The quote text of the risky sections was used to build the reference summaries. Compression Ratio = 1/64 Compression Ratio = 1/16 ROUGE-1 ROUGE-2 ROUGE-L METEOR ROUGE-1 ROUGE-2 ROUGE-L METEOR Upper Bound 22.45 13.7 18.27 22.32 22.56 13.95 18.49 23.03 CNN + RF 13.97 6.08 9.83 16.58 9.07 3.94 5.53 12.07 CNN + CF 12.39 4.81 8.51 14.93 10.18 4.54 6.58 13.16 Textrank 10.94 2.78 7.51 11.2 10.08 3.37 6.37 12.47 KLSum 10.96 2.43 7.34 12.54 8.37 1.92 5.26 11.06 Lead-k 11.21 1.9 7.9 11.04 9.33 2.44 5.96 11.87 Random 11.44 1.87 8.03 12.02 9.13 2.32 5.73 12.45 1 and 1 . Table 4: Performance of our model (highlighted in light gray) in comparison to the baselines in compression ratios 64 16 RF refers to the risk-focused content selection while CF refers to the coverage-focused content selection. The plain English summaries of risky sections was used to build the reference summaries. our model outperform the baselines. Our CNN + RF model, increases of the moderate success in classification of our realistically imbal- the METEOR score by 32.2% over KLSum and 48% over textrank. anced dataset, we observed a noticeable improvement in ROUGE This improvement is found to be statistically significant (with p- and METEOR metrics in comparison to domain agnostic baselines. value < 0.01). The CNN + CF outperforms the baselines over all We believe the summaries generated by our method can be im- evaluation metrics. However, the improvement is not statistically proved in multiple ways. First, the classifier itself, and the redun- significant. At compression ratio of 16 1 , CNN + RF outperforms all dancy reduction system, could be improved, bringing content selec- domain-agnostic baselines. This improvement however, is not sta- tion performance closer to the upper bound scores derived using tistically significant. At this compression ratio, CNN + RF achieves a perfect classifier. Secondly, our summaries would be more ac- comparable result with textrank. We conclude from our experiments cessible if written in plain English rather than legalese [2]. An that our domain-aware extractive model does moderately better abstractive system could be used to rewrite the contract text in than the baselines at lower compression ratios, however, due to this way. However, the abstractive summaries should not change high level of abstraction in plain English summaries of TOS;DR [16], the legal interpretation of the content and should be linkable to a fully-extractive approach cannot mimic the human-like qualities the original content to be considered binding. In addition to im- in the plain English summaries. This can also be seen by looking at proving the system, it is also necessary to conduct more extensive the performance of the upper bound baseline. evaluation experiments, involving human readers as well as auto- mated metrics. This will help determine the most effective ways to 6 CONCLUSION AND DISCUSSION present information from click-through contracts so that users can In this paper, we proposed a domain-aware extractive model for understand their terms and make a more informed decision. We are summarizing the privacy contracts. Our model, employs a convolu- planning to explore if the risk classifier module can be used indepen- tional neural network to identify risky sections of the contracts. We dently to enhance the productivity of annotators by identifying the build summaries by using a risk-focused and a coverage-focused sections that need to be summarised. This can potentially facilitate content selection mechanism. Our approach enables users to select annotating larger resources for training abstractive models. the content to be summarized within a controllable length while relying on substantially less training data in comparison to the exist- ACKNOWLEDGEMENT ing supervised summarization methods. Our two different content We are immensely grateful to Prof. Junyi Jessy Li, Prof. Bryan selection mechanisms enable users to build budgeted summaries H. Choi, Dr. Daniel Preoţiuc-Pietro, Mayank Kulkarni, and three of contracts based on their preference of coverage vs risk. In spite anonymous reviewers for valuable discussions. Toward Domain-Guided Controllable Summarization of Privacy Policies NLLP @ KDD 2020, August 24th, San Diego, US REFERENCES [34] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. [1] Lorrie Faith Cranor, Praveen Guduru, and Manjula Arjula. User interfaces for IEEE transactions on Signal Processing, 45, 1997. privacy agents. TOCHI, 2006. [35] Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional [2] Jonathan A Obar and Anne Oeldorf-Hirsch. The biggest lie on the internet: transformers for language understanding. arXiv:1810.04805, 2018. Ignoring the privacy policies and terms of service policies of social networking [36] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth services. ICS, 2020. Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis [3] Aleecia M McDonald and Lorrie Faith Cranor. The cost of reading privacy policies. and implementation. IEEE TPAMI, 2002. Isjlp, 2008. [37] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. [4] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model Smote: synthetic minority over-sampling technique. JAIR, 2002. for abstractive sentence summarization. arXiv:1509.00685, 2015. [38] Chin-Yew Lin and Eduard Hovy. Manual and automatic evaluation of summaries. [5] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, et al. Abstractive text summa- In ACL, 2002. rization using sequence-to-sequence rnns and beyond. arXiv:1602.06023, 2016. [39] Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. Recent advances in document [6] Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. Distraction- summarization. Knowledge and Information Systems, 2017. based neural networks for modeling document. In IJCAI, 2016. [40] Michael Denkowski and Alon Lavie. Meteor universal: Language specific trans- [7] Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summa- lation evaluation for any target language. In Proceedings of the ninth workshop rization with pointer-generator networks. arXiv:1704.04368, 2017. on statistical machine translation, 2014. [8] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. Abstractive document summarization [41] Frank Wilcoxon, SK Katti, and Roberta A Wilcox. Critical values and probability with a graph-based attentional neural model. In ACL, 2017. levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected [9] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model tables in mathematical statistics, 1970. for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017. [42] Charles W Dunnett. New tables for multiple comparisons with a control. Bio- [10] Ritesh Sarkhel*, Moniba Keymanesh*, Arnab Nandi, and Srinivasan Parthasarathy. metrics, 1964. Transfer learning for abstractive summarization at controllable budgets. arXiv:2002.07845, 2020. [11] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In AAAI, 2017. [12] Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srini- vasan, and Dragomir Radev. Graph-based neural multi-document summarization. arXiv preprint arXiv:1706.06681, 2017. [13] Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. Faithful to the original: Fact aware neural abstractive summarization. In AAAI, 2018. [14] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In EMNLP, 2004. [15] Aria Haghighi and Lucy Vanderwende. Exploring content models for multi- document summarization. In NAACL, 2009. [16] Laura Manor and Junyi Jessy Li. Plain english summarization of contracts. arXiv:1906.00424, 2019. [17] Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. Bottom-up abstrac- tive summarization. arXiv preprint arXiv:1808.10792, 2018. [18] Frederick Liu, Shomir Wilson, Peter Story, et al. Towards automatic classification of privacy policy text. 2018. [19] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, et al. The creation and analysis of a website privacy policy corpus. In ACL, 2016. [20] Sebastian Zimmeck and Steven M Bellovin. Privee: An architecture for automati- cally analyzing web privacy policies. 2014. [21] Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and Jetzabel Serna. Privacyguide: Towards an implementation of the eu gdpr on internet privacy policy evaluation. In IWSPA, 2018. [22] Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. Privacycheck: Automatic summarization of privacy policies using data mining. TOIT), 2018. [23] Najmeh Mousavi Nejad, Damien Graux, and Diego Collarana. Towards measuring risk factors in privacy policies. In ICAIL, 2019. [24] Hamza Harkous, Kassem Fawaz, Rémi Lebret, et al. Polisis: Automated analysis and presentation of privacy policies using deep learning. 2018. [25] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. JMLR, 12(Aug):2493–2537, 2011. [26] Yoon Kim. Convolutional neural networks for sentence classification. arXiv:1408.5882, 2014. [27] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014. [28] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv:1510.03820, 2015. [29] Matthew E Peters, Mark Neumann, Mohit Iyyer, et al. Deep contextualized word representations. arXiv:1802.05365, 2018. [30] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus- lan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [31] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre- sentations by back-propagating errors. nature, 1986. [32] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108, 2017. [33] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.