Privacy-Preserving Important Passage Retrieval Luís Marujo1,2,3 , José Portêlo2,3 , David Martins de Matos2,3 , João P. Neto2,3 , Anatole Gershman1 , Jaime Carbonell1 , Isabel Trancoso2,3 , Bhiksha Raj1 1 Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA 2 Instituto Superior Técnico, Lisboa, Portugal; 3 INESC-ID, Lisboa, Portugal {lmarujo,anatoleg,jgc,bhiksha}@cs.cmu.edu, {jose.portelo,david.matos,joao.neto,isabel.trancoso}@inesc-id.pt ABSTRACT A potential problem to the deployment of such methods State-of-the-art important passage retrieval methods obtain is that they usually assume that the input data are of public very good results, but do not take into account privacy is- domain. However, this data may come from social network sues. In this paper, we present a privacy preserving method profiles, medical records or other private documents, and that relies on creating secure representations of documents. their owners may not want to, or even be allowed to share it Our approach allows for third parties to retrieve important with third parties. Consider the scenario where a company passages from documents without learning anything regard- has millions of classified documents. The company needs to ing their content. We use a hashing scheme known as Se- retrieve the most important passages from those documents, cure Binary Embeddings to convert a key phrase and bag- but lacks the computational power or know-how to do so. At of-words representation to bit strings in a way that allows the same time, they can not give access to the documents to the computation of approximate distances, instead of exact a third party with such capabilities because they may con- ones. Experiments show that our secure system yield simi- tain sensitive information. As a result, the company must lar results to its non-private counterpart on both clean text obfuscate their own data before sending it to the third party, and noisy speech recognized text. a requirement that is seemingly at odds with the objective of extracting important passages from it. In this paper, we propose a new privacy-preserving tech- Categories and Subject Descriptors nique for IPR based on Secure Binary Embeddings (SBE) [3] H.3 [Information Storage and Retrieval]; I.2.7 [Natural that enables exactly this – it provides a mechanism for ob- Language Processing]: Text analysis; K.4.1 [Computers fuscating the data, while still achieving near state-of-the-art and Society]: Public Policy Issues—privacy performance in IPR. SBEs are a form of locality-sensitive hashing which con- vert data arrays such as bag-of-words vectors to obfuscated General Terms bit strings through a combination of random projections fol- Algorithms, Experimentation lowed by banded quantization. The method has information theoretic guarantees of security, ensuring that the original data cannot be recovered from the bit strings. At the same Keywords time, they also provide a mechanism for locally computing Secure Passage Retrieval, Important Passage Retrieval, KP- distances between vectors that are close to one another with- Centrality, Secure Binary Embeddings, Data Privacy, Auto- out revealing the global geometry of the data, consequently matic Key Phrase Extraction enabling tasks such as IPR. This is possible because, un- like other hashing methods which require exact matches for 1. INTRODUCTION performing classification tasks, SBEs allows for a near-exact matching: the hashes can be used to estimate the distances Important Passage Retrieval (IPR) is the problem of ex- between vectors that are very close, but provably provide no tracting the most important passages in a body of text. By information whatsoever about the distance between vectors “important”, we mean those passages that capture most of that are farther apart. The usefulness of SBE has already the key information the text is attempting to convey. Of been shown for implementing a privacy-preserving speaker the various solutions proposed, state-of-the-art solutions for verification system [21] yielding promising results. IPR based on centrality achieve excellent results [24]. The remainder of the paper is structured as follows. In Section 2 we briefly present some related work regarding Im- portant Passage Retrieval and privacy-preserving methods in IR. In Section 3 we detail the two stages of the important passage retrieval technique. Section 4 presents the method for obtaining a secure representation method. We describe our approach to privacy-preserving important passage re- trieval in Section 5. Section 6 describes the dataset used and Copyright is held by the author/owner(s). illustrates our approach with some experiments. Finally, we PIR’14,Privacy-Preserving IR: When Information Retrieval Meets Privacy present some conclusions and plans for future work. and Security, SIGIR 2014 Workshop July 11th, 2014, Gold Coast, Australia 2. RELATED WORK where two or more parties are involved and they wish to jointly perform a given operation without disclosing their 2.1 Important Passage Retrieval private information is not new, and several techniques such Text and speech information sources influence the com- as Garbled Circuits [32], Homomorphic Encryption [18], Lo- plexity of the important passage retrieval approaches dif- cality-Sensitive Hashing [6] have been introduced. However, ferently. For textual passage retrieval, it is common to use they all have limitations regarding the Important Passage complex information, such as syntactic [30], semantic [28], Retrieval task we wish to address. Until recently Garbled and discourse information [29], either to assess relevance Circuits were extremely inefficient to use due to several in- or reduce the length of the output. However, speech im- trinsic issues, and even now it is difficult to adapt them when portant passage retrieval approaches have an extra layer of the computation of non-linear operations is required. Solu- complexity, caused by speech-related issues like recognition tions to many of these problems have been developed, such errors or disfluencies. As a result, it is useful to use speech- as performing offline computation of the oblivious trans- specific information (e.g.: acoustic/prosodic features [15], fers, using shorter ciphers, evaluating XOR gates for ’free’, recognition confidence scores [33]), or by improving both the etc. [1]. Systems based on Homomorphic Encryption tech- assessment of relevance and the intelligibility of automatic niques introduce substantial amounts of computational over- speech recognizer transcriptions (by using related informa- head and usually require extremely long amounts of time tion [23]). These problems not only increase the difficulty to evaluate any function of interest. The Locality-Sensitive in determining the salient information, but also constrain Hashing technique allows for near-exact match detection be- the applicability of passage retrieval techniques to speech tween data points, but does not provide any actual notion passage retrieval. Nevertheless, shallow text summarization of distance, leading to degradation of performance in some approaches such as Latent Semantic Analysis (LSA) [9] and applications. As a result, we decided to consider Secure Maximal Marginal Relevance (MMR) [4] seem to achieve Binary Embeddings [3] as the data privacy method for our performances comparable to the ones using specific speech- approach, as it does not show any of the disadvantages men- related features [20]. In addition, discourse features start to tioned above for the task at hand. We describe this tech- gain some importance in speech retrieval [15, 35]. nique in depth in Section 4. Closely related to the important passage retrieval used by this work are approaches using the unsupervised key phrase 3. IMPORTANT PASSAGE RETRIEVAL extraction methods. These methods are used to reinforce To determine the most important sentences of an infor- passage retrieval [34, 31, 11, 25, 27]. Namely, they propose mation source, we used the KP-Centrality model [24]. We the use of key phrases to summarize news articles [11] and chose this model for its adaptability to different types of in- meetings [25]. In [11], the authors explored both supervised formation sources (e.g., text, audio and video) and state-of- and unsupervised methods with a limited set of features to the-art performance. It is based on the notion of combining extract key phrases as a first step towards important passage key phrases with support sets. A support set is a group retrieval. Furthermore, the important passage retrieval used of the most semantically related passages. These semantic in this work adapts the centrality retrieval model, which passages are selected using heuristics based on the passage plays an important role in the whole process. This kind of order method [22]. This type of heuristic explore the struc- model adaptation is explored in [25], where the first stage of ture of the input source to partition the candidate passages their method consists in a simple key phrase extraction step, to be included in the support set in two subsets: the ones based on part-of-speech patterns; then, these key phrases are closer to the passage associated with the support set under used to define the relevance and redundancy components of construction and the ones further apart. These heuristics a MMR summarization model. use a permutation, di1 , di2 , · · · , diN −1 , of the distances of the Most of the IPR methods could be easily adapted to be passages sk to the passage pi , related to the support set un- secure using the method described in Section 4. We opted to der construction, with dik = dist(sk , pi ), 1 ≤ k ≤ N − 1, use the KP-Centrality method described in the next section where N is the number of passages, corresponding to the because it has the current state-of-the-art IPR method. order of occurrence of passages sk in the input source. The metric that is normally used is the cosine distance. 2.2 Privacy Preserving Methods The KP-Centrality method consists of two steps, as il- In this work, we focused on creating a method to perform lustrated in Figure 1. First, it extracts key phrases using important passage retrieval keeping the information in the a supervised approach [13] and combines them with a bag- original documents private. There is a large body of litera- of-words model in a compact matrix representation, given ture on important passage retrieval and privacy preserving by: or secure methods. To the best of our knowledge, the com-   bination of both research lines has not been explored yet. w(t1 , p1 ) . . . w(t1 , pN ) w(t1 , k1 ) . . . w(t1 , kM ) However, there are some recent works combining informa- .. .. ,    . . tion retrieval and privacy. Most of these works use data w(tT , p1 ) . . . w(tT , pN ) w(tT , k1 ) . . . w(tT , kM ) encryption [8, 17, 7, 12] to transfer the data in a secure way. (1) This does not solve our problem because the content of the where w is a function of the number of occurrences of term ti document would be decrypted by the retrieval method and in passage pj or key phrase kl , T is the number of terms and therefore it would not remain confidential to the retrieval M is the number of key phrases. Then, using a segmented method. Another alternative secure information retrieval information source I , p1 , p2 , . . . , pN , a support set Si is methodology is to obfuscate queries, which hides user topi- computed for each passage pi using: cal intention [19], but does not secure documents content. In many areas the interest in privacy-preserving methods Si , {s ∈ I ∪ K : sim(s, pi ) > εi ∧ s 6= pi }, (2) Key Phrases Support Sets + Important Document + Bag-of-words Passages Ranking (matrix representation) Figure 1: Flowchart of the Important Passage Retrieval method. Figure 2: 1-bit quantization functions. for i = 1, . . . , N + M . Passages are ranked excluding the key phrases K (artificial passages) according to: Figure 3: SBE behavior as a function of ∆, for two arg max {Si : s ∈ Si } . (3) values of M . s∈(∪n i=1 Si )−K 4. SECURE BINARY EMBEDDINGS ity that the ith bits, qi (x) and qi (x0 ) respectively, of hashes of two vectors x and x0 are identical depends only on the A Secure Binary Embedding (SBE) is a scheme that con- Euclidean distance dE = kx − x0 k between the vectors and verts real-valued vectors to bit sequences using band-quan- not on their actual values. As a consequence, the follow- tized random projections. These bit sequences, which we ing relationship can be shown [3]: given any two vectors will refer to as hashes, possess an interesting property: if x and x0 with a Euclidean distance dE , with probability the Euclidean distance between two vectors is lower than a 2 threshold, then the Hamming distance between their hashes at most e−2t M the normalized (per-bit) Hamming distance is proportional to the Euclidean distance between the vec- dH (q(x), q(x0 )) between the hashes of x and x0 is bounded tors; if it is higher, then the hashes provide no informa- by: tion about the true distance between the two vectors. This 2 2 scheme relies on the concept of Universal Quantization [2], 1 1 − πσd 1 4 − πσd     √ E √ E which redefines scalar quantization by forcing the quantiza- − e 2∆ −t ≤ dH (q(x), q(x0 )) ≤ − 2 e 2∆ +t, 2 2 2 π tion function to have non-contiguous quantization regions. (6) Given an L-dimensional vector x ∈ RL , the universal where t is the control factor. The above bound means quantization process converts it to an M -bit binary sequence, that the Hamming distance dH (q(x), q(x0 )) is correlated to where the m-th bit is given by the Euclidean distance dE between the two vectors, if dE is   lower than a threshold (which depends on ∆). Specifically, hx, am i + wm qm (x) = Q . (4) for small dE , E[dH (q(x), q(x0 ))], the expected √ Hamming ∆ distance, can be shown to be bounded by 2π −1 σ∆−1 dE , Here h, i represents a dot product. am ∈ RL is a projection which is linear in dE . However, if the distance between x vector comprising L i.i.d. samples drawn from N (µ = 0, σ 2 ), and x0 is higher than this threshold, then dH(q(x), q(x0 )) is ∆ is a precision parameter, and wm is a random dither drawn bounded by 0.5−4π −2 exp −0.5π 2 σ 2 ∆−2 d2E , which rapidly from a uniform distribution over [0, ∆]. Q(·) is a quantiza- converges to 0.5 and effectively gives us no information what- tion function given by Q(x) = bx mod 2c. We can represent soever about the true distance between x and x0 . the complete quantization into M bits compactly in vector In order to illustrate how this scheme works, we ran- form: domly generated pairs of vectors in a high-dimensional space (L = 1024) and plotted the normalized Hamming distance q(x) = Q ∆−1 (Ax + w) ,  (5) between their hashes against the Euclidean distance between where q(x) is an M -bit binary vector, which we will refer to them (Figure 3). The number of bits in the hash is also as the hash of x. A ∈ RM ×L is a matrix composed of the shown in the figures. row vectors am , ∆ is a diagonal matrix with entries ∆, and We note that in all cases, once the normalized distance ex- w ∈ RM is a vector composed from the dither values wm . ceeds ∆, the Hamming distance between the hashes of two The universal 1-bit quantizer of Equation 4 maps the real vectors ceases to provide any information about the true dis- line onto 1/0 in a banded manner, where each band is ∆m tance between the vectors. We will find this property useful wide. Figure 2 compares conventional scalar 1-bit quantiza- in developing our privacy-preserving system. Changing the tion (left panel) with the equivalent universal 1-bit quanti- value of the precision parameter ∆ allows us to adjust the zation (right panel). distance threshold until which the Hamming distance is in- The binary hash generated by the Universal Quantizer of formative. Increasing the number of bits M leads to a reduc- Equation 5 has the following properties [3]: the probabil- tion of the variance of the Hamming distance. A converse property of the embeddings is that for all x0 except those Metric ROUGE-1 that lie within a small radius of any x, dH (q(x), q(x0 )) pro- Cosine distance 0.575 vides little information about how close x0 is to x. It can Euclidean distance 0.507 be shown that the embedding provides information theoretic security beyond this radius, if the embedding parameters A Table 1: KP-Centrality results with 40 key phrases and w are unknown to the potential eavesdropper. Any al- using the Concisus dataset. gorithm attempting to recover a signal x from its embedding q(x) or to infer anything about the relationship between two Metric ROUGE-1 signals sufficiently far apart using only their embeddings will Cosine distance 0.612 fail to do so. Furthermore, even in the case where A and w Euclidean distance 0.590 are known, it seems computationally intractable to derive x from q(x) unless one can guess a starting point very close to Table 2: KP-Centrality results with 40 key phrases x. In effect, it is infeasible to invert the SBE without strong using the Portuguese Broadcast News dataset. a priori assumptions about x. types of events: aviation accidents, earthquakes, and train 5. SECURE IMPORTANT PASSAGE accidents. This corpus also contains comparable data in RETRIEVAL Spanish. However, since our Automatic Key Phrase Extrac- tion (AKE) system uses some language-dependent features, Our approach for a privacy-preserving important passage we opted for not using in this part of the dataset in previous retrieval system closely follows the formulation presented in work [24] nor in this one. Section 3, and it is illustrated in Figure 4. However, this is The PT BN dataset consists of automatic transcriptions a very important difference in terms of who performs each of eighteen broadcast news stories in European Portuguese, of the steps. Typically there is only one party involved, Al- which are part of a news program. News stories cover several ice, who owns the original documents, performs key phrase generic topics like society, politics and sports, among others. extraction, combines them with the bag-of-words model in a For each news story, there is a human-produced abstract, compact matrix representation, computes the support sets used as reference. for each documents and finally uses to retrieve the impor- tant passages. In our scenario, Alice does not know how to 6.2 Setup extract the important passages from them or does not pos- We extracted key phrases from both datasets using the sess the computational power to do so. Therefore, she must MAUI toolkit [16] expanded with shallow semantic features, outsource the retrieval process to a third-party, Bob, who such as number of named entities, part-of-speech tags and has these capabilities. However, Alice must first obfuscate four n-gram domain model probabilities. This expanded fea- the information contained in the compact matrix representa- ture set leads to improvements regarding the quality of the tion. If Bob receives this information as is, he could use the key phrases. Regarding the Concisus dataset, we extracted term frequencies to infer on the contents of the original doc- yet additional features, such as the detection of rhetorical uments and gain access to private or classified information devices, which further improved the key phrase extraction Alice does not wish to disclosure to anyone. Alice computes process [13]. As for the PT BN dataset, we only used the binary hashes of her compact matrix representation using shallow semantic features as the remaining features were not the method described in Section 4, keeping the randomiza- available [14]. tion parameters A and w to herself. She sends these hashes to Bob, who computes the support sets and extracts the 6.3 Results important passages. Because Bob receives binary hashes in- stead of the original matrix representation, he must use the We present some baseline experiments in order to obtain normalized Hamming distance instead of the cosine distance reference values for our approach. We generated three pas- in this step, since it is the metric the SBE hashes best re- sages summaries for Concisus Dataset, which are commonly late to. Finally, we returns the hashes corresponding to the found in online news web sites like Google News. In the ex- important passages to Alice, who then uses them to get the periments using the PT BN dataset, the summary size was information she desires. determined by the size of the reference human summaries, which consisted in about 10% of the input news story. For both experiments, we used the Cosine and the Euclidean dis- 6. EXPERIMENTS tance as evaluation metrics, since the first is the usual metric In this section we illustrate the performance of our privacy- for computing textual similarity, but the second is the one preserving approach to Important Passage Retrieval and that relates to the Secured Binary Embeddings technique. how it compares to its non-private counterpart. We start All results are presented in terms of ROUGE [10], in par- by presenting the datasets we used in our experiments, then ticular ROUGE-1, which is the most widely used evaluation we describe the experimental setup and finally we present measure for this scenario. The results we obtained for the some results. Concisus and the PT BN datasets are presented in Tables 1 and 2, respectively. 6.1 Datasets We considered forty key phrases in our experiments since In order to evaluate our approach, we performed experi- it is the usual choice when news articles are considered [13]. ments on the English version of the Concisus dataset [26] and As expected, we notice some slight degradation when the the Portuguese Broadcast News (PT BN) dataset [23]. The Euclidean distance is considered, but we still achieve better Concisus dataset is composed by seventy eight event reports results than other state-of-the-art methods such as default and respective summaries, distributed across three different centrality [22] and LexRank [5]. Reported results in the Key Phrases Support Sets + Secure Binary Important Document + Bag-of-words Embeddings Passages Ranking (matrix representation) Figure 4: Flowchart of the Secure Important Passage Retrieval method. leakage ∼ 5% ∼ 25% ∼ 50% ∼ 75% ∼ 95% nary Embeddings based approach provides secure document bpc=4 0.365 0.437 0.465 0.486 0.495 representations that allows for sensitive information to be bpc=8 0.424 0.384 0.436 0.452 0.500 processed by third parties without any risk of sensitive in- bpc=16 0.384 0.416 0.450 0.463 0.517 formation disclosure. Although there was some slight degra- dation of results regarding the baseline, our approach still Table 3: KP-Centrality using SBE and the Concisus outperforms other state-of-the-art methods like default Cen- dataset, in terms of ROUGE-1. trality and LexRank, but with important advantage that no private or classified information is disclosed to third parties. leakage ∼ 5% ∼ 25% ∼ 50% ∼ 75% ∼ 95% For future work we intend to use the secure representation bpc=4 0.314 0.340 0.470 0.478 0.562 based on Secure Binary Embeddings in multi-document im- bpc=8 0.327 0.324 0.486 0.507 0.527 portant passage retrieval. Another additional research line bpc=16 0.338 0.336 0.520 0.473 0.550 that we would like to purse is to apply this privacy preserv- ing technique in other Information Retrieval tasks, such as Table 4: KP-Centrality using SBE and the Por- classified military and medical records retrieval. tuguese Broadcast News dataset, in terms of ROUGE-1. 8. ACKNOWLEDGMENTS We would like to thank FCT for supporting this research literature include ROUGE-1 = 0.443 and 0.531 using default through PPEst-OE/EEI/LA0021/2013, the Carnegie Mel- Centrality and ROUGE-1 = 0.428 and 0.471 using LexRank lon Portugal Program, PTDC/EIA-CCO/122542/2010, and for the Concisus and PT BN datasets, respectively [24]. This grants SFRH/BD/33769/2009 and SFRH/BD/71349/2010. means that the forced change of metric due to the intrinsic We would like to thank NSF for supporting this research properties of SBE does not affect the validity of our approach through grant 1017256. in any way. For our privacy-preserving approach we performed exper- 9. REFERENCES iments using different values for the SBE parameters. The [1] M. Bellare, V. T. Hoang, S. Keelveedhi, and results we obtained in terms of ROUGE for the Concisus P. Rogaway. Efficient garbling from a fixed-key and the PT BN datasets are presented in Tables 3 and 4, re- blockcipher. In IEEE Symposium on SP, pages spectively. Leakage refers to the fraction of SBE hashes for 478–492. IEEE, 2013. which the normalized Hamming distance dH is proportional [2] P. Boufounos. Universal rate-efficient scalar to the Euclidean distance dE between the original data vec- quantization. IEEE Transactions on Information tors. The amount of leakage is exclusively controlled by ∆. Theory, 58(3):1861–1872, 2012. Bits per coefficient (bpc) is the ratio between the number [3] P. Boufounos and S. Rane. Secure binary embeddings of measurements M and the dimensionality of the original for privacy preserving nearest neighbors. In WIFS data vectors L, i.e., bpc = M/L. As expected, increasing 2011, pages 1–6. IEEE, 2011. the amount of leakage (i.e. increasing ∆) leads to improv- [4] J. Carbonell and J. Goldstein. The Use of MMR, ing the retrieval results. Surprisingly, changing the values Diversity-Based Reranking for Reordering Documents of bpc does not lead to improved performance. The reason and Producing Summaries. In SIGIR 1998: for this results might be due to the KP-Centrality method Proceedings of the Annual International ACM SIGIR using support sets that consider multiple partial representa- Conference on Research and Development in tions of the documents. Nevertheless, the most significant Information Retrieval, pages 335–336. ACM, 1998. results is that for 95% leakage there is an almost negligible loss of performance. This scenario, however, does not vio- [5] G. Erkan and D. R. Radev. LexRank: Graph-based late our privacy requisites in any way, since although most Centrality as Salience in Text Summarization. Journal of the distances between hashes are known, there is no way of Artificial Intelligence Research, 22:457–479, 2004. to use this information to learn anything about the original [6] P. Indyk and R. Motwani. Approximate nearest underlying information. neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 7. CONCLUSIONS AND FUTURE WORK 604–613. ACM, 1998. In this work, we introduced a privacy-preserving technique [7] W. Jiang and B. Samanthula. N-gram based secure for performing Important Passage Retrieval that performs similar document detection. In Y. Li, editor, Data and similarly to their non-private counterpart. Our Secure Bi- Applications Security and Privacy XXV, volume 6818 of Lecture Notes in Computer Science, pages 239–246. as Geometric Proximity. Journal of Artificial Springer Berlin Heidelberg, 2011. Intelligence Research, 42:275–308, 2011. [8] W. Jiang, L. Si, and J. Li. Protecting source privacy [23] R. Ribeiro and D. M. de Matos. Multi-source, in federated search. In SIGIR 2007: Proceedings of the Multilingual Information Extraction and Annual International ACM SIGIR Conference on Summarization, chapter Improving Speech-to-Text Research and Development in Information Retrieval, Summarization by Using Additional Information SIGIR ’07, pages 761–762, New York, NY, USA, 2007. Sources. Theory and Applications of NLP. Springer, ACM. 2013. [9] T. K. Landauer and S. T. Dumais. A Solution to [24] R. Ribeiro, L. Marujo, D. Martins de Matos, J. P. Plato’s Problem: The Latent Semantic Analysis Neto, A. Gershman, and J. Carbonell. Self Theory of Acquisition, Induction and Representation reinforcement for important passage retrieval. In of Knowledge. Psych. Review, 104(2):211–240, 1997. SIGIR 2013: Proceedings of the Annual International [10] C.-Y. Lin. ROUGE: A Package for Automatic ACM SIGIR Conference on Research and Evaluation of Summaries. In M.-F. Moens and Development in Information Retrieval, SIGIR ’13, S. Szpakowicz, editors, Text Summ. Branches Out: pages 845–848. ACM, 2013. Proc. of the ACL-04 Workshop, pages 74–81. [25] K. Riedhammer, B. Favre, and D. Hakkani-Tür. Long Association for Computational Linguistics, 2004. story short – Global unsupervised models for [11] M. Litvak and M. Last. Graph-Based Keyword keyphrase based meeting summarization. Speech Extraction for Single-Document Summarization. In Communication, 52:801–815, 2010. Coling 2008: MMIES, pages 17–24. Coling 2008 Org. [26] H. Saggion and S. Szasz. The concisus corpus of event Committee, 2008. summaries. In Proceedings of the Eighth International [12] W. Lu, A. Varna, and M. Wu. Language Resources and Evaluation (LREC 2012), Confidentiality-preserving image search: A 2012. comparative study between homomorphic encryption [27] R. Sipos, A. Swaminathan, P. Shivaswamy, and and distance-preserving randomization. Access, IEEE, T. Joachims. Temporal corpus summarization using 2:125–141, 2014. submodular word coverage. In Proc. of CIKM, pages [13] L. Marujo, A. Gershman, J. Carbonell, R. Frederking, 754–763, New York, NY, USA, 2012. ACM. and J. P. Neto. Supervised topical key phrase [28] R. I. Tucker and K. Spärck Jones. Between shallow extraction of news stories using crowdsourcing, light and deep: an experiment in automatic summarising. filtering and co-reference normalization. In Proceedings Technical Report 632, University of Cambridge, 2005. of the Eighth International Language Resources and [29] V. R. Uzêda, T. A. S. Pardo, and M. das Graças Evaluation (LREC 2012), 2012. Volpe Nunes. A comprehensive comparative evaluation [14] L. Marujo, M. Viveiros, and J. P. Neto. Keyphrase of RST-based summarization methods. ACM Trans. Cloud Generation of Broadcast News. In Interspeech on Speech and Language Processing, 6(4):1–20, 2010. 2011. ISCA, September 2011. [30] L. Vanderwende, H. Suzuki, C. Brockett, and [15] S. R. Maskey and J. Hirschberg. Comparing Lexical, A. Nenkova. Beyond SumBasic: Task-focused Acoustic/Prosodic, Structural and Discourse Features summarization and lexical expansion. Information for Speech Summarization. In Proceedings of the 9th Processing and Management, 43:1606–1618, 2007. EUROSPEECH - INTERSPEECH 2005, 2005. [31] X. Wan, J. Yang, and J. Xiao. Towards an iterative [16] O. Medelyan, V. Perrone, and I. H. Witten. Subject reinforcement approach for simultaneous document metadata support powered by Maui. In Proceedings of summarization and keyword extraction. In Proceedings the JCDL ’10, page 407, New York, USA, 2010. of the 45th ACL, pages 552–559, 2007. [17] M. Murugesan, W. Jiang, C. Clifton, L. Si, and [32] A. C.-C. Yao. Protocols for secure computations. In J. Vaidya. Efficient privacy-preserving similar Foundations of Computer Science (FOCS), volume 82, document detection. The VLDB Journal, pages 160–164, 1982. 19(4):457–475, 2010. [33] K. Zechner and A. Waibel. Minimizing Word Error [18] P. Paillier. Public-key cryptosystems based on Rate in Textual Summaries of Spoken Language. In composite degree residuosity classes. In Proceedings of the North American Chapter of the EUROCRYPT’99, pages 223–238, 1999. ACL (NAACL), pages 186–193, 2000. [19] H. Pang, X. Xiao, and J. Shen. Obfuscating the [34] H. Zha. Generic summarization and keyphrase topical intention in enterprise text search. In Data extraction using mutual reinforcement principle and Engineering (ICDE), 2012 IEEE 28th International sentence clustering. In SIGIR 2002: Proceedings of the Conference on, pages 1168–1179, April 2012. Annual International ACM SIGIR Conference on [20] G. Penn and X. Zhu. A Critical Reassessment of Research and Development in Information Retrieval, Evaluation Baselines for Speech Summarization. In pages 113–120. ACM, 2002. Proc. of ACL-08: HLT, pages 470–478. ACL, 2008. [35] J. J. Zhang, R. H. Y. Chan, and P. Fung. Extractive [21] J. Portelo, B. Raj, P. Boufounos, I. Trancoso, and Speech Summarization Using Shallow Rhetorical A. Alberto. Speaker verification using secure binary Structure Modeling. IEEE Transactions on Audio, embeddings. In EUSIPO, 2013. Speech, and Language Processing (TASLP), [22] R. Ribeiro and D. M. de Matos. Revisiting 18(6):1147–1157, 2010. Centrality-as-Relevance: Support Sets and Similarity