Privacy-Preserving Important Passage Retrieval

                Luís Marujo1,2,3 , José Portêlo2,3 , David Martins de Matos2,3 , João P. Neto2,3 ,
                 Anatole Gershman1 , Jaime Carbonell1 , Isabel Trancoso2,3 , Bhiksha Raj1
                    1
                        Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA
                                       2
                                         Instituto Superior Técnico, Lisboa, Portugal;
                                                 3
                                                   INESC-ID, Lisboa, Portugal
                                   {lmarujo,anatoleg,jgc,bhiksha}@cs.cmu.edu,
                         {jose.portelo,david.matos,joao.neto,isabel.trancoso}@inesc-id.pt

ABSTRACT                                                                      A potential problem to the deployment of such methods
State-of-the-art important passage retrieval methods obtain                is that they usually assume that the input data are of public
very good results, but do not take into account privacy is-                domain. However, this data may come from social network
sues. In this paper, we present a privacy preserving method                profiles, medical records or other private documents, and
that relies on creating secure representations of documents.               their owners may not want to, or even be allowed to share it
Our approach allows for third parties to retrieve important                with third parties. Consider the scenario where a company
passages from documents without learning anything regard-                  has millions of classified documents. The company needs to
ing their content. We use a hashing scheme known as Se-                    retrieve the most important passages from those documents,
cure Binary Embeddings to convert a key phrase and bag-                    but lacks the computational power or know-how to do so. At
of-words representation to bit strings in a way that allows                the same time, they can not give access to the documents to
the computation of approximate distances, instead of exact                 a third party with such capabilities because they may con-
ones. Experiments show that our secure system yield simi-                  tain sensitive information. As a result, the company must
lar results to its non-private counterpart on both clean text              obfuscate their own data before sending it to the third party,
and noisy speech recognized text.                                          a requirement that is seemingly at odds with the objective
                                                                           of extracting important passages from it.
                                                                              In this paper, we propose a new privacy-preserving tech-
Categories and Subject Descriptors                                         nique for IPR based on Secure Binary Embeddings (SBE) [3]
H.3 [Information Storage and Retrieval]; I.2.7 [Natural                    that enables exactly this – it provides a mechanism for ob-
Language Processing]: Text analysis; K.4.1 [Computers                      fuscating the data, while still achieving near state-of-the-art
and Society]: Public Policy Issues—privacy                                 performance in IPR.
                                                                              SBEs are a form of locality-sensitive hashing which con-
                                                                           vert data arrays such as bag-of-words vectors to obfuscated
General Terms                                                              bit strings through a combination of random projections fol-
Algorithms, Experimentation                                                lowed by banded quantization. The method has information
                                                                           theoretic guarantees of security, ensuring that the original
                                                                           data cannot be recovered from the bit strings. At the same
Keywords                                                                   time, they also provide a mechanism for locally computing
Secure Passage Retrieval, Important Passage Retrieval, KP-                 distances between vectors that are close to one another with-
Centrality, Secure Binary Embeddings, Data Privacy, Auto-                  out revealing the global geometry of the data, consequently
matic Key Phrase Extraction                                                enabling tasks such as IPR. This is possible because, un-
                                                                           like other hashing methods which require exact matches for
1.    INTRODUCTION                                                         performing classification tasks, SBEs allows for a near-exact
                                                                           matching: the hashes can be used to estimate the distances
  Important Passage Retrieval (IPR) is the problem of ex-                  between vectors that are very close, but provably provide no
tracting the most important passages in a body of text. By                 information whatsoever about the distance between vectors
“important”, we mean those passages that capture most of                   that are farther apart. The usefulness of SBE has already
the key information the text is attempting to convey. Of                   been shown for implementing a privacy-preserving speaker
the various solutions proposed, state-of-the-art solutions for             verification system [21] yielding promising results.
IPR based on centrality achieve excellent results [24].                       The remainder of the paper is structured as follows. In
                                                                           Section 2 we briefly present some related work regarding Im-
                                                                           portant Passage Retrieval and privacy-preserving methods
                                                                           in IR. In Section 3 we detail the two stages of the important
                                                                           passage retrieval technique. Section 4 presents the method
                                                                           for obtaining a secure representation method. We describe
                                                                           our approach to privacy-preserving important passage re-
                                                                           trieval in Section 5. Section 6 describes the dataset used and
Copyright is held by the author/owner(s).                                  illustrates our approach with some experiments. Finally, we
PIR’14,Privacy-Preserving IR: When Information Retrieval Meets Privacy     present some conclusions and plans for future work.
and Security, SIGIR 2014 Workshop July 11th, 2014, Gold Coast, Australia
2.    RELATED WORK                                                 where two or more parties are involved and they wish to
                                                                   jointly perform a given operation without disclosing their
2.1    Important Passage Retrieval                                 private information is not new, and several techniques such
   Text and speech information sources influence the com-          as Garbled Circuits [32], Homomorphic Encryption [18], Lo-
plexity of the important passage retrieval approaches dif-         cality-Sensitive Hashing [6] have been introduced. However,
ferently. For textual passage retrieval, it is common to use       they all have limitations regarding the Important Passage
complex information, such as syntactic [30], semantic [28],        Retrieval task we wish to address. Until recently Garbled
and discourse information [29], either to assess relevance         Circuits were extremely inefficient to use due to several in-
or reduce the length of the output. However, speech im-            trinsic issues, and even now it is difficult to adapt them when
portant passage retrieval approaches have an extra layer of        the computation of non-linear operations is required. Solu-
complexity, caused by speech-related issues like recognition       tions to many of these problems have been developed, such
errors or disfluencies. As a result, it is useful to use speech-   as performing offline computation of the oblivious trans-
specific information (e.g.: acoustic/prosodic features [15],       fers, using shorter ciphers, evaluating XOR gates for ’free’,
recognition confidence scores [33]), or by improving both the      etc. [1]. Systems based on Homomorphic Encryption tech-
assessment of relevance and the intelligibility of automatic       niques introduce substantial amounts of computational over-
speech recognizer transcriptions (by using related informa-        head and usually require extremely long amounts of time
tion [23]). These problems not only increase the difficulty        to evaluate any function of interest. The Locality-Sensitive
in determining the salient information, but also constrain         Hashing technique allows for near-exact match detection be-
the applicability of passage retrieval techniques to speech        tween data points, but does not provide any actual notion
passage retrieval. Nevertheless, shallow text summarization        of distance, leading to degradation of performance in some
approaches such as Latent Semantic Analysis (LSA) [9] and          applications. As a result, we decided to consider Secure
Maximal Marginal Relevance (MMR) [4] seem to achieve               Binary Embeddings [3] as the data privacy method for our
performances comparable to the ones using specific speech-         approach, as it does not show any of the disadvantages men-
related features [20]. In addition, discourse features start to    tioned above for the task at hand. We describe this tech-
gain some importance in speech retrieval [15, 35].                 nique in depth in Section 4.
   Closely related to the important passage retrieval used by
this work are approaches using the unsupervised key phrase         3.    IMPORTANT PASSAGE RETRIEVAL
extraction methods. These methods are used to reinforce               To determine the most important sentences of an infor-
passage retrieval [34, 31, 11, 25, 27]. Namely, they propose       mation source, we used the KP-Centrality model [24]. We
the use of key phrases to summarize news articles [11] and         chose this model for its adaptability to different types of in-
meetings [25]. In [11], the authors explored both supervised       formation sources (e.g., text, audio and video) and state-of-
and unsupervised methods with a limited set of features to         the-art performance. It is based on the notion of combining
extract key phrases as a first step towards important passage      key phrases with support sets. A support set is a group
retrieval. Furthermore, the important passage retrieval used       of the most semantically related passages. These semantic
in this work adapts the centrality retrieval model, which          passages are selected using heuristics based on the passage
plays an important role in the whole process. This kind of         order method [22]. This type of heuristic explore the struc-
model adaptation is explored in [25], where the first stage of     ture of the input source to partition the candidate passages
their method consists in a simple key phrase extraction step,      to be included in the support set in two subsets: the ones
based on part-of-speech patterns; then, these key phrases are      closer to the passage associated with the support set under
used to define the relevance and redundancy components of          construction and the ones further apart. These heuristics
a MMR summarization model.                                         use a permutation, di1 , di2 , · · · , diN −1 , of the distances of the
   Most of the IPR methods could be easily adapted to be           passages sk to the passage pi , related to the support set un-
secure using the method described in Section 4. We opted to        der construction, with dik = dist(sk , pi ), 1 ≤ k ≤ N − 1,
use the KP-Centrality method described in the next section         where N is the number of passages, corresponding to the
because it has the current state-of-the-art IPR method.            order of occurrence of passages sk in the input source. The
                                                                   metric that is normally used is the cosine distance.
2.2    Privacy Preserving Methods                                     The KP-Centrality method consists of two steps, as il-
   In this work, we focused on creating a method to perform        lustrated in Figure 1. First, it extracts key phrases using
important passage retrieval keeping the information in the         a supervised approach [13] and combines them with a bag-
original documents private. There is a large body of litera-       of-words model in a compact matrix representation, given
ture on important passage retrieval and privacy preserving         by:
or secure methods. To the best of our knowledge, the com-                                                                             
bination of both research lines has not been explored yet.           w(t1 , p1 ) . . . w(t1 , pN ) w(t1 , k1 ) . . . w(t1 , kM )
However, there are some recent works combining informa-                  ..                                                      ..
                                                                                                                                       ,
                                                                                                                                      
                                                                         .                                                       .
tion retrieval and privacy. Most of these works use data             w(tT , p1 ) . . . w(tT , pN ) w(tT , k1 ) . . . w(tT , kM )
encryption [8, 17, 7, 12] to transfer the data in a secure way.                                                                        (1)
This does not solve our problem because the content of the         where w is a function of the number of occurrences of term ti
document would be decrypted by the retrieval method and            in passage pj or key phrase kl , T is the number of terms and
therefore it would not remain confidential to the retrieval        M is the number of key phrases. Then, using a segmented
method. Another alternative secure information retrieval           information source I , p1 , p2 , . . . , pN , a support set Si is
methodology is to obfuscate queries, which hides user topi-        computed for each passage pi using:
cal intention [19], but does not secure documents content.
   In many areas the interest in privacy-preserving methods                 Si , {s ∈ I ∪ K : sim(s, pi ) > εi ∧ s 6= pi },           (2)
                                                Key Phrases
                                                                             Support Sets
                                                      +                                                  Important
                   Document                                                       +
                                               Bag-of-words                                              Passages
                                                                               Ranking
                                           (matrix representation)


                           Figure 1: Flowchart of the Important Passage Retrieval method.


        Figure 2: 1-bit quantization functions.


for i = 1, . . . , N + M . Passages are ranked excluding the key
phrases K (artificial passages) according to:                         Figure 3: SBE behavior as a function of ∆, for two
                   arg max        {Si : s ∈ Si } .              (3)   values of M .
                 s∈(∪n
                     i=1 Si )−K


4.   SECURE BINARY EMBEDDINGS                                         ity that the ith bits, qi (x) and qi (x0 ) respectively, of hashes
                                                                      of two vectors x and x0 are identical depends only on the
   A Secure Binary Embedding (SBE) is a scheme that con-
                                                                      Euclidean distance dE = kx − x0 k between the vectors and
verts real-valued vectors to bit sequences using band-quan-
                                                                      not on their actual values. As a consequence, the follow-
tized random projections. These bit sequences, which we
                                                                      ing relationship can be shown [3]: given any two vectors
will refer to as hashes, possess an interesting property: if
                                                                      x and x0 with a Euclidean distance dE , with probability
the Euclidean distance between two vectors is lower than a                         2

threshold, then the Hamming distance between their hashes             at most e−2t M the normalized (per-bit) Hamming distance
is proportional to the Euclidean distance between the vec-            dH (q(x), q(x0 )) between the hashes of x and x0 is bounded
tors; if it is higher, then the hashes provide no informa-            by:
tion about the true distance between the two vectors. This
                                                                                      2                                          2
scheme relies on the concept of Universal Quantization [2],           1 1 − πσd                                   1 4 − πσd
                                                                                                                                
                                                                                √ E                                         √ E
which redefines scalar quantization by forcing the quantiza-            − e      2∆     −t ≤ dH (q(x), q(x0 )) ≤ − 2 e       2∆    +t,
                                                                      2 2                                         2 π
tion function to have non-contiguous quantization regions.                                                                       (6)
   Given an L-dimensional vector x ∈ RL , the universal                  where t is the control factor. The above bound means
quantization process converts it to an M -bit binary sequence,        that the Hamming distance dH (q(x), q(x0 )) is correlated to
where the m-th bit is given by                                        the Euclidean distance dE between the two vectors, if dE is
                                                                    lower than a threshold (which depends on ∆). Specifically,
                             hx, am i + wm
                qm (x) = Q                    .            (4)        for small dE , E[dH (q(x), q(x0 ))], the expected
                                                                                                                     √    Hamming
                                    ∆
                                                                      distance, can be shown to be bounded by 2π −1 σ∆−1 dE ,
Here h, i represents a dot product. am ∈ RL is a projection           which is linear in dE . However, if the distance between x
vector comprising L i.i.d. samples drawn from N (µ = 0, σ 2 ),        and x0 is higher than this threshold, then dH(q(x), q(x0 )) is
∆ is a precision parameter, and wm is a random dither drawn           bounded by 0.5−4π −2 exp −0.5π 2 σ 2 ∆−2 d2E , which rapidly
from a uniform distribution over [0, ∆]. Q(·) is a quantiza-          converges to 0.5 and effectively gives us no information what-
tion function given by Q(x) = bx mod 2c. We can represent             soever about the true distance between x and x0 .
the complete quantization into M bits compactly in vector                In order to illustrate how this scheme works, we ran-
form:                                                                 domly generated pairs of vectors in a high-dimensional space
                                                                      (L = 1024) and plotted the normalized Hamming distance
                 q(x) = Q ∆−1 (Ax + w) ,
                                           
                                                         (5)          between their hashes against the Euclidean distance between
where q(x) is an M -bit binary vector, which we will refer to         them (Figure 3). The number of bits in the hash is also
as the hash of x. A ∈ RM ×L is a matrix composed of the               shown in the figures.
row vectors am , ∆ is a diagonal matrix with entries ∆, and              We note that in all cases, once the normalized distance ex-
w ∈ RM is a vector composed from the dither values wm .               ceeds ∆, the Hamming distance between the hashes of two
   The universal 1-bit quantizer of Equation 4 maps the real          vectors ceases to provide any information about the true dis-
line onto 1/0 in a banded manner, where each band is ∆m               tance between the vectors. We will find this property useful
wide. Figure 2 compares conventional scalar 1-bit quantiza-           in developing our privacy-preserving system. Changing the
tion (left panel) with the equivalent universal 1-bit quanti-         value of the precision parameter ∆ allows us to adjust the
zation (right panel).                                                 distance threshold until which the Hamming distance is in-
   The binary hash generated by the Universal Quantizer of            formative. Increasing the number of bits M leads to a reduc-
Equation 5 has the following properties [3]: the probabil-            tion of the variance of the Hamming distance. A converse
property of the embeddings is that for all x0 except those                      Metric                 ROUGE-1
that lie within a small radius of any x, dH (q(x), q(x0 )) pro-                 Cosine distance          0.575
vides little information about how close x0 is to x. It can                     Euclidean distance       0.507
be shown that the embedding provides information theoretic
security beyond this radius, if the embedding parameters A        Table 1: KP-Centrality results with 40 key phrases
and w are unknown to the potential eavesdropper. Any al-          using the Concisus dataset.
gorithm attempting to recover a signal x from its embedding
q(x) or to infer anything about the relationship between two                    Metric                 ROUGE-1
signals sufficiently far apart using only their embeddings will                 Cosine distance          0.612
fail to do so. Furthermore, even in the case where A and w                      Euclidean distance       0.590
are known, it seems computationally intractable to derive x
from q(x) unless one can guess a starting point very close to     Table 2: KP-Centrality results with 40 key phrases
x. In effect, it is infeasible to invert the SBE without strong   using the Portuguese Broadcast News dataset.
a priori assumptions about x.
                                                                  types of events: aviation accidents, earthquakes, and train
5.    SECURE IMPORTANT PASSAGE                                    accidents. This corpus also contains comparable data in
      RETRIEVAL                                                   Spanish. However, since our Automatic Key Phrase Extrac-
                                                                  tion (AKE) system uses some language-dependent features,
   Our approach for a privacy-preserving important passage
                                                                  we opted for not using in this part of the dataset in previous
retrieval system closely follows the formulation presented in
                                                                  work [24] nor in this one.
Section 3, and it is illustrated in Figure 4. However, this is
                                                                     The PT BN dataset consists of automatic transcriptions
a very important difference in terms of who performs each
                                                                  of eighteen broadcast news stories in European Portuguese,
of the steps. Typically there is only one party involved, Al-
                                                                  which are part of a news program. News stories cover several
ice, who owns the original documents, performs key phrase
                                                                  generic topics like society, politics and sports, among others.
extraction, combines them with the bag-of-words model in a
                                                                  For each news story, there is a human-produced abstract,
compact matrix representation, computes the support sets
                                                                  used as reference.
for each documents and finally uses to retrieve the impor-
tant passages. In our scenario, Alice does not know how to        6.2    Setup
extract the important passages from them or does not pos-
                                                                    We extracted key phrases from both datasets using the
sess the computational power to do so. Therefore, she must
                                                                  MAUI toolkit [16] expanded with shallow semantic features,
outsource the retrieval process to a third-party, Bob, who
                                                                  such as number of named entities, part-of-speech tags and
has these capabilities. However, Alice must first obfuscate
                                                                  four n-gram domain model probabilities. This expanded fea-
the information contained in the compact matrix representa-
                                                                  ture set leads to improvements regarding the quality of the
tion. If Bob receives this information as is, he could use the
                                                                  key phrases. Regarding the Concisus dataset, we extracted
term frequencies to infer on the contents of the original doc-
                                                                  yet additional features, such as the detection of rhetorical
uments and gain access to private or classified information
                                                                  devices, which further improved the key phrase extraction
Alice does not wish to disclosure to anyone. Alice computes
                                                                  process [13]. As for the PT BN dataset, we only used the
binary hashes of her compact matrix representation using
                                                                  shallow semantic features as the remaining features were not
the method described in Section 4, keeping the randomiza-
                                                                  available [14].
tion parameters A and w to herself. She sends these hashes
to Bob, who computes the support sets and extracts the            6.3    Results
important passages. Because Bob receives binary hashes in-
stead of the original matrix representation, he must use the         We present some baseline experiments in order to obtain
normalized Hamming distance instead of the cosine distance        reference values for our approach. We generated three pas-
in this step, since it is the metric the SBE hashes best re-      sages summaries for Concisus Dataset, which are commonly
late to. Finally, we returns the hashes corresponding to the      found in online news web sites like Google News. In the ex-
important passages to Alice, who then uses them to get the        periments using the PT BN dataset, the summary size was
information she desires.                                          determined by the size of the reference human summaries,
                                                                  which consisted in about 10% of the input news story. For
                                                                  both experiments, we used the Cosine and the Euclidean dis-
6.    EXPERIMENTS                                                 tance as evaluation metrics, since the first is the usual metric
  In this section we illustrate the performance of our privacy-   for computing textual similarity, but the second is the one
preserving approach to Important Passage Retrieval and            that relates to the Secured Binary Embeddings technique.
how it compares to its non-private counterpart. We start          All results are presented in terms of ROUGE [10], in par-
by presenting the datasets we used in our experiments, then       ticular ROUGE-1, which is the most widely used evaluation
we describe the experimental setup and finally we present         measure for this scenario. The results we obtained for the
some results.                                                     Concisus and the PT BN datasets are presented in Tables 1
                                                                  and 2, respectively.
6.1    Datasets                                                      We considered forty key phrases in our experiments since
  In order to evaluate our approach, we performed experi-         it is the usual choice when news articles are considered [13].
ments on the English version of the Concisus dataset [26] and     As expected, we notice some slight degradation when the
the Portuguese Broadcast News (PT BN) dataset [23]. The           Euclidean distance is considered, but we still achieve better
Concisus dataset is composed by seventy eight event reports       results than other state-of-the-art methods such as default
and respective summaries, distributed across three different      centrality [22] and LexRank [5]. Reported results in the
                               Key Phrases
                                                                                       Support Sets
                                     +                      Secure Binary                                           Important
     Document                                                                               +
                              Bag-of-words                   Embeddings                                             Passages
                                                                                         Ranking
                          (matrix representation)


                        Figure 4: Flowchart of the Secure Important Passage Retrieval method.


     leakage    ∼ 5%    ∼ 25%    ∼ 50%     ∼ 75%    ∼ 95%          nary Embeddings based approach provides secure document
     bpc=4      0.365   0.437    0.465     0.486    0.495          representations that allows for sensitive information to be
     bpc=8      0.424   0.384    0.436     0.452    0.500          processed by third parties without any risk of sensitive in-
     bpc=16     0.384   0.416    0.450     0.463    0.517          formation disclosure. Although there was some slight degra-
                                                                   dation of results regarding the baseline, our approach still
Table 3: KP-Centrality using SBE and the Concisus                  outperforms other state-of-the-art methods like default Cen-
dataset, in terms of ROUGE-1.                                      trality and LexRank, but with important advantage that no
                                                                   private or classified information is disclosed to third parties.
     leakage    ∼ 5%    ∼ 25%    ∼ 50%     ∼ 75%    ∼ 95%             For future work we intend to use the secure representation
     bpc=4      0.314   0.340    0.470     0.478    0.562          based on Secure Binary Embeddings in multi-document im-
     bpc=8      0.327   0.324    0.486     0.507    0.527          portant passage retrieval. Another additional research line
     bpc=16     0.338   0.336    0.520     0.473    0.550          that we would like to purse is to apply this privacy preserv-
                                                                   ing technique in other Information Retrieval tasks, such as
Table 4: KP-Centrality using SBE and the Por-                      classified military and medical records retrieval.
tuguese Broadcast News dataset, in terms of
ROUGE-1.
                                                                   8.   ACKNOWLEDGMENTS
                                                                     We would like to thank FCT for supporting this research
literature include ROUGE-1 = 0.443 and 0.531 using default         through PPEst-OE/EEI/LA0021/2013, the Carnegie Mel-
Centrality and ROUGE-1 = 0.428 and 0.471 using LexRank             lon Portugal Program, PTDC/EIA-CCO/122542/2010, and
for the Concisus and PT BN datasets, respectively [24]. This       grants SFRH/BD/33769/2009 and SFRH/BD/71349/2010.
means that the forced change of metric due to the intrinsic        We would like to thank NSF for supporting this research
properties of SBE does not affect the validity of our approach     through grant 1017256.
in any way.
   For our privacy-preserving approach we performed exper-         9.   REFERENCES
iments using different values for the SBE parameters. The           [1] M. Bellare, V. T. Hoang, S. Keelveedhi, and
results we obtained in terms of ROUGE for the Concisus                  P. Rogaway. Efficient garbling from a fixed-key
and the PT BN datasets are presented in Tables 3 and 4, re-             blockcipher. In IEEE Symposium on SP, pages
spectively. Leakage refers to the fraction of SBE hashes for            478–492. IEEE, 2013.
which the normalized Hamming distance dH is proportional            [2] P. Boufounos. Universal rate-efficient scalar
to the Euclidean distance dE between the original data vec-             quantization. IEEE Transactions on Information
tors. The amount of leakage is exclusively controlled by ∆.             Theory, 58(3):1861–1872, 2012.
Bits per coefficient (bpc) is the ratio between the number
                                                                    [3] P. Boufounos and S. Rane. Secure binary embeddings
of measurements M and the dimensionality of the original
                                                                        for privacy preserving nearest neighbors. In WIFS
data vectors L, i.e., bpc = M/L. As expected, increasing
                                                                        2011, pages 1–6. IEEE, 2011.
the amount of leakage (i.e. increasing ∆) leads to improv-
                                                                    [4] J. Carbonell and J. Goldstein. The Use of MMR,
ing the retrieval results. Surprisingly, changing the values
                                                                        Diversity-Based Reranking for Reordering Documents
of bpc does not lead to improved performance. The reason
                                                                        and Producing Summaries. In SIGIR 1998:
for this results might be due to the KP-Centrality method
                                                                        Proceedings of the Annual International ACM SIGIR
using support sets that consider multiple partial representa-
                                                                        Conference on Research and Development in
tions of the documents. Nevertheless, the most significant
                                                                        Information Retrieval, pages 335–336. ACM, 1998.
results is that for 95% leakage there is an almost negligible
loss of performance. This scenario, however, does not vio-          [5] G. Erkan and D. R. Radev. LexRank: Graph-based
late our privacy requisites in any way, since although most             Centrality as Salience in Text Summarization. Journal
of the distances between hashes are known, there is no way              of Artificial Intelligence Research, 22:457–479, 2004.
to use this information to learn anything about the original        [6] P. Indyk and R. Motwani. Approximate nearest
underlying information.                                                 neighbors: towards removing the curse of
                                                                        dimensionality. In Proceedings of the thirtieth annual
                                                                        ACM symposium on Theory of computing, pages
7.     CONCLUSIONS AND FUTURE WORK                                      604–613. ACM, 1998.
  In this work, we introduced a privacy-preserving technique        [7] W. Jiang and B. Samanthula. N-gram based secure
for performing Important Passage Retrieval that performs                similar document detection. In Y. Li, editor, Data and
similarly to their non-private counterpart. Our Secure Bi-              Applications Security and Privacy XXV, volume 6818
     of Lecture Notes in Computer Science, pages 239–246.            as Geometric Proximity. Journal of Artificial
     Springer Berlin Heidelberg, 2011.                               Intelligence Research, 42:275–308, 2011.
 [8] W. Jiang, L. Si, and J. Li. Protecting source privacy      [23] R. Ribeiro and D. M. de Matos. Multi-source,
     in federated search. In SIGIR 2007: Proceedings of the          Multilingual Information Extraction and
     Annual International ACM SIGIR Conference on                    Summarization, chapter Improving Speech-to-Text
     Research and Development in Information Retrieval,              Summarization by Using Additional Information
     SIGIR ’07, pages 761–762, New York, NY, USA, 2007.              Sources. Theory and Applications of NLP. Springer,
     ACM.                                                            2013.
 [9] T. K. Landauer and S. T. Dumais. A Solution to             [24] R. Ribeiro, L. Marujo, D. Martins de Matos, J. P.
     Plato’s Problem: The Latent Semantic Analysis                   Neto, A. Gershman, and J. Carbonell. Self
     Theory of Acquisition, Induction and Representation             reinforcement for important passage retrieval. In
     of Knowledge. Psych. Review, 104(2):211–240, 1997.              SIGIR 2013: Proceedings of the Annual International
[10] C.-Y. Lin. ROUGE: A Package for Automatic                       ACM SIGIR Conference on Research and
     Evaluation of Summaries. In M.-F. Moens and                     Development in Information Retrieval, SIGIR ’13,
     S. Szpakowicz, editors, Text Summ. Branches Out:                pages 845–848. ACM, 2013.
     Proc. of the ACL-04 Workshop, pages 74–81.                 [25] K. Riedhammer, B. Favre, and D. Hakkani-Tür. Long
     Association for Computational Linguistics, 2004.                story short – Global unsupervised models for
[11] M. Litvak and M. Last. Graph-Based Keyword                      keyphrase based meeting summarization. Speech
     Extraction for Single-Document Summarization. In                Communication, 52:801–815, 2010.
     Coling 2008: MMIES, pages 17–24. Coling 2008 Org.          [26] H. Saggion and S. Szasz. The concisus corpus of event
     Committee, 2008.                                                summaries. In Proceedings of the Eighth International
[12] W. Lu, A. Varna, and M. Wu.                                     Language Resources and Evaluation (LREC 2012),
     Confidentiality-preserving image search: A                      2012.
     comparative study between homomorphic encryption           [27] R. Sipos, A. Swaminathan, P. Shivaswamy, and
     and distance-preserving randomization. Access, IEEE,            T. Joachims. Temporal corpus summarization using
     2:125–141, 2014.                                                submodular word coverage. In Proc. of CIKM, pages
[13] L. Marujo, A. Gershman, J. Carbonell, R. Frederking,            754–763, New York, NY, USA, 2012. ACM.
     and J. P. Neto. Supervised topical key phrase              [28] R. I. Tucker and K. Spärck Jones. Between shallow
     extraction of news stories using crowdsourcing, light           and deep: an experiment in automatic summarising.
     filtering and co-reference normalization. In Proceedings        Technical Report 632, University of Cambridge, 2005.
     of the Eighth International Language Resources and         [29] V. R. Uzêda, T. A. S. Pardo, and M. das Graças
     Evaluation (LREC 2012), 2012.                                   Volpe Nunes. A comprehensive comparative evaluation
[14] L. Marujo, M. Viveiros, and J. P. Neto. Keyphrase               of RST-based summarization methods. ACM Trans.
     Cloud Generation of Broadcast News. In Interspeech              on Speech and Language Processing, 6(4):1–20, 2010.
     2011. ISCA, September 2011.                                [30] L. Vanderwende, H. Suzuki, C. Brockett, and
[15] S. R. Maskey and J. Hirschberg. Comparing Lexical,              A. Nenkova. Beyond SumBasic: Task-focused
     Acoustic/Prosodic, Structural and Discourse Features            summarization and lexical expansion. Information
     for Speech Summarization. In Proceedings of the 9th             Processing and Management, 43:1606–1618, 2007.
     EUROSPEECH - INTERSPEECH 2005, 2005.                       [31] X. Wan, J. Yang, and J. Xiao. Towards an iterative
[16] O. Medelyan, V. Perrone, and I. H. Witten. Subject              reinforcement approach for simultaneous document
     metadata support powered by Maui. In Proceedings of             summarization and keyword extraction. In Proceedings
     the JCDL ’10, page 407, New York, USA, 2010.                    of the 45th ACL, pages 552–559, 2007.
[17] M. Murugesan, W. Jiang, C. Clifton, L. Si, and             [32] A. C.-C. Yao. Protocols for secure computations. In
     J. Vaidya. Efficient privacy-preserving similar                 Foundations of Computer Science (FOCS), volume 82,
     document detection. The VLDB Journal,                           pages 160–164, 1982.
     19(4):457–475, 2010.                                       [33] K. Zechner and A. Waibel. Minimizing Word Error
[18] P. Paillier. Public-key cryptosystems based on                  Rate in Textual Summaries of Spoken Language. In
     composite degree residuosity classes. In                        Proceedings of the North American Chapter of the
     EUROCRYPT’99, pages 223–238, 1999.                              ACL (NAACL), pages 186–193, 2000.
[19] H. Pang, X. Xiao, and J. Shen. Obfuscating the             [34] H. Zha. Generic summarization and keyphrase
     topical intention in enterprise text search. In Data            extraction using mutual reinforcement principle and
     Engineering (ICDE), 2012 IEEE 28th International                sentence clustering. In SIGIR 2002: Proceedings of the
     Conference on, pages 1168–1179, April 2012.                     Annual International ACM SIGIR Conference on
[20] G. Penn and X. Zhu. A Critical Reassessment of                  Research and Development in Information Retrieval,
     Evaluation Baselines for Speech Summarization. In               pages 113–120. ACM, 2002.
     Proc. of ACL-08: HLT, pages 470–478. ACL, 2008.            [35] J. J. Zhang, R. H. Y. Chan, and P. Fung. Extractive
[21] J. Portelo, B. Raj, P. Boufounos, I. Trancoso, and              Speech Summarization Using Shallow Rhetorical
     A. Alberto. Speaker verification using secure binary            Structure Modeling. IEEE Transactions on Audio,
     embeddings. In EUSIPO, 2013.                                    Speech, and Language Processing (TASLP),
[22] R. Ribeiro and D. M. de Matos. Revisiting                       18(6):1147–1157, 2010.
     Centrality-as-Relevance: Support Sets and Similarity