=Paper=
{{Paper
|id=Vol-1986/SML17_paper_3
|storemode=property
|title=Topical Sentence Embedding for Query Focused Document Summarization
|pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_3.pdf
|volume=Vol-1986
|authors=Yang Gao,Linjing Wei,Heyan Huang,Qian Liu
|dblpUrl=https://dblp.org/rec/conf/ijcai/GaoWHL17
}}
==Topical Sentence Embedding for Query Focused Document Summarization==
Topical Sentence Embedding for Query Focused Document Summarization Yang Gao Linjing Wei Beijing Institute of Technology (BIT); BIT; Beijing Advanced Innovation Center for Beijing Engineering Research Imaging Technology, Capital Normal University Center of High Volume Language weilinjing@bit.edu.cn Information Processing and Cloud Computing Applications gyang@bit.edu.cn Heyan Huang Qian Liu BIT; Beijing Engineering Research BIT; Beijing Advanced Innovation Center for Center of High Volume Language Imaging Technology, Capital Normal University Information Processing and liuqian2013@bit.edu.cn Cloud Computing Applications hhy63@bit.edu.cn 1 Introduction Text summarization is an important task in natural language Abstract processing, which is expected to understand the meaning of the documents and then produce a coherent, informative but brief summarization of the original document with in a Distributed vector representation for sentences limited length. The main approaches of text summarization have been utilized in summarization area, since can be divided into two categories: extractive and genera- it simplifies semantic cosine calculation between tive. Most extractive summarization systems extract parts sentence to sentence as well as sentence to doc- of the document (a few sentences or a few words) that are ument. Many extension works have been done deemed interesting by some metric (i.e., inverse-document to incorporate latent topics and word embedding, frequency) and join them to form a summary. Conven- however, few of them assign sentences with ex- tionally, selecting sentences rely on feature engineering ap- plicit topics. Besides, much sentence embedding proach in terms of extracting surface feature statistics (i.e., framework follows the same spirit of prediction TFIDF cosine similarity) to compare with query and docu- task about a word in the sentence, which omits ment representation. the sentence-to-sentence coherence. To address Recently, distributed vector semantic representation for these problems, we proposed a novel sentence words and sentences have achieved overwhelming success embedding framework to collaborate the current in summarization area [KMTD14, KNY15, YP15], since it sentence representation, word-based content and converts high-dimensional and sparse linguistic data into a topic assignment of the sentence to predict the controllable and dense dimension of semantic vectors. It next sentence representation. The experiments on becomes more straightforward for generic summarization summarization tasks show our model outperforms to compute similarity (or relevance to some extents) and fa- state-of-the-art methods. cilitates semantic calculation. Delighted by the successful word2vec model [MCCD13, MSC+ 13], Paragraph Vector c by the paper’s authors. Copying permitted for private and Copyright ⃝ (PV) [LM14] model (i.e., the paragraph can be sentence, academic purposes. paragraph or document) also contributes to predict the next In:A. In: Proceedings Editor, B. of IJCAI Workshop Coeditor on Semantic (eds.): Proceedings Machine of the XYZ Learning Workshop, word given sequential word context and the current para- Location,(SML 2017), Country, Aug 19-25 2017,published DD-MMM-YYYY, Melbourne, Australia. at http://ceur-ws.org graph representation. It inherits the semantic representa- tion and its efficiency, further captures the word order for Word2Vec method contains two models: CBOW and Skip- sentence representation. Moreover, the sentence vector can gram model. CBOW aims at predicting the target word us- benefit summaries since it directly characterises the rele- ing the context words in the sliding window. The objective vance between queries and candidate sentences. of CBOW is to maximize the average log probability, However, most of the sentence embedding models D [LM14, YP15] are trained as the prediction task about a 1 ! L= log P r(wi | C; W ). (1) word in the sentence. In these models, sentences are inde- D i=1 pendently learnt via their local word content but often omit where, wi is the target word, C is the word contexts and the coherent relationship between sentences. Summariza- W is is word matrix, D is the corpus size. Different from tion system focuses more on comprehensive attributes of CBOW, Skip-gram aims to predict context words given the sentences, such as sentence coherence, sentence topic, sen- target word. We ignore the details of this approach here. tence representation and so on. Utilizing the conventional Paragraph Vector (PV): sentence vectors may neglect the coherence between candi- It [LM14] is an unsupervised algorithm that learns fixed- date sentences as well as sentence topics. Although, mod- length semantic representations of variable-length of texts, els incorporating topic and word embedding models, such which follows the same predicting task with Word2Vec. as TWE [LLCS15], have achieved successful results in The only change is the concatenate vector constructed from some NLP tasks, at sentence level, very few work focuses W and S, where S is sentence matrix instead of individual on representing sentences with topics. For example, given W . The PV model is a strong alternative sentence model, a user’s query that emphasises on possible plans, progress and it is widely applied in learning representations for se- and problems with hydroelectric projects. The query con- quential data. tain complex topics like “plans”, “progress”, “problems” Work on extractive summarization spans a large range and “hydroelectric projects”. Nevertheless, normal vector- of approaches. Most existing systems [Gal06, YGVS07] based models can retrieve those relevant sentences that only use rank model to select the sentences with highest scores emphasis on one or two aspects of the query. It is problem- to form the summarization. However, multi-document texts atic to capture all the aspects of the query . often describe one central topic and some sub-topics, which In order to tackle the problems, we propose a novel sen- cannot be described only depending on ranking model. tence embedding learning framework to enhance sentence Then we focus on how to rank the sentences and collab- representation by incorporating multi-topic semantics for orate topic coverage. summarization task, called Topical Sentence Embedding A variety of features were defined to measure the (TSE) model. Gaussian distributions are utilised to model relevance, including TF-IDF cosine similarity [NVM06, mixtured centralities of the embedding space, which cap- YGVS07], cue words [LH00], topic theme [HL05], and ture a prior preference of topic for sentence prediction. In WordNet similarity [OLLL11], etc. However, these fea- addition, instead of training to predict words in the docu- tures usually suffer from lacking of deep understand- ment, our proposed model represents one sentence by pre- ing semantics mechanism, which fail to meet the query dicting the next sentence via jointly training the words in need. Since Mikolov et al. [MCCD13] proposed the the current sentence and the topic of the sentence. efficient word embedding method, there is a surge of The rest of this paper is organized as follows. Section works [LM14, LLCS15] focusing on embedding models 2 summarizes the basic methods of embedding models and for capturing the linguistic regularities. Embedding mod- summarization systems. We then introduce a newly sum- els [KMTD14, KNY15, YP15, CLW+ 15] for words and marization framework in Section 3, especially in Section sentences also have encouraged summarization tasks from 3.2, the novel TSE model is proposed. Section 4 reports the perspective of semantic relevance computing, such as the experimental results and corresponding analysis. Fi- DocEmb and CNNLM. However, aforementioned methods nally, we conclude the paper. usually reward semantic similarity without considering of topic coverage, which fail to meet the summary need. Topic-based methods have been proved their successes 2 Background and Related Work for summarization. Parveen et al. [PRS15] proposed an ap- We firstly introduce the Word2Vec and the PV model to in- proach, which is based on a weighted graphical represen- vestigate the basic framework of training embedding model tation of documents obtained by topic modeling. [GNJ07] for words and sentences. measured topic concentration in a direct manner: a sen- Word2Vec: tence was considered relevant to the query if it contained at The basic assumption behind Word2Vec [MCCD13] is that least one word from the query. While these work assume the representation of co-occurred words have the similar that documents related to the query only talk about one representation in the semantic space. To this target, a slid- topic. Tang et al. [TYC09] proposed a unified probabilistic ing window is employed on the input text stream, where approach to uncover query-oriented topics and four scor- the central word is the target word and others are contexts. ing methods to calculate the importance of each sentence in the document collection. Wang et al. [WLZD08] propose to represent the probability distribution for sampling a vec- a new multi-document summarization framework (SNMF) tor x from the GMM. based on sentence-level semantic analysis and symmetric Subsequently, we can infer the posterior probability dis- non-negative matrix factorization. The symmetric matrix tribution of topics. For each sentence s, the posterior dis- factorization has been shown to be equivalent to normal- tribution of its topic is ized spectral clustering and is used to group sentences into πz N (vec(s)|µz , Σz ) clusters. Futhermore, several approaches incorporate vec- q(zs = k) = "K (3) tor representations with topics , such as NTM [CLL+ 15], k=1 N (vec(s)|µk , Σk ) TWE [LLCS15] and GMNTM [YCT15], have collaborated Based on the distribution, the topic of sentence s both benefits of semantic representation and classified top- can be vectorized as vec(Ts ) = [q(zs = 1), q(zs = ics. This motivates us to investigate the cooperation models 2), · · · , q(zs = K)]. for summarization system. Generative Sentence Embedding 3 The Framework for Query-focused Sum- The assumption of the TSE is that sentences are coher- marization ent and associated with their neighbours. Consequently, we model one sentence as a prediction task based on se- Extracting salient sentences is the main task in this study. mantic structure of the previous sentences. The semantic is At sentence level, the sentence embedding and sentence represented by collaborating sentence topic, sentence rep- ranking are utilised to enable sentence relevance to the user resentation and its content. The Negative Sampling (NEG) queries and extract salient summaries. method is applied in [MCCD13] which is an efficient ap- proximation method. Therefore, we carry on the similar 3.1 The Proposed TSE Model estimation schema in our model. Inheriting the superiority of the PV model that constructs a Definition 1. Label l!s : A label of sentence s# is 1 or 0. The continuous semantic space, the novel architecture of learn- label of positive sample is 1, the label of negative samples ing sentence representation, called TSE model, as shown in are 0. For ∀# s ∈ S, the Figure 1. $ s 1, s# = s; l (# s) = (4) 0, s# ̸= s; s s* classifier 1 0 Let Xs be a concatenation of given information of current sentence for predicting the next sentence, s, s′ concatenate be the current sentence. Xs = vec(Ts′ ) ⊕ vec(s′ ) ⊕ vec(w1 )⊕, · · · , ⊕vec(wm ). We incorporate the vectors as the input, which includes topics, sentence embedding, and w1 w2 w3 . . . wn-1 wn sÿ Ts its content of words. Context Given the collection S, we show how to learn represen- tation of sentences and topics. In this paper, we concentrate GMM T1 T2 T3 ... Tk-3 Tk-2 Tk-1 Tk to exploit the latent relationship between sentences. Sub- sequently, the target sentence s is predicted purely by the information from previous sentence, namely Xs . So the Figure 1: The structure of the proposed TSE model objective of TSE is to maximize the probability % % % Topic Vectorization by GMM G= g(s) = p(u|Xs ) (5) Let K represent the number of topics, V is the size of s∈S s∈S u∈{s∪s− } vector, and W represent word dictionary. S denotes the sentence collection, in which s is one of the sentences. Let Instead of using softmax function as prediction proba- vec(Ts ) be the topic vector of sentence s. The vectors of bility, we directly use its negative sampling approxima- sentences and words are represented as vec(s) ∈ RV and tion. &The prediction objective function of sentence s is vec(w) ∈ RV . πk ∈ R, µk ∈ RV , Σk ∈ RV ×V and g(s)= s∈S p(u|Xs), and the probability function is rep- "K resented as follows k=1 πk = 1 are denoted as mixture weights, means and covariance matrices, respectively. The parameters of the $ σ(XsT θu ), ls (# u) = 1 GMM are collectively represented by λ = {πk , µk , Σk }, p(u|Xs ) = (6) where k = 1, · · · , K. Given the collection of parameters, 1 − σ(Xs θ ), ls (# T u u) = 0 we use !K or write as a whole P (x|λ) = πk N (x|µk , Σk ) (2) p(u|Xs ) = [σ(XsT θu )]l (!u) · [1 − σ(XsT θu )]1−l (!u) (7) s s k=1 where σ(x) = 1/(1 + exp(−x)) and θu ∈ RV is the pa- 4.1 Dataset and Evaluation Metrics rameter of Xs . In this study, we use the standard summarization bench- The objective function is taken log-likelihood and de- mark DUC2005 and DUC20061 for evaluation. DUC2005 fined as contains 50 query-oriented summarization tasks. For each ! query, a relevant document cluster is assumed to be “re- L= ls (u) log[σ(XsT θu )]+ trieved”, which contains 25-50 documents. DUC2006 con- s∈S tains 50 query-oriented summarization tasks as well and (1 − ls (u))(nE(s∗ ∼ N(S))) log[1 − σ(XsT θu )] each query contains 25 documents. Thus, the task is to (8) generate a summary from the document cluster for answer- where nE(·) is number of n negative samples as Definition ing the query2. The length of a result summary is limited 1, and n is set to 10 empirically. Considering convenience by 250 words. in estimation, we rewrite the final objective function as We conducted evaluations by ROUGE [LH03] metrics. The measure evaluates the quality of the summarization by L(s, u) = ls (u) · log[σ(XsT θu )]+ counting the number of overlapping units, such as n-grams. (9) Basically, ROUGE-N is n-gram recall measure. [1 − ls (u)] · log[1 − σ(XsT θu )] 4.2 Baseline Models and Settings Parameters Estimation We compare the TSE model with several query-focused The parameters {λ, θu , Xs }, where λ = {πk , µk , Σk } summarization methods. are estimated by maximizing the likelihood of the objec- tive function jointly. A two-phase iteration process is con- • TF-IDF: this model uses TF-IDF [NVM06] for scor- ducted. ing words and sentences. Given {θu , Xs }, stochastic gradient descent (SGD) is • Lead: take the first sentences one by one from the adopted in updating parameters of the GMM. Given λ, document in the collection, where documents are or- the gradient of θu is calculated using the back propagation dered randomly. It is often used as an official baseline based on the objective in Eq. 9. of DUC. • LDA: this method uses Latent Dirichlet 3.2 Sentence Ranking Allocation[BNJ03] to learn the topic model. Af- Sentence ranking aims to measure the relevant sentences ter learned the topic model, we give max score to the with consideration of query information. In this paper, word of the same topic with query. The reader can relevance ranking of sentences primarily relys on seman- refer to the paper [TYC09] for the details. tic vector-based cosine similarity [KMTD14] that is a promising measure to compute relatedness for summariza- • SNMF: this system [WLZD08] is for topic-biased tion. Additionally, statistics features (i.e., TFIDF score summarization. It utilised non-negative matrix factor- [NVM06]). In summary, the ranking score is formulated ization (SNMF) to cluster sentences and from which as: selected multi-coverage summary sentences. • Word2Vec: the vector representations of words can nw be learned by Word2Vec [MCCD13, MSC+ 13] mod- ! Score(S) = α T F IDF (wt ) + βsim(vec(s), vec(Q)) els. The sentence representation is calculated by using t=1 an average of all word embeddings in the sentence. + γsim(vec(Ts ), vec(TQ )) • PV: PV [LM14] learns sentence vectors based on (10) Word2Vec Model. Thus, we use the same parame- where Q is the query, sim(·) represents the function to ters as that in our approach to calculate the scores of compute similarity, and we use cosine similarity in this pa- sentences. per. α, β and γ are parameters in the summarization sys- tem. • TWE: TWE [LLCS15] employs LDA to refine Skip- gram model. It learns topical word embeddings based on both words and their topics. The sentence repre- 4 Experiments sentation is calculated by using an average of all word In this section, we present experiments to evaluate the per- vectors in the sentence. formance of our method in query focused multi-document 1 http://duc.nist.gov/data.html summarization task. 2 In DUC, the query is also called “narrative” or “topic” Table 1: Overall ROUGE evaluation (%) of different models for DUC2005 and DUC 2006 DUC2005 DUC2006 Method ROUGE-1 ROUGE-2 ROUGE-1 ROUGE-2 LEAD 29.71 4.69 32.61 5.71 TF-IDF 33.56 5.20 35.93 6.53 Avg-DUC 34.34 6.02 37.95 7.54 SNMF 35.0 6.04 37.14 7.52 Word2Vec 34.59 5.48 36.33 6.34 PV 35.41 6.14 37.52 7.41 DocEmb 30.59 4.69 32.77 5.61 LDA 31.70 5.33 33.07 6.02 TWE 35.05 6.06 37.58 6.52 TSE 36.28 6.53 37.96 7.56 Impr 2.46 6.35 0.03 0.27 Table 2: Influence analysis of each factor for the TSE summarization, evaluated on DUC2005 Method Rouge-1 Rouge-2 ratio 1 ratio 2 TF-IDF sen sim topic √ √ × √ √ 35.54 6.37 2.04% 2.45% √ × √ 34.88 5.99 3.86% 8.27% × 35.92 6.47 0.99% 0.91% Note that all the baselines are conducted similar with of each element as shown in Table 2. We calculate the per- the proposed summary framework as unsupervised query- centage that the TSE is superior to the one neglecting one focused summarization system. feature, denoted as ratio 1 for ROUGE-1 metrics and ratio The learning rate η is set to 0.05 and gradually reduced 2 for ROUGE-2. As shown the ratio 1 is 3.86% and ratio to 0.0001 as training converge. The word2vec is addition- 2 is highly up to 8.27%, it illustrates that sentence sim- ally trained by English Gigaword Fifth Edition 3 and di- ilarity computation by our proposed sentence embedding mension is set to 256. The dimension of PV is set to 128, plays a consistently dominant role for the summary. On and the TWE is 64, similar as the proposed TSE model. the contrary, it has improving space for utilizing topics for summary. 4.3 Experimental Results and Discussion 5 Conclusion In this subsection, we give a report of experimental results and analysis. Table 1 shows the overall summarization This work proposes a novel sentence embedding model performances of the proposed model and baseline mod- which wisely incorporates sentence coherence and topic els. It can be observed that our approach gives the best characteristics in the learning process. It can automatically summary compare to any other method in ROUGE metrics generates distributed representations for sentences as well over two benchmark datasets, which strongly demonstrates as assigns sentences with semantic and meaningful topics. the outstanding performance of the proposed summariza- We conduct extensive experiments on DUC query-focused tion model. Impr denotes the relative improvements over summarization datasets. Utilizing the superiority of the the best of the nine baselines. We can find that the pro- proposed TSE that facilitates sentence ranking, the system posed TSE sentence embedding consistently outperforms achieves competitive performance. A promising future di- the baselines from 0.03% to 6.35%. rection is to strengthen topic optimization during the sen- Experimental results have validated our proposed model tence learning. With the assistance of semantic topic, we that exploits sentence similarity and topic information can can extract sentence-based saliance topic representation as improve the overall performance. Nevertheless, they could direct summary. not point out impact of the designed measure of sentence similarity. Hence, we keep consistency for our algorithm Acknowledgments framework except for removing the part of features while calculating sentence ranking, to investigate the importance This work is supported by National Basic Research Pro- gram of China (973 Program, Grant No.2013CB329303), 3 https://catalog.ldc.upenn.edu/LDC2011T07 National Nature Science Foundation of China (Grant No.61602036), and Beijing Advanced Innovation Center [LM14] Quoc V. Le and Tomas Mikolov. Distributed for Imaging Technology (BAICIT-2016007). representations of sentences and documents. Computer Science, 4:1188–1196, 2014. References [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and [BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jeffrey Dean. Efficient estimation of word rep- Jordan. Latent dirichlet allocation. JMLR, resentations in vector space. Computer Sci- 3:993–1022, 2003. ence, 2013. [CLL+ 15] Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li, [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Chen, and Heng Ji. A novel neural topic model and Greg Corrado, and Jeffrey Dean. Distributed its supervised extension. In Proceedings of representations of words and phrases and their AAAI’15, pages 2210–2216, 2015. compositionality. 26:3111–3119, 2013. [NVM06] Ani Nenkova, Lucy Vanderwende, and Kath- [CLW+ 15] Kuan Yu Chen, Shih Hung Liu, Hsin Min leen Mckeown. A compositional context sen- Wang, Berlin Chen, and Hsin Hsi Chen. sitive multi-document summarizer: explor- Leveraging word embeddings for spoken doc- ing the factors that influence summarization. ument summarization. Computer Science, In Proceedings of SIGIR’06, pages 573–580, 2015. 2006. [Gal06] M. Galley. A skip-chain conditional random [OLLL11] You Ouyang, Wenjie Li, Sujian Li, and Qin field for ranking meeting utterances by impor- Lu. Applying regression models to query- tance. In Proceedings of EMNLP’07, 2006. focused multi-document summarization. In- formation Processing & Management An In- [GNJ07] Surabhi Gupta, Ani Nenkova, and Dan Juraf- ternational Journal, 2011. sky. Measuring importance and query rele- vance in topic-focused multi-document sum- [PRS15] Daraksha Parveen, Hans-Martin Ramsl, and marization. 2007. Michael Strube. Topical coherence for graph- based extractive summarization. In Proceed- [HL05] Sanda Harabagiu and Finley Lacatusu. Topic ings of EMNLP’15, pages 1949–1954, 2015. themes for multi-document summarization. In Proceedings of SIGIR’05, pages 202–209, [TYC09] Jie Tang, Limin Yao, and Dewei Chen. Multi- 2005. topic based query-oriented summarization. In Proceedings of SDM’09, pages 1147–1158, [KMTD14] Mikael Kågebäck, Olof Mogren, Nina Tah- 2009. masebi, and Devdatt Dubhashi. Extractive summarization using continuous vector space [WLZD08] Dingding Wang, Tao Li, Shenghuo Zhu, and models. In Proceedings of EACL’14, 2014. Chris Ding. Multi-document summarization via sentence-level semantic analysis and sym- [KNY15] Hayato Kobayashi, Masaki Noguchi, and metric matrix factorization. In Proceedings of Taichi Yatsuka. Summarization based on SIGIR’08, pages 307–314. ACM, 2008. embedding distributions. In Proceedings of EMNLP’15, 2015. [YCT15] Min Yang, Tianyi Cui, and Wenting Tu. Ordering-sensitive and semantic-aware topic [LH00] Chin Yew Lin and Eduard Hovy. The au- modeling. In Proceedings of AAAI’15, 2015. tomated acquisition of topic signatures for [YGVS07] Wen Tau Yih, Joshua Goodman, Lucy Vander- text summarization. In Proceedings of COL- wende, and Hisami Suzuki. Multi-document ING’00, pages 495–501, 2000. summarization by maximizing informative [LH03] Chin Yew Lin and Eduard Hovy. Auto- content-words. In Proceedings of IJCAI’07, matic evaluation of summaries using n-gram pages 1776–1782, 2007. co-occurrence statistics. In Proceedings of [YP15] Wenpeng Yin and Yulong Pei. Optimizing ACL’03, 2003. sentence modeling and selection for document summarization. In Proceedings of IJCAI’15, [LLCS15] Yang Liu, Zhiyuan Liu, Tat Seng Chua, and 2015. Maosong Sun. Topical word embeddings. In Proceedings of AAAI’15, 2015.