Improving Neural Abstractive Text Summarization with Prior Knowledge Position Paper

Improving Neural Abstractive Text Summarization with Prior Knowledge Position Paper GaetanoRossiello Department of Computer Science University of Bari "Aldo Moro" PierpaoloBasile Department of Computer Science University of Bari "Aldo Moro" GiovanniSemeraro Department of Computer Science University of Bari "Aldo Moro" MarcoDi Ciano m.diciano@innova.puglia.it InnovaPuglia S.p.A GaetanoGrasso g.grasso@innova.puglia.it InnovaPuglia S.p.A Improving Neural Abstractive Text Summarization with Prior Knowledge Position Paper E2B9CD0A7D1F48EC16D163E9A01F1E95 GROBID - A machine learning software for extracting information from scholarly documents

Abstractive text summarization is a complex task whose goal is to generate a concise version of a text without necessarily reusing the sentences from the original source, but still preserving the meaning and the key contents. In this position paper we address this issue by modeling the problem as a sequence to sequence learning and exploiting Recurrent Neural Networks (RNN). Moreover, we discuss the idea of combining RNNs and probabilistic models in a unified way in order to incorporate prior knowledge, such as linguistic features. We believe that this approach can obtain better performance than the state-of-the-art models for generating well-formed summaries.

Introduction

Information overload is a problem in modern digital society caused by the explosion of the amount of information produced on both the World Wide Web and the enterprise environments. For textual information, this problem is even more significant due to the high cognitive load required for reading and understanding a text. Automatic text summarization tools are thus useful to quickly understand a large amount of information.

The goal of summarization is to produce a shorter version of a source text by preserving the meaning and the key contents of the original. This is a very complex problem since it requires to emulate the cognitive capacity of human beings to generate summaries. For this reason, text summarization poses open challenges in both natural language understanding and generation. Due to the difficulty of this task, research focused on the extractive aspect of summarization, where the generated summary is a selection of relevant sentences from the source text in a copy-paste Once parametric models are trained, a decoder module greedily generates a summary, word by word, through a beam search algorithm.

The aim of these works based on neural networks is to provide a fully datadriven approach to solve the abstractive summarization task, where the models learn automatically the representation of relationships between the words in the input document and those in the output summary without using complex handcrafted linguistic features. Indeed, the experiments highlight significant improvements of these deep architectures compared to extractive and abstractive state-of-the-art methods evaluated on various datasets, including the goldstandard DUC-[ ] using various variant of ROUGE metric [ ].

Motivation

The proposed neural attention-based models for abstractive summarization are still in an early stage, thus they show some limitations. Firstly, they require a large amount of training data in order to capture a good representation that properly maps good (soft) alignments between original text and the related summary. Moreover, since these deep models learn the linguistic regularities relying only on statistical co-occurrences of words over the training set, some grammar and semantic errors can occur in the generated summaries. Finally, these models work only at sentence level and are effective for sentence compression rather than document summarization, where both input text and target summary consist of several sentences.

In this position paper we argue about our ongoing research on abstractive text summarization. Taking up the idea of casting the summarization task as a sequence-to-sequence learning problem, we study approaches to infuse prior knowledge into a RNN in a unified manner in order to overtake the aforementioned limits. In the first stage of our research we focus on methodologies to introduce syntactic features, such as part-of-speech tags and named entities.

We believe that informing the neural network about the specific role of each word during the training phase may led to the following advantages: introducing information about the syntactical role of each word, the neural network can tend to learn the right collocation of words by belonging to a certain part-of-speech class. This can improve the model avoiding grammar errors and producing wellformed summaries. Furthermore, the summarization task lacks of availability of data required to train the models, especially in specific domains. The introduction of prior knowledge can help to reduce the amount of data needed in the training phase.

Methodology

In this section we provide a general view of our proposed model starting from a formal definition of the abstractive summarization problem to a discussion of the proposed approach aimed at introducing a prior knowledge into neural networks.

. Model

Let us denote by x = {x 1 , x 2 , . . . , x n } and y = {y 1 , y 2 , . . . , y m } with n > m, two sequences, where x i , y j ∈ V and V is the vocabulary. x and y represent sequences of words of the input text and the output summary over the vocabulary V , respectively.

The summarization problem consists in finding an output sequence y that maximizes the conditional probability of y given an input sequence x:

arg max y∈V P (y|x) ( )

The conditional probability distribution P can be modeled by a neural network, with the aim of learning a set of parameters θ from a training set T = {(x 1 , y 1 ), . . . , (x k , y k )} of source text and target summary pairs. Thus, the problem is to find the right parameters that represent a good approximation of probability P (x|y) = P (x|y; θ).

The parametric model is trained to generate the next word in the summary, conditioned by previous words and the source text. Then, the conditional probability P can be factored as follows:

P (y|x; θ) = |y| t=1 P (y t |{y 1 , . . . , y t−1 }, x; θ) ( )

Since this is a typically sequence to sequence learning problem, the parametric function that computes the conditional probability can be modeled by RNNs using a encoder-decoder paradigm. Figure shows a graphical example. The encoder is a RNN that reads one token at time from the input source and returns a fixed-size vector representing the input text. The decoder is another RNN that generates words for the summary and it is conditioned by the vector representation returned by the first network.

Formally, At the time t the decoder RNN computes the probability of the word y t given the last hidden state h t and the context input c, where

P (y t |{y 1 , . . . , y t−1 }, x; θ) = g θ (h t , c) ( )h t = g θ (y t−1 , h t−1 , c) ( )

The vector context c is the output of the encoder and encodes the representation of the whole input source. This vector is fundamental to inform the decoder about the input representation during the generation of the next word. Some attentionbased mechanisms [ ] [ ] [ ] are integrated to help the network to remember certain aspects about the input. The good performance of the whole architecture often depends on how these attention-based components are modeled.

A simpler way to model g θ is using an Elman RNN [ ]. Hence:

h t = sigmoid(W 1 y t−1 + W 2 h t−1 + W 3 c) ( ) P (y t |{y 1 , . . . , y t−1 }, x; θ) = sof tmax(W 4 h t + W 5 c) ( )

where W i are matrices of parameters learned during the training phase.

In tasks involving language modeling, variants of RNNs have shown impressive performance and they solve the vanishing gradient problem. These variants are Long-Short Term Memory (LSTM) [ ] and Gated Recurrent Unit (GRU) [ ].

Finally, the decoder generates summaries by assigning probability values word by word. In order to find a sequence that maximize the equation ( ), a beam search algorithm is commonly used.

The whole architecture is inspired by [ ] and [ ], which use this setting to solve a machine translation problem learning soft alignments between source and target sentences. However, the summarization problem has two significant differences. The words in both sequences x and y share the same vocabulary V and the problem is constrained by the length of the input source, which must be shorter than the target summary. Despite in [ ], [ ] and [ ] the authors adopt the same paradigm to solve the abstractive summarization task by taking in account these constraints, their proposals regard only summarization of unique sentences. This constraint makes the summarization closer to a machine translation problem, where the length of the source and the target are similar. Conversely, for a document level of summarization, where the summary is far more shorter than the original text, the length constraint is stronger. Designing neural models to solve summarization at document o multi-document level is a promising future direction that we want to explore.

. Proposed approach

In our preliminary research we focus on techniques to incorporate prior knowledge into a neural network. We start by taking into account only lexical and syntactic information, such as part-of-speech and named entities tags. The core idea is to replace the softmax of each RNN layer with a log-linear model or a probabilistic graphical model, like factor graphs. This replacement does not arise any problem because the softmax function converts the output of the network into probability values, where the softmax can be seen as a special case of the extended version of RNN [ ]. Thus, the use of probabilistic models allows to condition the probability value, given an extra feature vector that represents the lexical and syntactic information of each word.

We believe that this approach can learn a better representation of the input context vector during the training and it can help the decoder in the generation phase. In this way, the decoder can assign to the next word a probability value which is related to the specific lexical role of that word in the generated summary. This can allow the model to decrease the number of grammar errors in the summary, even using a smaller training set since the linguistic regularities are supported by the extra vector of syntactic features.

Evaluation Plan

We plan to evaluate our models on gold-standard datasets for the summarization task, such as DUC-[ ], Gigaword [ ] and CNN/DailyMail [ ] corpus, as well as on a local government dataset of documents made available by In-novaPuglia S.p.A. (consisting of projects and funding proposals) using several variants of ROUGE [ ] metric.

ROUGE is a recall-based metric which assesses how many n-grams in generated summaries appear in the human reference summaries. This metric is designed to evaluate extractive methods rather than abstractive ones, thus the former would be advantaged.

The evaluation in summarization is a complex problem and it is still an open challenge for three main reasons. First, given an input text, there are different summaries that preserve the original meaning. Furthermore, the words that compose the summary could not appear at all in the original source. Finally, ROUGE metric cannot measure the quality of grammar structure of the generated summary. To overcome these issues we plan an in-vivo experiment with a user study.

Conclusions and Future Work

In this position paper we outlined our ongoing research on abstractive text summarization using deep learning models. The abstractive summarization is a harder task than extractive summarization, where the techniques produce a summary by selecting the most relevant sentences from an input source text. We propose a novel approach to combine probabilistic models with neural networks in a unified way in order to incorporate prior knowledge such as linguistic features. Using this approach, as future work we plan to integrate also semantic knowledge so that the neural network can be able to learn jointly word and knowledge embeddings by exploiting knowledge bases and lexical thesaurus. Moreover, the generation of abstractive summaries from documents or multiple documents is another promising direction that we want to investigate.

fashion [ ] [ ]. Over the past years, few works have been proposed to solve the abstractive problem of summarization, which aims to produce from scratch a new cohesive text not necessarily present in the original source [ ] [ ]. Abstractive summarization requires deep understanding and reasoning over the text, determining the explicit or implicit meaning of each element, such as words, phrases, sentences and paragraphs, and making inferences about their properties [ ] in order to generate new sentences which compose the summary. Recently, riding the wave of prominent results of modern deep learning models in many natural language processing tasks [ ] [ ], several groups have started to exploit deep neural networks for abstractive text summarization [ ] [ ] [ ]. These deep architectures share the idea of casting the summarization task as a neural machine translation problem [ ], where the models, trained on a large amount of data, learn the alignments between the input text and the target summary through an attention encoder-decoder paradigm. In detail, in [ ] the authors propose a feed-forward neural network based on neural language model [ ] with an attention-based encoder, while the models proposed in [ ] and [ ] use the attention encoder into a sequence-to-sequence framework modeled by RNNs [ ].

Fig. ..Fig. . An example of encoder-decoder paradigm for sequence to sequence learning [ ].

References

. Bahdanau, D., Cho, K., Bengio