<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Abstractive Text Summarization using Transfer Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ekaterina</forename><surname>Zolotareva</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Tsegaye</forename><forename type="middle">Misikir</forename><surname>Tashu</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Tomáš</forename><surname>Horváth</surname></persName>
							<email>tomas.horvath@inf.elte.hu</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Informatics</orgName>
								<orgName type="institution">ELTE-Eötvös Loránd University</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Department of Data Science and Engineering</orgName>
								<orgName type="institution">Telekom Innovation Laboratories Pázmány Péter sétány</orgName>
								<address>
									<addrLine>1/C</addrLine>
									<postCode>1117</postCode>
									<settlement>Budapest</settlement>
									<country key="HU">Hungary</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Abstractive Text Summarization using Transfer Learning</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A1D0F3CB9B5C29803E2C5A86A7C225AE</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:53+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Recently, abstractive text summarization has achieved success in switching from linear models via sparse and handcrafted features to nonlinear neural network models via dense inputs. This success comes from the application of deep learning models on natural language processing tasks where these models are capable of modeling intricate patterns in data without handcrafted features. In this work, the text summarization problem has been explored using Sequence-to-sequence recurrent neural networks and Transfer Learning with a Unified Textto-Text Transformer approaches. Experimental results showed that the Transfer Learning-based model achieved considerable improvement for abstractive text summarization.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Summarization is closely related to data compression and information understanding both of which are key to information science and retrieval. The technology of text summarization can improve information extraction systems and also allows readers to quickly view a large number of documents for important information. Indeed, automatic summarization has been recently recognized as one of the most important natural language processing (NLP) tasks, yet one of the least solved one.</p><p>In the literature, there are two main approaches to text summarization. While extractive methods are arguably well suited for identifying the most relevant information, such techniques may lack the fluency and coherency of human-generated summaries. Abstractive text summarization is the task of generating a summary consisting of a few sentences that capture the salient ideas of the input text document. The adjective 'abstractive' is used to denote a summary that is not a mere selection of a few existing passages or sentences extracted from the source, but a compressed paraphrasing of the main contents of the document, potentially using vocabulary unseen in the source document <ref type="bibr" target="#b8">[9]</ref>.</p><p>Abstractive summarization has shown the most promise towards addressing issues in extracting important information from the text documents but Abstractive generation may produce sentences not seen in the original input document. Motivated by neural network success in machine translation experiments, the attention-based encoder-decoder paradigm has recently been widely studied in abstractive summarization. By dynamically accessing the relevant pieces of information based on the hidden states of the decoder during the generation of the output sequence, the model revisits the input and attends to important information.</p><p>Recent abstractive document summarization models are yet not able to achieve convincing performance. In this paper, we investigate the Transfer learning for abstractive text summarization to address a key challenge in summarization, which is to optimally compress the original document while preserving the key concepts in the original document. The rest of this paper is organized as follows: Section 2 provides an overview of the existing works and approaches. In Section 3, the approach to be investigated is introduced. Section 5 presents Experimental setting ,data sets used and results. Finally, Section 6 presents the discussion and concludes the paper and discusses prospective plans for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>The number of summarization models introduced every year has been increasing rapidly. Advancements in neural network architectures <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b10">11]</ref>, and the availability of largescale data enabled the transition from systems based on expert knowledge and heuristics to data-driven approaches powered by end-to-end deep neural models. Current approaches to text summarization utilize advanced attention and copying mechanisms <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b11">12]</ref> multi-task and multi-reward training techniques <ref type="bibr" target="#b6">[7]</ref>, graph-based methods that involve arranging the input text in a graph and then using ranking or graph traversal algorithms in order to construct the summary <ref type="bibr" target="#b4">[5]</ref> [13], reinforcement learning strategies <ref type="bibr" target="#b3">[4]</ref>, and hybrid extractive-abstractive models <ref type="bibr" target="#b5">[6]</ref>.</p><p>This work is based on the most recent and novel Text-To-Text Transfer Transformer (T5) <ref type="bibr" target="#b9">[10]</ref> and on one of the main known Sequence to sequence (Seq2Seq) model <ref type="bibr" target="#b5">[6]</ref>. The T5 model, pre-trained on Colossal Clean Crawled Corpus (C4), achieved state-of-the-art results on many NLP benchmarks while being flexible enough to be finetuned to a variety of important tasks.</p><p>It is possible to formulate most NLP tasks in a "text-totext" format -that is, a task where the model is fed some text for context or conditioning and is then asked to produce some output text. This approach provides a consistent training objective both for pre-training and finetuning. Specifically,the model is trained with a maximum likelihood objective regardless of the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">The Transformer: Model Architecture</head><p>Most competitive and successful neural sequence transduction models have an encoder-decoder structure <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b10">11]</ref>. Here, the encoder maps an input sequence of symbol representations (x 1 , ..., x n ) to a sequence of continuous representations z = (z 1 , ..., z n ) <ref type="bibr" target="#b13">[14]</ref>. Given z, the decoder then generates an output sequence (y 1 , ..., y m ) of symbols one element at a time. At each step, the model is automatically regressive, with the previously generated symbols being consumed as additional input when generating the next step. The Transformer <ref type="bibr" target="#b13">[14]</ref> follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure <ref type="figure" target="#fig_1">1</ref>, respectively (See <ref type="bibr" target="#b13">[14]</ref> for more).</p><p>Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has a multi-head selfattention mechanism, and a simple, position-wise fully connected feed-forward network. A residual connection is employed around each of the two sub-layers followed by layer normalization. That is, the output of each sublayer is LayerNorm(x + Sublayer(x)) Where Sublayer(x) is the function implemented by the sub-layer itself <ref type="bibr" target="#b13">[14]</ref>.</p><p>Decoder: The decoder also consists of a stack of N = 6 identical layers. The decoder inserts a third sub-layer which, in addition to the two sub-layers, provides multihead attention to the output of the encoder stack. Similar to the encoder, a residual connection around each of the two sub-layers is used, followed by a layer normalization. To prevent positions from paying attention to subsequent positions, a modified self-attention sub-layer is used in the decoder <ref type="bibr" target="#b13">[14]</ref>.</p><p>Attention: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, the keys, the values and the output are all vectors <ref type="bibr" target="#b13">[14]</ref>. The output can be calculated as a weighted sum of the values, where the weight assigned to each value is calculated by a compatibility function of the query with the corresponding key.</p><p>The advantage of using multi-head attention allows the model to share information from different representation Figure <ref type="figure" target="#fig_1">1</ref>: The Transformer -Model Architecture <ref type="bibr" target="#b13">[14]</ref> subspaces at different positions. With a single attention head this is prevented by averaging <ref type="bibr" target="#b13">[14]</ref>. The Transformer uses multi-head attention in the following manner:</p><p>• In "encoder-decoder attention" layers, the queries come from the previous decoder layer and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b1">2]</ref>.</p><p>• The encoder contains self-attention layers. In a selfattention layer, all keys, values and queries come from the same location, in this case from the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder <ref type="bibr" target="#b13">[14]</ref>.</p><p>• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position <ref type="bibr" target="#b13">[14]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">T5 approach</head><p>Attention Masks: A major distinguishing factor for different architectures is the "mask" used by different attention mechanisms in the model. Recall that the selfattention operation in a Transformer takes a sequence as input and outputs a new sequence of the same length <ref type="bibr" target="#b9">[10]</ref>. Each entry of the output sequence is produced by computing a weighted average of entries of the input sequence. Specifically, let y i refer to the i th element of the output sequence and x j refer to the j th entry of the input sequence. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q.</p><p>The keys and values are also packed together into matrices K and V. We compute the matrix of outputs as:</p><formula xml:id="formula_0">y i = ∑ j w i, j x j<label>(1)</label></formula><p>Where w i, j is the scalar weight produced by the selfattention mechanism as a function of x i and x j . The attention mask is then used to zero out certain weights in order to constrain which entries of the input can be attended to at a given output time step.</p><p>Encoder-Decoder: An encoder-decoder Transformer consists of two layers of stacks: the encoder, which is fed an input sequence, and the decoder, which generates a new output sequence. The encoder uses a "fully visible" attention mask. The "fully visible" masking allows a selfattention mechanism to pay attention to each input of its output. This form of masking is suitable when the attention is over a "prefix", i.e. a context that is provided to the model that will later be used to make predictions. The selfattention operations in the decoder of the transformer use a "causal" masking pattern. Within model training process, approaching with "causal" mask let decoder prevent the model from attending to the j th entry during handling i th input sequence for j &gt; i. This is used during training so that the model cannot "see into the future" while producing its output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Sequence to Sequence Model</head><p>The Recurrent Neural Network(RNN) is a natural generalization of feed forward neural networks to sequences. Given a sequence of inputs(x 1 , ..., x T ), a standard RNN computes a sequence of outputs (y 1 , ..., y T ) by iterating the equation 2 and 3:</p><formula xml:id="formula_1">h t = sigmoid(W hx x t +W hh h t−1 )<label>(2)</label></formula><formula xml:id="formula_2">y t = W yh h t<label>(3)</label></formula><p>The RNN can easily map sequences to sequences whenever the alignment between the inputs and the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships.</p><p>Sequence learning consists of mapping the input sequence with one RNN to a vector of fixed size and then mapping the vector with another RNN to the target sequence. Although it could work in principle, since the RNN is supplied with all relevant information, it would be difficult to train the RNNs due to the resulting long-term dependencies. However, the Long Short-Term Memory (LSTM) is known to learn problems with long-range time dependencies, so an LSTM can be successful in this setting.</p><p>The objective of the LSTM is to estimate the conditional probability p(y 1 , ..., y M |x 1 , ..., x M ) where (x 1 , ..., x M ) is an input sequence and (y 1 , ..., y M ) is its corresponding output sequence whose length M may differ from M. The LSTM computes the conditional probability by first obtaining the fixed-dimensional representation v of the input sequence (x 1 , ..., x M ) given by the last hidden state of the LSTM, and then computing the probability of (y 1 , ..., y M ) with a standard LSTM language model formulation whose initial hidden state is set to the representation v of (x 1 , ..., x T ): In this equation, each p(y m |v, y 1 , ..., y m−1 ) distribution is represented with a soft max over all the words in the vocabulary. The LSTM formulation from Graves has been used. It is require that each sentence ends with a special end-of-sentence symbol "&lt;EOS&gt;", which enables the model to define a distribution over sequences of all possible lengths.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experimental Setting and Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Dataset Selection</head><p>The experiment was carried out on the BBC News dataset provided by Kaggle<ref type="foot" target="#foot_0">1</ref> . The dataset consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005 and includes five class labels which are business, entertainment, politics, sport, technology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Data Preprocessing</head><p>In preprocessing the documnets, the following tasks were performed: tokenization using the NLTK<ref type="foot" target="#foot_1">2</ref> tokenizer; removing punctuation marks, determiners, and prepositions; a transformation to lower-case; stopword removal and word stemming. In the stop word removal step, the words that are in the english stop word list were removed. After removing the stopwords, the words have been stemmed to their roots.</p><p>Python was used to implement the proposed LSH-based AEE algorithm. The Scikit-learn <ref type="foot" target="#foot_2">3</ref> , gensim <ref type="foot" target="#foot_3">4</ref> and the Numpy<ref type="foot" target="#foot_4">5</ref> and PyTorch<ref type="foot" target="#foot_5">6</ref> libraries were used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">T5 Model Hyper-Parameter Setting</head><p>The following parameters were selected by taking into account the computation power and resources at hand. Therefore, We selected the Hyper parameters using the manual configuration method. The dataset is split into 80% training data and 20% testing data with sample function from pandas framework.</p><p>• TRAIN_BATCH_SIZE = 2 (default: 64)</p><p>• VALID_BATCH_SIZE = 2 (default: 1000)</p><p>• TRAIN_EPOCHS = 2 (default: 10)</p><p>• VAL_EPOCHS = 1 (default: 10)</p><p>• LEARNING_RATE = 1 e − 4 (default: 0.01)</p><p>• SEED = 42 (default: 42) Initiating Fine-Tuning for the model on BBC News dataset:</p><p>• Epoch: 0, Loss: 14.0325</p><p>• Epoch: 0, Loss: 2.9507</p><p>• Epoch: 1, Loss: 2.8506</p><p>• Epoch: 1, Loss: 2.0221</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Seq2Seq Model Settings</head><p>Abstractive summarization neural network model is built using TensorFlow and Keras machine learning and neural networks python libraries.</p><p>First, set up the maximum cleaned text and summary lengths based on the distribution of sequence lengths from the chosen sample. Add "sostok" -START and "eostok" -END tokens to the reference summary as this will help the model to determine when the sequence starts and ends respectively. The dataset is split into 80% for training data and 20% for testing data with train_test_split package from sklearn.model_selection.</p><p>Then, both the training and testing data are tokenized to form the vocabulary and converted the word sequences into equal length integer sequences by using Tokenizer and pad sequences modules from keras.preprocessing package.</p><p>Our Seq2Seq model has three LSTM layers for the encoder network and a single LSTM layer for the decoder network with an embedding layer on both the encoder and decoder network. The custom attention layer was also used to remember the lengthy sequences, and the output layer uses the SoftMax activation function. The hidden layers have a dimension of 256 units and the embedding layers have a size of 200 units. Besides, a drop-out value of 0.4 is used in each hidden layer to reduce model overfitting and improve performance. These layers have been implemented and the model is built using different wrappers like Input, LSTM, Embedding, Dense from the tensorflow.keras.layers.</p><p>Different values for each hyper-parameters was used and the following hyper-parameters setting were selected during training based on the their performance :</p><p>• Epochs = 25</p><p>• Optimizer = "rmsprop"</p><p>• Batch size = 64</p><p>• Latent dimension = 256</p><p>• Embedding dimension = 200</p><p>• Loss function = "sparse_categorical_crossentropy" Hyper parameters were selected using the manual configuration method. In the accuracy and loss values are determined and analyzed. After training phase comes the inference phase, in which we input the testing data to our model and get the output predicted summary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Evaluation Metrics</head><p>In Text Summarization, summary evaluation is an essential chore. Manual and semi-automatic evaluation of largescale summarization models is costly and cumbersome. Much effort has been made to develop automatic metrics that would allow for fast and cheap evaluation of models. The ROUGE package introduced by Lin <ref type="bibr" target="#b7">[8]</ref> offers a set of automatic metrics based on the lexical overlap between candidate and reference summaries .</p><p>We used ROUGE metrics for our evaluation process. ROUGE refers to Recall Oriented Understudy for Gisting Evaluation which is an automatic summary evaluation </p><formula xml:id="formula_3">ROUGE − n = ∑ S∈RS ∑ gram n ∈S Count match (gram n ) ∑ S∈RS ∑ gram n ∈S Count(gram n )<label>(5)</label></formula><p>Where RS is a set of reference summaries, n stands for the length of the n-gram, gram n , and Countmatch(gram n ) is the maximum number of n-grams co-occurring in a generated summary and a set of reference summaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ROUGE-L:</head><p>It denotes the Longest Common Subsequence (LCS) matching between the reference summary and system generated summary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Results</head><p>The experimental results of Text-To-Text Transfer Transformer (T5) method were compared with attention based Sequence to sequence based methods. The experimental results are presented in Table <ref type="table">1</ref>   <ref type="bibr" target="#b9">[10]</ref>, the Transformer or T5 framework, to create a multi-sentence summary. Experiments were carried out to verify the effectiveness of the proposed method. Experimental results on the BBC News dataset showed that the T5 model performed well in the abstractive document summarization. The future direction is to study the Transformer method for the task of summarizing multiple documents and also to very the T5 approach on other benchmark dataset.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure2: Multi-Head Attention<ref type="bibr" target="#b13">[14]</ref> </figDesc><graphic coords="3,115.51,80.50,113.39,141.73" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>p(y 1</head><label>1</label><figDesc>, ..., y M |x 1 , ..., x M ) = M ∏ m=1 p(y m |v, y 1 , ..., y m−1 ) (4)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="2,323.86,80.50,198.43,255.12" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3 6</head><label>3</label><figDesc>and Table 2. The Results shown in Table 1 are from Transformer (T5) method and the results in table 2 are the baseline method. According to the experimental results presented, Text-To-Text Transfer Transformer (T5) based abstractive text summarization outperformed the baseline attention based seq2seq approach in all of the matrices used. Sample prediction results from the test are presented in ConclusionIn this paper, we have dealt with the demanding task of abstractive document summarization. We used a newly</figDesc><table><row><cell></cell><cell cols="3">ROUGE-1 ROUGE-2 ROUGE-L</cell></row><row><cell>F1</cell><cell>0.313</cell><cell>0.193</cell><cell>0.262</cell></row><row><cell>Precision</cell><cell>0.388</cell><cell>0.275</cell><cell>0.289</cell></row><row><cell>Recall</cell><cell>0.324</cell><cell>0.132</cell><cell>0.199</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Results on the BBC test set using Seq2Seq Model</figDesc><table><row><cell cols="2">Results</cell></row><row><cell>Generated Text</cell><cell>Actual text</cell></row><row><cell>Veteran Labour MP</cell><cell>Labour s Cunning-</cell></row><row><cell>and former Cabinet</cell><cell>ham to stand down</cell></row><row><cell>minister Jack Cun-</cell><cell>Veteran Labour MP</cell></row><row><cell>ningham has said he</cell><cell>and former Cabinet</cell></row><row><cell>will stand down at</cell><cell>minister Jack Cun-</cell></row><row><cell>the next election Mr</cell><cell>ningham has said he</cell></row><row><cell>Blair said He was</cell><cell>will stand down...</cell></row><row><cell>an...</cell><cell></cell></row><row><cell>Ministers would not</cell><cell>CSA could close</cell></row><row><cell>rule out scrapping</cell><cell>says minister Minis-</cell></row><row><cell>the Child Support</cell><cell>ters would not rule</cell></row><row><cell>Agency if it failed to</cell><cell>out scrapping the</cell></row><row><cell>improve Work and</cell><cell>Child Support...</cell></row><row><cell>Pension Secretary...</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 :</head><label>3</label><figDesc>Sample results using T5 model introduced approach</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.kaggle.com/pariza/bbc-news-summary</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.nltk.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://scikit-learn.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://radimrehurek.com/gensim/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://www.numpy.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://www.pytorch.org/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgment</head><p>The research has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamental Research Collaborations Grounding Innovation in Informatics and Infocommunications).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Neural Machine Translation by Jointly Learning to Align and Translate</title>
		<author>
			<persName><forename type="first">Dzmitry</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kyunghyun</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">3rd International Conference on Learning Representations, ICLR 2015</title>
				<editor>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Yann</forename><surname>Lecun</surname></persName>
		</editor>
		<meeting><address><addrLine>San Diego, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">May 7-9, 2015. 2015</date>
		</imprint>
	</monogr>
	<note>Conference Track Proceedings</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Neural machine translation by jointly learning to align and translate</title>
		<author>
			<persName><forename type="first">Dzmitry</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kyunghyun</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.0473</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents</title>
		<author>
			<persName><forename type="first">Arman</forename><surname>Cohan</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N18-2097</idno>
		<ptr target="https://www.aclweb.org/anthology/N18-2097" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>New Orleans, Louisiana</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-06">June 2018</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="615" to="621" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">BanditSum: Extractive Summarization as a Contextual Bandit</title>
		<author>
			<persName><forename type="first">Yue</forename><surname>Dong</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/d18-1409</idno>
		<ptr target="https://doi.org/10.18653/v1/d18-1409" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">Ellen</forename><surname>Riloff</surname></persName>
		</editor>
		<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">tober 31 -November 4, 2018. 2018</date>
			<biblScope unit="page" from="3739" to="3748" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Lexrank: Graph-based lexical centrality as salience in text summarization</title>
		<author>
			<persName><forename type="first">Günes</forename><surname>Erkan</surname></persName>
		</author>
		<author>
			<persName><surname>Dragomir R Radev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of artificial intelligence research</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="457" to="479" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Bottom-Up Abstractive Summarization</title>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Gehrmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuntian</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><forename type="middle">M</forename><surname>Rush</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/d18-1443</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">Ellen</forename><surname>Riloff</surname></persName>
		</editor>
		<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-11-04">October 31 -November 4, 2018. 2018</date>
			<biblScope unit="page" from="4098" to="4109" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Improving Abstraction in Text Summarization</title>
		<author>
			<persName><forename type="first">Wojciech</forename><surname>Kryscinski</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/d18-1207</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">Ellen</forename><surname>Riloff</surname></persName>
		</editor>
		<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-11-04">October 31 -November 4, 2018. 2018</date>
			<biblScope unit="page" from="1808" to="1817" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">ROUGE: A Package for Automatic Evaluation of Summaries</title>
		<author>
			<persName><forename type="first">Chin-Yew</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004-07">July 2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
	<note>Text Summarization Branches Out</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond</title>
		<author>
			<persName><forename type="first">Ramesh</forename><surname>Nallapati</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/K16-1028</idno>
		<ptr target="https://www.aclweb.org/anthology/K16-1028" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning</title>
				<meeting>The 20th SIGNLL Conference on Computational Natural Language Learning<address><addrLine>Berlin, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2016-08">Aug. 2016</date>
			<biblScope unit="page" from="280" to="290" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Exploring the limits of transfer learning with a unified text-to-text transformer</title>
		<author>
			<persName><forename type="first">Colin</forename><surname>Raffel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.10683</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Sequence to Sequence Learning with Neural Networks</title>
		<author>
			<persName><forename type="first">Ilya</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Quoc V</forename><surname>Le</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 27</title>
				<editor>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</editor>
		<imprint>
			<publisher>Associates, Inc</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="3104" to="3112" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Abstractive Document Summarization with a Graph-Based Attentional Neural Model</title>
		<author>
			<persName><forename type="first">Jiwei</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaojun</forename><surname>Wan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jianguo</forename><surname>Xiao</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P17-1108</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 55th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Vancouver, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2017-07">July 2017</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1171" to="1181" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Queryoriented text summarization based on hypergraph transversals</title>
		<author>
			<persName><forename type="first">H</forename><surname>Van Lierde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tommy</forename><forename type="middle">W S</forename><surname>Chow</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.ipm.2019.03.003</idno>
		<idno>DOI:</idno>
		<ptr target="https://doi.org/10.1016/j.ipm.2019.03.003" />
	</analytic>
	<monogr>
		<title level="j">Information Processing Management</title>
		<idno type="ISSN">0306- 4573</idno>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="1317" to="1338" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Attention is All you Need</title>
		<author>
			<persName><forename type="first">Ashish</forename><surname>Vaswani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 30</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<imprint>
			<publisher>Associates, Inc</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Google&apos;s neural machine translation system: Bridging the gap between human and machine translation</title>
		<author>
			<persName><forename type="first">Yonghui</forename><surname>Wu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1609.08144</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
