=Paper= {{Paper |id=Vol-2410/paper36.pdf |storemode=property |title=Text Summarization of Product Titles |pdfUrl=https://ceur-ws.org/Vol-2410/paper36.pdf |volume=Vol-2410 |authors=Joan Xiao,Robert Munro |dblpUrl=https://dblp.org/rec/conf/sigir/XiaoM19 }} ==Text Summarization of Product Titles== https://ceur-ws.org/Vol-2410/paper36.pdf
                                     Text Summarization of Product Titles
                                     Joan Xiao                                                             Robert Munro∗
                              joan.xiao@gmail.com                                                     robert.munro@gmail.com
                                 Figure Eight, Inc.                                                           Lilt, Inc.
                                San Francisco, CA                                                        San Francisco, CA
ABSTRACT                                                                              contain only the essential words that are present in the original
In this work, we investigate the problem of summarizing titles of                     product title, with no additional words. The essential words fall
e-commerce products. With the increase in popularity of voice shop-                   into the following categories:
ping due to smart phones and (especially) in-home speech devices,                         • BRAND: brand name of the product
it is necessary to shorten long text-based titles to more succinct                        • FUNCTION: what the product does
titles that are appropriate for speech. We present two extractive                         • VARIATION: variation (color, flavor, etc.)
summarization approaches using bi-directional long short-term                             • SIZE: size information
memory encoder-decoder network with attention mechanism. The                              • COUNT: count information
first a pproach t reats t he p roblem a s a m ulti-class n amed entity
recognition problem while the second approach treats it as a bi-                         A product title may or may not have all 5 attributes above - often
nary class named entity recognition problem. As a comparison,                         times VARIATION, SIZE, or COUNT may not be present. Some
we also evaluate two abstractive summarization approaches us-                         examples of the original product titles and desired short titles are
ing the same neural network architecture. We compare the results                      shown in Figure 1.
with automated (ROUGE) and human evaluation. Our experiment                              Summarization techniques are classified into two categories:
demonstrates the effectiveness of both extractive summarization                       extractive and abstractive. Extractive summarization identifies and
approaches.                                                                           extracts key segments of the text, then assembles them to compose
                                                                                      a summary. Abstractive summarization generates a summary from
KEYWORDS                                                                              scratch without being constrained to reusing phrases from the
                                                                                      original text.
extractive summarization, abstract summarization, neural networks,
                                                                                         In this work we apply two extractive summarization and two
voice shopping, named entity recognition
                                                                                      abstractive summarization approaches to summarize a dataset of e-
ACM Reference Format:                                                                 commerce product titles, and compare results using both ROUGE-1
Joan Xiao and Robert Munro. 2019. Text Summarization of Product Titles.               and ROUGE-2 scores and human judgments. The evaluation results
In Proceedings of SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom),
                                                                                      show that extractive summarization models consistently perform
7 pages.
                                                                                      much better than abstractive summarization models.
                                                                                         We conclude that extractive summarization is effective for title
1 INTRODUCTION                                                                        summarization at scale. For titles up to 36 words in length, the
Online marketplaces often have millions of products, and the prod-                    summarization is as good as human summarization.
uct titles are typically intentionally made quite long for the purpose
of being found by search engines. A typical 20-word title can be                      2 BACKGROUND & RELATED WORK
easily skimmed when it is text, but it provides a bad experience
when it needs to be read out loud. With voice shopping estimated                      2.1 Extractive Summarization
to hit $40+ billion across U.S. and U.K. by 2022 1 , short versions or                Most work on automatic summarization has been focusing on ex-
summaries of product titles are desired to improve user experience                    tractive summarization. [18] proposed a simple approach to ex-
with voice shopping.                                                                  tractive summarization by selecting top sentences ranked by the
   We worked with one of the largest online e-commerce platforms                      number of top high frequency words that are contained in the
which is also one of the largest producers of in-home devices. They                   sentences. [12] enhanced this mechanism by utilizing additional
firmly believe that voice-based search is an important future inter-                  information such as cue words, title, heading words and sentence
face for online commerce and they are expanding into speech-based                     location.
shopping. With them, we identified that a desired short title should                     Various approaches based on graphs [13], topic modeling [33]
∗ Research conducted during employment at Figure Eight, Inc.                          and supervised learning have been proposed since then. Supervised
1 https://www.prnewswire.com/news-releases/voice-shopping-set-to-jump-to-40-          learning methods typically model this as a classification problem
billion-by-2022-rising-from-2-billion-today-300605596.html                            on whether a sentence in the original document should be included
                                                                                      in the summary or not. Hidden Markov Models [10] and Condi-
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic
purposes.                                                                             tional Random Fields [29] are among the most common supervised
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        learning techniques used for summarization.
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at      Recently deep neural networks [7, 21–23, 35] have become pop-
http://ceur-ws.org
                                                                                      ular for extractive summarization. To date, the majority of these
                                                                                      approaches focus on summarizing multiple documents, or a single
                                                                                      document with multiple sentences.
SIGIR 2019 eCom, July 2019, Paris, France                                                                         Joan Xiao and Robert Munro




                                  Figure 1: Examples of original product titles and desired short titles


    In our work we focus on extractive summarization on product            3     OUR APPROACHES
titles which are single "sentences", although the sentences here are       We first manually extracted named entities corresponding to the
fragments of sentences. Since we identified that a desired short           classes of BRAND, FUNCTION, VARIATION, SIZE, and COUNT,
title should contain only the words that fall into the 5 categories        then constructed ground truth labels separately for each model.
(BRAND, FUNCTION, VARIATION, SIZE and COUNT), the problem                  Once a model is trained, it makes prediction on titles from the test
is reduced to identifying the words in these categories, which can be      set. In the case of extractive summarization models, shorter titles
treated as a Named Entity Recognition problem. Once the essential          are composed from the predicted named entities.
words are identified, a short title can be composed by assembling             Figure 2 illustrates how the labels for each model are generated
these words together.                                                      from the annotations of named entities of a product title. Figure 3
                                                                           describes how a short title is generated from each model’s prediction
2.2    Named Entity Recognition                                            using the same example.
Named Entity Recognition (NER) is a subtask of information ex-
traction that seeks to locate and classify named entities in text into     3.1    Extractive Summarization (Multi-class
pre-defined categories such as the names of persons, organizations,
                                                                                  NER)
locations, quantities, etc. NER systems have been created using
linguistic grammar-based techniques as well as statistical models          We treat the summarization problem as a multi-class sequence
such as machine learning.                                                  labeling problem, where each class corresponds to the category
   Traditional machine learning approaches have been dominated             of a word in the product title, i.e., whether a word is a BRAND,
by applying Hidden Markov Models [6], Decision Trees [28], Sup-            FUNCTION, VARIATION, SIZE, COUNT, or none of these. Once
port Vector Machines [3], and Conditional Random Fields [20]               we have the predicted classes of all words in the title, we create a
to hand-crafted features. [9] pioneered a neural network model             short (summary) title by concatenating all words that are classified
that requires little feature engineering and instead learns impor-         as having a non-trivial entity class.
tant features from word embeddings [31] trained on large quanti-              In this study, we obtained the ground-truth labels for NER using
ties of unlabeled text. Since then, CNN, LSTM, and bidirectional           the data annotation platform Figure Eight. Crowd workers were
LSTM models using feature extractors for word and characters               asked to extract named entities (BRAND, FUNCTION, VARIATION,
([1, 8, 15, 16, 19, 25, 34]) have been reported to achieve start-of-the-   SIZE, COUNT) from the product titles. We then construct a label for
art results on CoNLL-2003 NER task [26].                                   each title using a BIO tag scheme. The product titles and these labels
   In our work we experiment with two NER based approaches for             (Figure 2) are then fed into a neural network. For each predicted
extractive summarization.                                                  sequence of a title, we construct a short title using the named
                                                                           entities extracted from the prediction, in the fixed order of BRAND,
                                                                           FUNCTION, VARIATION, SIZE, COUNT (Figure 3).
2.3    Abstract Summarization
The task of abstractive sentence summarization was formalized
around the DUC-2003 and DUC-2004 competitions [24]. Inspired               3.2    Extractive Summarization (Binary NER)
by the success of attention model in neural machine translation, [5]       In this approach, we treat the summarization problem as a binary
proposed a sequence-to-sequence encoder-decoder LSTM [14] with             NER problem, where a word in a title belongs to the positive class if
attention mechanism for this problem, showing state-of-the-art             the word is included in the summary, in contrast with the previous
performance on the DUC tasks. Since then, more work using deep             multi-class NER model. We re-use the ground-truth labels from
neural networks has been done on focusing on handling out-of-              multi-class NER task above by transforming each entity class to
vocabulary words [22] and discouraging repetition [27].                    the positive class ("1") and non-entity class to the negative class
   As a comparison with the extractive approaches, we experiment           ("0"). The product titles and these labels are then fed into a neural
with two abstractive summarization models on the same dataset.             network (Figure 2).
Text Summarization of Product Titles                                                            SIGIR 2019 eCom, July 2019, Paris, France




Figure 2: How labels are generated from annotations for each model. Bold words in the labels for the abstractive models
indicate the difference in the order of words of the entity SIZE.




Figure 3: How shorter title is generated from each model’s prediction. Bold words in the short titles generated from the ex-
traction models indicate the difference in the order of the entity SIZE.


   For each predicted sequence of a title, we construct a short title   3.4    Abstractive Summarization (Unordered)
by including the words predicted in positive class, in the same order   Since the ground-truth labels for the abstractive summarization
as they appear in the original title (Figure 3).                        approach above are generated in a specific order, the words in the
                                                                        short title may not occur in the same order as they do in the source.
                                                                        We are curious to know whether the re-ordering of the words affects
                                                                        the result of the summarization. Therefore, we made one change
3.3    Abstractive Summarization (Ordered)                              from the ordered abstractive summarization approach, using the
                                                                        same annotated named entities but keeping the words in the same
For the abstractive summarization task, the ground-truth labels
                                                                        order as they originally appear in the source (Figure 2).
are constructed from the annotated named entities in the order of
BRAND, FUNCTION, VARIATION, SIZE, and COUNT, same as in
the multi-class NER approach (Figure 2).
SIGIR 2019 eCom, July 2019, Paris, France                                                                         Joan Xiao and Robert Munro


                                                                   Test Set            1000 Random Titles
                                Model
                                                           ROUGE-1 ROUGE-2            ROUGE-1 ROUGE-2
                                NER_GOLD                       -             -          75.32     50.13
                                Multi-class NER             84.71          65.98        75.00     50.43
                                Binary NER                   84.09         67.87        75.06     58.07
                                Ordered Abstractive          78.83         47.85        67.47     41.66
                                Unordered Abstractive        80.70         64.91        72.01     53.92
Table 1: ROUGE-1 and ROUGE-2 on test set and 1000 random titles. Bold indicates the model with the highest ROUGE-1 or
ROUGE-2 score on each dataset.



    NER_Gold       Multi-class NER       Binary NER       Ordered Abstractive      Unordered Abstractive         Human Summarization
    7.02 ± 1.72      6.77 ± 1.75          6.78 ± 1.80         6.39 ± 1.79               6.47 ± 1.70                   7.70 ± 1.76
                                                 Table 2: Human evaluation on accuracy.



                                Method                  Succinctness     Combined (accuracy and succinctness)
                               NER_Gold                  9.54 ± 0.83                  8.28 ± 0.96
                             Multi-class NER             9.53 ± 0.85                  8.15 ± 0.97
                               Binary NER                9.53 ± 0.77                  8.16 ± 1.02
                           Human Summarization           8.76 ± 1.35                  8.23 ± 1.09
              Table 3: Human evaluation on succinctness, and combined evaluation on accuracy and succinctness.



   Method                      % of Titles with Factual Errors                 In addition, we selected 1000 random product titles from the test
   Ordered Abstractive                       29.1                          set and asked the crowd workers to manually summarize them. The
   Unordered Abstractive                     26.8                          crowd workers were instructed to summarize in a similar manner
   Human Summarization                       0.19                          to how the short titles of the NER model are generated: identify key-
      Table 4: Human evaluation on non-factualness.                        words corresponding to BRAND, FUNCTION, VARIATION, SIZE
                                                                           and COUNT, and then create a short title using these keywords in
                                                                           the order they appear in this list.
                                                                               We then asked different crowd workers to compare the short
4 EXPERIMENTAL SETUP                                                       titles produced from the models with the human summarization
                                                                           results on the following metrics:
4.1 Dataset
Our dataset consists of 56,200 product titles in English, randomly
                                                                               • Accuracy: on the scale of 1-10, how accurately each short
selected from the following categories:
                                                                                 title describes the product.
     • Baby Products                                                           • Non-factualness: whether the short title has factual errors.
     • Beauty                                                                    Only the two abstractive models were compared with human
     • Drugstore                                                                 summarization.
     • Fresh Perishable                                                        • Succinctness: on the scale of 1-10, how succinct each short
     • Fresh Produce                                                             title is. A short title is rated as 10 if it does not contain any
     • Grocery                                                                   non-essential words that can be removed without affecting
     • Home                                                                      how accurately it describes the product. The abstractive
     • Kitchen                                                                   models are excluded from this evaluation due to the non-
     • Office Products                                                           factualness problem.
     • Pantry
   The dataset is randomly split into a training set of size 37,300, a         For each metric above, 3 crowd workers were assigned to rate the
validation set of size 9,300, and a test set of size 9,600.                short titles of each product title, and the average of the 3 workers’
                                                                           ratings is used as the aggregated rating.
4.2    Evaluation                                                              Finally, in order to have a single metric to evaluate the short
We evaluated the four approaches with the standard ROUGE metric            titles (excluding the titles generated from the abstractive models),
[17], reporting the F1 scores on each model’s test set for ROUGE-1         we combined the human evaluation ratings on accuracy and suc-
and ROUGE-2 against their corresponding ground truth labels.               cinctness by taking the average of these two ratings for each title.
Text Summarization of Product Titles                                                            SIGIR 2019 eCom, July 2019, Paris, France


4.3     Model Architecture                                             5.4    Human Evaluation on Non-Factualness
For simplicity, we used the same bi-directional LSTM encoder/decoder   The abstractive models are known to struggle with handling out-of-
network with attention mechanism for all 4 approaches. Both en-        vocabulary words and often make non-factual errors [27]. We were
coder and decoder are two-layer LSTMs with 512 hidden units.           curious about whether the two abstractive models perform differ-
Dropout [30] is used at the decoder and both source and target         ently in terms of non-factualness. Table 4 shows the percentage
word embeddings, and beam search of length 5 is used during in-        of the titles are rated as having factual errors. ANOVA Test shows
ference. We trained the models on Amazon SageMaker 2 .                 that there is no significant difference between the two abstractive
                                                                       models.
5 RESULTS
                                                                       5.5    Human Evaluation on Succinctness
5.1 Results on Test Set
                                                                       As the abstractive models make factual errors, this evaluation in-
Table 1 lists the ROUGE-1 and ROUGE-2 F1 scores on each model’s
                                                                       cludes only the extractive models and human summarization.
test set against their corresponding ground truth labels. On both
                                                                          Table 3 shows the average and standard deviation of the human
metrics, the two extractive models perform better than the two
                                                                       evaluation results on succinctness. There is no statistical difference
abstractive models, and Unordered Abstractive does better than
                                                                       among the extractive models and NER_Gold, but interestingly hu-
Ordered Abstractive.
                                                                       man summarization is rated as the least succinct among all. Some
                                                                       examples (Figure 4) indicate that human summarization tends to
5.2     Results Compared with Human                                    include words related to product variations which are not captured
        Summarization                                                  by the models, and human raters do not think these variations are
Table 1 also shows the F1 scores of the ROUGE-1 and ROUGE-2            essential to describe the product.
on the 1000 random titles when evaluated against human summa-
rization. For comparison purpose, we added the short titles gen-       5.6    Combined Human Evaluation on Accuracy
erated from the labels used by the NER model, and it is named as              and Succinctness
"NER_Gold" in the table.                                               Table 3 also shows the average and standard deviation of the com-
    ANOVA and post-hoc tests on ROUGE-1 scores show that there         bined human evaluation results. Again, there is no statistically
is no significant difference between the two extractive models, the    significant difference between the two extractive models, and it
extractive models are significantly better than both abstractive       is interesting to note that even though NER_Gold is significantly
models, and the unordered abstractive model is significantly better    better than the two extractive models, there is no statistically sig-
than the ordered abstractive model.                                    nificant difference between human summarization and any of the
    On ROUGE-2 scores, the binary NER model is significantly bet-      other 3 versions.
ter than the unordered abstractive model, which is better than             To understand how the ratings vary with the length of product
multi-class NER and NER_Gold, which are better than the ordered        titles, we show in Figure 5 the average combined rating broken
abstractive model.                                                     down by number of words in the product titles. And Table 5 shows
    It is interesting to note that the unordered abstractive model     the word count distribution of these product titles. We see that
achieves higher scores than the ordered abstractive model, and         the two NER models perform very close to human summarization
it even achieves higher ROUGE-2 score than the multi-class NER         unless the product titles are extremely long (with more than 37
model. This suggests that preserving the order of the words in the     words, which accounts for only 0.2% of the titles).
target labels has a significant impact on the abstractive model’s
performance.                                                           6     CONCLUSION
    For both ROUGE-1 and ROUGE-2 scores, there is no statistically
                                                                       We applied four different deep learning based approaches to product
significant difference between multi-class NER and NER_Gold.
                                                                       title summarization on a dataset of 56,200 product titles and used
                                                                       both ROUGE scores and human judgments to evaluate the results on
5.3     Human Evaluation on Accuracy                                   a random 1000 titles from the test set. The evaluation results show
Table 2 lists the average and standard deviation of the crowd work-    that extractive summarization models consistently perform much
ers’ rating on all 5 versions of short titles, plus the human summa-   better than the abstractive summarization models, and overall there
rized titles.                                                          is no statistically significant difference between the two extractive
   ANOVA and post-hoc test on the ratings show results consis-         models and human summarization.
tent with the ROUGE-1 evaluation performed above: there is no              There are several avenues for future work. First, in this study we
significant difference among the extractive models and among the       used the same neural network architecture for all models, so we did
abstractive models. However, NER_Gold is rated as significantly        not use the latest and greatest neural network architecture for NER,
higher than the two NER models, due to the fact that the NER           and this is evident in the gap in accuracy between NER_Gold and
models fail to identify some named entities in some cases. And         NER models when the product titles are longer (Figure 5). We plan to
not surprisingly, human summarization is rated as being the most       adopt the state-of-the-art architectures such as Elmo [25] and Flair
accurate among all.                                                    [1] contextual embeddings for the two NER models for future study.
                                                                       In addition, we plan to experiment with self-attention transformer
2 https://aws.amazon.com/sagemaker/                                    [32] based models such as OpenAI GPT [2], BERT [11] and [4].
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                 Joan Xiao and Robert Munro




             Figure 4: Human summarization is rated lower than NER models on succinctness for some product titles.




Figure 5: Average combined rating by word count, showing that automated (extractive) summarization is equal to human-
summarization for titles up to 36 words in length.

                                       Word Count           2-6      7-11      12-16     17-21   22-26      27-31      32-36     37-41
                                       % of Titles          11.9     39.9       20.6      10.7    7.8        6.8        2.1       0.2
                                                     Table 5: Word count distribution of product titles.



These models do not use recurrent neural networks therefore do                                  Linguistics, Stroudsburg, PA, USA, 8–15. https://doi.org/10.3115/1073445.1073447
not restrict their prediction performance to short sequences, and                           [4] Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke S. Zettlemoyer, and Michael Auli.
                                                                                                2019. Cloze-driven Pretraining of Self-attention Networks. CoRR abs/1903.07785
all have achieved competitive results on CoNLL 2003 NER task.                                   (2019).
    Second, for abstractive summarization, even with the high per-                          [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine
                                                                                                Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473
centage of titles making non-factual errors (Table 4), the ROUGE-1                              (2015).
and ROUGE-2 and human evaluation on accuracy are still consid-                              [6] Daniel M. Bikel, Scott Miller, Richard M. Schwartz, and Ralph M. Weischedel.
erably high, which suggests that abstractive summarization may                                  1997. Nymble: a High-Performance Learning Name-finder. In ANLP.
                                                                                            [7] Jianpeng Cheng and Mirella Lapata. 2016. Neural Summarization by Extracting
achieve good results if the non-factual errors are eliminated. We                               Sentences and Words. CoRR abs/1603.07252 (2016).
plan to explore the copy mechanism in pointer and generator ap-                             [8] Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirec-
proaches ([22, 27]) in future study.                                                            tional LSTM-CNNs. Transactions of the Association for Computational Linguistics
                                                                                                4 (2016), 357–370.
                                                                                            [9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
                                                                                                and Pavel P. Kuksa. 2011. Natural Language Processing (almost) from Scratch.
REFERENCES                                                                                      Journal of Machine Learning Research 12 (2011), 2493–2537.
[1] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embed-         [10] John M. Conroy and Dianne P. O’Leary. 2001. Text Summarization via Hidden
    dings for Sequence Labeling. In Proceedings of the 27th International Conference            Markov Models. In SIGIR.
    on Computational Linguistics. Association for Computational Linguistics, Santa         [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
    Fe, New Mexico, USA, 1638–1649. https://www.aclweb.org/anthology/C18-1139                   Pre-training of Deep Bidirectional Transformers for Language Understanding.
[2] Radford Alec, Narasimhan Karthik, Salimans Tim, and Ilya Sutskever Openai.                  CoRR abs/1810.04805 (2018).
    2018. Improving Language Understanding by Generative Pre-Training. Technical           [12] H. P. Edmundson. 1969. New Methods in Automatic Extracting. J. ACM 16 (1969),
    Report. https://doi.org/10.1093/aob/mcp031                                                  264–285.
[3] Masayuki Asahara and Yuji Matsumoto. 2003. Japanese Named Entity Extraction            [13] Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical
    with Redundant Morphological Analysis. In Proceedings of the 2003 Conference of             Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 22 (2004),
    the North American Chapter of the Association for Computational Linguistics on Hu-          457–479.
    man Language Technology - Volume 1 (NAACL ’03). Association for Computational
Text Summarization of Product Titles                                                                               SIGIR 2019 eCom, July 2019, Paris, France


[14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.           [25] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
     Neural Computation 9 (1997), 1735–1780.                                              Clark, Kenton Lee, and Luke S. Zettlemoyer. 2018. Deep contextualized word
[15] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for           representations. In NAACL-HLT.
     Sequence Tagging. CoRR abs/1508.01991 (2015).                                   [26] Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003
[16] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,          Shared Task: Language-Independent Named Entity Recognition. In CoNLL.
     and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In     [27] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point:
     HLT-NAACL.                                                                           Summarization with Pointer-Generator Networks. In ACL.
[17] Chin-Yew Lin. 2004. ROUGE: A Package For Automatic Evaluation Of Summaries.     [28] Satoshi Sekine. 1998. Description of the Japanese NE System Used for MET-2. In
     In ACL 2004.                                                                         MUC.
[18] Hans Peter Luhn. 1958. The Automatic Creation of Literature Abstracts. IBM      [29] Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, and Zheng Chen. 2007. Document
     Journal of Research and Development 2 (1958), 159–165.                               Summarization Using Conditional Random Fields. In IJCAI.
[19] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-        [30] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-
     directional LSTM-CNNs-CRF. CoRR abs/1603.01354 (2016).                               lan R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks
[20] Andrew McCallum and Wei Li. 2003. Early results for Named Entity Recognition         from overfitting. Journal of Machine Learning Research 15 (2014), 1929–1958.
     with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons.    [31] Florian Strub, Harm de Vries, Jérémie Mary, Bilal Piot, Aaron C. Courville, and
     In CoNLL.                                                                            Olivier Pietquin. 2017. End-to-end optimization of goal-driven and visually
[21] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016. SummaRuNNer: A Re-              grounded dialogue systems. In IJCAI.
     current Neural Network based Sequence Model for Extractive Summarization of     [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Documents. In AAAI.                                                                  Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
[22] Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, ÃĞaglar GülÃğehre,         You Need. In NIPS.
     and Bing Xiang. 2016. Abstractive Text Summarization using Sequence-to-         [33] Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-Document
     sequence RNNs and Beyond. In CoNLL.                                                  Summarization using Sentence-based Topic Models. In ACL/IJCNLP.
[23] Ramesh Nallapati, Bowen Zhou, and Mingbo Ma. 2017. Classify or Select: Neural   [34] Zhilin Yang, Ruslan R. Salakhutdinov, and William W. Cohen. 2016. Multi-Task
     Architectures for Extractive Document Summarization. CoRR abs/1611.04244             Cross-Lingual Sequence Tagging from Scratch. CoRR abs/1603.06270 (2016).
     (2017).                                                                         [35] Wenpeng Yin and Yulong Pei. 2015. Optimizing Sentence Modeling and Selection
[24] Paul Over, Hoa Dang, and Donna K. Harman. 2007. DUC in context. Inf. Process.        for Document Summarization. In IJCAI.
     Manage. 43 (2007), 1506–1520.