Text Summarization of Product Titles Joan Xiao Robert Munro∗ joan.xiao@gmail.com robert.munro@gmail.com Figure Eight, Inc. Lilt, Inc. San Francisco, CA San Francisco, CA ABSTRACT contain only the essential words that are present in the original In this work, we investigate the problem of summarizing titles of product title, with no additional words. The essential words fall e-commerce products. With the increase in popularity of voice shop- into the following categories: ping due to smart phones and (especially) in-home speech devices, • BRAND: brand name of the product it is necessary to shorten long text-based titles to more succinct • FUNCTION: what the product does titles that are appropriate for speech. We present two extractive • VARIATION: variation (color, flavor, etc.) summarization approaches using bi-directional long short-term • SIZE: size information memory encoder-decoder network with attention mechanism. The • COUNT: count information first a pproach t reats t he p roblem a s a m ulti-class n amed entity recognition problem while the second approach treats it as a bi- A product title may or may not have all 5 attributes above - often nary class named entity recognition problem. As a comparison, times VARIATION, SIZE, or COUNT may not be present. Some we also evaluate two abstractive summarization approaches us- examples of the original product titles and desired short titles are ing the same neural network architecture. We compare the results shown in Figure 1. with automated (ROUGE) and human evaluation. Our experiment Summarization techniques are classified into two categories: demonstrates the effectiveness of both extractive summarization extractive and abstractive. Extractive summarization identifies and approaches. extracts key segments of the text, then assembles them to compose a summary. Abstractive summarization generates a summary from KEYWORDS scratch without being constrained to reusing phrases from the original text. extractive summarization, abstract summarization, neural networks, In this work we apply two extractive summarization and two voice shopping, named entity recognition abstractive summarization approaches to summarize a dataset of e- ACM Reference Format: commerce product titles, and compare results using both ROUGE-1 Joan Xiao and Robert Munro. 2019. Text Summarization of Product Titles. and ROUGE-2 scores and human judgments. The evaluation results In Proceedings of SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), show that extractive summarization models consistently perform 7 pages. much better than abstractive summarization models. We conclude that extractive summarization is effective for title 1 INTRODUCTION summarization at scale. For titles up to 36 words in length, the Online marketplaces often have millions of products, and the prod- summarization is as good as human summarization. uct titles are typically intentionally made quite long for the purpose of being found by search engines. A typical 20-word title can be 2 BACKGROUND & RELATED WORK easily skimmed when it is text, but it provides a bad experience when it needs to be read out loud. With voice shopping estimated 2.1 Extractive Summarization to hit $40+ billion across U.S. and U.K. by 2022 1 , short versions or Most work on automatic summarization has been focusing on ex- summaries of product titles are desired to improve user experience tractive summarization. [18] proposed a simple approach to ex- with voice shopping. tractive summarization by selecting top sentences ranked by the We worked with one of the largest online e-commerce platforms number of top high frequency words that are contained in the which is also one of the largest producers of in-home devices. They sentences. [12] enhanced this mechanism by utilizing additional firmly believe that voice-based search is an important future inter- information such as cue words, title, heading words and sentence face for online commerce and they are expanding into speech-based location. shopping. With them, we identified that a desired short title should Various approaches based on graphs [13], topic modeling [33] ∗ Research conducted during employment at Figure Eight, Inc. and supervised learning have been proposed since then. Supervised 1 https://www.prnewswire.com/news-releases/voice-shopping-set-to-jump-to-40- learning methods typically model this as a classification problem billion-by-2022-rising-from-2-billion-today-300605596.html on whether a sentence in the original document should be included in the summary or not. Hidden Markov Models [10] and Condi- Copyright © 2019 by the paper’s authors. Copying permitted for private and academic purposes. tional Random Fields [29] are among the most common supervised In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): learning techniques used for summarization. Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at Recently deep neural networks [7, 21–23, 35] have become pop- http://ceur-ws.org ular for extractive summarization. To date, the majority of these approaches focus on summarizing multiple documents, or a single document with multiple sentences. SIGIR 2019 eCom, July 2019, Paris, France Joan Xiao and Robert Munro Figure 1: Examples of original product titles and desired short titles In our work we focus on extractive summarization on product 3 OUR APPROACHES titles which are single "sentences", although the sentences here are We first manually extracted named entities corresponding to the fragments of sentences. Since we identified that a desired short classes of BRAND, FUNCTION, VARIATION, SIZE, and COUNT, title should contain only the words that fall into the 5 categories then constructed ground truth labels separately for each model. (BRAND, FUNCTION, VARIATION, SIZE and COUNT), the problem Once a model is trained, it makes prediction on titles from the test is reduced to identifying the words in these categories, which can be set. In the case of extractive summarization models, shorter titles treated as a Named Entity Recognition problem. Once the essential are composed from the predicted named entities. words are identified, a short title can be composed by assembling Figure 2 illustrates how the labels for each model are generated these words together. from the annotations of named entities of a product title. Figure 3 describes how a short title is generated from each model’s prediction 2.2 Named Entity Recognition using the same example. Named Entity Recognition (NER) is a subtask of information ex- traction that seeks to locate and classify named entities in text into 3.1 Extractive Summarization (Multi-class pre-defined categories such as the names of persons, organizations, NER) locations, quantities, etc. NER systems have been created using linguistic grammar-based techniques as well as statistical models We treat the summarization problem as a multi-class sequence such as machine learning. labeling problem, where each class corresponds to the category Traditional machine learning approaches have been dominated of a word in the product title, i.e., whether a word is a BRAND, by applying Hidden Markov Models [6], Decision Trees [28], Sup- FUNCTION, VARIATION, SIZE, COUNT, or none of these. Once port Vector Machines [3], and Conditional Random Fields [20] we have the predicted classes of all words in the title, we create a to hand-crafted features. [9] pioneered a neural network model short (summary) title by concatenating all words that are classified that requires little feature engineering and instead learns impor- as having a non-trivial entity class. tant features from word embeddings [31] trained on large quanti- In this study, we obtained the ground-truth labels for NER using ties of unlabeled text. Since then, CNN, LSTM, and bidirectional the data annotation platform Figure Eight. Crowd workers were LSTM models using feature extractors for word and characters asked to extract named entities (BRAND, FUNCTION, VARIATION, ([1, 8, 15, 16, 19, 25, 34]) have been reported to achieve start-of-the- SIZE, COUNT) from the product titles. We then construct a label for art results on CoNLL-2003 NER task [26]. each title using a BIO tag scheme. The product titles and these labels In our work we experiment with two NER based approaches for (Figure 2) are then fed into a neural network. For each predicted extractive summarization. sequence of a title, we construct a short title using the named entities extracted from the prediction, in the fixed order of BRAND, FUNCTION, VARIATION, SIZE, COUNT (Figure 3). 2.3 Abstract Summarization The task of abstractive sentence summarization was formalized around the DUC-2003 and DUC-2004 competitions [24]. Inspired 3.2 Extractive Summarization (Binary NER) by the success of attention model in neural machine translation, [5] In this approach, we treat the summarization problem as a binary proposed a sequence-to-sequence encoder-decoder LSTM [14] with NER problem, where a word in a title belongs to the positive class if attention mechanism for this problem, showing state-of-the-art the word is included in the summary, in contrast with the previous performance on the DUC tasks. Since then, more work using deep multi-class NER model. We re-use the ground-truth labels from neural networks has been done on focusing on handling out-of- multi-class NER task above by transforming each entity class to vocabulary words [22] and discouraging repetition [27]. the positive class ("1") and non-entity class to the negative class As a comparison with the extractive approaches, we experiment ("0"). The product titles and these labels are then fed into a neural with two abstractive summarization models on the same dataset. network (Figure 2). Text Summarization of Product Titles SIGIR 2019 eCom, July 2019, Paris, France Figure 2: How labels are generated from annotations for each model. Bold words in the labels for the abstractive models indicate the difference in the order of words of the entity SIZE. Figure 3: How shorter title is generated from each model’s prediction. Bold words in the short titles generated from the ex- traction models indicate the difference in the order of the entity SIZE. For each predicted sequence of a title, we construct a short title 3.4 Abstractive Summarization (Unordered) by including the words predicted in positive class, in the same order Since the ground-truth labels for the abstractive summarization as they appear in the original title (Figure 3). approach above are generated in a specific order, the words in the short title may not occur in the same order as they do in the source. We are curious to know whether the re-ordering of the words affects the result of the summarization. Therefore, we made one change 3.3 Abstractive Summarization (Ordered) from the ordered abstractive summarization approach, using the same annotated named entities but keeping the words in the same For the abstractive summarization task, the ground-truth labels order as they originally appear in the source (Figure 2). are constructed from the annotated named entities in the order of BRAND, FUNCTION, VARIATION, SIZE, and COUNT, same as in the multi-class NER approach (Figure 2). SIGIR 2019 eCom, July 2019, Paris, France Joan Xiao and Robert Munro Test Set 1000 Random Titles Model ROUGE-1 ROUGE-2 ROUGE-1 ROUGE-2 NER_GOLD - - 75.32 50.13 Multi-class NER 84.71 65.98 75.00 50.43 Binary NER 84.09 67.87 75.06 58.07 Ordered Abstractive 78.83 47.85 67.47 41.66 Unordered Abstractive 80.70 64.91 72.01 53.92 Table 1: ROUGE-1 and ROUGE-2 on test set and 1000 random titles. Bold indicates the model with the highest ROUGE-1 or ROUGE-2 score on each dataset. NER_Gold Multi-class NER Binary NER Ordered Abstractive Unordered Abstractive Human Summarization 7.02 ± 1.72 6.77 ± 1.75 6.78 ± 1.80 6.39 ± 1.79 6.47 ± 1.70 7.70 ± 1.76 Table 2: Human evaluation on accuracy. Method Succinctness Combined (accuracy and succinctness) NER_Gold 9.54 ± 0.83 8.28 ± 0.96 Multi-class NER 9.53 ± 0.85 8.15 ± 0.97 Binary NER 9.53 ± 0.77 8.16 ± 1.02 Human Summarization 8.76 ± 1.35 8.23 ± 1.09 Table 3: Human evaluation on succinctness, and combined evaluation on accuracy and succinctness. Method % of Titles with Factual Errors In addition, we selected 1000 random product titles from the test Ordered Abstractive 29.1 set and asked the crowd workers to manually summarize them. The Unordered Abstractive 26.8 crowd workers were instructed to summarize in a similar manner Human Summarization 0.19 to how the short titles of the NER model are generated: identify key- Table 4: Human evaluation on non-factualness. words corresponding to BRAND, FUNCTION, VARIATION, SIZE and COUNT, and then create a short title using these keywords in the order they appear in this list. We then asked different crowd workers to compare the short 4 EXPERIMENTAL SETUP titles produced from the models with the human summarization results on the following metrics: 4.1 Dataset Our dataset consists of 56,200 product titles in English, randomly • Accuracy: on the scale of 1-10, how accurately each short selected from the following categories: title describes the product. • Baby Products • Non-factualness: whether the short title has factual errors. • Beauty Only the two abstractive models were compared with human • Drugstore summarization. • Fresh Perishable • Succinctness: on the scale of 1-10, how succinct each short • Fresh Produce title is. A short title is rated as 10 if it does not contain any • Grocery non-essential words that can be removed without affecting • Home how accurately it describes the product. The abstractive • Kitchen models are excluded from this evaluation due to the non- • Office Products factualness problem. • Pantry The dataset is randomly split into a training set of size 37,300, a For each metric above, 3 crowd workers were assigned to rate the validation set of size 9,300, and a test set of size 9,600. short titles of each product title, and the average of the 3 workers’ ratings is used as the aggregated rating. 4.2 Evaluation Finally, in order to have a single metric to evaluate the short We evaluated the four approaches with the standard ROUGE metric titles (excluding the titles generated from the abstractive models), [17], reporting the F1 scores on each model’s test set for ROUGE-1 we combined the human evaluation ratings on accuracy and suc- and ROUGE-2 against their corresponding ground truth labels. cinctness by taking the average of these two ratings for each title. Text Summarization of Product Titles SIGIR 2019 eCom, July 2019, Paris, France 4.3 Model Architecture 5.4 Human Evaluation on Non-Factualness For simplicity, we used the same bi-directional LSTM encoder/decoder The abstractive models are known to struggle with handling out-of- network with attention mechanism for all 4 approaches. Both en- vocabulary words and often make non-factual errors [27]. We were coder and decoder are two-layer LSTMs with 512 hidden units. curious about whether the two abstractive models perform differ- Dropout [30] is used at the decoder and both source and target ently in terms of non-factualness. Table 4 shows the percentage word embeddings, and beam search of length 5 is used during in- of the titles are rated as having factual errors. ANOVA Test shows ference. We trained the models on Amazon SageMaker 2 . that there is no significant difference between the two abstractive models. 5 RESULTS 5.5 Human Evaluation on Succinctness 5.1 Results on Test Set As the abstractive models make factual errors, this evaluation in- Table 1 lists the ROUGE-1 and ROUGE-2 F1 scores on each model’s cludes only the extractive models and human summarization. test set against their corresponding ground truth labels. On both Table 3 shows the average and standard deviation of the human metrics, the two extractive models perform better than the two evaluation results on succinctness. There is no statistical difference abstractive models, and Unordered Abstractive does better than among the extractive models and NER_Gold, but interestingly hu- Ordered Abstractive. man summarization is rated as the least succinct among all. Some examples (Figure 4) indicate that human summarization tends to 5.2 Results Compared with Human include words related to product variations which are not captured Summarization by the models, and human raters do not think these variations are Table 1 also shows the F1 scores of the ROUGE-1 and ROUGE-2 essential to describe the product. on the 1000 random titles when evaluated against human summa- rization. For comparison purpose, we added the short titles gen- 5.6 Combined Human Evaluation on Accuracy erated from the labels used by the NER model, and it is named as and Succinctness "NER_Gold" in the table. Table 3 also shows the average and standard deviation of the com- ANOVA and post-hoc tests on ROUGE-1 scores show that there bined human evaluation results. Again, there is no statistically is no significant difference between the two extractive models, the significant difference between the two extractive models, and it extractive models are significantly better than both abstractive is interesting to note that even though NER_Gold is significantly models, and the unordered abstractive model is significantly better better than the two extractive models, there is no statistically sig- than the ordered abstractive model. nificant difference between human summarization and any of the On ROUGE-2 scores, the binary NER model is significantly bet- other 3 versions. ter than the unordered abstractive model, which is better than To understand how the ratings vary with the length of product multi-class NER and NER_Gold, which are better than the ordered titles, we show in Figure 5 the average combined rating broken abstractive model. down by number of words in the product titles. And Table 5 shows It is interesting to note that the unordered abstractive model the word count distribution of these product titles. We see that achieves higher scores than the ordered abstractive model, and the two NER models perform very close to human summarization it even achieves higher ROUGE-2 score than the multi-class NER unless the product titles are extremely long (with more than 37 model. This suggests that preserving the order of the words in the words, which accounts for only 0.2% of the titles). target labels has a significant impact on the abstractive model’s performance. 6 CONCLUSION For both ROUGE-1 and ROUGE-2 scores, there is no statistically We applied four different deep learning based approaches to product significant difference between multi-class NER and NER_Gold. title summarization on a dataset of 56,200 product titles and used both ROUGE scores and human judgments to evaluate the results on 5.3 Human Evaluation on Accuracy a random 1000 titles from the test set. The evaluation results show Table 2 lists the average and standard deviation of the crowd work- that extractive summarization models consistently perform much ers’ rating on all 5 versions of short titles, plus the human summa- better than the abstractive summarization models, and overall there rized titles. is no statistically significant difference between the two extractive ANOVA and post-hoc test on the ratings show results consis- models and human summarization. tent with the ROUGE-1 evaluation performed above: there is no There are several avenues for future work. First, in this study we significant difference among the extractive models and among the used the same neural network architecture for all models, so we did abstractive models. However, NER_Gold is rated as significantly not use the latest and greatest neural network architecture for NER, higher than the two NER models, due to the fact that the NER and this is evident in the gap in accuracy between NER_Gold and models fail to identify some named entities in some cases. And NER models when the product titles are longer (Figure 5). We plan to not surprisingly, human summarization is rated as being the most adopt the state-of-the-art architectures such as Elmo [25] and Flair accurate among all. [1] contextual embeddings for the two NER models for future study. In addition, we plan to experiment with self-attention transformer 2 https://aws.amazon.com/sagemaker/ [32] based models such as OpenAI GPT [2], BERT [11] and [4]. SIGIR 2019 eCom, July 2019, Paris, France Joan Xiao and Robert Munro Figure 4: Human summarization is rated lower than NER models on succinctness for some product titles. Figure 5: Average combined rating by word count, showing that automated (extractive) summarization is equal to human- summarization for titles up to 36 words in length. Word Count 2-6 7-11 12-16 17-21 22-26 27-31 32-36 37-41 % of Titles 11.9 39.9 20.6 10.7 7.8 6.8 2.1 0.2 Table 5: Word count distribution of product titles. These models do not use recurrent neural networks therefore do Linguistics, Stroudsburg, PA, USA, 8–15. https://doi.org/10.3115/1073445.1073447 not restrict their prediction performance to short sequences, and [4] Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke S. Zettlemoyer, and Michael Auli. 2019. Cloze-driven Pretraining of Self-attention Networks. CoRR abs/1903.07785 all have achieved competitive results on CoNLL 2003 NER task. (2019). Second, for abstractive summarization, even with the high per- [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 centage of titles making non-factual errors (Table 4), the ROUGE-1 (2015). and ROUGE-2 and human evaluation on accuracy are still consid- [6] Daniel M. Bikel, Scott Miller, Richard M. Schwartz, and Ralph M. Weischedel. erably high, which suggests that abstractive summarization may 1997. Nymble: a High-Performance Learning Name-finder. In ANLP. [7] Jianpeng Cheng and Mirella Lapata. 2016. Neural Summarization by Extracting achieve good results if the non-factual errors are eliminated. We Sentences and Words. CoRR abs/1603.07252 (2016). plan to explore the copy mechanism in pointer and generator ap- [8] Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirec- proaches ([22, 27]) in future study. tional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4 (2016), 357–370. [9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing (almost) from Scratch. REFERENCES Journal of Machine Learning Research 12 (2011), 2493–2537. [1] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embed- [10] John M. Conroy and Dianne P. O’Leary. 2001. Text Summarization via Hidden dings for Sequence Labeling. In Proceedings of the 27th International Conference Markov Models. In SIGIR. on Computational Linguistics. Association for Computational Linguistics, Santa [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Fe, New Mexico, USA, 1638–1649. https://www.aclweb.org/anthology/C18-1139 Pre-training of Deep Bidirectional Transformers for Language Understanding. [2] Radford Alec, Narasimhan Karthik, Salimans Tim, and Ilya Sutskever Openai. CoRR abs/1810.04805 (2018). 2018. Improving Language Understanding by Generative Pre-Training. Technical [12] H. P. Edmundson. 1969. New Methods in Automatic Extracting. J. ACM 16 (1969), Report. https://doi.org/10.1093/aob/mcp031 264–285. [3] Masayuki Asahara and Yuji Matsumoto. 2003. Japanese Named Entity Extraction [13] Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical with Redundant Morphological Analysis. In Proceedings of the 2003 Conference of Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 22 (2004), the North American Chapter of the Association for Computational Linguistics on Hu- 457–479. man Language Technology - Volume 1 (NAACL ’03). Association for Computational Text Summarization of Product Titles SIGIR 2019 eCom, July 2019, Paris, France [14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. [25] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Neural Computation 9 (1997), 1735–1780. Clark, Kenton Lee, and Luke S. Zettlemoyer. 2018. Deep contextualized word [15] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for representations. In NAACL-HLT. Sequence Tagging. CoRR abs/1508.01991 (2015). [26] Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 [16] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Shared Task: Language-Independent Named Entity Recognition. In CoNLL. and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In [27] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: HLT-NAACL. Summarization with Pointer-Generator Networks. In ACL. [17] Chin-Yew Lin. 2004. ROUGE: A Package For Automatic Evaluation Of Summaries. [28] Satoshi Sekine. 1998. Description of the Japanese NE System Used for MET-2. In In ACL 2004. MUC. [18] Hans Peter Luhn. 1958. The Automatic Creation of Literature Abstracts. IBM [29] Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, and Zheng Chen. 2007. Document Journal of Research and Development 2 (1958), 159–165. Summarization Using Conditional Random Fields. In IJCAI. [19] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi- [30] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- directional LSTM-CNNs-CRF. CoRR abs/1603.01354 (2016). lan R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks [20] Andrew McCallum and Wei Li. 2003. Early results for Named Entity Recognition from overfitting. Journal of Machine Learning Research 15 (2014), 1929–1958. with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. [31] Florian Strub, Harm de Vries, Jérémie Mary, Bilal Piot, Aaron C. Courville, and In CoNLL. Olivier Pietquin. 2017. End-to-end optimization of goal-driven and visually [21] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016. SummaRuNNer: A Re- grounded dialogue systems. In IJCAI. current Neural Network based Sequence Model for Extractive Summarization of [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Documents. In AAAI. Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All [22] Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, ÃĞaglar GülÃğehre, You Need. In NIPS. and Bing Xiang. 2016. Abstractive Text Summarization using Sequence-to- [33] Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-Document sequence RNNs and Beyond. In CoNLL. Summarization using Sentence-based Topic Models. In ACL/IJCNLP. [23] Ramesh Nallapati, Bowen Zhou, and Mingbo Ma. 2017. Classify or Select: Neural [34] Zhilin Yang, Ruslan R. Salakhutdinov, and William W. Cohen. 2016. Multi-Task Architectures for Extractive Document Summarization. CoRR abs/1611.04244 Cross-Lingual Sequence Tagging from Scratch. CoRR abs/1603.06270 (2016). (2017). [35] Wenpeng Yin and Yulong Pei. 2015. Optimizing Sentence Modeling and Selection [24] Paul Over, Hoa Dang, and Donna K. Harman. 2007. DUC in context. Inf. Process. for Document Summarization. In IJCAI. Manage. 43 (2007), 1506–1520.