=Paper=
{{Paper
|id=Vol-2666/KDD_Converse20_paper_11
|storemode=property
|title=Abstractive Summarization of Spoken and Written Instructions with BERT
|pdfUrl=https://ceur-ws.org/Vol-2666/KDD_Converse20_paper_11.pdf
|volume=Vol-2666
|authors=Alexandra Savelieva,Bryan Au-Yeung,Vasanth Ramani
|dblpUrl=https://dblp.org/rec/conf/kdd/SavelievaAR20
}}
==Abstractive Summarization of Spoken and Written Instructions with BERT==
<pdf width="1500px">https://ceur-ws.org/Vol-2666/KDD_Converse20_paper_11.pdf</pdf>
<pre>
                                   Abstractive Summarization of
                             Spoken and Written Instructions with BERT

            Alexandra Savelieva∗†                                            Bryan Au-Yeung∗                                   Vasanth Ramani∗†
      UC Berkeley School of Information                           UC Berkeley School of Information                   UC Berkeley School of Information
        saveale@ischool.berkeley.edu                                  bkauyeung@berkeley.edu                           rlvasanth@ischool.berkeley.edu


                        Figure 1: A screenshot of a How2 YouTube video with transcript and model generated summary.
ABSTRACT                                                                                    and domains, it has great potential to improve accessibility and
Summarization of speech is a difficult problem due to the spontane-                         discoverability of internet content. We envision this integrated as a
ity of the flow, disfluencies, and other issues that are not usually                        feature in intelligent virtual assistants, enabling them to summarize
encountered in written texts. Our work presents the first application                       both written and spoken instructional content upon request.
of the BERTSum model to conversational language. We generate
abstractive summaries of narrated instructional videos across a                             CCS CONCEPTS
wide variety of topics, from gardening and cooking to software                              • Human-centered computing → Accessibility systems and tools;
configuration and sports. In order to enrich the vocabulary, we                             • Computing methodologies → Information extraction.
use transfer learning and pretrain the model on a few large cross-
domain datasets in both written and spoken English. We also do                              KEYWORDS
preprocessing of transcripts to restore sentence segmentation and                           Text Summarization; Natural Language Processing; Information
punctuation in the output of an ASR system. The results are evalu-                          Retrieval; Abstraction; BERT; Neural Networks; Virtual Assistant;
ated with ROUGE and Content-F1 scoring for the How2 and Wiki-                               Narrated Instructional Videos; Language Modeling
How datasets. We engage human judges to score a set of summaries
                                                                                            ACM Reference Format:
randomly selected from a dataset curated from HowTo100M and                                 Alexandra Savelieva, Bryan Au-Yeung, and Vasanth Ramani. 2020. Ab-
YouTube. Based on blind evaluation, we achieve a level of textual                           stractive Summarization of Spoken and Written Instructions with BERT.
fluency and utility close to that of summaries written by human                             In Proceedings of KDD Workshop on Conversational Systems Towards Main-
content creators. The model beats current SOTA when applied to                              stream Adoption (KDD Converse’20). ACM, New York, NY, USA, 9 pages.
WikiHow articles that vary widely in style and topic, while showing
no performance regression on the canonical CNN/DailyMail dataset.
Due to the high generalizability of the model across different styles
∗ Equal contribution.
                                                                                            1   INTRODUCTION
† Also with Microsoft.                                                                      The motivation behind our work involves making the growing
                                                                                            amount of user-generated online content more accessible. In order
Permission to make digital or hard copies of part or all of this work for personal or       to help users digest information, our research focuses on improving
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation   automatic summarization tools. Many creators of online content
on the first page. Copyrights for third-party components of this work must be honored.      use a variety of casual language, filler words, and professional jar-
For all other uses, contact the owner/author(s).
                                                                                            gon. Hence, summarization of text implies not only an extraction of
KDD Converse’20, August 2020,
© 2020 Copyright held by the owner/author(s).                                               important information from the source, but also a transformation
                                                                                            to a more coherent and structured output. In this paper we focus on


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
KDD Converse’20, August 2020,                                                                                       Savelieva, Au-Yeung, Ramani


                                     Figure 2: A taxonomy of summarization types and methods.


both extractive and abstractive summarization of narrated instruc-       2   PRIOR WORK
tions in both written and spoken forms. Extractive summarization         A taxonomy of summarization types and methods is presented in
is a simple classification problem for identifying the most impor-       Figure 2. Prior to 2014, summarization was centered on extracting
tant sentences in the document and classifies whether a sentence         lines from single documents using statistical models and neural
should be included in the summary. Abstractive summarization, on         networks with limited success [23] [17]. The work on sequence
the other hand, requires language generation capabilities to create      to sequence models from Sutskever et al. [22] and Cho et al. [2]
summaries containing novel words and phrases not found in the            opened up new possibilities for neural networks in natural language
source text. Language models for summarization of conversational         processing. From 2014 to 2015, LSTMs (a variety of RNN) became
texts often face issues with fluency, intelligibility, and repetition.   the dominant approach in the industry which achieved state of the
This is the first attempt to use BERT-based model for summarizing        art results. Such architectural changes became successful in tasks
spoken language from ASR (speech-to-text) inputs. We are aiming          such as speech recognition, machine translation, parsing, image
to develop a generalized tool that can be used across a variety of do-   captioning. The results of this paved the way for abstractive sum-
mains for How2 articles and videos. Success in solving this problem      marization, which began to score competitively against extractive
opens up possibilities for extension of the summarization model to       summarization. In 2017, a paper by Vaswani et.al. [25] provided
other applications in this area, such as summarization of dialogues      a solution to the ‘fixed length vector’ problem, enabling neural
in conversational systems between humans and bots [13].                  networks to focus on important parts of the input for prediction
   The rest of this paper is divided in the following sections:          tasks. Applying attention mechanisms with transformers became
                                                                         more dominant for tasks, such as translation and summarization.
     • A review of state-of-the art summarization methods;
                                                                             In abstractive video summarization, models which incorporate
     • A description of dataset of texts, conversations, and sum-
                                                                         variations of LSTM and deep layered neural networks have become
       maries used for training;
                                                                         state of the art performers. In addition to textual inputs, recent
     • Our application of BERT-based text summarization models
                                                                         research in multi-modal summarization incorporates visual and
       [17] and fine tuning on auto-generated scripts from instruc-
                                                                         audio modalities into language models to generate summaries of
       tional videos;
                                                                         video content. However, generating compelling summaries from
     • Suggested improvements to evaluation methods in addition
                                                                         conversational texts using transcripts or a combination of modal-
       to the metrics [12] used by previous research.
                                                                         ities is still challenging. The deficiency of human annotated data
     • Analysis of experimental results and comparison to bench-
                                                                         has limited the amount of benchmarked datasets available for such
       mark
Abstractive Summarization of
Spoken and Written Instructions with BERT                                                                           KDD Converse’20, August 2020,


             Table 1: Training and Testing Datasets                                  Table 2: Additional Dataset Statistics

 Total Training Dataset Size                    535,527                       YouTube Min / Max Length                    4 / 1,940 words
 CNN/DailyMail                                  90,266 and 196,961            YouTube Avg Length                          259 words
 WikiHow Text                                   180,110                       HowTo100M Sample Min / Max Length           5 / 6,587 words
 How2 Videos                                    68,190                        HowTo100M Sample Avg Length                 859 words
 Total Testing Dataset Size                     5,195 videos
 YouTube (DIY Videos and How-To Videos)         1,809                   by the availability of human annotated transcripts and summaries.
 HowTo100M                                      3,386                   Such datasets are difficult to obtain and expensive to create, often re-
                                                                        sulting in repetitive usage of singular-tasked and highly structured
                                                                        data . As seen with samples in the How2 dataset, only the videos
research [18] [10]. Most work in the field of document summa-           with a certain length and structured summary are used for training
rization relies on structured news articles. Video summarization        and testing. To extend our research boundaries, we complemented
focuses on heavily curated datasets with structured time frames,        existing labeled summarization datasets with auto-generated in-
topics, and styles [4]. Additionally, video summarization has been      structional video scripts and human-curated descriptions.
traditionally accomplished by isolating and concatenating impor-           We introduce a new dataset obtained from combining several
tant video frames using natural language processing techniques          ’How-To’ and Do-It-Yourself YouTube playlists along with sam-
[5]. Above all, there are often inconsistencies and stylistic changes   ples from the published HowTo100Million Dataset [16]. To test
in spoken language that are difficult to translate into written text.   the plausibility of using this model in the wild, we selected videos
In this work, we approach video summarizations by extending top         across different conversational texts that have no corresponding
performing single-document text summarization models [19] to            summaries or human annotations. The selected ’How-To’ [24]) and
a combination of narrated instructional videos, texts, and news         ’DIY’[6] datasets are instructional playlists covering different topics
documents of various styles, lengths, and literary attributes.          from downloading mainstream software to home improvement.
                                                                        The ’How-To’ playlist uses machine voice-overs in the videos to
3 METHODOLOGY                                                           aid instruction while the ’DIY’ playlist has videos with a human
                                                                        presenter. The HowTo100Million Dataset is a large scale dataset
3.1 Data Collection                                                     of over 100 million video clips taken from narrated instructional
We hypothesize that our model’s ability to form coherent summaries      videos across 140 categories. Our dataset incorporates a sample
across various texts will benefit from training across larger amounts   across all categories and utilizes the natural language annotations
of data. Table 1 illustrates various textual and video dataset sizes.   from automatically transcribed narrations provided by YouTube.
All training datasets include written summaries. The language and
length of the data span from informal to formal and single sentence     3.2     Preprocessing
to short paragraph styles.                                              Due to diversity and complexity of our input data, we built a pre-
     • CNN/DailyMail dataset [7]: CNN and DailyMail includes            processing pipeline for aligning the data to a common format. We
       a combination of news articles and story highlights written      observed issues with lack of punctuation, incorrect wording, and
       with an average length of 119 words per article and 83 words     extraneous introductions which impacted model training. With
       per summary. Articles were collected from 2007 to 2015.          these challenges, our model misinterpreted text segment bound-
     • Wikihow dataset [9]: a large scale text dataset containing       aries and produces poor quality summaries. In exceptional cases,
       over 200,000 single document summaries. Wikihow is a con-        the model failed to produce any summary. In order to maintain the
       solidated set of recent ‘How To’ instructional texts compiled    fluency and coherency in human written summaries, we cleaned
       from wikihow.com, ranging from topics such as ‘How to            and restored sentence structure as shown in the Figure 3. We ap-
       deal with coronavirus anxiety’ to ‘How to play Uno.’ These       plied entity detection from an open-source software library for
       articles vary in size and topic but are structured to instruct   advanced natural language processing called spacy [8] and nltk:
       the user. The first sentences of each paragraph within the       the Natural Language Toolkit for symbolic and statistical natural
       article are concatenated to form a summary.                      language processing [15] to remove introductions and anonymize
     • How2 Dataset [20]: This YouTube compilation has videos           the inputs of our summarization model. We split sentences and
       (8,000 videos - approximately 2,000 hours) averaging 90 sec-     tokenized using the Stanford Core NLP toolkit on all datasets and
       onds long and 291 words in transcript length. It includes hu-    preprocessed the data in the same method used by See et.al. in [21].
       man written summaries which video owners were instructed
       to write summaries to maximize the audience. Summaries           3.3     Summarization models
       are two to three sentences in length with an average length      We utilized the BertSum models proposed in [14] for our research.
       of 33 words.                                                     This includes both Extractive and Abstractive summarization mod-
  Despite the development of instructional datasets such as Wiki-       els, which employs a document level encoder based on Bert. The
how and How2, advancements in summarization have been limited           transformer architecture applies a pretrained BERT encoder with
KDD Converse’20, August 2020,                                                                                        Savelieva, Au-Yeung, Ramani


                                  Figure 3: A pipeline for preprocessing of texts for summarization.


a randomly initialized Transformer decoder. It uses two different       ROUGE metrics is not sufficient is presented in Appendix, Figure
learning rates: a low rate for the encoder and a separate higher rate   10.
for the decoder to enhance learning.                                       Additionally, we added Content F1 scoring, a metric proposed by
    We used a 4-GPU Linux machine and initialized a baseline by         Carnegie Mellon University [3] to focus on the relevance of content.
training an extractive model on 5,000 video samples from the How2       Similar to ROUGE, Content F1 scores summaries with a weighted
dataset. Initially, we applied BERT base uncased with 10,000 steps      f-score and a penalty for incorrect word order. It also discounts stop
and fine tuned the summarization model and BERT layer, selecting        and buzz words that frequently occur in the How-To domain, such
the top-performing epoch sizes. We followed this initial model by       as “learn from experts how to in this free online video”.
training the abstractive model on How2 and WikiHow individually.           To score passages with no written summaries, we surveyed hu-
    The best version of the abstractive summarization model was         man judges with a framework for evaluation using Python, Google
trained on our aggregated dataset of CNN/DailyMail, Wikihow,            Forms, and Excel spreadsheets. Summaries included in the surveys
and How2 datasets with a total of 535,527 examples and 210,000          are randomly sampled from our dataset to avoid biases. In order to
steps. We used a training batch size of 50 and ran the model for 20     avoid asymmetrical information between human versus machine
epochs. By controlling the order of datasets in which we trained        generated summaries, we removed capitalized text. We asked two
our model, we were able to improve the fluency of summaries. As         types of questions: A Turing test question for participants to distin-
stated in previous research, the original model contained more          guish AI from human-generated descriptions. The second involves
than 180 million parameters and used two Adam optimizers with           selecting quality ratings for summaries. Below are definitions of
𝛽 1 =0.9 and 𝛽 2 =0.999 for the encoder and decoder respectively.       criteria for clarity:
The encoder used a learning rate of 0.002 and the decoder had a              • Fluency: Does the text have a natural flow and rhythm?
learning rate of 0.2 to ensure that the encoder was trained with more        • Usefulness: Does it have enough information to make a user
accurate gradients while the decoder became stable. The results of              decide whether they want to spend time watching the video?
experiments are discussed in Section 4.                                      • Succinctness: Does the text look concise or do does it have
    We hypothesized that the training order is important to the                 redundancy?
model in the same way humans learn. The idea of applying curricu-            • Consistency: Are there any non sequiturs - ambiguous, con-
lum learning [1] in natural language processing has been a growing              fusing or contradicting statements in the text?
topic of interest [26]. We begin training on highly structured sam-          • Realisticity: Is there anything that seems far-fetched and
ples before moving to more complicated, but predictable language                bizarre in words combinations and doesn’t look "normal"?
structure 4. Only after training textual scripts do we proceed to          Options for grading summaries are as follows: 1: Bad 2: Below
video scripts, which presents additional challenges of ad-hoc flow      Average 3: Average 4: Good 5: Great.
and conversational language.
                                                                        4 EXPERIMENTS AND RESULTS
3.4    Scoring of results                                               4.1 Training
Results were scored using ROUGE, the standard metric for abstrac-       BertSum model is the best performing model on the CNN/DailyMail
tive summarization [11]. While we expected a correlation between        dataset producing state-of-the-art results (Row 6) 3. BertSum model
good summaries and high ROUGE scores, we observed examples of           supports both extractive and abstractive summarization techniques.
poor summaries with high scores, such as in Figure 10, and good         Our baseline results were obtained from applying this extractive
summaries with low ROUGE scores. Illustrative example of why            BertSum model pretrained on CNN/DailyMail to How2 videos. But
Abstractive Summarization of
Spoken and Written Instructions with BERT                                                                            KDD Converse’20, August 2020,


                                                                         Figure 5 shows the model’s accuracy metric on the training and
                                                                         validation sets. The model is validated using the How2 dataset
                                                                         against the training dataset. The model improves as expected with
                                                                         more steps.


                                                                         4.2    Evaluation
                                                                         The BertSum model trained on CNN/DailyMail [14] resulted in
                                                                         state of the art scores when applied to samples from those datasets.
                                                                         However, when tested on our How2 Test dataset, it gave very poor
                                                                         performance and a lack of generalization in the model (see Row 1 in
        Figure 4: Cross Entropy: Training vs Validation                  Table 3). Looking at the data, we found that the model tends to pick
                                                                         the first one or two sentences for the summary. We hypothesized
                                                                         that removing introductions from the text would help improve
                                                                         ROUGE scores. Our model improved a few ROUGE points after
                                                                         applying preprocessing described in the Section 3.2 above. Another
                                                                         improvement came from adding word deduping to the output of
                                                                         the model, as we observed it occurring on rare words which are
                                                                         unfamiliar to the model. We still did not achieve scores get higher
                                                                         than 22.5 ROUGE-1 F1 and 20 ROUGE-L F1 (initial scores achieved
                                                                         from training with only the CNN/DailyMail dataset and tested on
                                                                         How2 data). Reviewing scores and texts of individual summaries
                                                                         showed that the model performed better on some topics such as
                                                                         medicine, while scoring lower on others, such as sports.
                                                                             The differences in conversational style of the video scripts and
                                                                         news stories (on which the models were pretrained) impacted the
           Figure 5: Accuracy: Training vs Validation                    quality of the model output. In our initial application of the extrac-
                                                                         tive summarization model pretrained on CNN/DailyMail dataset,
                                                                         stylistic errors manifested in a distinct way. The model considered
the model produced very low scores for our scenario. Summaries           initial introductory sentences to be important in generating sum-
generated from the model were incoherent, repetitive, and unin-          maries (this phenomena is referred to by [15] as N-lead, where N
formative. Despite poor performance, the model performed better          is the number of important first sentences). Our model generated
in the health sub-domain within How2 videos. We explained this           short, simple worded summaries such as "hi!" and "hello, this is
as a symptom of heavy coverage in news reports generated by              <first and last name>".
CNN/DailyMail. We realized that extractive summarization is not              Retraining abstractive BertSum on How2 gave a very interesting
the strongest model for our goal: most YouTube videos are pre-           unexpected result - the model converged to a state of spitting out
sented with a casual conversational style, while summaries have          the same meaningless summary of buzzwords that are common for
higher formality. We pivoted to abstractive summarization to im-         most videos, regardless of the domain: "learn how to do the the the
prove performance.                                                       a in this free video clip clip clip series clip clip on how to make a and
    Abstractive model uses an encoder-decoder architecture, combin-      expert chef and expert in this unique and expert and expert. to utilize
ing the same pretrained BERT encoder with a randomly initialized         and professional . this unique expert for a professional."
Transformer decoder. It uses a special technique where the encoder           In our next series of experiments, we used extended dataset for
portion is almost kept same with a very low learning rate and cre-       training. Even though the difference in ROUGE scores for the results
ates a separate learning rate for the decoder to make it learn better.   from BertSum Model 1 (see Table 3) are not drastically different
In order to create a generalizable abstractive model, we first trained   from BertSum Models 2 and 3, the quality of summaries from the
on a large corpus of news. This allowed our model to understand          perspective of human judges is qualitatively different.
structured texts. We then introduced Wikihow, which exposes the              Our best results on How2 videos (see experiment 4 in Table 3)
model to the How-To domain. Finally, we trained and validated on         were accomplished by leveraging the full set of labeled datasets
the How2 dataset, narrowing the focus of the model to a selectively      (CNN/DM, WikiHow, and How2 videos) with order preserving con-
structured format. In addition to ordered training, we experimented      figuration. The best ROUGE scores we got for video summarization
with training the model using random sets of homogeneous sam-            are comparable to best results among news documents [14] (see
ples. We discovered that training the model using an ordered set of      row 9 in Table 3).
samples performed better than random ones.                                   Finally, we beat current best results on WikiHow. The current
    The cross entropy chart in the Figure 4 shows that the model         benchmark Rouge-L score for WikiHow dataset is 26.54 in Row
is neither overfitting nor underfitting the training data. Good fit      8. 3 Our model uses the BERT abstractive summarization model
is indicated with the convergence of training and validation lines.      to produce a Rouge-L score of 36.8 in Row 5 3, outperforming the
KDD Converse’20, August 2020,                                                                                          Savelieva, Au-Yeung, Ramani


 current benchmark score by 10.26 points. Compared to Pointer Gen-
 erator+Coverage model, the improvement on Rouge-1 and Rouge-L
 is about 10 points each. We got same results testing for WikiHow
 using BertSum with ordered training on the the entire How2, Wiki-
 How, and CNN/DailyMail dataset.
    With our initial results, we achieved fluent and understandable
video descriptions which give a clear idea about the content. Our
 scores did not surpass scores from other researchers [20] despite
 employing BERT. However, our summaries appear to be more fluent          Figure 6: Scores of human judges in the challenge to distin-
 and useful in content for users looking at summaries in the How-To       guishing ML-generated summaries from actual video anno-
 domain. Some examples are given in the [Appendix: C].                    tations on YouTube
    Abstractive summarization was helpful for reducing the effects
 of speech-to-text errors, which we observed in some videos tran-
 scripts, especially auto-generated closed captioning in the addi-
 tional dataset that we created as part of this project (transcripts in
 How2 videos were reviewed and manually corrected, so spelling
 errors there are less common). For example, in one of the samples
 in our test dataset closed captioning confuses the speaker’s words
“how you get a text from a YouTube video” for “how you get attacks
 from a YouTube video”. As there is usually a lot of redundancy in
 explanations, the model is still able to figure out sufficient context
 to produce a meaningful summary. We did not observe situations
where the summaries did not match topic of the video due to im-
 pact from spelling errors that frequently occur in ASR-generated
 scripts without human supervision, but ensuring correct bound-
 aries between sentences by using Spacy to fix punctuation errors
 at preprocessing stage made a very big difference.                       Figure 7: Distribution of average FP and FN ratio per ques-
    Based on these observations, we decided that the model gener-         tion
 ated strong results comparable to human written descriptions. To
 analyze the differences in summary quality, we leveraged help of
 human experts to evaluate conversational characteristics between
 our summaries and the descriptions that users provide for their
videos on YouTube. We recruited a diverse group of 30+ volunteers
 to blindly evaluate a set of 25 randomly selected video summaries
 that were generated by our model and video descriptions from our
 curated conversational dataset. We created two types of questions:
 one, a version of famous Turing test, was a challenge to distinguish
AI from human-curated descriptions and used the framework de-
 scribed in Section 3.4. Participants were made aware that there
 is equal possibility that some, all, or none of these summary out-
 puts were machine generated in this classification task. The second
 question collects a distribution of ratings addressing conversation          Figure 8: Quality assessment of generated summaries
 quality. The aggregated results for both evaluations are in Figures 6
- 8. We observe zero perfect scores on Turing test answers. Results
 included many false positives and false negatives [Appendix: D].
                                                                          5   CONCLUSION
    The quality of our test output is comparable to YouTube sum-          The contributions of our research address multiple issues that we
 maries. "Realistic" text is the main growth opportunity because the      identified in pursuit of generalizing BertSum model for summariza-
 abstractive model is prone to generating incoherent sentences that       tion of instructional video scripts throughout the training process.
 are grammatically correct. Human authors are prone to making                  • We explored how different combinations of training data
 language use errors. The advantage of using abstractive summariza-               and parameters impact the training performance of BertSum
 tion models allows us to mitigate some issues with video author’s                abstractive summarization model.
 grammar.                                                                      • We came up with novel preprocessing steps for auto-generated
                                                                                  closed captioning scripts before summarization.
                                                                               • We generalized BertSum abstractive summarization model
                                                                                  to auto-generated instructional video scripts with the quality
                                                                                  level that is close to randomly sampled descriptions created
                                                                                  by YouTube users.
Abstractive Summarization of
Spoken and Written Instructions with BERT                                                                                                    KDD Converse’20, August 2020,


                                                                    Table 3: Comparison of results

                                          Experiment
   Model                                                            Pretraining Data          Test Set        Rouge-1              Rouge-L              Content-F1
   1. BertSum, BertSum with pre+post processing                     CNN/DM                    How2            18.08 to 22.47       18.01 to 20.07       26.0
   2. BertSum with random training                                  How2, 1/50 Sampled-       How2            24.4                 21.45                18.7
                                                                    WikiHow, CNN/DM
   3. BertSum with random training and                              How2, 1/50 Sampled-       How2            26.32                22.47                32.9
   postprocessing                                                   WikiHow, CNN/DM
   4. BertSum with ordered training                                 How2, WikiHow,            How2            48.26                44.02                36.4
                                                                    CNN/DM
   5. BertSum                                                       WikiHow                   WikiHow         35.91                34.82                29.8
   6. BertSum [14]                                                  CNN/DM                    CNN/DM          43.23                39.63                Out of Scope
   7. Multi-modal Model [18]                                        How2                      How2            59.3                 59.2                 48.9
   8. MatchSum (BERT-base) [27]                                     WikiHow                   WikiHow         31.85                29.58                Not Available
   9. Lead 3 for WikiHow [9]                                        Not Applicable            CNN/DM          40.34                36.57                Not Available
   10. Lead 3 for CNN/DM [9]                                        Not Applicable            WikiHow         26.00                24.25                Not Available
   11. Lead 3 for How2 [9]                                          Not Applicable            How2            23.66                20.69                16.2


     • We designed and implemented a new framework for blind                                  Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman (Eds.), Vol. 382.
       unbiased review that produces more actionable and objective                            ACM, 41–48. http://dblp.uni-trier.de/db/conf/icml/icml2009.html#BengioLCW09
                                                                                          [2] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger
       scores, augmenting ROUGE, BLEU and Content F1.                                         Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN
                                                                                              Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014).
   All the artifacts listed above are available in to our repository                          arXiv:1406.1078 http://arxiv.org/abs/1406.1078
for the benefit of future researchers 1 . Overall, the results we ob-                     [3] Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific
tained by now on amateur narrated instructional videos make us                                Translation Evaluation for Any Target Language. In Proceedings of the EACL 2014
                                                                                              Workshop on Statistical Machine Translation.
believe that we were able to come up with a trained model that                            [4] B. Erol, D. . Lee, and J. Hull. 2003. Multimodal summarization of meeting record-
generates summaries from ASR (speech-to-text) scripts of com-                                 ings. In 2003 International Conference on Multimedia and Expo. ICME ’03. Proceed-
petitive quality to human-curated descriptions on YouTube. With                               ings (Cat. No.03TH8698), Vol. 3. III–25.
                                                                                          [5] Berna Erol, Dar-Shyang Lee, and Jonathan J. Hull. 2003. Multimodal summa-
the limited availability of labeled summary datasets, our future                              rization of meeting recordings. In Proceedings of the 2003 IEEE International
plan is to create several benchmark models to extend the human                                Conference on Multimedia and Expo, ICME 2003, 6-9 July 2003, Baltimore, MD,
                                                                                              USA. IEEE Computer Society, 25–28. https://doi.org/10.1109/ICME.2003.1221239
valuation framework with human curated summaries. Given the                               [6] GardenFork. 2018. DIY How-to Videos. Retrieved July 15, 2020 from https:
successes of generalized summaries across informal and formal                                 //www.youtube.com/playlist?list=PL05C1F99A68D37472
styles of conversation, we believe that investigating the application                     [7] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will
                                                                                              Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and
of these summarization models to human - chatbot dialogues is an                              Comprehend. In Advances in Neural Information Processing Systems 28, C. Cortes,
important direction for future work.                                                          N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates,
                                                                                              Inc., 1693–1701. http://papers.nips.cc/paper/5945-teaching-machines-to-read-
                                                                                              and-comprehend.pdf
ACKNOWLEDGMENTS                                                                           [8] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under-
                                                                                              standing with Bloom embeddings, convolutional neural networks and incremen-
We would first like to thank Shruti Palaskar, whose mentorship                                tal parsing. (2017). To appear.
and guidance were invaluable throughout our research. We would                            [9] Mahnaz Koupaee and William Yang Wang. 2018. WikiHow: A Large Scale
also like to thank Jack Xue, James McCaffrey, Jiun-Hung Chen, Jon                             Text Summarization Dataset. CoRR abs/1810.09305 (2018). arXiv:1810.09305
                                                                                              http://arxiv.org/abs/1810.09305
Rosenberg, Isidro Hegouaburu, Sid Reddy, Mike Tamir, and Troy                            [10] Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2017.
Deck for their insights and feedback. We also thank the survey                                Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio
participants for taking time to complete our human evaluation                                 and Video. In Proceedings of the 2017 Conference on Empirical Methods in Natural
                                                                                              Language Processing. Association for Computational Linguistics, Copenhagen,
study.                                                                                        Denmark, 1092–1102. https://doi.org/10.18653/v1/D17-1114
                                                                                         [11] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
                                                                                              In Text Summarization Branches Out. Association for Computational Linguistics,
REFERENCES                                                                                    Barcelona, Spain, 74–81. https://www.aclweb.org/anthology/W04-1013
 [1] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009.           [12] Chin-Yew Lin and Eduard Hovy. 2002. Manual and Automatic Evaluation of
     Curriculum learning.. In ICML (ACM International Conference Proceeding Series),          Summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization
                                                                                              - Volume 4 (Phildadelphia, Pennsylvania) (AS ’02). Association for Computational
1 https://github.com/alebryvas/berk266/ - it’s not a public repository yet, but we can        Linguistics, USA, 45–51. https://doi.org/10.3115/1118162.1118168
provide access upon request
KDD Converse’20, August 2020,                                                                                                          Savelieva, Au-Yeung, Ramani


[13] Chunyi Liu, Peng Wang, Xu Jiang, Zang Li, and Jieping Ye. 2019. Automatic
     Dialogue Summary Generation for Customer Service. 1957–1965. https://doi.
     org/10.1145/3292500.3330683
[14] Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders.
     In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
     Processing and the 9th International Joint Conference on Natural Language Pro-
     cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong,
     China, 3730–3740. https://doi.org/10.18653/v1/D19-1387
[15] Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. CoRR
     cs.CL/0205028 (2002). http://dblp.uni-trier.de/db/journals/corr/corr0205.html#cs-
     CL-0205028
[16] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan
     Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by
     Watching Hundred Million Narrated Video Clips. CoRR abs/1906.03327 (2019).
     arXiv:1906.03327 http://arxiv.org/abs/1906.03327
[17] Ani Nenkova. 2005. Automatic Text Summarization of Newswire: Lessons                          Figure 9: BertSum Architecture. From [14]
     Learned from the Document Understanding Conference. 1436–1441.
[18] Shruti Palaskar, Jindřich Libovický, Spandana Gella, and Florian Metze. 2019.
     Multimodal Abstractive Summarization for How2 Videos. In Proceedings of the
     57th Annual Meeting of the Association for Computational Linguistics. Association
                                                                                         with a very low learning rate and a separate learning rate is used
     for Computational Linguistics, Florence, Italy, 6587–6596. https://doi.org/10.      for the decoder to make it learn better.
     18653/v1/P19-1659
[19] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A Neural Attention
     Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Confer-    B    AN ILLUSTRATIVE EXAMPLE OF WHY
     ence on Empirical Methods in Natural Language Processing. Association for Compu-         ROUGE METRICS ARE NOT SUFFICIENT
     tational Linguistics, Lisbon, Portugal, 379–389. https://doi.org/10.18653/v1/D15-
     1044
[20] Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault,
     Lucia Specia, and Florian Metze. 2018. How2: A Large-scale Dataset for Multi-
     modal Language Understanding. CoRR abs/1811.00347 (2018). arXiv:1811.00347
     http://arxiv.org/abs/1811.00347
[21] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point:
     Summarization with Pointer-Generator Networks. CoRR abs/1704.04368 (2017).
     arXiv:1704.04368 http://arxiv.org/abs/1704.04368
[22] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence
     Learning with Neural Networks. CoRR abs/1409.3215 (2014). arXiv:1409.3215
     http://arxiv.org/abs/1409.3215                                                       Figure 10: An example where ROUGE metric is confusing
[23] Krysta M. Svore, Lucy Vanderwende, and Christopher J. C. Burges. 2007. Enhanc-
     ing Single-Document Summarization by Combining RankNet and Third-Party
     Sources. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on
     Empirical Methods in Natural Language Processing and Computational Natural          C    EXAMPLES OF COMPARISON OF OUR
     Language Learning, June 28-30, 2007, Prague, Czech Republic, Jason Eisner (Ed.).
     ACL, 448–457. https://www.aclweb.org/anthology/D07-1047/                                 MODEL OUTPUT VS BENCHMARK [18]
[24] How to Videos. 2020. How-to Videos. Retrieved July 20, 2020 from https://www.
     youtube.com/channel/UC_qTn8RzUXBP5VJ0q2jROGQ
                                                                                              AND REFERENCE SUMMARIES
[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,            Below examples were selected to illustrate several aspects of the
     Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
     You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/
                                                                                         problem. First, we share URLs of the videos so that the reader
     1706.03762                                                                          may view the original content. Second, we share the final result
[26] Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and                of abstractive summarization with our current best model version
     Yongdong Zhang. 2020. Curriculum Learning for Natural Language Understand-
     ing. In Proceedings of the 58th Annual Meeting of the Association for Computa-      (Summary Abs). For comparison, we provide summaries from cur-
     tional Linguistics. Association for Computational Linguistics, Online, 6095–6104.   rent Benchmark for How2 videos that bypasses our model in terms
     https://www.aclweb.org/anthology/2020.acl-main.542                                  of scores, but, as can be seen in these examples, not in the fluency
[27] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and X. Huang.
     2020. Extractive Summarization as Text Matching. ArXiv abs/2004.08795 (2020).       and usefulness. Reference represents the actual YouTube video de-
                                                                                         scription curated by the authors. For contrast, we show Summary
                                                                                         Ext - the result of extractive summarization, which explains why
A     MODEL DETAILS                                                                      abstractive summarization is a better fit for the purpose, as we are
Extractive summarization is generally a binary classification task                       trying to accomplish style conversion from spoken for the source
with labels indicating whether sentences should be included in the                       text to written for the target summary. Since BertSum is uncased,
summary. Abstractive summarization, on the other hand, requires                          all texts below were converted to lower case for consistency.
language generation capabilities to create summaries containing                                • Video 1: https://www.youtube.com/watch?v=F_4UZ3bGMP8
novel words and phrases not found in the source text.                                          • Summary Abs 1: growing rudbeckia requires full hot sun and
   The architecture in the Figure 9 shows the BertSum model. It                                  good drainage. grow rudbeckia with tips from a gardening
uses a novel documentation level encoder based on BERT which                                     specialist in this free video on plant and flower care. care for
can encode a document and obtain representation for the sentences.                               rudbeckia with gardening tips from an experienced gardener.
CLS token is added to every sentence instead of just 1 CLS token                               • Benchmark 1: growing black - eyed - susan is easy with these
in the original BERT model. Abstractive model uses an encoder-                                   tips, get expert gardening tips in this free gardening video .
decoder architecture, combining the same pretrained BERT encoder                               • Reference 1: growing rudbeckia plants requires a good deal
with a randomly initialized Transformer decoder. The model uses a                                of hot sun and plenty of good drainage for water. start a
special technique where the encoder portion is almost kept same                                  rudbeckia plant in the winter or anytime of year with advice
Abstractive Summarization of
Spoken and Written Instructions with BERT                                                                     KDD Converse’20, August 2020,


       from a gardening specialist in this free video on plant and         False Positive (FP): Survey participants believed sample sum-
       flower care.                                                        maries were written by humans when sample were written
     • Summary Ext 1: make sure that your plants are in your               by robot.
       garden. get your plants. don’t go to the flowers. go to your        FN Examples:
       garden’s soil. put them in your plants in the water. take care    • "permanently fix flat atv tires with tireject ??. dry rot, bead
       of your flowers.                                                    leaks, nails, sidewall punctures are no issue. these 30yr old
     • Video 2: https://www.youtube.com/watch?v=LbsGHj2Akao                atv tires permanently sealed and back into service in under
     • Summary 2: camouflage thick arms by wearing sleeves that            5 min. they sealed right up and held air for the first time in
       are not close to the arms and that have a line that goes all        a long time. this liquid rubber and kevlar are a permanent
       the way to the waist. avoid wearing jackets and jackets with        repair and will protect from future punctures."
       tips from an image consultant in this free video on fashion.      • "how to repair a bicycle tire : how to remove the tube from
       learn how to dress for fashion modeling.                            bicycle tires. by using handy tire levers, expert cyclist shows
     • Benchmark 2: hide thick arms and arms by wearing clothes            how to remove the tube from bicycle tires, when changing a
       that hold the arms in the top of the arm. avoid damaging the        flat tire, in this free bicycle repair video."
       arm and avoid damaging the arms with tips from an image             FP Examples:
       consultant in this free video on fashion.                         • "learn about the parts of a microscope with expert tips and
     • Reference 2: hide thick arms by wearing clothes sleeves that        advice on refurbishing antiques from a professional pho-
       almost reach the waist to camouflage the area .conceal the          tographer in this free video clip on home astronomy and
       thickness at the top of the arms with tips from an image            buildings. learn more about how to use a light microscope
       consultant in this free video on fashion.                           with a demonstration from a science teacher. "
     • Summary Ext 2: make sure that you have a look at the top of       • "watch as a seasoned professional demonstrates how to use
       the top. if you want to wear the right arm. go to the shoulder.     a deep fat fryer in this free online video about home pool
       wear a long-term shirts. keep your arm in your shoulders.           care. get professional tips and advice from an expert on how
       don’t go out.                                                       to organize your kitchen appliance and kitchen appliance
                                                                           for special occasions."
        D EXAMPLES OF FALSE POSITIVES AND
        FALSE NEGATIVES FROM SURVEY
        RESULTS
        False Negative (FN): Survey participants believed sample
        summaries were written by robots when sample were written
        by humans.

</pre>