=Paper= {{Paper |id=Vol-2526/paper2 |storemode=property |title=Automatic Slide Generation for Scientific Papers |pdfUrl=https://ceur-ws.org/Vol-2526/paper2.pdf |volume=Vol-2526 |authors=Athar Sefid,Jian Wu,Prasenjit Mitra,C. Lee Giles |dblpUrl=https://dblp.org/rec/conf/kcap/Sefid0MG19 }} ==Automatic Slide Generation for Scientific Papers== https://ceur-ws.org/Vol-2526/paper2.pdf
                     Automatic Slide Generation for Scientific Papers
                                   Athar Sefid                                                                 Jian Wu
                       Pennsylvania State University                                                  Old Dominion University
                            azs5955@psu.edu                                                               jwu@cs.odu.edu

                               Prasenjit Mitra                                                              C. Lee Giles
                       Pennsylvania State University                                               Pennsylvania State University
                           pmitra@ist.psu.edu                                                           giles@ist.psu.edu

ABSTRACT                                                                              document. This method has been applied in slide generation tasks,
We describe our approach for automatically generating presenta-                       e.g., [6]. This is because (1) the selected sentences have correct
tion slides for scientific papers using deep neural networks. Such                    grammatical structure and (2) they are scientifically/technically
slides can help authors have a starting point for their slide genera-                 consistent with the original article.
tion process. Extractive summarization techniques are applied to                         The two major components of extractive summarization are
rank and select important sentences from the original document.                       sentence scoring and sentence selection. The goal is to keep
Previous work identified important sentences based only on a lim-                     salient sentences while excluding redundant information. Sentence
ited number of features that were extracted from the position and                     scoring is usually converted to a regression problem. The scores
structure of sentences in the paper. Our method extends previous                      depend on features extracted from the current sentence and its
work by (1) extracting a more comprehensive list of surface fea-                      contextual sentences.
tures, (2) considering semantic or meaning of the sentence, and (3)
using context around the current sentence to rank the sentences.
Once, the sentences are ranked, salient sentences are selected using
Integer Linear Programming (ILP). Our results show the efficacy of
our model for summarization and the slide generation task.

CCS CONCEPTS
• Computing methodologies → Neural networks; Classifica-
tion and regression trees; • Information systems → Document
topic models; Retrieval models and ranking.

KEYWORDS
Natural Language Processing, Text Mining, Slide Generation, Sum-
marization, Deep Learning

1    INTRODUCTION
Scientific results are usually disseminated by publications and re-
search presentations. The latter has been a convention for almost
all scientific domains since it provides an efficient way for others
to grasp the major contribution of a paper. Currently, PowerPoint
automates slide formating and thus significantly reduces the effort
to make professional and useful slides. However, PowerPoint can-
not automatically generate slides based on the content of research
                                                                                                  Figure 1: Slide Generation flow chart
papers. Automatically generating slides would not only save the
presenters’ time in preparing presentations, but it also provides
a way to summarize paper and it can also be used to generate
summaries for papers that do not have publicly available slides.                        The main contributions of this paper are the following.
   The two major approaches for summarization are abstractive and                       (1) To the best of our knowledge, we are the first to use Deep
extractive methods. Abstractive approaches summarize text using                             Neural models to encode sentences and its context as new
words not necessarily appearing in the original document. Extrac-                           features in sentence ranking to build slides.
tive summarization focuses on identifying important constituents
usually at the sentence level and connecting them to generate a                         (2) We combine regression with integer linear programming
text snippet, which is usually significantly shorter than the original                      (ILP) to select salient sentences. Then, noun phrases are
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons         extracted from the selected sentences to build the first-level
License Attribution 4.0 International (CC BY 4.0).                                          bullet points in slides.
SciKnow ’19, November 19–22, 2019, Marina del Rey, CA                                                                                         Sefid and Wu, et al.


Here, we focus on the textual components. This approach can later        bullet points and to determine their place in the slide. However,
be enhanced with visual effects, e.g., figures and tables, in order to   their model was tested on a limited set of only 175 paper-slide pairs.
make better more complex slides.                                            Compared with previous works, we use a combination of feature
                                                                         based and deep neural network methods to score sentences. Then,
2     RELATED WORK                                                       ILP is applied to build the summary. To transform the summary
There has been much work on text summarization using both ab-            into a slide format, first-level bullet points are generated using key
stractive and extractive methods. However, automatic generation          phrases extracted from the selected sentences.
of slides, which can be seen as a specific form of summarization of
scholarly papers, has not been well studied.                             3     MODEL
                                                                         The summarization system consists of 2 steps. The first is to train a
2.1     Summarization                                                    model to calculate scores used to identify important sentences. The
Unsupervised Models. Identifying important sentences for generat-        second is to extract salient sentences under constraints. The model
ing a limited length summary can be formalized as an optimization        is trained on a corpus of aligned pairs of papers and slides. In the
problem which can be NP-hard. Maximum Marginal Relevance                 training process, we investigate Convolutional Neural Networks
[2] has been used to heuristically select salient sentences while        (CNN), Gated Recurrent Unit (GRU), and Long Short Term Memory
keeping redundancy with the selected summary at the lowest point.        (LSTM) using pre-trained word embeddings (WE) to represent the
Sentence selection could be converted to the Knapsack problem            semantic feature of sentences. This semantic representation is en-
by maximizing the total scores of the selected sentences given the       hanced by combining other surface features of the current sentence
length limit and can be solved by Integer Linear Programming (ILP).      and its contextual features. The high-level architecture is shown in
ILP is an optimization method with linear constraints and objec-         Figure 1.
tive functions [13]. Graph-based summarization methods model a
document as a graph, in which vertices are sentences and edges
are the similarity of vertices. One example is TextRank [14] which
extends Google’s PageRank. TextRank selects important sentences
that have a high degree of overlapping tokens with other sentences.
LexRank [5], a stochastic graph-based method, measures the sen-
tence importance by eigenvector centrality of the sentences in the
graph.
Feature-Based Models. Many traditional extractive summarization
approaches use feature-based models such as term frequency, an
effective feature. Other features include the length and position of
sentences [18]. The Support Vector Regressor was applied to score
                                                                         Figure 2: Architecture to predict salience score of a sentence
and extract salient sentences [9]. Others have modeled a document
                                                                         Si . S j (j , i) are contextual sentences. Three types of embed-
as a sequence of sentences and label each sentence with 1 or 0 in
                                                                         dings are combined as an input to MLP to output the score of
which 1 indicates that the corresponding sentence belongs to the
                                                                         sentence Si . The CNN could be replaced with LSTM or GRU.
summary [19] and a 0 not. Hidden Markov models (HMM) have
been adopted to solve such sequence labeling problems [4].
Deep Neural Network Models. State-of-the-art models now use deep         3.1      Sentence Labeling
neural networks which benefit from both the semantics of the
                                                                         In this section, we describe how to generate the ground truth by
sentences and their structure [7, 8]. Zhou et al. used multiple layers
                                                                         assigning scores to sentences in a paper given a paper-slide pair.
of bidirectional Gated Recurrent Units (GRU) to jointly learn the
                                                                         An academic paper D can be represented as a sequence of sentences
scoring process and the selection of sentences [23]. This work
                                                                         D = {s 1 , s 2 , . . . , sn }. Each sentence is represented by the set of its
applies bi-GRU to make sentence embeddings and then combines
                                                                         unigrams and bigrams, such as si = {t 1 , t 2 , . . . , tk , t 1t 2 , t 2t 3 , . . . , tk −1tk }
sentence vectors with another bi-GRU layer to make document
                                                                         where ti is the i-th token in the sentence. The corresponding pre-
level representations.
                                                                         sentation of the document D can be represented as a sequence of
                                                                         slides P = {p1 , p2 , . . . , pm }. The salience score of each sentence
2.2     Slide Generation                                                 is determined by the highest Jaccard similarity of unigrams and
Previously, a method was proposed to generate slides from docu-          bigrams between a sentence and all slides in the corresponding
ments by identifying important topics and then adding their related      presentation.
sentences to the summary [12]. A tool called PPSGen applies a
Support Vector Regressor for sentence ranking and then ILP to                               ∀si ∈ D, scoresi = max Jaccard(si , p j )                          (1)
select important sentences with a set of sophisticated constraints                                                   p j ∈P
[6]. Another method generates slides by phrases extracted from             ÑJaccard similarity of si and p j sets is defined as Jaccard(si , p j ) =
                                                                         The
papers [22]. The model learns the saliency of each phrase and the        si p j
                                                                           Ð .
hierarchical relationship between a pair of phrases to make the          si p j
Automatic Slide Generation for Scientific Papers                                                     SciKnow ’19, November 19–22, 2019, Marina del Rey, CA


   Having the labeled sentences with their scores, we can start             previous activations instead of replacing the entire activation like
training different models to learn the scoring functions.                   a vanilla RNN.
                                                                               LSTM: The LSTM tracks long term dependencies via input (i t ),
3.2     Sentence Ranking                                                    forget (ft ), and output (ot ) gates; the input gate regulates how much
The goal of the sentence ranking module is to train a model f (si |ϕ)       of the new cell state to keep, the forget gate forgets a portion of
that minimizes the Mean Square Error (MSE) defined as below:                existing memory and the output gate sends important information
                                                                            in the cell state to the next layers of the network.
                             ∀si , 0 ≤ f (si |ϕ) ≤ 1                  (2)      GRU: The GRU operates using a reset gate and an update gate.
                                                                            The reset gate decides how much of the past information to forget,
                                 Õ                                          and the update gate helps the model to determine how much of the
                      MSE =           (f (si |ϕ) − scoresi )2         (3)   past information needs to be passed to the future.
                                 si                                            The input to the recurrent units are pre-trained word vectors
in which ϕ is sentence embedding combined with syntactic and                and the unit outputs are combined to build the sentence semantic
contextual features.                                                        embeddings.
   The sentence ranking model consists of three modules. The first
models the meaning of a sentence using a deep neural network
                                                                            3.2.2 Context Embedding. The context of a sentence is defined as
that embeds the sentence itself into a vector. The second evalu-
                                                                            sentences before and after the current sentence. Intuitively, the
ates the capability of the sentence to summarize its contextual
                                                                            sentence that includes an abstract of its context is more important
sentences. The last extracts a variety of surface features from the
                                                                            and should have a higher chance to be selected for the summary of
sentence structure and its position in the paper. The results from
                                                                            the paper. For instance, some paragraphs are ended by a sentence
all of the three modules are combined to build the input for the
                                                                            that summarizes the whole paragraph. This sentence probably has
final Multi-Layer Perceptron (MLP) with two hidden layers which
                                                                            a high word overlap with other sentences of the paragraph. Contex-
acts as a regressor to predict the final score for a sentence. Figure
                                                                            tual sentence relations have been used to model attention weights
2 depicts the embedding layers for the construction of sentence
                                                                            of sentences and then to identify important sentences [17]. The
representation vectors.
                                                                            context information is embedded as the following vectors.
3.2.1 Semantic Embedding. In this layer, the semantic of a sentence
is encoded into a vector. We compare the performance of both
convolutional and recurrent networks in order to choose the best                               cos(Vsi , Vsi −1 )                cos(Vsi , Vsi +1 )
system for semantic embedding.                                                     pr ev Si = cos(Vsi , Vsi −2 ) , nex t Si = cos(Vsi , Vsi +2 )
                                                                                                                                                    
                                                                                                                                                           (7)
   Convolutional Neural Network (CNN): In this layer, the se-                                  cos(Vs , Vs )                     cos(Vs , Vs )
                                                                                                      i     i −3                         i     i +3 
mantic representation of a sentence si is generated by concatenating
                                                                                                                                  

embeddings of bi-grams. The intuition is that bi-grams preserve
                                                                            3.2.3 Syntactic Embedding. The models suggested in the above
the sequential information of the text. Convolutional layers and
                                                                            focus on representing the meaning of the sentences and their con-
element-wise max-pooling on top of the bigram embeddings cap-
                                                                            text into vectors. Although semantic features are strong metrics for
ture important patterns in the sentence that match similar patterns
                                                                            identification of salient sentences, surface features are also proved
in the reference slides.
                                                                            to be important [6]. The surface features we extracted are tabulated
   The bigram representations are calculated by concatenating pre-
                                                                            in Table 1.
trained word embeddings of tokens in the bigram.
                                                                               The “section” feature is a one-hot vector to represent the section
                            biдram j = [v j , v j+1 ]                 (4)   that sentence belongs to. Certain sections, such as conclusion, can
  In which v j and v j+1 stand for the current and the next word            be more important to indicate sentence importance compared with
vectors in sentence si .                                                    other sections such as acknowledgment. Table 2 shows the average
                                                                            scores of the sentences in a set of pre-defined sections calculated
                  Vbiдr am j = tanh(biдram j ∗ Wc + b)                (5)   from the labeled paper-slide pairs. The table indicates that Intro-
                                                                            duction and Conclusion sections are relatively more important than
   Where filter Wc ∈ R2|v j |∗ |v j | is applied on a vector of two words
                                                                            the other sections.
to make new features. The activation function is the Hyperbolic
                                                                               The second feature is the relative position of the sentence in the
Tangent and b is bias term.
                                                                            section in units of sentences.
                         Vsi =        max         Vbiдr am j          (6)      The other features #NP, #VP, #S, and “height” are obtained from
                                 0