=Paper= {{Paper |id=Vol-2526/paper2 |storemode=property |title=Automatic Slide Generation for Scientific Papers |pdfUrl=https://ceur-ws.org/Vol-2526/paper2.pdf |volume=Vol-2526 |authors=Athar Sefid,Jian Wu,Prasenjit Mitra,C. Lee Giles |dblpUrl=https://dblp.org/rec/conf/kcap/Sefid0MG19 }} ==Automatic Slide Generation for Scientific Papers== https://ceur-ws.org/Vol-2526/paper2.pdf

Automatic Slide Generation for Scientific Papers
Athar Sefid Jian Wu
Pennsylvania State University Old Dominion University
azs5955@psu.edu jwu@cs.odu.edu

Prasenjit Mitra C. Lee Giles
Pennsylvania State University Pennsylvania State University
pmitra@ist.psu.edu giles@ist.psu.edu

ABSTRACT document. This method has been applied in slide generation tasks,
We describe our approach for automatically generating presenta- e.g., [6]. This is because (1) the selected sentences have correct
tion slides for scientific papers using deep neural networks. Such grammatical structure and (2) they are scientifically/technically
slides can help authors have a starting point for their slide genera- consistent with the original article.
tion process. Extractive summarization techniques are applied to The two major components of extractive summarization are
rank and select important sentences from the original document. sentence scoring and sentence selection. The goal is to keep
Previous work identified important sentences based only on a lim- salient sentences while excluding redundant information. Sentence
ited number of features that were extracted from the position and scoring is usually converted to a regression problem. The scores
structure of sentences in the paper. Our method extends previous depend on features extracted from the current sentence and its
work by (1) extracting a more comprehensive list of surface fea- contextual sentences.
tures, (2) considering semantic or meaning of the sentence, and (3)
using context around the current sentence to rank the sentences.
Once, the sentences are ranked, salient sentences are selected using
Integer Linear Programming (ILP). Our results show the efficacy of
our model for summarization and the slide generation task.

CCS CONCEPTS
• Computing methodologies → Neural networks; Classifica-
tion and regression trees; • Information systems → Document
topic models; Retrieval models and ranking.

KEYWORDS
Natural Language Processing, Text Mining, Slide Generation, Sum-
marization, Deep Learning

1 INTRODUCTION
Scientific results are usually disseminated by publications and re-
search presentations. The latter has been a convention for almost
all scientific domains since it provides an efficient way for others
to grasp the major contribution of a paper. Currently, PowerPoint
automates slide formating and thus significantly reduces the effort
to make professional and useful slides. However, PowerPoint can-
not automatically generate slides based on the content of research
Figure 1: Slide Generation flow chart
papers. Automatically generating slides would not only save the
presenters’ time in preparing presentations, but it also provides
a way to summarize paper and it can also be used to generate
summaries for papers that do not have publicly available slides. The main contributions of this paper are the following.
The two major approaches for summarization are abstractive and (1) To the best of our knowledge, we are the first to use Deep
extractive methods. Abstractive approaches summarize text using Neural models to encode sentences and its context as new
words not necessarily appearing in the original document. Extrac- features in sentence ranking to build slides.
tive summarization focuses on identifying important constituents
usually at the sentence level and connecting them to generate a (2) We combine regression with integer linear programming
text snippet, which is usually significantly shorter than the original (ILP) to select salient sentences. Then, noun phrases are
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons extracted from the selected sentences to build the first-level
License Attribution 4.0 International (CC BY 4.0). bullet points in slides.
SciKnow ’19, November 19–22, 2019, Marina del Rey, CA Sefid and Wu, et al.

Here, we focus on the textual components. This approach can later bullet points and to determine their place in the slide. However,
be enhanced with visual effects, e.g., figures and tables, in order to their model was tested on a limited set of only 175 paper-slide pairs.
make better more complex slides. Compared with previous works, we use a combination of feature
based and deep neural network methods to score sentences. Then,
2 RELATED WORK ILP is applied to build the summary. To transform the summary
There has been much work on text summarization using both ab- into a slide format, first-level bullet points are generated using key
stractive and extractive methods. However, automatic generation phrases extracted from the selected sentences.
of slides, which can be seen as a specific form of summarization of
scholarly papers, has not been well studied. 3 MODEL
The summarization system consists of 2 steps. The first is to train a
2.1 Summarization model to calculate scores used to identify important sentences. The
Unsupervised Models. Identifying important sentences for generat- second is to extract salient sentences under constraints. The model
ing a limited length summary can be formalized as an optimization is trained on a corpus of aligned pairs of papers and slides. In the
problem which can be NP-hard. Maximum Marginal Relevance training process, we investigate Convolutional Neural Networks
[2] has been used to heuristically select salient sentences while (CNN), Gated Recurrent Unit (GRU), and Long Short Term Memory
keeping redundancy with the selected summary at the lowest point. (LSTM) using pre-trained word embeddings (WE) to represent the
Sentence selection could be converted to the Knapsack problem semantic feature of sentences. This semantic representation is en-
by maximizing the total scores of the selected sentences given the hanced by combining other surface features of the current sentence
length limit and can be solved by Integer Linear Programming (ILP). and its contextual features. The high-level architecture is shown in
ILP is an optimization method with linear constraints and objec- Figure 1.
tive functions [13]. Graph-based summarization methods model a
document as a graph, in which vertices are sentences and edges
are the similarity of vertices. One example is TextRank [14] which
extends Google’s PageRank. TextRank selects important sentences
that have a high degree of overlapping tokens with other sentences.
LexRank [5], a stochastic graph-based method, measures the sen-
tence importance by eigenvector centrality of the sentences in the
graph.
Feature-Based Models. Many traditional extractive summarization
approaches use feature-based models such as term frequency, an
effective feature. Other features include the length and position of
sentences [18]. The Support Vector Regressor was applied to score
Figure 2: Architecture to predict salience score of a sentence
and extract salient sentences [9]. Others have modeled a document
Si . S j (j , i) are contextual sentences. Three types of embed-
as a sequence of sentences and label each sentence with 1 or 0 in
dings are combined as an input to MLP to output the score of
which 1 indicates that the corresponding sentence belongs to the
sentence Si . The CNN could be replaced with LSTM or GRU.
summary [19] and a 0 not. Hidden Markov models (HMM) have
been adopted to solve such sequence labeling problems [4].
Deep Neural Network Models. State-of-the-art models now use deep 3.1 Sentence Labeling
neural networks which benefit from both the semantics of the
In this section, we describe how to generate the ground truth by
sentences and their structure [7, 8]. Zhou et al. used multiple layers
assigning scores to sentences in a paper given a paper-slide pair.
of bidirectional Gated Recurrent Units (GRU) to jointly learn the
An academic paper D can be represented as a sequence of sentences
scoring process and the selection of sentences [23]. This work
D = {s 1 , s 2 , . . . , sn }. Each sentence is represented by the set of its
applies bi-GRU to make sentence embeddings and then combines
unigrams and bigrams, such as si = {t 1 , t 2 , . . . , tk , t 1t 2 , t 2t 3 , . . . , tk −1tk }
sentence vectors with another bi-GRU layer to make document
where ti is the i-th token in the sentence. The corresponding pre-
level representations.
sentation of the document D can be represented as a sequence of
slides P = {p1 , p2 , . . . , pm }. The salience score of each sentence
2.2 Slide Generation is determined by the highest Jaccard similarity of unigrams and
Previously, a method was proposed to generate slides from docu- bigrams between a sentence and all slides in the corresponding
ments by identifying important topics and then adding their related presentation.
sentences to the summary [12]. A tool called PPSGen applies a
Support Vector Regressor for sentence ranking and then ILP to ∀si ∈ D, scoresi = max Jaccard(si , p j ) (1)
select important sentences with a set of sophisticated constraints p j ∈P
[6]. Another method generates slides by phrases extracted from ÑJaccard similarity of si and p j sets is defined as Jaccard(si , p j ) =
The
papers [22]. The model learns the saliency of each phrase and the si p j
Ð .
hierarchical relationship between a pair of phrases to make the si p j
Automatic Slide Generation for Scientific Papers SciKnow ’19, November 19–22, 2019, Marina del Rey, CA

Having the labeled sentences with their scores, we can start previous activations instead of replacing the entire activation like
training different models to learn the scoring functions. a vanilla RNN.
LSTM: The LSTM tracks long term dependencies via input (i t ),
3.2 Sentence Ranking forget (ft ), and output (ot ) gates; the input gate regulates how much
The goal of the sentence ranking module is to train a model f (si |ϕ) of the new cell state to keep, the forget gate forgets a portion of
that minimizes the Mean Square Error (MSE) defined as below: existing memory and the output gate sends important information
in the cell state to the next layers of the network.
∀si , 0 ≤ f (si |ϕ) ≤ 1 (2) GRU: The GRU operates using a reset gate and an update gate.
The reset gate decides how much of the past information to forget,
Õ and the update gate helps the model to determine how much of the
MSE = (f (si |ϕ) − scoresi )2 (3) past information needs to be passed to the future.
si The input to the recurrent units are pre-trained word vectors
in which ϕ is sentence embedding combined with syntactic and and the unit outputs are combined to build the sentence semantic
contextual features. embeddings.
The sentence ranking model consists of three modules. The first
models the meaning of a sentence using a deep neural network
3.2.2 Context Embedding. The context of a sentence is defined as
that embeds the sentence itself into a vector. The second evalu-
sentences before and after the current sentence. Intuitively, the
ates the capability of the sentence to summarize its contextual
sentence that includes an abstract of its context is more important
sentences. The last extracts a variety of surface features from the
and should have a higher chance to be selected for the summary of
sentence structure and its position in the paper. The results from
the paper. For instance, some paragraphs are ended by a sentence
all of the three modules are combined to build the input for the
that summarizes the whole paragraph. This sentence probably has
final Multi-Layer Perceptron (MLP) with two hidden layers which
a high word overlap with other sentences of the paragraph. Contex-
acts as a regressor to predict the final score for a sentence. Figure
tual sentence relations have been used to model attention weights
2 depicts the embedding layers for the construction of sentence
of sentences and then to identify important sentences [17]. The
representation vectors.
context information is embedded as the following vectors.
3.2.1 Semantic Embedding. In this layer, the semantic of a sentence
is encoded into a vector. We compare the performance of both
convolutional and recurrent networks in order to choose the best cos(Vsi , Vsi −1 ) cos(Vsi , Vsi +1 )
system for semantic embedding. pr ev Si = cos(Vsi , Vsi −2 ) , nex t Si = cos(Vsi , Vsi +2 )
   
(7)
Convolutional Neural Network (CNN): In this layer, the se- cos(Vs , Vs ) cos(Vs , Vs )
i i −3  i i +3 
mantic representation of a sentence si is generated by concatenating
 

embeddings of bi-grams. The intuition is that bi-grams preserve
3.2.3 Syntactic Embedding. The models suggested in the above
the sequential information of the text. Convolutional layers and
focus on representing the meaning of the sentences and their con-
element-wise max-pooling on top of the bigram embeddings cap-
text into vectors. Although semantic features are strong metrics for
ture important patterns in the sentence that match similar patterns
identification of salient sentences, surface features are also proved
in the reference slides.
to be important [6]. The surface features we extracted are tabulated
The bigram representations are calculated by concatenating pre-
in Table 1.
trained word embeddings of tokens in the bigram.
The “section” feature is a one-hot vector to represent the section
biдram j = [v j , v j+1 ] (4) that sentence belongs to. Certain sections, such as conclusion, can
In which v j and v j+1 stand for the current and the next word be more important to indicate sentence importance compared with
vectors in sentence si . other sections such as acknowledgment. Table 2 shows the average
scores of the sentences in a set of pre-defined sections calculated
Vbiдr am j = tanh(biдram j ∗ Wc + b) (5) from the labeled paper-slide pairs. The table indicates that Intro-
duction and Conclusion sections are relatively more important than
Where filter Wc ∈ R2|v j |∗ |v j | is applied on a vector of two words
the other sections.
to make new features. The activation function is the Hyperbolic
The second feature is the relative position of the sentence in the
Tangent and b is bias term.
section in units of sentences.
Vsi = max Vbiдr am j (6) The other features #NP, #VP, #S, and “height” are obtained from
0