=Paper= {{Paper |id=Vol-2831/paper7 |storemode=property |title=Multi-span Style Extraction for Generative Reading Comprehension |pdfUrl=https://ceur-ws.org/Vol-2831/paper7.pdf |volume=Vol-2831 |authors=Junjie Yang,Zhuosheng Zhang,Hai Zhao |dblpUrl=https://dblp.org/rec/conf/aaai/Yang0Z21 }} ==Multi-span Style Extraction for Generative Reading Comprehension== https://ceur-ws.org/Vol-2831/paper7.pdf

Multi-span Style Extraction for Generative Reading Comprehension

Junjie Yang1,3,4 , Zhuosheng Zhang2,3,4 , Hai Zhao2,3,4*
1
SJTU-ParisTech Elite Institute of Technology, Shanghai Jiao Tong University
2
Department of Computer Science and Engineering, Shanghai Jiao Tong University
3
Key Laboratory of Shanghai Education Commission for Intelligent Interaction
and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China
4
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
jj-yang@sjtu.edu.cn, zhangzs@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn

Abstract
Generative machine reading comprehension (MRC) requires
a model to generate well-formed answers. For this type of
MRC, answer generation method is crucial to the model per-
formance. However, generative models, which are supposed
to be the right model for the task, in generally perform poorly.
At the same time, single-span extraction models have been
proven effective for extractive MRC, where the answer is con-
strained to a single span in the passage. Nevertheless, they
generally suffer from generating incomplete answers or in-
troducing redundant words when applied to the generative
MRC. Thus, we extend the single-span extraction method to
multi-span, proposing a new framework which enables gen-
erative MRC to be smoothly solved as multi-span extraction.
Thorough experiments demonstrate that this novel approach
can alleviate the dilemma between generative models and
single-span models and produce answers with better-formed
syntax and semantics. Figure 1: Example of how a well-formed answer is gener-
ated by the multi-span style extraction.

Introduction
Machine Reading Comprehension (MRC) is considered as abstractive, the single-span extraction based methods can
a nontrivial challenge in natural language understanding. easily suffer from incomplete answers or redundant words
Recently, we have seen continuous success in this area, problem. Thus, there still exists a large gap between the per-
partially benefiting from the release of massive and well- formance of single-span extraction baselines and human per-
annotated datasets from both academic (Rajpurkar, Jia, and formance.
Liang 2018; Reddy, Chen, and Manning 2019) and industry In the meantime, we have observed that utilizing multiple
(Bajaj et al. 2018; He et al. 2018) communities. spans appearing in the question and passage to compose the
The widely used span-extraction models (Seo et al. 2017; well-formed answer could be a promising method to alle-
Ohsugi et al. 2019; Lan et al. 2020), formulate the MRC task viate these drawbacks. Figure 1 shows how the mechanism
as a process of predicting the start and end position of the of multi-span style extraction works for an example from the
span inside the given passage. They have been proven effec- MS MARCO task (Bajaj et al. 2018), where the well-formed
tive on the tasks which constrain the answer to be an exact answer cannot simply be extracted as a single span from the
span in the passage (Rajpurkar, Jia, and Liang 2018). How- input text.
ever, for generative MRC tasks whose answers are highly Therefore, in this work, we propose a novel answer gen-
eration approach that takes advantage of the effectiveness of
* Corresponding author. This paper was partially supported
span extraction and the concise spirit of multi-span style to
by National Key Research and Development Program of China synthesize the free-formed answer, together with a frame-
(No. 2017YFB0304100), Key Projects of National Natural Science work as a whole for the multi-passage generative MRC. We
Foundation of China (U1836222 and 61733011), Huawei-SJTU
long term AI project, Cutting-edge Machine reading comprehen-
call our framework MUSST for MUlti-Span STyle extrac-
sion and language model. tion. Our framework is also empowered by well pre-trained
Copyright © 2021 for this paper by its authors. Use permitted un- language model as encoder component of our model. It pro-
der Creative Commons License Attribution 4.0 International (CC vides deep understanding of both the input passage and
BY 4.0). question, and models the information interaction between
Figure 2: Our framework MUSST

them. We conduct a series of experiments and the corre- Encoder For each input question and passage pair (Q, Pi ),
sponding ablations on the MS MARCO v2.1 dataset. we represent it as a single packed sequence of length n of
Our main contributions in this paper can be summarized the form “[CLS]Q[SEP]Pi [SEP]”. We pass the whole
as follows1 : sequence into a contextualized encoder, thereby to produce
its contextualized representation E ∈ Rn×h where h de-
• We propose a novel multi-span answer annotator to trans- notes the hidden size of the Transformer blocks. Following
form the initial well-formed answer into a series of spans the fine-tuning strategy of Devlin et al. (2019) for the clas-
that distribute in the question and passage. sification task, we consider the final hidden vector c ∈ Rh
• We generalize the single-span extraction based method to corresponding to the first input token ([CLS]) as the input’s
the multi-span style by introducing a lightweight but pow- aggregate representation. Our encoder also models the inter-
erful answer generator, which supports the extraction of action between the question and the passage.
various number answer spans during prediction.
Ranker The ranker is responsible for ranking the passages
• To make better usage of the large dataset for the passage based on its relevance to the question. Given the output of
ranking task, we propose dynamic sampling during the the encoding layer c, we pass it through a fully connected
training of the ranker that selects the passage most likely multi-layer perceptron which consists of two linear transfor-
to entail the answer. mations with a Tanh activation in between:
s = softmax(W2 tanh(W1 c + b1 ) + b2 ) ∈ R2
MUSST
ui = s0 and ri = s1
In this section, we present our proposed framework,
MUSST, for multi-passage generative MRC task. Figure 2 where W1 ∈ Rh×h , W2 ∈ R2×h , b1 ∈ Rh and b2 ∈ R2
depicts the general architecture of our framework, which are trainable parameters. Here, ri and ui are respectively the
consists of a passage ranker, a multi-span answer annotator, relevance and unrelevance score for the pair (Q, Pi ). The
and a question-answering module. relevance scores are consequently normalized across all the
candidates passages of the same question:
Passage ranker
exp (ri )
Problem formulation Given a question Q and a set of r̂i = Pk
k candidate passages P = {P1 , P2 , ..., Pk }, the passage j=0 exp (rj )

ranker is responsible for ranking the passages based on Here, r̂i indicates the probability that passage Pi entails the
their relevance to the question. In other words, the model answer Q.
is requested to output conditional probability distribution
P (y|Q, P; θ), where θ is the model parameters and P (y = Training We define the question-passage pair where the
i|Q, P; θ) denotes the probability that passage Pi can be passage entails the question as a positive training sample.
used to answer question Q. The positive passage is noted as P + . During the training
phase, we adopt a negative sampling with one negative sam-
1
The code is publicly available at: https://github.com/ ple. Specifically, for each positive instance (Q, P + ), we ran-
chunchiehy/musst domly sample a negative passage P − from the unselected
passages of the same question. The model is trained by min- constraints. The attempt to extract the answer spans syntacti-
imizing the following cost function: cally is motivated by our first intuition that the human editors
T compose the original answer in an analogous way.
1X
J(θ) = − log(r(Qt , Pt+ )) + log(u(Qt , Pt− )) As shown in the middle of Figure 2, we transform the an-
T t=1 swer phrase into a parsing tree and traverse the parsing tree
where T is the number of questions in the training set, in a DFS (Depth-first search) way. At each visit of the sub-
r(Qt , Pt+ ) denotes the relevance score of (Qt , Pt+ ) and tree, we check if the span represented by the subtree appears
u(Qt , Pt− ) denotes the unrelevance score of (Qt , Pt− ). in the question or passage text. We obtain a span list after
Moreover, motivated by Liu et al. (2019), we resample the traversing the whole parsing tree. However, in some cases,
negative training instances at the beginning of each training the original answer still cannot be perfectly composed by the
epoch, to avoid using the same training pattern for the ques- words from the input text even in a multi-span style. We get
tion during each training epoch. We name it dynamic sam- rid of these bad samples by comparing their edit distances
pling. with a threshold value which is set by the model beforehand.
An important final step is to prune the answer span list.
Syntactic multi-span answer annotator The pruning procedure sticks to the following principle: if
two spans adjoint in the list are contiguous in the original
Algorithm 1 Syntactic Multi-span Answer Annotation text, we joint them together. Pruning reduces heavily the
number of spans needed to recover to the original answer
Input: Question Q = {q1 , q2 , . . . , qm } , passage P = phrase. The more comprehensive detail of our annotator is
{p1 , p2 , . . . , pn } and gold answer A = {a1 , a2 , . . . , ak } described in Algorithm 1.
Parameter: Edit distance threshold dmax
Output: A list of start and end position of answer spans in Question-answering module
the question and passage
1: Let M be an empty list Problem formulation Given a question Q and a passage
2: Pack question Q and passage P into a single sequence P , the question-answering module is requested to answer the
C in a certain way. question based on the information provided by the passage.
3: Get the syntactic parsing tree T of answer A by a con- In other words, the model outputs the conditional probability
stituency parser. distribution P (y|Q, P ), where P (y = A|Q, P ) denotes the
4: Let S be the stack of subtrees to be traversed. probability that A is the answer.
5: Initialize S with the root R of the tree T Question-passage reader The architecture of the reader
6: while S is not empty do is analogous to the encoder module of the ranker in section ,
7: let V = P OP(S) where we take a pre-trained language model as encoder. But
8: Get a list of all the leaves of subtree V: L = instead of getting only the aggregate representation, we pass
{l1 , l2 , · · · , ln } the whole output of the last layer to predict the answer spans
9: if L is a sublist of C then as the follows:
10: Get the start index s and end index e of L in C
by Knuth-Morris-Pratt pattern searching algorithm M = Encoder(Q, P ) ∈ Rh×n
11: Add (s, e) into the span position list M
12: else where n is the length of the input token sequence, and h is
13: for childtree U in V (From right to left) do the hidden size of the encoder.
14: P USH(S, U)
Multi-span style answer generator Our answer genera-
15: end for
tor is responsible for composing the answer in a multi-span
16: end if
style extraction. Let n be the number of span to be extracted.
17: end while
For each single span prediction, we treat it as the sin-
18: Reconstruct answer A0 from span position list M
gle span extraction MRC task. Following Lan et al. (2020),
19: Let d = E DIT D ISTANCE(A, A0 )
we adopt a linear layer to predict start and end positions of
20: if d > dmax then
the span in the input sequence. It is worth noticing that our
21: Empty the list M
model is also enabled to predict the answer span from the
22: end if
question. The probability distribution of i-th span’s start po-
23: M ∗ = P RUNING(M )
sition over the input tokens is obtained by:
24: return M ∗
p̂j,start = softmax(Wjs M + bsj )
In this section, we introduce our syntactic multi-span answer
annotator. Before the training of our question-answering where Wjs ∈ R1×h and bsj ∈ R are trainable parameters and
module, we need to extract non-overlapped spans from the p̂j,start
k denote the probability of token k being the start of the
question and passage based on the original answer from the answer span j. The end position distribution of the answer
training dataset. Our annotator is responsible for transform- span j is obtained by using the analogous formula:
ing the original answer phrase into multiple spans that dis-
tribute in the question and passage with subject to syntactic p̂j,end = softmax(Wje M + bej )
Training and inference During training, we add a special are adopted as the official evaluation 3 metrics to evaluate
virtual span, with start and end position values equaling the model performance, while the official leaderboard chooses
length of the input sequence, at the end of the annotated an- ROUGE-L as the main metric. In the meantime, we use
swer span list. This approach enables our model to generate Mean Average Precision (MAP) and Mean Reciprocal Rank
a various number of answer spans during prediction with the (MRR) for our ranker.
virtual span serving as a stop symbol. The cost function is
defined as follows: Dataset Train Dev Test
mt
T X ALL 808,731 101,093 101,092
1 X
J(θ) = − log(p̂j,start
ytj,start
) + log(p̂j,end
ytj,end
)
T t=1 j=1 QA 503,370 (63.39%) 55,636 (45.40%) –
N LG 153,725 (12.57%) 12,467 (24.99%) –
where T is the number of training samples, mt is the number
of answer span for sample t, ytj,start and ytj,end are the true Table 1: Statistics of MS MARCO v2.1 dataset. The num-
start and end position of the t-th sample’s j-th span. bers in parenthesis indicate the percentage of examples
During inference, at each time step j, we choose the an- whose answer is single span in gold passage.
swer span (k, l) where k < l with the maximum value of
p̂kj,start p̂j,end
l . The decoding procedure terminates when the
stop span is predicted. Sometimes, the model tends to gen- Baseline models
erate repeatedly the same spans. In order to alleviate the re-
We compare our MUSST with the following baseline mod-
peating problem, at each prediction time step j, we mask
els: single-span extraction and seq2seq. For the single-span
out the predicted span positions of previous time steps (< j)
extraction baseline, we employ the model for the SQuAD
during the calculation of probability distribution of new start
dataset from ALBERT (Lan et al. 2020). The model is
and end positions. Since the masking depends on the previ-
trained only with samples where the answer is a single span
ously predicted spans, we name it as conditional masking.
in the passage. In the meantime, We adopt the Transformer
The extracted spans are later joined together to form a final
model from Vaswani et al. (2017) as our seq2seq baseline.
answer phrase.
For a fair comparison, the baseline models share the same
passage ranker as the one in MUSST.
Experiments
Dataset Implementation details
For the multi-span answer annotation, we use constituency
We evaluate our framework on the MS MARCO v2.1 2 (Ba-
parser from Standford CoreNLP (Manning et al. 2014).
jaj et al. 2018), which is a large scale open-domain gen-
NLTK 4 package is also used to implement our annota-
erative task. MS MARCO v2.1 provides two MRC tasks:
tor. The maximum edit distance between the answer recon-
Question Answering (QA) and Natural Langauge Genera-
structed from the annotated spans, and the original answer is
tion (NLG). The statistics of the corresponding datasets’
32 and 8 respectively for the N LG and QA training sets.
size are presented in Table 1. Both datasets consist of sam-
The ranker and question-answering module of MUSST
pled questions from Bing’s search logs, and each question is
are implemented with PyTorch 5 and Transformers package
accompanied by an average of ten passages that may contain 6
. We adopt ALBERT (Lan et al. 2020) as the encoder in our
the answers. QA and N LG are subsets of ALL, which also
models and initialize it with the pre-trained weights before
contains the unanswerable questions.
the fine-tuning. We choose ALBERT-base as the encoder of
Distinguished with the QA task, the NLG task requires passage ranker and ALBERT-xlarge instead for question an-
the model to provide the well-formed answer, which could swering module.
be read and understood by a natural speaker without any
Following Lan et al. (2020), we use SentencePiece (Kudo
additional context. Therefore NLG-style answers are more
and Richardson 2018) to tokenize our inputs with a vocab-
abstract than the QA-style answers. Table 1 shows also the
ulary size of 30,000. We adopt Adam optimizer (Kingma
percentage of examples where the answer can be extracted
and Ba 2015) to minimize the cost function. Two types
as a single span in the gold passage. Unsurprisingly, the an-
of regularization methods during training: dropout and L2
swers from the QA set are much more likely to match a
weight decay. Hyperparameter details for the training of the
span in the passage than the ones in the N LG set. More-
different models of our framework are presented in Table
over, Nishida et al. (2019) states that the QA task prefers
2. MUSST-NLG and MUSST-QA are trained respectively
the answer to be more concise than in the NLG task, aver-
on the N LG and QA subsets. The maximum number of
aging 13.1 words, while the latter one averages 16.6 words.
Therefore, the N LG set is more suitable to evaluate model 3
The official evaluation scripts can be found in
performance on generative MRC. https://github.com/microsoft/MSMARCO-Question-Answering/
BLEU-1 (Papineni et al. 2002) and ROUGE-L (Lin 2004) tree/master/Evaluation
4
https://www.nltk.org
2 5
The datasets can be obtained from the official site (https: https://pytorch.org
6
//microsoft.github.io/msmarco/) https://github.com/huggingface/transformers
spans for them is set to 9 and 5, respectively. We trained Analysis and discussions
the passage ranker and the question-answering module of Effect of maximum number of spans
MUSST-NLG on a machine with four Tesla P40 GPUs. The
Figure 3 presents the distribution of span numbers with edit
question-answering module of MUSST-QA is trained with
distance less than 4 over the QA and N LG training sets
eight GeForce GTX 1080 Ti GPUs. It takes roughly 9 hours
after the annotation procedure. It is seen that most QA-style
to train the passage ranker. For the question-answering mod-
answers are only one span, while the NLG-style answers dis-
ule in MUSST-NLG and MUSST-QA, the training time is
tribute more uniformly in the range of [1, 9].
about 10 hours and 17 hours respectively.

Hyperparameter Ranker MUSST-QA MUSST-NLG
Learning rate 1e-5 3e-5 3e-5
Learning rate decay Linear Linear Linear
Training epoch 3 3 5
Warmup rate 0.1 0.1 0.1
Adam 10−6 10−6 10−6
Adam β1 0.9 0.9 0.9
Adam β2 0.999 0.999 0.999
MSN 256 256 256
Batch size 128 32 32
Encoder dropout rate 0 0 0
Classifier dropout rate 0.1 0.1 0.1
Weight decay 0.01 0.01 0.01

Table 2: Training hyperparameters of different modules of Figure 3: Distribution of training samples of edit distance
MUSST on MS MARCO v2.1 dataset. Here, MUSST-QA less than 4 over annoted answer spans. For the purpose of
and MUSST-NLG refer to its question-answering module. better illustration, we filter the samples which include more
MSN means maximum sequence length. than 9 spans.

To better understand the effect of the maximum number
The single-span baseline is implemented with the same of spans to be generated in the answer generator, we let it
packages as MUSST while the seq2seq baseline is imple- vary in the range of [2, 12] and conduct experiments on the
mented with Fairseq (Ott et al. 2019). N LG set with our best single passage ranker. The edit dis-
tance threshold is set to be 8. The results are presented in
Figure 4. Generally, increasing the number of the span will
Results augment the token coverage rate, thus yielding better results.
But the gain becomes less significant when the maximum
number of span is already large enough. From Figure 4, we
QA N LG can see that the results vary imperceptibly when the maxi-
Model
ROUGE-L BLEU-1 ROUGE-L BLEU-1 mum number of spans reaches 5. However, since each span
Single-span 47.96 50.22 53.10 49.08 only introduces 4k parameters, which is negligible before
Seq2seq – – 56.42 53.89 the encoder (60M), we still choose the maximum number
MUSST-QA 48.44 49.54 – – to be 9, which corresponds to the best performance on the
MUSST-NLG – – 66.24 64.23 development set.

Table 3: Performance comparison with our baselines on the
QA and N LG development set. Here, we use the same sin-
gle ranker for MUSST and the baselines.

Table 3 shows the results of our single model and the base-
line models on the QA and N LG development datasets.
MUSST outperforms significantly the baselines including
the generative seq2seq model over the N LG set in terms
of both ROUGE-L and BLEU-1. Even on the QA set,
our model yields better results regarding ROUGE-L. Ta-
ble 4 compares our model performance with the compet-
ing models on the leaderboard. Although our model utilizes
only a standalone classifier for passage ranking, multi-span
style extraction still helps us rival with state-of-the-art ap- Figure 4: Effect of maximum number of spans.
proaches.
NLG Task QA Task
Model Answer Generation Ranking Overall Average
R-L B-1 R-L B-1
Human – – 63.2 53.0 53.9 48.5 54.65
Unpublished
PALM Unknown 49.8 49.9 51.8 50.7 50.55
Multi-doc Enriched BERT Unknown 32.5 37.7 54.0 56.5 45.18
Published
BiDAFa ♠ Single-span Confidence score 16.9 9.3 24.0 10.6 15.20
ConZNetb ♠ Pointer-Generator Unkonwn 42.1 38.6 – – –
VNETc ♠ Single-span Answer verification 48.4 46.8 51.6 54.3 50.28
Deep Cascade QAd ♠ Single-span Cascade 35.1 37.4 52.0 54.6 44.78
Masque QAe † Pointer-Generator Joint trained classifier 28.5 39.9 52.2 43.7 41.08
Masque NLGe † Pointer-Generator Joint trained classifier 49.6 50.1 48.9 48.8 49.35
MUSST-NLG † Multi-span Standalone classifier 48.0 45.8 49.0 51.6 48.60

Table 4: The performance of our framework and competing models on the MS MARCO v2.1 test set. All the results presented
here reflect the MS MARCO leaderboard (microsoft.github.io/msmarco/) as of 28 May 2020. ♠ refers to the model whose
results are not reported in the original published paper. BiDAF for MARCO is implemented by the official MS MARCO Team.
† refers to the ensemble submission. Whether the other competing models are ensemble or not is unclear. a Seo et al. (2017); b
Indurthi et al. (2018); c Wang et al. (2018b); d Yan et al. (2019); e Nishida et al. (2019).

Ablation study on model design choice Effect of edit distance threshold
We perform ablation experiments that quantify the individ- Figure 5 shows the results of MUSST on N LG develop-
ual contribution of the design choices of MUSST. Table 5 ment set for various edit distance threshold. Interestingly, it
shows the results on the N LG development set. Both prun- indicates that BLEU-1 is impacted more heavily by the vari-
ing and conditional masking contribute the model perfor- ation of edit distance than ROUGE-L. And setting the edit
mance, which indicates that pruning can help the model distance threshold too large may damage the model perfor-
to converge more easily by reducing the number of spans, mance by introducing too many incomplete samples.
while conditional masking can better generate answer with-
out suffering from the repeating problem. We also observe
using the gold passage can significantly improve question-
answering. It shows there still exists a great improvement
space for the passager ranker.

Model ROUGE-L BLEU-1
MUSST 66.24 64.23
w/o pruning 64.66 60.36
w/o conditional masking 65.50 64.31
MUSST w gold passage 75.39 74.41

Table 5: Ablation study on the N LG development set.
Figure 5: Effect of edit distance threshold.

Quality of multi-span answer annotator Effect of encoder size
Table 6 presents experimental results on ALBERT encoder
On the N LG development set, we evaluate the answers with various model sizes. Unsurprisingly, the model yields
generated by our syntactic multi-span annotator. The re- stronger results as the encoder gets larger.
sults shows our annotated answers can obtain 89.35 in
BLEU-1 and 90.19 in ROUGE-L with the gold passages,
which demonstrates the effectiveness of our annotator. For Performance of the ranker
MUSST, the results are 74.41 and 75.39 respectively (in Ta- Table 7 presents our ranker performance in terms of MAP
ble 5). So there is still much room for improvement with and MRR. The results show that dynamic sampling leads to
respect to the question-answering module. slightly better results.
Encoder Parameters ROUGE-L BLEU-1 Ohsugi et al. 2019). The models using a single-span extrac-
ALBERT-base 12M 62.03 60.48 tive method show effectiveness for the dataset where ab-
ALBERT-large 18M 64.93 61.67 stractive behavior of answers includes mostly small modi-
ALBERT-xlarge 60M 66.24 64.23 fications to spans in the context (Ohsugi et al. 2019; Yatskar
2019). Whereas, for the datasets with answers of deep ab-
Table 6: Effect of ALBERT encoder size. straction, this method fails to yield promising results. The
first attempt to generate the answer in a generative way
is to apply an RNN-based seq2seq attentional model to
Model Training set MAP MRR synthesize the answer, such as S-NET (Tan et al. 2018),
Bing (initial ranking) - 34.62 35.00 where seq2seq learning was first introduced by Sutskever,
Vinyals, and Le (2014) for the machine translation. The
MUSST (single) QA 71.10 71.56 most recent models adopt a hybrid neural network Pointer-
w/o dynamic sampling QA 70.82 71.26 Generator (See, Liu, and Manning 2017) to generate answer,
such as ConZNet (Indurthi et al. 2018), MHPGM (Bauer,
Table 7: The performance of ranker with various configura- Wang, and Bansal 2018) and Masque (Nishida et al. 2019).
tions on the QA development set. Pointer-Generator was firtsly proposed for the abstractive
text summarization, which can copy words from the source
via the pointer network while retaining the ability to produce
Case study novel words through the generator. Different from ConZNet
To have an intuitive observation of the prediction ability of and MHPGM, Masque adopt a Transformer-based (Vaswani
MUSST, we show a prediction example on MS MARCO et al. 2017) Pointer-Generator, while the previeous ones uti-
v2.1 from the baseline and MUSST in Table 8. The compar- lizeing GRU (Cho et al. 2014) or LSTM (Hochreiter and
ison indicates that our model could extract effectively use- Schmidhuber 1997).
ful spans, yielding more complete answer that can be under-
stood independent of question and passage context. Multi-passage MRC
For each question-answer pair, the Multi-passage MRC
Question: how long should a central air conditioner last dataset contains more than one passage as the reading con-
Selected Passage: 10 to 20 years - sometimes longer. You text, such as SearchQA (Dunn et al. 2017), Triviaqa (Joshi
should have a service tech come out once a year for a et al. 2017), MS MARCO, and DuReader.
tune up. You wouldn’t run your car without regular main- Existing approaches designed specifically for Multi-
tenance and tune ups and you shouldn’t run your a/c that passage MRC can be classified into two categories: pipeline
way either - if you want it to last as long as possible. and end-to-end. Pipeline-based models (Chen et al. 2017;
Source(s): 20 years working for a major manufacturer of Wang et al. 2018a; Clark and Gardner 2018) adopt a ranker
central heating and air conditioning. to first rank all the passages based on its relevance to the
Reference Answer: A Central air conditioner lasts for question and then utilize a question-answering module to
in between 10 and 20 years./ A central air conditioner read the selected passages. The ranker can be based on tra-
should last for 10 to 20 years. ditional information retrieval methods (BM25 or TF-IDF)
Prediction (Baseline): 10 to 20 years. or employ a neural re-ranking model. End-to-end models
Prediction (MUSST): a central air conditioner should (Wang et al. 2018b; Tan et al. 2018; Nishida et al. 2019)
last for 10 to 20 years. read all the provided passages at the same time, and produce
for each passage a candidate answer assigned with a score
Table 8: A prediction example from the baseline and which is consequently compared among passages to find the
MUSST. The underlined texts are the spans predicted by our final answer. Passage ranking and answer prediction are usu-
model to compose the final answer phrase. ally jointly done as multi-task learning. More recently, Yan
et al. (2019) proposed a cascade learning model to balance
the effectiveness and efficiency of the two approaches men-
tioned above.
Related work
Generative MRC Pre-trained language model in MRC
Generative MRC is considered as a more challenging task Employing the pre-trained language models has been a com-
where answers are free-form human-generated text. More mon practice for tackling MRC tasks (Zhang, Zhao, and
recently, we have seen an emerging wave of generative MRC Wang 2020). The appearances of more elaborated architec-
tasks, including MS MARCO (Bajaj et al. 2018), Narra- tures, larger corpora, and more well-designed pre-training
tiveQA (Kočiský et al. 2018), DuReader (He et al. 2018) objectives speed up the achievement of new state-of-the-
and CoQA (Reddy, Chen, and Manning 2019). art in MRC (Devlin et al. 2019; Liu et al. 2019; Yang
The most earlier approaches tried to generate the an- et al. 2019; Lan et al. 2020). Moreover, Glass et al. (2019)
swer in a single-span extractive way (Tay et al. 2018; Tay, adopts span selection, a MRC task, as an auxiliary pre-
Luu, and Hui 2018; Wang et al. 2018b; Yan et al. 2019; training task. Another mainstream line of research attempts
to drive the improvements during the fine-tuning, which in- Dunn, M.; Sagun, L.; Higgins, M.; Guney, V. U.; Cirik, V.;
cludes integrating better verification strategies for unanswer- and Cho, K. 2017. SearchQA: A New Q&A Dataset Aug-
able question (Zhang, Yang, and Zhao 2020), incorporating mented with Context from a Search Engine. arXiv preprint
explicit linguistic features (Zhang et al. 2020b,c), leverag- arXiv:1704.05179 .
ing external knowledge for commonsense reasoning (Lin
Glass, M.; Gliozzo, A.; Chakravarti, R.; Ferritto, A.; Pan,
et al. 2019) or enhancing matching network for multi-choice
L.; Bhargav, G. P. S.; Garg, D.; and Sil, A. 2019. Span Se-
MRC (Zhang et al. 2020a; Zhu, Zhao, and Li 2020). In ad-
lection Pre-training for Question Answering. arXiv preprint
dition,Hu et al. (2019) introduced multi-span extraction to
arXiv:1909.04120 .
obtain top-k most likely spans for multi-type MRC. How-
ever, different from our work, this method is more suitable He, W.; Liu, K.; Liu, J.; Lyu, Y.; Zhao, S.; Xiao, X.; Liu, Y.;
to predict a set of independent answer spans instead of gen- Wang, Y.; Wu, H.; She, Q.; Liu, X.; Wu, T.; and Wang, H.
erating a complete sentence. 2018. DuReader: a Chinese Machine Reading Comprehen-
sion Dataset from Real-world Applications. In Proceedings
Conclusion of the Workshop on Machine Reading for Question Answer-
ing, 37–46.
In this work, we present a novel solution to generative MRC,
multi-span style extraction framework (MUSST), and show Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-Term
it is capable of alleviating the problems of generating incom- Memory. Neural Comput. 9(8): 1735–1780. ISSN 0899-
plete answers or introducing redundant words encountered 7667. Place: Cambridge, MA, USA Publisher: MIT Press.
by single-span extraction models. We apply our model to a Hu, M.; Peng, Y.; Huang, Z.; and Li, D. 2019. A Multi-Type
challenging generative MRC dataset MS MARCO v2.1 and Multi-Span Network for Reading Comprehension that Re-
significantly outperform the single-span extraction baseline. quires Discrete Reasoning. In Proceedings of the 2019 Con-
This work indicates a new research line for generative MRC ference on Empirical Methods in Natural Language Pro-
in addition to the existing two methods, single-span extrac- cessing and the 9th International Joint Conference on Natu-
tion and seq2seq generation. With the support of only a stan- ral Language Processing (EMNLP-IJCNLP), 1596–1606.
dalone ranking classifier, our proposed method still gives
Indurthi, S. R.; Yu, S.; Back, S.; and Cuayáhuitl, H. 2018.
an overall performance approaching state-of-the-art, show-
Cut to the Chase: A Context Zoom-in Network for Reading
ing great potential.
Comprehension. In Empirical Methods in Natural Language
Processing (EMNLP), 570–575.
References
Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017.
Bajaj, P.; Campos, D.; Craswell, N.; Deng, L.; Gao, J.; TriviaQA: A Large Scale Distantly Supervised Challenge
Liu, X.; Majumder, R.; McNamara, A.; Mitra, B.; Nguyen, Dataset for Reading Comprehension. In Association for
T.; Rosenberg, M.; Song, X.; Stoica, A.; Tiwary, S.; and Computational Linguistics (ACL), 1601–1611.
Wang, T. 2018. MS MARCO: A Human Generated MA-
chine Reading COmprehension Dataset. arXiv preprint Kingma, D. P.; and Ba, J. 2015. Adam: A Method for
arXiv:1611.09268 . Stochastic Optimization. In International Conference on
Learning Representations (ICLR).
Bauer, L.; Wang, Y.; and Bansal, M. 2018. Common-
sense for Generative Multi-Hop Question Answering Tasks. Kočiský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann,
In Empirical Methods in Natural Language Processing K. M.; Melis, G.; and Grefenstette, E. 2018. The Narra-
(EMNLP), 4220–4230. tiveQA Reading Comprehension Challenge. Transactions
of the Association for Computational Linguistics (TACL) 6:
Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Read- 317–328.
ing Wikipedia to Answer Open-Domain Questions. In As-
sociation for Computational Linguistics (ACL), 1870–1879. Kudo, T.; and Richardson, J. 2018. SentencePiece: A sim-
ple and language independent subword tokenizer and deto-
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; kenizer for Neural Text Processing. In Proceedings of the
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learn- 2018 Conference on Empirical Methods in Natural Lan-
ing Phrase Representations using RNN Encoder–Decoder guage Processing: System Demonstrations, 66–71.
for Statistical Machine Translation. In Empirical Methods
in Natural Language Processing (EMNLP), 1724–1734. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.;
and Soricut, R. 2020. ALBERT: A Lite BERT for Self-
Clark, C.; and Gardner, M. 2018. Simple and Effective supervised Learning of Language Representations. In Inter-
Multi-Paragraph Reading Comprehension. In Association national Conference on Learning Representations (ICLR).
for Computational Linguistics (ACL), 845–855.
Lin, B. Y.; Chen, X.; Chen, J.; and Ren, X. 2019. KagNet:
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Knowledge-Aware Graph Networks for Commonsense Rea-
BERT: Pre-training of Deep Bidirectional Transformers for soning. In Proceedings of the 2019 Conference on Empirical
Language Understanding. In North American Chapter of Methods in Natural Language Processing and the 9th Inter-
the Association for Computational Linguistics: Human Lan- national Joint Conference on Natural Language Processing
guage Technologies (NACCL-HLT), 4171–4186. (EMNLP-IJCNLP), 2829–2839.
Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evalu- Tay, Y.; Luu, A. T.; Hui, S. C.; and Su, J. 2018. Densely
ation of Summaries. In Text Summarization Branches Out, Connected Attention Propagation for Reading Comprehen-
74–81. sion. In Advances in Neural Information Processing Systems
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; (NIPS), 4906–4917.
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
2019. RoBERTa: A Robustly Optimized BERT Pretraining L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
Approach. arXiv preprint arXiv:1907.11692 . tention is All you Need. In Advances in Neural Information
Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, Processing Systems (NIPS), 5998–6008.
S. J.; and McClosky, D. 2014. The Stanford CoreNLP Natu- Wang, S.; Yu, M.; Guo, X.; Wang, Z.; Klinger, T.; Zhang,
ral Language Processing Toolkit. In Association for Compu- W.; Chang, S.; Tesauro, G.; Zhou, B.; and Jiang, J. 2018a.
tational Linguistics (ACL) System Demonstrations, 55–60. R3: Reinforced Ranker-Reader for Open-Domain Question
Answering. In Proceedings of the AAAI Conference on Ar-
Nishida, K.; Saito, I.; Nishida, K.; Shinoda, K.; Otsuka, A.;
tificial Intelligence.
Asano, H.; and Tomita, J. 2019. Multi-style Generative
Reading Comprehension. In Association for Computational Wang, Y.; Liu, K.; Liu, J.; He, W.; Lyu, Y.; Wu, H.; Li, S.;
Linguistics (ACL), 2273–2284. and Wang, H. 2018b. Multi-Passage Machine Reading Com-
prehension with Cross-Passage Answer Verification. In As-
Ohsugi, Y.; Saito, I.; Nishida, K.; Asano, H.; and Tomita, J.
sociation for Computational Linguistics (ACL), 1918–1927.
2019. A Simple but Effective Method to Incorporate Multi-
turn Context with BERT for Conversational Machine Com- Yan, M.; Xia, J.; Wu, C.; Bi, B.; Zhao, Z.; Zhang, J.; Si,
prehension. In Proceedings of the First Workshop on NLP L.; Wang, R.; Wang, W.; and Chen, H. 2019. A Deep Cas-
for Conversational AI, 11–17. cade Model for Multi-Document Reading Comprehension.
In Proceedings of the AAAI Conference on Artificial Intelli-
Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; gence, volume 33, 7354–7361.
Grangier, D.; and Auli, M. 2019. fairseq: A Fast, Extensible
Toolkit for Sequence Modeling. In Proceedings of NAACL- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov,
HLT 2019: Demonstrations. R. R.; and Le, Q. V. 2019. XLNet: Generalized Autore-
gressive Pretraining for Language Understanding. In Ad-
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. vances in Neural Information Processing Systems (NIPS),
Bleu: a Method for Automatic Evaluation of Machine Trans- 5754–5764.
lation. In Association for Computational Linguistics (ACL),
311–318. Yatskar, M. 2019. A Qualitative Comparison of CoQA,
SQuAD 2.0 and QuAC. In North American Chapter of
Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know What You the Association for Computational Linguistics: Human Lan-
Don’t Know: Unanswerable Questions for SQuAD. In As- guage Technologies (NACCL-HLT), 2318–2323.
sociation for Computational Linguistics (ACL), 784–789.
Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; and
Reddy, S.; Chen, D.; and Manning, C. D. 2019. CoQA: Zhou, X. 2020a. DCMN+: Dual co-matching network for
A Conversational Question Answering Challenge. Trans- multi-choice reading comprehension. In Proceedings of
actions of the Association for Computational Linguistics the AAAI Conference on Artificial Intelligence, volume 34,
(TACL) 7: 249–266. 9563–9570.
See, A.; Liu, P. J.; and Manning, C. D. 2017. Get To The Zhang, Z.; Wu, Y.; Zhao, H.; Li, Z.; Zhang, S.; Zhou, X.;
Point: Summarization with Pointer-Generator Networks. In and Zhou, X. 2020b. Semantics-aware BERT for language
Association for Computational Linguistics (ACL), 1073– understanding. In Proceedings of the AAAI Conference on
1083. Artificial Intelligence, volume 34, 9628–9635.
Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017. Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; and Wang,
Bidirectional Attention Flow for Machine Comprehension. R. 2020c. SG-Net: Syntax-Guided Machine Reading Com-
In International Conference on Learning Representations prehension. In Proceedings of the AAAI Conference on Ar-
(ICLR). tificial Intelligence.
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Zhang, Z.; Yang, J.; and Zhao, H. 2020. Retrospec-
Sequence Learning with Neural Networks. In Advances in tive Reader for Machine Reading Comprehension. arXiv
Neural Information Processing Systems (NIPS), 3104–3112. preprint arXiv:2001.09694 .
Tan, C.; Wei, F.; Yang, N.; Du, B.; Lv, W.; and Zhou, M. Zhang, Z.; Zhao, H.; and Wang, R. 2020. Machine Read-
2018. S-Net: From Answer Extraction to Answer Synthesis ing Comprehension: The Role of Contextualized Language
for Machine Reading Comprehension. In Proceedings of the Models and Beyond. arXiv preprint arXiv:2005.06249 .
AAAI Conference on Artificial Intelligence.
Zhu, P.; Zhao, H.; and Li, X. 2020. Dual multi-head co-
Tay, Y.; Luu, A. T.; and Hui, S. C. 2018. Multi-Granular Se- attention for multi-choice reading comprehension. arXiv
quence Encoding via Dilated Compositional Units for Read- preprint arXiv:2001.09415 .
ing Comprehension. In Empirical Methods in Natural Lan-
guage Processing (EMNLP), 2141–2151.