=Paper= {{Paper |id=Vol-2831/paper7 |storemode=property |title=Multi-span Style Extraction for Generative Reading Comprehension |pdfUrl=https://ceur-ws.org/Vol-2831/paper7.pdf |volume=Vol-2831 |authors=Junjie Yang,Zhuosheng Zhang,Hai Zhao |dblpUrl=https://dblp.org/rec/conf/aaai/Yang0Z21 }} ==Multi-span Style Extraction for Generative Reading Comprehension== https://ceur-ws.org/Vol-2831/paper7.pdf
           Multi-span Style Extraction for Generative Reading Comprehension

                                   Junjie Yang1,3,4 , Zhuosheng Zhang2,3,4 , Hai Zhao2,3,4*
                            1
                         SJTU-ParisTech Elite Institute of Technology, Shanghai Jiao Tong University
                        2
                      Department of Computer Science and Engineering, Shanghai Jiao Tong University
                       3
                         Key Laboratory of Shanghai Education Commission for Intelligent Interaction
                         and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China
             4
               MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
                jj-yang@sjtu.edu.cn, zhangzs@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn

                                Abstract
  Generative machine reading comprehension (MRC) requires
  a model to generate well-formed answers. For this type of
  MRC, answer generation method is crucial to the model per-
  formance. However, generative models, which are supposed
  to be the right model for the task, in generally perform poorly.
  At the same time, single-span extraction models have been
  proven effective for extractive MRC, where the answer is con-
  strained to a single span in the passage. Nevertheless, they
  generally suffer from generating incomplete answers or in-
  troducing redundant words when applied to the generative
  MRC. Thus, we extend the single-span extraction method to
  multi-span, proposing a new framework which enables gen-
  erative MRC to be smoothly solved as multi-span extraction.
  Thorough experiments demonstrate that this novel approach
  can alleviate the dilemma between generative models and
  single-span models and produce answers with better-formed
  syntax and semantics.                                              Figure 1: Example of how a well-formed answer is gener-
                                                                     ated by the multi-span style extraction.

                        Introduction
Machine Reading Comprehension (MRC) is considered as                 abstractive, the single-span extraction based methods can
a nontrivial challenge in natural language understanding.            easily suffer from incomplete answers or redundant words
Recently, we have seen continuous success in this area,              problem. Thus, there still exists a large gap between the per-
partially benefiting from the release of massive and well-           formance of single-span extraction baselines and human per-
annotated datasets from both academic (Rajpurkar, Jia, and           formance.
Liang 2018; Reddy, Chen, and Manning 2019) and industry                 In the meantime, we have observed that utilizing multiple
(Bajaj et al. 2018; He et al. 2018) communities.                     spans appearing in the question and passage to compose the
   The widely used span-extraction models (Seo et al. 2017;          well-formed answer could be a promising method to alle-
Ohsugi et al. 2019; Lan et al. 2020), formulate the MRC task         viate these drawbacks. Figure 1 shows how the mechanism
as a process of predicting the start and end position of the         of multi-span style extraction works for an example from the
span inside the given passage. They have been proven effec-          MS MARCO task (Bajaj et al. 2018), where the well-formed
tive on the tasks which constrain the answer to be an exact          answer cannot simply be extracted as a single span from the
span in the passage (Rajpurkar, Jia, and Liang 2018). How-           input text.
ever, for generative MRC tasks whose answers are highly                 Therefore, in this work, we propose a novel answer gen-
                                                                     eration approach that takes advantage of the effectiveness of
   * Corresponding author. This paper was partially supported
                                                                     span extraction and the concise spirit of multi-span style to
by National Key Research and Development Program of China            synthesize the free-formed answer, together with a frame-
(No. 2017YFB0304100), Key Projects of National Natural Science       work as a whole for the multi-passage generative MRC. We
Foundation of China (U1836222 and 61733011), Huawei-SJTU
long term AI project, Cutting-edge Machine reading comprehen-
                                                                     call our framework MUSST for MUlti-Span STyle extrac-
sion and language model.                                             tion. Our framework is also empowered by well pre-trained
Copyright © 2021 for this paper by its authors. Use permitted un-    language model as encoder component of our model. It pro-
der Creative Commons License Attribution 4.0 International (CC       vides deep understanding of both the input passage and
BY 4.0).                                                             question, and models the information interaction between
                                              Figure 2: Our framework MUSST


them. We conduct a series of experiments and the corre-         Encoder For each input question and passage pair (Q, Pi ),
sponding ablations on the MS MARCO v2.1 dataset.                we represent it as a single packed sequence of length n of
   Our main contributions in this paper can be summarized       the form “[CLS]Q[SEP]Pi [SEP]”. We pass the whole
as follows1 :                                                   sequence into a contextualized encoder, thereby to produce
                                                                its contextualized representation E ∈ Rn×h where h de-
• We propose a novel multi-span answer annotator to trans-      notes the hidden size of the Transformer blocks. Following
  form the initial well-formed answer into a series of spans    the fine-tuning strategy of Devlin et al. (2019) for the clas-
  that distribute in the question and passage.                  sification task, we consider the final hidden vector c ∈ Rh
• We generalize the single-span extraction based method to      corresponding to the first input token ([CLS]) as the input’s
  the multi-span style by introducing a lightweight but pow-    aggregate representation. Our encoder also models the inter-
  erful answer generator, which supports the extraction of      action between the question and the passage.
  various number answer spans during prediction.
                                                                Ranker The ranker is responsible for ranking the passages
• To make better usage of the large dataset for the passage     based on its relevance to the question. Given the output of
  ranking task, we propose dynamic sampling during the          the encoding layer c, we pass it through a fully connected
  training of the ranker that selects the passage most likely   multi-layer perceptron which consists of two linear transfor-
  to entail the answer.                                         mations with a Tanh activation in between:
                                                                      s = softmax(W2 tanh(W1 c + b1 ) + b2 ) ∈ R2
                         MUSST
                                                                                  ui = s0 and ri = s1
In this section, we present our proposed framework,
MUSST, for multi-passage generative MRC task. Figure 2          where W1 ∈ Rh×h , W2 ∈ R2×h , b1 ∈ Rh and b2 ∈ R2
depicts the general architecture of our framework, which        are trainable parameters. Here, ri and ui are respectively the
consists of a passage ranker, a multi-span answer annotator,    relevance and unrelevance score for the pair (Q, Pi ). The
and a question-answering module.                                relevance scores are consequently normalized across all the
                                                                candidates passages of the same question:
Passage ranker
                                                                                            exp (ri )
Problem formulation Given a question Q and a set of                                 r̂i = Pk
k candidate passages P = {P1 , P2 , ..., Pk }, the passage                                 j=0 exp (rj )

ranker is responsible for ranking the passages based on         Here, r̂i indicates the probability that passage Pi entails the
their relevance to the question. In other words, the model      answer Q.
is requested to output conditional probability distribution
P (y|Q, P; θ), where θ is the model parameters and P (y =       Training We define the question-passage pair where the
i|Q, P; θ) denotes the probability that passage Pi can be       passage entails the question as a positive training sample.
used to answer question Q.                                      The positive passage is noted as P + . During the training
                                                                phase, we adopt a negative sampling with one negative sam-
   1
     The code is publicly available at: https://github.com/     ple. Specifically, for each positive instance (Q, P + ), we ran-
chunchiehy/musst                                                domly sample a negative passage P − from the unselected
passages of the same question. The model is trained by min-           constraints. The attempt to extract the answer spans syntacti-
imizing the following cost function:                                  cally is motivated by our first intuition that the human editors
                  T                                                   compose the original answer in an analogous way.
               1X
   J(θ) = −          log(r(Qt , Pt+ )) + log(u(Qt , Pt− ))               As shown in the middle of Figure 2, we transform the an-
               T t=1                                                  swer phrase into a parsing tree and traverse the parsing tree
where T is the number of questions in the training set,               in a DFS (Depth-first search) way. At each visit of the sub-
r(Qt , Pt+ ) denotes the relevance score of (Qt , Pt+ ) and           tree, we check if the span represented by the subtree appears
u(Qt , Pt− ) denotes the unrelevance score of (Qt , Pt− ).            in the question or passage text. We obtain a span list after
   Moreover, motivated by Liu et al. (2019), we resample the          traversing the whole parsing tree. However, in some cases,
negative training instances at the beginning of each training         the original answer still cannot be perfectly composed by the
epoch, to avoid using the same training pattern for the ques-         words from the input text even in a multi-span style. We get
tion during each training epoch. We name it dynamic sam-              rid of these bad samples by comparing their edit distances
pling.                                                                with a threshold value which is set by the model beforehand.
                                                                         An important final step is to prune the answer span list.
Syntactic multi-span answer annotator                                 The pruning procedure sticks to the following principle: if
                                                                      two spans adjoint in the list are contiguous in the original
Algorithm 1 Syntactic Multi-span Answer Annotation                    text, we joint them together. Pruning reduces heavily the
                                                                      number of spans needed to recover to the original answer
Input: Question Q = {q1 , q2 , . . . , qm } , passage P =             phrase. The more comprehensive detail of our annotator is
{p1 , p2 , . . . , pn } and gold answer A = {a1 , a2 , . . . , ak }   described in Algorithm 1.
Parameter: Edit distance threshold dmax
Output: A list of start and end position of answer spans in           Question-answering module
the question and passage
 1: Let M be an empty list                                            Problem formulation Given a question Q and a passage
 2: Pack question Q and passage P into a single sequence              P , the question-answering module is requested to answer the
     C in a certain way.                                              question based on the information provided by the passage.
 3: Get the syntactic parsing tree T of answer A by a con-            In other words, the model outputs the conditional probability
     stituency parser.                                                distribution P (y|Q, P ), where P (y = A|Q, P ) denotes the
 4: Let S be the stack of subtrees to be traversed.                   probability that A is the answer.
 5: Initialize S with the root R of the tree T                        Question-passage reader The architecture of the reader
 6: while S is not empty do                                           is analogous to the encoder module of the ranker in section ,
 7:       let V = P OP(S)                                             where we take a pre-trained language model as encoder. But
 8:       Get a list of all the leaves of subtree V: L =              instead of getting only the aggregate representation, we pass
     {l1 , l2 , · · · , ln }                                          the whole output of the last layer to predict the answer spans
 9:       if L is a sublist of C then                                 as the follows:
10:              Get the start index s and end index e of L in C
     by Knuth-Morris-Pratt pattern searching algorithm                                M = Encoder(Q, P ) ∈ Rh×n
11:              Add (s, e) into the span position list M
12:       else                                                        where n is the length of the input token sequence, and h is
13:              for childtree U in V (From right to left) do         the hidden size of the encoder.
14:                    P USH(S, U)
                                                                      Multi-span style answer generator Our answer genera-
15:              end for
                                                                      tor is responsible for composing the answer in a multi-span
16:       end if
                                                                      style extraction. Let n be the number of span to be extracted.
17: end while
                                                                         For each single span prediction, we treat it as the sin-
18: Reconstruct answer A0 from span position list M
                                                                      gle span extraction MRC task. Following Lan et al. (2020),
19: Let d = E DIT D ISTANCE(A, A0 )
                                                                      we adopt a linear layer to predict start and end positions of
20: if d > dmax then
                                                                      the span in the input sequence. It is worth noticing that our
21:       Empty the list M
                                                                      model is also enabled to predict the answer span from the
22: end if
                                                                      question. The probability distribution of i-th span’s start po-
23: M ∗ = P RUNING(M )
                                                                      sition over the input tokens is obtained by:
24: return M ∗
                                                                                     p̂j,start = softmax(Wjs M + bsj )
In this section, we introduce our syntactic multi-span answer
annotator. Before the training of our question-answering              where Wjs ∈ R1×h and bsj ∈ R are trainable parameters and
module, we need to extract non-overlapped spans from the              p̂j,start
                                                                        k       denote the probability of token k being the start of the
question and passage based on the original answer from the            answer span j. The end position distribution of the answer
training dataset. Our annotator is responsible for transform-         span j is obtained by using the analogous formula:
ing the original answer phrase into multiple spans that dis-
tribute in the question and passage with subject to syntactic                        p̂j,end = softmax(Wje M + bej )
Training and inference During training, we add a special           are adopted as the official evaluation 3 metrics to evaluate
virtual span, with start and end position values equaling the      model performance, while the official leaderboard chooses
length of the input sequence, at the end of the annotated an-      ROUGE-L as the main metric. In the meantime, we use
swer span list. This approach enables our model to generate        Mean Average Precision (MAP) and Mean Reciprocal Rank
a various number of answer spans during prediction with the        (MRR) for our ranker.
virtual span serving as a stop symbol. The cost function is
defined as follows:                                                  Dataset         Train                Dev            Test
                       mt
                     T X                                             ALL            808,731             101,093        101,092
                  1  X
      J(θ) = −              log(p̂j,start
                                  ytj,start
                                            ) + log(p̂j,end
                                                      ytj,end
                                                              )
                  T t=1 j=1                                          QA        503,370 (63.39%)     55,636 (45.40%)       –
                                                                     N LG      153,725 (12.57%)     12,467 (24.99%)       –
where T is the number of training samples, mt is the number
of answer span for sample t, ytj,start and ytj,end are the true    Table 1: Statistics of MS MARCO v2.1 dataset. The num-
start and end position of the t-th sample’s j-th span.             bers in parenthesis indicate the percentage of examples
    During inference, at each time step j, we choose the an-       whose answer is single span in gold passage.
swer span (k, l) where k < l with the maximum value of
p̂kj,start p̂j,end
             l     . The decoding procedure terminates when the
stop span is predicted. Sometimes, the model tends to gen-         Baseline models
erate repeatedly the same spans. In order to alleviate the re-
                                                                   We compare our MUSST with the following baseline mod-
peating problem, at each prediction time step j, we mask
                                                                   els: single-span extraction and seq2seq. For the single-span
out the predicted span positions of previous time steps (< j)
                                                                   extraction baseline, we employ the model for the SQuAD
during the calculation of probability distribution of new start
                                                                   dataset from ALBERT (Lan et al. 2020). The model is
and end positions. Since the masking depends on the previ-
                                                                   trained only with samples where the answer is a single span
ously predicted spans, we name it as conditional masking.
                                                                   in the passage. In the meantime, We adopt the Transformer
The extracted spans are later joined together to form a final
                                                                   model from Vaswani et al. (2017) as our seq2seq baseline.
answer phrase.
                                                                   For a fair comparison, the baseline models share the same
                                                                   passage ranker as the one in MUSST.
                       Experiments
Dataset                                                            Implementation details
                                                                   For the multi-span answer annotation, we use constituency
We evaluate our framework on the MS MARCO v2.1 2 (Ba-
                                                                   parser from Standford CoreNLP (Manning et al. 2014).
jaj et al. 2018), which is a large scale open-domain gen-
                                                                   NLTK 4 package is also used to implement our annota-
erative task. MS MARCO v2.1 provides two MRC tasks:
                                                                   tor. The maximum edit distance between the answer recon-
Question Answering (QA) and Natural Langauge Genera-
                                                                   structed from the annotated spans, and the original answer is
tion (NLG). The statistics of the corresponding datasets’
                                                                   32 and 8 respectively for the N LG and QA training sets.
size are presented in Table 1. Both datasets consist of sam-
                                                                       The ranker and question-answering module of MUSST
pled questions from Bing’s search logs, and each question is
                                                                   are implemented with PyTorch 5 and Transformers package
accompanied by an average of ten passages that may contain         6
                                                                     . We adopt ALBERT (Lan et al. 2020) as the encoder in our
the answers. QA and N LG are subsets of ALL, which also
                                                                   models and initialize it with the pre-trained weights before
contains the unanswerable questions.
                                                                   the fine-tuning. We choose ALBERT-base as the encoder of
   Distinguished with the QA task, the NLG task requires           passage ranker and ALBERT-xlarge instead for question an-
the model to provide the well-formed answer, which could           swering module.
be read and understood by a natural speaker without any
                                                                       Following Lan et al. (2020), we use SentencePiece (Kudo
additional context. Therefore NLG-style answers are more
                                                                   and Richardson 2018) to tokenize our inputs with a vocab-
abstract than the QA-style answers. Table 1 shows also the
                                                                   ulary size of 30,000. We adopt Adam optimizer (Kingma
percentage of examples where the answer can be extracted
                                                                   and Ba 2015) to minimize the cost function. Two types
as a single span in the gold passage. Unsurprisingly, the an-
                                                                   of regularization methods during training: dropout and L2
swers from the QA set are much more likely to match a
                                                                   weight decay. Hyperparameter details for the training of the
span in the passage than the ones in the N LG set. More-
                                                                   different models of our framework are presented in Table
over, Nishida et al. (2019) states that the QA task prefers
                                                                   2. MUSST-NLG and MUSST-QA are trained respectively
the answer to be more concise than in the NLG task, aver-
                                                                   on the N LG and QA subsets. The maximum number of
aging 13.1 words, while the latter one averages 16.6 words.
Therefore, the N LG set is more suitable to evaluate model             3
                                                                         The official evaluation scripts can be found in
performance on generative MRC.                                     https://github.com/microsoft/MSMARCO-Question-Answering/
   BLEU-1 (Papineni et al. 2002) and ROUGE-L (Lin 2004)            tree/master/Evaluation
                                                                       4
                                                                         https://www.nltk.org
   2                                                                   5
     The datasets can be obtained from the official site (https:         https://pytorch.org
                                                                       6
//microsoft.github.io/msmarco/)                                          https://github.com/huggingface/transformers
spans for them is set to 9 and 5, respectively. We trained                           Analysis and discussions
the passage ranker and the question-answering module of                Effect of maximum number of spans
MUSST-NLG on a machine with four Tesla P40 GPUs. The
                                                                       Figure 3 presents the distribution of span numbers with edit
question-answering module of MUSST-QA is trained with
                                                                       distance less than 4 over the QA and N LG training sets
eight GeForce GTX 1080 Ti GPUs. It takes roughly 9 hours
                                                                       after the annotation procedure. It is seen that most QA-style
to train the passage ranker. For the question-answering mod-
                                                                       answers are only one span, while the NLG-style answers dis-
ule in MUSST-NLG and MUSST-QA, the training time is
                                                                       tribute more uniformly in the range of [1, 9].
about 10 hours and 17 hours respectively.

 Hyperparameter            Ranker        MUSST-QA          MUSST-NLG
 Learning rate                1e-5             3e-5           3e-5
 Learning rate decay         Linear           Linear         Linear
 Training epoch                 3                3              5
 Warmup rate                   0.1              0.1            0.1
 Adam                       10−6             10−6           10−6
 Adam β1                       0.9              0.9            0.9
 Adam β2                     0.999            0.999          0.999
 MSN                          256              256            256
 Batch size                   128               32             32
 Encoder dropout rate           0                0              0
 Classifier dropout rate       0.1              0.1            0.1
 Weight decay                 0.01             0.01           0.01

Table 2: Training hyperparameters of different modules of              Figure 3: Distribution of training samples of edit distance
MUSST on MS MARCO v2.1 dataset. Here, MUSST-QA                         less than 4 over annoted answer spans. For the purpose of
and MUSST-NLG refer to its question-answering module.                  better illustration, we filter the samples which include more
MSN means maximum sequence length.                                     than 9 spans.

                                                                          To better understand the effect of the maximum number
  The single-span baseline is implemented with the same                of spans to be generated in the answer generator, we let it
packages as MUSST while the seq2seq baseline is imple-                 vary in the range of [2, 12] and conduct experiments on the
mented with Fairseq (Ott et al. 2019).                                 N LG set with our best single passage ranker. The edit dis-
                                                                       tance threshold is set to be 8. The results are presented in
                                                                       Figure 4. Generally, increasing the number of the span will
Results                                                                augment the token coverage rate, thus yielding better results.
                                                                       But the gain becomes less significant when the maximum
                                                                       number of span is already large enough. From Figure 4, we
                        QA                           N LG              can see that the results vary imperceptibly when the maxi-
  Model
                  ROUGE-L BLEU-1                ROUGE-L BLEU-1         mum number of spans reaches 5. However, since each span
  Single-span        47.96            50.22        53.10       49.08   only introduces 4k parameters, which is negligible before
  Seq2seq              –                –          56.42       53.89   the encoder (60M), we still choose the maximum number
  MUSST-QA           48.44            49.54          –           –     to be 9, which corresponds to the best performance on the
  MUSST-NLG            –                –          66.24       64.23   development set.

Table 3: Performance comparison with our baselines on the
QA and N LG development set. Here, we use the same sin-
gle ranker for MUSST and the baselines.


   Table 3 shows the results of our single model and the base-
line models on the QA and N LG development datasets.
MUSST outperforms significantly the baselines including
the generative seq2seq model over the N LG set in terms
of both ROUGE-L and BLEU-1. Even on the QA set,
our model yields better results regarding ROUGE-L. Ta-
ble 4 compares our model performance with the compet-
ing models on the leaderboard. Although our model utilizes
only a standalone classifier for passage ranking, multi-span
style extraction still helps us rival with state-of-the-art ap-              Figure 4: Effect of maximum number of spans.
proaches.
                                                                                 NLG Task        QA Task
  Model                       Answer Generation               Ranking                                          Overall Average
                                                                                 R-L B-1        R-L B-1
  Human                                 –                      –                 63.2   53.0    53.9   48.5          54.65
                                                       Unpublished
  PALM                                          Unknown                          49.8   49.9    51.8   50.7          50.55
  Multi-doc Enriched BERT                       Unknown                          32.5   37.7    54.0   56.5          45.18
                                                        Published
  BiDAFa ♠                         Single-span        Confidence score           16.9    9.3    24.0   10.6          15.20
  ConZNetb ♠                    Pointer-Generator          Unkonwn               42.1   38.6     –      –              –
  VNETc ♠                          Single-span       Answer verification         48.4   46.8    51.6   54.3          50.28
  Deep Cascade QAd ♠               Single-span              Cascade              35.1   37.4    52.0   54.6          44.78
  Masque QAe †                  Pointer-Generator   Joint trained classifier     28.5   39.9    52.2   43.7          41.08
  Masque NLGe †                 Pointer-Generator   Joint trained classifier     49.6   50.1    48.9   48.8          49.35
  MUSST-NLG †                      Multi-span            Standalone classifier   48.0   45.8    49.0   51.6          48.60

Table 4: The performance of our framework and competing models on the MS MARCO v2.1 test set. All the results presented
here reflect the MS MARCO leaderboard (microsoft.github.io/msmarco/) as of 28 May 2020. ♠ refers to the model whose
results are not reported in the original published paper. BiDAF for MARCO is implemented by the official MS MARCO Team.
† refers to the ensemble submission. Whether the other competing models are ensemble or not is unclear. a Seo et al. (2017); b
Indurthi et al. (2018); c Wang et al. (2018b); d Yan et al. (2019); e Nishida et al. (2019).


Ablation study on model design choice                               Effect of edit distance threshold
We perform ablation experiments that quantify the individ-          Figure 5 shows the results of MUSST on N LG develop-
ual contribution of the design choices of MUSST. Table 5            ment set for various edit distance threshold. Interestingly, it
shows the results on the N LG development set. Both prun-           indicates that BLEU-1 is impacted more heavily by the vari-
ing and conditional masking contribute the model perfor-            ation of edit distance than ROUGE-L. And setting the edit
mance, which indicates that pruning can help the model              distance threshold too large may damage the model perfor-
to converge more easily by reducing the number of spans,            mance by introducing too many incomplete samples.
while conditional masking can better generate answer with-
out suffering from the repeating problem. We also observe
using the gold passage can significantly improve question-
answering. It shows there still exists a great improvement
space for the passager ranker.


   Model                        ROUGE-L         BLEU-1
   MUSST                           66.24         64.23
   w/o pruning                     64.66         60.36
   w/o conditional masking         65.50         64.31
   MUSST w gold passage            75.39         74.41

  Table 5: Ablation study on the N LG development set.
                                                                             Figure 5: Effect of edit distance threshold.


Quality of multi-span answer annotator                              Effect of encoder size
                                                                    Table 6 presents experimental results on ALBERT encoder
On the N LG development set, we evaluate the answers                with various model sizes. Unsurprisingly, the model yields
generated by our syntactic multi-span annotator. The re-            stronger results as the encoder gets larger.
sults shows our annotated answers can obtain 89.35 in
BLEU-1 and 90.19 in ROUGE-L with the gold passages,
which demonstrates the effectiveness of our annotator. For          Performance of the ranker
MUSST, the results are 74.41 and 75.39 respectively (in Ta-         Table 7 presents our ranker performance in terms of MAP
ble 5). So there is still much room for improvement with            and MRR. The results show that dynamic sampling leads to
respect to the question-answering module.                           slightly better results.
   Encoder           Parameters    ROUGE-L        BLEU-1         Ohsugi et al. 2019). The models using a single-span extrac-
   ALBERT-base             12M         62.03           60.48     tive method show effectiveness for the dataset where ab-
   ALBERT-large            18M         64.93           61.67     stractive behavior of answers includes mostly small modi-
   ALBERT-xlarge           60M         66.24           64.23     fications to spans in the context (Ohsugi et al. 2019; Yatskar
                                                                 2019). Whereas, for the datasets with answers of deep ab-
          Table 6: Effect of ALBERT encoder size.                straction, this method fails to yield promising results. The
                                                                 first attempt to generate the answer in a generative way
                                                                 is to apply an RNN-based seq2seq attentional model to
  Model                      Training set      MAP       MRR     synthesize the answer, such as S-NET (Tan et al. 2018),
  Bing (initial ranking)           -           34.62     35.00   where seq2seq learning was first introduced by Sutskever,
                                                                 Vinyals, and Le (2014) for the machine translation. The
  MUSST (single)                  QA           71.10     71.56   most recent models adopt a hybrid neural network Pointer-
  w/o dynamic sampling            QA           70.82     71.26   Generator (See, Liu, and Manning 2017) to generate answer,
                                                                 such as ConZNet (Indurthi et al. 2018), MHPGM (Bauer,
Table 7: The performance of ranker with various configura-       Wang, and Bansal 2018) and Masque (Nishida et al. 2019).
tions on the QA development set.                                 Pointer-Generator was firtsly proposed for the abstractive
                                                                 text summarization, which can copy words from the source
                                                                 via the pointer network while retaining the ability to produce
Case study                                                       novel words through the generator. Different from ConZNet
To have an intuitive observation of the prediction ability of    and MHPGM, Masque adopt a Transformer-based (Vaswani
MUSST, we show a prediction example on MS MARCO                  et al. 2017) Pointer-Generator, while the previeous ones uti-
v2.1 from the baseline and MUSST in Table 8. The compar-         lizeing GRU (Cho et al. 2014) or LSTM (Hochreiter and
ison indicates that our model could extract effectively use-     Schmidhuber 1997).
ful spans, yielding more complete answer that can be under-
stood independent of question and passage context.               Multi-passage MRC
                                                                 For each question-answer pair, the Multi-passage MRC
 Question: how long should a central air conditioner last        dataset contains more than one passage as the reading con-
 Selected Passage: 10 to 20 years - sometimes longer. You        text, such as SearchQA (Dunn et al. 2017), Triviaqa (Joshi
 should have a service tech come out once a year for a           et al. 2017), MS MARCO, and DuReader.
 tune up. You wouldn’t run your car without regular main-           Existing approaches designed specifically for Multi-
 tenance and tune ups and you shouldn’t run your a/c that        passage MRC can be classified into two categories: pipeline
 way either - if you want it to last as long as possible.        and end-to-end. Pipeline-based models (Chen et al. 2017;
 Source(s): 20 years working for a major manufacturer of         Wang et al. 2018a; Clark and Gardner 2018) adopt a ranker
 central heating and air conditioning.                           to first rank all the passages based on its relevance to the
 Reference Answer: A Central air conditioner lasts for           question and then utilize a question-answering module to
 in between 10 and 20 years./ A central air conditioner          read the selected passages. The ranker can be based on tra-
 should last for 10 to 20 years.                                 ditional information retrieval methods (BM25 or TF-IDF)
 Prediction (Baseline): 10 to 20 years.                          or employ a neural re-ranking model. End-to-end models
 Prediction (MUSST): a central air conditioner should            (Wang et al. 2018b; Tan et al. 2018; Nishida et al. 2019)
 last for 10 to 20 years.                                        read all the provided passages at the same time, and produce
                                                                 for each passage a candidate answer assigned with a score
Table 8: A prediction example from the baseline and              which is consequently compared among passages to find the
MUSST. The underlined texts are the spans predicted by our       final answer. Passage ranking and answer prediction are usu-
model to compose the final answer phrase.                        ally jointly done as multi-task learning. More recently, Yan
                                                                 et al. (2019) proposed a cascade learning model to balance
                                                                 the effectiveness and efficiency of the two approaches men-
                                                                 tioned above.
                      Related work
Generative MRC                                                   Pre-trained language model in MRC
Generative MRC is considered as a more challenging task          Employing the pre-trained language models has been a com-
where answers are free-form human-generated text. More           mon practice for tackling MRC tasks (Zhang, Zhao, and
recently, we have seen an emerging wave of generative MRC        Wang 2020). The appearances of more elaborated architec-
tasks, including MS MARCO (Bajaj et al. 2018), Narra-            tures, larger corpora, and more well-designed pre-training
tiveQA (Kočiský et al. 2018), DuReader (He et al. 2018)        objectives speed up the achievement of new state-of-the-
and CoQA (Reddy, Chen, and Manning 2019).                        art in MRC (Devlin et al. 2019; Liu et al. 2019; Yang
   The most earlier approaches tried to generate the an-         et al. 2019; Lan et al. 2020). Moreover, Glass et al. (2019)
swer in a single-span extractive way (Tay et al. 2018; Tay,      adopts span selection, a MRC task, as an auxiliary pre-
Luu, and Hui 2018; Wang et al. 2018b; Yan et al. 2019;           training task. Another mainstream line of research attempts
to drive the improvements during the fine-tuning, which in-       Dunn, M.; Sagun, L.; Higgins, M.; Guney, V. U.; Cirik, V.;
cludes integrating better verification strategies for unanswer-   and Cho, K. 2017. SearchQA: A New Q&A Dataset Aug-
able question (Zhang, Yang, and Zhao 2020), incorporating         mented with Context from a Search Engine. arXiv preprint
explicit linguistic features (Zhang et al. 2020b,c), leverag-     arXiv:1704.05179 .
ing external knowledge for commonsense reasoning (Lin
                                                                  Glass, M.; Gliozzo, A.; Chakravarti, R.; Ferritto, A.; Pan,
et al. 2019) or enhancing matching network for multi-choice
                                                                  L.; Bhargav, G. P. S.; Garg, D.; and Sil, A. 2019. Span Se-
MRC (Zhang et al. 2020a; Zhu, Zhao, and Li 2020). In ad-
                                                                  lection Pre-training for Question Answering. arXiv preprint
dition,Hu et al. (2019) introduced multi-span extraction to
                                                                  arXiv:1909.04120 .
obtain top-k most likely spans for multi-type MRC. How-
ever, different from our work, this method is more suitable       He, W.; Liu, K.; Liu, J.; Lyu, Y.; Zhao, S.; Xiao, X.; Liu, Y.;
to predict a set of independent answer spans instead of gen-      Wang, Y.; Wu, H.; She, Q.; Liu, X.; Wu, T.; and Wang, H.
erating a complete sentence.                                      2018. DuReader: a Chinese Machine Reading Comprehen-
                                                                  sion Dataset from Real-world Applications. In Proceedings
                       Conclusion                                 of the Workshop on Machine Reading for Question Answer-
                                                                  ing, 37–46.
In this work, we present a novel solution to generative MRC,
multi-span style extraction framework (MUSST), and show           Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-Term
it is capable of alleviating the problems of generating incom-    Memory. Neural Comput. 9(8): 1735–1780. ISSN 0899-
plete answers or introducing redundant words encountered          7667. Place: Cambridge, MA, USA Publisher: MIT Press.
by single-span extraction models. We apply our model to a         Hu, M.; Peng, Y.; Huang, Z.; and Li, D. 2019. A Multi-Type
challenging generative MRC dataset MS MARCO v2.1 and              Multi-Span Network for Reading Comprehension that Re-
significantly outperform the single-span extraction baseline.     quires Discrete Reasoning. In Proceedings of the 2019 Con-
This work indicates a new research line for generative MRC        ference on Empirical Methods in Natural Language Pro-
in addition to the existing two methods, single-span extrac-      cessing and the 9th International Joint Conference on Natu-
tion and seq2seq generation. With the support of only a stan-     ral Language Processing (EMNLP-IJCNLP), 1596–1606.
dalone ranking classifier, our proposed method still gives
                                                                  Indurthi, S. R.; Yu, S.; Back, S.; and Cuayáhuitl, H. 2018.
an overall performance approaching state-of-the-art, show-
                                                                  Cut to the Chase: A Context Zoom-in Network for Reading
ing great potential.
                                                                  Comprehension. In Empirical Methods in Natural Language
                                                                  Processing (EMNLP), 570–575.
                        References
                                                                  Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017.
Bajaj, P.; Campos, D.; Craswell, N.; Deng, L.; Gao, J.;           TriviaQA: A Large Scale Distantly Supervised Challenge
Liu, X.; Majumder, R.; McNamara, A.; Mitra, B.; Nguyen,           Dataset for Reading Comprehension. In Association for
T.; Rosenberg, M.; Song, X.; Stoica, A.; Tiwary, S.; and          Computational Linguistics (ACL), 1601–1611.
Wang, T. 2018. MS MARCO: A Human Generated MA-
chine Reading COmprehension Dataset. arXiv preprint               Kingma, D. P.; and Ba, J. 2015. Adam: A Method for
arXiv:1611.09268 .                                                Stochastic Optimization. In International Conference on
                                                                  Learning Representations (ICLR).
Bauer, L.; Wang, Y.; and Bansal, M. 2018. Common-
sense for Generative Multi-Hop Question Answering Tasks.          Kočiský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann,
In Empirical Methods in Natural Language Processing               K. M.; Melis, G.; and Grefenstette, E. 2018. The Narra-
(EMNLP), 4220–4230.                                               tiveQA Reading Comprehension Challenge. Transactions
                                                                  of the Association for Computational Linguistics (TACL) 6:
Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Read-       317–328.
ing Wikipedia to Answer Open-Domain Questions. In As-
sociation for Computational Linguistics (ACL), 1870–1879.         Kudo, T.; and Richardson, J. 2018. SentencePiece: A sim-
                                                                  ple and language independent subword tokenizer and deto-
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.;        kenizer for Neural Text Processing. In Proceedings of the
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learn-            2018 Conference on Empirical Methods in Natural Lan-
ing Phrase Representations using RNN Encoder–Decoder              guage Processing: System Demonstrations, 66–71.
for Statistical Machine Translation. In Empirical Methods
in Natural Language Processing (EMNLP), 1724–1734.                Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.;
                                                                  and Soricut, R. 2020. ALBERT: A Lite BERT for Self-
Clark, C.; and Gardner, M. 2018. Simple and Effective             supervised Learning of Language Representations. In Inter-
Multi-Paragraph Reading Comprehension. In Association             national Conference on Learning Representations (ICLR).
for Computational Linguistics (ACL), 845–855.
                                                                  Lin, B. Y.; Chen, X.; Chen, J.; and Ren, X. 2019. KagNet:
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.        Knowledge-Aware Graph Networks for Commonsense Rea-
BERT: Pre-training of Deep Bidirectional Transformers for         soning. In Proceedings of the 2019 Conference on Empirical
Language Understanding. In North American Chapter of              Methods in Natural Language Processing and the 9th Inter-
the Association for Computational Linguistics: Human Lan-         national Joint Conference on Natural Language Processing
guage Technologies (NACCL-HLT), 4171–4186.                        (EMNLP-IJCNLP), 2829–2839.
Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evalu-          Tay, Y.; Luu, A. T.; Hui, S. C.; and Su, J. 2018. Densely
ation of Summaries. In Text Summarization Branches Out,         Connected Attention Propagation for Reading Comprehen-
74–81.                                                          sion. In Advances in Neural Information Processing Systems
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;       (NIPS), 4906–4917.
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.          Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
2019. RoBERTa: A Robustly Optimized BERT Pretraining            L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
Approach. arXiv preprint arXiv:1907.11692 .                     tention is All you Need. In Advances in Neural Information
Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard,   Processing Systems (NIPS), 5998–6008.
S. J.; and McClosky, D. 2014. The Stanford CoreNLP Natu-        Wang, S.; Yu, M.; Guo, X.; Wang, Z.; Klinger, T.; Zhang,
ral Language Processing Toolkit. In Association for Compu-      W.; Chang, S.; Tesauro, G.; Zhou, B.; and Jiang, J. 2018a.
tational Linguistics (ACL) System Demonstrations, 55–60.        R3: Reinforced Ranker-Reader for Open-Domain Question
                                                                Answering. In Proceedings of the AAAI Conference on Ar-
Nishida, K.; Saito, I.; Nishida, K.; Shinoda, K.; Otsuka, A.;
                                                                tificial Intelligence.
Asano, H.; and Tomita, J. 2019. Multi-style Generative
Reading Comprehension. In Association for Computational         Wang, Y.; Liu, K.; Liu, J.; He, W.; Lyu, Y.; Wu, H.; Li, S.;
Linguistics (ACL), 2273–2284.                                   and Wang, H. 2018b. Multi-Passage Machine Reading Com-
                                                                prehension with Cross-Passage Answer Verification. In As-
Ohsugi, Y.; Saito, I.; Nishida, K.; Asano, H.; and Tomita, J.
                                                                sociation for Computational Linguistics (ACL), 1918–1927.
2019. A Simple but Effective Method to Incorporate Multi-
turn Context with BERT for Conversational Machine Com-          Yan, M.; Xia, J.; Wu, C.; Bi, B.; Zhao, Z.; Zhang, J.; Si,
prehension. In Proceedings of the First Workshop on NLP         L.; Wang, R.; Wang, W.; and Chen, H. 2019. A Deep Cas-
for Conversational AI, 11–17.                                   cade Model for Multi-Document Reading Comprehension.
                                                                In Proceedings of the AAAI Conference on Artificial Intelli-
Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.;   gence, volume 33, 7354–7361.
Grangier, D.; and Auli, M. 2019. fairseq: A Fast, Extensible
Toolkit for Sequence Modeling. In Proceedings of NAACL-         Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov,
HLT 2019: Demonstrations.                                       R. R.; and Le, Q. V. 2019. XLNet: Generalized Autore-
                                                                gressive Pretraining for Language Understanding. In Ad-
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.        vances in Neural Information Processing Systems (NIPS),
Bleu: a Method for Automatic Evaluation of Machine Trans-       5754–5764.
lation. In Association for Computational Linguistics (ACL),
311–318.                                                        Yatskar, M. 2019. A Qualitative Comparison of CoQA,
                                                                SQuAD 2.0 and QuAC. In North American Chapter of
Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know What You       the Association for Computational Linguistics: Human Lan-
Don’t Know: Unanswerable Questions for SQuAD. In As-            guage Technologies (NACCL-HLT), 2318–2323.
sociation for Computational Linguistics (ACL), 784–789.
                                                                Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; and
Reddy, S.; Chen, D.; and Manning, C. D. 2019. CoQA:             Zhou, X. 2020a. DCMN+: Dual co-matching network for
A Conversational Question Answering Challenge. Trans-           multi-choice reading comprehension. In Proceedings of
actions of the Association for Computational Linguistics        the AAAI Conference on Artificial Intelligence, volume 34,
(TACL) 7: 249–266.                                              9563–9570.
See, A.; Liu, P. J.; and Manning, C. D. 2017. Get To The        Zhang, Z.; Wu, Y.; Zhao, H.; Li, Z.; Zhang, S.; Zhou, X.;
Point: Summarization with Pointer-Generator Networks. In        and Zhou, X. 2020b. Semantics-aware BERT for language
Association for Computational Linguistics (ACL), 1073–          understanding. In Proceedings of the AAAI Conference on
1083.                                                           Artificial Intelligence, volume 34, 9628–9635.
Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017.    Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; and Wang,
Bidirectional Attention Flow for Machine Comprehension.         R. 2020c. SG-Net: Syntax-Guided Machine Reading Com-
In International Conference on Learning Representations         prehension. In Proceedings of the AAAI Conference on Ar-
(ICLR).                                                         tificial Intelligence.
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to     Zhang, Z.; Yang, J.; and Zhao, H. 2020. Retrospec-
Sequence Learning with Neural Networks. In Advances in          tive Reader for Machine Reading Comprehension. arXiv
Neural Information Processing Systems (NIPS), 3104–3112.        preprint arXiv:2001.09694 .
Tan, C.; Wei, F.; Yang, N.; Du, B.; Lv, W.; and Zhou, M.        Zhang, Z.; Zhao, H.; and Wang, R. 2020. Machine Read-
2018. S-Net: From Answer Extraction to Answer Synthesis         ing Comprehension: The Role of Contextualized Language
for Machine Reading Comprehension. In Proceedings of the        Models and Beyond. arXiv preprint arXiv:2005.06249 .
AAAI Conference on Artificial Intelligence.
                                                                Zhu, P.; Zhao, H.; and Li, X. 2020. Dual multi-head co-
Tay, Y.; Luu, A. T.; and Hui, S. C. 2018. Multi-Granular Se-    attention for multi-choice reading comprehension. arXiv
quence Encoding via Dilated Compositional Units for Read-       preprint arXiv:2001.09415 .
ing Comprehension. In Empirical Methods in Natural Lan-
guage Processing (EMNLP), 2141–2151.