Biomedical question-focused multi-document
    summarization: ILSP and AUEB at BioASQ3

Prodromos Malakasiotis1,2 , Emmanouil Archontakis1 , Ion Androutsopoulos1,2 ,
              Dimitrios Galanis2 , and Harris Papageorgiou2
     1
      Dept. of Informatics, Athens University of Economics and Business, Greece
                rulller@aueb.gr, man.arcon@gmail.com, ion@aueb.gr
                              http://nlp.cs.aueb.gr/
 2
   Institute for Language and Speech Processing, Research Center ‘Athena’, Greece
             malakasiotis@ilsp.gr, galanisd@ilsp.gr, xaris@ilsp.gr
                               http://www.ilsp.gr/


          Abstract. Question answering systems aim to find answers to natural
          language questions by searching in document collections (e.g., reposi-
          tories of scientific articles or the entire Web) and/or structured data
          (e.g., databases, ontologies). Strictly speaking, the answer to a question
          might sometimes be simply ‘yes’ or ‘no’, a named entity, or a set of
          named entities. In practice, however, a more elaborate answer is often
          also needed, ideally a summary of the most important information from
          relevant documents and structured data. In this paper, we focus on gen-
          erating summaries from documents that are known to be relevant to par-
          ticular questions. We describe the joint participation of AUEB and ILSP
          in the corresponding subtask of the bioasq3 competition, where partic-
          ipants produce multi-document summaries of given biomedical articles
          that are relevant to English questions prepared by biomedical experts.

          Keywords: biomedical question answering, text summarization


1        Introduction
Biomedical experts are extremely short of time. They also need to keep up with
scientific developments happening at a pace that is probably faster than in any
other science. The online biomedical bibliographic database PubMed currently
comprises approximately 21 million references and was growing at a rate of-
ten exceeding 20,000 articles per week in 2011.3 Figure 1 shows the number
of biomedical articles indexed by PubMed per year since 1964. Rich sources
of structured biomedical information, like the Gene Ontology, umls, or Dis-
easesome are also available.4 Obtaining sufficient and concise answers from this
wealth of information is a challenging task for traditional search engines, which
instead of answers return lists of (possibly) relevant documents that the experts
3
    Consult http://www.ncbi.nlm.nih.gov/pubmed/.
4
    See http://www.geneontology.org/, http://www.nlm.nih.gov/research/umls/,
    http://diseasome.eu/.
themselves have to study. Consequently, there is growing interest for biomedi-
cal question answering (QA) systems [3, 4], which aim to produce more concise
answers. To foster research in biomedical QA, the bioasq project constructs
benchmark datasets, evaluation services, and organizes international biomedical
QA competitions since 2012 [20].5


                                     900

                                     800

                                     700
    Published Articles (thousands)


                                     600

                                     500

                                     400                                                              PubMed
                                                                                                      Trend
                                     300

                                     200

                                     100

                                      0
                                       1964   1971   1978   1985          1992   1999   2006   2013
                                                                   Year


Fig. 1. Number of new PubMed articles (blue line) indexed over the period 1964-2013
per year, and the respective logarithmic trend (red dashed line).


    Given a question expressed in natural language, QA systems aim to provide
answers by searching in document collections (e.g., repositories of scientific ar-
ticles or the entire Web) and/or structured data (e.g., databases, ontologies).
Strictly speaking, the answer to a question might sometimes be simply a ‘yes’ or
‘no’ (e.g., in biomedical questions like “Do CpG islands co-localize with transcrip-
tion start sites?”), a named entity (e.g., in “What is the methyl donor of DNA
(cytosine-5)-methyltransferases?”), or a set of named entities (e.g., in “Which
species may be used for the biotechnological production of itaconic acid?”). Fol-
lowing the terminology of bioasq, we call short answers of this kind ‘exact’
answers. In practice, however, a more elaborate answer is often needed, ideally
a paragraph summarizing the most important information from relevant docu-
ments and structured data; bioasq calls answers of this kind ‘ideal’ answers. In
this paper, we focus on generating ‘ideal’ answers (summaries) from documents
that are known to be relevant to particular questions. We describe our participa-
tion in the corresponding subtask of the bioasq3 competition (Task 3b, Phase
B, generation of ‘ideal’ answers), where the participants produce summaries of
5
    See also http://www.bioasq.org/.
biomedical articles that are relevant to English questions prepared by biomed-
ical experts. In this particular subtask, the input is a question along with the
PubMed articles that a biomedical expert identified as relevant to the ques-
tion; in effect, a perfect search engine is assumed (see Fig. 2). More precisely, in
bioasq3 only the abstracts of the articles were available; hence, we summarize
sets of abstracts (one set per question). We also note that the abstracts contain
annotations showing the snippets (one or more consecutive sentences each) that
the biomedical experts considered most relevant to the corresponding questions.
We do not use the snippet annotations of the experts, since our system includes
its own mechanisms to assess the importance of each sentence. Hence, our system
may be at a disadvantage compared to systems that use the snippet annotations
of the experts. Nevertheless, experimental results we present indicate that it still
performs better than its competitors.


      Question: Do CpG islands co-localize with transcription start sites?

      Query: e.g., “CpG islands” AND “transcription start sites”


       Search Engine
                               “Exact” answer: Yes.
                               “Ideal” answer (summary): Yes. It is generally
                               known that the presence of a CpG island around
       Documents,              the TSS is related to the expression pattern of the
       RDF triples …           gene. CGIs (CpG islands) often extend into
                               downstream transcript regions. This provides an
                               explanation for the observation that the exon at
           QA,                 the 5' end of the transcript, flanked with the
       summarization,          transcription start site, shows a remarkably higher
           NLG                 CpG density than the downstream exons.


Fig. 2. Using QA, multi-document summarization, and concept-to-text generation to
produce ‘exact’ and ‘ideal’ answers to English biomedical questions. The blue box
indicates the focus of our participation in bioasq3. We did not consider rdf triples.


   We also note that when relevant structured information is also available (e.g.,
rdf triples), concept to text natural language generation (nlg) [1] can also be
used to produce ‘ideal’ answers or texts to be given as additional input docu-
ments to the summarizer. We did not consider nlg, however, since in bioasq3
the questions were not accompanied by manually selected (by the biomedical
experts) relevant structured information, unlike bioasq1 and bioasq2, and we
do not yet have mechanisms to select structured information automatically.
   Section 2 below describes the different versions of the multi-document sum-
marizer that we used. Section 3 reports our experimental results. Section 4 con-
cludes and provides directions for future work.
2     Our question-focused multi-document summarizer

We now discuss how the ‘ideal’ answers (summaries) of our system are produced.
Recall that for each question, a set of documents (article abstracts) known to be
relevant to the question is given. Our system is an extractive summarizer, i.e., it
includes in each summary sentences of the input documents, without rephrasing
them. The summarizer attempts to select the most relevant (to the question)
sentences, also trying to avoid including in the summary redundant sentences,
i.e., pairs of sentences that convey the same information. bioasq restricts the
maximum size of each ‘ideal’ answer to 200 words; including redundant sentences
wastes space and is also penalized when experts manually assess the responses
of the systems [20]. The summarizer does not attempt to repair (e.g., replace
pronouns by their referents), order, or aggregate the selected sentences [6]; we
leave these important issues for future work.


2.1   Baseline 1 and Baseline 2

As a starting point, we used the extractive summarizer of Galanis et al. [7, 8].
Two versions of the summarizer, known as Baseline 1 and Baseline 2, have been
used as baselines for ‘ideal’ answers in all three years of the bioasq competition.6
Both versions employ a Support Vector Regression (svr) model [5] to assign a
relevance score rel (si ) to each sentence si of the relevant documents of a question
q.7 An svr learns a function f : Rn → R in order to predict a real value yi ∈ R
given a feature vector ~xi ∈ Rn that represents an instance. In our case, ~xi
is a feature vector representing a sentence si of the relevant documents of a
question q, and yi is the relevance score of si . Consult Galanis et al. [7, 8] for a
discussion of the features that were used in the svr of Baseline 1 and Baseline
2. During training, for each q we compute the rouge-2 and rouge-su4 scores
[13] between each si and the gold (provided by an expert) ‘ideal’ answer of
q, and we take yi to be the average of the rouge-2 and rouge-su4 scores.
The motivation for using these scores is that they are the two most commonly
used measures for automatic evaluation of machine-generated summaries against
gold ones. Roughly speaking, both measures compute the word bigram recall
of the summary (or sentence) being evaluated against, possibly multiple, gold
summaries. However, rouge-su4 also considers skip bigrams (pairs of words
with other ignored intervening words) with a maximum distance of 4 words
between the words of each skip bigram. Both measures have been found to
correlate well with human judgements in extractive summarization [13] and,
hence, training a component (e.g., an svr) to predict the rouge score of each
sentence can be particularly useful. Intuitively, a sentence with a high rouge
score has a high overlap with the gold summaries; and since the gold summaries
6
  Baseline 1 and Baseline 2 are the ilp2 and greedy-red methods, respectively, of
  Galanis et al. [8]. Baseline 2 had also participated in TAC 2008 [9].
7
  We use the svr implementation of libsvm (see http://www.csie.ntu.edu.tw/
  ~cjlin/libsvm/) with an rbf kernel and libsvm’s parameter tuning facilities.
contain the sentences that human authors considered most important, a sentence
with a high rouge score is most likely also important.
    Baseline 1 uses Integer Linear Programming (ilp) to jointly maximize the
relevance and diversity (non-redundancy) of the selected sentences si , respect-
ing at the same time the maximum allowed summary length. The ilp model
maximizes the following objective function:8

                                n                                  |B|
                                X             li                   X   bi
                        max λ           αi          xi + (1 − λ)                    (1)
                         b,x
                                  i=1
                                             lmax                  i=1
                                                                         n

subject to:
                                        n
                                        X
                                              li xi ≤ lmax                          (2)
                                        i=1
                          X
                                   bj ≥ |Bi | xi , for i = 1, . . . , n             (3)
                         gj ∈Bi
                               X
                                    xi ≥ bj , for j = 1, . . . , |B|                (4)
                           si ∈Sj


where αi is the relevance score rel (si ) of sentence si normalized in [0, 1]; li is the
word length of si ; lmax is the maximum allowed summary length in words; n is the
number of input sentences (sentences in the given relevant documents); B is the
set of all the word bigrams in the input sentences; xi and bi show which sentences
si and word bigrams, respectively, are present in the summary; Bi is the set of
word bigrams that occur in sentence si ; gj ranges over the word bigrams in Bi ;
and Sj is the set of sentences that contain bigram gj . Constraint (2) ensures that
the maximum allowed summary length is not exceeded. Constraint (3) ensures
that if an input sentence is included in the summary, then all of its word bigrams
are also included. Constraint (4) ensures that if a word bigram is included in the
summary, than at least a sentence that contains it is also included. The first sum
of Eq. 1 maximizes the total relevance of the selected sentences. The second sum
maximizes the number of distinct bigrams in the summary, in effect minimizing
the redundancy of the included sentences. Finally, λ ∈ [0, 1] controls how much
the model tries to maximize the total relevance of the selected sentences at the
expense of non-redundancy and vice versa. Consult Galanis et al. [7, 8] for a
more detailed explanation of the ilp model.
    Baseline 2 first uses the trained svr to rank the sentences si of the relevant
documents of q by decreasing relevance rel (si ). It then greedily examines each si
from highest to lowest rel (si ). If the cosine similarity between si and any of the
sentences that have already been added to the summary exceeds a threshold t,
then si is discarded; the cosine similarity is computed by representing each sen-
tence as a bag of words (using Boolean features), and t is tuned on development
8
    We use the implementation of the Branch and Cut algorithm of the gnu Linear
    Programming Kit (glpk); consult http://sourceforge.net/projects/winglpk/.
data. Otherwise, if si fits in the remaining available summary space, it is added
to the summary; if it does not fit, the summary construction process stops.
    Baselines 1 and 2 were trained on news articles, as discussed by Galanis et
al. [7, 8], and were used in bioasq without retraining and without modifying
the features of their svr. However, there are many differences between news
and biomedical articles, and many of the features that were used in the svr of
Baselines 1 and 2 are irrelevant to biomedical articles. For example, Baselines
1 and 2 use a feature that counts the names of organizations, persons, etc. in
sentence si , as identified by a named entity recognizer that does not support
biomedical entity types (e.g., names of genes, diseases). They also use a feature
that considers the order of si in the document it was extracted from, based
on the intuition that news articles usually list the most important information
first, a convention that does not always hold in biomedical abstracts. Hence, we
also experimented with modified versions of Baselines 1 and 2, discussed below,
which were trained on bioasq datasets and used different feature sets.

2.2   The ILP-SUM-0 and ILP-SUM-1 summarizers
The first new version of our summarizer, called ilp-sum-0, is the same as Base-
line 1 (the baseline that uses ilp, with the same features in its svr), but was
trained on bioasq data, as discussed in Section 3 below.
    Another version, ilp-sum-1, is the same as ilp-sum-0, it was also trained on
bioasq data, but uses a different feature set in its svr, still close to the features
of Baselines 1 and 2 [7, 8], but modified for biomedical questions and articles.
The features of ilp-sum-1 are the following. All the features of all the versions
of the summarizer, including Baselines 1 and 2, are normalized in [0, 1].
(1.1) Word overlap: The number of common words between the question q
      and each sentence si of the relevant documents of q, after removing stop
      words and duplicate words from q and si .
(1.2) Stemmed word overlap: The same as Feature (1.1), but the words of q
      and si are stemmed, after removing stop words.
(1.3) Levenshtein distance: The Levenshtein distance [11] between q and si ,
      taking insertions, deletions, and replacements to operate on entire words.
(1.4) Stemmed Levenshtein distance: The same as Feature (1.3), but the
      words of q and si are stemmed, before computing the Levensthein distance.
(1.5) Content word frequency: The average frequency CF (si ) of the content
      words of sentence si in the relevant documents of q, as defined by Schilder
      and Ravikumar [18]:
                                           Pc(si )
                                             j=1 pc (wj )
                                CF (si ) =
                                               c(si )
                                                                            m
      where c(si ) is the number of content words in sentence si , pc (w) = M ,m
      is the number of occurrences of content word wj in the relevant documents
      of q, and M is the total number of content word occurences in the relevant
      documents of q.
(1.6) Stemmed content word frequency: The same as Feature (1.5), but
      the content words of the relevant documents of q (and their sentences si )
      are stemmed before computing CF (si ).
(1.7) Document frequency: The average document frequency of the content
      words of sentence si in the relevant documents of q, as defined by Schilder
      and Ravikumar [18]:
                                           Pc(si )
                                             j=1 pd (wj )
                                DF (si ) =
                                               c(si )
                     d
      where pd (w) = D , d is the number of relevant documents of q that contain
      the content word wj , and D is the number of relevant documents of q.
(1.8) Stemmed document frequency: The same as Feature (1.7), but the
      content words of the relevant documents of q (and their sentences si ) are
      stemmed before computing DF (si ).

2.3    The ILP-SUM-2 and GR-SUM-2 summarizers
In recent years, continuous space vector representations of words, also known as
word embeddings, have been found to capture several morphosyntactic and se-
mantic properties of words [12, 14–17]. bioasq employed the popular word2vec
tool [14–16] to construct embeddings for a vocabulary of 1,701,632 words oc-
curring in biomedical texts, using a corpus of 10,876,004 English abstracts of
biomedical articles from PubMed.9 The ilp-sum-2 and gr-sum-2 versions of
our summarizer use the following features in their svr, which are based on the
bioasq word embeddings, in addition to Features (1.1)–(1.8) of ilp-sum-1. ilp-
sum-2 also uses the ilp model (like Baseline 1, ilp-sum-0, ilp-sum-1), whereas
gr-sum-2 uses the greedy approach of Baseline 2 instead (see Section 2.1).

(2.1) Euclidean similarity of centroids: This is computed as:
                                                      1
                                 ES (q, si ) =                                      (5)
                                                 1 + ED(~q, ~si )
       where ~q, ~si are the centroid vectors of q and si , respectively, defined below,
       and ED(~q, ~si ) is the Euclidean distance between ~q and ~si . The centroid ~t
       of a text t (question or sentence) is computed as:
                                                 |V
                                                 P|
                                     |t|          ~ j · TF(wj , t)
                                                  w
                            ~t =  1  X        j=1
                                         w
                                         ~i =                                       (6)
                                 |t| i=1        |V
                                                P|
                                                     TF(wj , t)
                                                   j=1

       where |t| is the number of words (tokens) in t, and w   ~ i is the embedding
       (vector) of the i-th word (token) of t, |V | is the number of (distinct) words
9
    See https://code.google.com/p/word2vec/ and http://bioasq.lip6.fr/tools/
    BioASQword2vec/ for further details.
      in the vocabulary, and TF(wj , t) is the term frequency (number of occur-
      rences) of the j-th vocabulary word in the text t.10
(2.2) Euclidean similarity of IDF-weighted centroids: The same as Fea-
      ture (2.1), except that the centroid of a text t (question or sentence) now
      also takes into account the inverse document frequencies of the words in t:
                                  |V
                                  P|
                                      ~ j · TF(wj , t) · IDF(wj )
                                      w
                             ~t = j=1                                              (7)
                                    |V
                                    P|
                                         TF(wj , t) · IDF(wj )
                                    j=1

                                  |D|
      where IDF(wj ) = log |D(w     j )|
                                         , |D| = 10, 876, 004 is the total number of
      abstracts in the corpus the word embeddings were obtained from, and
      |D(wj )| is the number of those abstracts that contain the word wj .
(2.3) Pairwise Euclidean similarities: To compute this set of features (8 fea-
      tures in total), we create two bags, one with the tokens (word occurrences)
      of the question q and one with the tokens of the sentence si . We then com-
      pute the similarity ES (w, w0 ) (as in Eq. 5) for every pair of tokens w, w0
      of q and si , respectively, and we construct the following features:
        - the average of the similarities ES (w, w0 ), for all the token pairs w, w0
          of q and si , respectively,
        - the median of the similarities ES (w, w0 ),
        - the maximum similarity ES (w, w0 ),
        - the average of the two largest similarities ES (w, w0 ),
        - the average of the three largest similarities ES (w, w0 ),
        - the minimum similarity ES (w, w0 ),
        - the average of the two smallest similarities ES (w, w0 ),
        - the average of the three smallest similarities ES (w, w0 ).
(2.4) IDF-weighted pairwise Euclidean similarities: The same set of fea-
      tures (8 features) as Features (2.3), but the Euclidean similarity ES (w, w0 )
                                                                    0
      of each pair of tokens w, w0 is multiplied with IDF(w)·IDF(w
                                                            maxidf2
                                                                      )
                                                                        to reward pairs
      with high idf scores. The idf scores are computed as in Feature (2.2), and
      maxidf is the maximum idf score of the words we have embeddings for.


3      Experimental results

We used the datasets of bioasq1 and bioasq2 to train and tune the four new
versions of our summarizer (ilp-sum-0, ilp-sum-1, ilp-sum-2, gr-sum-2). We
then used the dataset of bioasq3 to test the two best new versions of our sum-
marizer (ilp-sum-2, gr-sum-2) on unseen data, and to compare them against
Baseline 1, Baseline 2, and the other systems that participated in bioasq3.
10
     Tokens for which we have no embeddings are ignored when computing the features
     of this section.
3.1   Experiments on BioASQ1 and BioASQ2 data

The bioasq1 and bioasq2 datasets consist of 3 and 5 batches, respectively,
called Batches 1–3 and Batches 4–8 in this section. Each batch contains ap-
proximately 100 questions, along with relevant documents, and ‘ideal’ answers
provided by the biomedical experts.
    In a first experiment, we aimed to tune the λ parameter of ilp-sum-0,
ilp-sum-1, and ilp-sum-2, which use the ilp model of Section 2.1, and com-
pare the three systems. Figure 3 shows the average rouge scores of the three
systems on Batches 4–6, for different values of λ, using Batches 1–3 to train
them (train their svrs); Batches 7–8 were reserved for another experiment,
discussed below. In more detail, we first computed the rouge-2 and rouge-
su4 scores on Batch 4, training the systems on Batches 1–3 and 5–6. We then
computed the average of the rouge-2 and rouge-su4 scores of Batch 4, i.e.,
rouge(Batch4) = 12 (rouge-2(Batch4)+rouge-su4(Batch4)), for each λ value.
We repeated the same process for Batches 5 and 6, obtaining rouge(Batch5)
and rouge(Batch6), for each λ value. Finally, we computed (and show in Fig. 3)
the average 13 (rouge(Batch4) + rouge(Batch5) + rouge(Batch6)), for each λ
valuer. Figure 3 shows that ilp-sum-2 performs better than ilp-sum-1, which in
turn outperforms ilp-sum-0. The differences in the rouge scores are larger for
greater values of λ, because greater λ values place more emphasis on the rel (si )
scores returned by the svr, which are affected by the different feature sets of
the three systems. For λ > 0.8, the rouge scores decline, because the systems
place too much emphasis on avoiding redundant sentences. The best of the three
systems, ilp-sum-2, achieves its best performance for λ = 0.8.
    In a second experiment, we compared ilp-sum-2, which is the best of our new
versions that use the ilp model, against gr-sum-2, which uses the same features,
but the greedy approach instead of the ilp model. We set λ = 0.8 in ilp-sum-2,
based on Fig. 3. In gr-sum-2, we set the cosine similarity threshold (Section 2.1)
to t = 0.4, based on Galanis et al. [7, 8]. Figure 4 shows the average rouge-2 and
rouge-su4 score of each system on Batches 7 and 8, using an increasingly larger
training dataset, consisting of Batches 1–3, 1–4, 1–5, or 1–6. A first observation
is that ilp-sum-2 outperforms gr-sum-2. Moreover, it seems that both systems
would benefit from more training data.


3.2   Experiments on BioASQ3 data

In bioasq3, we participated with ilp-sum-2 (with λ = 0.8) and gr-sum-2 (with
t = 0.4), both trained on all 8 batches of bioasq1 and bioasq2. Baseline 1 and
Baseline 2, which are also versions of our own summarizer, were used again as
the official baselines for ‘ideal’ answers, as in bioasq1 and bioasq2, i.e., without
modifying their features or retraining them for biomedical data. The test dataset
of bioasq3 contained five new batches, hereafter called bioasq3 Batches 1–5;
these are different from Batches 1–8 of bioasq1 and bioasq2.
    For each bioasq3 batch, Table 1 shows the rouge-2, rouge-su4, and av-
erage of rouge-2 and rouge-su4 scores of the four versions of our summarizer
                                     Average ROUGE scores
        0.43

        0.41

        0.39

        0.37

        0.35
                                                                                                 ILP‐SUM‐0
        0.33                                                                                     ILP‐SUM‐1

        0.31                                                                                     ILP‐SUM‐2

        0.29

        0.27

        0.25
               0.0     0.1   0.2   0.3         0.4   0.5   0.6         0.7   0.8   0.9     1.0
                                                     λ


Fig. 3. Average rouge-2 and rouge-su4 on Batches 4–6 of bioasq1 and bioasq2, for
different λ values, each time using the five other batches of Batches 1–6 for training.


                                    Average ROUGE scores
 0.38


 0.37


 0.36


 0.35
                                                                                                 GR‐SUM‐2
                                                                                                 ILP‐SUM‐2
 0.34


 0.33


 0.32
                 1‐3                     1‐4                     1‐5                     1‐6
                                     Batches used for training


Fig. 4. Average rouge-2 and rouge-su4 scores on Batches 7-8 of bioasq1 and
bioasq2, using increasingly more of Batches 1–6 for training.
(ilp-sum-2, gr-sum-2, Baseline 1, Baseline 2), ordered by decreasing average
rouge-2 and rouge-su4. The results of the three other best (in terms of average
rouge-2 and rouge-su4) participants per batch are also shown, as part-sys-
1, part-sys-2, part-sys-3; part-sys-1 is not necessarily the same system in
all batches, and similarly for part-sys-2 and part-sys-3.11 The four versions
of our summarizer are the best four systems in all five batches of Table 1.
     As in the experiments of Section 3.1, Table 1 shows that ilp-sum-2 con-
sistently outperforms gr-sum-2. Similarly, Baseline 2 (which uses the greedy
approach) performs better than Baseline 1 (which uses the ilp model) only in
the third batch. It is also surprising that ilp-sum-2 and gr-sum-2 do not always
perform better than Baselines 1 and 2, even though the former systems were tai-
lored for biomedical data by modifying their features and retraining them on the
datasets of bioasq1 and bioasq2. This may be due to the fact that Baseline 1
and Baseline 2 were trained on larger datasets than ilp-sum-2 and gr-sum-2
[7, 8]. Hence, training our summarizer on more data, even from another domain
(news) may be more important than training it on data from the application
domain (biomedical data, in the case of bioasq) and modifying its features.
     It would be interesting to check if the conclusions of Table 1 continue to hold
when the systems are ranked by the manual (provided by biomedical experts)
evaluation scores of their ‘ideal’ summaries, as opposed to using rouge scores.
At the time this paper was written, the manual evaluation scores of the ‘ideal’
answers of bioasq3 had not been announced.


4      Conclusions and future work
We presented four new versions (ilp-sum-0, ilp-sum-1, ilp-sum-2, gr-sum-2)
of an extractive question-focused multi-document summarizer that we used to
construct ‘ideal’ answers (summaries) in bioasq3. The summarizer employs an
svr to assign relevance scores to the sentences of the given relevant abstracts,
and an ilp model or an alternative greedy strategy to select the most rele-
vant sentences avoiding redundant ones. The two official bioasq baselines for
‘ideal’ answers, Baseline 1 and Baseline 2, are also versions of the same sum-
marizer; they use the ilp model or the greedy approach, respectively, but they
were trained on news articles and their features are not always appropriate for
biomedical data. By contrast the four new versions were trained on data from
bioasq1 and bioasq2. ilp-sum-0, ilp-sum-1, and ilp-sum-2 all use the ilp
model, but ilp-sum-0 uses the original features of Baselines 1 and 2, ilp-sum-1
uses a slightly modified feature set, and ilp-sum-2 uses a more extensive feature
set that includes features based on biomedical word embeddings. gr-sum-2 uses
the same features as ilp-sum-2, but with the greedy mechanism.
    A preliminary set of experiments on bioasq1 and bioasq2 data indicated
that ilp-sum-2 performs better than ilp-sum-0 and ilp-sum-1, showing the
importance of modifying the feature set. ilp-sum-2 was also found to perform
11
     The results of all the systems can be found at http://participants-area.bioasq.
     org/results/3b/phaseB/.
                      bioasq3 Batch 1 (15 systems, 6 teams)
                         System rouge-2 rouge-su4 Avg.
                       ilp-sum-2 0.4050     0.4213 0.4132
                       Baseline 1 0.4033    0.4217 0.4125
                       gr-sum-2 0.3829      0.4052 0.3941
                       Baseline 2 0.3604    0.3787 0.3696
                      part-sys-1 0.2940     0.3071 0.3006
                      part-sys-2 0.2934     0.3066 0.3000
                      part-sys-3 0.2929     0.3069 0.2999
                      bioasq3 Batch 2 (16 systems, 6 teams)
                         System rouge-2 rouge-su4 Avg.
                       Baseline 1 0.4657    0.4860 0.4759
                       Baseline 2 0.4201    0.4493 0.4347
                       ilp-sum-2 0.4071     0.4460 0.4266
                       gr-sum-2 0.3934      0.4249 0.4092
                      part-sys-1 0.3597     0.3770 0.3684
                      part-sys-2 0.3561     0.3742 0.3652
                      part-sys-3 0.3523     0.3710 0.3617
                      bioasq3 Batch 3 (17 systems, 6 teams)
                         System rouge-2 rouge-su4 Avg.
                       ilp-sum-2 0.4843     0.5155 0.4999
                       Baseline 2 0.4586    0.4806 0.4696
                       gr-sum-2 0.4482      0.4756 0.4619
                       Baseline 1 0.4396    0.4661 0.4529
                      part-sys-1 0.3834     0.3950 0.3892
                      part-sys-2 0.3836     0.3941 0.3889
                      part-sys-3 0.3796     0.3906 0.3851
                      bioasq3 Batch 4 (17 systems, 6 teams)
                         System rouge-2 rouge-su4 Avg.
                       Baseline 1 0.4742    0.4947 0.4845
                       ilp-sum-2 0.4718     0.4942 0.4830
                       gr-sum-2 0.4480      0.4708 0.4594
                       Baseline 2 0.4345    0.4506 0.4426
                      part-sys-1 0.3864     0.3906 0.3885
                      part-sys-2 0.3606     0.3711 0.3659
                      part-sys-3 0.3627     0.3684 0.3656
                      bioasq3 Batch 5 (17 systems, 6 teams)
                         System rouge-2 rouge-su4 Avg.
                       Baseline 1 0.3947    0.4252 0.4100
                       ilp-sum-2 0.3698     0.4039 0.3869
                       gr-sum-2 0.3698      0.4039 0.3869
                      part-sys-1 0.3752     0.3945 0.3849
                      part-sys-2 0.3751     0.3910 0.3831
                      part-sys-3 0.3731     0.3930 0.3831
                       Baseline 2 0.3406    0.3766 0.3586
Table 1. Results of four versions of our summarizer (ilp-sum-2, gr-sum-2, Baseline 1,
Baseline 2) on the bioasq3 batches, along with results of the three other best systems
(part-sys-1, part-sys-2, part-sys-3) per batch. Baselines 1 and 2 were not retrained
or otherwise modified for biomedical data. ilp-sum-2 and gr-sum-2 were trained on
the datasets of bioasq1 and bioasq2. The total numbers of systems and teams that
participated in each batch are shown in brackets.
better than gr-sum-2, which uses the same feature set, showing the benefit of
using the ilp model instead of the greedy approach. Our experiments also indi-
cated that ilp-sum-2 and gr-sum-2 would probably benefit from more training
data. In bioasq3, we participated with ilp-sum-2 and gr-sum-2, tuned and
trained on bioasq1 and bioasq2 data. Along with Baselines 1 and 2, which are
also versions of our own summarizer, ilp-sum-2 and gr-sum-2 were the best
four systems in terms of rouge scores in all five batches of bioasq3. Again,
ilp-sum-2 consistently outperformed gr-sum-2, but surprisingly ilp-sum-2 and
gr-sum-2 did not always perform better than Baselines 1 and 2. This may be
due to the fact that Baselines 1 and 2 were trained on more data, suggesting that
the size of the training set may be more important than improving the feature
set or using data from the biomedical domain.
    Future work could consider repairing, ordering, or aggregating the sentences
of the ‘ideal’ answers, as already noted. The centroid vectors of ilp-sum-2 and
gr-sum-2 could also be replaced by paragraph vectors [10] or vectors obtained
by using recursive neural networks [19]. Another possible improvement could
be to use metamap [2], a tool that maps biomedical texts to concepts derived
from umls.12 We could then compute new features that measure the similarity
between a question and a sentence in terms of biomedical concepts.


Acknowledgements

The work of the first author was funded by the Athens University of Economics
and Business Research Support Program 2014-2015, “Action 2: Support to Post-
doctoral Researchers”.


References
 1. Androutsopoulos, I., Lampouras, G., Galanis, D.: Generating natural language
    descriptions from OWL ontologies: the NaturalOWL system. Journal of Artificial
    Intelligence Research 48, 671–715 (2013)
 2. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus:
    the MetaMap program. In: Proceedings of the American Medical Informatics As-
    sociation Symposium. pp. 18–20. Washington DC, USA (2001)
 3. Athenikos, S., Han, H.: Biomedical question answering: A survey. Computer Meth-
    ods and Programs in Biomedicine 99(1), 1–24 (2010)
 4. Bauer, M., Berleant, D.: Usability survey of biomedical question answering systems.
    Human Genomics 6(1)(17) (2012)
 5. Drucker, H., Burges, C.J., Kaufman, L., Smola, A., Vapnik, V., et al.: Support
    vector regression machines. Advances in Neural Information Processing Systems 9,
    155–161 (1997)
 6. Filippova, K., Strube, M.: Sentence fusion via dependency graph compression.
    In: Proceedings of the Conference on Empirical Methods in Natural Language
    Processing. pp. 177–185. Honolulu, Hawaii (2008)
12
     See http://metamap.nlm.nih.gov/.
 7. Galanis, D.: Automatic generation of natural language summaries. Ph.D. thesis,
    Department of Informatics, Athens University of Economics and Business (2012)
 8. Galanis, D., Lampouras, G., Androutsopoulos, I.: Extractive multi-document sum-
    marization with integer linear programming and support vector regression. In:
    Proceedings of COLING 2012. pp. 911–926. Mumbai, India (2012)
 9. Galanis, D., Malakasiotis, P.: AUEB at tac 2008. In: Proceedings of the Text Anal-
    ysis Conference. pp. 42–47. Gaithersburg, MD (2008)
10. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In:
    Proceedings of the 31st International Conference on Machine Learning. pp. 1188–
    1196. Beijing, China (2014)
11. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and re-
    versals. Soviet Physice-Doklady 10, 707–710 (1966)
12. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons
    learned from word embeddings. Transactions of the Association for Computational
    Linguistics 3, 211–225 (2015)
13. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Pro-
    ceedings of the ACL workshop ‘Text Summarization Branches Out’. pp. 74–81.
    Barcelona, Spain (2004)
14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen-
    tations in vector space. In: Proceedings of Workshop at International Conference
    on Learning Representations. Scottsdale, AZ, USA (2013)
15. Mikolov, T., Yih, W., Zweig, G.: Distributed representations of words and phrases
    and their compositionality. In: Proceedings of the Conference on Neural Informa-
    tion Processing Systems. Lake Tahoe, NV (2013)
16. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word
    representations. In: Proceedings of the Conference of the North American Chapter
    of the Association for Computational Linguistics - Human Language Technologies.
    Atlanta, GA (2013)
17. Pennington, J., Socher, R.and Manning, C.D.: GloVe: Global vectors for word
    representation. In: Proceedings of the Conference on Empirical Methods on Natural
    Language Processing. Doha, Qatar (2014)
18. Schilder, F., Kondadadi, R.: Fastsum: Fast and accurate query-based multi-
    document summarization. In: Proceedings of 46th Annual Meeting of the Associa-
    tion for Computational Linguistics - Human Language Technologies, Short Papers.
    pp. 205–208. Columbus, Ohio (2008)
19. Socher, R., Huval, B., Manning, C.D., Ng, A.Y.: Semantic compositionality
    through recursive matrix-vector spaces. In: Proceedings of the 2012 Joint Confer-
    ence on Empirical Methods in Natural Language Processing and Computational
    Natural Language Learning. pp. 1201–1211. Jeju Island, Korea (2012)
20. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
    M., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis,
    Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artieres, T., Ngonga, A., Heino, N.,
    Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., Paliouras, G.:
    An overview of the BioASQ large-scale biomedical semantic indexing and question
    answering competition. BMC Bioinformatics 16(138) (2015)