Biomedical question-focused multi-document summarization: ILSP and AUEB at BioASQ3 Prodromos Malakasiotis1,2 , Emmanouil Archontakis1 , Ion Androutsopoulos1,2 , Dimitrios Galanis2 , and Harris Papageorgiou2 1 Dept. of Informatics, Athens University of Economics and Business, Greece rulller@aueb.gr, man.arcon@gmail.com, ion@aueb.gr http://nlp.cs.aueb.gr/ 2 Institute for Language and Speech Processing, Research Center ‘Athena’, Greece malakasiotis@ilsp.gr, galanisd@ilsp.gr, xaris@ilsp.gr http://www.ilsp.gr/ Abstract. Question answering systems aim to find answers to natural language questions by searching in document collections (e.g., reposi- tories of scientific articles or the entire Web) and/or structured data (e.g., databases, ontologies). Strictly speaking, the answer to a question might sometimes be simply ‘yes’ or ‘no’, a named entity, or a set of named entities. In practice, however, a more elaborate answer is often also needed, ideally a summary of the most important information from relevant documents and structured data. In this paper, we focus on gen- erating summaries from documents that are known to be relevant to par- ticular questions. We describe the joint participation of AUEB and ILSP in the corresponding subtask of the bioasq3 competition, where partic- ipants produce multi-document summaries of given biomedical articles that are relevant to English questions prepared by biomedical experts. Keywords: biomedical question answering, text summarization 1 Introduction Biomedical experts are extremely short of time. They also need to keep up with scientific developments happening at a pace that is probably faster than in any other science. The online biomedical bibliographic database PubMed currently comprises approximately 21 million references and was growing at a rate of- ten exceeding 20,000 articles per week in 2011.3 Figure 1 shows the number of biomedical articles indexed by PubMed per year since 1964. Rich sources of structured biomedical information, like the Gene Ontology, umls, or Dis- easesome are also available.4 Obtaining sufficient and concise answers from this wealth of information is a challenging task for traditional search engines, which instead of answers return lists of (possibly) relevant documents that the experts 3 Consult http://www.ncbi.nlm.nih.gov/pubmed/. 4 See http://www.geneontology.org/, http://www.nlm.nih.gov/research/umls/, http://diseasome.eu/. themselves have to study. Consequently, there is growing interest for biomedi- cal question answering (QA) systems [3, 4], which aim to produce more concise answers. To foster research in biomedical QA, the bioasq project constructs benchmark datasets, evaluation services, and organizes international biomedical QA competitions since 2012 [20].5 900 800 700 Published Articles (thousands) 600 500 400 PubMed Trend 300 200 100 0 1964 1971 1978 1985 1992 1999 2006 2013 Year Fig. 1. Number of new PubMed articles (blue line) indexed over the period 1964-2013 per year, and the respective logarithmic trend (red dashed line). Given a question expressed in natural language, QA systems aim to provide answers by searching in document collections (e.g., repositories of scientific ar- ticles or the entire Web) and/or structured data (e.g., databases, ontologies). Strictly speaking, the answer to a question might sometimes be simply a ‘yes’ or ‘no’ (e.g., in biomedical questions like “Do CpG islands co-localize with transcrip- tion start sites?”), a named entity (e.g., in “What is the methyl donor of DNA (cytosine-5)-methyltransferases?”), or a set of named entities (e.g., in “Which species may be used for the biotechnological production of itaconic acid?”). Fol- lowing the terminology of bioasq, we call short answers of this kind ‘exact’ answers. In practice, however, a more elaborate answer is often needed, ideally a paragraph summarizing the most important information from relevant docu- ments and structured data; bioasq calls answers of this kind ‘ideal’ answers. In this paper, we focus on generating ‘ideal’ answers (summaries) from documents that are known to be relevant to particular questions. We describe our participa- tion in the corresponding subtask of the bioasq3 competition (Task 3b, Phase B, generation of ‘ideal’ answers), where the participants produce summaries of 5 See also http://www.bioasq.org/. biomedical articles that are relevant to English questions prepared by biomed- ical experts. In this particular subtask, the input is a question along with the PubMed articles that a biomedical expert identified as relevant to the ques- tion; in effect, a perfect search engine is assumed (see Fig. 2). More precisely, in bioasq3 only the abstracts of the articles were available; hence, we summarize sets of abstracts (one set per question). We also note that the abstracts contain annotations showing the snippets (one or more consecutive sentences each) that the biomedical experts considered most relevant to the corresponding questions. We do not use the snippet annotations of the experts, since our system includes its own mechanisms to assess the importance of each sentence. Hence, our system may be at a disadvantage compared to systems that use the snippet annotations of the experts. Nevertheless, experimental results we present indicate that it still performs better than its competitors. Question: Do CpG islands co-localize with transcription start sites? Query: e.g., “CpG islands” AND “transcription start sites” Search Engine “Exact” answer: Yes. “Ideal” answer (summary): Yes. It is generally known that the presence of a CpG island around Documents, the TSS is related to the expression pattern of the RDF triples … gene. CGIs (CpG islands) often extend into downstream transcript regions. This provides an explanation for the observation that the exon at QA, the 5' end of the transcript, flanked with the summarization, transcription start site, shows a remarkably higher NLG CpG density than the downstream exons. Fig. 2. Using QA, multi-document summarization, and concept-to-text generation to produce ‘exact’ and ‘ideal’ answers to English biomedical questions. The blue box indicates the focus of our participation in bioasq3. We did not consider rdf triples. We also note that when relevant structured information is also available (e.g., rdf triples), concept to text natural language generation (nlg) [1] can also be used to produce ‘ideal’ answers or texts to be given as additional input docu- ments to the summarizer. We did not consider nlg, however, since in bioasq3 the questions were not accompanied by manually selected (by the biomedical experts) relevant structured information, unlike bioasq1 and bioasq2, and we do not yet have mechanisms to select structured information automatically. Section 2 below describes the different versions of the multi-document sum- marizer that we used. Section 3 reports our experimental results. Section 4 con- cludes and provides directions for future work. 2 Our question-focused multi-document summarizer We now discuss how the ‘ideal’ answers (summaries) of our system are produced. Recall that for each question, a set of documents (article abstracts) known to be relevant to the question is given. Our system is an extractive summarizer, i.e., it includes in each summary sentences of the input documents, without rephrasing them. The summarizer attempts to select the most relevant (to the question) sentences, also trying to avoid including in the summary redundant sentences, i.e., pairs of sentences that convey the same information. bioasq restricts the maximum size of each ‘ideal’ answer to 200 words; including redundant sentences wastes space and is also penalized when experts manually assess the responses of the systems [20]. The summarizer does not attempt to repair (e.g., replace pronouns by their referents), order, or aggregate the selected sentences [6]; we leave these important issues for future work. 2.1 Baseline 1 and Baseline 2 As a starting point, we used the extractive summarizer of Galanis et al. [7, 8]. Two versions of the summarizer, known as Baseline 1 and Baseline 2, have been used as baselines for ‘ideal’ answers in all three years of the bioasq competition.6 Both versions employ a Support Vector Regression (svr) model [5] to assign a relevance score rel (si ) to each sentence si of the relevant documents of a question q.7 An svr learns a function f : Rn → R in order to predict a real value yi ∈ R given a feature vector ~xi ∈ Rn that represents an instance. In our case, ~xi is a feature vector representing a sentence si of the relevant documents of a question q, and yi is the relevance score of si . Consult Galanis et al. [7, 8] for a discussion of the features that were used in the svr of Baseline 1 and Baseline 2. During training, for each q we compute the rouge-2 and rouge-su4 scores [13] between each si and the gold (provided by an expert) ‘ideal’ answer of q, and we take yi to be the average of the rouge-2 and rouge-su4 scores. The motivation for using these scores is that they are the two most commonly used measures for automatic evaluation of machine-generated summaries against gold ones. Roughly speaking, both measures compute the word bigram recall of the summary (or sentence) being evaluated against, possibly multiple, gold summaries. However, rouge-su4 also considers skip bigrams (pairs of words with other ignored intervening words) with a maximum distance of 4 words between the words of each skip bigram. Both measures have been found to correlate well with human judgements in extractive summarization [13] and, hence, training a component (e.g., an svr) to predict the rouge score of each sentence can be particularly useful. Intuitively, a sentence with a high rouge score has a high overlap with the gold summaries; and since the gold summaries 6 Baseline 1 and Baseline 2 are the ilp2 and greedy-red methods, respectively, of Galanis et al. [8]. Baseline 2 had also participated in TAC 2008 [9]. 7 We use the svr implementation of libsvm (see http://www.csie.ntu.edu.tw/ ~cjlin/libsvm/) with an rbf kernel and libsvm’s parameter tuning facilities. contain the sentences that human authors considered most important, a sentence with a high rouge score is most likely also important. Baseline 1 uses Integer Linear Programming (ilp) to jointly maximize the relevance and diversity (non-redundancy) of the selected sentences si , respect- ing at the same time the maximum allowed summary length. The ilp model maximizes the following objective function:8 n |B| X li X bi max λ αi xi + (1 − λ) (1) b,x i=1 lmax i=1 n subject to: n X li xi ≤ lmax (2) i=1 X bj ≥ |Bi | xi , for i = 1, . . . , n (3) gj ∈Bi X xi ≥ bj , for j = 1, . . . , |B| (4) si ∈Sj where αi is the relevance score rel (si ) of sentence si normalized in [0, 1]; li is the word length of si ; lmax is the maximum allowed summary length in words; n is the number of input sentences (sentences in the given relevant documents); B is the set of all the word bigrams in the input sentences; xi and bi show which sentences si and word bigrams, respectively, are present in the summary; Bi is the set of word bigrams that occur in sentence si ; gj ranges over the word bigrams in Bi ; and Sj is the set of sentences that contain bigram gj . Constraint (2) ensures that the maximum allowed summary length is not exceeded. Constraint (3) ensures that if an input sentence is included in the summary, then all of its word bigrams are also included. Constraint (4) ensures that if a word bigram is included in the summary, than at least a sentence that contains it is also included. The first sum of Eq. 1 maximizes the total relevance of the selected sentences. The second sum maximizes the number of distinct bigrams in the summary, in effect minimizing the redundancy of the included sentences. Finally, λ ∈ [0, 1] controls how much the model tries to maximize the total relevance of the selected sentences at the expense of non-redundancy and vice versa. Consult Galanis et al. [7, 8] for a more detailed explanation of the ilp model. Baseline 2 first uses the trained svr to rank the sentences si of the relevant documents of q by decreasing relevance rel (si ). It then greedily examines each si from highest to lowest rel (si ). If the cosine similarity between si and any of the sentences that have already been added to the summary exceeds a threshold t, then si is discarded; the cosine similarity is computed by representing each sen- tence as a bag of words (using Boolean features), and t is tuned on development 8 We use the implementation of the Branch and Cut algorithm of the gnu Linear Programming Kit (glpk); consult http://sourceforge.net/projects/winglpk/. data. Otherwise, if si fits in the remaining available summary space, it is added to the summary; if it does not fit, the summary construction process stops. Baselines 1 and 2 were trained on news articles, as discussed by Galanis et al. [7, 8], and were used in bioasq without retraining and without modifying the features of their svr. However, there are many differences between news and biomedical articles, and many of the features that were used in the svr of Baselines 1 and 2 are irrelevant to biomedical articles. For example, Baselines 1 and 2 use a feature that counts the names of organizations, persons, etc. in sentence si , as identified by a named entity recognizer that does not support biomedical entity types (e.g., names of genes, diseases). They also use a feature that considers the order of si in the document it was extracted from, based on the intuition that news articles usually list the most important information first, a convention that does not always hold in biomedical abstracts. Hence, we also experimented with modified versions of Baselines 1 and 2, discussed below, which were trained on bioasq datasets and used different feature sets. 2.2 The ILP-SUM-0 and ILP-SUM-1 summarizers The first new version of our summarizer, called ilp-sum-0, is the same as Base- line 1 (the baseline that uses ilp, with the same features in its svr), but was trained on bioasq data, as discussed in Section 3 below. Another version, ilp-sum-1, is the same as ilp-sum-0, it was also trained on bioasq data, but uses a different feature set in its svr, still close to the features of Baselines 1 and 2 [7, 8], but modified for biomedical questions and articles. The features of ilp-sum-1 are the following. All the features of all the versions of the summarizer, including Baselines 1 and 2, are normalized in [0, 1]. (1.1) Word overlap: The number of common words between the question q and each sentence si of the relevant documents of q, after removing stop words and duplicate words from q and si . (1.2) Stemmed word overlap: The same as Feature (1.1), but the words of q and si are stemmed, after removing stop words. (1.3) Levenshtein distance: The Levenshtein distance [11] between q and si , taking insertions, deletions, and replacements to operate on entire words. (1.4) Stemmed Levenshtein distance: The same as Feature (1.3), but the words of q and si are stemmed, before computing the Levensthein distance. (1.5) Content word frequency: The average frequency CF (si ) of the content words of sentence si in the relevant documents of q, as defined by Schilder and Ravikumar [18]: Pc(si ) j=1 pc (wj ) CF (si ) = c(si ) m where c(si ) is the number of content words in sentence si , pc (w) = M ,m is the number of occurrences of content word wj in the relevant documents of q, and M is the total number of content word occurences in the relevant documents of q. (1.6) Stemmed content word frequency: The same as Feature (1.5), but the content words of the relevant documents of q (and their sentences si ) are stemmed before computing CF (si ). (1.7) Document frequency: The average document frequency of the content words of sentence si in the relevant documents of q, as defined by Schilder and Ravikumar [18]: Pc(si ) j=1 pd (wj ) DF (si ) = c(si ) d where pd (w) = D , d is the number of relevant documents of q that contain the content word wj , and D is the number of relevant documents of q. (1.8) Stemmed document frequency: The same as Feature (1.7), but the content words of the relevant documents of q (and their sentences si ) are stemmed before computing DF (si ). 2.3 The ILP-SUM-2 and GR-SUM-2 summarizers In recent years, continuous space vector representations of words, also known as word embeddings, have been found to capture several morphosyntactic and se- mantic properties of words [12, 14–17]. bioasq employed the popular word2vec tool [14–16] to construct embeddings for a vocabulary of 1,701,632 words oc- curring in biomedical texts, using a corpus of 10,876,004 English abstracts of biomedical articles from PubMed.9 The ilp-sum-2 and gr-sum-2 versions of our summarizer use the following features in their svr, which are based on the bioasq word embeddings, in addition to Features (1.1)–(1.8) of ilp-sum-1. ilp- sum-2 also uses the ilp model (like Baseline 1, ilp-sum-0, ilp-sum-1), whereas gr-sum-2 uses the greedy approach of Baseline 2 instead (see Section 2.1). (2.1) Euclidean similarity of centroids: This is computed as: 1 ES (q, si ) = (5) 1 + ED(~q, ~si ) where ~q, ~si are the centroid vectors of q and si , respectively, defined below, and ED(~q, ~si ) is the Euclidean distance between ~q and ~si . The centroid ~t of a text t (question or sentence) is computed as: |V P| |t| ~ j · TF(wj , t) w ~t = 1 X j=1 w ~i = (6) |t| i=1 |V P| TF(wj , t) j=1 where |t| is the number of words (tokens) in t, and w ~ i is the embedding (vector) of the i-th word (token) of t, |V | is the number of (distinct) words 9 See https://code.google.com/p/word2vec/ and http://bioasq.lip6.fr/tools/ BioASQword2vec/ for further details. in the vocabulary, and TF(wj , t) is the term frequency (number of occur- rences) of the j-th vocabulary word in the text t.10 (2.2) Euclidean similarity of IDF-weighted centroids: The same as Fea- ture (2.1), except that the centroid of a text t (question or sentence) now also takes into account the inverse document frequencies of the words in t: |V P| ~ j · TF(wj , t) · IDF(wj ) w ~t = j=1 (7) |V P| TF(wj , t) · IDF(wj ) j=1 |D| where IDF(wj ) = log |D(w j )| , |D| = 10, 876, 004 is the total number of abstracts in the corpus the word embeddings were obtained from, and |D(wj )| is the number of those abstracts that contain the word wj . (2.3) Pairwise Euclidean similarities: To compute this set of features (8 fea- tures in total), we create two bags, one with the tokens (word occurrences) of the question q and one with the tokens of the sentence si . We then com- pute the similarity ES (w, w0 ) (as in Eq. 5) for every pair of tokens w, w0 of q and si , respectively, and we construct the following features: - the average of the similarities ES (w, w0 ), for all the token pairs w, w0 of q and si , respectively, - the median of the similarities ES (w, w0 ), - the maximum similarity ES (w, w0 ), - the average of the two largest similarities ES (w, w0 ), - the average of the three largest similarities ES (w, w0 ), - the minimum similarity ES (w, w0 ), - the average of the two smallest similarities ES (w, w0 ), - the average of the three smallest similarities ES (w, w0 ). (2.4) IDF-weighted pairwise Euclidean similarities: The same set of fea- tures (8 features) as Features (2.3), but the Euclidean similarity ES (w, w0 ) 0 of each pair of tokens w, w0 is multiplied with IDF(w)·IDF(w maxidf2 ) to reward pairs with high idf scores. The idf scores are computed as in Feature (2.2), and maxidf is the maximum idf score of the words we have embeddings for. 3 Experimental results We used the datasets of bioasq1 and bioasq2 to train and tune the four new versions of our summarizer (ilp-sum-0, ilp-sum-1, ilp-sum-2, gr-sum-2). We then used the dataset of bioasq3 to test the two best new versions of our sum- marizer (ilp-sum-2, gr-sum-2) on unseen data, and to compare them against Baseline 1, Baseline 2, and the other systems that participated in bioasq3. 10 Tokens for which we have no embeddings are ignored when computing the features of this section. 3.1 Experiments on BioASQ1 and BioASQ2 data The bioasq1 and bioasq2 datasets consist of 3 and 5 batches, respectively, called Batches 1–3 and Batches 4–8 in this section. Each batch contains ap- proximately 100 questions, along with relevant documents, and ‘ideal’ answers provided by the biomedical experts. In a first experiment, we aimed to tune the λ parameter of ilp-sum-0, ilp-sum-1, and ilp-sum-2, which use the ilp model of Section 2.1, and com- pare the three systems. Figure 3 shows the average rouge scores of the three systems on Batches 4–6, for different values of λ, using Batches 1–3 to train them (train their svrs); Batches 7–8 were reserved for another experiment, discussed below. In more detail, we first computed the rouge-2 and rouge- su4 scores on Batch 4, training the systems on Batches 1–3 and 5–6. We then computed the average of the rouge-2 and rouge-su4 scores of Batch 4, i.e., rouge(Batch4) = 12 (rouge-2(Batch4)+rouge-su4(Batch4)), for each λ value. We repeated the same process for Batches 5 and 6, obtaining rouge(Batch5) and rouge(Batch6), for each λ value. Finally, we computed (and show in Fig. 3) the average 13 (rouge(Batch4) + rouge(Batch5) + rouge(Batch6)), for each λ valuer. Figure 3 shows that ilp-sum-2 performs better than ilp-sum-1, which in turn outperforms ilp-sum-0. The differences in the rouge scores are larger for greater values of λ, because greater λ values place more emphasis on the rel (si ) scores returned by the svr, which are affected by the different feature sets of the three systems. For λ > 0.8, the rouge scores decline, because the systems place too much emphasis on avoiding redundant sentences. The best of the three systems, ilp-sum-2, achieves its best performance for λ = 0.8. In a second experiment, we compared ilp-sum-2, which is the best of our new versions that use the ilp model, against gr-sum-2, which uses the same features, but the greedy approach instead of the ilp model. We set λ = 0.8 in ilp-sum-2, based on Fig. 3. In gr-sum-2, we set the cosine similarity threshold (Section 2.1) to t = 0.4, based on Galanis et al. [7, 8]. Figure 4 shows the average rouge-2 and rouge-su4 score of each system on Batches 7 and 8, using an increasingly larger training dataset, consisting of Batches 1–3, 1–4, 1–5, or 1–6. A first observation is that ilp-sum-2 outperforms gr-sum-2. Moreover, it seems that both systems would benefit from more training data. 3.2 Experiments on BioASQ3 data In bioasq3, we participated with ilp-sum-2 (with λ = 0.8) and gr-sum-2 (with t = 0.4), both trained on all 8 batches of bioasq1 and bioasq2. Baseline 1 and Baseline 2, which are also versions of our own summarizer, were used again as the official baselines for ‘ideal’ answers, as in bioasq1 and bioasq2, i.e., without modifying their features or retraining them for biomedical data. The test dataset of bioasq3 contained five new batches, hereafter called bioasq3 Batches 1–5; these are different from Batches 1–8 of bioasq1 and bioasq2. For each bioasq3 batch, Table 1 shows the rouge-2, rouge-su4, and av- erage of rouge-2 and rouge-su4 scores of the four versions of our summarizer Average ROUGE scores 0.43 0.41 0.39 0.37 0.35 ILP‐SUM‐0 0.33 ILP‐SUM‐1 0.31 ILP‐SUM‐2 0.29 0.27 0.25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 λ Fig. 3. Average rouge-2 and rouge-su4 on Batches 4–6 of bioasq1 and bioasq2, for different λ values, each time using the five other batches of Batches 1–6 for training. Average ROUGE scores 0.38 0.37 0.36 0.35 GR‐SUM‐2 ILP‐SUM‐2 0.34 0.33 0.32 1‐3 1‐4 1‐5 1‐6 Batches used for training Fig. 4. Average rouge-2 and rouge-su4 scores on Batches 7-8 of bioasq1 and bioasq2, using increasingly more of Batches 1–6 for training. (ilp-sum-2, gr-sum-2, Baseline 1, Baseline 2), ordered by decreasing average rouge-2 and rouge-su4. The results of the three other best (in terms of average rouge-2 and rouge-su4) participants per batch are also shown, as part-sys- 1, part-sys-2, part-sys-3; part-sys-1 is not necessarily the same system in all batches, and similarly for part-sys-2 and part-sys-3.11 The four versions of our summarizer are the best four systems in all five batches of Table 1. As in the experiments of Section 3.1, Table 1 shows that ilp-sum-2 con- sistently outperforms gr-sum-2. Similarly, Baseline 2 (which uses the greedy approach) performs better than Baseline 1 (which uses the ilp model) only in the third batch. It is also surprising that ilp-sum-2 and gr-sum-2 do not always perform better than Baselines 1 and 2, even though the former systems were tai- lored for biomedical data by modifying their features and retraining them on the datasets of bioasq1 and bioasq2. This may be due to the fact that Baseline 1 and Baseline 2 were trained on larger datasets than ilp-sum-2 and gr-sum-2 [7, 8]. Hence, training our summarizer on more data, even from another domain (news) may be more important than training it on data from the application domain (biomedical data, in the case of bioasq) and modifying its features. It would be interesting to check if the conclusions of Table 1 continue to hold when the systems are ranked by the manual (provided by biomedical experts) evaluation scores of their ‘ideal’ summaries, as opposed to using rouge scores. At the time this paper was written, the manual evaluation scores of the ‘ideal’ answers of bioasq3 had not been announced. 4 Conclusions and future work We presented four new versions (ilp-sum-0, ilp-sum-1, ilp-sum-2, gr-sum-2) of an extractive question-focused multi-document summarizer that we used to construct ‘ideal’ answers (summaries) in bioasq3. The summarizer employs an svr to assign relevance scores to the sentences of the given relevant abstracts, and an ilp model or an alternative greedy strategy to select the most rele- vant sentences avoiding redundant ones. The two official bioasq baselines for ‘ideal’ answers, Baseline 1 and Baseline 2, are also versions of the same sum- marizer; they use the ilp model or the greedy approach, respectively, but they were trained on news articles and their features are not always appropriate for biomedical data. By contrast the four new versions were trained on data from bioasq1 and bioasq2. ilp-sum-0, ilp-sum-1, and ilp-sum-2 all use the ilp model, but ilp-sum-0 uses the original features of Baselines 1 and 2, ilp-sum-1 uses a slightly modified feature set, and ilp-sum-2 uses a more extensive feature set that includes features based on biomedical word embeddings. gr-sum-2 uses the same features as ilp-sum-2, but with the greedy mechanism. A preliminary set of experiments on bioasq1 and bioasq2 data indicated that ilp-sum-2 performs better than ilp-sum-0 and ilp-sum-1, showing the importance of modifying the feature set. ilp-sum-2 was also found to perform 11 The results of all the systems can be found at http://participants-area.bioasq. org/results/3b/phaseB/. bioasq3 Batch 1 (15 systems, 6 teams) System rouge-2 rouge-su4 Avg. ilp-sum-2 0.4050 0.4213 0.4132 Baseline 1 0.4033 0.4217 0.4125 gr-sum-2 0.3829 0.4052 0.3941 Baseline 2 0.3604 0.3787 0.3696 part-sys-1 0.2940 0.3071 0.3006 part-sys-2 0.2934 0.3066 0.3000 part-sys-3 0.2929 0.3069 0.2999 bioasq3 Batch 2 (16 systems, 6 teams) System rouge-2 rouge-su4 Avg. Baseline 1 0.4657 0.4860 0.4759 Baseline 2 0.4201 0.4493 0.4347 ilp-sum-2 0.4071 0.4460 0.4266 gr-sum-2 0.3934 0.4249 0.4092 part-sys-1 0.3597 0.3770 0.3684 part-sys-2 0.3561 0.3742 0.3652 part-sys-3 0.3523 0.3710 0.3617 bioasq3 Batch 3 (17 systems, 6 teams) System rouge-2 rouge-su4 Avg. ilp-sum-2 0.4843 0.5155 0.4999 Baseline 2 0.4586 0.4806 0.4696 gr-sum-2 0.4482 0.4756 0.4619 Baseline 1 0.4396 0.4661 0.4529 part-sys-1 0.3834 0.3950 0.3892 part-sys-2 0.3836 0.3941 0.3889 part-sys-3 0.3796 0.3906 0.3851 bioasq3 Batch 4 (17 systems, 6 teams) System rouge-2 rouge-su4 Avg. Baseline 1 0.4742 0.4947 0.4845 ilp-sum-2 0.4718 0.4942 0.4830 gr-sum-2 0.4480 0.4708 0.4594 Baseline 2 0.4345 0.4506 0.4426 part-sys-1 0.3864 0.3906 0.3885 part-sys-2 0.3606 0.3711 0.3659 part-sys-3 0.3627 0.3684 0.3656 bioasq3 Batch 5 (17 systems, 6 teams) System rouge-2 rouge-su4 Avg. Baseline 1 0.3947 0.4252 0.4100 ilp-sum-2 0.3698 0.4039 0.3869 gr-sum-2 0.3698 0.4039 0.3869 part-sys-1 0.3752 0.3945 0.3849 part-sys-2 0.3751 0.3910 0.3831 part-sys-3 0.3731 0.3930 0.3831 Baseline 2 0.3406 0.3766 0.3586 Table 1. Results of four versions of our summarizer (ilp-sum-2, gr-sum-2, Baseline 1, Baseline 2) on the bioasq3 batches, along with results of the three other best systems (part-sys-1, part-sys-2, part-sys-3) per batch. Baselines 1 and 2 were not retrained or otherwise modified for biomedical data. ilp-sum-2 and gr-sum-2 were trained on the datasets of bioasq1 and bioasq2. The total numbers of systems and teams that participated in each batch are shown in brackets. better than gr-sum-2, which uses the same feature set, showing the benefit of using the ilp model instead of the greedy approach. Our experiments also indi- cated that ilp-sum-2 and gr-sum-2 would probably benefit from more training data. In bioasq3, we participated with ilp-sum-2 and gr-sum-2, tuned and trained on bioasq1 and bioasq2 data. Along with Baselines 1 and 2, which are also versions of our own summarizer, ilp-sum-2 and gr-sum-2 were the best four systems in terms of rouge scores in all five batches of bioasq3. Again, ilp-sum-2 consistently outperformed gr-sum-2, but surprisingly ilp-sum-2 and gr-sum-2 did not always perform better than Baselines 1 and 2. This may be due to the fact that Baselines 1 and 2 were trained on more data, suggesting that the size of the training set may be more important than improving the feature set or using data from the biomedical domain. Future work could consider repairing, ordering, or aggregating the sentences of the ‘ideal’ answers, as already noted. The centroid vectors of ilp-sum-2 and gr-sum-2 could also be replaced by paragraph vectors [10] or vectors obtained by using recursive neural networks [19]. Another possible improvement could be to use metamap [2], a tool that maps biomedical texts to concepts derived from umls.12 We could then compute new features that measure the similarity between a question and a sentence in terms of biomedical concepts. Acknowledgements The work of the first author was funded by the Athens University of Economics and Business Research Support Program 2014-2015, “Action 2: Support to Post- doctoral Researchers”. References 1. Androutsopoulos, I., Lampouras, G., Galanis, D.: Generating natural language descriptions from OWL ontologies: the NaturalOWL system. Journal of Artificial Intelligence Research 48, 671–715 (2013) 2. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the American Medical Informatics As- sociation Symposium. pp. 18–20. Washington DC, USA (2001) 3. Athenikos, S., Han, H.: Biomedical question answering: A survey. Computer Meth- ods and Programs in Biomedicine 99(1), 1–24 (2010) 4. Bauer, M., Berleant, D.: Usability survey of biomedical question answering systems. Human Genomics 6(1)(17) (2012) 5. Drucker, H., Burges, C.J., Kaufman, L., Smola, A., Vapnik, V., et al.: Support vector regression machines. Advances in Neural Information Processing Systems 9, 155–161 (1997) 6. Filippova, K., Strube, M.: Sentence fusion via dependency graph compression. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 177–185. Honolulu, Hawaii (2008) 12 See http://metamap.nlm.nih.gov/. 7. Galanis, D.: Automatic generation of natural language summaries. Ph.D. thesis, Department of Informatics, Athens University of Economics and Business (2012) 8. Galanis, D., Lampouras, G., Androutsopoulos, I.: Extractive multi-document sum- marization with integer linear programming and support vector regression. In: Proceedings of COLING 2012. pp. 911–926. Mumbai, India (2012) 9. Galanis, D., Malakasiotis, P.: AUEB at tac 2008. In: Proceedings of the Text Anal- ysis Conference. pp. 42–47. Gaithersburg, MD (2008) 10. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning. pp. 1188– 1196. Beijing, China (2014) 11. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and re- versals. Soviet Physice-Doklady 10, 707–710 (1966) 12. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, 211–225 (2015) 13. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Pro- ceedings of the ACL workshop ‘Text Summarization Branches Out’. pp. 74–81. Barcelona, Spain (2004) 14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen- tations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations. Scottsdale, AZ, USA (2013) 15. Mikolov, T., Yih, W., Zweig, G.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Informa- tion Processing Systems. Lake Tahoe, NV (2013) 16. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies. Atlanta, GA (2013) 17. Pennington, J., Socher, R.and Manning, C.D.: GloVe: Global vectors for word representation. In: Proceedings of the Conference on Empirical Methods on Natural Language Processing. Doha, Qatar (2014) 18. Schilder, F., Kondadadi, R.: Fastsum: Fast and accurate query-based multi- document summarization. In: Proceedings of 46th Annual Meeting of the Associa- tion for Computational Linguistics - Human Language Technologies, Short Papers. pp. 205–208. Columbus, Ohio (2008) 19. Socher, R., Huval, B., Manning, C.D., Ng, A.Y.: Semantic compositionality through recursive matrix-vector spaces. In: Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 1201–1211. Jeju Island, Korea (2012) 20. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artieres, T., Ngonga, A., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., Paliouras, G.: An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16(138) (2015)