Biomedical Question Answering using the
        YodaQA System: Prototype Notes

                          Petr Baudiš and Jan Šedivý

                 Dept. of Cybernetics, Czech Technical University,
                       Technická 2, Praha, Czech Republic
                             baudipet@fel.cvut.cz


      Abstract. We briefly outline the YodaQA open domain question an-
      swering system and its initial adaptation to the Biomedical domain for
      the purposes of the BIOASQ challenge (question answering task 3b) on
      CLEF2015.

      Keywords: Question answering, linked data, natural language process-
      ing, bioinformatics.


1   Introduction
The YodaQA system for open domain factoid English question answering has
been published recently. [1] [2] The system is a fully open source, modular
pipeline inspired by the IBM Watson DeepQA system [3]. So far, the system
has not been specialized for any particular domain, this represents the first such
effort.
    The BIOASQ challenge [4] aims at semantic indexing and question answering
in the biomedical domain using a variety of knowledge bases from the given
domain. The BIOASQ 2015 has tasks 3A and 3B, where 3A concerns semantic
indexing and 3B is about question answering, where we participated.
    The BIOASQ Task 3B is further split into phase A (information retrieval)
and phase B (answer production); the phases are evaluated separately, with gold
standard phase A results (documents, snippets and triples) available for phase
B. Our system participated in the phase B evaluation.
    The paper is structured as follows. In Sec. 2, we briefly outline the YodaQA
system in its original form. In Sec. 3, we discuss the changes of the system for
the biomedical domain. In Sec. 4, we review the system performance.


2   YodaQA Summary
The YodaQA pipeline is implemented mainly in Java, using the Apache UIMA
framework [5]. Detailed technical description of the pipeline is included in a
technical report [1].
   The system maps an input question to ordered list of answer candidates in
a pipeline fashion, encompassing the following stages:
 – Question Analysis extracts natural language features from the input and
   produces in-system representations of the question. We currently build just a
   naive representation of the question as a bag-of-features. The most important
   characterization of the question is a set of clues (keywords, keyphrases and
   concept clues that exactly match enwiki titles) and possible lexical answer
   types.
 – Answer Production generates a set of candidate answers based on the
   question, typically by performing a Primary Search in the knowledge bases
   according to the question clues and either directly using search results as
   candidate answers or filtering relevant passages from the result text (the
   Passage Extraction) and generating candidate answers from the filtered
   passages (the Passage Analysis). Answers are produced from text passages
   by a simple strategy of considering all named entities and noun phrases.
 – Answer Analysis generates various answer features based on detailed anal-
   ysis. Most importantly, this concerns lexical type determination and coercion
   to question type. Other features include distance from clues in passages or
   text overlap with clues.
 – Answer Merging and Scoring consolidates the set of answers, removing
   duplicates and using a machine learned classifier (logistic regression) to score
   answers by their features.


3   YodaQA Domain Adaptation

We made a variety of adjustments to fit our end-to-end pipeline to the BIOASQ
task. The changes are available within the public open source code base (https:
//github.com/brmson/yodaqa) in the d/clef2015-bioasq branch.
    As a minor technical change, we enhanced our question analysis for impera-
tive and otherwise specifically phrased questions which were uncommon in our
TREC-based open domain dataset.
    Our system is designed to answer just factoid questions, while the BIOASQ
challenge also includes list and yes-no questions. For the list question, we simply
use the top 5 answer candidates returned by the system. Since we implemented
no text entailment algorithm yet and there was no easy way to skip yes-no
questions, we simply use a fixed yes answer for all since it was significantly more
prevalent in the training dataset.
    Similarly, our system is designed to return narrow answers, not sentence-
length answers that include background and justification. Therefore, we always
return empty string as the ideal answer and supply our answers as the exact an-
swer. However, the BIOASQ definition of exact answers might be more stringent
than ours, which simply requires that the gold standard answer is a sub-string of
the produced answers answer (e.g. “the red color” would be acceptable for gold
standard “red” in our scenario). Therefore, we modified our system to require
exact matches during training, and disabled a heuristic in answer analysis which
attempts to find a focus word in the answer and run analysis (like title lookup,
type coercion) on it instead of the whole answer.
    As we focus on the phase B of the question answering task, we do not perform
an explicit primary search on questions and instead base answer production on
the search result snippets (generated by phase A) supplied along with the ques-
tion on program input. These snippets are equivalent to passages our primary
search would produce.1
    To further improve the accuracy of the system, we implemented an en-
hanced answer production strategy (inspired by the Jacana QA system) that
approaches the problem of identifying the answer in a text passage in a way sim-
ilar to named entity recognition: as a (token) sequence tagging (by begin-inside-
outside labels) that uses the conditional random field model to predict labels. [6]
However, we use a significantly simplified feature set: just part-of-speech tags,
named entity labels and dependency labels as token sequence unigrams, bigrams
and trigrams.
    A crucial feature for scoring answers is information on successful type coer-
cion. This involves identification of Lexical Answer Type (LAT) in the question
(typically an easy task using a few fixed heuristics), production of a set of LATs
describing the answer and the question-answer type coercion using straight string
matching and Wordnet [7] hypernymy relations. In an open domain QA system,
the most useful LAT source is looking up the answer as a Wikipedia article by ti-
tle and using the category information.2 This may fail for specialized terminology
of the biomedical domain, we also look up the answer in the GeneOntology [9]
using the GOLR endpoint (as either of bioentity, bioentity label, bioentity name,
synonym ) and using the type field as the answer LAT; this is successful for
correct identification of answers like gene and protein names.


4   Results

To evaluate the performance of our system, we split the (randomly reshuffled)
reference training dataset provided for the Task 3B to a local dev/train set
(100 questions) and a test set (100 questions).3 For comparison, we also include
baseline version performance4 on the “curated” factoid open domain dataset [2].
The results are summarized in Table 1.


References

1. Baudiš, P.: YodaQA: A Modular Question Answering System Pipeline. In: POSTER
   2015 - 19th International Student Conference on Electrical Engineering
1
  We did not participate in phase A due to a lack of resources on our side.
2
  In practice, DBpedia [8] rdf:type ontology can be leveraged for this task.
3
  The rest of the questions were unused; we opted for a smaller dataset as the Ge-
  neOntology access was causing quite a slow-down.
4
  This performance is done using the open domain metric which permits gold standard
  substrings, see above.
                        Pipeline         Recall Accuracy-at-1 MRR
                          final          33.0%      10.0%      0.132
                   w/o GeneOntology      33.0%       8.0%      0.120
                     w/o G.O., CRF       33.0%       5.5%      0.114
                 final, ignoring yes/no q. 43.5%    10.1%      0.148
                      open domain        79.3%      32.6%      0.420
Fig. 1. Benchmark results of various pipeline
                                        P variants on the test split of the dataset.
MRR is the Mean Reciprocal Rank |Q| · q∈Q 1/rq .


2. Baudiš, P.: YodaQA: A Modular Question Answering System Pipeline. In: Sixth
   International Conference of the CLEF Association, CLEF’15, Toulouse, September
   8-11, 2015. Volume 9283 of LNCS., Springer (2015)
3. Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A.A.,
   Lally, A., Murdock, J.W., Nyberg, E., Prager, J., et al.: Building watson: An
   overview of the deepqa project. AI magazine 31(3) (2010) 59–79
4. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
   M.R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., et al.: An
   overview of the bioasq large-scale biomedical semantic indexing and question an-
   swering competition. BMC bioinformatics 16(1) (2015) 138
5. Ferrucci, D., Lally, A.: UIMA: An architectural approach to unstructured infor-
   mation processing in the corporate research environment. Nat. Lang. Eng. 10(3-4)
   (September 2004) 327–348
6. Yao, X., Van Durme, B., et al.: Answer extraction as sequence tagging with tree
   edit distance. In: HLT-NAACL. (2013) 858–867
7. Miller, G.A.: WordNet: a lexical database for english. Communications of the ACM
   38(11) (1995) 39–41
8. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
   Hellmann, S., Morsey, M., van Kleef, P., et al.: Dbpedia–a large-scale, multilingual
   knowledge base extracted from wikipedia. Semantic Web (2014)
9. Consortium, G.O., et al.: Gene ontology consortium: going forward. Nucleic acids
   research 43(D1) (2015) D1049–D1056