=Paper= {{Paper |id=Vol-1180/CLEF2014wn-QA-MaoEt2014 |storemode=property |title=NCBI at the 2014 BioASQ Challenge Task: Large-scale Biomedical Semantic Indexing and Question Answering |pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-QA-MaoEt2014.pdf |volume=Vol-1180 |dblpUrl=https://dblp.org/rec/conf/clef/MaoWL14 }} ==NCBI at the 2014 BioASQ Challenge Task: Large-scale Biomedical Semantic Indexing and Question Answering== https://ceur-ws.org/Vol-1180/CLEF2014wn-QA-MaoEt2014.pdf

NCBI at the 2014 BioASQ challenge task: large-scale
biomedical semantic indexing and question answering

Yuqing Mao1, Chih-Hsuan Wei1, Zhiyong Lu1,*
1
National Center for Biotechnology Information (NCBI),
8600 Rockville Pike, Bethesda, MD 20894, USA

{yuqing.mao,chih-hsuan.wei,zhiyong.lu}@nih.gov

Abstract. In this paper we report our participation in the 2014 BioASQ chal-
lenge tasks on biomedical semantic indexing and question answering. For the
biomedical semantic indexing task (Task 2a) where participating teams are pro-
vided with PubMed articles and asked to return relevant MeSH terms, we built
on our previous learning-to-rank framework with a special focus on systemati-
cally incorporating results of complementary methods for improved perfor-
mance. For the question answering task (Task 2b) where teams are provided
with natural language questions and asked to return responses in the format of
documents, snippets, concepts and RDF triplets (Phase A) and direct answers
(Phase B), we relied on PubMed search engines and our state-of-the-art named
entity recognition tools such as DNorm and tmVar in Phases A and B, respec-
tively. The official challenge results demonstrate that we consistently per-
formed better than the baseline approaches for Task 2a and Task 2b (Phase B),
and ranked among the top tier systems in the 2014 challenge.

Keywords: MeSH Indexing; Biomedical Semantic Indexing; Hierarchical Text
Classification; Learning to Rank; Biomedical Question Answering.

1 Introduction

Over the past decade, a number of community-wide challenge evaluations have been
held for various research topics in the biomedical natural language processing (Bi-
oNLP) field, such as document retrieval [1, 2], named entity recognition [3-5], in-
formation extraction [6, 7], etc. Different from other challenges such as BioCreative
[8, 9], the BioASQ Challenge (http://www.bioasq.org/) is a newly organized shared
task and has a unique focus on biomedical semantic indexing and question answering.

Similar to the previous year [10], the BioASQ 2014 challenge consists of two tasks:
automated semantic (MeSH) indexing (Task 2a) and question answering (Task 2b).
More specifically, for Task 2a, participating teams are provided with a set of newly
published articles in PubMed, and are asked to automatically predict the most relevant
MeSH terms for each article in the given set. For evaluation, team prediction results
will be compared to those gold standard curated by human indexers. MeSH indexing

1319
is an important task for the US National Library of Medicine (NLM) because indexed
MeSH terms can then be used implicitly or explicitly for searching the biomedical
literature in PubMed [11]. Indexed MeSH terms can also play a role in many other
scientific investigations [12-14] in the biomedical informatics research.

However, like many other curation tasks, manual MeSH indexing is labor-intensive
and time-consuming. As shown in [15, 16], it can take weeks or even months for an
article to be manually indexed with relevant MeSH terms after it first enters PubMed.
In response, many automated systems for assisting MeSH indexing have been pro-
posed in the past [15-17]. Some automated systems such as the NLM Medical Text
Indexer (MTI) and its newer version, Medical Text Indexer First Line (MTIFL) [18],
are already being used in the NLM production pipelines to assist human annotators
with indexing MeSH main headings, and main heading/subheading pairs [19].

Task2b is a biomedical question-answering task. For this task, teams are provided
with 100 natural language questions in each batch (5 test batches in total) and asked to
return answers in two phases. In Phase A, the participating teams should return rele-
vant documents, concepts, RDF triples and snippets for each question. In Phase B, the
teams should return “exact” and “ideal” answers. Exact answers depend on the ques-
tion type, which can be categorized as below:

• Yes/no type questions: answer either yes or no
• Factoid type questions: answer named entities
• List type questions: answer list of named entities
• Summary type questions: no exact answer is needed

Ideal answers are paragraph-sized summaries that are required for all four types of
questions. For both phases of Task 2, the question type is known to the participants.

2 Methods

2.1 Task 2a

For Task 2a, our overall approach builds on our previous research where we first pro-
posed to reformulate the MeSH prediction task as a ranking problem in 2010 [16]: our
approach first retrieves an initial list of MeSH terms as candidates for each target
article. Next, we apply a learning-to-rank algorithm to re-rank the candidate MeSH
terms based on the learned associations between the document text and each candi-
date MeSH term. More specifically, each main heading (MH) candidate can be repre-
sented as a feature vector as xi= (x1i,x2i, …,xmi), where m is the number of features
(e.g. neighborhood features, unigram/bigram features, etc). The learning objective is
to find a ranking function f(x) which can assign a score to each main heading based on
the feature vector and subsequently use the scores to rank relevant main headings of

1320
the target document ahead of those irrelevant ones. Finally, we prune the ranked list
and return a number of top candidates as the final system output.

Through participation in the indexing task in BioASQ 2013 [20], we have shown
several useful extensions such as using a different learning-to-rank algorithm with an
enriched set of learning features, as well as using different methods for list-pruning
and selecting top candidates from the ranked list.

In BioASQ 2014, we further expanded our approach in the following aspects: First,
we built binary SVM classifiers using bag-of-word features, one for each MeSH term,
as suggested by [21]. Second, predicted MeSH terms from the aforementioned binary
classifiers and NLM’s MTI system were added to our list of candidate MeSH terms,
in addition to those already collected from the neighbor documents. Third, we limited
the neighbor documents to newly indexed articles (last five years) and used a more
recent and larger training set, along with a new list-pruning method for selecting final
terms from our ranked list. Lastly, we used some post-processing techniques, such as
using string matching to identify “Age Check Tags” in the abstract, to enhance the
final system output. Table 1 shows a detailed list of key differences between our cur-
rent system and our 2013 system. In addition to the abovementioned differences, the
table also includes a few other notable modifications such as upgrading our lexicon to
MeSH 2014 version.

Table 1. Major Differences between our current work and our previous approach in BioASQ
2013 (In both cases, the general learning-to-rank framework [16] was used)
Notable Differences BioASQ 2013 BioASQ 2014
Learning-to-rank algorithm MART LAMBDA-MART
Retrieved from all Retrieved from documents
Neighbor documents
MEDLINE database indexed after 2009
All MeSH terms in neighbor
The list of candidate MeSH All MeSH terms in neighbor
documents plus MTI and
terms documents
binary classifier results
All features in [16] plus two
All features in [16] plus a
Features used in the learn- new features representing the
new feature representing the
ing-to-rank algorithm MTI and binary classifier
MTI results.
results
MeSH version MeSH 2013 MeSH 2014
Training data for the learn- 1000 documents from select 5000 documents from select
ing-to-rank algorithm BioASQ Journal List BioASQ Journal List
Method for selecting the Si +1 / Si < i /(i + 1 + α ) Si +1 < Si ⋅ log(i) ⋅ λ
number of predicted MeSH Si is the score of predicted Si is the score of predicted
terms from the ranked list MeSH Term at position i MeSH Term at position i.
Refine Age Check Tags
Post-process techniques NONE Add tags like “Europe” to
European foreign journals

1321
2.2 Task 2b – Phase A

For returning relevant documents, we used PubMed search functions. Given a search
query, PubMed provides users with two results-ranking options: by date or by rele-
vance. Furthermore, we computed cosine similarity (Eq.1) scores between the ques-
tion (q) and each sentence (s) in a retrieved article. The sentence in the abstract with
the highest score would be returned as a snippet. We did not use full text in this work.

𝑞∙𝑠 𝑡∈𝑞∩𝑠 𝑞𝑡 ∙𝑠𝑡
cos(𝑞, 𝑠) = = (1)
𝑞 𝑠
𝑞𝑡 2 𝑠𝑡 2
For concept recognition, we used a dictionary-look up method to mine disease, chem-
ical and GO terms and used our previous developed gene normalization tool, Gen-
Norm [22], to identify gene/protein mentions. In addition, we used MetaMap [23] to
extract MeSH concepts from the questions. For snippets, we only return results when
the relevant concepts are gene/proteins.

2.3 Task 2b – Phase B

In phase B, the gold-standard relevant documents, concepts, snippets and RDF triples
in Phase A become available to the participants. In particular, we used the relevant
documents and snippets for returning “exact” answers in Phase B.

“Exact” answers: For “yes/no” type questions, we simply returned “yes” as “exact”
answers because of its strong performance on the previous training data. No “exact”
answers were needed for summary-type questions.

For Factoid and List-type questions, we developed a three-step approach for returning
“exact” answers. The first step was to automatically determine the type of desired
answers: 1) numbers; 2) multiple choices; and 3) bio-concepts. (See examples in Ta-
ble 2). If bio-concepts are desired, we further classified them into sub-types: 3a)
Gene/proteins; 3b) Chemical/drugs; 3c) Disorder/syndromes; 3d) Mutation/variations
and 3e) Species/viruses. Based on such a strategy and previous year’s data, we devel-
oped a set of regular expression patterns to identify different answer types and sub-
types for a given question. When no match is found, the question will be discarded
from further processing (i.e. no “exact” answers will be returned). Otherwise, it will
be passed to the next step.

The next step was to generate candidate answers for different answer types in Factoid
and List-type questions. For 1), we identified all numbers in relevant snippets to be
candidate answers. For 2), the candidates were mined from the questions. For 3) we
applied our PubTator [24-26] tool to the relevant documents for obtaining entity
recognition results when generating the candidate answers. PubTator is equipped with
several competition-wining text-mining algorithms for automatically extracting bio-
concepts (GenNorm [22] for genes, tmChem [4] for chemicals, DNorm for [27] dis-
eases, SR4GN [28] for species, and tmVar [29] for mutations) from free text.

1322
Table 2. Three answer types for Factoid and List-type questions.
Answer Type Example questions
How many genes does the human hoxD cluster contain?
1) Numbers What is the incidence of Edwards’s syndrome in the European popula-
tion?
2) Multi-Choices Is the transcriptional regulator BACH1 an activator or a repressor?
Which gene is involved in CADASIL?
Which drugs affect insulin resistance in obesity?
Which disease is caused by mutations in Calsequestrin 2 (CASQ2) gene?
3) Bio-concepts Which gene mutations are responsible for isolated Non-compaction
cardiomyopathy?
Which virus is Cidofovir (Vistide) indicated for?

The last step was to rank the candidate answers. For each candidate, we calculated its
cosine similarity score against the relevant snippets, and ranked the candidates by the
similarity scores. We then returned the maximum number of allowed answers (e.g.,
no more than 5 answers for Factoid-type questions).

“Ideal” answers: For returning “ideal” answers, we used the same method as retriev-
ing relevant snippets in Phase A. That is, we scored each gold-standard snippet
against the question using cosine similarity and returned the one with the highest
score. This method is applied to all questions regardless of their types.

3 Results

3.1 Task 2a

Task 2a was organized for three consecutive periods (batches) of five weeks each.
Each week, participants have a limited response time (less than one day) to return
their predicted MeSH terms for a set of newly indexed articles in PubMed.

For Task 2a, team results were evaluated based on multiple measures. Two main
measures are: the flat measure “label-based micro F-measure (MiF)” and the hierar-
chical measure “Lowest Common Ancestor F-measure (LCA-F)” [30].

Table 3 shows our best results on the Task 2a Batch 3 Week 2 test set, which contains
the largest number of test articles (5,717) with known answers among all 15 test sets
of Task 2a, as of June 30, 2014. In this run, we incorporated both MTI and binary
classifier results, and applied all three post-processing methods in Table 1.

According to the official website, we rank the first in both the flat (MiF) and hierar-
chical F-measure (LCA-F) on this test set. We also obtained the highest recall scores
in both flat and hierarchical measures (MiR and LCA-R).

1323
Table 3. Official results (as of June 30, 2014) for our best run (L2R-n2) on Batch 3 Week 2 test
set plus the results of several baseline methods. Our best results among all team submissions
are highlighed in bold.
Systems MiF MiP MiR LCA-F LCA-P LCA-R
NCBI (L2R-n2) 0.6052 0.6191 0.5919 0.5105 0.5324 0.5208
Default MTI 0.5640 0.5921 0.5385 0.4834 0.5245 0.4770
MTI First Line 0.5520 0.6257 0.4939 0.4686 0.5434 0.4382
BIoASQ_Baseline 0.2666 0.2413 0.2978 0.3120 0.3224 0.3299

3.2 Task 2b – Phase A

The test dataset of Task 2b was released in five batches1 over a period of three
months, each containing 100 questions. Several measures were used to evaluate team
submissions. Table 4 shows our submission for the final batch (fifth batch) according
to the official results released on June 30, 2014, where we obtained the best F-
measure and mean precision for returning relevant concepts (highlighted in bold).

Table 4. Official results for our best submssion on Batch 5 Phase A test set. Our best results
among all submissions are highlighed in bold.

Mean precision Recall F-Measure MAP GMAP
Documents 0.2124 0.1450 0.1384 0.0903 0.0005
Concepts 0.4572 0.391 0.3848 0.297 0.0634
RDF triples 0.0455 0.001 0.0021 0.001 0.0000
Snippets 0.0655 0.038 0.0409 0.024 0.0001

3.3 Task 2b – Phase B

Table 5 shows our official results2 for all the five batches, for three types of questions:
Yes/No, Factoid, and List. When considering the official measures – accuracy for
Yes/No type questions; mean reciprocal rank (MRR) for Factoid type questions; and
mean F-measure for List type questions – we achieved consistently better results than
the two BioASQ baseline approaches. In addition, we obtained the highest results for
the Yes/No-type questions in Batches 1 & 5, and for the Factoid-type questions in
Batches 3 & 5 (highlighted in bold in Table 5).

1
We did not submit results for Batch 2 – Phase A
2
Official results for the Summary-type questions are not available in the case of “exact” an-
swers at the time of writing. And no results have been released in the case of “ideal” an-
swers for all questions.

1324
Table 5. Official results of our submssions for the Phase B test sets in the case of “exact” ans-
wers. Our best results among all submissions are highlighed in bold.

Yes/No Factoid List
Batch Lenient Mean
Accuracy StrictAcc. MRR Recall F-Measure
Acc. precision
Batch1 0.9375 0.1852 0.1852 0.1852 0.0618 0.0929 0.0723
Bacth2 0.8214 − − − 0.1596 0.2057 0.1618
Batch3 0.8333 0.0417 0.1250 0.0833 0.1195 0.1780 0.1373
Bacth4 0.8750 0.0938 0.1250 0.1042 − − −
Batch5 1.0000 0.1379 0.1724 0.1466 − − −

Note that there are no official evaluation results for our submissions for the Factoid-
type questions in Batch 2 and List-type questions in Batches 4 and 5. This is likely
due to a data format issue in our submissions.

4 Discussion & Conclusion

In BioASQ 2014 challenge on automated MeSH indexing, our learning-to-rank based
approach shows improved and competitive performance among all participating sys-
tems. Moreover, we demonstrate that our learning-to-rank method is a general and
robust framework that allows systematic integration of results from other methods for
improved performance. When we include predicted results from a knowledge-based
approach (MTI) and a text classification method, we were able to achieve the highest
recall results by both flat and hierarchical measures while still maintaining high preci-
sions. For instance, compared to one of the baseline systems – MTI First Line –we
were able to achieve a much higher recall (59% vs. 49%) with almost the same level
of precision (62%) (See Table 2 for details).

Our best results for Task 2b are noted in the “exact” answers to the Factoid-type ques-
tions (see Table 5) where we used results of our previously developed named entity
recognition (NER) tools. In fact, this approach appears to perform better than relying
on the gold-standard concepts from Phase A based on our comparative analysis.

In conclusion, we participated in both tasks of the BioASQ 2014 challenge where we
are ranked among one of the top teams for both tasks. In the future, we are interested
in exploring the opportunities of our high-performing MeSH prediction methods in
practical applications (e.g. support instant MeSH indexing) and the roles of our state-
of-the-art automated entity recognition tools in question answering tasks.

Acknowledgements

We would like to thank the BioASQ 2014 task organizers as well as authors of the
NLM’s MTI system for providing the task and baseline data. We also thank Dr. Ritu

1325
Khare for her help on proofreading the manuscript. This research is supported by the
NIH Intramural Research Program, National Library of Medicine.

References

1. Kim, S., Kim, W., Wei, C.-H., Lu, Z., Wilbur, W.J.: Prioritizing PubMed articles for the
Comparative Toxicogenomic Database utilizing semantic information. Database: The
Journal of Biological Databases & Curation 2012, bas042 (2012)
2. Lu, Z., Kim, W., Wilbur, W.J.: Evaluating relevance ranking strategies for MEDLINE
retrieval. Journal of the American Medical Informatics Association 16, 32-36 (2009)
3. Leaman, R., Khare, R., Lu, Z.: NCBI at 2013 ShARe/CLEF eHealth Shared Task: Disorder
Normalization in Clinical Notes with DNorm. Proceedings of the CLEF 2013 Evaluation
Labs and Workshop. CLEF, Valencia, Spain (2013)
4. Leaman, R., Wei, C.-H., Lu, Z.: NCBI at the BioCreative IV CHEMDNER Task:
Recognizing chemical names in PubMed articles with tmChem. BioCreative IV Challenge
Evaluation Workshop vol. 2, pp. 34 (2013)
5. Lu, Z., Kao, H.-Y., Wei, C.-H., Huang, M., Liu, J., Kuo, C.-J., Hsu, C.-N., Tsai, R.T., Dai,
H.-J., Okazaki, N.: The gene normalization task in BioCreative III. BMC bioinformatics 12,
S2 (2011)
6. Krallinger, M., Vazquez, M., Leitner, F., Salgado, D., Chatr-aryamontri, A., Winter, A.,
Perfetto, L., Briganti, L., Licata, L., Iannuccelli, M.: The Protein-Protein Interaction tasks
of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to
full text. BMC bioinformatics 12, S3 (2011)
7. Mao, Y., Van Auken, K., Li, D., Arighi, C.N., Lu, Z.: The gene ontology task at biocreative
IV. Proceedings of the Fourth Biocreative Challenge Evaluation Workshop, vol. 1, pp.
119-127. BioCreative, Bethesda, Maryland (2013)
8. Arighi, C.N., Wu, C.H., Cohen, K.B., Hirschman, L., Krallinger, M., Valencia, A., Lu, Z.,
Wilbur, J.W., Wiegers, T.C.: BioCreative-IV virtual issue. Database 2014, bau039 (2014)
9. Wu, C.H., Arighi, C.N., Cohen, K.B., Hirschman, L., Krallinger, M., Lu, Z., Mattingly, C.,
Valencia, A., Wiegers, T.C., Wilbur, W.J.: BioCreative-2012 virtual issue. Database: The
Journal of Biological Databases & Curation 2012, bas049 (2012)
10. Partalas, I., Gaussier, E., Ngomo, A.-C.N.: Results of the First BioASQ Workshop.
Proceedings of the first Workshop on BioASQ, vol. 1094. BioASQ@CLEF, Valencia,
Spain (2013)
11. Lu, Z., Kim, W., Wilbur, W.J.: Evaluation of query expansion using MeSH in PubMed.
Information retrieval 12, 69-80 (2009)
12. Névéol, A., Doğan, R.I., Lu, Z.: Author keywords in biomedical journal articles.
Proceedings of the American Medical Informatics Association Symposium, vol. 2010, pp.
537. Washington, D.C. (2010)
13. Doğan, R.I., Lu, Z.: Click-words: learning to predict document keywords from a user
perspective. Bioinformatics 26, 2767-2775 (2010)
14. Doms, A., Schroeder, M.: GoPubMed: exploring PubMed with the gene ontology. Nucleic
acids research 33, W783-W786 (2005)

1326
15. Huang, M., Lu, Z.: Learning to annotate scientific publications. Proceedings of the 23rd
International Conference on Computational Linguistics, pp. 463-471. Association for
Computational Linguistics, Beijing, China (2010)
16. Huang, M., Névéol, A., Lu, Z.: Recommending MeSH terms for annotating biomedical
articles. Journal of the American Medical Informatics Association 18, 660-667 (2011)
17. Kim, W., Aronson, A.R., Wilbur, W.J.: Automatic MeSH term assignment and quality
assessment. Proceedings of the American Medical Informatics Association Symposium,
pp. 319, Washington, D.C. (2001)
18. Mork, J.G., Yepes, A.J.J., Aronson, A.R.: The NLM Medical Text Indexer System for
Indexing Biomedical Literature. Proceedings of the first Workshop on BioASQ, vol. 1094.
BioASQ@CLEF, Valencia, Spain (2013)
19. Névéol, A., Shooshan, S.E., Humphrey, S.M., Mork, J.G., Aronson, A.R.: A recent advance
in the automatic indexing of the biomedical literature. Journal of biomedical informatics 42,
814-823 (2009)
20. Mao, Y., Lu, Z.: NCBI at the 2013 BioASQ challenge task: Learning to rank for automatic
MeSH indexing. Technical report (2013)
21. Tsoumakas, G., Laliotis, M., Markantonatos, N., Vlahavas, I.P.: Large-Scale Semantic
Indexing of Biomedical Publications. Proceedings of the first Workshop on BioASQ, vol.
1094. BioASQ@CLEF, Valencia, Spain (2013)
22. Wei, C.-H., Kao, H.-Y.: Cross-species gene normalization by species inference. BMC
bioinformatics 12, S5 (2011)
23. Aronson, A.R., Lang, F.-M.: An overview of MetaMap: historical perspective and recent
advances. Journal of the American Medical Informatics Association 17, 229-236 (2010)
24. Wei, C.-H., Harris, B.R., Li, D., Berardini, T.Z., Huala, E., Kao, H.-Y., Lu, Z.: Accelerating
literature curation with text-mining tools: a case study of using PubTator to curate genes in
PubMed abstracts. Database: The Journal of Biological Databases & Curation 2012, bas041
(2012)
25. Wei, C.-H., Kao, H.-Y., Lu, Z.: PubTator: A PubMed-like interactive curation system for
document triage and literature curation. proceedings of BioCreative 2012 workshop, pp.
145-150. BioCreative, Washington D.C. (2012)
26. Wei, C.-H., Kao, H.-Y., Lu, Z.: PubTator: a web-based text mining tool for assisting
biocuration. Nucleic acids research 41, W518-W522 (2013)
27. Leaman, R., Doğan, R.I., Lu, Z.: DNorm: disease name normalization with pairwise
learning to rank. Bioinformatics 29, 2909-2917 (2013)
28. Wei, C.-H., Kao, H.-Y., Lu, Z.: SR4GN: a species recognition software tool for gene
normalization. Plos one 7, e38460 (2012)
29. Wei, C.-H., Harris, B.R., Kao, H.-Y., Lu, Z.: tmVar: a text mining approach for extracting
sequence variants in biomedical literature. Bioinformatics 29, 1433-1439 (2013)
30. Kosmopoulos, A., Partalas, I., Gaussier, E., Paliouras, G., Androutsopoulos, I.: Evaluation
Measures for Hierarchical Classification: a unified view and novel approaches. CoRR
abs/1306.6802, (2013)

1327