Introduction

Machine Reading of Biomedical Texts about Alzheimer's Disease

Roser Morante

0 1

Martin Krallinger

Alfonso Valencia

avalenciag@cnio.es c@1 0

Walter Daelemans

walter.daelemansg@ua.ac.be 0 CLiPS, University of Antwerp , Prinsstraat 13, B-2000 Antwerpen , Belgium 1 CNIO , Melchor Fernandez Almagro 3, 28029 Madrid , Spain

This report describes the task Machine reading of biomedical texts about Alzheimer's disease, which is a task of the Question Answering for Machine Reading Evaluation (QA4MRE) Lab at CLEF 2013. The task aims at exploring the ability of a machine reading system to answer questions about a scienti c topic, namely Alzheimer's disease. As in the QA4MRE task, participant systems were asked to read a document and identify the answers to a set of questions about information that is stated or implied in the text. A background collection was provided for systems to acquire background knowledge. Three teams participated in the task submitting a total of 13 runs. The highest score obtained by a team was 0.42 c@1, which is clearly above baseline.

Introduction

This report describes the second edition of the task Machine reading of biomedical texts about Alzheimer 's disease, organised as part of the Question Answering for Machine Reading Evaluation (QA4MRE)1 Lab at CLEF 2013. The task aims at exploring the ability of a machine reading system (4; 13) to answer questions about a scienti c topic, namely Alzheimer's disease (AD), based on a background collection of scienti c texts.

As in the QA4MRE task ( 9 ), participant systems were asked to read a document and identify the answers to a set of questions about information that is stated or implied in the text. Questions are in the form of multiple choice, each having ve options, and only one correct answer. The detection of correct answers is speci cally designed to require various kinds of inference and the consideration of previously acquired background knowledge. Knowledge acquisition can be performed from a document collection called the background collection provided by the organization. Participants were provided with the same background collection as in the 2012 edition, the Alzheimer's Disease Literature Corpus (ADLC corpus) ( 6 ). The evaluation was performed on four reading tests with ten multiple choice questions each following the setup of the 2012 edition.

1 http://celct.fbk.eu/QA4MRE/

To solve the task, participants could make use of existing resources, such as ontologies or databases, and tools, such as named entity taggers, event extractors, parsers, etc. In order to keep the task reasonably simple for systems, the task organizers provided the texts of the background collection and the test documents processed at several levels of linguistic analysis (lemmas, part-of-speech, named entities, chunking, dependency parsing) with publicly available state of the art tools.

AD was chosen as a topic of the QA4MRE Lab because there is a particular interest in more e cient processing of Alzheimer-related literature, as this condition constitutes a considerable health challenge for an aging population (Citron 2010). The increasing importance of AD is re ected in the recently approved US National Alzheimer's Project Act,2 which will result in considerable funding being made available for research on this disease and for nancing better data infrastructure resources. Currently, the illness is being analyzed from various perspectives in a growing number of scienti c studies (5; 1; 2).

The report is organised as follows. Section 2 provides information about the Alzheimer's Disease Literature Corpus and Section 3 about the test data. Section 4 explains the process followed to annotated the data. Section 5 deals with the design of questions. In Section 6 the evaluation process is explained and in Section 7 details about the number of participating systems and runs are presented as well as their results. Finally, Section 8 closes the paper with some conclusions. 2

Background collection: the Alzheimer's Disease Literature Corpus

The background collection is a collection of texts about Alzheimer's disease called the Alzheimer's Disease Literature Corpus (ADLC corpus). Participants could use it for their systems to acquire reading capabilities and to obtain knowledge about Alzheimer's disease that could help in answering the questions about the test documents. The texts have been carefully selected to be as speci c as possible for this topic and the corpus should constitute a comprehensive resource for this task in particular and for text mining e orts tailored to the Alzheimer's disease eld in general. Although the use of the background collection is recommended, it is not mandatory. The background collection is released subject to signing a license agreement.3 It contains the following sets of documents: PubMed abstracts. 66,222 abstracts obtained by performing in PubMed the search provided in Figure 1. The abstracts were provided in XML format, and with the annotations described in Section 4.

Open Access full articles PMC. 8,249 Open Access full articles from PubMed

Central in PDF format. These articles have been selected by rst performing

2 http://aspe.hhs.gov/daltcp/napa/#NAPA

3 The ADLC corpus can be downloaded from the following link: http://celct.fbk. eu/ResPubliQA/index.php?page=Pages/bg_collection_pilot.php (((((("Alzheimer Disease"[Mesh] OR "Alzheimer's disease antigen"[Supplementary Concept] OR "APP protein, human"[Supplementary Concept] OR "PSEN2 protein, human"[Supplementary Concept] OR "PSEN1 protein, human"[Supplementary Concept]) OR "Amyloid beta-Peptides"[Mesh]) OR "donepezil"[Supplementary Concept]) OR ("gamma-secretase activating protein, human"[Supplementary Concept] OR "gamma-secretase activating protein, mouse"[Supplementary Concept])) OR "amyloid beta-protein ( 1-42 )"[Supplementary Concept]) OR "Presenilins"[Mesh]) OR "Neuro brillary Tangles"[Mesh] OR "Alzheimer's disease"[All Fields] OR "Alzheimer's Disease"[All Fields] OR "Alzheimer s disease"[All Fields] OR "Alzheimers disease"[All Fields] OR "Alzheimer's dementia"[All Fields] OR "Alzheimer dementia"[All Fields] OR "Alzheimer-type dementia"[All Fields] NOT "non-Alzheimer"[All Fields] NOT ("non-AD"[All Fields] AND "dementia"[All Fields]) AND (hasabstract[text] AND English[lang]) the search in Figure 1 and then selecting the full articles that belong to the PubMed Central Open Access subset and that were available on 1.03.2012. 7,512 of these articles were provided in text format, which was obtained by converting the PDF les into text by using the tool LA-PDFText4 ( 10 ). 7,447 of these articles were also provided with annotations.

Open Access full articles PMC, smaller set. This smaller set contains 1,041 full text articles from PubMed Central in HTML and text format. The articles are also provided with annotations. For this articles the text version has been converted from the PubMed HTML version. To select these documents a search was performed on PubMed using Alzheimer's disease related keywords and restricting the search to the last three years. The search was performed on 3.02.2012. Only a subset of the articles obtained by the search has been included in the collection.

Elsevier full articles. This set contains 379 full text articles from Elsevier and 103 abstracts. The documents are provided in XML and text format. They are also provided with annotations. The text les have been obtained by converting the XML les into text. The articles in this subset have been selected from a list of articles provided by Professor Tim Clark from the Massachusetts Alzheimer's Disease Research Center, USA. The list contains bibliographic records representing 45 core hypotheses in Alzheimer's disease. Elsevier kindly provided the articles from this list that were Elsevier publications. 3

Test data

The test set is composed of 4 reading tests, each consisting of 10 questions about 1 document, with 5 answer choices per question. So, there were in total 40 questions and 200 choices/options. Participating systems were required to 4 LA-PDFText is available at http://code.google.com/p/lapdftext/ answer these 40 questions by choosing in each case one answer from the ve alternatives. Systems could leave questions unanswered.

The test documents were selected using the PubMed query shown in Figure 2. Then, based on manual examination of the abstracts, the articles were classi ed using the MyMiner system into those that were relevant for the task. The full text of the abstracts found to be relevant was retrieved and the 4 most relevant articles for the task were chosen based on a quick inspection of the full text. ((((("Alzheimer Disease"[Mesh] OR "Alzheimer's disease antigen"[Supplementary Concept] OR "APP protein, human"[Supplementary Concept] OR "PSEN1 protein, human"[Supplementary Concept]) OR "Amyloid beta-Peptides"[Mesh]) OR "donepezil"[Supplementary Concept]) OR ("gamma-secretase activating protein, human"[Supplementary Concept] OR "gamma-secretase activating protein, mouse"[Supplementary Concept])) OR "Presenilins"[Mesh]) OR "Alzheimer's disease"[All Fields] OR "Alzheimer's Disease"[All Fields] OR "Alzheimer s disease"[All Fields] OR "Alzheimers disease"[All Fields] OR "Alzheimer's dementia"[All Fields] OR "Alzheimer dementia"[All Fields] OR "Alzheimer-type dementia"[All Fields] NOT "non-Alzheimer"[All Fields] NOT ("non-AD"[All Fields] AND "dementia"[All Fields]) AND (hasabstract[text] AND English[lang]) AND ("loattrfree full text"[sb] AND ("2013=01=01"[PDAT] : "2014=12=31"[PDAT]))

The test documents were provided in text format. They were rst converted automatically from PDF into text format and then the text version was corrected manually, paying attention to symbols that express relevant information about Alzheimer's disease. The captions of gures and tables were also included, but the gures and tables not. Participants were not expected to process the contents of tables and gures. A sample of a test document with questions can be downloaded from the QA4MRE website.5 The test documents and the questions were provided also with annotations. 4

Data annotation

The documents in the background collection, the test documents, and the questions were provided with annotations in a column format as shown in Figure 3.

The annotations were obtained automatically with the dependency parser GDep ( 11 ), a UMLS ( 3 ) based NE tagger developed at CLiPS, and the ABNER NE tagger ( 12 ). The content of the columns is speci ed in Table 1. 5

Question design

As in the QA4MRE task, questions are in multiple choice format and focus on testing the comprehension of one single document. The questions posed for this 5 http://celct.fbk.eu/QA4MRE/index.php?page=Pages/downloads.php task should address aspects that are of biomedical relevance and that have been proven to be of importance in the context of previous e orts such as BioCreative6, Genomics TREC track7 or the BioNLP shared tasks.8 This should enable participants to make use of resources developed for these competitions and will establish a link between this pilot task and previous e orts. Additionally, since machine reading of biomedical texts is a new task, it seemed more appropriate to restrict the types of questions somehow. Therefore a restricted set of named entity types associated to the questions was de ned, as well as a list of question types. The expected answer types for the multiple choice answers depend on allowed entity types. 5.1

Named entities

The categories of named entities considered for this task are the following: { GENE PROT. Genes and gene products (proteins, mRNA). { CHEM DRUG. Chemicals/drugs/pharmacological agents. 6 http://www.biocreative.org 7 http://ir.ohsu.edu/genomics 8 http://sites.google.com/site/bionlpst { DIS SYMPT. Disease/symptoms. { EXP METHOD. Experimental method/quali er. { SPEC ORG. Species/organism. { PATH PROC. Pathway/Biological process. { ANAT CELL. Anatomical/cellular/subcellular structures. { MUT PTM. Mutations/genetic variations/posttranslational modi cations. { ADV TOXIC. Adverse e ect/toxic endpoints. { DOSE. Dose of a given treatment. { TIMING. Schedule of treatments (timing). { PAT CHAR. Patient characteristics: age, gender, sex, race, population, animal strain. { MOL MARKER. Molecular marker.

In order to identify the named entities above, the following lexico-semantic resources and tools can be used (among others): ABNER, BANNER, Genia Tagger, BioThesaurus, BioLexicon,UMLS, LINNAEUS tagger, OrganismTagger, MeSH, Gene Ontology (and other ontologies from OBO), etc... .

The test documents were processed with UMLS and the BANNER tagger before making the questions, so that questions would refer only to entities that can be automatically identi ed with existing resources. 5.2

Question types

Based on examination of the relationships between the various entity types we compiled the following collection of biomedically relevant question types: Experimental evidence/quali er. This question type refers to experimental techniques, methods or models used to generate or validate a given discovery. Examples include animal models used for a given in vivo study, interaction detection methods used to detect protein interactions, imaging techniques for visualization or localization of a particular protein.

Protein-protein interaction. This question type refers to the detection of an interaction partner of a given protein. Examples include physical binding of two proteins in a protein-protein complex or more transient interaction in phosphorylation of one protein by another.

Gene synonymy relation. This question type tries to establish relations between two entity mentions of genes or proteins that refer actually to the same biological entity. For instance this relation exists between `APP' and `amyloid beta (A4) precursor protein'. Here alternative aliases of a gene name or symbol are included, as well as typographical variants and acronyms and their corresponding expanded forms.

Organism source relation. This question type refers to the actual organism source for a given protein or gene. An example would be the genes encoded in the human genome or expressed in humans.

Regulatory relation. This question type refers to gene regulatory relationships between two bio-entities (protein and gene), i.e. whether one bio-entity a ects the gene expression of another entity (e.g. transcription factor target gene relation).

Increase (improvement, higher expression). This is a more speci c ques

tion type of the regulatory relation. It refers to cases where one bio-entity causes the upregulation (increased expression) of another bio-entity. Decrease (depletion, reduction). This is a more speci c question type of the regulatory relation. It refers to cases where one bio-entity causes the downregulation (decreased expression) of another bio-entity.

Inhibition/disruption/impaired. This question type refers to cases were one bio-entity blocks or inhibits another bio-entity. Examples include drugs blocking a given protein or enzyme, or proteins that inhibit a particular biological process or pathway.

As Table 2 shows, not all question types are equally frequent. Balancing the question types is di cult given the constraint that only 4 test documents are provided. Three types occur only once or twice.

Question type Protein-protein interaction Experimental evidence/quali er Increase Gene synonymy relation Regulatory relation Inhibition/disruption/impaired Organism source relation Decrease Questions can be assigned a degree of di culty: simple, medium and complex. Simple. Factual questions that can be answered using information from the target document and whose textual evidence is contained multiple times in the paper, e.g. several text snippets are supporting the correct answer. The answer is found almost verbatim in the paper.

Medium. The correct answer is phrased in a way that requires the use of lexicosemantic dictionaries and name alias recognition capabilities to be able to handle lexico-semantic alienations of keywords and entities.

Complex. Reasoning must be applied to answer this question. Choosing the correct answer requires combining pieces of evidence. Such questions might need ad hoc axiomatic knowledge and abductive processes.

A collection of criteria for question di culty classi cation was followed. Aspects that in uence question di culty include: { Are the ontological relations encoded in the question? If they are encoded the question should be easier. { If keyword-based indexing and conceptual indexing are required the question is less easy. { Script like questions such as `how is an anatomical structure assembled?' should be more di cult since answering them requires combining several units of information. { Template questions about successive temporal events (biological processes, disease stages) should be more di cult since it also requires several units of information. { Is it necessary to process morphological alternations such as phosphorylate lexicalized as the nominalization phosphorylation? In this case the degree of di culty should be simple/medium, depending on other characteristics of the question. { Is it necessary to process lexical alternations? The usage of synonyms or semantically related terms derived from ontologies is necessary to increase the recall. { Is it necessary to process semantic alternations and paraphrases? This involves nding relations between multi-term paraphrases and single terms, textual patterns, or complex examination between word building terms within the ontology. { Is it necessary to process terminological variants and high level indexes comprising terms and their variants for retrieval? A variant recognition module is required as well as weighting of matching between questions and documents. { How big is the paragraph window size of the evidence text? Is it a continuous span of text? The bigger the window size, the more di cult is the question. Non continuous spans are more di cult to process than continuous.

As for the distribution of questions depending on di culty degree, 26 questions were assigned Medium, 13 were assigned Simple and 1 was assigned Complex. 5.4

Answers

As in the main task, systems are not required to answer every question, since the c@1 measure ( 7 ) was used for evaluation. This measure encourages systems to reduce the number of incorrect answers while maintaining the number of correct ones by leaving some questions unanswered. Systems were asked to choose the right answer among ve choices. 6

Evaluation

As in the main task, participants were allowed to submit a maximum of 10 runs. Each run should be categorized as one of the following types, depending on the resources that have been used to assist in asnwering the questions: 1. No external resource was used (only the test document). 2. Only the test document and the associated background collection was used. 3. The test document and other resources were used, but not the background collection. 4. The test document together with the background collection and other resources were used.

Evaluation was performed automatically following the same procedure as in the QA4MRE task. Each question received one (and only one) of the three following assessments: { Correct if the system selected the correct answer among the ve candidate ones of the given question. { Incorrect if the system selected one of the wrong answers. { NoA if the system chose not to answer the question.

The main evaluation measure used was c@1 ( 7 ), which takes into account the option of not answering certain questions. The formulation of c@1 is given in ( 1 ). The overall c@1 is calculated over the 40 questions of the test collection.

n1 (nR + nU nnR ) nR: number of questions correctly answered. nU : number of questions unanswered n: total number of questions

As a secondary measure systems are evaluated on accuracy, which is the traditional measure applied to question answering evaluations that do not distinguish between answered and unanswered questions. The formulation of accuracy is given in ( 2 ). The overall accuracy is calculated over the 40 questions of the test collection.

accuracy = nR + nUR n where where nR: number of questions correctly answered. nUR: number of unanswered questions whose candidate answer was correct. n: total number of questions More information about the evaluation procedure can be found in ( 9 ). ( 1 ) ( 2 )

Participation and results

Out of the 12 groups that had previously registered and signed the license agreement to download the background collection, a total of 3 groups participated submitting 13 runs. Table 3 shows the list of participating teams and the reference to their reports.

Table 4 provides information about the number of runs per team and the scores of the best run in terms of c@1. A random baseline is calculated, assuming that a system answers all questions. This baseline has ve possibilities when trying to answer a question: it can select the correct answer to the question, or it can select one of the four incorrect answers. In this case, the overall result is 0,20. One of the participating systems scores below baseline and one scores just below baseline, whereas the team that obtained the best results is clearly above baseline with 0,42 c@1 score. This team runs experiments on the test set of the 2012 edition obtaining 0,39 c@1, which is lower than the maximum c@1 score obtained in 2012, 0,55.

All teams take a question answering approach. The team that obtained the highest scores, lims, applies a method that exploits discourse relations focusing on complex questions, such as causal questions. They create a question typology and detect the kind of discourse relation between the candidate answers and the question. The detection of discourse relations is ruled-based using information from parse trees and connectors.

The cmuq team participates with an UIMA-based pipeline system which integrates the Con guration Space Exploration (CSE) framework for building and exploring con guration spaces for information systems. They performed 1020 experiments in order to nd the best parameter con guration by means of CSE. Their best run is obtained by matching the named entities in the answer choices with the named entities in candidate sentences extracted from the background collection based on Lucene queries built from the questions.

The bite team adapts the EAGLi question-answering system (http://eagl. unige.ch/EAGLi)) using the content of MEDLINE as background knowledge. This approach was not e cient enough to perform above baseline. More details about the approaches taken by participating systems are available in the corresponding articles in this volume.

Table 5 illustrates the mean c@1 scores for each of the 4 reading tests considering all systems. This shows the di culty of each particular test. Test 4 at 0,13 appears to be a very hard test, whereas Test 1 at 0,34 seems to be somewhat easier.

Test 1 Test 2 Test 3 Test 4

0,34 0,23 0,20 0,13

Table 5. Mean c@1 scores for each reading test.

The scores per run are provided in Table 6 in terms of overall c@1, median and standard deviation of c@1, and overall accuracy. This report presented the second edition of the task Machine Reading of Biomedical Texts about Alzheimer's Disease, which was organised as a task of the QA4MRE Lab at CLEF 2013. The task focused on biomedical texts about Alzheimer's disease in English. Participating systems should answer readability tests about the test documents provided. Each readability test consisted on 10 multiple choice questions about a document. The best system obtained a c@1 score of 0,42 which is certainly above baseline. As in the rst edition of the task in 2012, many teams downloaded the data, although much less teams uploaded results. The reason why this happens should be analyzed in order to decide about a future edition of the task.

Acknowledgments

This work was made possible through nancial support from the University of Antwerp (GOA project BIOGRAPH). We are grateful to the organizers of the QA4MRE Lab at CLEF 2013 for hosting the pilot task. Vincent Van Asch, Florian Geitner, Cartic Ramakrishnan, Gully A.P.C. Burns, Pamela Forner, and Giovanni Moretti provided technical support. Elsevier was kind enough to allow us to include some of their articles in the background collection. We are grateful to Anita de Waard and Antony Scerri for providing the Elsevier documents.

[1] Al-Mubaid , H. , Singh , R.: A new text mining approach for nding protein-todisease associations . American Journal of Biochemistry and Biotechnology 1 ( 3 ), 145 { 152 ( 2009 )

[2] Barbosa-Silva , A. , Fontaine , J. , Donnard , E. , Stussi , F. , Ortega , J. , AndradeNavarro, M.: PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from pubmed queries . BMC Bioinformatics 12 ( 2011 )

[3] Bodenreider , O. : The uni ed medical language system (UMLS): Integrating biomedical terminology . Nucleic Acids Research 32 ( Suppl . 1):D267D270 ( 2014 )

[4] Etzioni , O. , Banko , M. , Cafarella , M.J.: Machine reading . In: Proceedings of the 21st National Conference on Arti cial Intelligence . vol. 2 , pp. 1517 { 1519 . Boston, Massachusetts ( 2006 )

[5] Gao , Y. , Kinoshita , J. , Wu , E. , Miller , E. , Lee , R. , Seaborne , A. , Cayzer , S. , Clark , T. : SWAN: A distributed knowledge infrastructure for alzheimerdisease research . Web Semantics: Science, Services and Agents on the World Wide Web 4 ( 3 ), 222 { 228 ( 2006 )

[6] Morante , R. , Krallinger , M. , Valencia , A. , Daelemans , W.: Machine reading of biomedical texts about alzheimers disease . In: CLEF 2012 Evaluation Labs and Workshop - Working Notes Papers ( 2012 )

[7] Pen~as, A. , Rodrigo , A. : A simple measure to assess the non-response . In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT 2011 ). pp. 1415 { 1424 ( 2011 )

[8] Pen~as, A. , Hovy , E.H. , Forner , P. , Rodrigo , A. , Sutcli e , R.F.E. , Sporleder , C. , Forascu , C. , Benajiba , Y. , Osenova , P. : Overview of QA4MRE at CLEF 2012: Question answering for machine reading evaluation . In: CLEF 2012 Evaluation Labs and Workshop - Working Notes Papers ( 2012 )

[9] Pen~as, A. , Hovy , E.H. , Forner , P. , Sutcli e , R., Morante , R. , Rodrigo , A. : Overview of Question Answering for Machine Reading Evaluation 2011-2013 . In: Proceedings of the Fourth International Conference of the CLEF Initiative , CLEF 2013 ( 2013 )

[10] Ramakrishnan , C. , Patnia , A. , Hovy , E. , Burns , G.: Layout-aware text extraction from full-text pdf of scienti c articles . Source code for biology and medicine 7(1) , 7 ( 2012 )

[11] Sagae , K. , Tsujii , J.: Dependency parsing and domain adaptation with lr models and parser ensembles . In: Proceedings of the CoNLL 2007 Shared Task. Joint Conferences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL'07) . pp. 1044 { 1050 . Prague, Czech Republic ( 2007 )

[12] Settles , B. : ABNER: an open source tool for automatically tagging genes, proteins and other entity names in texts . Bioinformatics 21 ( 14 ), 3191 { 3192 ( 2005 )

[13] Strassel , S. , Adams , D. , Goldberg , H. , Herr , J. , Keesing , R. , Oblinger , D. , Simpson , H. , Schrag , R. , Wright , J.: The DARPA machine reading program - encouraging linguistic and reasoning research with a series of reading tasks . In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10) . Valletta, Malta ( 2010 )