Disease and Disorder Template Filling Using Rule-Based and Statistical Approaches Thierry Hamon1,2 , Cyril Grouin1 , and Pierre Zweigenbaum1 1 LIMSI–CNRS, Campus universitaire d’Orsay, bât. 508, rue John von Neumann, F-91405 Orsay, France 2 Université Paris 13, Villetaneuse, France Abstract. We present the participation of LIMSI in Task 2 of the 2014 ShARe/CLEF eHealth Evaluation Lab. We used a hybrid approach based on a rule-based system and supervised classifiers depending on the prop- erties of the attributes. The rule-based system identified course, severity and body location attributes based on the annotations of the training set and resources obtained from the UMLS. The Heideltime system was used to identify the dates. A MaxEnt model was trained to detect negation and uncertainty based on the disorder and surrounding words. A Decision Tree detected the relation to document time based on the position of the disorder in the document and on the words in the current sentence. Our system obtained a global 5th position out of ten ranked teams (accuracy of 0.804), and ranked 2nd for the detection of the relation to document time (accuracy of 0.322). Keywords: natural language processing, medical records, machine learning 1 Introduction Medical records contain a wealth of information on patients covering their hos- pital stays, including health conditions, diagnoses, performed tests, treatments. A large part of this information is held in free text. Information extraction from free text medical records now has a long history [5, 16, 22]. While these earlier text analysis systems aimed at a detailed representation of text contents, more recent shared tasks (e.g., i2b2/VA 2010 [21]) have generally handled medical entities such as medical problems (aka disorders) as atomic. This was the case of the 2013 ShARe/CLEF eHealth T2 task [19] which required to detect disorders spans and their concept unique identifiers (CUIs). In contrast, the 2014 ShARe/CLEF eHealth T2 shared task [11] focuses on the attributes of such disorders. According to the task guidelines [4], the at- tributes can be divided into five categories: anatomical information concerning the location of the disorders in the body (BL), assertions on the disorder concern- ing negation and uncertainty indications (NI, UI), clinical information describing the disorder severity (SV) and its course (CC), contextual information to iden- tify the subject who experiences the disorder (SC) and the condition (CC) in 79 which the disorder exists, and temporal information including the time expres- sion related to the disorder (TE) and the temporal relation between the disorder and the time of the document (DT). General methods to perform this task include knowledge-based methods which specify in which condition an attribute should be recognized for a given disorder mention, e.g. by detecting terms in lexicons or by matching lexico- syntactic patterns; and machine-learning based methods which learn to detect the presence of an attribute from a feature representation of each disorder men- tion. The overall approach of the LIMSI team is hybrid: depending on the prop- erties of the attributes, we used rule-based methods relying on linguistic and terminological resources (BL, SV, CC, TE) or supervised classifiers (NI, UI, DT) to identify and normalise disorder attributes. This paper is organized as follows. In Sec. 2 we present related work on attribute recognition in clinical texts. Then, we detail the materials and methods we used according to the attributes in Sec. 3. Results are presented and discussed in Sec. 4. 2 Related Work Negation (NI), uncertainty (UI), subject (SC) and conditional existence (CO) were part of the task to address in the i2b2/VA 2010 [21] and i2b2 2012 [18] challenges, under one category called “assertion”. Most of the top-ten ranked systems obtained a high F-measure around 0.93, using supervised methods or hybrid systems. Temporal expressions (TE) and relations (including that in DT) were ad- dressed in the i2b2 2012 challenge [18], albeit in a slightly different way. Tempo- ral expressions were to be detected anywhere in the text and did not need to be related to a specific disorder. Each event needed to be anchored to the patient timeline through a temporal relations. Events included “problems”, which were close to the “disorders” addressed in the present task. Besides, temporal relations targets could to be any event or temporal expression, including the admission or discharge dates. The top-ten systems obtained F-measures of 0.45–0.66 for finding the normalized value of a temporal expression (Timex3), and 0.43–0.69 for temporal relations between any pair of events. Anatomical parts (BL) have been included in manual annotations in a few corpora, including MiPACQ [1] and Quaero [13]. Roberts et al. [14] report an F-measure of 0.86 for extracting the anatomical site of an actionable finding in radiology reports. 3 Materials and Methods 3.1 Data The corpus used for the 2014 ShARe/CLEF eHEALTH evaluation lab consists of de-identified plain text EMRs from the MIMIC II database, version 2.5 [15]. 80 Table 1. Description of the corpora Training Test Documents 299 133 Words 182,056 153,558 Distinct CUIs 1,356 1,141 Table 2. Statistics of classes for each attribute from the training corpus Attribute Classes Body Location (BL) null (5,131), C0000726 (403), C0003501 (295), C0225897 (271), C0278454 (215), C0026264 (208), C0018787 (183), C0817096 (172), C0024109 (134), C0031050 (127), C0007226 (103), C0018792 (101), etc. Conditional Class (CO) false (10,993), true (560), null (1) Course Class (CC) unmarked (10,887), increased (234), decreased (186), im- proved (101), worsened (67), resolved (63), changed (12), null (3), no (1) DocTime Class (DT) overlap (6,851), before overlaps (2,814), before (1,391), after (442), unknown (55) Generic Class (GC) false (11,553), null (1) Negation Indicator (NI) no (9,349), yes (2,205) Severity Class (SV) unmarked (10,344), moderate (671), severe (410), slight (128), null (1) Subject Class (SC) patient (11,467), family member (72), other (13), donor other (1) Temporal Expression (TE) none (8,094), date (3,266), duration (131), time (62), overlap (1) Uncertainty Indicator (UI) no (13,539), yes (1,014), null (1) The EMR documents were extracted from the intensive-care unit setting and included discharge summaries, electrocardiography reports, echography reports, and radiology reports. The training set contained 299 documents and a total of 182,056 words, while the test set contained 133 documents and a total of 153,558 words (see Tab. 1). In Tab. 2, we give a few statistics for each attribute in the training corpus. 3.2 System Design Three types of methods were used in our system depending on the properties of an attribute: – (i) because unbalanced distributions are hard to process (see Tab. 2), at- tributes with a very large majority class (SC, CO, GC) were not addressed minimally: the majority value was systematically returned for such attributes;1 1 Subject Class=patient; Conditional Class=false; Generic Class=false. 81 – Attributes with more variation were handled with either • (ii) human-designed resources and rules if clear clues could be collected and organized to make a decision for such attributes (CC, BL, SV, TE); • or (iii) supervised classification if some of the clues played a less cate- gorical role in decision-making (NI, UI, DT). We detail below the methods used for attributes with more variation: rule-based detection of temporal expression (Sec. 3.3), resource-based detection of body location, severity and course (Sec. 3.4), supervised detection of negation and uncertainty (Sec. 3.5), and supervised detection of DocTime class (Sec. 3.6). 3.3 Rule-Based Detection of Temporal Expression To identify the temporal expressions, we used the rule-based temporal tagger Heideltime [17] that we tuned for clinical texts during the 2012 i2b2 challenge [9]. This tuned version of Heideltime includes linguistic patterns specific to medical and especially clinical temporal expressions, such as postoperative day four, day of life, etc. For the CLEF-eHealth challenge, we only used the date expressions Heideltime recognises, since the other temporal expressions (duration and time) were too rare in the training set and their recognition decreased the performance of our system on that corpus. 3.4 Resource-Based Detection of Body Location, Severity and Course The recognition of the terms for the attributes body location (BL), course (CC), and severity (SV), was based on resources specifically built for each attribute. Since course and severity were marked with fairly regular clues in the training corpus (see Tab. 2), we used the annotations of the training set as resources to identify linguistic expressions related to these two attributes. Terminological resources used for the recognition of terms referring to body locations were built from the training annotations as well as from UMLS Metathe- saurus terms from selected source vocabularies. During preliminary experiments on the training set, we observed that terms found in some UMLS vocabularies tend to decrease the quality of the annotation. For this reason, we only consid- ered UMLS terms obtained from four source vocabularies: – Health Level Seven Vocabulary (HL7), – Metathesaurus Forms of FDA National Drug Code Directory (FDA), – University of Washington Digital Anatomist (UWDA), – and UMLS Metathesaurus specific terms (MTH). Ambiguous annotations such as “a” or “his” occurring in the training annota- tions were removed from the body location resource we used. We considered the CUIs as fine-grained semantic tags associated to the BL terms. 82 These resources were used by the TermTagger Perl module2 to recognise SV and CC mentions and BL terms. The clinical texts were also semantically tagged with the CUIs associated to BL terms. Term tagging is integrated in the Ogmios platform [10] which first performs POS-tagging with GeniaTagger [20]. For each disorder mention, a post-processing step selected the BL, SV and CC terms found in the sentence where the disorder occurs. 3.5 Supervised Detection of Negation and Uncertainty System Description. Based upon an empirical analysis of the training corpus, we prepared a list of clues we found relevant to detect negation and uncertainty: – Negation clues: negative, no, not, without, denies, deny; – Uncertainty clues: appear, assess, could, evaluate, likely, may, possible, possi- bility, possibilities, prior, probable, questionable, somewhat, suggesting, sus- picion, unknown. We also marked the PATIENT/TEST subsection header as an uncertainty clue, since we observed that disorders in this subsection were associated with an uncertainty indicator in the training corpus. They were then used to mark as negated or uncertain the part of a sentence following such a clue, thereby implementing a simplified scope detection method. In order to detect negated and uncertain disorders, we designed two distinct models based upon the Maximum Entropy framework [2, 7] as implemented in the Wapiti toolkit3 [12]: one model for negation identification (NI), and one for uncertainty identification (UI). Our models rely on both surface and external features: – Surface features: (i) the whole entity, (ii) each token from the entity as a bag of words, (iii) the capitalization of each token among four schemas (all in upper case, all in lower case, combination of upper and lower case, not relevant), and (iv) the three tokens preceding the entity to process; – External features: (i) the Concept Unique Identifier (CUI) of the whole en- tity as found in the UMLS Metathesaurus [3], and (ii) whether the part of the sentence where the entity is found is negated or uncertain, based upon negation and uncertainty clues found before the current entity. Example. For the entity “Allergies to Drugs” in the sentence “Patient recorded as having No Known Allergies to Drugs”, we used the following features: – Whole entity: Allergies to Drugs; – Tokens from the entity (bag of words): Allergies, to, Drugs; – Capitalization of each token: Mm, mm, Mm (i.e., the first and third tokens combine lower and upper case while the second token is only in lower case); – Three tokens preceding the entity (bag of words): having, No, Known; – CUI of the entity: C0013182 ; – Part of the sentence where the entity is found being marked as negated or uncertained: NEG (the clue “no” was found in the left context of the entity). 2 http://search.cpan.org/~thhamon/Alvis-TermTagger/ 3 http://wapiti.limsi.fr/ 83 3.6 Supervised Detection of Temporal Relation to Document Time The Document Time attribute encodes the temporal relation between a disorder and the date of the document. Clinical reports often follow the chronological order of reported events. A study of the training corpus confirmed this principle. It also showed that the document structuring into sections often goes together with specific distributions of temporal relations in each section. For example, typically, the Chief Complaint section covers past disorders, the Pertinent Re- sults section describes disorders which overlap the hospital stay, and the Medi- cations on Discharge section mention disorders that may occur after discharge. We therefore emphasized the use of document structure as an important clue to determine the temporal relation of a disorder. To do so, we compiled a list of the most frequent section headers found in the training corpus, and encoded it as patterns to detect 26 section types. We also modeled the position of a disorder in a document as both its character offset and its relative position by cutting the text into five equal-sized bins. In principle, verb tense should also contribute to relative time positioning; unfortunately we could not test it for want of time. We addressed this sub-task as a supervised classification task with four classes: before, before overlaps, overlap, after. For each disorder, we collected the following features: – position in the text (absolute and discretized in five equal bins); – document type, section type, and their conjunction; – tokens in the sentence, as a bag of words. The conversion of sentences into bags of words considered the absence or presence of each word with at least 10 occurrences in the set of sentences for each class. We tested several classifiers of the Weka toolkit [8] by training and testing them in ten-fold cross-validation on the training set (see Tab. 3): majority class (ZeroR: overlap), set of rules operating on only one feature (OneR: operates on conjoined feature document type+section type), Naı̈ve Bayes (NB), Decision Tree (J48, confidence threshold 0.4, minimal number of instances per leaf 10), k Nearest Neighbors (kNN with k = 1, 3, 5), SVM (SMO with polynomial kernel). The best results on the training set before the submission were obtained by the Table 3. Document Time attribute: performance on training set with various classi- fiers (ten-fold cross validation). Results marked in bold have been obtained after the submission. Classifier ZeroR OneR NB J48 kNN: k = 1, 3, 5 SVM Accuracy 0.593 0.739 0.788 0.814 0.842, 0.823, 0.813 0.844 decision tree, which was therefore used as the classifier for the test corpus. We can see in Tab. 3 that although slightly better results could be achieved after the submission with similarity-based classifiers such as kNN or SVM, the obtained range seems to be close to the maximum that can be obtained with the current features. 84 3.7 Submissions We submitted two system outputs based upon the predictions performed by the previous systems. The only difference between the two submissions pertained to the Temporal Expression attribute: the first submission only focused on classes date and none which were most often found with this attribute (see Tab. 2), while the second submission also took into account the less represented time and duration classes. 4 Results and Discussion 4.1 Evaluation Metrics The official evaluation measure is the overall average accuracy, where the accu- racy of each attribute is defined as Correct Accuracy = (1) Total where Correct is the number of entities with correctly predicted value and Total is the number of entities in the gold standard annotations. 4.2 Results on the Training Set To estimate the performance of the system, two methods can be used. Admit- tedly, the method which best helps predict future results on unseen data consists in using cross-validation, i.e., preparing a system based on a subset of the train- ing data and testing it on the rest, repeating the process on different splits of the training data. This is easy to do for machine-learning systems: Table 3 showed the accuracy obtained on the Document Time attribute with ten-fold cross-validation on the training set. For knowledge-based systems however, it is more cumbersome to use multiple splits of the same dataset since the human knowledge engineer / system developer cannot “forget” the data she has seen in a previous split to prepare a new version of the system. Working on one split is possible although less predictive of future results. What we present here is simply the application of the system modules prepared on the training set and tested on the training set itself. While this is not in principle highly predictive of future results, it often gives an idea of where the system stands. Table 4 shows the overall results obtained this way, while Table 5 provides detailed information for each attribute. The obtained results are likely to be optimistic, especially for machine-learning systems, which generally tend to overfit the training data. We return to them when examining the results on the test data. 85 Table 4. Results on the training set Submission Accuracy F-measure Recall Precision #1 0.884 0.684 0.674 0.693 #2 0.882 0.682 0.677 0.686 Table 5. Detailed results on the training set for each attribute. The δ value for each attribute represents its difference to the best system. Method Attribute Accuracy F-measure Recall Precision GC 1.000 0.000 0.000 0.000 Default value SC 0.992 0 0.000 0.000 CO 0.950 0 0.000 0.000 SV 0.877 0.420 0.573 0.332 Resource-based CC 0.859 0.388 0.832 0.253 BL 0.511 0.375 0.404 0.350 TE (#1) 0.692 0.071 0.040 0.289 Rule-based TE (#2) 0.678 0.104 0.065 0.258 NI 0.966 0.905 0.827 0.998 MaxEnt UI 0.989 0.936 0.884 0.994 Decision tree DT N/A N/A N/A N/A 4.3 Global Results Table 6 shows the official results we achieved on the test set. Our first submission ranked 6th out of 12 submissions, and 5th out of 10 participants. We can see that the overall accuracy is not much lower than that obtained on the training data (−0.08)—however, recall, precision, and F-measure are much lower (they are divided by two). Table 6. Official results on the test set Submission Accuracy F-measure Recall Precision #1 0.804 0.315 0.303 0.330 #2 0.801 0.315 0.290 0.333 4.4 Detailed Results per Attribute Table 7 displays the detailed results we achieved on the test set for each attribute. For the Temporal Expression (TE) attribute, we indicate the results we achieved for both submissions. For the other attributes, there is no difference between submissions #1 and #2. We also indicate the δ value between our submissions and the best submission for each attribute. 86 Table 7. Detailed results on the test set for each attribute. The δ value for each attribute represents its difference to the best system. Method Attribute Accuracy δ F-measure Recall Precision GC 1.000 -0.000 0.000 0.000 0.000 Default value SC 0.984 -0.011 0.000 0.000 0.000 CO 0.936 -0.042 0.000 0.000 0.000 SV 0.900 -0.082 0.395 0.282 0.663 Resource-based CC 0.853 -0.118 0.281 0.172 0.765 BL 0.504 -0.293 0.277 0.248 0.313 TE (#1) 0.839 -0.025 0.092 0.186 0.061 Rule-based TE (#2) 0.806 -0.058 0.126 0.156 0.106 NI 0.902 -0.067 0.722 0.879 0.612 MaxEnt UI 0.801 -0.159 0.026 0.018 0.044 Decision tree DT 0.322 -0.006 0.322 0.322 0.322 The results for attributes handled through default values or resources are very close to those obtained on the training set. Surprisingly, the accuracy ob- tained for the rule-based TE attribute is much better on the test set than on the training set. A possible explanation could be related to the fact that the test set only contained discharge summaries whereas the training set also contained echography, ECG and radiography examination reports, with maybe more reg- ular temporal expressions in the discharge summaries. 4.5 Discussion For highly unbalanced attributes (GC, SC, CO), the decision not to process these attributes and to select the majority class instead proved good: we achieved our better accuracy values on these three attributes. We notice that most teams did the same for GC (which did not vary at all in the training set), four other teams did the same for CO (ranking #5 before 2 teams), and one other team did so for SC (ranking #5 before 4 teams). For SC, the distance to the best team, which obtained near-perfect results, is only 0.009; for CO, it was 0.042: there is more to gain there with a more precise strategy. For attributes relying on lists, the resource-based approach obtained mod- erate results. For example, the CUIs for the Body Location attribute encom- pass a high number of distinct values, which makes it difficult to detect with high accuracy. The simple dictionary-based method that we used to detect BL mentions with a co-occurrence based method to associate them to a disorder underperformed compared to other participants (−0.29 wrt. the best system). The detection of CC and SV attributes based uniquely on clue words found in the training set also underperformed wrt. other participants, both ranking last with differences of respectively −0.12 and −0.08 wrt. the best system. The choice we made to process the Temporal Expressions with the Heideltime tool and specifically designed rules allowed us to achieve an accuracy of 0.839, 87 with a small δ of 0.025 wrt. the first system. The addition of the less represented time and duration classes was detrimental to this module. The MaxEnt model we designed for negation identification performed well, achieving a 0.902 accuracy with a small δ of 0.067 wrt to the first system on this attribute (the maximum amplitude of accuracy on this attribute is of 0.207 between the first and the last system). However, the MaxEnt model we created for uncertainty identification obtained quite low results with an accuracy of 0.801, our system ranking last on this attribute. Given its similarity of design to the negation identification module and the very low precision and recall scores it obtained, we suspect this might be due to a bug in this module. The detection of the DT attribute (temporal relation to document time) with an emphasis on the position of the disorder in the document structure (section type and relative position in document) performed on par with the best system. Its use of the document type as one of the features may have helped it perform well on the test set, which only contained discharge summaries, in contrast to the training set which included four types of documents. We have seen in further experiments on the training set that the use of similarity-based classifiers (kNN or SVM) instead of the decision tree might improve its results. Besides, it currently does not take into account verb tense, which can be expected to be an important clue for this attribute. Finally, let us note that the accuracy scores obtained by the participants on the test corpus of this temporal relation task are the lowest among all attributes. They are much lower than those obtained on the training set (0.81 for our clas- sifier in 10-fold cross-validation). They are also much lower that those obtained in the i2b2 2012 challenge on temporal relation detection (F-measures of the ten best systems in the 0.43–0.69 range) [18]. Our own work in the i2b2 2012 challenge [6] studied the relative recall of our classifiers. The temporal relations of i2b2 2012 that were closest to the DT attribute of the present task were those between an event and the admission (AD) or discharge (DD) date. For these two relations, we obtained F-measures, recalls and precisions of respectively (0.86, 0.80, 0.94) and (0.63, 0.51, 0.83) (see Figure 4 in [6], relations timex3 event dd hc and timex3 event ad hpi), also much higher than the scores for the DT attribute. However, the events in i2b2 2012 included more event types than only disorders, which may change the difficulty of the task. 5 Conclusion and Perspectives We designed several systems to address the disease and disorder template filling task of ShARe/CLEF eHealth 2014. We chose the method to use (either rule- based or supervised approach) depending on the characteristics of each attribute: resource-based for attributes (e.g., BL) where a dictionary was an important component, rule-based where patterns were important (TE), based on supervised machine learning where the determination of the attribute value was based on distributions of features and relied on a study of their context (e.g., NI and DT). 88 While we achieved a high accuracy by using default values in the case of very unbalanced attributes, we consider that this is not satisfactory. A better study of contexts occurring near disorders should allow us to highlight clues that could be used either to produce rules or to train statistical models (taking into account the specific distribution of values of these attributes). The resource- based methods that we used probably need to be complemented with additional features to take better account of their context of occurrence. The supervised methods obtained high accuracies on the NI and UI attributes. The accuracy on the DT attribute was low for all participants, pointing at it as the hardest of all attributes: our system performed on par with the top system on this attribute, and we discussed directions to improve it further. Acknowledgments We acknowledge the Shared Annotated Resources (ShARe) project funded by the United States National Institutes of Health with grant number R01GM090187. This work was partly funded through project Accordys4 funded by ANR under grant number ANR-12-CORD-0007-03. References 1. Daniel Albright, Arrick Lanfranchi, Anwen Fredriksen, William F. 4th Styler, Colin Warner, Jena D. Hwang, Jinho D. Choi, Dmitriy Dligach, Rodney D. Nielsen, James Martin, Wayne Ward, Martha Palmer, and Guergana K. Savova. Towards comprehensive syntactic and semantic annotations of the clinical narrative. J Am Med Inform Assoc, 20(5):922–930, Sep-Oct 2013. 2. Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maxi- mum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996. 3. Olivier Bodenreider. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acid Res, 32:D267–D270, 2004. 4. Noémie Elhadad, Wendy W. Chapman, Tim O’Gorman, Martha Palmer, and Guer- gana K. Savova. The ShARe schema for the syntactic and semantic annotation of clinical texts. 2014. Under Review. 5. Carol Friedman, Philip O. Alderson, John H. M. Austin, James J. Cimino, and Stephen B. Johnson. A general natural-language text processor for clinical radiol- ogy. J Am Med Inform Assoc, 1(2):161–174, 1994. 6. Cyril Grouin, Natalia Grabar, Thierry Hamon, Sophie Rosset, Xavier Tannier, and Pierre Zweigenbaum. Eventual situations for timeline extraction from clinical reports. J Am Med Inform Assoc, 20(5):820–827, Sep-Oct 2013. 2013 Apr 9. [Epub ahead of print]. 7. Silviu Guiasu and Abe Shenitzer. The principle of maximum entropy. The Math- ematical Intelligence, 7(1), 1985. 4 Accordys: Agrégation de Contenus et de COnnaissances pour Raisonner à partir de cas de DYSmorphologie fœtale, Content and Knowledge Aggregation for Case-based Reasoning in the field of Fetal Dysmorphology (ANR 2012-2015). 89 8. Mark A. Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- mann, and Ian H. Witten. The WEKA data mining software: An update. SIGKDD Explor Newsl, 11(1), 2009. 9. Thierry Hamon and Natalia Grabar. Tuning heideltime for identifying time expres- sions in clinical texts in english and french. In Proc of International Workshop on Health Text Mining and Information Analysis (LOUHI2014), pages 101–5, Gothen- burg, Sweden, April 2014. 10. Thierry Hamon, Adeline Nazarenko, Thierry Poibeau, Sophie Aubin, and Julien Derivière. A robust linguistic platform for efficient and domain specific web content analysis. In Proceedings of RIAO 2007, Pittsburgh, USA, 2007. 15 pages. 11. Liadh Kelly, Lorraine Goeuriot, Gondy Leroy, Hanna Suominen, Tobias Schreck, Danielle L. Mowery, Sumithra Velupillai, Wendy W. Chapman, Guido Zuccon, and Joao Palotti. Overview of the ShARe/CLEF eHealth evaluation lab 2014. In Proceedings of the ShARe/CLEF eHealth Evaluation Lab. Springer-Verlag, 2014. 12. Thomas Lavergne, Olivier Cappé, and François Yvon. Practical very large scale CRFs. In Proc of ACL, pages 504–13, Uppsala, Sweden, July 2010. 13. Aurélie Névéol, Cyril Grouin, Jérémy Leixa, Sophie Rosset, and Pierre Zweigen- baum. The Quaero French medical corpus: A ressource for medical entity recog- nition and normalization. In Proc BioTextM, Reykjavik, Iceland, 2014. 14. Kirk Roberts, Bryan Rink, Sanda M. Harabagiu, Richard H. Scheuermann, Seth Toomay, Travis Browning, Teresa Bosler, and Ronald Peshock. A machine learning approach for identifying anatomical locations of actionable findings in radiology reports. In AMIA Annu Symp Proc, volume 2012, pages 779–788, 2012. 15. Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-Wei Lehman, George B. Moody, Thomas Heldt, Tin H. Kyaw, Benjamin E. Moody, and Roger G. Mark. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access ICU database. Clin Care Med, 39:952–960, 2011. 16. Naomi Sager, Carol Friedman, and Margaret S. Lyman, editors. Medical Language Processing: Computer Management of Narrative Data. Addison Wesley, Reading, MA, 1987. 17. Jannik Strötgen and Michael Gertz. Temporal tagging on different domains: Chal- lenges, strategies, and gold standards. In Proc of LREC, pages 3746–3753, 2012. 18. Weiyi Sun, Anna Rumshisky, and Özlem Uzuner. Evaluating temporal relations in clinical text: 2012 i2b2 challenge overview. J Am Med Inform Assoc, 20(5):806–813, Sep-Oct 2013. 19. Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chap- man, Guergana K. Savova, Noémie Elhadad, Sameer Pradhan, Brett R. South, Danielle L. Mowery, Gareth J. F. Jones, Johannes Leveling, Liadh Kelly, Lorraine Goeuriot, David Martinez, and Guido Zuccon. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In Proceedings of CLEF 2013, Lecture Notes in Com- puter Science, Berlin Heidelberg, 2013. Springer. 20. Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John Mc- Naught, Sophia Ananiadou, and Jun’ichi Tsujii. Developing a robust part-of-speech tagger for biomedical text. In Proc of Advances in Informatics – 10th Panhellenic Conference on Informatics, LNCS 3746, pages 382–92, 2005. 21. Özlem Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc, 18(5):552–556, Sep-Oct 2011. Epub 2011 Jun 16. 22. Pierre Zweigenbaum. Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine, 45:117–120, 1994. 90