The NLM Medical Text Indexer System for Indexing Biomedical Literature James G. Mork1, Antonio J. Jimeno Yepes2,1, Alan R. Aronson1 1 National Library of Medicine, Bethesda, MD, USA {mork,alan}@nlm.nih.gov 2 NICTA Victoria Research Lab, Melbourne, Australia antonio.jimeno@gmail.com Abstract. In the face of a growing workload and dwindling resources, the US National Library of Medicine (NLM) created the Indexing Initiative project in the mid-1990s. This cross-library team’s mission is to explore indexing meth- odologies that can help ensure that MEDLINE and other NLM document col- lections maintain their quality and currency and thereby contribute to NLM’s mission of maintaining quality access to the biomedical literature. The NLM Medical Text Indexer (MTI) is the main product of this project and has been providing indexing recommendations based on the Medical Subject Headings (MeSH) vocabulary since 2002. In 2011, NLM expanded MTI’s role by desig- nating it as the first-line indexer (MTIFL) for a few journals; today the MTIFL workflow includes about 100 journals and continues to increase. Due to a close collaboration with the Index Section at NLM, MTI continues to grow and ex- pand its ability to provide assistance to the indexers. This paper provides an overview of MTI’s functionality, performance, and its evolution over the years. Keywords: Indexing methods, Text categorization, MeSH, MEDLINE 1 Introduction The NLM Medical Text Indexer (MTI) system [1] is the primary product and focus of the Indexing Initiative [2]. MTI produces both semi- and fully-automated indexing recommendations based on the Medical Subject Headings (MeSH®)1 controlled vo- cabulary and has been in use at NLM since 2002. MTI is in daily use to assist Index- ers, Catalogers, and NLM’s History of Medicine Division (HMD) in their indexing efforts. Every weeknight MTI provides recommendations for approximately 4,000 new citations for Indexing and processes a mixed file of approximately 7,000 old and new records for both Cataloging and HMD. MTI was also used on a regular basis between 2002 and 2012 to provide fully-automated keyword indexing for NLM’s Gateway2 meeting abstract collection, which was not manually indexed. In 2011, MTI was designated as the First-Line Indexer (MTIFL) for 14 journals (89 in 2013) 1 http://www.nlm.nih.gov/pubs/factsheets/mesh.html 2 http://www.nlm.nih.gov/pubs/factsheets/gateway.html because of its success with those publications. For MTIFL journals, MTI indexing is treated like human indexing and, of course, subject to the normal manual review pro- cess. MEDLINE® Indexers and Revisers consult MTI recommendations for approxi- mately 58% of the articles they index, and the MTI recommendations are tightly inte- grated into the Cataloging and HMD system. Although mainly used in indexing ef- forts for processing MEDLINE citations3 consisting of identifier, title, and abstract, MTI is also capable of processing arbitrary biomedical text. MTI provides an ordered list of MeSH Main Headings (MH), Subheadings (SH), and CheckTags (CT)4 as a final result. MHs are the main descriptors or headings from the MeSH Vocabulary (e.g., Lung). SHs are used to qualify the MHs (e.g., Lung/abnormalities means that the article is about the abnormalities associated with the Lung more than the Lung itself), and CTs are a spe- cial type of MHs that are required to be included for each article and cover species, sex, human age groups, his- torical periods, pregnancy, and vari- ous types of research support (e.g., Male). 2 Processing Overview The Indexing Initiative explored several indexing methods [2] eventu- ally implementing two of the best ones as a prototype indexing system which became the NLM Medical Text Indexer (MTI). Normal MTI processing involves receiving a daily XML formatted MEDLINE5 file which contains a list of Completed, In-Process, and In-Data-Review citations and a list of Deleted PMIDs (PubMed® Unique Identifier). All processing is done offline, and the Fig. 1. MTI Process Flow Diagram MTI results are then stored in a data- base for later use by the Indexers. This preloading of the results is necessary since MTI takes too long to be done in real time for the Indexers. Fig. 1 depicts the pro- cessing flow as MEDLINE citations are processed through the various components of the MTI system. Each of the major MTI components is described briefly below. 3 http://www.nlm.nih.gov/bsd/mms/medlineelements.html 4 http://www.nlm.nih.gov/mesh/features2003.html 5 http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html MetaMap Indexing (MMI) [3]: a method that applies a ranking function to concepts found by MetaMap [4]. Generally speaking, the MMI ranking function was designed to indicate the characterizing power or “aboutness” of a given concept for a piece of text, e.g., a MEDLINE citation. It is the product of a frequency factor and a relevance factor, which is essentially measured by MeSH Tree depth. For concepts found in the title of the citation, there is a simplified form of the function which maximizes the frequency factor. PubMed Related Citations [5]: the neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths. MTI currently uses two methods for determining PubMed Related Citations (PRC) for the text it is processing. If MTI is working with a MEDLINE citation and there are enough indexed PRC defined by the PubMed system6, MTI uses that list of PRC. If MTI is processing free form text or there is an insufficient number of indexed PRC, MTI will default to using the in-house TexTool7 implementation of PRC. MEDLINE is the indexed subset of PubMed. Restrict to MeSH [6]: a method which finds the closest MHs to UMLS® Metathesau- rus®8 concepts. Three basic approaches can be used to map a UMLS concept to MeSH: through synonyms, through built-in mappings, and through inter-concept rela- tionships. These approaches can be combined into a strategy that maximizes both specificity (selected MeSH terms are relevant) and sensitivity (the number of concepts that fail to be mapped to MeSH is small). Extract MeSH Descriptors: retrieving the MeSH Heading lines from the PRC in MEDLINE format and tracking whether the MeSH Heading is a main (starred) term or not. Note that MTI does not recommend main vs. non-main status to the Indexers, but the status is tracked internally to see if MTI is improving or not. Clustering and Ranking [7]: the ranked lists of MHs produced by the methods de- scribed so far must be clustered into a single, final list of recommended indexing terms. The task here is to provide a weighting of the confidence or strength of belief in the assignment, and rank the suggested headings appropriately. Post-Processing: once all of the recommendations are ranked and selected, validation of the recommendations is done based on the targeted end-user. Typically, CTs are added based on triggers from the text and for the remaining recommended headings, a machine learning algorithm is applied adding frequently occurring CTs [8,9], and then 6 http://www.nlm.nih.gov/pubs/factsheets/pubmed.html 7 http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/TexTool/ 8 http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html finally MTI performs subheading attachment [10-12] to individual headings and for the text in general. Not all citations processed by MTI go through all of the components listed above. MTI has various filtering levels and special handling rules which require different processing pathways. Basic filtering rules have evolved over time based on ambigui- ties in the UMLS Metathesaurus, ambiguity in the text, feedback from Indexers, etc. 3 MTI Filtering and Post-Processing MTI has three levels of filtering which can be selected depending on the circumstanc- es. Base Filtering, or High Recall Filtering, is performed for all citations and free text, regardless of whether any further filtering has been selected or not. High Recall Filtering is used for MEDLINE indexing recommendations and tends to provide a list of approximately 25 recommendations with most of the good recommendations near the top of the list. Balanced Recall/Precision Filtering provides filtering which looks at the compatibility and context of the recommendations based on what path(s) made the recommendation and provides a good balance between number of recommenda- tions and the filtering out of good recommendations. Balanced Recall/Precision Fil- tering was developed for use in the fully-automatic processing of the NLM Gateway abstracts and is now used for MTIFL processing. High Precision Filtering is the last filtering option and provides the highest level of accuracy by requiring recommenda- tions to come from both MetaMap (MMI) and PubMed Related Citations (PRC). This provides a small list of quality MTI recommendations while filtering out many good recommendations as well. The High Precision Filtering option is not currently used since it provides such a short list of recommendations. Once filtering is accomplished, post-processing is performed regardless of the filter- ing level used. Post-processing involves cleaning up the final recommendation list by removing any terms that survived the filtering process but are invalid for the target audience, filling out the list of terms by adding CTs, Geographicals, and other MHs based on the text, a machine learning algorithm, and lookup lists, and then finally attaching subheadings to the individual MHs and creating a global list of subheadings applicable to the text. Since MeSH indexing can be viewed as a categorization task, we use machine learn- ing in the post-processing stage in an effort to improve both Recall and Precision on the most frequently used terms in MeSH [8,9]. MTI’s final step in creating its indexing recommendations is to perform subheading attachment [10-12]. Subheading attachment is currently only done for the Indexers since Cataloging and HMD do not utilize subheadings. Due to the complexity of the data manipulation required for subheading attachment, it is not provided as a user option to MTI. Subheadings are not attached to every MH recommended by MTI; the subheading attachment algorithms use several linguistic and statistical methods to determine what is appropriate for each MH based on the text and which subheadings are allowable for each MH. MeSH specifies a subset of the subheadings that are al- lowed for each MH, so the subheading attachment algorithms utilize these rules to ensure that non-allowed combinations are not recommended by MTI. Based on the results of two user-centered studies [13,14], at most three subheadings are attached to each MH. 4 MTI Performance MTI has shown a steady increase in usage and acceptance by the NLM indexers since 2002 when it first started producing recommendations for them. MTI is now a ma- ture indexing tool that benefits greatly from a close collaborative relationship with its customers. The strides that MTI has been able to make over the last two years would not be possible without the continued collaboration with the Index Section providing much needed expertise and insight to the indexing task. MTI was able to provide recommendations for over 93% of the total number of cita- tions that were indexed in 2012. We use the human indexing as a gold standard and compare that against the MTI recommendations to calculate Precision, Recall, and F1- measure. Overall F1 has improved from 0.3875 in 2008 to 0.5481 in 2012 (+41.45%). We look forward to the results of the 2013 BioASQ Challenge to see how MTI per- forms against other systems. This will be the first opportunity for such a comparison. Future Direction Several research topics that are planned for the future include: utilizing full text now that it is becoming more available, assisting in Gene Link and Chemical Flag identifi- cation, utilizing sections identified in Structured Abstracts to help weight recommen- dations, identify whether author/publisher supplied keywords might benefit MTI, and expanding machine learning usage to help improve problematic MeSH Headings. We also look forward to expanding the number of MTIFL journals. Acknowledgements The Medical Text Indexer Team benefits from a very close collaboration with the NLM Index Section. This collaboration provides a deeper understanding of the man- ual indexing process and insights into other possible avenues where MTI might be used to assist in the indexing process at NLM. This work was partly supported by the Intramural Research Program of the NIH, Na- tional Library of Medicine. NICTA is funded by the Australian Government as repre- sented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. We would like to thank our colleagues François Lang and Willie Rogers for providing direct and indirect support of MTI. We would also like to extend special acknowl- edgment to Hua Florence Chang who was the original creator of MTI. Florence’s foresight has provided us with a robust and tunable program. References 1. Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initia- tive's Medical Text Indexer. Medinfo. 2004 Sept.;2004: 268-272. 2. Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC, and Wilbur WJ. The NLM Indexing Initiative. Proc AMIA Symp 2000;:17-21. 3. Aronson AR. The MMI Ranking Function Whitepaper (1997). Available at http://skr.nlm.nih.gov/papers/references/ranking.pdf. 4. Aronson AR and Lang FM. (2010). An Overview of MetaMap: Historical Perspective and Recent Advances. J Am Med Inform Assoc. 2010 May 1;17(3):229-36. 5. Lin, J., & Wilbur, W. J. (2007). PubMed related articles: a probabilistic topic-based model for content similarity. BMC bioinformatics, 8(1), 423. 6. Bodenreider O, Nelson SJ, Hole WT, and Chang HF. Beyond Synonymy: Exploiting the UMLS Semantics in Mapping Vocabularies. Proc AMIA Symp 1998;:815-9. 7. Medical Text Indexer (MTI) Processing Flow Whitepaper. Available at http://skr.nlm.nih.gov/resource/Medical_Text_Indexer_Processing_Flow.pdf. 8. Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., and Aronson, A.R. Automatic algo- rithm selection for MeSH Heading indexing based on meta-learning. International Sym- posium on Languages in Biology and Medicine, Singapore, December, 2011. 9. Jimeno-Yepes, Antonio, Mork JG, Demner-Fushman D, Aronson AR. Comparison and combination of several MeSH indexing approaches. AMIA Annual Symposium Proceed- ings. Vol. 2013. American Medical Informatics Association, 2013. 10. Névéol A., Mork J.G., Aronson A.R.. Automatic Indexing of Specialized Documents: Us- ing Generic vs. Domain-Specific Document Representations. Proc BioNLP 2007 Work- shop, 183-92. 11. Névéol A., Shooshan S.E., Humphrey S.M., Rindflesch T.C. and Aronson A.R. Multiple Approaches to Fine-Grained Indexing of the Biomedical Literature. Proc Pacific Symposi- um on Biocomputing 2007, 292-303. 12. Névéol A, Shooshan SE, Mork JG, Aronson AR. Fine-Grained Indexing of the Biomedi- cal Literature: MeSH Subheading Attachment for a MEDLINE Indexing Tool . AMIA Annu Symp Proc. 2007;:553-7. 13. A MEDLINE Indexing Experiment Using Terms Suggested by MTI Whitepaper, June 2002. Available at http://ii.nlm.nih.gov/resources/ResultsEvaluationReport.pdf. 14. Ruiz M.E. and Aronson A.R. User-centered Evaluation of the MTI System, 2007 White- paper. Available at http://ii.nlm.nih.gov/resources/MTIEvaluation-Final.pdf.