-

Identifying Publication Types Using Machine Learning

Antonio J. Jimeno Yepes

antonio.jimeno@gmail.com 0 1

James G. Mork

mork@nlm.nih.gov 1

Alan R. Aronson

alan@nlm.nih.gov 1 0 NICTA Victoria Research Lab , Melbourne , Australia 1 National Library of Medicine , Bethesda, MD , USA

Every year the number of journals and the number of articles to be indexed grows at the U.S. National Library of Medicine (NLM) causing an ever increasing demand on the highly qualified, but, relatively small, dedicated staff of indexers. We present a methodology for identifying MeSH (Medical Subject Headings) Publication Types for assisting the indexers in the categorization of these MEDLINE citations. Publication Types are used by the indexer to describe the type or genre of an article instead of what the article is about, making this a different kind of text categorization problem from identifying MeSH Descriptors. Our goal is to apply a machine learning approach to recommending Publication Types which will save indexers time by providing a precise list of possible Publication Types for each article. Our experiments involved several different machine learning methods to provide Publication Type recommendations which were then evaluated against the gold standard of human indexing. Our results show that machine learning in most cases adds a great deal to the overall performance of recommending Publication Types. Our experiments also show that in some cases, either the full text of the article or feature engineering will be required to accurately produce some Publication Type recommendations.

Indexing methods Text categorization Machine learning MeSH MEDLINE

The MEDLINE®/PubMed® database contains over 21 million citations1. It currently grows at the rate of around 800,000 indexed citations per year covering almost 6,000 international biomedical journals2 in 58 languages. These new citations are manually indexed by a relatively small, dedicated staff of indexers at the U.S. National Library of Medicine (NLM). In this paper, we will use the terms article and citation interchangeably, but they do refer to two distinct entities in the indexing world. Indexers index from the full text of an article, and the results of that effort along with the title and abstract from the article are stored as a citation in the MEDLINE/PubMed data

1 http://mbr.nlm.nih.gov

2 www.nlm.nih.gov/bsd/bsd_key.html base. The indexers use the Medical Subject Headings (MeSH®) 3controlled vocabulary to summarize the central points of full text articles. The 2013 MeSH vocabulary consists of 26,853 MeSH Descriptors4 which are further qualified by a set of 83 Mesh Qualifiers (Subheadings). For example, Aspirin/therapeutic use illustrates the MeSH Descriptor Aspirin being qualified by the MeSH Qualifier therapeutic use showing that the article is not about Aspirin in general, but, more specifically about the therapeutic uses of Aspirin. There are also 214,816 Supplementary Concepts available to the indexer for detailing important chemicals, drugs, or proteins identified in the articles. In addition to summarizing the main points of each article, the indexer is also responsible for other curation tasks such as assigning one or more Publication Types which define the genre of the article.

Publication Types (PTs)5 are a special type of MeSH Heading that are used to indicate what an article is rather than what it is about. There are 61 PTs identified in the four MeSH Publication Characteristics (V) Tree top-level sub-trees that the indexers typically use. These four sub-trees describe a wide range of document types or genres for PTs: Publication Components [V01] (e.g., Architectural Drawings), Publication Formats [V02] (e.g., Eulogies), Study Characteristics [V03] (e.g., Clinical Trial), and Support of Research [V04] (U.S. Government and non-U.S. Government) with some PTs included in multiple sub-trees. Multiple PTs can be assigned to the same article by the indexer.

The ever increasing demand for indexing (502,056 indexed in 2002 to 760,903 indexed in 2012, and with NLM expecting to index over one million articles annually within a few years) is a growing and burdensome workload in a time of dwindling resources. NLM created the NLM Indexing Initiative (II) [ 1 ] project to explore indexing methodologies that could assist indexers by providing tools to increase their productivity while maintaining their high quality of indexing. The II project has previously shown that the right tools can help significantly reduce the amount of time required to manually index articles: MetaMap [ 2 ] identifying Unified Medical Language System (UMLS) ® concepts in biomedical text, the NLM Medical Text Indexer (MTI) [ 3 ] providing indexing recommendations and acting as a First Line Indexer for a select number of journals, and our previous success with machine learning providing recommendations for twelve of the most commonly used MeSH Check Tags [ 4,5 ] in MTI with an 80% success rate.

In 2004, testing MTI’s PT recommendations showed that MTI was not very good at the task, as shown in Table 1. MTI has two main methods of summarizing what a citation is about: MetaMap Indexing (MMI) [ 2 ] and the PubMed Related Citations (PRC) [ 6 ] algorithm. MMI analyzes the citation identifying Unified Medical Language System (UMLS) concepts that best match the text of the citation. MTI then maps these UMLS concepts to the MeSH vocabulary using the Restrict-to-MeSH [ 7 ] 3 http://www.nlm.nih.gov/pubs/factsheets/mesh.html 4 http://www.nlm.nih.gov/mesh/intro_record_types.html 5 http://www.nlm.nih.gov/mesh/pubtypes.html mappings, which are based primarily on the semantic relationships of the UMLS concepts. The PRC algorithm is a modified k-NN algorithm which relies on document similarity to identify potentially relevant MeSH Descriptors. Both of the MTI methods are focused on summarizing the contents of the citation and not on analyzing the type of document being processed which accounts for MTI’s poor performance with PTs. MTI performed so poorly on PTs that it was not used for 46 of the 61 PTs from the beginning, and we stopped recommending the remaining 15 PTs altogether on November 10, 2004. Our goal now is to consider the task of recommending PTs as a text categorization task using machine learning, which could save indexers even more time by providing a precise list of possible PTs for each article. There is no previous work on using machine learning in the context of PTs, though review of existing work for MeSH indexing [ 4,5,8,9 ] illustrates many cases where machine learning has been applied effectively. In addition, a large corpus of indexed MEDLINE citations is available as training data. There are several challenges to our approach: 1. The indexers index from the full text of an article in making their determinations of which PTs to assign while we are currently limited by license restrictions to just the title and abstract found in the MEDLINE citation. 2. Inconsistency between MeSH indexers [ 10 ] due to different interpretations of the article and different understanding of MeSH could result in an inconsistent gold standard and provide less than optimal training for the algorithms. 3. Changes to the indexing policy over time can introduce inconsistencies in the machine learning training. For example, if we have trained with years 2010, 2011, and 2012 and a new Publication Type was added in 2011, we have the potential for inconsistencies in the 2010 training data due to articles that look like they should have the new Publication Type assigned, but, do not. To help limit this problem, we have created a training set with MEDLINE citations from the last three years. 4. 18 of the 61 Publications Types commonly used by the indexers are found in multiple Publication Characteristics MeSH tree sub-trees. For example, the Publication Type Letter appears in the Publication Components (V01) and Publication Formats (V02) sub-trees. This presents a possible ambiguity problem and at the very least introduces possibly confusing documents for the machine learning training. 2

Methods

We have studied the use of various machine learning algorithms testing their ability to accurately recommend several different types of PTs for MEDLINE citations. We have selected the following ten PTs to see if we could provide reliable recommendations.  Case Reports: Clinical presentations that eventually lead to a diagnosis.  Clinical Trial: Work that is the report of a pre-planned clinical study.  Congresses: Published records of the papers delivered at or issued on the occasion of individual congresses, symposia, and meetings.  Controlled Clinical Trial (CCT): Work consisting of a clinical trial involving one or more test treatments and at least one control treatment.  Editorial: Work consisting of a statement of the opinions, beliefs, and policy of the editor or publisher of a journal.  English Abstract: English Abstracts of foreign articles.  In Vitro: Studies using excised tissues.  Meta-Analysis: Work consisting of studies using a quantitative method of combining the results of independent studies.  Randomized Controlled Trial (RCT): Similar to Controlled Clinical Trial, but requires that the treatments to be administered are selected by a random process.  Review: An article or book published after examination of published material on a subject.

These ten PTs were selected because they represent some of the most frequently used PTs and provide a good cross category sample of the four Publication Characteristics MeSH tree sub-trees for PTs. We also limited our set to 10 PTs to facilitate training and evaluation.

As mentioned before, changes in the indexing policy can have a dramatic effect on how articles are indexed and can create inconsistencies in a large training corpus if special care is not taken. To reduce the chance of this, we have focused on the last three full indexing years using the 2012 MEDLINE Baseline. We used the Medline Baseline Repository Query Tool6 to identify a list of PMIDs (PubMed Unique Identifiers) for Date Completed (date indexing was applied to the citation) ranging from January 1, 2009 to December 31, 2011. The Query Tool also allowed us to randomly divide the list of PMIDs into Training (2/3) and Testing (1/3) sets. We ended up with 1,784,061 randomly selected PMIDs for Training and 878,718 for Testing. Once we had the two lists of PMIDs, we extracted the actual citations from the 2012 MEDLINE Baseline in XML format for use with our MTI ML machine learning package7. The MTI ML package was developed as part of the Indexing Initiative effort to provide machine learning algorithms optimized for large text categorization tasks and capable of combining several text categorization solutions. It is available subject to the MetaMap Terms and Conditions8.

Certain types of articles require special indexing. For example, a Comment On article, which is an article commenting on a different article, is indexed by simply using the indexing from the originating article. For a Review type of article, the indexer uses fewer MeSH Headings that tend to be more general in nature than they would use for a non-Review article. For these reasons, when we assembled the final data set, we also filtered out the articles requiring special handling to create as clean a data set as possible. Specifically, we removed the following types9 of articles from our data sets: OLDMEDLINE, PubMed-not-MEDLINE, articles with no indexing, CommentOn, RetractionOf, PartialRetractionOf, UpdateIn, RepublishedIn, ErratumFor,

6 http://mbr.nlm.nih.gov

7 http://ii.nlm.nih.gov/MTI_ML/index.shtml 8 http://metamap.nlm.nih.gov/MMTnCs.shtml 9 http://www.nlm.nih.gov/bsd/licensee/elements_alphabetical.html and ReprintOf. This left us with 1,321,512 articles for Training and 651,617 articles for Testing. The data sets used for these experiments are available from our Indexing Initiative Data Sets and Test Collections web page10.

The task of assigning PTs to a MEDLINE citation can be seen as a text categorization task [ 4,8 ], in which the PTs are the categories to be assigned. In our experiments, we have trained binary classifiers to predict if the article should be indexed with a given PT or not. We have selected several learning methods in these experiments focusing on learning methods that can be trained in a reasonable time due to the large number of citations under consideration. Among these methods are a linear SVM implementation based on Hinge Loss and Huber Loss and an implementation of AdaBoostM1 that uses decision trees as base learner. In addition, we have considered Naïve Bayes and Logistic regression from the Mallet11 package.

SVM has been shown to perform well on text categorization tasks [ 11 ]. We have used an implementation of SVM with linear kernel based on Hinge loss and stochastic gradient descent and modified Huber loss proposed by Zhang’s [12] work used by Yeganova et al. [13], which has been shown to improve the performance of Hinge loss in the case of very imbalanced training sets. It is a wide margin classifier with a quadratic loss function. We have restricted our study to linear kernels due to the size of our data sets, but it would be worth exploring efficient implementations for learning with more complex kernels.

One of the algorithms that we have extensively used is AdaBoostM1 (Ada) using an implementation of decision trees based on C4.5 as the base learning algorithm. In previous work, Ada had performed well on the Check Tags set [ 8,9 ], and we were interested in evaluating its performance with a larger, more diverse set of terms. Our implementation of C4.5 relies on binary features, which provide a more efficient implementation of the decision tree in terms of memory and time required for training. The SVM and AdaBoostM1 implementations are available from the MTI ML package12, which has been used in several MeSH indexing research efforts and has become part of the MTI system. The MTI ML tool is already configured to work with MEDLINE citations and provides several configuration options to deal with different MEDLINE citation fields. The MTI ML package has also been extended to export the preprocessing of the articles for use by the Mallet package using its SVMLight13 interface. 10 http://ii.nlm.nih.gov/DataSets/index.shtml#2013_BioASQ 11 http://mallet.cs.umass.edu 12 http://ii.nlm.nih.gov/MTI_ML/index.shtml 13 http://svmlight.joachims.org

Results

For each machine learning method, we trained with up to four different feature variations. In all the cases, we considered only Boolean features, either the feature appears in the citation or not: 1. Base method, which includes the text from the Title and Abstract fields. The text has been tokenized, lowercased, and no stemming was applied. 2. Base method plus added text features (-F). For the added text features, we also include the following fields to the default Title and Abstract fields for training: Journal Unique Identifier, Author Affiliations, Author Names, and Grant Agencies. Some of the features rely on either the authors or the institutions to be working on the same type of publications, which might change after some time. The plan is to retrain the learning algorithms to avoid any concept drift.

3. Base method plus bigrams (-B), and 4. Base method plus added text features plus bigrams (-BF).

Due to time constraints, AdaBoostM1 was only trained using the first two variations. We used five different machine learning methods: Modified Huber Loss (Mhl), Hinge Loss (Sgd), Naïve Bayes (NB), Logistic Regression (LR), and AdaBostM1 (Ada). So, in the table under methods, “Mhl-BF” means Modified Huber Loss using bigrams and added text features. We have also highlighted the four PTs (CO, EA, IV, and MA) where we have baseline results from early MTI performance. Even though not directly comparable, the difference in performance is quite significant. For the four PTs that we have baseline performance information, we can see three have a dramatic improvement with machine learning: Congresses improves from 0.3397 to 0.7113 (+109%), English Abstract improves from 0.0010 to 0.8359 (+835%), and Meta-Analysis improves from 0.2674 to 0.7742 (+190%). Interestingly, In Vitro actually has a decrease in performance from 0.1679 to 0.1610 (-4%). Mhl-BF and LR-BF have the best performance from the evaluated methods. These two classifiers have already shown better performance compared to other algorithms in existing work on MeSH indexing [ 4,5,8,9 ]. Adding features from the article fields seems to improve the performance compared with using only the Title and Abstract fields. Using bigrams slightly improves the performance.

Discussion

Not surprisingly with machine learning, there is no clear winning method that works best for all of the Publication Types, echoing the findings for MeSH indexing [ 4,5,8,9 ]. The Logistic Regression (LR) method provides the highest F1 measures for six of the ten PTs in our study making it the best overall performer. Even within the LR method results, the highest measures come from both the default (LR) and then Base method plus added text features plus bigrams (LR-BF) with a great deal of differences in performance between the two variations. The results for the Modified Huber Loss (Mhl), Hinge Loss (Sgd), and AdaboostM1 (Ada) methods were very close to the results for the LR method and depending on retraining might in some cases perform slightly better than the LR method.

The Naïve Bayes method was far behind all of the other methods. This effect is more dramatic when the ratio of positives is smaller compared to the number of negatives. This has been explained already by Rennie et al. [Error! Reference source not found.] and it is due to the imbalance between the classes for which the Naïve Bayes classifier favors the majority class. In addition, this effect is more dramatic with a larger set of dependent features, in which the decision boundary is pushed by the related features favoring the majority class even more.

Case Reports, Congresses, English Abstract, Meta-Analysis, Randomized Controlled Trial, and Review all have F1 measures above 0.700 making them promising candidates for future integration into the indexing process. The remaining PTs Clinical Trial, Controlled Clinical Trial, Editorial, and In Vitro all have F1 measures too low for consideration at this time but provide the kernel for further research into improving their performance.

The overall results are promising enough to warrant expanding the experiments to include more PTs to see how they will perform.

If we focus on the 480,631 citations in the 2013 MEDLINE Baseline with a 2012 Publication Date, we can see that several of our high performing PTs were also some of the most frequently used PTs. Review (46,808) is fourth, Case Reports (27,662) fifth, English Abstract (14,208) tenth, and Randomized Controlled Trial (11,408) twelfth. By providing accurate Publication Type recommendations to the indexers, we will help make their jobs easier and more efficient.

Two of the PTs intrigued us enough to warrant a deeper study for very different reasons.

English Abstract performed very well (0.8359) in our experiments, but, we could not understand why it did not reach 1.0000. The rule for identifying whether an article is actually an English Abstract is very clear, more so than most of the PTs. If an article has a title in brackets (meaning it was translated into English) and contains an abstract, it should receive the English Abstract Publication Type. What we found in talking with an indexer is that English Abstract is actually not added by the indexer at all. This rule was straightforward enough that there is a program in place to automatically assign this Publication Type to articles before the indexing is released to the MEDLINE/PubMed database. During our false positives error analysis we found that the majority of cases met the definition of English Abstract, but, simply did not have the Publication Type assigned and this is very likely the cause of not meeting our goal of 1.0000.

In Vitro on the other hand actually performed worse than our MTI baseline and we wanted to try and find out what might be causing this anomaly. In Vitro was designated as a Check Tag when our MTI baseline measure was taken and changed to being a Publication Type shortly thereafter. As a Check Tag, indexers would have used In Vitro much differently than as a Publication Type since Check Tags are based on the main topics found in the article and PTs describe the type or genre of the article. This may account for some of the differences in performance, but, there had to be additional reasons for such a low F1 measure (0.1610) for In Vitro. We only used the last three years of MEDLINE in our experiments, so this time period would only include In Vitro as a Publication Type. So, we should not be confusing the machine learning algorithms by providing them with contradictory data. What we found in our error analysis was that in almost all of the false negatives that we manually reviewed, the information for designating the article as In Vitro was located in the full text of the article, usually in the Methods section, where the authors describe how they performed their research. This fact alone explains the low performance for In Vitro and highlights one of the challenges we mentioned earlier (full text versus only using title and abstract) to successfully recommend Publication Types. 5

Conclusion and Future Work

We have evaluated the automatic assignment of PTs to MEDLINE articles based on machine learning, which extends our previous machine learning efforts using MTI. We find that for the majority (6 of 10) of PTs the performance is quite good with F1 measures above 0.700, while further work is required for the rest of them. The results also show that in addition to the title and abstract text, further information provided from fields in the MEDLINE article result in improved performance. The discussion section shows that feature engineering might provide improved performance, for instance, in the English Abstract case.

Future work will involve expanding the experiments to include most of the remaining frequently used PTs to see if we can identify the set of PTs that perform the best and that would provide the most assistance to the indexers. We will also be exploring the use of openly available full text from PubMed Central14 to see if the full text would benefit In Vitro as well as other poorly performing PTs.

Acknowledgements

NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. This work was also partly supported by the Intramural Research Program of the NIH, National Library of Medicine. The authors would also like to thank Preeti Kochar a senior indexer at the U.S. National Library of Medicine for her valuable insights into how the Publication Types work from an indexer’s perspective. 14 http://www.ncbi.nlm.nih.gov/pmc/ 12. Zhang, T, Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning. ACM, 2004. 13. Yeganova L, Comeau DC, Kim W, Wilbur WJ. Text mining techniques for leveraging positively labeled data. In Proceedings of BioNLP 2011 Workshop (pp. 155-163). Association for Computational Linguistics. 14. Rennie J.D., Shi Rennie J.D., Shih L., Teevan J., Kerger DR. (2003) Tackling the poor assumptions of naive bayes text classifiers. ICML. Vol. 3. 2003.

1. Aronson

A.R.

, Bodenreider

, Chang

H.F.

, Humphrey

S.M.

, Mork

J.G.

, Nelson

S.J.

, Rindflesch

T.C.

, Wilbur

W.J.

( 2000 ). The NLM indexing initiative . Proc AMIA Symp 2000 ;: 17 - 21 .

2. Aronson

and Lang

FM.

( 2010 ). An Overview of MetaMap: Historical Perspective and

Recent

Advances . J Am Med Inform Assoc . 2010 May 1 ; 17 ( 3 ): 229 - 36 .

3. Aronson

A.R.

, Mork

J.G.

, Gay

C.W.

, Humphrey

S.M.

, Rogers

W.J.

( 2004 ). The NLM Indexing Initiative's Medical Text Indexer . Medinfo 2004 ; 11 (Pt 1): 268 - 72

4. Jimeno-Yepes , A. , Mork , J.G. , Demner-Fushman , D. , and Aronson , A.R. ( 2011c ). Automatic algorithm selection for MeSH Heading indexing based on meta-learning . International Symposium on Languages in Biology and Medicine , Singapore, December, 2011 .

5. Jimeno-Yepes , Antonio, Mork

, Demner-Fushman

, Aronson

. Comparison and combination of several MeSH indexing approaches . AMIA Annual Symposium Proceedings . Vol. 2013 . American Medical Informatics Association, 2013 .

6. Lin , J. , & Wilbur , W. J. ( 2007 ). PubMed related articles: a probabilistic topic-based model for content similarity . BMC bioinformatics , 8 ( 1 ), 423 .

7. Bodenreider

, Nelson

, Hole

, and Chang HF . Beyond Synonymy: Exploiting the UMLS Semantics in Mapping Vocabularies . Proc AMIA Symp 1998 ;: 815 - 9 .

8. Jimeno-Yepes , A. , Wilkowski , B. , Mork , J.G. , Demner-Fushman , D. , and Aronson , A.R. ( 2012 ). MeSH indexing: machine learning and lessons learned . ACM SIGHIT International Health Informatics Symposium , Miami, FL, USA, 2012 .

9. Jimeno-Yepes

, Mork

, Demner-Fushman

, Aronson AR . A one-size-fits-all indexing method does not exist: automatic selection based on meta-learning . JCSE , vol. 6 , no. 2 , pp. 151 - 160 , 2012 .

10. M.E. Funk and C.A. Reid . Indexing consistency in MEDLINE . Bulletin of the Medical Library Association , 71 ( 2 ): 176 , 1983 .

11. Joachims , T. ( 1998 ). Text Categorization with Support Vector Machines: Learning with Many Relevant Features . (

Nédellec & C. Rouveirol, Eds.) Machine Learning ECML98 , 1398 ( 2 ), 2 - 7 . Springer