Introduction

NTTMU System in the 2nd Social Media Mining for Health Applications Shared Task

0 Chen-Kai Wang 1 Department of Computer Science and Information Engineering, National Taitung University , Taitung , Taiwan, R.O.C 2 Graduate Institute of Biomedical Informatics, Taipei Medical University , Taipei, Taiwan , R.O.C. 3 Institute of Information Science , Academia Sinica, Taipei , Taiwan

In this study, we describe our methods to automatically classify Twitter posts describing events of adverse drug reaction and medication intake. We developed classifiers using linear support vector machines (SVM) and Naïve Bayes Multinomial (NBM) models. We extracted features to develop our models and conducted experiments to examine their effectiveness as part of our participation in AMIA 2017 Social Media Mining for Health Applications shared task. For both tasks, the best-performed models on the test sets were trained by using NBM with n-gram, partof-speech and lexicon features, which achieved F-scores of 0.295 and 0.615, respectively.

Introduction Methods



Lexicon-based features: We used the ADR lexicon compiled in our previous work [5] to mark their presence and developed two binary features for a tweet; one is the presence of drug names and the other is presence of ADR mentions.

In addition to the above features, we have tried to exploit a likely positive dataset [6] and employed different term weighting methods, such as the transformed weight-normalized complement Naïve Bayes (TWCNB) [7]. Naïve Bayes classifier and the weighted features, such as term frequency, inverse document frequency, length normalization and complement class weighting, are used as the factors for TWCNB. Unfortunately, we could not achieve any significant improvement over the above feature sets. We will report the details in the Results section.

Results

For task 1, the configuration (14) adopting NBM algorithm with all proposed features achieves the highest F-measure, 49.92%, here. And for task 2, the same configuration (denoted as 4 in Table 2) also achieved the highest F-measure, 63.34%.

F In this paper, we gave a briefly introduction of our systems based on SVM and NBM algorithms and conducted experiments to study the effectiveness of different features and preprocessing. We observed that the best configurations for both tasks were based on the spell-checked and dosage-replaced tweets along with n-gram, POS and lexicon features.

PSB SMM4H Shared Task 1 Results

PSB SMM4H Shared Task 2 Results 0.213 0.362 0.226

0.69 0.644 0.662

F 0.433 0.249 0.403 0.554 0.588 0.572 0.5497809.268261 0.441441441

0.295 0.489693941

0.29 0.614 0.615 0.614

G. Holmes, A. Donkin, and I. H. Witten, "Weka: A machine learning workbench," in Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference on, 1994, pp. 357361.

J. Jonnagaddala, T. R. Jue, and H.-J. Dai, "Binary classification of Twitter posts for adverse drug reactions," presented at the Proceedings of the Social Media Mining Shared Task Workshop at the Pacific Symposium on Biocomputing, Big Island, Hawaii, 2016.

R. T.-H. Tsai, H.-C. Hung, H.-J. Dai, Y.-W. Lin, and W.-L. Hsu, "Exploiting Likely-Positive and Unlabeled Data to Improve the Identification of Protein-Protein Interaction Articles," presented at the 6th InCoB - Sixth International Conference on Bioinformatics, 2007.

M. Timonen, "Term Weighting in Short Documents for Document Categorization, Keyword Extracti on and Query Expansion," 2013 .

Sarker and

Gonzalez , "Portable automatic text classification for adverse drug reaction detection via multi-corpus training," J Biomed Inform , vol. 53 , pp. 196 - 207 , Feb 2015 .

Klein ,

Sarker ,

Rouhizadeh , K. O'Connor , and G. Gonzalez , "Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System," BioNLP 2017 , pp. 136 - 142 , 2017 .

Owoputi ,

B. O

'Connor ,

Dyer ,

Gimpel ,

Schneider , and

N. A.

Smith , "Improved part-of-speech tagging for online conversational text with word clusters," in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics , 2013 .