Comparing Word Embeddings for Document Screening based on Active Learning Andres Carvallo and Denis Parra[0000−0001−9878−8761] Computer Science Department Pontificia Universidad Catolica de Chile Santiago, Chile afcarvallo@uc.cl, dparra@ing.puc.cl Abstract. Document screening is a fundamental task within Evidence- based Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches are attempting to reduce the workload of physicians who need to screen and label hundreds or thousands of documents in order to answer specific clinical questions. Previous works have attempted to semi-automate document screening, reporting promising results, but their evaluation is conducted using small datasets, which hinders generalization. Moreover, some recent works have used recently introduced neural language models, but no previous work have compared, for this task, the performance of different language mod- els based on neural word embeddings, which have reported good results in the latest years for several NLP tasks. In this work, we evaluate the per- formance of two popular neural word embeddings (Word2vec and GloVe) in an active learning-based setting for document screening in EBM, with the goal of reducing the number of documents that physicians need to label in order to answer clinical questions. We evaluate these methods in a small public dataset (HealthCLEF 2017) as well as a larger one (Epis- temonikos). Our experiments indicate that Word2vec have less variance and better general performance than GloVe when using active learning strategies based on uncertainty sampling. Keywords: active learning · evidence based medicine · document screen- ing · word embeddings 1 Introduction Evidence-based Medicine (EBM) is a practice that provides scientific evidence to support medical decisions. This evidence nowadays is obtained from biomed- ical journals, usually accessible through the portal PubMed1 , a search engine which provides free access to abstracts of biomedical research articles as well as to the MEDLINE database. An existing problem is to find relevant documents within a massive volume of documents, given a clinical question or a query. As a consequence of this, the time required for the search and screening of articles 1 https://www.ncbi.nlm.nih.gov/pubmed/ 2 A. Carvallo et al. related to clinical questions about medical problems can take long and some- times it consumes a large part of a physician’s workday [15, 6]. When people conduct this repetitive task, there is a good chance of overlooking important articles, which can have a negative impact on decisions such as the patient’s treatment [12]. Moreover, the publication of medical papers has grown expo- nentially the last decade. Since 2005, PubMed has indexed more than 1 million articles per year, which means that the process of searching and manual screen- ing of medical evidence will become increasingly more difficult for physicians without the support of information retrieval and machine learning algorithms. For this reason, some systems have emerged to support experts in the collection of evidence such as Embase2 , DARE3 and Epistemonikos4 . In this article, we work with data from Epistemonikos, which helps expert physicians to review and validate scientific evidence grouped by medical questions to facilitate its subsequent search. Our goal is to improve the efficiency and efficacy of docu- ment screening in the practice of EBM. In other words, we aim at reducing the effort made by physicians at screening documents to find the evidence needed to support the answers of a medical question. We use an active learning approach, experimenting with a large dataset of medical questions, unlike previous works which use very small datasets, some of them very recent [13]. In this short pa- per, we contribute by: i ) Experimenting in both a large dataset (Epistemonikos, 987) and a small dataset (CLEF, 50), showing evidence of generalization of our approaches, and ii ) comparing the performance of documents represented with two state-of-the-art neural word embeddings (Word2vec [14] and GloVe [18]) as well as traditional relevance feedback [5]. 2 Related Work The task of finding relevant documents related to a medical question through citation screening has been studied and it is known as the total recall problem: given a medical topic or question, find all the documents that are relevant about a particular topic. Recently, the CLEF task 2 [11] is a challenge that calls for solving the problem of prioritizing which documents to screen to reduce work overload for experts. They provide a public dataset with medical topics and a set of candidate documents; participants have to rank documents by relevance for every specific medical subject in the minimum of iterations to make more efficient the document screening process. In the literature, the approaches to solving this problem are based on two general lines: information retrieval and machine learning methods. In the information retrieval area, there have been many attempts to solve the problem using techniques such as relevance feedback [5], query expansion [13], ranking and inference based on external knowledge [8]. However, they do not 2 https://www.elsevier.com/solutions/embase-biomedical-research 3 https://www.crd.york.ac.uk/CRDWeb/ 4 https://www.epistemonikos.org/en Comparing Word Embeddings for Active Learning Document Screening 3 ensure a level of recall necessary to capture all the evidence related to a medical question. From the machine learning community, the approach is to automate or semi- automate the screening process or review of medical articles that were previously selected as relevant to a medical question by learning the pattern of physicians conducting a document survey. There have been efforts to solve this problem by using automatic classification [2, 3, 1, 16, 21]. Where they compared classifiers such as Naive Bayes, K-NN, and SVM, using different ways to represent text, such as word embeddings and bag-of-clinical terms from titles and abstracts. There is also literature that has used active learning [9, 7, 22, 15] for medical topic detection and clinical text classification. Moreover, a few of deep learning models have been proposed for the classification of relevant evidence and categorization of documents in medical questions [4, 10]. Generally, the majority of work done has used datasets of up to 50 medical topics/questions and 200,000 documents, and in this case, we work with a dataset close to 1, 000 medical questions and 370, 000 potential documents, allowing models to generalize and obtain better efficacy results compared to the state of the art. In addition, unlike previous work, we compare two neural word embedding models for document representa- tion (Word2vec [14] and GloVe [18]) in order to asses their performance for the biomedical document screening task. 3 Proposed Solution The process of finding documents that answer a clinical question requires first retrieving a set of candidate documents. Then, physicians perform the document screening where they verify that abstracts and titles of each document are related to the medical question, this particular process may involve much time and cognitive effort from experts. Problem formulation: Given a medical question q and a set of candidate documents C = {c1 , c2 , ...cn } we need to ask an oracle (physician) O to label these documents as relevant or not relevant to q. We want to avoid asking the labeling of every document, so we select an informative sample to be labeled by the expert. With these labels, we train a predictive model M . It might be necessary to ask for labels in many iterations in order to refine the model, ending up with several models M0 , M1 , ..., Mk . In our case, we use an active learning (AL) approach [20]. Using an AL strategy A (e.g., uncertainty sampling, query-by-committee, etc.), we sample a set of unlabeled documents X from C, in order to ask the oracle O to label them. With the labeled items, we then train a machine learning model Mi (X, Y ) with the new observations X ⊂ C with labels Y (binary, Y = 1 means relevant document and Y = 0 means not relevant) given by O. Then, we use the trained model Mi to predict relevance labels for unobserved documents, and using the active learning strategy A we select new items to be labeled by O in order to create an updated version of the model Mi+1 . In each iteration, we evaluate the 4 A. Carvallo et al. model (precision, recall, ), and we can stop until a fixed number of iterations or after the model converges. To address the problem above, we developed a system, where we start with a small proportion of labeled documents as relevant or not relevant for each medical question to train a first version of the machine learning model Mi . Then, using the active learning strategy we chose instances to be labeled by a physician based on the title and abstract text features represented internally as word embeddings (GloVe and Word2vec). After the physician adds the labels, they are used to train a machine learning model Mi+1 to predict the relevance of new unlabeled documents and thus begin a new iteration. The performance of our approach first depends on the machine learning algo- rithm chosen and second, on the active learning strategy that chooses unlabeled examples to create a labeled dataset as input for supervised learning algorithms. The strategies used in this experiment are uncertainty sampling and random sampling, given their lower complexity compared to others such as error-based, gradient-based and variable reduction [20]. The machine learning algorithms that are considered to be trained with new labeled examples are random forests, logistic regression, and neural networks. With respect to the active learning sampling strategies, random sampling: chooses random documents to train the machine learning model and it is usually used as a comparison baseline against other approaches. On the other side, uncertainty sampling: looks for records which have higher label prediction uncertainty, mak- ing them potentially more informative for collecting their actual labels and then training or updating a model. 4 Experiments Dataset. For the experiments, we used two datasets: CLEF5 and Epistemonikos. Both of them have a similar distribution of documents per question, where the majority of medical questions contain an approximate number of 200 relevant documents. On the one hand, CLEF dataset contains only 50 medical questions and 200,000 documents related to them that were crawled from PubMed using each document id. On the other hand, the Epistemonikos Evidence Synthesis Project is a collaborative initiative established in 2012 with the objective of collecting, organizing and comparing all relevant evidence for health decision- making, through a multilingual platform. This dataset is composed of 987 med- ical questions and 372,829 potential documents. In both datasets each medical question is associated to a Systematic Review (hereinafter, SR), which is a type of article that collects and synthesizes the most relevant primary studies and trials related to a question. The information of documents from both datasets consists of the title, abstract, author, year and the label if it is relevant (or not) to the question or medical subject. In the case of Epistemonikos data, the labels were previously curated by senior medical students, in which they had to select papers related to a set of medical questions. Document representation: for each 5 https://sites.google.com/site/clefehealth2017/task-2 Comparing Word Embeddings for Active Learning Document Screening 5 document we lower case the concatenation of title and abstract, then remove stop words, and we use GloVe [18] and Word2vec [14] to obtain an embedding representation of 300 dimensions of each word. Finally, using average pooling we obtain a vector for each document. 4.1 Offline Active Learning Setup We experimented doing a simulation of the active learning labeling process of documents for medical questions. As each medical question has a different num- ber of relevant documents, we sample documents that are not relevant where the total of relevants corresponds to the 5%, so that the distribution of docu- ments is similar to the CLEF dataset [13]. We filtered out some of the medical questions, keeping those that have more than five relevant documents and less than 2,000 relevant documents, ending up with 987. We compared the results of applying active learning on the CLEF dataset that contains 50 SR (Systematic Reviews) with Epistemonikos. For each medical question, we hide the document labels and we leave only five with their respective labels to start building the model and then iterating with active learning to receive feedback from the ora- cle. For each prediction made by the machine learning model in each iteration, we sorted the results depending on the predicted probability of being relevant for each model, so the evaluation metrics were calculated with the ranked list of potential candidates given by each strategy. The parameters chosen for machine learning algorithms were: for neural networks we used five hidden layers, ReLu activation function, learning rate of 1e-05, momentum of 0.9, 100 neurons per layer and Adam optimization function. For the random forest, we used 100 esti- mators. Experiments were programmed in Python3 using libact [23], scikit-learn [17], pandas and gensim libraries. Code for these experiments will be published in a github repository after notification. Evaluation metrics. We evaluated our proposed active learning strategies with traditional IR metrics also used by Lee et al. [13]: precision@k, recall@k and mean average precision (MAP). We report the metrics obtained after ten iterations, with ten documents labeled per iteration. 5 Results and Discussion Table 1 presents the results. The first column indicates the dataset as well as the type of embeddings. The second column shows the active learning strategy, as well as the learning model. Then the following seven columns show recall at three cut off levels (R@10, R@20, R@30), precision at three cut off levels (P@10, P@20, P@30), and Mean average precision (MAP). As shown in Table 1, for the Epistemonikos dataset, uncertainty sampling based on RF is a clear winner for recall@10 which means that this strategy captures relevant documents in the first ten positions for both GloVe and Word2vec representations of titles and abstracts. In the case of the HealthCLEF dataset, the model that achieves the best results on recall@10 is also RF, followed by NN. If we compare the performance of both word embeddings, we observe that in general, Word2vec has 6 A. Carvallo et al. Table 1: Average results with standard deviation, of recall@k (R@k), precision@k (Pr@k) and Mean Average Precision (MAP) performance measured in Epistemonikos and CLEF datasets using active learning strategies (US: uncertainty sampling, RS: random sampling) using a batch of 10 documents per feedback iteration for Word2vec and GloVe representation. Dataset AL-Model R@10 R@20 R@30 Pr@10 Pr@20 Pr@30 MAP Epistemonikos US-NN 0.377 (0.02) 0.542 (0.04) 0.627 (0.05) 0.856 (0.045) 0.747 (0.053) 0.654 (0.053) 0.900 (0.001) 987 SRs US-RF 0.414 (0.03) 0.590 (0.06) 0.679 (0.08) 0.926 (0.058) 0.799 (0.075) 0.696 (0.078) 0.975 (0.001) GloVe 300 dim US-LR 0.307 (0.03) 0.435 (0.04) 0.498 (0.05) 0.760 (0.047) 0.670 (0.057) 0.587 (0.058) 0.875 (0.001) RS-LR 0.052 (0.001) 0.094 (0.003) 0.127 (0.004) 0.413 (0.012) 0.362 (0.01) 0.315 (0.01) 0.625 (0.07) Epistemonikos US-NN 0.391 (0.02) 0.562 (0.04) 0.645 (0.05) 0.877 (0.04) 0.760 (0.05) 0.663 (0.05) 0.932 (0.02) 987 SRs US-RF 0.417 (0.03) 0.596 (0.05) 0.687 (0.07) 0.934 (0.04) 0.807 (0.06) 0.704 (0.06) 0.973 (0.001) Word2vec 300 dim US-LR 0.413 (0.02) 0.593 (0.03) 0.684 (0.04) 0.912 (0.04) 0.791 (0.04) 0.691 (0.04) 0.958 (0.02) RS-LR 0.021 (0.01) 0.054 (0.002) 0.125 (0.002) 0.283 (0.01) 0.192 (0.01) 0.293 (0.01) 0.463 (0.05) CLEF US-NN 0.427 (0.01) 0.573 (0.01) 0.665 (0.01) 0.841 (0.02) 0.782 (0.02) 0.702 (0.01) 0.935 (0.02) 50 SRs US-RF 0.416 (0.01) 0.583 (0.01) 0.688 (0.01) 0.865 (0.01) 0.871 (0.01) 0.729 (0.01) 0.965 (0.01) GloVe 300 dim US-LR 0.146 (0.019) 0.206 (0.04) 0.228 (0.03) 0.758 (0.014) 0.555 (0.04) 0.446 (0.07) 0.957 (0.04) RS-LR 0.033 (0.01) 0.077 (0.002) 0.099 (0.002) 0.392 (0.01) 0.278 (0.01) 0.322 (0.01) 0.374 (0.05) CLEF US-NN 0.402 (0.01) 0.552 (0.01) 0.687 (0.01) 0.865 (0.01) 0.753 (0.01) 0.726 (0.01) 0.985 (0.01) 50 SRs US-RF 0.428 (0.01) 0.586 (0.01) 0.689 (0.01) 0.867 (0.01) 0.785 (0.01) 0.713 (0.01) 0.989 (0.01) Word2vec 300 dim US-LR 0.170 (0.026) 0.249 (0.047) 0.297 (0.06) 0.892 (0.018) 0.805 (0.018) 0.723 (0.017) 0.930 (0.08) RS-LR 0.019 (0.01) 0.045 (0.002) 0.095 (0.002) 0.192 (0.01) 0.382 (0.01) 0.273 (0.01) 0.172 (0.05) Epistemonikos Rel. Feed. (Rocchio) 0.28 0.35 0.42 - - - 0.45 TF-IDF BM25 0.18 0.25 0.32 - - - 0.35 better and more stable performance. GloVe present more substantial variations in different ML models. 6 Conclusion and Future Work In this article we supported results from previous studies in terms of showing that active learning with an uncertainty sampling strategy yields good results for the task of biomedical document screening. Moreover, we contribute by com- paring two popular word embeddings to represent documents: Word2vec and GloVe. The best results were obtained using Word2vec document representation and random forests as the learning algorithm. GloVe document representation also yields competitive results, but it seems more sensitive two the classification model used: it performs well with random forests but shows poor performance with neural networks and logistic regression. Moreover, our experiments indicate that these results are consistent in both the small public dataset of HealthCLEF and the larger dataset of Epistemonikos, giving evidence of generalization. For future work, we will try other machine learning models, active learning strategies and evaluate the results using CLEF metrics [11]. We will also test other paradigms for more scalable learning, such as weak supervision. With respect, to embeddings, we will test different values of sensitive parameters, as mentioned by Roy et al. [19]. Finally, we will conduct a user study with actual physicians in order to evaluate online the performance of our approach. 7 Acknowledgements We acknowledge Epistemonikos Foundation, the Chilean research agency Coni- cyt, Fondecyt grant 1191791 and the Millenium Institute IMFD. Comparing Word Embeddings for Active Learning Document Screening 7 References 1. Adeva, J.G., Atxa, J.P., Carrillo, M.U., Zengotitabengoa, E.A.: Automatic text classification to support systematic reviews in medicine. Expert Systems with Ap- plications 41(4), 1498–1508 (2014) 2. Bekhuis, T., Tseytlin, E., Mitchell, K.J., Demner-Fushman, D.: Feature engineer- ing and a proposed decision-support system for systematic reviewers of medical evidence. PloS one 9(1), e86277 (2014) 3. Choi, S., Ryu, B., Yoo, S., Choi, J.: Combining relevancy and methodological quality into a single ranking for evidence-based medicine. Information Sciences 214, 76–90 (2012) 4. Del Fiol, G., Michelson, M., Iorio, A., Cotoi, C., Haynes, R.B.: A deep learning method to automatically identify reports of scientifically rigorous clinical research from the biomedical literature: Comparative analytic study. J Med Internet Res 20(6) (Jun 2018) 5. Donoso-Guzmán, I., Parra, D.: An interactive relevance feedback interface for evidence-based health care. In: 23rd International Conference on Intelligent User Interfaces. pp. 103–114. ACM (2018) 6. Elliott, J.H., Turner, T., Clavisi, O., Thomas, J., Higgins, J.P., Mavergames, C., Gruen, R.L.: Living systematic reviews: an emerging opportunity to narrow the evidence-practice gap. PLoS medicine 11(2), e1001603 (2014) 7. Figueroa, R.L., Zeng-Treitler, Q., Ngo, L.H., Goryachev, S., Wiechmann, E.P.: Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association 19(5), 809–816 (2012) 8. Goodwin, T.R., Harabagiu, S.M.: Knowledge representations and inference tech- niques for medical question answering. ACM Transactions on Intelligent Systems and Technology (TIST) 9(2), 14 (2018) 9. Hashimoto, K., Kontonatsios, G., Miwa, M., Ananiadou, S.: Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of biomedical informatics 62, 59–65 (2016) 10. Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using convolutional neural networks. Stud Health Technol Inform 235, 246–50 (2017) 11. Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: Clef 2017 technologically as- sisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings. vol. 1866, pp. 1–29 (2017) 12. Keselman, A., Smith, C.A.: A classification of errors in lay comprehension of med- ical documents. Journal of biomedical informatics 45(6), 1151–1163 (2012) 13. Lee, G.E., Sun, A.: Seed-driven document ranking for systematic reviews in evidence-based medicine. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. pp. 455–464. ACM (2018) 14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 15. Miwa, M., Thomas, J., OMara-Eves, A., Ananiadou, S.: Reducing systematic re- view workload through certainty-based screening. Journal of biomedical informat- ics 51, 242–253 (2014) 16. Mo, Y., Kontonatsios, G., Ananiadou, S.: Supporting systematic reviews using lda-based document representations. Systematic reviews 4(1), 172 (2015) 8 A. Carvallo et al. 17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Ma- chine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011) 18. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014) 19. Roy, D., Ganguly, D., Bhatia, S., Bedathur, S., Mitra, M.: Using word embeddings for information retrieval: How collection and term normalization choices affect performance. In: Proceedings of the 27th ACM International Conference on Infor- mation and Knowledge Management. pp. 1835–1838. ACM (2018) 20. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Ma- chine Learning 6(1), 1–114 (2012) 21. Wallace, B.C., Small, K., Brodley, C.E., Lau, J., Schmid, C.H., Bertram, L., Lill, C.M., Cohen, J.T., Trikalinos, T.A.: Toward modernizing the systematic review pipeline in genetics: efficient updating via data mining. Genetics in medicine 14(7), 663 (2012) 22. Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Active learning for biomedical citation screening. In: Proceedings of the 16th ACM SIGKDD inter- national conference on Knowledge discovery and data mining. pp. 173–182. ACM (2010) 23. Yang, Y.Y., Lee, S.C., Chung, Y.A., Wu, T.E., Chen, S.A., Lin, H.T.: libact: Pool- based active learning in python (2017)