376 Exploring the Effects of Different Embedding Algorithms and Neural Architectures on Early Detection of Alzheimer’s Disease Minni Jain, Rishabh Doshi, Vibhu Sehra and Divyashikha Sethia Department of Computer Engineering, Delhi Technological University, Delhi, India Abstract Alzheimer’s Disease (AD) is an irrecoverable, progressive neurodegenerative disorder that deteriorates the cognitive and linguistic abilities of a person over time. Ample research has been done on the early detection of AD; it remains a challenging task. Doctors use the patient’s history, laboratory tests, and change in behaviour to diagnose the disease. Natural Language Processing(NLP) techniques can help automate the detection of AD, as Language impairments accompany this disease. This work aims to analyze the effect of different Embedding models on the DementiaBank dataset in order to detect the disease. The work uses both Generic and domain-specific Word Embeddings on the three deep learning models - CNN, Bidirectional LSTM(BLSTM), and CNN+BLSTM. Results indicate that for a specific picture description task like cookie theft description, domain-specific Word Embeddings tend to work better. Lastly, it is discussed how results are affected by the use of different Embedding models (Fasttext, Word2Vec, GloVe). Keywords Alzheimer’s Disease, Natural Language Processing, Word Embeddings, Deep Learning, Cookie theft Description task 1. Introduction medical treatment is not very useful after the diagnosis of the disease. Hence the early detection of Alzheimer’s Alzheimer’s Disease(AD) is a brain disorder that slowly is still a challenge in medical science. There have been damages the nerve connections in the Brain. It is the many attempts to diagnose the disease with the help most common type of dementia and symptoms of AD of neuroimaging techniques, but non-imaging tech- include communication difficulties, memory loss, poor niques are essential to personalize the treatment for judgment, and changing mood and personality1 . More a patient and monitor disease progression. Machine than 50 million people are diagnosed with Alzheimer’s learning can detect the language deficits that often ac- Disease every year 2 . This challenge has grown sub- company dementia and therefore can be used for ealry stantially over the years with the ageing of the pop- detection of Alzheimer’s Disease. Previously, many ulation and the agerelated nature of many dementia- Natural Language Processing (NLP) techniques were producing neurodegenerative diseases [1]. This num- proposed to help in early detection of Alzheimer’s Dis- ber of cases for Alzheimer’s Disease will continue to ease. These techniques treat the problem as a super- grow in the coming years. There is no proven health vised learning problem. Previous research works like care method to cure AD. Hence, it is necessary to de- [2, 3, 4] made use of transcripts obtained from inter- velop a new method to detect AD in a patient. Around views with patients to detect Alzheimer’s disease by 50 to 90% of dementia cases are left undiagnosed by using various machine learning and deep learning al- standard clinical examinations [1]. Early detection of gorithms. Further, other studies like [5, 6, 7] used acous- Alzheimer’s Disease is still a massive issue in the cur- tic features obtained from the audio recordings of the rent scenario. Alzheimer’s Disease progresses over the interviews for the classification task. Our study aims years, and sometimes patients can have the disease to explore the effect of various Word Embeddings and for 20 years before showing symptoms. At this point, neural architectures on transcripts obtained from the cookie theft description task of DementiaBank. ISIC’2021: International Semantic Intelligence Conference, Feb 25–27, This paper makes use of both generic and domain- 2021, New Delhi, Delhi , India specific Word Embeddings that are trained on the tran- " minnijain@dtu.ac.in (M. Jain); doshirishabh26@gmail.com (R. Doshi); vibhusehra@gmail.com (V. Sehra); divyashikha@dtu.ac.in scripts. Out of all the presented models, the CNN + (D. Sethia) Bidirectional LSTM models that make use of Fasttext  domain-specific Word Embeddings provides the best © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). results. Sentences obtained from the transcripts are CEUR CEUR Workshop Proceedings (CEUR-WS.org) input to the models, and the output is the predicted http://ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1 https://www.alz.org/alzheimers-dementia/10_signs 2 https://www.alz.org/alzheimers-dementia/facts-figures label (Healthy or Alzheimer’s), no feature engineering 377 was involved in the process. Hence, this paper inves- had achieved an accuracy of 87.5% using the sparse tigates how the task of detecting Alzheimer’s Disease vector representations of 4, 5 n-grams. The dataset is affected by the use of various domain-specific and was equally divided by making use of 99 dementia tran- generic Embeddings on different neural architectures. scripts and 99 control transcripts from the dataset. Re- The rest of the paper comprises section 2, which cently, [2] proposed the use of 3 different deep learn- consists of the Related works followed by our proposed ing algorithms- 2D-CNN, LSTM, and 2D CNN - RNN work and experimental setup in sections 3 and 4, re- models by making use of the complete Dementia bank spectively. Then we present our results and discussion dataset which consists of 1017 Alzheimer’s transcripts in sections 5 and 6, respectively, which is followed by and 243 control transcripts. They used each utterance the conclusion and future work in section 7. as a separate data sample, therefore obtaining 14362 utterance samples. They achieve the best accuracy of 91.1% using the CNN-RNN model by using Word Em- 2. Related Work beddings along with POS tagged data to the classifier. [3] used a Hierarchical attention network (HAN) on This section discusses the previous research, done in the transcripts obtained from DementiaBank Dataset. the field of Alzheimer’s detection using the various They made use of Word Embeddings along with demo- machine learning and deep learning techniques. graphic features for the prediction task obtaining an accuracy of 86.9%. [11] proposed a model that com- 2.1. Machine Learning Techniques bined bidirectional hierarchical recurrent neural net- work with an attention mechanism for dementia de- Existing research found on early detection of Alzheim- tection. [12] showed that fine tuned BERT model out- er’s Disease using Natural language processing made performed the models that used hand crafted feature use of various machine learning techniques. [8] used engineering. Table.4 summarizes the approach used three different machine learning algorithms - namely by previous research works. Decision trees, Support Vector Machine, and K-Nearest neighbours on a sample of 80 conversations to achieve the best accuracy of 79.5% using their Decision tree 3. Proposed Work model. [9]proposed a model using Support Vector ma- chine making use of 14 lexical features, nine syntac- 3.1. Preprocessing tic features, and n-grams extracted from the Pitt Cor- pus in Dementia Bank Dataset by using 99 dementia This work uses the transcripts in the Dementia Bank transcripts and 99 control transcripts from the dataset. dataset [13], which are available in the form of CHAT They used Area Under Curve (AUC) metric to test the transcription [14]. The transcripts are passed through performance of the algorithm achieving a maximum a series of steps as given below and illustrated in Fig. 1. AUC score of 0.93 by using the top 1000 features ob- PyLangAcq library [15], which is a powerful library tained using a Leave Pair Out Cross-Validation (LPOCV) that can handle CHAT data, reads the transcripts. We crossvalidation technique. then convert all obtained utterances to lower text and Further, [7] used the DementiaBank dataset to ex- remove all punctuations. We use 99 transcripts from tract the acoustic measures and semantic measures to each set (Dementia and Control) from the Cookie Theft predict the clinical scores of the patients by making task as suggested by [9, 10] where they made use of an use of the bivariate dynamic Bayes network. [5] ex- equal number of dementia and control patients. tracted acoustic features from the DementiaBank datas- et and created a regression model to predict clinical 3.2. Word Embeddings used for early scores (MMSE) used for dementia prediction. [6] made detection of Alzheimer’s Disease use of acoustic features on various Machine Learning models like Logistic Regression, KNN, Naive Bayes, This work uses three types of Word Embeddings- Wor- Dummy classifier, Random Forests, and achieved the d2Vec [16], Glove [17] and, Fasttext [18]. These em- best accuracy of 78% with Logistic regression classi- beddings are chosen because they are widely used and fier. have different architechtures which may tell us the best way to proceed with the problem in hand. All the Word Embeddings have a 300-dimensional vector rep- 2.2. Deep Learning Techniques resentation for each Word. For each of the types men- [10] had made use of Deep-Deep neural networks and tioned above, two-Word Embeddings are used, Domain- 378 specific and generic Word Embeddings. All the tran- the 1D Convolution layer, ReLU [22] as the activation scripts from DementiaBank are used to create the do- function for the Dense layers, and Softmax for classi- main specific Word Embeddings stated above. The max- fication. imum size of a transcript was 498 words. Hence, we keep the size of the Word Embedding as (500,300). 3.3.2. Bi-Directional LSTM Model The model has a series of the Bidirectional LSTM layer 3.2.1. Domain-Specific Word Embeddings and Dropout [23] layer; further layers consist of a Dense Domain-Specific Word Embeddings are Embeddings network for classification. The Dropout layers are adde- that are trained on a specific corpus that contains data d to prevent overfitting in the model and dropout rate from the interested domain. They are highly effective is kept at 30%. All the layers use default Tanh activa- for a specific domain but require extra training time. tion except the last one, which uses Softmax for clas- Gensim library [19] is used to create Word2vec [16] sification. and Fasttext [18] Word Embeddings from the corpus. Glove3 library is used to create the GloVe Embeddings 3.3.3. Hybrid CNN + Bi-Directional LSTM Model [17]. This model is a combination of the above two mod- els. We pass the Embeddings through a series of 1D- 3.2.2. Generic Word Embeddings convolutional layers followed by a MaxPooling layer, Generic Word Embeddings are Embeddings that are with two bidirectional LSTM layers stacked over the trained on vast generic corpora. Hence these Embed- Maxpool layer. A dense network follows this. Fig. 2. dings reduce training time and often give outstanding illustrates the proposed model. The Activations used results. The work trains the pretrained Glove [17] Em- for CNN and bidirectional LSTM is Tanh, while we use beddings on 6 billion words. It trains Word2vec Em- ReLU [22] activation for dense layers followed by a bedding, which includes word vectors for a vocabulary SoftMax function for classification. of 3 million words and phrases on roughly 100 billion words from a Google News dataset. It also trains Fast- 3.4. Training Details text [18] Embedding, which contains vectors for 1 mil- lion words, on Wikipedia 2017, UMBC web base cor- The above-stated models are trained using the Adam pus, and statmt.org news dataset having a total of 16 Optimizer [24] for 30 epochs, each using Binary cross- billion tokens. entropy as the loss function. L2 regularization [21] is applied in each layer has 𝜆 = 10−5 3.3. Deep Learning Models Used This section explains the deep learning models that 4. Experimental Details are used for the classification of control and dementia This work uses Pitt Corpus, which is the largest En- patients. Keras functional API [20] is used to create glish dataset available in DementiaBank [13]. Demen- all the deep learning models explained below. To ad- tiaBank is a part of the TalkBank project initiated by dress the concern of overfitting, we use L2 regularizer Carnegie Mellon University. The National Institute of [21] as the kernel initializer. Due to the small size of Aging funds it. The project encourages research for the dataset, the research makes use of 10-fold cross- human communication. It uses the Codes for the Hu- validation on each model. The model atempts to cap- man Analysis of Transcripts (CHAT) system [14], whi- ture the language impairments that are often seen in ch provides automatic analysis and testing. The CHAT the ealry phases of dementia. The Annexure provides system is commonly used in many datasets to pro- the details of the model architecture. vide uniformity and easy usage. Various participants from each group (Control and dementia) visited annu- 3.3.1. CNN Model ally for the interview. Pitt Corpus [13] is a collection In this work, the CNN model consists of a combina- of transcripts and audio files that were collected as a tion of 1DConvolution layers with an increasing num- part of a longitudinal study conducted by Alzheimer’s ber of kernels followed by MaxPool layers. A Dense and Related dementia at the University of Pittsburgh network follows this. We use the Tanh activation for School of Medicine. This dataset contains interviews 3 https://github.com/JonathanRaiman/glove 379 NN + BLSTM) use the generic and domain-specific Wo- Transcripts (DementiaBank) rd Embeddings of each Embedding model. For domain- specific Word Embeddings, we achieved maximum ac- curacies of 89.9%, 85%, and 90.6% with Fasttext Embed- ding for CNN, BLSTM, and CNN + BLSTM models, re- spectively. While for pre-trained Word Embeddings, Read the transcripts using PyLangAcq Library maximum accuracies obtained were 85.2% with Glove for both CNN and BLSTM, and 85.5% with Fasttext for Conversion of words to Lowercase and removing punctuation CNN + BLSTM. The baseline model used is constant label classifier which gives the same result for any in- put which achieved an accuracy of 50% since we have Creation of domain specific word embeddings two classes. Tables 1, 2, and 3 summarize the results (Word2Vec, Glove and Fasttext) obtained by using the three Embedding models (Glove, Word2Vec, Fasttext) for the three given deep learning models. Fig. 4. compares the F1 scores achieved by Pass the word embeddings (domain specific and generic) through the classifier (CNN, BI-LSTM, CNN + BI-LSTM) these models which makes clear that Domain Specific Fasttext embeddings outperform all the other embed- dings. Accuracy, precision, recall, and F1-score are used Result as the evaluation metrics. Previous works using deep learning techniques such as [2] used accuracy, [10] Figure 1: Proposed Approach for early detection of used AUC (Area Under Curve), and [3] used precision, Alzheimer’s Disease recall and F1 score as the evaluation metrics. Gener- ally, the performance of the domain-specific Word Em- beddings was better than that of Generic Word Embed- of patients with possible Alzheimer’s along with con- dings. The probable causes are discussed further in the trol patients, containing transcripts of 104 control pa- next section. tients and 208 dementia patients. The patient’s ages range from 49-90 years in the dataset. It comprises of four different tests on the patients: 6. Discussions • Cookie Theft: Patients see an image provided The paper aims to explore how the different Word Em- by the Boston Diagnostic Aphasia Examination, bedding models and types of Embeddings perform on and then the patients (Control and Dementia) different neural models. It uses both the domain spe- recall the events taking place in the image (Fig. 3). cific and the generic Word Embeddings to classify the transcripts. However, since the domain-specific Word • Fluency: This task is done only for dementia Embeddings have been trained on the same corpus be- patients where they respond to a word Fluency ing used, it generally provides better results. As the task. cookie theft data comprises of explaining a particular image, the vocabulary found in the transcripts is lim- • Recall: The Dementia Patients undergo a story ited, and as a result, it is easier to understand the rela- recall test. tionship between words. Using Domain-specific, Fast- • Sentence: The Dementia Patients perform a Sen- text, and Word2vec provides better results than their tence construction task. Generic counterparts. Results indicate that Glove Em- beddings provide similar results on both types of Word The work uses the Cookie theft part of the corpus as Embeddings. it contains the maximum number of participants, and If we had a combination of different tasks (not only previous researchers have used it. cookie theft) having a larger corpus and vocabulary, Generic Embedding might perform better. Results indicate that Word2vec has the lowest ac- 5. Results curacy amongst the three Embedding models. This is possible because domain-specific Word2vec requires All the three neural models - 1D CNN, Bidirectional a larger corpus to develop the semantic relation as it LSTM(BLSTM), and 1D CNN + Bidirectional LSTM (C- 380 Figure 2: Pictorial Representation of the CNN+BLSTM used Table 1 Results obtained for the CNN model Word Embedding Accuracy Precision Recall F1-score Fasttext Generic 0.85 0.86 0.85 0.85 Domain-specific 0.90 0.92 0.90 0.91 GloVe Generic 0.85 0.85 0.85 0.85 Domain-specific 0.83 0.83 0.81 0.82 Word2Vec Generic 0.77 0.78 0.77 0.77 Domain-specific 0.80 0.80 0.80 0.80 only captures local word relations. The domain spe- in the text. Bidirectional LSTM is better than cific Fasttext Embedding gives the best result since it the LSTM as it trains on two LSTM cells instead does not require a large corpus as it breaks each word of one cell in a single input sequence. into character n-grams, thereby increasing the vocab- ulary size. Compared to similar previous works like [2] and [3] use a Word Embeddings layer that is trained along Results also indicate that the hybrid CNN + BLSTM model achieves the highest accuracy of 90.6%. The with the neural architecture, this study uses three Word CNN + BLSTM model works better than any single useEmbedding models and from each Embedding model, of either of the model, because: a domainspecific and pre-trained Embedding is cre- ated to identify how different Embedding models and • CNN model captures the short-term dependen- the type of data on which the Embeddings are trained cies in text. affects the performance of detecting Alzheimer’s Dis- ease. [2] breaks down each transcript into utterances • LSTM model captures long term dependencies and considers them as separate data samples thereby Table 2 Results obtained for the BLSTM model Word Embedding Accuracy Precision Recall F1-score Fasttext Generic 0.80 0.85 0.80 0.82 Domain-specific 0.85 0.86 0.85 0.85 GloVe Generic 0.85 0.88 0.85 0.86 Domain-specific 0.84 0.85 0.84 0.84 Word2Vec Generic 0.74 0.75 0.74 0.74 Domain-specific 0.80 0.80 0.80 0.80 381 Table 3 Results obtained for the CNN+BLSTM model Word Embedding Accuracy Precision Recall F1-score Fasttext Generic 0.86 0.86 0.85 0.85 Domain-specific 0.91 0.91 0.91 0.91 GloVe Generic 0.84 0.85 0.83 0.84 Domain-specific 0.87 0.88 0.87 0.87 Word2Vec Generic 0.77 0.79 0.78 0.78 Domain-specific 0.80 0.80 0.80 0.80 Table 4 Comparision of proposed work with results and techniques of existing work Author Accuracy Model Technique Orimaye et al. 87.5% Neural Network 4-5 n-grams (2018) [10] Karlekar et al. 82.8% 2D-CNN Word Embeddings (2018) [2] Karlekar et al. 83.7% RNN Word Embeddings (2018) [2] Karlekar et al. 91.1% 2D-CNN + RNN Word Embeddings along with POS (2018) [2] tagged data Kong et al. 86.9% Hierarchical Attention Net- Word Embeddings (2019) [3] work Proposed work 90.6% 1D-CNN + BLSTM Doamin-Specific Fasttext Word Em- bedding sification. For each word embedding algorithm 2 dif- ferent types of word embeddings were used - Domain Specific and Generic Embeddings, where it was found that Domain Specific word embeddings performed bet- ter than Generic Word Embeddings. This work was limited by the small amount of dataset available. In future, we may gather a larger dataset that may help in creation of a more generalized embedding. Further, we can also extend the dataset for people speaking dif- ferent languages. Figure 3: Boston cookie theft description task A. Appendix creating 14362 samples as compared to our 198 sam- A.1. Neural Model Details ples which are complete transcripts of a patient. We used the following neural models. The batch size was kept at 10. In the last dense layer of each model softmax activation function was used. Other dense 7. Conclusion and Future Work layers use a rectified linear activation function. This study employs three Word Embedding algorithms on three different Neural Models that make use of CNN and Bidirectional LSTM for Alzheimer’s Disease Clas- 382 Figure 4: Comparsion of F1-scores achieved by different neural models and Word Embeddings A.1.1. CNN Model References Each CNN-1D layer in brackets represents(no-of-filters [1] M. W. Bondi, E. C. Edmonds, D. P. Salmon, , kernel-size) Alzheimer’s disease: past, present, and future, CNN-1D(8,3) → CNN-1D(10,3) → MaxPool-1D(3) → Journal of the International Neuropsychological CNN-1D(12,3) → CNN-1D(14,3) → MaxPool-1D(3) Society 23 (2017) 818–831. → Flatten() → Dense(20,Relu) → Dense(10,Relu) → [2] S. Karlekar, T. Niu, M. Bansal, Detecting lin- Dense(2,Softmax) guistic characteristics of Alzheimer’s dementia by interpreting neural models, arXiv preprint A.1.2. BLSTM arXiv:1804.06440 (2018). Each LSTM layer in brackets represents(no-of-lstm-cells- [3] W. Kong, H. Jang, G. Carenini, T. Field, A neu- in-that-layer) ral model for predicting dementia from language, Bidir(LSTM(16)) → Dropout(0.3) → Bidir(LSTM(8)) in: Machine Learning for Healthcare Conference, → Bidir(LSTM(4)) → Bidir(LSTM(2)) → Dropout(0.2) 2019, pp. 270–286. → Dense(8) → Dense(2,Softmax) [4] S. O. Orimaye, J. S.-M. W. K. J. Golden, Learning Predictive Linguistic Features for Alzheimer’s Disease and related Dementias using Verbal Ut- A.1.3. CNN+BLSTM terances, in: Proceedings Workshop on Com- CNN-1D(8,3) → CNN-1D(10,3) → MaxPool-1D(3) → putational Linguistics and Clinical Psychology: CNN-1D(16,3) → CNN-1D(20,3) → MaxPool-1D(3) From Linguistic Signal to Clinical Reality, 2014, → Bidir(LSTM(8)) → BatchNorm() → Bidir(LSTM(16)) pp. 78–87. → Dense(64,Relu) → Dense(32,Relu) → Dense (2,Soft- [5] S. Al-Hameed, M. Benaissa, H. Christensen, Sim- max) 383 ple and robust audio-based detection of biomark- [18] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, ers for alzheimer’s disease, in: 7th Workshop Enriching Word Vectors with Subword Informa- on Speech and Language Processing for Assistive tion, arXiv:1607.04606 (2016). Technologies (SLPAT), 2016, pp. 32–36. [19] R. Rehurek, P. Sojka, Software framework for [6] V. Masrani, Detecting dementia from written topic modelling with large corpora, in: Proceed- and spoken language, Ph.D. thesis, University of ings International Workshop on New Challenges British Columbia, 2018. for NLP Frameworks, 2010. [7] M. Yancheva, F. Rudzicz, Vector-space topic mod- [20] Keras, Deep learning for humans, els for detecting Alzheimer’s disease, in: Pro- https://github.com/fchollet/keras, 2015. Last ceedings Annual Meeting of the Association for accessed on Nov 2019. Computational Linguistics, 2016, pp. 2337–2346. [21] J. Schmidhuber, Deep learning in neural net- [8] C. Guinn, A. Habash, Language analysis of works: An overview, Neural Networks 61 (2015) speakers with dementia of the Alzheimer’s type, 85–117. in: AAAI Fall Symposium Series, 2012, pp. 8–13. [22] A. F. Agarap, Deep Learning using Rectified Lin- [9] S. O. Orimaye, J. S.-M. Wong, K. J. Golden, ear Units (ReLU), arXiv:1803.08375 (2018). C. P. Wong, I. N. Soyiri, Predicting probable [23] N. Srivastava, et al., Dropout: A Simple Way Alzheimer’s disease using linguistic deficits and to Prevent Neural Networks from Overfitting, biomarkers, BMC Bioinformatics 18 (2017). Journal of Machine Learning Research 15 (2014) [10] S. O. Orimaye, J. S.-M. Wong, C. P. Wong, Deep 1929–1958. language space neural network for classifying [24] D. P. Kingma, J. Ba, A Method for Stochastic Op- mild cognitive impairment and alzheimer-type timization, ?arXiv:1412.6980 (2014). dementia, PloS one 13 (2018). [25] L. Hebert, P. A. Scherr, J. L. Bienias, D. A. Ben- [11] Y. Pan, B. Mirheidari, M. Reuber, A. Venneri, nett, D. A. Evans, Alzheimer disease in the US D. Blackburn, H. Christensen, Automatic hi- population: prevalence estimates using the 2000 erarchical attention neural network for detect- census, Arch Neurol 60 (2003) 1119–1122. ing ad, 2019, pp. 4105–4109. doi:10.21437/ Interspeech.2019-1799. [12] A. Balagopalan, B. Eyre, F. Rudzicz, J. Novikova, To bert or not to bert: Comparing speech and language-based approaches for alzheimer’s dis- ease detection, 2020. arXiv:2008.01551. [13] J. T. Becker, F. Boller, O. L. Lopez, J. Saxton, K. Mc- Gonigle, The natural history of Alzheimer’s dis- ease. Description of study cohort and accuracy of diagnosis., Archives of Neurology 51 (1994) 585– 594. [14] B. Macwhinney, The CHILDES project: tools for analyzing talk, Child Language Teaching and Therapy 8 (2000). [15] J. L. Lee, R. Burkholder, G. B. Flinn, E. R. Coppess, Working with CHAT transcripts in Python, Tech- nical report TR-2016-02, Technical Report, De- partment of Computer Science, University of Chicago, 2016. [16] T. Mikolov, et al., Efficient Estimation of Word Representations in Vector Space, in: Proceedings International Conference on Learning Represen- tations, 2013. [17] Pennington, Jeffrey, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings International Conference Em- piricial Methods in Natural Language Processing, 2014.