Cataplexy Detection: Neurologists, You Are Not Alone! (Discussion Paper) Ilaria Bartolini1 , Andrea Di Luzio1 1 Department of Computer Science and Engineering (DISI), Alma Mater Studiorum, University of Bologna, Italy Abstract Narcolepsy with cataplexy is a severe lifelong disorder characterized, among the others, by the sud- den loss of bilateral face muscle tone triggered by emotions (cataplexy). In this extended abstract, we present two methodologies for the automatic analysis of patients’ videos able to assist neurologists in diagnosing the disease and/or detecting attacks. Indeed, recent findings demonstrated that the detection of abnormal motor behaviors in video recordings of patients undergoing emotional stimulation is effec- tive in characterizing the disease symptoms. Such motor behaviors (ptosis, mouth opening, head drop) are however to be discovered by neurologists through manual inspection of patients’ videos. Automatic content-based video analysis is clearly of immediate help here. Experimental results conducted on real data support the effectiveness of the presented automated techniques. Keywords Video-based classification of cataplexy, automatic video content analysis, motor behavior patterns, data analysis for health 1. Introduction Narcolepsy with cataplexy is a rare disorder mainly arising in young adults/children charac- terized by daytime sleepiness, sudden loss of muscle tone while awake triggered by emotional stimuli (cataplexy), hallucinations, sleep paralysis, and disturbed nocturnal sleep [13]. A recent approach for the detection of the disease is based on an analysis of video recordings of patients undergoing emotional stimulation made on-site by medical specialists [16]. According to this methodology, cataplexy is present if any of three abnormal motor behaviors is detected in the patient video: ptosis (a drooping or falling of the upper eyelid), head drop, and smile/mouth opening [13]. Such patterns are, however, still to be manually detected by neurologists through visual inspection of videos. This is due to the complete absence of automatic technological solutions able to properly support neurologists in such a delicate task. It is evident that a tool able to detect the “correct” facial expression changes (i.e., the disease symptoms) from video recordings of patients would be able to automatically identify the presence of the disease. This could be extremely helpful, not only to support neurologists in diagnosing the disease, but also in monitoring everyday activities in a non-invasive way to SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " ilaria.bartolini@unibo.it (I. Bartolini); andrea.diluzio2@unibo.it (A. Di Luzio) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) provide early warnings in the event of the insurgence of a crisis. Indeed, it is well known that the synergistic use of Machine Learning (ML) techniques can help in alleviating the burden of the medical specialist in analyzing patient data, thus improving diagnostic consistency and accuracy [10]. In [2], we introduced the CAT-CAD (Computer Aided Diagnosis for CATaplexy) tool, which exploits ML techniques for the automatic analysis of video recordings made on patients un- dergoing emotional stimulation through the vision of funny movies designed to evoke the laughter. By means of a user friendly GUI, CAT-CAD effectively supports neurologists with (1) the automatic detection of disease symptoms, and thus in the disease recognition/monitoring, and (2) advanced functionalities for video playback and browsing/retrieval. CAT-CAD is the first tool to allow the automatic recognition of cataplexy symptoms based on the analysis of patients’ video recordings. In this extended abstract, we report details on the video analyzer for the automatic detection of cataplexy symptoms. This component of the CAT-CAD system is built on top of SHIATSU, a general and extensible framework for video retrieval which is based on the (semi-)automatic hierarchical semantic annotation of videos exploiting the analysis of their visual content [3], and exploits features managed through the Windsurf software library [4]. After reviewing some background information, we detail the methodologies used to au- tomatically analyze videos: a pattern-based technique able to recognize facial patterns and a novel approach based on convolutional neural networks (Section 2). Finally, we provide results obtained from an extensive experimental evaluation to compare the performance of the two video analysis approaches, using a benchmark containing recordings from real patients (Section 3) and conclude (Section 4). 1.1. Background and Related Work Narcolepsy with cataplexy usually arises in adolescence or young adulthood, but the diagnosis is typically established after a long period with a mean delay (across Europe) from symptom onset to diagnosis of 14 years [12]. The diagnosis delay is due not only to the failure to recognize the symptoms of the disease, but also to the misinterpretation of cataplexy phenomena as the expression of other disorders, such as episodes of loss of consciousness of epileptic nature, force reductions due to neuromuscular disorder or behavioral disorders of childhood psychiatric or neuropsychiatric relevance. Few scientific studies have considered the video-polygraphic features of cataplexy in adult age and only recently the motor phenotype of childhood cataplexy has been described exclu- sively using video recordings of the attacks evoked by watching funny cartoons [13]. These studies showed that in the context of the physiological response to the laughter there are the distinctive elements of cataplexy, called motor behaviors patterns, particularly evident at the level of the facial expression changes. In particular, the three most recurrent motor phenomena (often displayed by patients affected by the disease) are ptosis, head drop, and smile/mouth opening [13]. To the best of our knowledge, CAT-CAD is the first study about the automatic recognition of cataplexy by exploiting patient video recordings. However, automatic detection of facial motor phenomena similar to the ones used to diagnose cataplexy is commonly used in other contexts. For example, the detection of eyelid closure, head pose, or mouth opening is useful for the automatic recognition of fatigue/drowsiness in vehicle drivers [11, 9, 6]. The “verbatim” use of such techniques in the context of cataplexy diagnosis is however inappropriate, since the peculiar motor patterns are somewhat different, even if they can be detected using similar facial features. 2. The CAT-CAD Video Analyzer The core of the CAT-CAD tool is the real-time analysis of patients’ videos to detect the presence of disease symptoms (i.e., ptosis, head drop, and smile/mouth opening). Two different approaches were developed to perform video analysis, with the idea of comparing their relative performance and possibly combining them to achieve the best possible result in recognizing the different motor phenomena: • The Pattern-Based approach (Section 2.1) is built on the automatic detection of facial features in video frames. • The Deep Learning approach (Section 2.2) exploits three convolutional neural networks, each trained to detect a specific motor phenomenon. 2.1. Video Analyzer: Pattern-Based Approach The first video analyzer to be implemented in CAT-CAD for the detection of cataplexy motor phenomena exploits facial landmarks, as detected by OpenFace [1]. The first step of the pattern characterization process consists in detecting and extracting patients’ facial landmarks of interest from each video frame; this is necessary because it is safe to assume that different patients have different facial features. Using OpenFace on each single video frame we are thus able to extract, for each video, a time series of multi-dimensional feature vectors, each vector characterizing the facial landmarks extracted from a single frame. 2.1.1. Ptosis Ptosis is a drooping or falling of the upper eyelid. This, however, should not be mistaken as a (regular) eye blink. For this, ptosis is detected whenever eyes are closed for a period of time longer than a typical blink. For each frame, a 12-dimensional feature vector is extracted, containing the (𝑥, 𝑦) coordinates of six landmarks characterizing the shape of the eye. The Eye Aspect Ratio (EAR) can then be defined as the ratio of the eye height to the eye width (averaged for left and right eye). EAR is partially invariant to head pose and fully invariant to uniform image scaling and in-place face rotation. The semantics of EAR are as follows: when an eye is closing, EAR approaches zero, whereas when the eye is completely open, EAR attains its maximum value (which varies from person to person). Therefore, we define the presence or absence of ptosis by measuring the length of the time series corresponding to a “long enough” sequence of frames with closed eyes (EAR lower than a threshold). 2.1.2. Head Drop For head drop, a 8-𝐷 feature vector is extracted from each frame, including the (𝑥, 𝑦) coordi- nates of the two landmarks characterizing the external corner of each eye and the one that is immediately below the tip of the nose and the rotation of the head around 𝑋 and 𝑍 axes. The Center of Gravity (CoG) of the three landmarks is then used to measure rotation around the 𝑌 axis. Head drop is then detected if rotation around one of the three axes exceeds a threshold. 2.1.3. Smile/Mouth Opening For the third motor phenomenon, a 8-𝐷 feature vector is extracted for each frame, containing (𝑥, 𝑦) coordinates of four landmarks characterizing the shape of the mouth. The Mouth Aspect Ratio (MAR) can then be defined as the ratio of the mouth width to the mouth height. Like EAR, MAR is partially invariant to head pose and fully invariant to uniform image scaling and in-place face rotation. When the mouth is closed, MAR attains its maximum value (which varies from person to person), while if the mouth is completely open, MAR reaches its lowest value; intermediate values characterize various types of smile. We thus consider the cataplectic mouth opening as present if the current MAR is lower than a threshold, indicating that the patient is smiling widely or opening her mouth. 2.2. Video Analyzer: Deep Learning Approach The alternative video analysis tool is based on convolutional neural networks (CNNs). The CNN architecture used in this work is based on the DeXpression Network [7], which achieves excellent performance in expression recognition, and has been implemented using TensorFlow (https://www.tensorflow.org/). Our CNN architecture consists of three different types of blocks: 1. an Input Block, which performs image pre-processing, 2. a Feature Extraction Block, inspired by the architectural principles introduced by GoogleNet [15], which is repeated four times, and 3. an Output Block, which is used to produce the result class from the features extracted by previous layers. We have trained three different networks, one for each motor phenomenon to be recognized. The three networks share the same architecture, but the learned weights are clearly different, due to the use of different training classes. It is clear that, for this approach, each frame is analyzed per se, and no considerations on frame sequences, like duration of eyelids closing or of head drop, can be extrapolated, contrary to the pattern-based approach. The three neural networks have been trained for 8 epochs each, for a total time of about 12 hours. For each video frame, the face of the patient is first detected by means of OpenFace and then cropped. The resulting image is converted to grayscale and downsized to produce images of 320 × 320 pixels. Cropping of the images was necessary in order to provide the CNN with only face details (thus avoiding that the surrounding environment would distract the learning). 3. Experimental Evaluation The benchmark used for our experimental evaluation consists of a population of patients of the Outpatient Clinic for Narcolepsy of the University of Bologna, who were assessed for the presence of cataplexy by way of a neurophysiological and biological diagnosis [16]. The first (experimental) group of patients includes 14 subjects displaying symptoms of the disease. Training of video analyzers has been performed using an inter-patient separation scheme, where patients have been randomly assigned to non-overlapping training and test sets, by respecting sex and age distribution. In particular, 11 patients have been included in the training set (thus, their entire videos have been used to train each analyzer), while the remaining 3 patients have been exploited to test the accuracy of the tool. The second group includes 44 different subjects that show no sign of the disease. Among those, 14 patients have been selected as a control group so as to follow the same sex and age distribution of the experimental group. For the deep-learning approach, data augmentation was performed by adding, to each training set frame, seven additional images by performing: (i) 3 rotations with a random angle between −45∘ and +45∘ , (ii) 3 translations with a random shift between -50 and 50 pixels, and (iii) 1 horizontal flipping. The final training sets consists of 191140 labeled images for ptosis, 61216 labeled images for head-drop, and 108196 labeled images for mouth opening. 3.1. Performance Measures To objectively evaluate the performance of our analyzers, each frame can be labeled according to a confusion matrix as correctly and incorrectly recognized for each of the two classes available (in our case, motor phenomenon actually present or absent). The four possible outcomes are 𝑡𝑝 (true positive, a frame where the symptom is correctly detected as present), 𝑓 𝑛 (false negative, symptom wrongly not detected), 𝑡𝑛 (true negative, symptom correctly not detected) and 𝑓 𝑝 (false positive, symptom wrongly detected as present). From the confusion matrix, the performance measures used in our experiments are defined as follows. Recall/Sensitivity (𝑅) is defined as the fraction of the frames showing signs of the disease 𝑡𝑝 (positives) that are correctly identified: 𝑅 = 𝑡𝑝+𝑓 𝑛 . 𝑅 is therefore used to measure the accuracy of a technique in recognizing the presence of the disease. Specificity (𝑆) is the fraction of frames not showing the disease (negatives) that are correctly classified: 𝑆 = 𝑡𝑛+𝑓 𝑡𝑛 𝑝 . 𝑆 thus expresses the ability of a technique to avoid false alarms (which can lead to expensive/invasive exams). Precision (𝑃 ) is another popular metric, besides 𝑅 and 𝑆, which are the fundamental prevalence- independent statistics. 𝑃 is defined as the fraction of correct positively classified frames 𝑡𝑝 and assesses the predictive power of the classifier: 𝑃 = 𝑡𝑝+𝑓 𝑝. Accuracy (𝐴) measures the fraction of correct decisions, to assess the overall effectiveness of the algorithm: 𝐴 = 𝑡𝑝+𝑓𝑡𝑝+𝑡𝑛 𝑝+𝑓 𝑛+𝑡𝑛 . Balanced Score (𝐹1 ) is a commonly used measure, combining 𝑃 and 𝑅 in a single metric 2·𝑡𝑝 computed as their harmonic mean: 𝐹1 = 2 1 + 1 1 = 2·𝑡𝑝+𝑓 𝑝+𝑓 𝑛 . 𝑃 𝑅 For the pattern-based approach, thresholds used for the detection of ptosis, head drop, and mouth opening were chosen as the ones providing the best classifying performance on the test set [8]. To this end, a Receiver Operating Characteristics (ROC) graph is used for each threshold, and the threshold value maximizing the harmonic mean of 𝑅 and 𝑆 measures is chosen as the optimal one: this represents a metric for imbalanced classification, seeking an equilibrium between the two measures [14]. 3.2. Overall Performance Tables 1 and 2 show the performance of the proposed classification techniques over cataplectic/non- cataplectic patients, respectively: figures values were obtained by averaging individual values weighted by the recordings’ length. Tables report classification results for the detection of the three symptoms as well for the overall cataplexy which, we remind, is recognized as present if any of the three motor pattern is detected for a particular frame. To compare the performance of the two alternative video analyzers, results in tables show in boldface the best value obtained in any of the five considered performance measures (specificity only for the control group). pattern-based deep learning motor phenomenon 𝑅 𝑆 𝑃 𝐴 𝐹1 𝑅 𝑆 𝑃 𝐴 𝐹1 ptosis 0.72 0.84 0.82 0.78 0.77 0.71 0.67 0.68 0.69 0.70 mouth opening 0.78 0.76 0.75 0.76 0.77 0.72 0.81 0.79 0.76 0.75 head drop 0.60 0.94 0.89 0.77 0.72 0.67 0.81 0.78 0.74 0.72 overall 0.75 0.79 0.79 0.77 0.77 0.70 0.74 0.73 0.72 0.71 Table 1 Performance of the proposed approaches for cataplectic patients. pattern-based deep learning motor phenomenon 𝑆 𝑆 ptosis 0.99 0.83 mouth opening 0.98 0.83 head drop 0.99 0.81 overall 0.98 0.66 Table 2 Specificity of the proposed approaches for non-cataplectic subjects. Above results led us to draw the following considerations: • The pattern-based approach lead to significantly superior results with respect to its deep learning counterpart. In particular, for cataplectic subjects the former attains the best performance in 85% of the metrics (17 out of 4 × 5 = 20 performance measures). • When considering specific motor phenomena, the pattern-based approach consistently outperforms the deep learning approach in detecting ptosis, while the latter sports superior measures only for specificity and precision in detecting mouth opening and for recall in head drop detection. • The superior specificity of the pattern-based technique is confirmed in non-cataplectic subjects, with an overall specificity at 98%. A possible explanation for the inferior performance of the deep learning approach is the fact that such approach cannot discriminate between quick and long eye blinks/head drops, due to the fact that each frame is analyzed individually by the CNN. It is therefore likely that the higher number of false positives is because the CNN wrongly detects “regular” eye blinks or head movements as ptosis or head drop. For the case of non-cataplectic subjects, it is interesting to note that the performance of the deep learning approach for the overall detection of cataplexy is sensibly worse than those attained for the single motor phenomena. For such patients, false positives for ptosis, head drop, and mouth opening are present in different frames. Indeed, due to the absence of positive cases, the set of false positive frames for the overall cataplexy coincides with the union of frames wrongly classified by any specific motor phenomenon detector. Finally, we include a brief discussion about efficiency of the proposed techniques. On our experimental setup, which involved a commodity (low-end) machine, we were able to extract EAR, CoG and MAR descriptors in real-time for each video frame. Clearly, this is the more time consuming operation for the pattern-based approach, thus it is proven that the whole process of automatic detection can be performed on-line during a single emotional stimulated video recording session. On the other hand, our current implementation of the deep learning approach is only capable to obtain a throughput of 18.5 frame/s, thus being unable to attain real-time performance (recall that the frame rate of videos is 30 frame/s). The reason for this measure is the following: when analyzing a single frame, about 50% of the time is spent in detecting the position of the patient face, about 25% in cropping the image (retaining only the face), and 25% for the classification of the frame by the three neural networks. The bottleneck of the whole computation is clearly the face detection phase, which we implemented using the OpenFace library, instead of using other faster methods (such as the well-known Haar-Cascade filter). This choice was carried out starting from the consideration that quicker filters often fail to identify the face within the image, especially in videos with excessive head movement, which is the common case for cataplectic subjects. 4. Conclusions In this extended abstract, we reported details on the video analyzer of CAT-CAD for the automatic detection of cataplexy symptoms (ptosis, head drop, and smile/mouth opening). Two different approaches are introduced for the detection of disease symptoms: the Pattern-Based approach is based on analysis of facial features, using the OpenFace framework, while the Deep Learning approach uses CNNs, as implemented by TensorFlow. An extensive comparative experimental evaluation conducted on a benchmark of real patients recordings demonstrated the accuracy of the proposed techniques. When comparing the effectiveness of the two video analyzers we introduced to detect cataplexy symptoms, the pattern-based approach achieves superior performance. One of the possible explanations for the inferior detection accuracy of the deep learning approach is the fact that 2D CNNs are unable to properly take into account the temporal dimension that correlates subsequent frames in a video. The use of 3D CNNs could be an interesting way to pursue, and we plan to consider their inclusion in CAT-CAD. References [1] T. Baltrušaitis, A. Zadeh, L. Yao Chong, L.-P.; Morency, OpenFace 2.0: Facial Behavior Analysis Toolkit, in: Proceedings of FG 2018, Xi’an, China, May 2018. [2] I. Bartolini, A. Di Luzio, CAT-CAD: A Computer-Aided Diagnosis Tool for Cataplexy, Computers, 2021, 10(4). [3] I. Bartolini, M. Patella, C. Romani, SHIATSU: Tagging and Retrieving Videos Without Worries, Multimedia Tools and Applications, 2013, 63(2). [4] I. Bartolini, M. Patella, G. Stromei, The Windsurf Library for the Efficient Retrieval of Multimedia Hierarchical Data, in: Proceedings of SIGMAP 2011, Seville, Spain, July 2011. [5] I. Bartolini et al, Automatic Detection of Cataplexy, Sleep Medicine, 2018, 52. [6] L.M. Bergasa et al, Analysing Driver’s Attention Level Using Computer Vision, in: Pro- ceedings of ITSC 2008, Beijing, China, June 2008. [7] P. Burkert et al, DeXpression: Deep Convolutional Neural Network for Expression Recog- nition, arXiv, September 2015. [8] T. Fawcett, An Introduction to ROC Analysis, Pattern Recognition Letters, 2006, 27(8). [9] J. Jo, H.G. Jung, K. Ryoung, J. Kim, Vision-Based Method for Detecting Driver Drowsiness and Distraction in Driver Monitoring System, Optical Engineering, 2011, 50(12). [10] L. Lazli, M. Boukadoum, O.A. Mohamed, A Survey on Computer-Aided Diagnosis of Brain Disorders through MRI Based on Machine Learning and Data Mining Methodologies with an Emphasis on Alzheimer Disease Diagnosis and the Contribution of the Multimodal Fusion, Applied Sciences, 2020, 10. [11] B. Mandal, L. Li, G.S. Wang, J. Lin, Towards Detection of Bus Driver Fatigue Based on Robust Visual Analysis of Eye State, IEEE Transactions on Intelligent Transportation Systems, 2017, 18(3). [12] F. Pizza et al, Clinical and Polysomnographic Course of Childhood Narcolepsy with Cata- plexy, Brain, 2013, 136(12). [13] G. Plazzi et al, Complex Movement Disorders at Disease Onset in Childhood Narcolepsy with Cataplexy, Brain: A Journal of Neurology, 2011, 134(12). [14] F. Provost, Machine Learning from Imbalanced Data Sets 101, in: Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets, 68(2000), Austin, TX, July 2000. [15] C. Szegedy et al, Going Deeper with Convolutions, in: Proceedings of CVPR 2015, Boston, MA, June 2015. [16] S. Vandi et al, A Standardized Test to Document Cataplexy, Sleep Medicine, 2019, 53.