An Explainable Convolutional Neural Network for the Detection of Drug Abuse Giulia Tufo1,*,† , Meriam Zribi1,† , Paolo Pagliuca2,† and Francesca Pitolli1 1 Department of Basic and Applied Sciences for Engineering, Università degli Studi Roma La Sapienza, Via Antonio Scarpa 14, Roma 2 Institute of Cognitive Sciences and Technologies, National Research Council (CNR), Via Gian Domenico Romagnosi 18/A, Roma Abstract The spread of Artificial Intelligence methods in many contexts is undeniable. Different models have been proposed and applied to real-world applications in sectors like economy, industry, medicine, healthcare and sports. Nevertheless, the reasons of why such techniques work are not investigated in depth, thus posing questions about explainability, transparency and trust. In this work, we introduce a novel Deep Learning approach for the problem of drug abuse detection. Specifically, we design a Convolutional Neural Network model analyzing lateral-flow tests and discriminating between normal and abnormal assays. Moreover, we provide evidence regarding the attributes that enable our model to address the considered task, aiming to identify which parts of the input exert a significant influence on the network’s output. This understanding is crucial for applying our methodology in real-world scenarios. The results obtained demonstrate the validity of our approach. In particular, the proposed model achieves an excellent accuracy in the classification of the lateral-flow tests and outperforms two state-of-the-art deep networks. Additionally, we provide supporting data for the model’s explainability, ensuring a precise understanding of the relationship between attributes and output, a key factor in comprehending the internal workings of the neural network. Keywords Drug abuse detection, Lateral-flow tests, Explainability, Convolutional Neural Networks 1. Introduction Artificial Intelligence (AI) is a field in considerable and continuous expansion that is part of our lives and has spread to many sectors, like economy [1], industry [2], sports [3], medicine [4, 5] and healthcare [6]. Focusing on these two latter fields, AI provides a valid support for helping doctors and other professionals to make diagnosis [7] and predictions [8], explain and analyze medical data [9, 10]. Moreover, the use of assistive robots in rehabilitation and elderly monitoring is widespread nowadays [11, 12]. EXPLIMED - First Workshop on Explainable Artificial Intelligence for the medical domain - 19-20 October 2024, Santiago de Compostela, Spain * Corresponding author. † These authors contributed equally. $ giulia.tufo@uniroma1.it (G. Tufo); meriam.zribi@uniroma1.it (M. Zribi); paolo.pagliuca@istc.cnr.it (P. Pagliuca); francesca.pitolli@uniroma1.it (F. Pitolli)  0009-0003-8187-274X (G. Tufo); 0009-0003-6232-3745 (M. Zribi); 0000-0002-3780-3347 (P. Pagliuca); 0000-0002-7159-0533 (F. Pitolli) © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Example of a lateral flow test for drug abuse detection. Figure 2: Adulterant guide chart. For each adulterant, the list of colors identifying normality and abnormality of the test is provided. Image taken from [13]. In particular, AI is a unique tool for analyzing huge amounts of data effectively and on time [14]. Human evaluation, constrained by various factors such as subjectivity, limited computational capacity, past and personal experiences, fatigue, stress, and data quality (such as image resolution and/or lighting conditions), may be prone to generate inaccurate predictions and/or faults. Especially in medicine and healthcare, the error minimization is paramount, since it might affect the diagnosis of potential diseases, prompt interventions, therapies for rehabilitation and other aspects. A largely applied approach in the medical data analysis and classification relies on the use of Convolutional Neural Networks (CNNs) [15–17]. CNNs enable the analysis of broad datasets containing thousands of data faster than human operators. Notwithstanding the abundance of the examples, a major concern is the lack of explainability of some models proposed in the literature, posing a challenge that even involves the developers themselves. This turns out to be critical in the fields of medicine and healthcare, where the use of explainable AI approaches is pivotal [18–20]. In this work, we analyzed the issue of detecting the presence of substances/drugs in rapid lateral-flow tests (Fig. 1) [21]. Similar works investigating this topic are those reported in [22–24]. In particular, in [23] the authors propose an image processing algorithm combined with a Least Squares Support Vector Machine (LS-SVM) to investigate pH indicator paper assays. Their approach achieve excellent performances in terms of accuracy. The analysis of the test results is generally made by human operators. As we stated above, the interpretation of the test is affected by factors like the subjectivity of the person and/or her/his physical and mental conditions, with consequent possible errors. Instead, we propose a novel Computer Vision (CV) approach based on the use of a deep CNN. Specifically, we employed the model introduced in [25, 26] with the addition of pooling layers [27] in the convolutional part of the network. The use of pooling allows us to reduce the complexity of the problem without losing accuracy. Indeed, pooling helps the model to become invariant to small translations of the input [27]. The model must distinguish between normal and abnormal results in lateral-flow tests analyzed for the detection of drug abuse. The primary goal of the model is to verify the suitability of the sample to ensure it has not been compromised in any way. Once the sample’s suitability is confirmed, the analysis can proceed to investigate the presence of narcotic substances. The cartridge undergoes a color change upon contact with the human biological sample (urine). Based on the detected color gradation, it can be concluded whether the sample has been adulterated or not. A test is considered as “abnormal” if any of the adulterants is not compliant with the corresponding guide (Fig. 2). While the detection of strips in rapid check tests with Deep Learning techniques has already been addressed in the literature [22, 28–30], to our knowledge the use of a CNN model to verify that the biological sample is indeed urine and has not been tampered with in the adulterant section of lateral-flow tests has not been investigated. Collected results indicate the validity of our approach: the proposed model manages to discriminate between normal and abnormal tests. Moreover, we discuss the reasons of why such model is effective, thus providing evidence of its explainability, which represents a paramount property to successfully apply the methodology in real-world scenarios. The main contributions of our work can be summarized as follows: • we propose a novel Deep Learning (DL) approach to address the issues related to the visual inspection of lateral-flow tests, which are generally examined by human operators, with a focus on the adulterant section of the assay; • we apply the recently proposed ConvNet3_4 model [25, 26] to discriminate between normal and abnormal tests; • we achieve an excellent classification capability proving the validity of the model; • we compare the model with two state-of-the-art deep networks and we demonstrate the superiority of our approach; • we provide a thorough analysis of the relevant features extracted by the model in order to associate the proper output class to each image. The remainder of the article is structured as follows: section 2 contains a description of the methodology we applied with respect to the considered problem. Results of our experiments are provided in section 3. Finally, our conclusions and final remarks are reported in section 4. Figure 3: Image showing the entire setup: the lateral flow test is inserted into the device. Acquisition is made through a camera placed at the front. Figure 4: Left: Adulterant section of the lateral flow test (area inside the blue rectangle). Right: Example of image used for model training. It highlights the portion of the adulterant section considered. The adulterants taken into account are: pH, OX and GL. We filled the image with a black box in order to minimize the impact of meaningless pixels on the model’s prediction. 2. Materials and methods As stated in section 1, our study focuses on assessing the suitability of samples through the analysis of lateral-flow tests for the detection of substance abuse (Fig. 1). All images used were obtained from a specialized medical devices company and were labeled by professional laboratory technicians. An example of images obtained with this setup is shown in Fig. 3. Our analysis focuses on the suitability of the sample by examining the portion of the test image containing the six different adulterants (see Fig. 4 left, highlighted area in the blue rectangle): Specific Gravity (SG), pH, Oxidant (OX), Creatinine (CRE), Nitrite (NI), and Glutaraldehyde (GL). Due to the presence of elements belonging to a specific class only (i.e., samples being either always normal or always abnormal), we concentrated our analysis on only three adulterants - pH, OX, and GL - for which we managed to collect data belonging to both classes. Consequently, we created a dataset consisting of 181 images. Fig. 4 right provides an example of the input image where we cropped the specific portion of interest, filling the remaining part with black pixels. The size of pictures is 215 × 225 pixels. Our model was then trained exclusively on images considering these components. Given the small size of our dataset, we employed the data augmentation technique [31], which is crucial to avoid overfitting when the amount of available data is limited [32]. Furthermore, due to the difficulty to collect normal assays, the original dataset is unbalanced between the two classes, with 133 images of abnormal tests and only 48 pictures of normal assays (the ratio is around 2.77). The class imbalance problem is a major concern in Machine Learning (ML) and Deep Learning (DL) [33]. In fact, training models on unbalanced data may result in learning most from the larger class, with consequent sub-optimal performances and poor generalization capabilities (for instance, a model could associate one class to all the input data regardless of the image features). Aiming to mitigate such issue, we first apply transformations so as to balance the two types of data (see Table 1), thus obtaining 300 images equally split between the two classes, 80% of which constitute the training set and the remaining 20% are the test set. The balancing operation has been performed in order to ensure that each image and its variation(s) cannot be in both training and test sets, hence excluding the possibility of overfitting. Then, we use the RandomAdjustSharpness transformation [34] (parameters: 𝑓 𝑎𝑐𝑡𝑜𝑟 ∈ [0, 0.25, 0.5, 0.75, 1.25, 1.5, 2, 2.5, 3]; 𝑝𝑟𝑜𝑏 = 1.0) to widen the set of input images in both training and test sets. The type of transformations employed to increase the number of data has been chosen carefully by taking into account the specific nature of the problem and the criticality of modifying image colors (see Fig. 2). Overall, the final training set consists of 2400 images, while the final test set contains 600 images. Our model has been trained 10 times, each one starting with a different network weight initialization. The use of multiple replications minimize the risk of overestimating the model’s performance due to lucky conditions. Training lasts 50 epochs, the learning rate is set to 10−4 and the batch size is set to 16. The model’s optimizer is Adam [35] with weight decay, whose value is 10−2 . The size of pooling filters is 2 × 2. The experimental parameters have been derived from [26] and are empirically determined. Before training our model, we applied the k-fold cross-validation technique [36] to verify whether our deep network is suitable for the considered problem and mitigate the data overfitting issue [37]. We set 𝑘 = 5 and measured the average accuracy of the model by computing the Cross-Entropy (CE) loss metric. The average CE obtained during cross-validation phase is 92.917%, that is the proposed model achieves sufficiently good categorization performances and correctly discriminates between normal and abnormal tests. Therefore, we can state that our model is suitable for the considered problem. Aiming to demonstrate the novelty and efficacy of the proposed model, we perform a compar- ison with the DenseNet121 [38] and ResNet18 [39] pre-trained networks, which represent two state-of-the-art models. We choose these networks since they are characterized by a number of trainable parameters of similar orders of magnitude compared to our approach, as we will illustrate in the next section. This allows us to perform a fair evaluation of the presented model. 3. Results In this section we provide the outcomes of our experiments. As for the k-fold cross-validation phase, we employed the CE loss as a performance measure. Fig. 5, left shows the CE loss of the model during training. As it can be observed, the training error quickly decreases and stabilizes from the epoch 20 (see Fig. 5 left, blue curve). Conversely, the test error increases in the first 10 epochs (see Fig. 5 left, red curve), then it starts decreasing and almost stabilizes from epoch 20-25. The peak at epoch 40 (see Fig. 5 left, red curve) is due Table 1 List of transformations applied to the original images in order to balance pictures between the two output classes (i.e., “normal” and “abnormal”). Specifically, we generate 17 images of abnormal tests and 102 pictures of normal assays. The resulting set contains 300 images equally distributed among the two classes. For further details about the transformations, the reader is referred to [34]. Output class Type # of transforms Parameter Center crop 𝑠𝑖𝑧𝑒 : (215, 205) Abnormal + pad 17 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 : 5 𝑓 𝑖𝑙𝑙 = 0 Center crop 𝑠𝑖𝑧𝑒 : (215, 205) Normal + pad 48 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 : 5 𝑓 𝑖𝑙𝑙 = 0 Center crop 𝑠𝑖𝑧𝑒 : (205, 195) Normal + pad 48 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 : 10 𝑓 𝑖𝑙𝑙 = 0 Center crop 𝑠𝑖𝑧𝑒 : (205, 195) Normal + pad 6 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 : 15 𝑓 𝑖𝑙𝑙 = 0 0.8 2250 # of correct images 2000 0.6 1750 1500 S1 training S2 Loss test 1250 S3 0.4 0 10 20 30 40 50 S4 S5 600 S6 S7 # of correct images 0.2 500 S8 S9 S10 400 0.0 300 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs Figure 5: Model classification capability during training. Left: Error curve during training. Data are obtained by averaging 10 replications of the experiment. Right: Number of images correctly classified during training. Curves show how many images are correctly categorized as “normal” in each replication (labeled as S1, S2, . . . , S10). Top: data referring to training set. Bottom: data collected on the images belonging to test set. to a sudden increase of the error observed in one of the 10 replications (see Fig. A.1), probably related to a particular batch of images. The model achieves an average accuracy of 97.683%, i.e. a very good classification capability. Specifically, the proposed approach succeeds in correctly categorizing all the lateral-flow assays in both the training set and the test set in 4 out of 10 replications and reaches a test accuracy over 98% in 9 replications (see Fig. A.2). Overall, these results imply that our outcomes have been obtained systematically and are not due to chance or lucky initialization of the network’s weights. Instead, the proposed model is able to extract Confusion matrix 1.0 300 250 0.8 Abnormal 300 0 200 True Positive Rate 0.6 True label AUC=1.0 150 0.4 100 Normal 0 300 0.2 50 0.0 0 Abnormal Normal 0.0 0.2 0.4 0.6 0.8 1.0 Predicted label False Positive Rate Figure 6: Analysis of the best model’s categorization. Left: Classification results of best model found. The confusion matrix indicates the number of correctly/wrongly categorized images in the test set. Data in the main diagonal represent correct model predictions, while data outside the diagonal indicate classification errors. Right: ROC curve of the best model. The AUC score is indicated in the legend. Original Image Saliency Integrated Gradients 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 7: Feature maps of a lateral-flow test labeled as “normal”. Left: original image of the assay. Middle: Saliency feature map. Colors in the bar range from white (absence of saliency) to blue (positive saliency). Right: Integrated Gradients feature map. Colors in the bar range from red (negative attribution) to green (positive attribution). Original Image Saliency Integrated Gradients 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 8: Feature maps of a lateral-flow test labeled as “abnormal”. The abnormality of the assay is due to the OX and GL adulterants. Left: original image of the assay. Middle: Saliency feature map. Colors in the bar range from white (absence of saliency) to blue (positive saliency). Right: Integrated Gradients feature map. Colors in the bar range from red (negative attribution) to green (positive attribution). Table 2 Analysis of the accuracy and the efficiency of the ConvNet3_4, DenseNet121 and ResNet18 models. With regard to ConvNet3_4, we considered the best model for the comparison. Bold values denote the best outcomes (concerning both Model parameters and Training time, the lower the better). Interestingly, the ConvNet3_4 model requires remarkably less time than the other two networks during training. ConvNet3_4 DenseNet121 ResNet18 Accuracy 100% 52.5% 50.667% Model parameters 3464527 6955906 11177538 Training time (s) 513 11668 3790 the relevant features from the input images and predict the corresponding output class. The oscillations of the CE during training may be due to the randomization of the order of input images across epochs. If we analyze the number of lateral-flow assays correctly categorized in each replication (see Fig. 5 right), we can observe how all the networks manage to properly associate each test in the training set to the right class in around 20 epochs (Fig. 5 right, top figure). The classification of images in the test set is subject to oscillations and stabilizes after around 20 epochs except for one replication (Fig. 5 right, bottom figure). This outcome is not surprising since the latter set is used as a tool for validating the model. By considering the best model only, Fig. 6 left explains whether and how the model categorizes the lateral-flow assays in the test set. The latter consists of 300 images of abnormal tests and 300 pictures of normal assays (see Fig. 6 left). As it can be seen, the model manages to correctly classify all the 600 images in the test set. Fig. 6 right illustrates the Receiver Operating Characteristic (ROC) curve of the best model, which plots the true positive rate against the false positive rate. Because the Area Under Curve (AUC) is 1.0, our best model corresponds to a perfect classifier. As we mentioned above, we compared the outcomes of our model with those achieved with the DenseNet121 and ResNet18 networks. The analysis is illustrated in Table 2 and reveals that the ConvNet3_4 model is notably superior to both DenseNet121 and ResNet18 with respect to the accuracy: pre-trained networks manage to correctly classify only around half of the images, i.e. they are not able to discriminate between the two possible output classes (see also Fig. B.1). The result is in line with those reported in [26]. Moreover, our ConvNet3_4 model is also remarkably better than DenseNet121 and ResNet18 in terms of efficiency (see the significant discrepancy concerning the training time in Table 2), which represents a pivotal property for a model applicability in real scenarios. The results we presented so far demonstrate that our model is suitable to address the consid- ered problem and achieve excellent performances in terms of classification capability. Nonethe- less, as we stated in section 1, a worthwhile aspect in the fields of medicine and healthcare is the explainability of the proposed models, which is necessary to practically employ them in real-application cases. To this end, we performed a feature analysis by using two different techniques: Saliency [40] and Integrated Gradients (IG) [41]. Both methods are widely employed to interpret the outcomes of a model’s classification [42–44]. The former method allows the identification of the parts of the image contributing more to the output prediction. An example of the feature map extracted through the Saliency method is shown in Fig. 7 middle. Saliency values range from 0 (absence of saliency) to 1 (positive saliency), as indicated in Fig. 7 middle. For a more detailed description of the Saliency algorithm, the reader is referred to [40]. Con- versely, the IG method identifies the regions of the image that most influenced the model’s classification decision by considering the entire input-output trajectory and the reference input distribution (baselines) used in the attribution calculation. As pointed out in [41], the IG method first considers the input image 𝑥 and a baseline 𝑥′ , which is an input characterized by absence of features. Then, a straightline path from 𝑥′ to 𝑥 is taken into account. IG computes the gradients at all points along such path. The integrated gradients are obtained as the cumulative sum of these gradients. Put more formally, if we denote our CNN model with 𝐹 : R𝑛 → [0, 1], the integrated gradients along the 𝑖𝑡ℎ dimension is calculated as: 𝜕𝐹 (𝑥′ + 𝛼 × (𝑥 − 𝑥′ )) ∫︁ 1 ′ 𝐼𝐺𝑖 (𝑥) = (𝑥𝑖 − 𝑥 𝑖 ) × 𝑑𝛼 (1) 𝛼=0 𝜕𝑥𝑖 where 𝜕𝐹𝜕𝑥(𝑥) 𝑖 indicates the gradient of 𝐹 (𝑥) along the 𝑖𝑡ℎ dimension. Overall, the IG method enables to detect the portions of the picture providing positive (parts in green, see Fig. 7 right) and negative (parts in red, see Fig. 7 right) contributions to the output prediction. Fig. 7 shows the image of a lateral-flow test categorized as “normal” (Fig. 7 left), the Saliency feature map (Fig. 7 middle) and the Integrated Gradients heat map (Fig. 7 right). Fig. 8 contains the same data for an “abnormal” assay. The colorbars below the heat maps specify respectively the intensity of the saliency and the magnitude of the importance attribution of each region of the image to the model’s prediction. By examining the feature maps associated to a normal lateral-flow test (Fig. 7), we can observe that the Saliency method returns a positive saliency for the pH and GL adulterants and a slightly positive saliency for the OX adulterant (Fig. 7 middle). Concerning the Integrated Gradients technique, the heat map displays a positive attribution for the pH and GL adulterants and a slightly positive attribution for the OX adulterant (see Fig. 7 right). Therefore, in the case of a normal assay, the model assigns the same importance to the portions of the image containing the considered adulterants. This outcome is in line with the actual test result. If we look at the relevant features highlighted by the Integrated Gradients technique with regard to an abnormal assay, we can see that a slightly positive attribution is conferred to the pH, OX and GL adulterants (Fig. 8 right). As far as the Saliency algorithm is concerned, it returns a positive saliency for the GL adulterant and a slightly positive attribution for the pH and OX adulterants (Fig. 8 middle). Also in this case, the results are coherent with the actual test outcome, which indicates non-compliance for the OX and GL adulterants. Overall, our findings suggest that the proposed model primarily relies on the portions of the image containing the pH, OX and GL adulterants. However, the amount of contribution strongly depends on the test result (e.g., normal or abnormal) and the specific color gradation of the adulterants. Indeed, except for one case, the OX adulterant is characterized by soft nuances tending to be as similar as the background color of the image. Similarly, the normality and abnormality of the pH adulterant are defined based on subtle color gradations. Therefore, distinguishing between the two cases might be challenging even for laboratory operators. Finally, it is worth noting that cropping the picture does not affect the classification capability of the model. Indeed, the use of black pixels providing no information allows the model to focus only on the remaining parts of the image, which contains the relevant data. To summarize, our outcomes demonstrate the capability of the ConvNet3_4 model to extract the most relevant features of the input image in order to generate a precise prediction of the output class. In particular, the identification of the portions containing the adulterants as the key elements of the input image implies that the model is capable of making assumptions from a medical point of view. Indeed, discriminating between the color nuances of the considered adulterants (see Fig. 2) is far from trivial even for experienced and well-trained operators. 4. Discussion and conclusions The spread of AI methods poses questions about the explainability of such models, particularly when they are applied in real-world contexts. Especially in medicine and healthcare, using explainable and trustworthy approaches is paramount in order to help doctors and other professionals to make diagnoses of possible diseases, design adequate therapies for prevention or rehabilitation, explain and collect historical data. Generally, analyzing huge amount of medical data is addressed through DL methods and CNNs represent a widespread tool, although they are often tailored to specific applications. This represents a major concern in the possibility to develop cross-cutting tools. In this work, we proposed a novel approach for the problem of automatically classify lateral-flow tests for drug abuse detection. Specifically, we considered the adulterant section of an assay and we trained a CNN model for the ability to categorize tests (i.e., normal or abnormal) by analyzing the pH, OX and GL adulterants only. We used the network introduced in [25, 26] with the addition of pooling layers in the convolutional part of the model. The use of pooling enables the development of slim networks that can be used in real-world scenarios, particularly when dealing with limited hardware resources. We verified the suitability of the model through a 5-fold cross-validation and we ran the training 10 times. We collected promising results on the chosen task, with an excellent average accuracy. The proposed approach is also notably superior to two state-of-the-art deep networks. Moreover, we provided an explainability of our model by performing a feature analysis. Our outcomes reveal the importance of some portions of the input image (those containing the adulterants), while other parts affect the final prediction only partially. In spite of the good results we achieved, further research is needed to generalize our approach. First, we are collecting samples so as to broaden our analyses to the SG, CRE and NI adulterants, which are out of the scope of this work, and increase the size of our dataset. Owning a significant high number of images is pivotal to apply the model in a real case, since medical data usually include hundreds or thousands of pictures. However, extending the analysis to all adulterants implies considering input images of different sizes, with an unavoidable effect on the model’s performance. Furthermore, the ConvNet3_4 model proves effective at dealing with the considered problem, with a very good classification capability, and outperforms the DenseNet121 and ResNet18 pre-trained networks. Nonetheless, we observe the presence of oscillations during training due to the sensitivity to specific batches of input images. Aiming to address such undesired behavior, in the future we will consider the possibility to adapt the learning rate and/or the weight decay during training. In addition, with respect to the explainability issue, we provide evidence of the reasons behind the success of our model. Nonetheless, because the visualization of feature maps revealed the model identifying sometimes as regions of interest those that should not influence it (e.g., the portions of the cartridge container surrounding the adulterants, see Figs. 7 - 8), in future developments we plan to create a mask. This mask, overlaid on the original image, will eliminate areas of low interest, allowing the model to focus solely on relevant regions. Finally, we are investigating the applicability of our model to other datasets in the medical and health care fields with the aim to generalize the validity of the approach. References [1] M. Jrad, A role of artificial intelligence in the context of economy: Bibliometric analysis and systematic literature review, International Journal of Membrane Science and Technology 10 (2023) 1563–86. [2] Z. Jan, F. Ahamed, W. Mayer, N. Patel, G. Grossmann, M. Stumptner, A. Kuusk, Artificial in- telligence for industry 4.0: Systematic review of applications, challenges, and opportunities, Expert Systems with Applications 216 (2023) 119456. [3] D. Araújo, M. Couceiro, L. Seifert, H. Sarmento, K. Davids, Artificial intelligence in sport performance analysis, Routledge, 2021. [4] C. J. Haug, J. M. Drazen, Artificial intelligence and machine learning in clinical medicine, 2023, New England Journal of Medicine 388 (2023) 1201–1208. [5] O. Marques, Artificial intelligence and medicine: The big picture, in: AI for Radiology, CRC Press, 2024, pp. 1–17. [6] D. Houfani, S. Slatnia, O. Kazar, H. Saouli, A. Merizig, Artificial intelligence in healthcare: a review on predicting clinical needs, International Journal of Healthcare Management 15 (2022) 267–275. [7] J. G. Richens, A. Buchard, Artificial intelligence for medical diagnosis, in: Artificial Intelligence in Medicine, Springer, 2022, pp. 181–201. [8] R. G. Nadakinamani, A. Reyana, S. Kautish, A. Vibith, Y. Gupta, S. F. Abdelwahab, A. W. Mohamed, et al., Clinical data analysis for prediction of cardiovascular disease using machine learning techniques, Computational intelligence and neuroscience 2022 (2022). [9] F. Khader, T. Han, G. Müller-Franzes, L. Huck, P. Schad, S. Keil, E. Barzakova, M. Schulze- Hagen, F. Pedersoli, V. Schulz, et al., Artificial intelligence for clinical interpretation of bedside chest radiographs, Radiology 307 (2022) e220510. [10] J. Tveit, H. Aurlien, S. Plis, V. D. Calhoun, W. O. Tatum, D. L. Schomer, V. Arntsen, F. Cox, F. Fahoum, W. B. Gallentine, et al., Automated interpretation of clinical electroencephalo- grams using artificial intelligence, JAMA neurology (2023). [11] S. Coşar, M. Fernandez-Carmona, R. Agrigoroaie, J. Pages, F. Ferland, F. Zhao, S. Yue, N. Bellotto, A. Tapus, Enrichme: Perception and interaction of an assistive robot for the elderly at home, International Journal of Social Robotics 12 (2020) 779–805. [12] G. D’Onofrio, D. Sancarlo, M. Raciti, D. Reforgiato, A. Mangiacotti, A. Russo, F. Ricciardi, A. Vitanza, F. Cantucci, V. Presutti, et al., Mario project: experimentation in the hospital setting, in: Ambient Assisted Living: Italian Forum 2017 8, Springer, 2019, pp. 289–303. [13] Craig Medical: Adulterant Validity Chart Interpretation, Rapidcheck pro 10 dsc with adul- terant check, https://www.craigmedical.com/Drug_5Panel_DSC-Adulterant.htm, 2024. [14] H.-J. Jang, K.-O. Cho, Applications of deep learning for the analysis of medical data, Archives of pharmacal research 42 (2019) 492–504. [15] D. Sarvamangala, R. V. Kulkarni, Convolutional neural networks in medical image under- standing: a survey, Evolutionary intelligence 15 (2022) 1–22. [16] X. Yao, X. Wang, S.-H. Wang, Y.-D. Zhang, A comprehensive survey on convolutional neural network in medical image analysis, Multimedia Tools and Applications 81 (2022) 41361–41405. [17] H. Yu, L. T. Yang, Q. Zhang, D. Armstrong, M. J. Deen, Convolutional neural networks for medical image analysis: state-of-the-art, comparisons, improvement and perspectives, Neurocomputing 444 (2021) 92–110. [18] S. Reddy, Explainability and artificial intelligence in medicine, The Lancet Digital Health 4 (2022) e214–e215. [19] W. Samek, K.-R. Müller, Towards explainable artificial intelligence, Explainable AI: interpreting, explaining and visualizing deep learning (2019) 5–22. [20] B. H. Van der Velden, H. J. Kuijf, K. G. Gilhuijs, M. A. Viergever, Explainable artificial intelligence (xai) in deep learning-based medical image analysis, Medical Image Analysis 79 (2022) 102470. [21] G. Tufo, M. Zribi, F. Pitolli, P. Pagliuca, Advanced computer vision techniques for drug abuse detection, in: 21st edition of the IMACS world congress (IMACS2023), 2023, p. 226. [22] A. Carrio, C. Sampedro, J. L. Sanchez-Lopez, M. Pimienta, P. Campoy, Automated low-cost smartphone-based lateral flow saliva test reader for drugs-of-abuse detection, Sensors 15 (2015) 29569–29593. [23] M. H. Tania, K. T. Lwin, A. M. Shabut, M. Najlah, J. Chin, M. A. Hossain, Intelligent image-based colourimetric tests using machine learning framework for lateral flow assays, Expert Systems with Applications 139 (2020) 112843. [24] V. Turbé, C. Herbst, T. Mngomezulu, S. Meshkinfamfard, N. Dlamini, T. Mhlongo, T. Smit, V. Cherepanova, K. Shimada, J. Budd, et al., Deep learning of hiv field-based rapid tests, Nature medicine 27 (2021) 1165–1170. [25] M. Zribi, P. Pagliuca, F. Pitolli, Convolutional neural networks for the automatic control of consumables for analytical laboratories, in: BUILD-IT2023 worskhop, 2023, pp. 95–97. [26] M. Zribi, P. Pagliuca, F. Pitolli, A computer vision-based quality assessment technique for the automatic control of consumables for analytical laboratories, Expert Systems with Applications (2024, in press). [27] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016. [28] H. J. Min, H. A. Mina, A. J. Deering, E. Bae, Development of a smartphone-based lateral- flow imaging system using machine-learning classifiers for detection of salmonella spp., Journal of Microbiological Methods 188 (2021) 106288. [29] S. Yan, C. Liu, S. Fang, J. Ma, J. Qiu, D. Xu, L. Li, J. Yu, D. Li, Q. Liu, Sers-based lateral flow assay combined with machine learning for highly sensitive quantitative analysis of escherichia coli o157: H7, Analytical and Bioanalytical Chemistry 412 (2020) 7881–7890. [30] Y. Zha, Y. Li, J. Zhou, X. Liu, K. S. Park, Y. Zhou, Dual-mode fluorescent/intelligent lateral flow immunoassay based on machine learning algorithm for ultrasensitive analysis of chloroacetamide herbicides, Analytical Chemistry (2024). [31] M. A. Tanner, W. H. Wong, The calculation of posterior distributions by data augmentation, Journal of the American statistical Association 82 (1987) 528–540. [32] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, Journal of big data 6 (2019) 1–48. [33] N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study, Intelligent data analysis 6 (2002) 429–449. [34] PyTorch Illustration of transforms, https://pytorch.org/vision/main/auto_examples/plot_ transforms.html, 2024. [35] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [36] C. Schaffer, Selecting a classification method by cross-validation, Machine learning 13 (1993) 135–143. [37] L. A. Yates, Z. Aandahl, S. A. Richards, B. W. Brook, Cross validation for model selection: a review with examples from ecology, Ecological Monographs 93 (2023) e1557. [38] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2017, pp. 4700–4708. [39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [40] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, arXiv preprint arXiv:1312.6034 (2013). [41] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: International conference on machine learning, PMLR, 2017, pp. 3319–3328. [42] I. Čík, A. D. Rasamoelina, M. Mach, P. Sinčák, Explaining deep neural network using layer- wise relevance propagation and integrated gradients, in: 2021 IEEE 19th world symposium on applied machine intelligence and informatics (SAMI), IEEE, 2021, pp. 000381–000386. [43] G. Li, Y. Yu, Visual saliency detection based on multiscale deep cnn features, IEEE transactions on image processing 25 (2016) 5012–5024. [44] M. Schwegler, C. Müller, A. Reiterer, Integrated gradients for feature assessment in point cloud-based data sets, Algorithms 16 (2023) 316. Appendix A. ConvNet3_4 3.0 2.5 2.0 training Loss 1.5 test 1.0 0.5 0.0 0 10 20 30 40 50 Epochs Figure A.1: Error curve of the worst ConvNet3_4 model out of 10 replications. The training error decreases and goes to 0 after few epochs. Instead, the test error increases at the beginning of training and oscillates; the sudden increase at epoch 40 prevents this replication from achieving a good accuracy on images in the test set. 600 500 400 # of classifications Correct 300 Wrong 200 100 0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Figure A.2: Number of correct and wrong classifications with respect to the images in the test set. Data refer to 10 replications of the experiment. B. Pre-trained models Confusion matrix Confusion matrix 190 200 180 Abnormal 123 177 170 Abnormal 90 210 180 160 160 True label True label 150 140 140 Normal 108 192 130 Normal 86 214 120 120 100 110 Abnormal Normal Abnormal Normal Predicted label Predicted label Figure B.1: Analysis of the classification capability of pre-trained networks on the images belonging to the test set. Data in the main diagonal denote correct predictions, while values outside the diagonal represent faulty categorizations. Left: DenseNet121 model. Right: ResNet18 model.