Exploring Composite Dataset Biases for Heart Sound Classification Davoud Shariat Panah1 , Andrew Hines2 , and Susan Mckeever1 1 School of Computing, Technological University Dublin, Ireland 2 School of Computer Science, University College Dublin, Ireland d19127274@mytudublin.ie, andrew.hines@ucd.ie, susan.mckeever@tudublin.ie Abstract. In the last few years, the automatic classification of heart sounds has been widely studied as a screening method for heart disease. Some of these studies have achieved high accuracies in heart abnormality prediction. However, for such models to assist clinicians in the detection of heart abnormalities, it is of critical importance that they are gener- alisable, working on unseen real-world data. Despite the importance of generalisability, the presence of bias in the leading heart sound datasets used in these studies has remained unexplored. In this paper, we explore the presence of potential bias in heart sound datasets. Using a small set of spectral features for heart sound representation, we demonstrate ex- perimentally that it is possible to detect sub-datasets of PhysioNet, the leading dataset of the field, with 98% accuracy. We also show that sensors which have been used to capture recordings of each dataset are likely the main cause of the bias in these datasets. Lack of awareness of this bias works against generalised models for heart sound diagnostics. Our find- ings call for further research on the bias issue in heart sound datasets and its impact on the generalisability of heart abnormality prediction models. Keywords: Bias, PhysioNet Dataset, Heart Sound, Machine Learning 1 Introduction Cardiac auscultation is a cost-effective and non-invasive technique that has been used by physicians to diagnose heart disease for over 200 years [15]. Ausculta- tion involves listening and interpreting the patient’s heart beat, typically using a stethoscope. However, the accuracy of this diagnostic method is influenced by different factors such as the auscultation skills of the clinicians and the capacity of the human auditory system to detect low-frequency sounds [20]. In recent times, the development of heart sound classification models for automatic detec- tion of heart abnormalities has been an active area of research [13, 11, 4, 23, 16]. Given that the ultimate goal of such systems is to assist clinicians with their decision making, the generalisability of these models to unseen real-world data is of great importance. One of the main causes of poor generalisation of predictive models is dataset bias [22, 21]. Unintended bias can be introduced into datasets at different stages Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 D. Shariat Panah, A. Hines, S. Mckeever of the data collection and generation process. Consequently, the source of bias will differ across datasets, including historical, representation and measurement bias [18]. Supervised machine learning models are heavily influenced by the char- acteristics of the data they are trained on. Bias in the data may result in a suboptimal model which would be biased towards some particular features of the dataset [22]. While such a model might show a high level of accuracy on the dataset used for training and evaluation, it may not offer the same performance when deployed to production. In other words, the generalisability of such model can be affected by the fact that the training and the real-world data come from different distributions. In addition to the diagnostically salient acoustic characteristics, heart sound recordings are susceptible to a variety of factors. These can be grouped as: hu- man factors regarding the patient (e.g. age, resting/moving state, fitness levels); context and environmental factors (e.g. room noise, stethoscope placement); and system factors (e.g. stethoscope specifications such as acoustic coupling, digital sampling rate, acoustic dynamic intensity range). Recent work has also shown differences in parameters of the transmitted sound exist between digital stetho- scopes [12]. Despite the fact that any of these factors can be a potential source of bias in heart sound datasets, the impact of bias on datasets used for data-driven classification models has not been explored. At the same time, the heavy reliance on a small set of datasets in this area of research also stresses the importance of potential bias in these datasets. Currently, there are few publicly available heart sound datasets which have been employed to build heart abnormality prediction models. One of these which has been widely used as a gold standard dataset by researchers since 2016 is the PhysioNet heart sound dataset [9]. We explore the presence of potential bias in the PhysioNet heart sound dataset and its im- pact on the generalisability of heart disease prediction models. Our key finding is that bias is present in the PhysioNet meta-dataset, and that the sound cap- turing sensor (the digital stethescope) used is likely the principal contributor to this bias. Our paper is structured as follows: In section 2, we provide a brief overview of the heart sound classification problem. We also give an overview of the available datasets and specifications of sensors used to capture heart sounds. In section 3, we provide the details of the experimental methodology, including the chosen datasets, pre-processing, feature representations, classification model, and met- rics. In section 4, we give a detailed analysis of the results. In section 5, the results are discussed. Conclusions and future directions are presented in section 6. 2 Background and Related Work Heart sounds are a product of vibrations in heart muscles. These vibrations are in turn the result of blood flow with the opening and closure of heart valves. A normal heartbeat cycle is composed of two separate sounds, called first heart sound (S1) and second heart sound (S2). In some cases, a third and fourth sound might also be present, which can be a sign of heart abnormality. In addition Exploring Composite Dataset Biases for Heart Sound Classification 3 Fig. 1: Phonocardiogram of a normal heart sound. to these sounds, heart valve defects can also produce a whooshing or swishing sound, which is called murmur. A phonocardiogram is a visual representation of a heart sound showing heart sound amplitudes over a period of time. Figure 1 shows the phonocardiogram of a normal heart sound. Physicians use a device called stethoscope to monitor the heart sounds. By examining the timing, duration, intensity, and pitch of the heart sounds, they can differentiate normal and abnormal sounds [10]. Acoustic stethoscopes have been used by clinicians for over 200 years. However, in recent years, they have evolved into digital auscultation devices with multiple functionalities. 3M Littmann [1], Eko Core [6], and Jabes [7] are just some examples of available electronic stetho- scopes in the market. Specifications of these three stethoscopes have been pro- vided in Table 1. These devices allow their users to significantly amplify heart sounds. They also eliminate ambient noises by filtering out unwanted frequency ranges or through active noise cancellation. Littman and Jabes stethoscopes also offer functionality to enable users to switch between different frequency modes which have been tailored to heart and lung sounds frequency ranges. Such fea- tures enable digital stethoscopes to offer a higher level of sound quality than their acoustic counterparts, which in turn will assist practitioners to make a more accurate diagnosis [19]. Referring to Table 1, we think that characteristics such as different frequency ranges, digital sampling rates, and frequency modes can be potential causes of bias in sounds recorded by such sensors. Table 1: Specifications of three digital stethoscopes available in the market. Frequency Sample Noise Frequency Stethoscope Amplification Range (Hz) Rate (Hz) Reduction Modes 3M Littmann 20 – 2000 8000 Up to 24x Yes Yes 3200 Yes - Active Eko Core 20 – 2000 4000 Up to 40x No noise cancellation Jabes 20 – 1000 8000 Up to 20x Yes Yes Currently, a few heart sound datasets are available to researchers, including but not limited to PASCAL [2] and PhysioNet datasets. PhysioNet is by far the most extensive heart sound meta-dataset, comprising six smaller databases with 4 D. Shariat Panah, A. Hines, S. Mckeever different number of recordings. These six databases were contributed by differ- ent research groups to the PhysioNet Computing in Cardiology 2016 challenge. The heart sounds available in each database have been recorded using digital stethoscopes/microphones in clinical as well as non-clinical environments. Phy- sioNet meta-dataset contains 3240 recordings, out of which 665 samples belong to normal subjects and 2575 samples to abnormal ones. Since its release in 2016, many researchers, including [13, 11, 4, 23] have used PhysioNet as a benchmark dataset to validate their proposed algorithms. Creating a heart sound classification system generally involves four steps: data acquisition and pre-processing, segmentation, feature extraction, and heart sound classification [5]. It must be noted that one or some of these steps might not be present in particular cases such as deep learning models. Potes et al. [13] applied the Springer segmentation algorithm [17] to segment heart sounds and then used time and frequency domain features to train an ensemble model which combines an AdaBoost classifier with convolutional neural network (CNN). Their method achieved a mean accuracy of 86% on the PhysioNet/CinC 2016 challenge test set and won the challenge. The goal of this challenge was to classify heart sound recordings into either normal or abnormal categories. In [11], Noman et al. extracted Mel-Frequency Cepstral Coefficients (MFCCs) from short segment heart sound signals. They then built a deep learning architecture which com- bines a 1D-CNN that receives raw heart sound signals, and a 2D-CNN that takes MFCCs as input. Their proposed method achieved a mean accuracy of 88.14% on the PhysioNet dataset. The method proposed by Dominguez-morales et al. [4] split heart sound recordings into fixed-length segments and extracts frequency bands of each segment. Sonogram images were generated for each of the samples and then fed into a CNN model. This method achieved a mean ac- curacy of 94.16% on Physionet dataset. High reliance on the PhysioNet dataset as a benchmark dataset in recent work [13, 11, 4, 23] indicates that PhysioNet is currently the gold standard dataset in this field. Although some of the studies mentioned above have achieved high accuracies on PhysioNet meta-dataset, the potential presence of bias in this dataset and its impact on the generalisability of the proposed models has been overlooked. The fact that the PhysioNet meta-dataset is imbalanced across its constituent databases increases the risk that the models built using this dataset will be biased towards the characteristics of one of its sub-databases. As a result, while the resultant models might achieve high accuracies on this particular dataset, such models may show lower performance when we use them in real-world scenarios to classify heart sounds. Previous studies were motivated to design models with a higher levels of heart sound classification accuracy. We explore the PhysioNet dataset from a different point of view. We investigate the presence of bias in PhysioNet, the leading dataset in the field, aiming to establish the presence and main cause of such bias. Exploring Composite Dataset Biases for Heart Sound Classification 5 Table 2: Details of the databases used for training and evaluation of the model. Adapted from [9]. Sensor Subject # Gender Recording Sample Database Age frequency Sensor type Rec (F/M)% position rate (Hz) response Normal 117 Nine 20 Hz – Training-a Unknown Unknown different 44100 Meditron 20 kHz Abnormal 292 positions Normal 386 3M Tricuspid 20 Hz – Training-b Unknown 38/62 4000 Littmann Area 1 kHz Abnormal 104 E4000 Normal 80 20 Hz – Training-f 56 ± 16 62/38 Apex 8000 Jabes 1 kHz Abnormal 34 3 Experimental Methodology Our first objective is to find out if there is any bias resulting from PhysioNet’s construction through combining sub-datasets of heart sounds that were sourced from a variety of independent research studies. To do so, we train a classification model using three different sub-datasets of PhysioNet dataset and see if we can distinguish the recordings of each sub-dataset with an accuracy higher than random guess. PhysioNet sub-datasets have some differences across multiple attributes such as age distribution of subjects, auscultation positions and sensors used to capture the sounds. As a result, any of these attributes might be a potential cause of bias in this meta-dataset. In this regard, our second objective is to find out which attributes play a more significant role in introducing bias in the PhyioNet dataset. This section presents the datasets, pre-processing, feature representations, classification model and the evaluation metrics computed. The experiments were implemented in Python 3.8 using Librosa 0.8.0 library for feature extraction and Scikit-learn 0.23.2 for the classification models. 3.1 Datasets In order to train and evaluate the classification model, we use three out of six databases which are available in PhysioNet heart sound meta-dataset. Table 2 shows the details of the selected databases. These databases include both normal and abnormal heart sounds and been recorded using three different electronic stethoscopes (sensors). We excluded Training-c and Training-d databases because the number of normal samples in these databases are small. As the sensors which were used to capture heart sounds are different across the normal and abnormal classes of Training-e database, we also do not use this database. It must be noted that the sampling rate of the recordings available in PhysioNet dataset is 2000 Hz. 6 D. Shariat Panah, A. Hines, S. Mckeever We label the recordings of each of these databases according to their sensors. Then we choose 80 recordings from the normal class of each of the three datasets. Out of each of these sets of 80 samples, 50 recordings will be used for training, and 30 recordings will be used for testing the classification model. Therefore, we will have a training set containing 150 normal samples across three different databases, and a test set of 90 normal samples. We also create a separate test set which includes 90 abnormal recordings from the same databases. In Table 3, the details of the training and test sets which are used to train and evaluate the classification model have been provided. After creating the training and test sets, pre-processing, feature extraction, and classification steps will be carried out as described in section 3.2, 3.3, and 3.4, respectively. Table 3: Distributions of samples selected from each dataset in training and test sets. The training set contains only normal recording. The classes include Jabes, Littmann, and Meditron. Sensor(class) Training set Normal test set Abnormal test set Jabes 50 30 30 Littmann 50 30 30 Meditron 50 30 30 Total 150 90 90 3.2 Pre-processing Given that each heart sound recording has a different duration, to make the length of samples consistent, we use only the first five seconds of each sample. Five seconds is long enough to capture several cardiac cycles. Also, given that recordings of each database have a different range of amplitudes, we normalise all samples to have an amplitude between -1 and 1 with the following equation: S (t) − min[S (t)] S ´(t) = 2 −1 (1) max [S (t)] − min [S (t)] where S (t) and S ´(t) are the original and normalised signals, respectively. 3.3 Feature Extraction Numerous time-domain, frequency-domain and time-frequency features have been employed in the area of heart sound classification [5]. Unlike previous work, in this paper we are not aiming to design a heart abnormality prediction model – our main goal is to determine whether combining training data from a variety of sources introduces bias into the dataset. As heart sounds can be classified by listening to them with a stethoscope, basic acoustic features that capture features salient to human perception are applied. Given that heart sounds are fundamentally periodic beats, both temporal and spectral features are captured. In this regard, after the pre-processing step, we extract four different spectral features from each sample, including spectral centroid, spectral roll-off, spectral Exploring Composite Dataset Biases for Heart Sound Classification 7 bandwidth and spectral contrast. To reduce the dimensionality of feature vec- tors, for all features, we calculate the average value of the feature across the frames of the sample, as in [23]. (a) Spectral centroid is a measure that shows where the centre of mass of the spectrum is located. It represents the brightness of the sound signal [14] and is calculated as follows (as used in [23]): P x (n) f (n) Sc = n (2) x (n) where x (n) represents the spectral magnitude of frequency bin n, and f (n) is the centre frequency of that bin. (b) Spectral bandwidth is the order-p statistic of the signal spectrum and distinguishes high bandwidth sounds from low bandwidth sounds. It is calculated using the following equation (as used in [14]): ! p1 X p Sb = x (n) (f (n) − Sc ) (3) n where x (n) is the spectral magnitude at frequency bin n, f (n) is the centre frequency of that bin, and Sc represents the spectral centroid. It must be noted that in default Librosa implementation of this feature, p is equal to 2. (c) Spectral roll-off point is a frequency so that 85% of spectral energy lies below that frequency. This feature is calculated using the following equation (as used in [23]): f N ! X X x (n) = 0.85 x (n) (4) n=1 n=1 where f is the roll-off frequency, x (n) is the spectral magnitude at frequency bin n, and N is the total number of frequency bins. (d) Spectral contrast is defined as the difference between spectral peaks and valleys measured in sub-bands by octave-scale filters. For more information, please refer to [8]. 3.4 Classification Model After the feature extraction step, we train a linear Support Vector Machine (SVM) classifier [3]. We selected SVM classifier based on the data volumes avail- able, and achieved similar results with other classifiers such as KNN and random forest. We perform a grid search with 4-fold cross-validation on the training set (as described in Table 3) to optimise the c value for the SVM. After training, the model is evaluated using the test sets described in Table 3. 8 D. Shariat Panah, A. Hines, S. Mckeever 3.5 Metrics We use two different metrics to evaluate the classification model. The first one is recall, which is also called sensitivity or true positive rate. It shows the fraction of positive examples which have been classified correctly and is calculated using the following equation: T rue positive Recall = . (5) T rue positive + F alse negative As summarised in Table 3, the prepared training and test sets are balanced across classes. Therefore, we also use accuracy metric to evaluate our model at dataset level. This metric is defined as the ratio of the number of correct predictions to the total number of examples and is calculated as follows: T rue positive + T rue negative Accuracy = . (6) All examples 4 Results 4.1 Investigation of the presence of dataset bias Figure 2 illustrates the distributions of the extracted features from the training set across three different datasets. As shown in Figure 2, the median value of the majority of features is distinct across different datasets. Also, in the case of Jabes dataset, we can observe that the distributions of spectral centroid and spectral roll-off features are entirely distinct from the feature distributions of the other two datasets. The observations can be a sign of potential bias across datasets. To find out if selected datasets are biased or not, we evaluate the SVM model which was trained using normal recordings of three different datasets on the normal test set (as described in Table 3). Given that normal recordings must be consistent across different datasets, if dataset bias was not present, we could not expect to see an accuracy higher than chance. Figure 3 (left confusion matrix) depicts the evaluation results. According to Figure 3, we can see when we evaluate the model on the test set with normal recordings, the recall for each of the classes is at least 97%. Given that this is a three-class classification problem, the random guess accuracy would be 13 ≈ 0.33. We can see that the overall accuracy on the normal test set is 98% which is significantly higher than the chance. This observation clearly indicates the presence of bias in selected PhysioNet sub-datasets. 4.2 Exploring the cause of the dataset bias As we mentioned earlier, in addition to sensors, PhysioNet sub-datasets are different across multiple attributes such as age distribution of the subjects and auscultation positions used to record the heartbeat sounds. To find out what is Exploring Composite Dataset Biases for Heart Sound Classification 9 Fig. 2: Distributions of extracted features across three different datasets. Fig. 3: Confusion matrices for the SVM model which has been trained using four spectral features. The left confusion matrix shows the performance of model tested on normal test set, and the right one shows the performance on the abnormal test set. the main cause of bias in PhysioNet dataset, we perform the above experiment again, but this time instead of testing on the normal test set, we evaluate the model on the abnormal test set (as described in Table 3). In other words, we use the same SVM model which was trained on the normal heart sounds and evaluate it on a test set which contains various abnormal heart sounds from different subjects. This way, we can make sure that the recordings in training and test sets are considerably different in terms of content. If the model can still predict the datasets with an accuracy higher than chance, this will indicate that sensor is likely the main cause of bias in datasets and the role of the other 10 D. Shariat Panah, A. Hines, S. Mckeever attributes in dataset bias is negligible compared to that of sensors. The reason is that sensor is the only attribute which is certainly consistent across the three datasets available in normal and abnormal test sets. Figure 3 (right confusion matrix) shows the result of this experiment. We can see that the recall values for all three classes are still much higher than the random guess. Also, the overall accuracy is 96% which is significantly higher than the chance (33.3%). We observe that despite evaluating the model on a test set with very different content, the model still achieves a near to perfect accuracy. This observation suggests that the sensor is likely the main source of bias in selected heart sound databases. Given that other attributes like the age distribution of the subjects differ across normal and abnormal test sets, if they were the main source of bias, we could not except to see an accuracy much higher than chance. 5 Discussion In section 4, we examined the presence of bias across three heart sound datasets available as part of the PhysioNet meta-dataset We showed that we could accu- rately classify sub-datasets of the PhysioNet meta-dataset and concluded that bias is certainly present in this meta-dataset. We also demonstrated that the role of attributes such as the age of the subjects and auscultation positions in introducing bias could not be as significant as the sensor, and sensor is likely the main cause of bias in PhysioNet dataset. It is important to note that, due to the existence of multiple attributes in each of the PhysioNet sub-datasets, we cannot assert that the other attributes do not play any role in introducing bias into this dataset. We do not have access to sufficient data to precisely examine the role of all attributes involved. We carried out our experiments on three out of six subset datasets of Phy- sioNet meta-dataset. This meta-dataset has been assembled by pooling smaller datasets from different resources. As it was reported in Table 2, each of these datasets contains a different number of recordings. This means that when we use PhysioNet meta-dataset to train heart abnormality prediction models, we can expect a bias in our models towards the characteristics of databases with the highest number of samples. That is to say, employing PhysioNet dataset for training heart disease prediction models may not necessarily lead to models with better generalisability than that of models trained with smaller datasets as the models can be biased towards a proportion of the PhyioNet meta-dataset. It is worth noting that PhysioNet meta-dataset has been used as a gold standard dataset in the majority of studies in the area of heart sound classification since 2016. The main goal of such studies is to build prediction models which can be used as a tool for initial screening of heart disease. However, according to the results of our experiments, the generalisability of the models built using this meta-dataset to unseen data in real-world settings seems implausible. Indeed, any model which is built using this meta-dataset must be evaluated using real-world data to validate that it can reproduce the reported performance. In addition to Exploring Composite Dataset Biases for Heart Sound Classification 11 this, we must also consider bias as an important factor when we want to create a heart abnormality prediction system using the PhysioNet meta-dataset. As we mentioned in section 2, in the majority of cases, building a heart abnormality prediction model involves four different steps: pre-processing, segmentation, fea- ture extraction, and classification. Our design decisions in each of these steps can determine the level of which the resultant model will be influenced by the dataset bias. 6 Conclusion and Future Work In this paper, we investigated the presence of potential bias in PhysioNet heart sound meta-dataset. We chose three sub-datasets of this dataset and labelled the recordings of each one based on the sensor used to capture them. Then we built an SVM model using four spectral features. The model was able to detect recordings of each of the PhysioNet sub-datasets with an accuracy of 98%, which is way above chance. This indicates that bias is undoubtedly present in PhysioNet dataset, the gold standard dataset in the field. We also showed that sensors are likely the main cause of this bias. Our findings necessitate further investigations into the impact of this bias issue in the PhysioNet dataset on the generalisability of the heart sound classification models. A comprehensive analysis of the different feature representations which are being used in the field of heart abnormality prediction in terms of their level of robustness to sensor bias can be a future direction. In addition to this, looking into the possibility of alleviating the bias through data preprocessing techniques can also be an interesting future work. Acknowledgements This work was conducted with the financial support of the Science Foundation Ireland Centre for Research Training in Digitally-Enhanced Reality (D-REAL) under Grant No. 18/CRT/6224. References 1. 3M: Littmann electronic stethoscope model 3200, https://www.littmann.com/3M/en US/littmann-stethoscopes/, last accessed 2020/09/16. 2. Bentley, P., Nordehn, G., Coimbra, M., Mannor, S.: The pascal classifying heart sounds challenge (2011), http://www.peterjbentley.com/heartchallenge/index.html 3. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. pp. 144–152 (1992) 4. Dominguez-Morales, J., Jimenez-Fernandez, A., Dominguez-Morales, M., Jimenez-Moreno, G.: Deep neural networks for the recognition and classification of heart murmurs using neuromorphic auditory sensors. IEEE Trans. Biomed. Circuits Syst 12, 24–34 (2018) 12 D. Shariat Panah, A. Hines, S. Mckeever 5. Dwivedi, A., Imtiaz, S., Rodriguez-Villegas, E.: Algorithms for automatic analysis and classification of heart sounds–a systematic review. IEEE Access 7, 8316–8345 (2019) 6. Ekohealth: Core digital stethoscope - electronic stethoscopes — eko, https://shop.ekohealth.com/products/core-digital-stethoscope, last accessed 2020/09/16. 7. Jabes: Jabes electronic stethoscope, https://www.allheart.com/jabes-electronic-stethoscope/p/jsjabes3/, last accessed 2020/09/16. 8. Jiang, D.N., Lu, L., Zhang, H.J., Tao, J.H.: Lian-hong cai: Music type classification by spectral contrast feature. In: Proceedings. IEEE International Conference on Multimedia and Expo. p. 113–116. IEEE, Lausanne, Switzerland (2002) 9. Liu, C., Springer, D., Li, Q., Moody, B., Juan, R., Chorro, F., Castells, F., Roig, J., Silva, I., Johnson, A., Syed, Z., Schmidt, S., Papadaniil, C., Hadjileontiadis, L., Naseri, H., Moukadem, A., Dieterlen, A., Brandt, C., Tang, H., Samieinasab, M., Samieinasab, M., Sameni, R., Mark, R.: Clifford, g.d.: An open access database for the evaluation of heart sound algorithms. Physiol. Meas 37, 2181–2213 (2016) 10. McGee, S.: Auscultation of the Heart: General Principles. Evidence-Based Physical Diagnosis. Elsevier Health Sciences, 4th edn. (2017) 11. Noman, F., Ting, C.M., Salleh, S.H., Ombao, H.: Short-segment heart sound classification using an ensemble of deep convolutional neural networks. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP. p. 1318–1322 (2019) 12. Nowak, L.J., Nowak, K.M.: Sound differences between electronic and acoustic stethoscopes. BioMedical Engineering OnLine 17(1), 104 (2018) 13. Potes, C., Parvaneh, S., Rahman, A., Conroy, B.: Ensemble of feature based and deep learning-based classifiers for detection of abnormal heart sounds. In: 2016 Computing in Cardiology Conference September. vol. 14 (2016) 14. Sharma, G., Umapathy, K., Krishnan, S.: Trends in audio signal feature extraction methods. Applied Acoustics 158, 107020 (2020) 15. Shaver: J.a.: Cardiac auscultation: A cost-effective diagnostic skill. Current Problems in Cardiology 20, 447–530 (1995) 16. Son, G.Y., Kwon, S., et al.: Classification of heart sound signal using multiple features. Applied Sciences 8(12), 2344 (2018) 17. Springer, D.B., Tarassenko, L., Clifford, G.D.: Logistic regression-hsmm-based heart sound segmentation. IEEE Transactions on Biomedical Engineering 63(4), 822–832 (2015) 18. Suresh, H., Guttag: J.v.: A framework for understanding unintended consequences of machine learning (2020), arXiv:1901.10002 [cs, stat]. 19. Tavel: M.e.: Cardiac auscultation: a glorious past–and it does have a future! Circulation 113, 1255–1259 (2006) 20. Tavel, M.E.: Cardiac auscultation: a glorious past—but does it have a future? Circulation 93(6), 1250–1253 (1996) 21. Tommasi, T., Patricia, N., Caputo, B., Tuytelaars: T.: A deeper look at dataset bias (2015), arXiv:1505.01257 [cs]. 22. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR 2011. pp. 1521–1528. IEEE (2011) 23. Yadav, A., Singh, A., Dutta, M.K., Travieso, C.M.: Machine learning-based classification of cardiac diseases from pcg recorded heart sounds. Neural Computing and Applications pp. 1–14 (2019)