Optimizing Factors Influencing on Accuracy of Biometrical Cardiometry Marat R. Bogdanov1,2 , Aleksander A. Dumchikov2 , Vadim M. Kartak1,2 , and Aigul I. Fabarisova2 1 Ufa State Aviation Technical University, Ufa, Russia redfoxufa@gmail.com 2 M. Akmullah named after Bashkir State Pedagogical University, Ufa, Russia kvmail@mail.ru Abstract. The paper is about some aspects concerning person biomet- ric identification based on using of electrocardiograms. Signal prepro- cessing routing is considered in the paper. Classification was carried out with support vector machines algorithm. Tuning of hyper parameters of classification is considering. Keywords: Biometric person identification · Electrocardiogram · Ma- chine learning · Support vector machines · Hyper parameter tuning 1 Introduction Various biometric methods of person identification are getting more popular. Fingerprinting, face, voice and retina recognition are widely used in various secu- rity systems. The vulnerabilities of traditional methods of biometric identifica- tion were revealed over time. Researchers are increasingly turning their attention to such person biometric features as electrocardiograms, electroencephalograms and DNA [1]. In this paper, we would like to discuss some practical aspects of person identification using ECG. 2 Motivation and Aim The problem of person biometric identification concerns classification prob- lems. To solve it, we have to consider algorithms from some finite set and choose an algorithm that gives the least error of the forecast [3]. Let’s introduce some notation. Let us suppose X is a space of objects. Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: S. Belim et al. (eds.): OPTA-SCL 2018, Omsk, Russia, published at http://ceur-ws.org 62 M. R. Bogdanov et al. Y is a set of answers. X l = (xi , yi )li=1 (1) is a training set, l is a sample size. yi = y ∗ (xi ), (2) At = {a : X → Y } (3) are a models of algorithms, t ∈ T , T is a number of algorithms under consider- ation. µt : (X × Y )l → At (4) are learning methods. It is required to find a method µt with the best generalizing power. When finding a method µt , we often have to solve the following subtasks: – Choice of the best model At (model selection). – Choice of learning method µt for a given model At (in particular, optimiza- tion of hyperparameters). – Features selection: F = {fj : X → Dj : j = 1, ..., n} (5) is a set of features. The method of learning µj uses only features J ⊆ F . To assessment the quality of learning by precedents it’s used: L(a, x) is a cost function of algorithm a on the object x. l 1X Q(a, X l ) = L(a, xi ) (6) l i=1 is a functional of accuracy a on X. In this case we consider an internal quality criterion that is measured on the training set X l : Qµ(X l ) = Q(µ(X l ), X l ) (7) and an external criterion evaluating the quality of learning on hold-out set X k [2]: Qµ(X l , X k ) = Q(µ(X l ), X k ). (8) In the paper presented we will consider such aspects of person biometric iden- tification as feature selection, model selection, choice of methods (tuning of hy- perparameters), assessment of the quality of learning. 3 Feature Selection We used the MGH/MF Waveform Database hosted at physionet.org re- source [8], [2]. The Massachusetts General Hospital/Marquette Foundation (MGH/MF) Waveform Database is a comprehensive collection of electronic Biometrical Cardiometry 63 recordings of hemodynamic and electrocardiographic waveforms of stable and unstable patients in critical care units, operating rooms, and cardiac catheteri- zation laboratories. It is the result of a collaboration between physicians, biomed- ical engineers and nurses at the Massachusetts General Hospital. The database consists of recordings from 250 patients and represents a broad spectrum of phys- iologic and pathophysiologic states. Individual recordings vary in length from 12 to 86 minutes, and in most cases are about an hour long [8], [2]. The typical recording includes three ECG leads, arterial pressure, pulmonary arterial pressure, central venous pressure, respiratory impedance, and airway CO2 waveforms. The raw sampling rate of 1440 samples per second per signal was reduced by a factor of two to yield an effective rate of 360 samples per second per signal relative to real time [8], [2]. When preprocessing stage we used a biopsy python library by John Reid. The package enables the development of Pattern Recognition and Machine Learning work flows for the analysis of biosignals including ECG [5]. Using biopsy we extracted first lead from electrocardiogram and performed a low pass filter for reducing of redundancy. After applying of low-pass filter R-peaks was extracted from ECG-signal using a wfdb python library by Chen Xie and Julien Dubiel [6]. The software allow extract peaks and QRS -cycles from electrocardiograms. We choice amplitude and temporal features of Q,R and S -peaks ( Qx , Qy , Rx , Ry , Sx , Sy ). In total we had 6 features. Feature table together label class vector were randomly splitted into training set and testing set in the ratio of 75:25 for further cross validation. We learned a classifier on training set and performed measuring of classifying accuracy on testing set. 4 Model Selection We used a Support Vector Machines (SVM) algorithm for classification. Sup- port Vector Machines are based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of ob- jects having different class memberships. SVM is primarily a classier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. SVM supports both regres- sion and classification tasks and can handle multiple continuous and categorical variables [4]. To construct an optimal hyperplane, SVM employs an iterative training al- gorithm, which is used to minimize an error function. According to the form of the error function, SVM models can be classified into four distinct groups: – Classification SVM Type 1 (also known as C-SVM classification) – Classification SVM Type 2 (also known as nu-SVM classification) – Regression SVM Type 1 (also known as epsilon-SVM regression) – Regression SVM Type 2 (also known as nu-SVM regression) We used a Classification SVM Type 1 (also known as C-SVM classification) model. 64 M. R. Bogdanov et al. 5 Classification SVM Type 1 For this type of SVM, training involves the minimization of the error function: N 1 T X w w+C ξi (9) 2 i=1 subject to the constraints: yi (wT φ(xi ) + b) ≥ 1 − ξi ≥ 0, i = 1, ..., N, (10) where C is the capacity constant, w is the vector of coefficients, b is a constant, and ξi represents parameters for handling nonseparable data (inputs). The index i labels the N training cases. Note that y ∈ ±1 represents the class labels and xi represents the independent variables. The kernel φ is used to transform data from the input (independent) to the feature space. It should be noted that the larger the C, the more the error is penalized. Thus, C should be chosen with care to avoid overfitting. 6 Kernel Functions     Xi · Xj Linear   (γXi · Xj + C)d P olynomial   K(Xi , Xj ) = 2 , (11)   exp(−γ|Xi − Xj | ) RBF   tanh(γXi · Xj + C) Sigmoid   where K(Xi , Xj ) = φ(Xi ) · φ(Xj ) that is, the kernel function, represents a dot product of input data points mapped into the higher dimensional feature space by transformation φ. 7 Gamma is an Adjustable Parameter of Certain Kernel Functions The RBF is by far the most popular choice of kernel types used in Support Vector Machines. This is mainly because of their localized and finite responses across the entire range of the real x-axis [7]. Support vector machine classifier supported by sklearn python library uses as default following hyper parameters: C=1.0, kernel=’rbf’, gamma=’auto’. When using of default parameters while performing of classification of electrocardio- grams we had accuracy score equal to 0.93. We tuned hyper parameters of clas- sification with Grid Search procedure varying C parameter in range of [1, 10, 100, 1000], kernel in range of [’linear, ’rbf”], and gamma in range of [1e-3, 1e-4]. After performing of tuning we had the following best parameters set: ’kernel’: ’rbf’, ’C’: 10, ’gamma’: 0.001. Using these parameters we had accuracy score equal to 0.99. Biometrical Cardiometry 65 8 Results and Discussion During the preprocessing of electrocardiograms we extracted first leads of signal and performed low-pass filter for reducing redundancy. Then we extracted cardiac cycles from the leads and extracted Q, R and S peaks from cardiac cycles. Using amplitude and temporal features of peaks we composed a feature table containing 6 features (Qx , Qy , Rx , Ry , Sx , Sy ) and class labels vector y. Further we randomly splitted a feature table and class labels vector into training set and testing set on ration of 75:25 for further cross-validation. Training set was used for learning a classifier and testing set was used for assessment of quality of learning. SVM classifier supported by sklearn python library using default options show accuracy score equal to 0.93. We found the best hyper parameters are following: ’kernel’: ’rbf’, ’C’: 10, ’gamma’: 0.001. Using these parameters we could improved accuracy score up to 0.99. References 1. Abdulmonam, O., Ahlal, H., Fawzia, E.: Vulnerabilities of biometric authentica- tion. Threats and countermeasures. IJICT 4(11), 947-958 (2014) 2. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mietus, J., Moody, G., Peng, C., Stanley, H.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), 1-6 (2000) 3. Machine Learning course by K.V. Voroncov: http://www.MachineLearning.ru/ wiki [On-line; accessed 12-April-2011] 4. Support Vector Machines (SVM) Introductory Overview: http://www.statsoft. com/Textbook/Support-Vector-Machines [On-line; accessed 26-May-2012] 5. The biosppy Toolbox: http://biosppy.readthedocs.io/en/stable/ [On-line; ac- cessed 24-March-2016] 6. The WFDB Python Toolbox: https://pypi.python.org/pypi/wfdb [On-line; ac- cessed 11-July-2015] 7. Ting-Fan, W., Chih-Jen, L., Weng, R., Singer., Y. (eds.): Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975-1005 (2004) 8. Welch, J., Ford, P., Teplick, R., Rubsamen, R.: The Massachusetts General Hospital-Marquette Foundation hemodynamic and electrocardiographic database comprehensive collection of critical care waveforms. J Clinical Monitoring 7(1), 96-97 (1991)