Methods of Primary Processing Handwriting Samples at User Authentication Using a Probabilistic Neural Network Anatolii Davydenko1[0000-0001-6466-1690], Olena Vysotska2[0000-0002-9543-1385] and Tetiana Shmelova 2[0000-0002-9737-6906] 1 Pukhov Institute for Modeling in Energy Engineering of NAS of Ukraine, Kyiv, Ukraine davidenkoan@gmail.com 2 National Aviation University, Kyiv, Ukraine Lek_Vys@ukr.net Abstract. This article analyzes the dynamic biometric methods to authenticate users of automated systems. We have feasibility of their use for the organization of access and privacy of information in automated systems. For further analysis and use authentication methods are selected keystroke pattern and handwriting. For each of these methods, many handwriting features are created to analyze them during user authentication. As the mechanism of recognition selected probabilistic neural network as a type of neural network suitable for solving the problem of object recognition. After that, the proposed technology primary processing of handwriting samples through which achieved increase the probability of correct recognition of users. The stages of the implementation process considered and their expediency is proved. It is formulated what and why errors may occur in samples of handwriting, which ones should be removed or which ones should be corrected. For the method of authenticating users by handwriting, the technology of selecting the most significant points, whose characteristics it is advisable to analyze during recognition, is proposed. The quality criterion of the analyzed characteristics is also proposed. Further, a number of experiments were performed using the developed software. The results of these experiments prove the correctness of the proposed methods and the effectiveness of the proposed technology of primary processing of handwriting samples. Keywords: multifactor authentication, identification, authentication, biometrics, keystroke pattern, handwriting, primary processing handwriting samples. 1 Introduction In recent years, computer technology is probably already involved in all areas of activity. All information is stored and processed using either the existing software or software what specifically developed for the implementation of business process typical organization. An important aspect is to ensure the confidentiality of information and implementation of access control. The function of user Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CybHyg-2019: International Workshop on Cyber Hygiene, Kyiv, Ukraine, November 30, 2019 authentication is a critical component in the organization of access and privacy of information in the software used, so it is urgent to develop scientific problem. There are various methods of authentication, including authentication using keys, digital signatures, passwords, biometrics, etc. [1-8]. Where the level of information security put forward higher requirements, it makes sense to use multifactor authentication. This paper uses biometric authentication methods. The advantages of biometric authentication from other methods of solving this problem is the fact that: a person is a carrier of their "biometric password", ie, no need to remember the password and can not be somewhere to forget or lose; high degree of unique passwords; difficult to falsify a password. And the use of methods for recognizing methods based on the analysis of dynamic biometric characteristics, such as keystroke pattern and handwriting, has additional advantages, namely: they do not require additional expensive equipment (for keystroke pattern); allows to increase the degree of multifactor authentication of users of information systems; can be used not only to authenticate users but also for monitoring their work. This approach provides significant advantages for businesses that critically depend on their workers’ level of attention during work. In case, if monitoring fixes, a significant deviation of the characteristics of the user's handwriting from their average statistical values for this user, thereby fixing the anomalous state of users for some reason (external factors, illness, etc.) distracted from work, or unauthorized change of the user. All of it makes relevant and expedient the use of biometric authentication methods and especially dynamic biometric methods, among which are the recognition of users by their keystroke pattern and handwriting. But with all these advantages, biometric recognition methods based on dynamic characteristics analysis have some drawbacks. The main one is that a person's dynamic characteristics are characterized by some instability. In addition, some instability in the characteristics of handwriting is caused by the features of the devices used to transmit a sample of handwriting to the computer. In this work, a probabilistic neural network is chosen to solve the problem of user recognition [2- 4,9]. This kind of neural networks [9-10] copes with the problem of pattern recognition well enough, but only if there is a slight instability of the analyzed characteristics in the sample, but more significant deviations need to be corrected or removed from the recognition process to ensure high probability proper recognition. That is why this article is dedicated to the primary processing of keystroke pattern and handwriting samples. That is, the task of this work was to develop methods for the primary processing of samples of keystroke pattern and handwriting to increase the probability of correct recognition of users. 2 Problem statement This work improves the technology of authentication of users of automated systems by their keystroke pattern and handwriting. The operation of any biometric authentication system consists of the following two modes [2-4]: 1. Registration of users in the automated system, that is, the creation of l multiple legal users of this system USL= { USLt } = {USL1,USL2,…,USLt,…,USLd, t 1 …, USLl}, where t  1, l ; l – number of legal users; USLd – a legal user who appears to be an authorized, authenticated party. In this mode, there is an accumulation of a database of training samples of the biometric characteristics of the user being analyzed. That is, many training samples are created in the training samples database l zt ks _ k O= { Ot , j }; Ot,j= { ut, j,i }, where, respectively, the set O takes the following t 1 j 1 i 1 l zt ks _ k form: O= { ut, j,i }; j  1, zt ; zt – the number of training samples t-th use in the t 1 j 1 i 1 training samples database; u ijt - the value of the i-th characteristic in the j-th training sample of the t-th user; ks_k - is the number of characters in the keyword dynamics input or writing, which is analyzed in the authentication. In this case, samples of keystroke pattern and handwriting are respectively accumulated. The volume of this database depends on what biometric characteristics of the person are being analyzed. For dynamic performance, it is desirable to accumulate at least several hundred samples for each user. It is also advisable to update this database periodically. 2. Authentication of registered users. In this mode, the user who undergoes the authentication procedure presents his biometric password, the automated system uses authentication mechanism, in this case probabilistic neural network, authenticates by comparing the presented sample with the samples stored in the training samples database. As a result of the recognition, the neural network determines the username to which the user who entered the biometric password most likely belongs. If this name is the same as the user name, then the authentication process is considered successful. As stated earlier, dynamic biometric characteristics are characterized by some instability that reduces the probability of correct user recognition (P) when used. To correct this shortcoming, this work develops a technology of primary processing of handwriting samples, which corrects some errors and removes them, thereby increasing the correct recognition of users. The effectiveness of this technology in this work is tested using experiments. 3 The solution to this problem In order to solve this problem, a database of keystroke pattern and handwriting training samples was first accumulated, respectively. The list of features that make up a handwriting sample depends on the device used to send the handwriting sample to your computer. A standard keyboard is used to convey a sample of keystroke pattern. In this work, a graphic tablet is used to convey a sample of handwriting as one of the touch screen devices. When authenticating by keystroke pattern, the user is prompted to enter some kind of password (key phrase), the dynamics of the input of which is analyzed for recognition. When authenticating by handwriting, the user is asked to write the word-password on the graphic tablet, then the word that is written is analyzed for correctness and the dynamics of spelling of the word. At the same time, users are conditioned that all characters of the key phrase should be written not separately, but separately. This condition greatly simplifies the recognition process and reduces the resources involved in this biometric authentication systemIn both cases some Ukrainian word, word-combination or letter combination is used as a keyphrase. When recognizing by keystroke pattern, the following features can be analyzed for each word-password character: the value of the time intervals between entering two adjacent password characters; the value of time intervals between the release of two adjacent characters; the value of the time intervals between pressing and releasing each password character. You can analyze all of these features, or some of them, at the same time. In addition, in this work, each sample handwriting stores the number of keys that were mistakenly pressed when creating the sample. In handwriting recognition, the following features can be analyzed for each point in the word-password symbol: the value of the X coordinate; the value of the coordinate Y; point type; the pressure value at which the user presses a handle (or other similar device) on the touch screen when creating points; the value of the angle of change of direction of writing when creating points; the amount of time that elapsed from the beginning of the character to the specific point; the value of the speed of movement of the handle from the forward point to a given point; as well as for images of each word-password character: the value of the symbol image area; the value of the number of points of the symbols to be analyzed during recognition; the value of the angle of inclination of the characters; the value of the frequency of the symbol points, which is fixed by the system; the value of the number of repetitions of points in the symbol plot (points created in succession with the same two coordinates) [2-4]. Some stages of initial processing are required for samples of both handwriting and keystroke pattern, but most of the steps are inherent in certain biometric characteristics. Next, let's look at the common processing steps and the processing required for handwriting and keystroke pattern samples [2-4]. First, you need to choose the correct word, password, keyboard input or typing using the graphical tablet being analyzed. It is advisable to choose as a key phrase the text that is often used in the field in which the organization that uses this biometric authentication system operates. If a person often types or writes specific words or phrases, then he develops a characteristic handwriting, which makes sense to analyze for recognition of that person. Second, not all typing or typing characteristics have the same quality of recognition for a particular group of users. That is, in each specific organization, for each group of people it is necessary to choose those features of handwriting that are most characteristic. Third, there will always be error samples in the training sample database, which is required. Some of these errors are specific to the individual user, so they should be left in the database, but if the deviation is significant, then using this sample handwriting as a training will have a negative impact on the probability of correct recognition. For example, if a certain user pauses a lot before typing a particular character than before typing other keys, or writes a larger character than other letters, then the characteristic features of that user's handwriting are not correct. But if a person is distracted or sick and as a result, the speed of typing is significantly reduced or the number of misspelled keys is too high, or when writing a password on the graphic tablet besides the desired text, the user wrote something superfluous, then this sample is wrong and it should be removed from the database of training samples. In addition, there are errors that are associated with the specific use of equipment needed to transmit handwriting to your computer. For example, if a user wrote the password in the upper left corner of the tablet today, and writes 2 inches below tomorrow, then without using some data correction, these patterns should be perceived by the system as samples of different users' handwriting or patterns of different passwords. That is, to summarize all of the above, then the primary processing of handwriting samples should include the following steps [2-4]: 1. Choosing the right password word (key phrase) whose typing or writing is specific to a specific group of people of a particular organization. 2. Selection of handwriting characteristics that should be analyzed when entering (writing) this key phrase by user data. 3. Deletion or correction of handwriting samples containing erroneous data. Next, let's take a closer look at the criteria for selecting the optimal keyword phrase and the attributes of its input (writing) that will be analyzed for recognition. Then let's look at what data is wrong for each of these types of handwriting and which of these errors should be deleted and which ones should be corrected. In addition, consider the algorithm for performing these actions. In order to select the optimal keyword phrase and the characteristics of its input (spelling) that will be analyzed for recognition, it is necessary to first accumulate a database of small volume training data (trial database) for different variants of the word password. Then choose the best keyword phrase and then accumulate a complete database of training samples of the required volume for the selected keyword phrase. There are different techniques for determining the best handwriting for recognition. In this paper, it is proposed to use the Hi, function, which is calculated on the basis of characteristics such as mathematical expectation M it (1) and dispersion Sit (2) for each trait being analyzed. zt u j 1 ijt M it  ; (1) zt zt Sit   (u  M ) / ( z  1). j 1 ijt it 2 t (2) Accordingly, the function Hi, is calculated by the formula (3): l 1 l ( Sir  Sic )   M M r 1 c  r 1 Hi  ir ic , (3) komb where komb - is the number of possible combinations from l to two (the number of possible pairs for comparison), which are calculated by the formula (4) l! l! komb   . (4) 2!*(l  2)! 2*(l  2)! The main requirement for the traits being analyzed is that Hi<1, but the smaller the Hi value, the better the recognition for the trait. As a definition of Hi for all characteristics, it is necessary to select the key phrase in which the highest number of features with the best Hi value is and in this phrase to select the desired number of the best features. After that, it is already necessary to accumulate a complete database of training samples. Then, as stated earlier, erroneous samples must be removed from the accumulated database. Keystroke pattern samples and handwriting swatches have different errors. For samples of keystroke pattern, an error will be called either entering the wrong word-character (first type error) or too long before performing a specific action (pressing or releasing a key) (second type error). To determine the percentage of misspelled keys, that is, the percentage of errors of the first type, uses the formula (5): 100* ko Och  , (5) kol  ko where ko - is an amount of pressures of the erroneous keys; kol - it is an amount of pressures of the correct keys. In addition, the probability of correctly recognizing all users is significantly influenced by the amplitude (distribution) of the percentage of errors among users (6). The smaller the amplitude, the lower the recognition quality. AOch  Ochmax  Ochmin , (6) where Ochmax та Ochmin - respectively maximum and minimum error rate among all users. To determine samples that have too long a time interval before performing a certain action, that is, samples with a second type of error, we must first calculate by the formula (7) the arithmetic mean of the i-th characteristic for the t-th user among all training samples: zt u j 1 ijt Srit  . (7) zt Then, check each trait in the training sample on condition (8). If at least one condition is not met, then this sample must be acknowledged as a mistake of the second type of the keystroke pattern sample and deleted from the database. uijt  Srit  k * Srit , (8) where k - is the coefficient that is selected depending on the required recognition accuracy. The smaller this factor, the more accurate the training data will be accumulated, but the less it will remain after selection. For handwriting samples, errors are in most cases related to the specificity of using a graphic tablet. They can be divided into two groups: bugs that need to be deleted and bugs that need to be fixed. There are five types of errors that you need to remove: 1. Type 1 error is a sequence of points with zero pressure (except for the first such point in each sequence). These errors occur when the touchscreen slides its handle over the work area of the graphic tablet a short distance away, without touching it. 2. Type 2 error is a random point (small number). Occurs if the user accidentally touches a pen on the work area of a graphics tablet. 3. Type 3 error is repetition, that is, a succession of consecutive points in which the coordinate values on both axes (X and Y) have not changed (except when one of the points has zero pressure). Occurs if the data packet was transmitted to the computer due to a change of coordinates not on one of the axes (X and Y) but another parameter. 4. Type 4 error is an accidental big loss (usually with an acute angle) at the beginning of the lines. They are caused either by the inertia of the tablet or by the shaking of the user's hand. 5. Type 5 error is a poor quality pattern that is rejected due to the inability to split the key phrase image into a given number of character images. Occur if either the wrong key phrase is entered, or if the user has little experience with the graphics tablet, or if the user has written some characters not separately but together. With this error, the image of a written phrase cannot be divided into a given number of images of individual characters, so such samples are omitted. Some of the types of errors are illustrated in Figure 1. Fig. 1. Examples of erroneous data that require deletion There are also three types of errors that need to be corrected (Fig. 2): 1. Type 6 error is a different angle, relative to the axes of the work area of the tablet, the image of a key phrase in different users. To correct such errors, a correction of type 1 is performed - character-by-character rotation of symbol images to normalize the angle of inclination of their coordinate axes (Fig. 3). 2. Type 7 error is a different location, in the work area of the tablet, the image of a key phrase for different users. To correct such errors, a correction of type 2 is performed - a character shift of images of each character to the center of the workspace of the selected size (Fig. 4). 3. Type 8 error is a different image size of a keyword phrase, on the workspace of a graphic tablet, for different users. To correct such errors, a correction of type 3 is called - a symbolic proportional mass-plotting (stretching / contraction) of images of each symbol over the entire working area of the selected size (Fig. 5). Fig. 3. The result of character rotation Fig. 2. Sample image before of the image correction Removing and correcting these errors significantly increases the probability of correct recognition, which has been verified through experiments. In addition to the already mentioned primary treatment, in this work is Fig. 4. The result of Fig. 5. Result of another very important and character shifting of character-proportional appropriate stage. In this the image scaling of the image paper, the characteristics of not all pixels of the word password character are analyzed, but only the most significant control points [3-4]. The necessity of this step is explained by the fact that, as a rule, the image of a single character consists of approximately 100 points, and for each character several characteristics are analyzed, and therefore too large resources are expended to perform recognition, which is not is appropriate. In this paper, there are three types of control points (Fig. 6): 1. The starting and ending points of each line are the points of contact of the pen of the graphic tablet and the point of detachment of the pen from the graphic tablet. In Fig. 6 these points are indicated by Fig. 6. Example of squares (points 1 and 15). Stored zero data packets arrangement of control are used to determine such points. points 2. The angular points of the lines are the points that are on the bend of the line. In Fig. 6 these points are shown by collags (points 2,3,4,5,6,7,9,10,11,12,13). The bend of a line is called the change of direction of a line, which can be determined by the change of the sign of change of coordinates along one of the axes (or both). 3. The points of intersection are the points that are at the intersection of the lines. In Fig. 6 these points are shown by triangles (points 8 and 14). The search is complicated by the fact that the image points are not some distance from each other (the distance between the dots, beep, more than 1 pixel) In addition, in order to use a probabilistic neural network as a recognition mechanism, all samples must have the same number of features, and due to the correct placement of control points, approximately the same number of points whose characteristics are analyzed for the images of different characters, as opposed to when the characteristics are analyzed all points of the image. 4 Experimental results In order to verify the correct operation of the proposed methods and the efficiency of using the proposed technology of the primary work of the samples, to use the developed software, a number of experiments were conducted. In these experiments modeled the situation of implementation of biometric system authentication of users of the automated system by handwriting and keystroke pattern, respectively. In the case of keyboard handwriting, a group of 10 people were first asked to enter one of the words 200 times for each of the 3 proposed word sets. Then, using the technology indicated in the paper, a set of words ending in a combination of letters "ізація" was selected for further use. After that, each user has already been asked to enter 1,500 times one of the words ending in a combination of letters "ізація". Based on the collected data, a series of experiments were conducted in what the impact of the most critical parameters and of the implementation of the proposed technology of primary processing of handwriting samples on the probability of correct recognition were determined. During the experiments, the situation was simulated with 2 to 10 users working in the automated system. These experiments were conducted for about two weeks. In the case of handwriting recognition, the experimental conditions were about the same as those for keystroke pattern, except that users did not write the entire word on the graphic tablet, but only the letter combination ("ізація " was selected). At the same time, users were given the condition that all characters should not be written together but separately. In addition, for handwriting, for further analysis, it was determined distribution of the number of control points in the images of different characters that were formed by the technology proposed in the work. The main results of the experiments are shown in the graphs (Fig. 7-11). Consider the conditions of the experiments in more detail. 1. In the first experiment, the hypothesis of the influence of the percentage of errors on system users on the probability of their correct recognition by keyboard handwriting (P) was tested. This experiment was performed with the analysis of 6 handwriting characteristics, the number of training samples for each user is 1500. The results of experiments (Fig. 7) prove that the smaller the percentage of errors (Och), the greater the probability of correct user recognition (P). 2. In the second experiment, the hypothesis of the effect of the deletion of training samples with gross errors from the training sample database on the quality of the analyzed traits (Hi) was tested. This experiment was performed to authenticate users to their keyboard handwriting. The results of the experiments (Fig. 8) prove that after the removal of training samples with gross errors, the quality of the features is significantly improved, ie the Hi criterion decreases and the Hi criterion value becomes almost the same in all traits. 3. In the third experiment, the hypothesis of the effect of deletion of training samples with gross errors from the training sample database on the probability of correct recognition by keyboard handwriting (P) was tested. This experiment was performed under the condition that 6 handwriting characteristics are analyzed, the number of training samples for each user is equal to 100. The results of experiments (Fig. 9) prove that after removing the training samples with gross errors, the probability of correct recognition (P) is significantly increased, but what the smaller the deviation of at least one of the signs from its mean (k) is allowed, the greater the probability of correct recognition (P) is achieved. 4. The fourth experiment determines the distribution of values of the number of control points in the image of different characters, with the authentication of handwriting. The results of the experiments (Fig. 10) prove that at the arrangement of control points, according to the technology specified in the work, approximately the same number of points whose characteristics are analyzed is reached for images of different symbols, which makes it possible to use the probabilistic neural network as a recognition mechanism and reduces the we have the resources involved. 5. In the first experiment, the hypothesis of the effect of performing data correction (error correction of 6-8 types) in the authentication of users on their handwriting on the probability of correct recognition of the character and word- password was tested. The results of the experiments (Fig. 11) prove that the probability of correct recognition is significantly increased when performing the specified data correction. 5 Conclusions The article is devoted to the analysis of methods of biometric authentication by dynamic characteristics. Critical features and authentication methods have been selected to identify users by keyboard and handwriting. Primary processing of handwriting samples is proposed to reduce the neural network training term and increase the probability of correct user recognition. References 1. Kosheva N.A., Maznychenko N.I.: Identification of users of information and computer systems: analysis and forecasting of approaches. Information Processing Systems, № 6 (113), pp. 215-223 (2013). 2. Vysotska O., Davydenko A.: Keystroke Pattern Authentication of Computer Systems Users as One of the Steps of Multifactor Authentication. Advances in Computer Science for Engineering and Education II. ICCSEEA 2019. Advances in Intelligent Systems and Computing, vol. 938, pp. 356-368 (2019), DOI: https://doi.org/10.1007/978-3-030-16621-2_33, last accessed: 01.11.2019. 3. Korchenko O., Davydenko A., Vysotskaya O.: Method of authentication of information systems users by their handwriting with multi-step correction of primary data. Information security, vol. 21, №1, pp. 40-51 (2019), DOI: https://doi.org/10.18372/2410-7840.21.13546, last accessed: 01.11.2019. 4. Vysotska O.O., Davydenko A.M.: Analysis of data pre-processing technology for authentication of users of computer systems by keyword pattern and handwriting. Modeling and information technologies. Collection of scientific works, vol. 55, pp. 34-41 (2010). 5. Ilyenko, A.V.: Modern ways of improving the procedure for the formation and verification of digital signature. Science-Based Technologies, 1(37), pp. 61-66 (2018), https://doi.org/10.18372/2310-5461.37.12370 6. Kazmirchuk, S., Ilyenko A., Ilyenko S.: Digital signature authentication scheme with message recovery based on the use of elliptic curves In: Hu Z., Petoukhov S., Dychka I., He M. (eds) Advances in Computer Science for Engineering and Education II. ICCSEEA 2019. Advances in Intelligent Systems and Computing, vol. 938, pp. 279–288 (2019), https://doi.org/10.1007/978-3-030-16621-2_26, last accessed: 01.11.2019. 7. Gattal Abdeljalil, Chibani Youcef: Segmentation and Recognition Strategy of Handwritten Connected Digits Based on the Oriented Sliding Window. 2012 International Conference on Frontiers in Handwriting Recognition, pp. 297-301 (2012). 8. Furukawa Takeshi: The New Method of Identification of Handwriting Using Volumes of Indentations, 2012 International Conference on Frontiers in Handwriting Recognition, pp. 163-168 (2012). 9. Kallan R.: Basic concepts of neural networks. Translate from English. Publishing house “Williams” (2001). 10. Dychka I., Chernyshev D., Tereikovskyi I., Tereikovska L., and Pogorelov V., "Malware Detection Using Artificial Neural Networks", Advances in Computer Science for Engineering and Education II. Advances in Intelligent Systems and Computing, vol. 938, pp. 13-22 (2019).