=Paper=
{{Paper
|id=Vol-2803/paper8
|storemode=property
|title=Method of user authentication on the basis of recognition of
computer handwriting peculiarities (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2803/paper8.pdf
|volume=Vol-2803
|authors=Leonid S. Kryzhevich
}}
==Method of user authentication on the basis of recognition of
computer handwriting peculiarities (short paper)==
Method of user authentication on the basis of recognition of computer handwriting peculiarities Leonid S. Kryzhevicha a Kursk state university, 33 Radisheva str., Kursk, 305000, Russian Federation Abstract This article deals with the following hypothesis: each person has unique peculiarities of text typing. The process of typing can be expressed in the form of various metrics and analyzed with the help of statistical methods. Keywords normal distribution, de Moivre–Laplace integral theorem, Pearson's nonparametric test χ2 1. Introduction1 information science of Kursk State University participated in the experiment [1]. Their aim Nowadays people keep almost all sorts of was to type a text which included at least four data in digital forms, databases or cloud sentences. At the same time, a special program storage services, which can be accessed online. measured the following characteristics for It is possible to keep important documents, each symbol: the amount of time of a treaties, banking data, passwords. If these keystroke from the moment when the program forms of data are stolen, people can lose their was run (in milliseconds); ASCII of a pressed personal or business information, their bank key; whether a key was pressed (1) or released accounts can be wasted. Therefore, the number (0). of evil-doers, who want to steal various forms In Figure 1: data fileFigure 1 you can see of information, is increasing. the file which includes statistical data for the There are different ways to protect further analysis. information. However, they are constantly getting out of date. To detect a transgressor, it is necessary to find out if this person has system access rights. This fact has led to ideas to authenticate users with the help of digital handwriting. Each person has unique peculiarities of text typing. People type texts at a definite speed. Figure 1: data file The amount of time of keystrokes can vary as The purpose of the experiment is to well. We decided to measure these determine individual features of one typing characteristics and analyze them. session in order to find out in what way it differs from some other test patterns of other 2. Conditions of the experiment users. An experiment was carried out to get test 3. Data analysis results. About one hundred students of the faculty of mathematics, physics and Let us examine the analysis of statistics of the first feature noted – the amount of time of Models and Methods for Researching Information Systems in Transport, Dec. 11-12, 2020, St. Peterburg, Russia a keystroke. If we take all the consecutive EMAIL: Leonid@programist.ru (L.S. Kryzhevich); measurements in pairs for the same symbol ORCID: 0000-0002-6736-498X (L.S. Kryzhevich); ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative (when it was pressed and when it was Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) released) from the test pattern and subtract the press time from the release time, we can see 52 the duration of press for each of the symbols. x1 – abscissa axis or time; Let us depict test durations for all the symbols f – frequency, in a two-dimensional chart. The horizontal (x1 *f) which should be used to calculate axis of the graph denominates time of a the weighted arithmetic mean; keystroke in milliseconds and the vertical axis S – cumulative frequency, which is denominates frequency of a keystroke (it is the calculated by adding each previous frequency ratio of the number of keystrokes of the to the following one; (|xi - xср |*fi ) value, definite duration to the total number of which is the difference between the current xi keystrokes). If the data are sorted according to and the weighted arithmetic mean multiplied the press time, the chart can be depicted in the by the current frequency; following way (Figure 2). ((xi − xср )2 *fi ) value, which is the difference between the current xi and the weighted arithmetic mean which is raised to the second power and multiplied by the current frequency; (fi /f) – the ratio of the relative frequency to the total sum. We should calculate the weighted arithmetic mean: Figure 2: the time/frequency bar chart for the ∑ xi ∗fi 47656 first typing session of a test person x̅ = ∑ fi = = 99,49 479 These values are necessary for further 3.1. Checking for normal calculations. Let us create a Table 1 that includes them. distribution The dispersion shows the measure of scatter of all the values in the series around the Let us make a suggestion that this average value. distribution is normal. To check it, we should Let us calculate the mean square deviation: analyze the received data with the help of σ = √D = √626,079 = 25,022 Pearson's nonparametric test χ2 . Let us check the suggestion that Х is Let us divide our series into fourteen normally distributed with the help of Pearson's disjoint intervals. For each of the intervals we (ni−ni∗)2 should count the number of test values which chi-squared test K=∑ , where n*i – ni∗ are included in it. It is obligatory to include at theoretical frequencies, which are calculated least five results of each key pressed into each n∗h according to the formula ni= σ ∗ φi . of the intervals [2]. If we follow this rule, we can average out the values of these intervals according to the arithmetic mean and we can Let us choose the mode for the following create a new chart (Figure 3). distribution. The mode is the most frequent value among the examined indices. In our case, we can choose the mode as xi = 96 (the value of frequency is 59). The median is also xi = 96 because it is the first index where the value of the cumulative frequency is higher 479/2≈240. In symmetrical distribution series the values of the mode and the median are similar to the average value (xср =Me=Mo), and in Figure 3: the averaged time/frequency bar moderately asymmetrical series they can be chart for the first typing session of a test calculated in the following way: person 3*(xav -Me) ≈ xav -Mo. In order to find out if the distribution is normal, we should use Pearson's test χ2 [3]. We should use the following indices: 53 Table 1 The calculation table for empirical frequencies of the first typing session xi The Relative xi * pi Cumul xi |x - xav|*pi (x - xav)2 *pi Cumulative num frequency, ative frequency, S ber, pi=fi/f frequen fi cy, S 48 10 0.0209 480 0.0209 48 514.906 26512.824 10 56 14 0.0292 784 0,0501 56 608.868 26480.059 24 64 30 0.0626 1920 0,1127 64 1064.718 37787.492 54 72 35 0.0731 2520 0,1858 72 962.171 26450.669 89 80 51 0.106 4080 0,2918 80 994.021 19374.069 140 88 52 0.109 4576 0,4008 88 597.511 6865.769 192 96 59 0.123 5664 0,5238 96 205.946 718.875 251 104 50 0.104 5200 0,6278 104 225.47 1016.732 301 112 54 0.113 6048 0,7408 112 675.507 8450.187 355 120 37 0.0772 4440 0,818 120 758.848 15563.505 392 128 30 0.0626 3840 0,8806 128 855.282 24383.567 422 136 27 0.0564 3672 0,937 136 985.754 35989.269 449 144 16 0.0334 2304 0,9704 144 712.15 31697.379 465 152 14 0.0292 2128 0,9996 152 735.132 38601.311 479 Total 479 1 47656 Total 9896.284 299891.708 The range of deviation, which is the The following indices are used in the difference between the minimum and formula: n = 479, h=8 (the interval width), maximum values of х, is R = 152 - 48 = 104. σ = 25.022, xср = 99.49, φi – the appropriate Wе can calculate the mean deviation: value from Laplace’s table. ̅ ∗fi 9896,284 ∑ |xi −x| d= ∑ fi = 479 =20,66. We can calculate the theoretical Let us calculate the dispersion D frequencies in Table 2. ̅ )2 ∗fi 299891,708 ∑(|x −x| Now we should compare the empirical and = i∑ = =626,079. theoretical frequencies. fi 479 54 Table 2 Table 3 The calculation table for theoretical The calculation table for comparison of frequencies of the first typing session theoretical and empirical frequencies of the first typing session i xi ui φi n*i i xi ui φi n*i 1 48 -2.0578 0,0478 7.32 1 48 -2.0578 0,0478 7.32 2 56 -1.7381 0,0878 13.446 2 56 -1.7381 0,0878 13.446 3 64 -1.4184 0,1456 22.298 3 64 -1.4184 0,1456 22.298 4 72 -1.0987 0,2179 33.371 4 72 -1.0987 0,2179 33.371 5 80 -0.779 0,2943 45.071 5 80 -0.779 0,2943 45.071 6 88 -0.4592 0,3589 54.965 6 88 -0.4592 0,3589 54.965 7 96 -0.1395 0,3951 60.509 7 96 -0.1395 0,3951 60.509 8 104 0.1802 0,3918 60.003 8 104 0.1802 0,3918 60.003 9 112 0.4999 0,3521 53.923 9 112 0.4999 0,3521 53.923 10 120 0.8197 0,285 43.647 10 120 0.8197 0,285 43.647 11 128 1.1394 0,2083 31.901 11 128 1.1394 0,2083 31.901 12 136 1.4591 0,1374 21.043 12 136 1.4591 0,1374 21.043 13 144 1.7788 0,0818 12.527 13 144 1.7788 0,0818 12.527 14 152 2.0986 0,044 6.739 14 152 2.0986 0,044 6.739 We can create one more Table 3, with the The higher Kemp value differs from Kcr, the help of which we are going to find the more convincing arguments against our main observed value of Pearson’s test χ2 = hypothesis can be provided [3]. ∑ (ni −n∗i )2 . Its bound Kcr = χ2(k-r-1;α) can be n∗i calculated according to the distribution tables We should include the following indices in χ2 and the set values xav and σ(determined the Table 3: i- the sequence number, ni – the according to the series), k = 14, r=2,the observed frequencies, n∗i – theoretical significance level α is determined as 0,05. frequencies, (ni - n∗i ) – the difference between Kcr(0.05;11) = 19.67514; Kemp = 17.99. the observed and theoretical frequencies,(ni − The observed value of Pearson’s statistics n∗i ) 2 / n∗i – the difference, which is raised to does not touch the critical region: (Kempthe mode is 96. Figure 4: the averaged time/frequency bar Half of the sum of the cumulative chart for the second typing session of a test frequency is 216. It is xi = 96. Thus, the person median is 96. The range of deviation is 152 - 56 = 96. Let us create a Table 4 for the second The mean deviation is distribution according to the described above. 56 ̅ ∗fi 7168,921 ∑ |xi−x| d= ∑ fi = 430 =16,67. Table 6 Table 5 The calculation table for comparison of The calculation table for theoretical theoretical and empirical frequencies of the frequencies of the second typing session second typing session i ni n∗i ni -ni∗ (ni -n∗i )2 (ni -n∗i )2/n∗i i xi ui φi n∗i 1 5 7.498 2.498 6.2398 0.832 1 56 -2.0983 0,044 7.498 2 15 15.7627 0.7627 0.5818 0.0369 2 64 -1.702 0,0925 15.763 3 35 28.816 -6.184 38.242 1.327 3 72 -1.3057 0,1691 28.816 4 38 44.9366 6.9366 48.1161 1.071 4 80 -0.9094 0,2637 44.937 5 45 56.0983 11.0983 123.1722 2.196 5 86 -0.6122 0,3292 56.098 6 53 59.3872 6.3872 40.796 0.687 6 88 -0.5131 0,3485 59.387 7 56 67.4986 11.4986 132.2176 1.959 7 96 -0.1168 0,3961 67.499 8 51 65.181 14.181 201.102 3.085 8 104 0.2795 0,3825 65.181 9 41 53.9512 12.9512 167.7325 3.109 9 112 0.6758 0,3166 53.951 10 35 37.9499 2.9499 8.7016 0.229 10 120 1.0721 0,2227 37.95 11 30 23.0732 -6.9268 47.98 2.079 11 128 1.4684 0,1354 23.073 12 15 11.8263 -3.1737 10.0723 0.852 12 136 1.8647 0,0694 11.826 13 8 5.1634 -2.8366 8.0465 1.558 13 144 2.261 0,0303 5.163 14 3 1.9767 -1.0233 1.0471 0.53 14 152 2.6573 0,0116 1.977 ∑ 430 430 19.551 Each value of the range differs from another index by 16.67 Let us calculate the theoretical frequencies (Table 5), paying attention to the appropriate Let us calculate the dispersion: values from Laplace’s table. ̅ )2∗fi 175228,847 ∑(|xi−x| Let us compare the empirical and D= ∑ fi = 430 = 407,509 theoretical frequencies. We can create a The mean square deviation is σ = √D = calculation Table 6 for the second typing √407,509 = 20,187 session where the above mentioned values We can check the suggestion that Х is should be included. The table helps us to normally distributed with the help of Pearson's determine the observed value of the test: χ2 = chi-squared test [3]. We should calculate the (ni −n∗i )2 ∑ . theoretical frequencies, paying attention to the n∗i fact that: n = 430, h=8 (the interval width), σ = According to the described above principle, 20.187, xср= 98.36. we can see that: K cr(0.05;11) = 19.67514; n∗h 430∗8 Kemp = 19.55. Thus, (Kemp < Kcr)=> the ni= ∗ φi =>ni= ∗ φi =170,41 φi . σ 20,187 distribution is normal. 57 3.2. Comparison of series 0,104384+ 0,085595+ 0,073069+ 0,06263+ 0,031315+ 0,016701+ 0,006263=2,0459. Two sets of samples for one person are For the second typing session we can see portrayed in the next Figure 5. ∑ni=1 h ∗ lmin =0,020876827 +0,03131524 +0,073068894 +0,079331942 +0,106471816 + 0,110647182+ 0,123173278+ 0,106471816+ 0,112734864+ 0,077244259 +0,06263048 + 0,056367432 + 0,033402923+0,029227557 =1,749. The hit rate is K1= 1,749 /2,0459=0,85510≈86% is the level of coincidence between the two results of the same user. We can check the hit rate between the values of normal distributions, which are Figure 5: joint graphs for the sets of corresponding to the sets noted [5]. We should samples of the first and the second typing use de Moivre–Laplace integral formula for sessions normal distribution. 2 1 +∞ −(t−m) To show everything better, we can depict Φ(x) = ∫ e 22σ dt, σ√2π 0 the graphs in the form of bar charts (Figure 6). where The red bars denote the averaged chart of the σ – standard deviation; first typing session, the blue bars are related to t – the amount of time of a keystroke in the second typing session. milliseconds; m – expected value. According to that function, we can create the graphs of the two cases of the normal distribution, which are shown in Figure 7. Figure 6: the graphs of the sample sets To determine how much the typing style of one test person differs from his own, we should examine the crossing area of the Figure 7: graph of normal distributions, graphs[4]. The first set of samples crosses the corresponding to both samples second set completely. Therefore, we should consider the second set to be the crossing area, where whereas the first set of samples is the joining S1 – the area, which is limited to the first area. graph, We should use the following formula: S2 – the area, which is limited to the ∑ni=1 h ∗ lmax – ∑ni=1 h ∗ lmin , where: second graph. h – width of the bars; The hit rate of the theoretical graphs is lmax = max(li1 ,li2 ) – the maximum value S11∩ S2 53,082 K2= S ∪S =59,075=0,899582≈90% - is the out of the bar heights, which are grouped in 1 2 pairs, from the two graphs; level of the coincidence. lmin = min(li1 ,li2 ) – the minimum value, Even taking into consideration the respectively. high error level, we have 86% of coincidence According to the described formula, for the for the empirical and 90% of coincidence the first typing session we can see ∑ni=1 h ∗ theoretical values. Therefore, we can conclude lmax =0,010438+ 0,029228+ 0,06263+ that each person has individual peculiarities 0,073069+ 0,093946+ 0,108559+ 0,11691+ connected with the duration of pressing keys he or she follows while typing texts. 58 4. Scaling by multiple series 5. Summary In Figure 8 we can see a range of the This method of identification during the expected value for the amount of time of process of the user’s authorization can be used folding different keys pressed [4] in the in samplings of various volumes. K value of sessions of the same user during different days each user can differ a bit in different typing (the days are marked in different colours). sessions. The fact that K value can be close or not so close to 1 depends on the level of development of the user’s keyboard handwriting. If a user has weak typing skills, the critical value K for his authorization can be determined according to the results of the comparative analysis of his several typing sessions[8]. The further analysis of typing sessions of such users can be made more accurate if we do not take into consideration those keys, the amounts of time of holding Figure 8: the graph for the cases of normal which pressed have a high level of standard distribution deviation (for example, far higher than the standard deviation of the whole typing In the bottom right corner on the axis of the session). ordinates, we can see the average amount of time of holding the keys pressed. References In Figure 9, the similar characteristics are illustrated to show the typing sessions of different users. [1] Aragón-Mendizábal E., Delgado-Casas C., Romero-Oliva M. F., A comparative The comparative analysis of the received results gives an opportunity to conclude that study of handwriting and computer typing the amount of time of holding different keys in note-taking by university students, pressed is a very informative value that shows Comunicar (2016). doi: 10.3916/C48-2016- a user’s typing technique[6]. Despite partly 10 random scatter of averaged amounts of time of [2] Summary and classification of statistics, 2018, URL: http://www.grandars.ru/ holding keys pressed, the statistical analysis of the differences lets identify various versions of student/statistika/gruppirovka- keyboard typing of the same user and statisticheskih-dannyh.html distinguish typing variants of different [3] Shulenin, V.P., Mathematical statistics, users[7]. NTL Publishing House, Tomsk, 2012. [4] Kryzhevich L.S., Rakov A.S. Kostenko I.V., Arkhipova V.V. Lukin D.E., Testing statistical hypotheses about the time parameters of keytyping, "Problems of cybersecurity, modeling and information processing in modern sociotechnical systems", KSU, Kursk, 2017. [5] Gmurman V.E., Probability theory and mathematical statistics, 9th edition, Vysshaya shkola, Мoscow, 2003. [6] Fedorowich L. M., Côté J. N., Effects of Figure 9: the graph for the distribution of standing on typing task performance and typing sessions of different users upper limb discomfort, vascular and The results of the experiment show that, in muscular indicators, Applied Ergonomics most cases, periods of time of holding keys (2018). doi: 10.1016/j.apergo.2018.05. 009. pressed are random sets of samples, which are [7] Kryzhevich L. S., Matyushina S. N., normally distributed. Kostenko I. V., Providing access to electronic equipment based on computer 59 handwriting recognition, "Current research in the field of exact sciences and their study in secondary and higher educational institutions", KSU, Kursk, 2015. [8] Yoo W. G., Effects of different computer typing speeds on acceleration and peak contact pressure of the fingertips during computer typing, Journal of Physical Therapy Science (2015). doi: 10.1589/jpts.27.57 60