-

Method of user authentication on the basis of recognition of computer handwriting peculiarities

Leonid S. Kryzhevich

0 0 Kursk state university , 33 Radisheva str., Kursk, 305000, Russian Federation

52 60

This article deals with the following hypothesis: each person has unique peculiarities of text typing. The process of typing can be expressed in the form of various metrics and analyzed with the help of statistical methods.

eol>normal distribution de Moivre-Laplace integral theorem Pearson's nonparametric test χ2

1. Introduction1

Nowadays people keep almost all sorts of data in digital forms, databases or cloud storage services, which can be accessed online. It is possible to keep important documents, treaties, banking data, passwords. If these forms of data are stolen, people can lose their personal or business information, their bank accounts can be wasted. Therefore, the number of evil-doers, who want to steal various forms of information, is increasing.

There are different ways to protect information. However, they are constantly getting out of date. To detect a transgressor, it is necessary to find out if this person has system access rights. This fact has led to ideas to authenticate users with the help of digital handwriting.

Each person has unique peculiarities of text typing. People type texts at a definite speed. The amount of time of keystrokes can vary as well. We decided to measure these characteristics and analyze them.

2. Conditions of the experiment

An experiment was carried out to get test results. About one hundred students of the faculty of mathematics, physics and information science of Kursk State University participated in the experiment [ 1 ]. Their aim was to type a text which included at least four sentences. At the same time, a special program measured the following characteristics for each symbol: the amount of time of a keystroke from the moment when the program was run (in milliseconds); ASCII of a pressed key; whether a key was pressed (1) or released (0).

In Figure 1: data fileFigure 1 you can see the file which includes statistical data for the further analysis.

3. Data analysis

Let us examine the analysis of statistics of the first feature noted – the amount of time of a keystroke. If we take all the consecutive measurements in pairs for the same symbol (when it was pressed and when it was released) from the test pattern and subtract the press time from the release time, we can see the duration of press for each of the symbols. Let us depict test durations for all the symbols in a two-dimensional chart. The horizontal axis of the graph denominates time of a keystroke in milliseconds and the vertical axis denominates frequency of a keystroke (it is the ratio of the number of keystrokes of the definite duration to the total number of keystrokes). If the data are sorted according to the press time, the chart can be depicted in the following way (Figure 2).

Let us make a suggestion that this distribution is normal. To check it, we should analyze the received data with the help of Pearson's nonparametric test χ2.

Let us divide our series into fourteen disjoint intervals. For each of the intervals we should count the number of test values which are included in it. It is obligatory to include at least five results of each key pressed into each of the intervals [ 2 ]. If we follow this rule, we can average out the values of these intervals according to the arithmetic mean and we can create a new chart (Figure 3).

Let us choose the mode for the following distribution. The mode is the most frequent value among the examined indices. In our case, we can choose the mode as xi = 96 (the value of frequency is 59).

The median is also x = 96 because it is the i first index where the value of the cumulative frequency is higher 479/2≈240.

In symmetrical distribution series the values of the mode and the median are similar to the average value (xср=Me=Mo), and in moderately asymmetrical series they can be calculated in the following way: 3*(xav-Me) ≈ xav-Mo.

The range of deviation, which is the difference between the minimum and maximum values of х, is R = 152 - 48 = 104.

Wе can calculate the mean deviation:

d =∑ |xi−x̅|∗fi=9896,284=20,66. =∑(L|xei−tx̅|)2u∗sfi=2c9a9l8∑c9uf1il,7a0te8=62t4h67e,9079d.ispersion D ∑ fi 479

The following indices are used in the formula: n = 479, h=8 (the interval width), σ = 25.022, xср = 99.49, φi – the appropriate fvraelquWueeefnrcoimecsaLinnapTlaaccbeale’lsc2ut.alabtlee. the theoretical

Now we should compare the empirical and theoretical frequencies.

xi |x - xav|*pi (x - xav)2 *pi

We can create one more Table 3, with the help of which we are going to find the observed value of Pearson’s test χ2 = ∑ (ni−ni∗)2.

ni∗

We should include the following indices in the Table 3: i- the sequence number, ni – the observed frequencies, ni∗ – theoretical frequencies, (ni - ni∗) – the difference between the observed and theoretical frequencies,(ni − n∗) 2/ n – the difference, which is raised to ∗ i i the second power and divided by the current value of the theoretical frequency.

Later we should calculate the following indices: Kemp – the observed value of the bound of the critical region and Kcr - the theoretical value of the bound of the critical region. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 xi 48 56 64 72 80 88 96 104 112 120 128 136 144 152

ui -2.0578 -1.7381 -1.4184 -1.0987 -0.779 -0.4592 -0.1395

The higher Kemp value differs from Kcr, the more convincing arguments against our main hypothesis can be provided [ 3 ].

Its bound Kcr = χ2(k-r-1;α) can be calculated according to the distribution tables χ2 and the set values xav and σ(determined according to the series), k = 14, r=2,the significance level α is determined as 0,05.

Kcr(0.05;11) = 19.67514; Kemp = 17.99.

The observed value of Pearson’s statistics does not touch the critical region: (Kemp<Kcr.) It can be fair to say that the data from the series follow the rules of normal distribution.

Paying attention to the same ideas, we can check the second set of data series (Figure 4) of the same person but for different text extracts with the help of Pearson’s test.

ni=nσ∗h ∗ φi =>ni=2403,01∗887 ∗ φi =170,41 φi. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ∑ ni 5 15 35 38 45 53 56 51 41 35 30 15 8 3 430

ni∗

Let us calculate the theoretical frequencies (Table 5), paying attention to the appropriate values from Laplace’s table.

Let us compare the empirical and theoretical frequencies. We can create a calculation Table 6 for the second typing session where the above mentioned values should be included. The table helps us to determine the observed value of the test: χ2 = ∑ (ni−ni∗)2.

ni∗

According to the described above principle, we can see that: Kcr(0.05;11) = 19.67514; Kemp = 19.55. Thus, (Kemp < Kcr)=> the distribution is normal. 3.2.

Comparison of series

Two sets of samples for one person are portrayed in the next Figure 5.

To show everything better, we can depict the graphs in the form of bar charts (Figure 6). The red bars denote the averaged chart of the first typing session, the blue bars are related to the second typing session. To determine how much the typing style of one test person differs from his own, we should examine the crossing area of the graphs[ 4 ]. The first set of samples crosses the second set completely. Therefore, we should consider the second set to be the crossing area, whereas the first set of samples is the joining area.

We should use the following formula: ∑in=1 h ∗ lmax – ∑in=1 h ∗ lmin , where: h – width of the bars; lmax= max(li1,li2) – the maximum value out of the bar heights, which are grouped in pairs, from the two graphs;

lmin= min(li1,li2) – the minimum value, respectively.

According to the described formula, for the first typing session we can see ∑in=1 h ∗ lmax=0,010438+ 0,029228+ 0,06263+ 0,073069+ 0,093946+ 0,108559+ 0,11691+ 0,104384+ 0,085595+ 0,073069+ 0,06263+ 0,031315+ 0,016701+ 0,006263=2,0459.

For the second typing session we can see ∑in=1 h ∗ lmin=0,020876827 +0,03131524 +0,073068894 +0,079331942 +0,106471816 + 0,110647182+ 0,123173278+ 0,106471816+ 0,112734864+ 0,077244259 +0,06263048 + 0,056367432 + 0,033402923+0,029227557 =1,749.

The hit rate is K1= 1,749 /2,0459=0,85510≈86% is the level of coincidence between the two results of the same user.

We can check the hit rate between the values of normal distributions, which are corresponding to the sets noted [ 5 ]. We should use de Moivre–Laplace integral formula for normal distribution.

+∞ −(t−m)2 Φ(x) = e 2σ2 dt,

1 ∫ σ√2π 0 where σ – standard deviation; t – the amount of time of a keystroke in milliseconds; m – expected value.

According to that function, we can create the graphs of the two cases of the normal distribution, which are shown in Figure 7. K2=S11∩ S2=53,082=0,899582≈90% - is the

S1∪S2 59,075 level of the coincidence.

Even taking into consideration the high error level, we have 86% of coincidence for the empirical and 90% of coincidence the theoretical values. Therefore, we can conclude that each person has individual peculiarities connected with the duration of pressing keys he or she follows while typing texts.

4. Scaling by multiple series 5. Summary

In Figure 8 we can see a range of the expected value for the amount of time of folding different keys pressed [ 4 ] in the sessions of the same user during different days (the days are marked in different colours).

In the bottom right corner on the axis of the ordinates, we can see the average amount of time of holding the keys pressed.

In Figure 9, the similar characteristics are illustrated to show the typing sessions of different users.

The comparative analysis of the received results gives an opportunity to conclude that the amount of time of holding different keys pressed is a very informative value that shows a user’s typing technique[ 6 ]. Despite partly random scatter of averaged amounts of time of holding keys pressed, the statistical analysis of the differences lets identify various versions of keyboard typing of the same user and distinguish typing variants of different users[ 7 ].

This method of identification during the process of the user’s authorization can be used in samplings of various volumes. K value of each user can differ a bit in different typing sessions. The fact that K value can be close or not so close to 1 depends on the level of development of the user’s keyboard handwriting. If a user has weak typing skills, the critical value K for his authorization can be determined according to the results of the comparative analysis of his several typing sessions[8]. The further analysis of typing sessions of such users can be made more accurate if we do not take into consideration those keys, the amounts of time of holding which pressed have a high level of standard deviation (for example, far higher than the standard deviation of the whole typing session). handwriting recognition, "Current research in the field of exact sciences and their study in secondary and higher educational institutions", KSU, Kursk, 2015. [8] Yoo W. G., Effects of different computer typing speeds on acceleration and peak contact pressure of the fingertips during computer typing, Journal of Physical Therapy Science (2015). doi: 10.1589/jpts.27.57

[1]

AragÃ

³ n-MendizÃ¡bal E., Delgado-Casas

, Romero-Oliva M. F. , A comparative study of handwriting and computer typing in note-taking by university students , Comunicar ( 2016 ). doi: 10 .3916/C48-2016- 10

[2] Summary and classification of statistics, 2018 , URL: http://www.grandars.ru/ student/statistika/gruppirovkastatisticheskih-dannyh.html

[3] Shulenin , V.P. , Mathematical statistics, NTL Publishing House, Tomsk, 2012 .

[4] Kryzhevich

L.S.

, Rakov

A.S. Kostenko I.V.

, Arkhipova

V.V. Lukin D.E.

, Testing statistical hypotheses about the time parameters of keytyping, "Problems of cybersecurity, modeling and information processing in modern sociotechnical systems" , KSU, Kursk, 2017 .

[5] Gmurman

V.E.

, Probability theory and mathematical statistics, 9th edition , Vysshaya shkola, Мoscow, 2003 .

[6] Fedorowich

L. M.

, Côté

J. N.,

Effects of standing on typing task performance and upper limb discomfort, vascular and muscular indicators , Applied Ergonomics ( 2018 ). doi: 10 .1016/j.apergo. 2018 . 05 . 009.

[7] Kryzhevich

L. S.

, Matyushina

S. N.

, Kostenko

I. V.

, Providing access to electronic equipment based on computer