<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recovery gaps in experimental data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roman Kaminskyi</string-name>
          <email>Roman.M.Kaminskyi@lpnu.ua</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataliia Kunanets [</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Systems and Networks Department, Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stpan Bandera street, 32a, 79013, Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In advanced information technology of statistical analysis, often data for which there are no properties, parameters, characteristics and their values is found. In this situation, the actual becomes the problem of recovering missing data. It's almost impossible to set a value which is missed, but there is a large number of simple and more complex methods for replacing these values. This study describes the characteristics of some methods of filling gaps and examples of their application to the tables of data and time series.</p>
      </abstract>
      <kwd-group>
        <kwd>exclusion method</kwd>
        <kwd>replacement methods</kwd>
        <kwd>filling gaps with the average value</kwd>
        <kwd>filling gaps with median</kwd>
        <kwd>method of closest neighbors</kwd>
        <kwd>filling data model with value</kwd>
        <kwd>filling gaps without matching</kwd>
        <kwd>regression analysis</kwd>
        <kwd>maximum likelihood method</kwd>
        <kwd>EM algorithm</kwd>
        <kwd>ZET algorithm</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Database tables and time series are widely used to describe various static and dynamic
systems. However, in analyzing the results of studies of different systems there are
often situations when the values of certain data are missing. In such cases, we have
data with gaps, which greatly complicates the processing of data, since estimations of
statistical characteristics have displacement. Recovering of missing data is a primary
procedure and involves not only the application of certain methods for recovery of
missing data, but also the knowledge of their nature.</p>
      <p>The problem of missing data values is very relevant in sociology, image recognition,
cluster analysis, and so on. Most often, it also occurs during the identification of time
series, when a priori information about the value of the parameters is incomplete.
Objective reasons for the occurrence of gaps are often: failure of equipment for
measurement and registration, the emergence of obstacles in the monitoring, overlay
on the observed attributeal of interference, etc., and subjective reasons – inactivity in
data registration, inability to get full and accurate information through the influence of
the fuzzy qualitative situation, psychological aspects and attributes of memory, as
well as the delayed sensory-motor reaction.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Causes and problems of missing data recovery</title>
      <p>The presence of missing values in data, attribute ificantly limits the possibility of
using the required methods of processing. In this situation, an important parameter is
the frequency of occurrence of gaps and the existence of certain regularities in them.
In complex systems often two types of data are formulated: tables and time series.
The tables data are presented as follows: the rows are the characteristics (parameters)
of the studied objects (states, computers, methods, etc.), and columns are the value of
a specific characteristic for each object and, mostly, they are numerical values
generated by predetermined scale.</p>
      <p>Time series characterize the dynamics of object as a change of some specific defining
indicator, the value of which most adequately reflects the behavior of the object.
Situations requiring decision-making, generate the following needs for the recovery of
missing data:
1. Filling all gaps in tables or time series.
2. Filling only some gaps.
3. Filling the gaps based on information contained in the table.
4. Filling each subsequent gap, based on the analysis of the initial information
and obtained as a result of predicted values, taking into account the trends for
previously filled gaps.</p>
      <p>Existing methods of filling the gaps substantially different and provide different needs
of such recovery. At the same time, the quality of data recovery is an important point
in filling in the missing data.</p>
      <p>
        It should be taken into account that the values of the characteristics in the table or the
values of the time series levels are obtained in one way or another by data that may
contain gaps, but only the tables themselves or the time series contain the most
information about these data. In the case of large amount of data and small number of
gaps, characteristics of the obtained data will vary slightly from the true values for
their entire population (full data availability). In this case, to find a replacement for
the missing value is not very difficult, since you can set the nature of the data (at least
statistically). Otherwise, it is necessary to use several methods and choose the best to
achieve a determined quality criterion, for example, for the characteristics of
descriptive statistics, the mean square deviation from the trend, its various models.
R. Little [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] considered it expedient to process incomplete data, in case of missing
time series levels (considering existing levels as elements of the sample), use the
following methods, which consist in determining the primary statistical characteristics
of the studied sample:
1. without gaps;
2. when filling missing values with zeros;
3. when filling missing values with average values;
4. when filling missing values with the indicators of numerical characteristics
of the distribution levels – mod, median or its quartiles;
5. when filling the missing levels quasi-random numbers, distributed by the
normal law, the average value and the mean square deviation of the levels coincide
with the corresponding characteristics of the initial series with the gaps.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods for filling the spaces in the data tables</title>
      <p>The practice of working with databases gives reason to assert that there is a high
likelihood of a large number of tabs in the table, in particular in the "object-attribute"
Table 1. The attributes of objects in this table are certain physical parameters, each of
which has its dimension. Therefore, to fill the spaces given in the table with an empty
space character, use the comparison methods with the analogs. Statistical analysis of
the values of different attributes of one object is inadmissible because it leads to
errors and loss of confidence in the data. Recovery of missing data in tables makes
sense only in the case of several gaps.
Object . . .</p>
      <p>Yn hn  ln mn 
The most common methods that can recover missing data without losing their
reliability are the following:
Exclusion method . This method is used to recover gaps in tables and is implemented
by eliminating lines with the presence of gaps. In general, this reduces the informality
of the tables and the loss of objectivity in the analysis of the situation, based on the
processing of the data thus obtained, reducing the adequacy of the constructed
models.</p>
    </sec>
    <sec id="sec-4">
      <title>Objects</title>
    </sec>
    <sec id="sec-5">
      <title>Object</title>
      <p>Y1</p>
    </sec>
    <sec id="sec-6">
      <title>Object</title>
      <p>Y2
. . .</p>
    </sec>
    <sec id="sec-7">
      <title>Object</title>
      <p>Yn−1
h1
h2
Replacement methods are diverse, we consider only some of them, which in our
opinion, expedient to use for recovering of gaps in the data obtained during the
scientific experiment.</p>
      <p>
        Filling gaps with the average value of the data presented in the column. Missing data
are generated by calculating the average value available in the table of values of given
attribute, assuming that the objects are equivalent. At the same time, the application
of this method is possible only if there is no data in random order MAR (missing at
random) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], when gaps are random variables. The disadvantage of the method is
distorting the distribution of data and dispersion decreasing of initial data.
Filling gaps with median.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] The median is the most stable characteristic of the
sample, since in any transformation it remains unchanged. It can be successfully used
for a table. However, in the case of a small number of objects and several gaps the
median value can vary quite attributeificantly, so the median value must be
determined after each completed gap, and the gaps fill in a random way.
Method of closest neighbors. The method is based on the assumption that the missing
value is close to the filled values of the rows, neighboring with the row with missing
value. To fill the value of the missing attribute, the values of all relevant attributes of
these neighboring rows are averaged. In this case weight coefficients are used, which
are inversely proportional to the distance between the cell with the gap and the cell
with available value of the given attribute. For a large number of gaps, this method is
ineffective.
      </p>
      <p>
        Filling data model with value. This method is building the model of values of the
given attribute. For filling take the value of this model, which corresponds to a gap.
Filling gaps without matching. The pass is filled with the constant value from an
external source, for example, a value of the previous observation from the same study.
Regression analysis. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] For application of this method it is necessary to comply with
the requirement of data compliance to condition MAR, as well as requirements related
to the implementation of the prerequisites of regression analysis. The disadvantages
of this method include the dependence of quality of gaps recovery from choice of
regression model. These methods belong to the category of simple techniques and are
usually performed in the pre-processing of data and preparation for analysis. In
addition to these methods, used others, classified as complex. In turn, this category is
divided into two subcategories – global methods and local methods. The following are
the global ones.
      </p>
      <p>
        Maximum likelihood method. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] The basis of this method is finding the maximum
values of mathematical expectations, which are target variables for each missing
value, using the existing observations. Its application is complicated by the large
number of missing values.
      </p>
      <p>
        EM-algorithm. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] This algorithm is implemented in two stages. In the first stage,
which is called stage E, using full or partly incomplete data, determine the conditional
mathematical expectations, which fill each proxy value. After filling all the missing
values, determine the average, dispersion, correlative and covariance indicators. In the
second stage, which is called stage M, they achieve the maximum matching of the
substituted found values so that data structure with the filled variables matches data
structure of complete observations.
      </p>
      <p>Algorithm ZET. It refers to local fill-in gaps algorithms. These algorithms mainly
take into account the dependencies, for that part of the data containing the gap, that is,
in some area of gap. In identifying the dependence for this algorithm, all the rows and
columns of the output data field are involved. Local algorithms have high efficiency
in comparison with other known algorithms for filling the gaps. For real-world tasks,
different their modifications are used.</p>
      <p>In the table, when data values are missed, that is, when some of the objects present are
incomplete sets of attributes, often act in the following way.</p>
      <p>1. To fill the missing values, select similar objects with complete sets of
attributes, that is, objects that do not have missing values of attributes.
2. On subset of these objects, different missing values are simulated, typical for
this type of data.
3. Data recovery is carried out with various methods.
4. Next, determine which of the methods provides the best match for the
replacement of missing values calculated, within a given criterion.
5. This method is used to recovery the really missing values of the objects
attributes of the given set.</p>
      <p>Example. To illustrate the methods we consider a hypothetical example of filling in
the table of missing values in the environment of the table processor Ms Excel.
Objects in the table are characterized with only one attribute – the values of the
indicator. As such vector of attributes a sequence of random numbers C3: C32 with
steady distribution within the values of attributes is used xi 1, 10 as shown in
Fig. 1. This sequence includes n = 30 values and is practically the least
representative, and therefore conclusions based on it can be considered reliable.
Let for several objects there is no given attribute, that is, in the relevant column of the
table "object-attributes" there are missing values. To fill the gaps we use: the average
value of the sample with gaps, the weighted average of this sample and the median
value. To simulate the gaps remove from this sequence the following values С7, С13,
С20, and С24. As a result, we will obtain the vector of attribute with gaps Е3:Е32.
The value of the average and median are determined using the procedure «Data →
Data Analysis → Descriptive Statistics». The weighted average is determined by the
following formula
  = 0.5(0.191 ∙   −2 + 0.309 ∙   −1 + 0.309 ∙   +1 + 0.191 ∙   +2
(1)
where xi – missing value.</p>
      <p>Indicators of descriptive statistics for the recovered are shown below in Fig. 2. The
values of the objects attributes have steady distribution, and therefore the main
characteristics in descriptive statistics, which can be compared with the results of the
application of one or another method are: arithmetic mean, median, standard deviation
and sum (value of kurtosis, asymmetry and mod are not informative and incorrect).
According to these indicators the relative error of the data with missing values and
data with replacement of missing values in relation to the original data – values
С3:С32. The values of relative error are presented in Table 2.</p>
    </sec>
    <sec id="sec-8">
      <title>Relative error of indicators of descriptive statistics in [%]</title>
    </sec>
    <sec id="sec-9">
      <title>Average</title>
    </sec>
    <sec id="sec-10">
      <title>Median of</title>
      <p>In this example, the most suitable was the method of filling missing data with median
value.</p>
      <p>Remark. This example illustrates only a procedure, rather than solution to specific
problem.</p>
      <p>As the criterion for assessing the quality of the method is the relative error of
indicator value for characteristics of descriptive statistics, then the smaller this value,
the better is replacement of value with the specified method.</p>
    </sec>
    <sec id="sec-11">
      <title>Standard</title>
      <p>deviation</p>
    </sec>
    <sec id="sec-12">
      <title>Methods of filling gaps in time series</title>
      <p>Data representation of time series and their analysis is becoming increasingly popular
in various scientific studies. Especially time series analysis is important for research
of data streams in information systems and networks, in problems of modeling
processes of different systems and phenomena, in predicting situations and dynamic
of systems on the basis of monitoring of their state.</p>
      <p>The main reason that causes the gaps in time series is the inability to obtain
information at certain points of time. Besides, it may be a situation where the means
of measurement, observation, registration are not configured, damaged, do not meet
the measured values or have inappropriate limits of their measurements (discrepancy
of scales of values, low sensitivity, require a considerable amount of time for
measurement), data is recorded by unskilled personnel.</p>
      <p>Characteristic for time series is that, depending on subject area, the nature of gaps has
its own peculiarities. However, in the process of filling gaps, their nature is often
ignored and one or more of the most accessible and simple methods are applied. View
of time series with gaps is shown in Fig. 2.</p>
      <p>Today there is no single methodology for recovering of missing values or processing
missing data. The choice of the most appropriate method for filling gaps in each
particular situation is often a rather complicated individual task, which can take much
more time and efforts than data processing itself with recovered values.</p>
      <p>To fill the gaps in time series can be used different methods, but in each case the
method of filling must be substantiated, and the results interpreted. Unlike tables in
time series, the attributes are equal in time series, but they have the same nature,
physical content, their values are measured in the same scale, they are dependent
random variables with the same distribution and, most importantly, they are
connected with ordered sequence of moments of time, in which their registration was
made.</p>
      <p>To fill missing values, levels in time series, we can distinguish two of these methods:
use the values of individual statistics and use the values of the trend model.
To fill the missing levels are used the following statistics.</p>
      <p>Filling the missing levels with average value. For time series, the gap is filled in or
with general average for all values of the series, or selecting a certain interval inside
which is the gap and the average is calculated for this interval for filling the gap. This
method is easy to implement, and the mechanism of creation of gap can be ignored.
The disadvantage of the method is distorting the distribution of data and reducing the
dispersion of initial data.</p>
      <p>Filling gap with median. The median is the most stable characteristic of the sample,
since in any transformation it remains unchanged. It can be successfully used for the
time series as well.</p>
      <p>Filling with distribution mode. In the case when it is necessary to find value of
missing level of time series, for sufficient amount of data, the value of mode is
determined, which is used to recover the gap.</p>
      <p>Remark. The lack of inflection point of envelope of variation series indicates that
there is no mode for the distribution of time series levels.</p>
      <p>Filling values of trend model. The essence of this method is that a trend model is
built in the form of appropriate function, mainly nonlinear. Then the values of missing
levels are taken from this model (function) in accordance with the numbers of these
levels.</p>
      <p>Example. For the time series shown in Fig. 3. the most appropriate method for filling
missing levels is to calculate their values based on the model of its trend.
Let the output series have  = 80 levels, however, there are gaps, namely missing
levels xi there are levels for which  = 20, 55, 67, 70, 71.</p>
      <p>Filling the missing levels is carried out in following way. A trend model is being
constructed using as an approximation function, for example, a third-order
polynomial. This choice is conditioned with the following consideration. Because the
form of the trend is unknown, the selected approximation function should reflect
growth, decline, certain changes that may have a tendency. The third degree
polynomial is actually a cubic parabola, which has an inflection point, which means
that it can "catch" existing, although quite common, changes in trends.
The procedure is that first we find the trend of the original series and approximate it
with the third degree polynomial. Define the value of the first gap from trend equation
and fill it the first missing value. Next, we approximate the series with the filled gap
by the same polynomial and determine from obtained trend model (approximating
function) value for the second gap. As a result of the completion of this procedure, the
time series graph has the form shown in Fig. 3.</p>
      <p>In fig. 3. thin line reflects the trend, approximated with polynomial of the third degree
after filling in the missed fourth level.</p>
      <p>The analysis of coefficients, approximating trend, polynomials showed that in this
case, they are very weakly different from each other, and and almost all trend lines
merge because the difference between them is much smaller than that for the given
scale and pixel sizes can be submitted. For a more detailed analysis, we’ll compare
the changes of parameters of descriptive statistics for presented series, which are
presented in Table 3. Based on the results of estimating the parameters of descriptive
statistics, we can conclude, that almost every parameter "feels" a change of time
series when replacing the missing value with a value determined with using the
model. No changes were made within the limits of accuracy of "three decimal places"
only mode, interval, minimum and maximum values. This can be explained by the
fact that the replacement of the gaps obtained by values from the models, practically
does not affect the distribution of time series levels. The change of data values of
kurtosis and asymmetry of distribution can be considered insignificant as these
parameters determine the features of shape of curve of the law of distribution and
density functions, which are visually impossible to capture Interval, minimum and
maximum values remained unchanged because the values of the model lie inside the
interval – do not exceed the extreme values of levels. Thus, the problem of filling
gaps in real tables or time series can be solved with different methods, however, it is
always necessary to keep in mind the features of the source that generates these data,
the situation at the time of data collection, as well as the requirements to use data with
filled gaps.</p>
    </sec>
    <sec id="sec-13">
      <title>Parameters of descriptive statistics</title>
    </sec>
    <sec id="sec-14">
      <title>Average</title>
    </sec>
    <sec id="sec-15">
      <title>Standard error</title>
    </sec>
    <sec id="sec-16">
      <title>Median</title>
    </sec>
    <sec id="sec-17">
      <title>Mode</title>
    </sec>
    <sec id="sec-18">
      <title>Standard</title>
      <p>deviation</p>
    </sec>
    <sec id="sec-19">
      <title>Dispersion</title>
      <p>of levels</p>
    </sec>
    <sec id="sec-20">
      <title>Kurtosis</title>
    </sec>
    <sec id="sec-21">
      <title>Asymmetry</title>
    </sec>
    <sec id="sec-22">
      <title>Interval</title>
      <p>(scope)</p>
    </sec>
    <sec id="sec-23">
      <title>Minimum</title>
    </sec>
    <sec id="sec-24">
      <title>Maximum Sum</title>
    </sec>
    <sec id="sec-25">
      <title>Number levels of</title>
    </sec>
    <sec id="sec-26">
      <title>Conclusions</title>
      <p>This researh shows that even using quite simple methods of filling missing values in
tables and time series, on condition of representative amount of data, you can get
quite good results in replacement of the missing data.</p>
      <p>The use of indicators of descriptive statistics as a criterion for evaluating the
replacement of missing values is completely correct and quite sufficient for most real
situations. Obviously, the random data considered in the examples can be attributed to
stationary random sequences, at least in relation to their average value and dispersion.
However, in the case of significant no-nlinearities and a relatively small amount of
missing values, the basic method is the construction of empirical model and matching
of its properties in the framework of the set task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          :
          <article-title>Statistical analysis with missing data</article-title>
          . John Wiley &amp; Sons, Inc. (
          <year>1987</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Abnane</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Idri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Missing data techniques in analogy-based software development effort estimation</article-title>
          .
          <source>Journal of Systems and Software</source>
          ,
          <volume>117</volume>
          ,
          <fpage>595</fpage>
          -
          <lpage>611</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Valencia</surname>
          </string-name>
          , Pedro L.,
          <string-name>
            <surname>Astudillo-Castro</surname>
          </string-name>
          , Carolina, Gajardo, Diego, Flores, Sebastián:
          <article-title>Calculation of statistic estimates of kinetic parameters from substrate uncompettive inhibition equation using the median method</article-title>
          .
          <source>Data in Brief 11</source>
          ,
          <fpage>567</fpage>
          -
          <lpage>571</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Mika</given-names>
            <surname>Sato-Ilic</surname>
          </string-name>
          :
          <article-title>Knowledge-based Comparable Predicted Values in Regression Analysis</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>114</volume>
          ,
          <fpage>216</fpage>
          -
          <lpage>223</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dong-Qing</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Zhen Zhang,
          <string-name>
            <surname>Jin-Yun</surname>
            <given-names>Yuan</given-names>
          </string-name>
          :
          <article-title>Maximum likelihood estimation method for dual-rate Hammerstein systems</article-title>
          .
          <source>International Journal of Control, Automation and Systems</source>
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <fpage>698</fpage>
          -
          <lpage>705</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Balakrishnan</surname>
            , Sivaraman; Wainwright,
            <given-names>Martin J.</given-names>
          </string-name>
          ; Yu,
          <article-title>Bin: Statistical guarantees for the EM algorithm: From population to sample-based analysis</article-title>
          .
          <source>Ann. Statist</source>
          .
          <volume>45</volume>
          (
          <issue>1</issue>
          )
          <fpage>77</fpage>
          -
          <lpage>120</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>