<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Some approaches for improving quality of tabular data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matrosov Institute for System Dynamics</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Control Theory</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SB RAS</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irkutsk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia slv@icc.ru</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Mathematics and Information Technologies of Irkutsk State University</institution>
          ,
          <addr-line>Irkutsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A spreadsheet is one of popular forms for presentation and transferring data of the same types. The area of using this kind of documents is very widespread. Extraction tables from spreadsheets and their understanding are significant tasks that allow getting useful information for further use, for example in processes of integration data that obtained from various sources. As rule tables in spreadsheets create by humans and for humans use. This feature could be the reason that tables may contain messy data such as misprints, errors of calculation, incorrect structure etc. It leads to the complication of automated table processing and understanding. This paper has discussed some approaches to data cleanse that improve the quality of tabular data. The approaches consist of checking and correction of cells calculation and spelling errors. We use phonetic words similarity to correct spelling mistakes in words and heuristic algorithms to detect calculated values in cells.</p>
      </abstract>
      <kwd-group>
        <kwd>spreadsheet data quality data cleanse data correction phonetic algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Spreadsheet is a convenient format of documents to store and present sets
of semistructured data. Since their inception in the 1980s, they have
revolutionised the organisation and storage of data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This kind of documents are
used everywhere, where required to present information of the same type, from
everyday tasks, such as planning a family budget, to professional ones (storing
records about the company’s clients and various kinds of statistical reports).
Thus, spreadsheets are a valuable source of information. Extraction and
subsequent integration of data from spreadsheet within the subject area to do
multicriteria analysis, to identify dependencies between various factors, and to find
patterns.
      </p>
      <p>Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        Typically, a person is the author of spreadsheets. Most documents of this type
are created by people rather than automatically by machines. Consequently,
data and their organisation may have errors. Some studies report that about
more 90% of documents contain some errors [
        <xref ref-type="bibr" rid="ref1 ref6">6,1</xref>
        ] of various types. It may be
different typos, calculation errors, the inconsistency of measurements, etc. The
presence of errors leads to a decline in data quality. Low data quality complicates
automated table processing such as data extraction and understanding. This is
especially important in data integration processes. In this paper, we propose two
approaches that allow to improve the quality of textual (on Russian language)
and numerical values in spreadsheets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        To integration, data must be carefully assembled from a variety of sources around
the domain, cleansed and quality-assured [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Messy data complicate the
processing process. For example it is impossible to correctly compare data from different
sources with classifiers values or among themselves. Thus one of the important
issues before data integration is a data cleanse. Data cleanse is the process of
identifying and correcting errors and inconsistencies in order to improve the
quality of the initial (raw) data [13]. This process includes many aspects, such
as identifying spelling errors, missing data, the presence of fictitious, encoded,
composite values, logical inconsistencies. Different methods are used for
detecting and cleansing such errors. These are fuzzy string comparison, phonetic
comparison, neural networks usage etc. This paper is reviewed the errors related to
spelling and calculation.
      </p>
      <p>Estimation of strings similarity is possible with fuzzy string comparison
methods. The algorithms are used to implement fuzzy search that based on the system
of associative access to words that are contained in the text index of the full-text
document storage. The letters that form word(s) act as search units. The search
process is significantly accelerated with the help of a specially created index
containing fragments with links to words in which these fragments were
encountered. The neighbourhoods of permissible distortions are set, with the help of
which the accuracy and completeness of the search are easily regulated [10]. The
size of the distortion is the percentage of different fragments and the allowed
displacements of their positions. Then the words are selected, the fragments of
which coincided with the fragments of the query words. Fuzzy string comparison
methods are independent from the language alphabet. For raising the quality of
misspelling detection the features of the language should be used.</p>
      <p>In Russian language, following errors are possible [11]:
– morphological i.e. when a person tries to write all audible sounds by
letters [12];
– phonologic, i.e. preservation of a writing of phonemes spelling regardless the
word of change;
– phonetic, i.e. words are written as they heard;
– traditional due to historic or traditional style of old times or as in a language
this word was borrowed from.</p>
      <p>
        A lot of spelling errors in Russian language are associated with language
phonetic norms. The type of mistakes generally depends on persons’ education
level [9]. Some of mistakes may be detected and fixed using phonetic algorithms.
Algorithms of phonetic coding are based on the sequence of letters in a word
and their pronunciation rules, converted into an encoding text (code, index, key).
Thereby phonetic algorithms index words by their sound. The conclusion about
the similarity of two different words in sound is based on the coincidence or
similarity of their coding text. Phonetic coding algorithms are effective when
comparing acoustic data with text samples. They have not lost their relevance
even despite the emergence and development of new models and methods of
speech recognition, as they are the basis for their applying in practice. Rules
of words phonetic coding depend on language phonetic norms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and useful for
estimation similarity of words by their pronunciation.
      </p>
      <p>Data cleansing processes concern not only with strings. Numerical values may
require a clean also. Some of them may have different formats of presentation,
may require to undergo cleansing or may be as a result of numerical dependency.</p>
      <p>Checking and verification of numerical data in tables that are result of
calculations, i.e. ”sum”, ”average”, ”minimum”, ”maximum”, etc. is also reviewed in
the paper. Results of calculation are redundant information in data integration
processes. These values also may contain errors, because different techniques
could me used for computation (not necessarily of using specialised software
tools). Thus, identification of calculated values is one of the ways to improve
data quality and, accordingly, to simplify data analysis procedures, including
during integration. Detection of calculated values is one of the components of
data cleanse. Consider some approaches to detection and correction types of
messy data listed above.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Typos and misspelling correction</title>
      <sec id="sec-3-1">
        <title>Technique of correction</title>
        <p>Often different spelling errors could arise in tables that are formed by persons
without using special dictionaries, classifiers etc. Types of spelling errors in this
case depend on particularities of the language [11].</p>
        <p>Textual information, presented in tabular view has some specificity. As a
rule, these words which are names of something and written in the singular and
nominative case. In this instance, possible not to use rules of words formation
for estimation of their similarity. It is possible to apply phonetic algorithms in
this case (evaluate not only strings similarity).</p>
        <p>Well-known phonetic algorithms are based on words coding using Latin
alphabet characters and the pronunciation rules of English language. Thereby we
suggest to use Polyphon [8] algorithm for coding words of Russian language. The
common stages of the Polyphon are as follows:
– substitution of Latin letters which are similar to Russian with Russian ones;
– removal of all non-Russian alphabet characters from the input string;
– modification of letters before dividers, i.e., special letters as “ъ” and “ь”;
– transformation of doubled characters into one (e.g. “oo” to “o” );
– reduction of a row of the same letters into one letter [8];
– transformation of specially defined character sequences (e.g. “сч” to “ш”).
In details, stages of Polyphon are describe in [7].</p>
        <p>Use of phonetic algorithms in combination with fuzzy string comparison
algorithms increases quality of identification and automated elimination of spelling
errors in the initial (raw) data. In this research, we use the following scheme of
algorithms combination:
– for each word in the dictionary, its Polyphon code is calculated and divided
into trigrams;
– for each value row from user table, the Polyphon code is also calculated, and
divided into trigrams;
– trigrams of Polyphon codes are compared;
– the originals of phonetically similar reference words are compared character
by character with the original words from the tables using “similarity points”
(in percentage).</p>
        <p>If some words have the same level of “similarity points”, it indicates phonetic
similarity. In this case, fuzzy string comparison is used.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental results</title>
        <p>
          The accuracy of the words matching was made. We matched in pairs each
misspelled word with each reference word from Ozhegov’ [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] dictionary. The identical
(have more than one meaning) words were reviewed as one word. Thus, the
initial amount of words for error introduction is 11601. In these words some spelling
errors were made [8]. The approach allowed to correctly identify 93.3% of unique
words, 6% of words were incorrectly identified or had identical codes and points
of similarity for different words, 0.7% of words are not identified at all.
        </p>
        <p>The results of the experiment demonstrate that fuzzy phonetic comparison
with trigrams has results worse results than by using prime coding [7,8]. However
we were able to raise uniqueness of words identification. Accordingly its allow to
improve the quality of textual data cleanse.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Numerical dependencies detection</title>
      <p>Numerical values are the basis for data analysis. Some of values are the results
of calculation (”sum”, ’average”, ”minimum”, ”maximum”, etc.) of another values.
This values could be as a result of hidden numerical dependencies. For
example when the decision about dependency acceptable after data analysis only but
not from the columns/rows labels. Detection of such values significant for data
processing. Firstly, these values are redundant in the consolidated database, due
to they may be calculated with insignificant computational costs. Some of the
numerical dependencies may contain errors. Authors of tables may use different
ways for getting these values, both using special spreadsheets tools like
”formulas” and calculate with a calculator. The variety of possibilities increases the
likelihood of errors. We suggest a set of heuristics for identification of numerical
dependencies, including hidden dependencies.</p>
      <p>The calculated value can be indicated by a label in the header, set by formula,
and to be a result of numerical dependencies. The first two of them require
verification. Thus, all of the features are used for detection and verification of
these values. The complex approach allows for increasing the quality of data
deletion.
4.1
One way to identify calculated values (numerical dependencies) is to search for
keyword values in table labels (headers). It is supposed that a table has
hierarchical header or table itself presented in canonical form. The variety of different
names (table 1) of keywords for denotation of numerical dependencies on Russian
and English are supported.
# Group Keywords
1 sum сумма, сум, итог, итого, всего, sum, summarising
2 average среднее, средн, average, avg
3 minimum минимум, мин, наименьшее, minimum, min
4 maximum максимум, макс, наибольшее, maximum, max</p>
      <p>If a keyword found in the cell, a hypothesis is put forward about the location
of the calculated values and their type. If a keyword is found in the header row,
the calculated column is considered, and the dependency search is performed
to the left and right of it. If the keyword is located in the column header, an
assumption is made about the calculated row, and then the analysis is performed
at the top and bottom of it.
4.2</p>
      <sec id="sec-4-1">
        <title>Analysis of formulas</title>
        <p>When parsing a column, for each row, it searches for the range that contains
the initial position of the calculation. To do this, the cells from the calculated
column to the beginning of the table are searched. For each cell, conditions which
depend on the type of calculations are checked, i.e. analyses all cells from the
calculated column to the beginning (ending) of the table.</p>
        <p>If a formula is written in a cell from a supposedly calculated column, it is
analysed. Thus this algorithm is analysed all the cells to find the beginning of
calculations, to extract this data from the formula, thereby reducing the amount
of processed data.</p>
        <p>The formula in the cell is retrieved as a string and processed using regular
expressions. Such formulas as SUM, MIN, MAX, AVG are analysed. For the
formula processing, its arguments must be inside the table and correspond to
the search direction. If one of these conditions is absent, or the formula is written
in an inappropriate form, a numeric value is extracted from the formula and an
iteration over the cells is performed to find dependencies using the technique of
cells data analyses.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Cells data analysis</title>
        <p>Process of numerical dependencies in cells detection assumes sequential
processing of all cells in the table in the case when borders of area were not established
on the base of the keywords in labels or values of formulas. Consider reviewing
the technique of cells data analyses for dependencies, presented in table 1 on
the sample of ”sum” dependency. All other dependencies (given in 4.2) are made
analogously in regard to formulas. For each type of dependency, a search is
carried out in columns from right to left, left to right, top-down and bottom-up.
The sample of technique is presented in table 2.</p>
        <p>Let columns A-D in the table 2 were already processed and no calculated
values were found in them. Consider the search of a dependency of ”sum” type
for column E. For each row to the left of the requested column in question, a
search is performed for the cell that contains the first summand for the sum from
the cell in the calculated column. For example, in row 1, the columns from D to A
are explored, at each step the contents of the cells are summed up and compared
with the number in cell E1. The number of terms must be greater than one since
otherwise, the case is possible when the column contains duplicate values and
not the sum.</p>
        <p>The situation is possible when the term is suitable for several columns, not
the one. It is demonstrated in row 2 of table 2. Therefore, a range of column
indexes is written for each row, which can be the first summand for the assumed
sum. The acceptable border range check for other rows. If the number of rows
with the first term in one column is 80% or more of the total number of rows in
the table, it is concluded that the column for which the dependency search was
performed, is the sum - i.e., the calculated value.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Tabular data, such as reports, statistical reviews are a valuable source of
information. However, some of the tables might have errors and negatively affect to
automated table processing and understanding. In this regard, data cleansing is
needed as one of the first steps of data integration. In this paper we suggest two
approaches that allow to improve quality of tabular data. The suggested in the
paper technique is based on phonetic coding and its codes fuzzy comparison for
textual values and calculated values detection for numerical data are presented.
It allows to clean tabular data from messy values, identify some of redundant
data, and check the correctness of calculated values.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The research was supported by the Program of the Fundamental Research of the
Siberian Branch of the Russian Academy of Sciences, project num. IV.38.1.2 (reg.
num. АААА-А17-117032210079-1). Results are achieved using the Centre of
collective usage ¾Integrated information network of Irkutsk scientific educational
complex.
7. Paramonov, V., Shigarov, A., Ruzhnikov, G., Cherkashin, E.: Phonetic string
matching for languages with cyrillic alphabet. In: Advances in Intelligent
Systems and Computing, pp. 301–311. Springer International Publishing (Aug 2018).
https://doi.org/10.1007/978-3-319-99981-4_28
8. Paramonov, V.V., Shigarov, A.O., Ruzhnikov, G.M., Belykh, P.V.: Polyphon: An
algorithm for phonetic string matching in russian language. In: Communications in
Computer and Information Science, pp. 568–579. Springer International Publishing
(2016). https://doi.org/10.1007/978-3-319-46254-7_46
9. Parubchenko, L.: Hypercorrection errors (in russian). Russian Literature pp. 23–27
(04 2005)
10. Sharapov, R., Sharapova, E.: The problem of fuzzy duplicate detection of large
texts. In: CEUR Workshop Proceedings. vol. 2212, pp. 270–277 (2018)
11. Skripnik, Y., Smolenskaya, T.: Phonetics of modern Russian Language (in
Russian), p. 152. VoSIGI, Stavropol (2010)
12. Valgina, N., Rozental, D., Fomina, M.: Modern Russian Language: Textbook (in</p>
      <p>Russian). Logos, Moscow, 6-th edn. (2002)
13. Zhao, L., Yuan, S.S., Peng, S., Wang, L.T.: A new efficient data cleansing method.</p>
      <p>In: Lecture Notes in Computer Science, pp. 484–493. Springer Berlin Heidelberg
(2002). https://doi.org/10.1007/3-540-46146-9_48</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abraham</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erwig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Header and unit inference for spreadsheets through spatial analyses</article-title>
          .
          <source>In: Proceedings. 2004 IEEE Symposium on Visual Languages and Human Centric Computing(VLHCC)</source>
          . vol.
          <volume>00</volume>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>172</lpage>
          (09
          <year>2004</year>
          ). https://doi.org/10.1109/VLHCC.
          <year>2004</year>
          .29
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alotaibi</surname>
            ,
            <given-names>Y.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meftah</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          :
          <article-title>Review of distinctive phonetic features and the arabic share in related modern research</article-title>
          .
          <source>Turkish Journal of Electrical Engineering and Computer Sciences</source>
          <volume>21</volume>
          ,
          <fpage>1426</fpage>
          -
          <lpage>1439</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ciganek</surname>
          </string-name>
          , J.:
          <article-title>Design and implementation of open-data data warehouse</article-title>
          .
          <source>In: 2019 6th International Conference on Advanced Control Circuits and Systems (ACCS) &amp; 2019 5th International Conference on New Paradigms in Electronics &amp; information Technology (PEIT)</source>
          .
          <source>IEEE (Nov</source>
          <year>2019</year>
          ). https://doi.org/10.1109/accspeit48329.
          <year>2019</year>
          .
          <volume>9062879</volume>
          , https://doi.org/10.1109/accs-peit48329.
          <year>2019</year>
          .9062879
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Guerrero</surname>
          </string-name>
          , H.:
          <article-title>Excel Data Analysis</article-title>
          . Springer International Publishing (
          <year>2019</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -01279-3
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ozhegov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dictionary of Russian language</article-title>
          .
          <source>About 53000 words. (Словарь русского языка: Ок. 53 000 слов)</source>
          (in Rusian). Oniks, World and education, Moscow,
          <volume>24</volume>
          <fpage>edn</fpage>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Panko</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Spreadsheet errors: What we know. what we think we can do</article-title>
          .
          <source>CoRR abs/0802</source>
          .3457 (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>