=Paper= {{Paper |id=Vol-1108/paper3 |storemode=property |title=Semi-automatic Data Extraction from Tables |pdfUrl=https://ceur-ws.org/Vol-1108/paper3.pdf |volume=Vol-1108 |dblpUrl=https://dblp.org/rec/conf/rcdl/AstrakhantsevTV13 }} ==Semi-automatic Data Extraction from Tables == https://ceur-ws.org/Vol-1108/paper3.pdf
               Semi-automatic Data Extraction from Tables
     © Nikita Astrakhantsev                      © Denis Turdakov                            © Natalia Vassilieva
            ISPRAS                                   ISPRAS                                       HP Labs
            Moscow                                   Moscow                                   Saint-Petersburg
     astrakhantsev@ispras.ru                    turdakov@ispras.ru                            vassilieva@hp.com


                       Abstract                                    suggest The Periodic Table of the Elements as an
                                                                   example of the table requiring semantic knowledge for
    This paper describes a novel approach to                       interpretation; the authors of this survey also state that it
    automate extraction of useful information from                 is easy to contrive other examples “that are challenging
    tables and to record the knowledge procured in                 even from a human perspective”.
    a structured data repository. The approach is                      We present a semi-automatic approach that tracks
    based on modeling a behavior of an expert,                     actions of a domain expert, when he/she begins to
    who collects tabular data and maps them to a                   process a table (map cell data to a relational scheme),
    predefined relational schema. Experimental                     derives regularities/patterns of expert behavior, and
    results demonstrate that the proposed approach                 applies them to the rest of the table in order to predict
    predicts expert decisions with high accuracy                   further mappings.
    and thus significantly minimizes the time                          This paper is organized as follows: In Section 2 we
    required of an expert for data aggregation.                    give an overview of related works. Section 3 discusses
                                                                   the table format used in our work. Section 4 describes
1 Introduction                                                     the general architecture of our prototype, while
                                                                   Section 5 describes it in details. Section 6 presents our
Tables are widely used in scientific, financial and other
                                                                   experimental evaluation. We conclude in Section 7.
analytical documents to concisely communicate
information to human readers. A table usually contains
information about objects of the same type. These                  2 Related work
objects can be easily perceived by the reader through              Silva et. al [11] survey about 50 works devoted to tables
content and relations between different table elements             processing in details. Authors outline several table-
(cells, rows, etc.), because he/she is experienced in table        related tasks and corresponding table representation
reading and is able to take in both semantic and                   models, which serve as input and output for these tasks.
structural information. Sometimes two tables with                  The span of tasks goes from location of a table in a
exactly the same structure are interpreted completely              document to semantic interpretation of the information
different just because of slight difference in the external        contained in the table. There are more works focused on
context or because the content of some cells differs.              the basic low-level table related tasks than on the more
    However, it is hard to automate table analysis and             knowledge based ones. A deep interpretation of the
information is extracted mainly by hand by interested              table is almost always requires context specific
parties. A common scenario used by experts is manual               knowledge. Existing solutions for extracting
cell by cell extraction of data from a table into a                information are very domain specific and designed for
relational database, and then using OLAP or other                  particular table types.
techniques to generate reports and perform analytics                   Zanibbi et. al [15] present the table recognition
over these data. Efforts required for manual extraction            literature in terms of the interaction of table models,
are considerable, while the time of experts is always              observations, transformations, and inferences. Most of
costly.                                                            described methods are fully automatic and consider the
    There is no end-to-end solution for automatic                  task of table processing in isolation from further usage
information extraction from arbitrary tables. And as it            of the extracted information.
appears to us, construction of a fully automatic                       Embley et al. [2] extract data from XML tables and
instrument is hardly feasible. It might be possible to             map them to a given target database schema with 96/85
parse automatically the structure of any table, but                precision and 93/91 recall (depending on a domain —
semantic interpretation of tabular data requires the               car advertisement and cell-phone correspondingly).
knowledge of a domain expert. Embley et al. [1]                    However, their approach requires a hand-crafted
                                                                   ontology, which is costly and, more important, cannot
                                                                   always be in place due to high specificity of certain
Proceedings of the 15th All-Russian Conference                     documents.
"Digital Libraries: Advanced Methods and                               More       recent     work     of     Embley     and
Technologies, Digital Collections" ― RCDL-2013,                    Krishnamoorthy [3] transforms CSV or HTML tables
Yaroslavl, Russia, October 14-18 2013.




                                                              14
into a canonical representation in order to obtain a                 Thus, except for the highly specialized programs
target representation, one of which is Relational table.         like VeriClick and commercial tools like Google Refine
A canonical table representation is based on Header              or MS Excel, we are not aware of any research on
Paths, a purely syntactic technique that relates headers         interactive information extraction from tables; and the
and data cells. Further transformation into a target             authors of VeriClick confirm this observation [5]. It
representation is performed by specially defined                 also should be noted that most works try to convert a
relational algebra. In contrast to the previous approach,        table into some more usable format, but not to extract
this one does not use any semantic knowledge; also it            needed parts of information from a table.
demands a user to point out the top-left data cell in
cases where proposed heuristics cannot define such a             3 Table format
cell automatically. Precision of correct Header Paths
construction is about 74%. In [8] the authors evolve this        Like other basic notions, 'table' has a lot of different
approach by using interactive tool VeriClick [9], ‘a             definitions.
macro-enabled spreadsheet interface that provides                    Peterman et al. [10] suggest the following intuitive
ground-truthing,      confirmation,     correction,   and        definition: “tables have a regular repetitive structure
verification functions for CSV tables’. However, this            along one axis so that the data type is determined either
tool is used only to locate so-called ‘critical cells’:          by the horizontal or vertical indices.” The definition
corner cells that allows to distinguish header and data          given by Lopresti et al. [7] consists of similar items:
regions in a table.                                                  1. 2-D cell assembly for presenting information;
    Vasudevan et. al [14] address a closely related task:            2. Regular, repetitive structure along at least one
to automate data extraction from financial reports                         axis;
presented in PDF documents with many tables. The                     3. Datatype determined by either horizontal or
proposed approach shows good results (95.7% precision                      vertical index.
and 78.4% recall), but it requires a quality review stage            Tijerino et al. [12] uses standard definition of a
and, since it is based on domain knowledge heuristics,           relational table.
the range of its application is severely limited.                    In this work we consider a table to be a set of cells
    Gatterbauer et al. [4] extract information from web          with some text content. In other words, it is the only
tables by using two-dimensional visual model provided            property of a table that is used explicitly; other
by web browsers instead of tree-based (HTML)                     properties like repetitive structure or the same datatype
representation. Fumarola et al. [5] combine knowledge            in a column or a row are considered by our method
about the visual structure of the Web page and the               implicitly.
HTML markup for web lists extraction. Their tool is                  Our prototype takes HTML tables as an input;
applicable for web tables, too; moreover, the authors            therefore we use terminology from HTML in order to
evaluate accuracy on the same dataset as Gatterbauer             describe cell properties; for example, colspan as a
and report very good quality (more than 99% precision            relative width of a table cell. However, our method does
and recall on table records). But this dataset is domain-        not depend on any specific properties of HTML format.
independent, and target format of table analysis is more             We use classical spreadsheet addressing for cells.
general than ours; therefore, it cannot be directly              For example, A2 of Table 1 is an empty cell in the first
compared with our method.                                        column and the second row with rowspan equal 2.
    Looking at semi-automatic tools, Google Refine 1
should be noted. It is "a power tool for working with            Table 1: Example of a source table with spreadsheet
messy data, cleaning it up, transforming it from one                                coordinates
format into another, extending it with web services, and             A         B          C         D        E
linking it to databases like Freebase". However, Google          1 Secret budget
Refine is mostly for data cleansing and is not capable of        2             FY 2009              FY 2010
performing information extraction and interpretation. It         3             Oper.      Capital Oper. Capital
also cannot track user decisions and anticipate them.            4 HP          10         20        30       40
    Similarly, Microsoft Excel2 can perform such                 5 Oracle      50         60        10       20
transformations using formulas and macros, but it                6 Samsung 12             34        56       78
demands user to define these formulas explicitly.
    Praedea3 is another semi-automatic tool that uses            4 General architecture
predefined text-mining models (chosen by user) for
extracting required data from unstructured documents             Assume that a user processes a table step-by-step.
and mapping them to database or XML scheme. It is not            During each step the user selects multiple cells from the
capable to process a table in case there is no suitable          table and maps them to a record in a relational database.
predefined model for it.                                         Let us define user feedback as information about
                                                                 mappings from a single user step. See Table 2 for an
1 http://code.google.com/p/google-refine                         example of the initial user feedback provided by
                                                                 him/her while processing Table 1.
2 http://office.microsoft.com/en-us/excel/
3 http://www.praedea.com




                                                            15
 Table 2: User feedback example: two first columns                  forthcoming actions. The key idea of the approach is to
 represent data about the target relational scheme;                 consider the shift from one user step to the next one.
        two last ones represent table cell data                     More precisely, a shift consists of row and column
Relation         Attribute        Cell          Value               offsets (possibly zero) for each cell represented in a pair
name             name             position                          of sequential user feedbacks. Since any table contains
Report           Company          A4            HP                  several objects of the same type and structure, we try to
Report           Operating        B4            10                  recognize it by tracking sequential user steps.
Report           Financial Year B2              2009                    Table 3 shows an example of a shift from the first
   Figure 1 shows the general architecture of the                   feedback (underlined words) to the second one (bolded
prototype.                                                          words). Note that some cells remain fixed, e.g. B2.

            Figure 1: General architecture                             Table 3: Shift example: from (HP, 10, 2009) to
                                                                                      (Oracle, 50, 2009)
                                                                        A           B          C         D       E
                                                                     1 Secret budget
                                                                     2              FY 2009              FY 2010
                                                                     3              Oper.      Capital Oper. Capital
                                                                     4 HP           10         20        30      40
                                                                     5 Oracle       50         60        10      20
                                                                     6 Samsung 12              34        56      78

                                                                        Shift approach is used in different ways depending
                                                                    on the number of the available user feedbacks. In
                                                                    section 5.1 we describe a regular phase of our
                                                                    algorithm: when at least two user feedbacks are
                                                                    available and we can explicitly compute shifts. Section
                                                                    5.2 presents the phase when we have only one user
                                                                    feedback and have to guess a correct shift. Section 5.3
                                                                    considers the phase when user starts to process a table
                                                                    and we have to predict his/her actions based on the
                                                                    previously processed tables.

                                                                    5.1 Regular phase
    The prototype takes as input an HTML table and                  After two user steps we can construct a shift by
enriches it by the predefined Natural Language                      computing row and column offsets explicitly for every
Processing pipeline; currently we add only information              cell. For example, after two steps of processing Table 1
about Named Entity types.                                           the following shift is constructed: the first two cells are
    Enriched table is further processed by the main part            shifted down by one cell and the third cell stays at the
of the prototype - Table processor on the diagram. It               same position.
generates output, which has the same format as the user                 The constructed shift is further applied to cells of
feedback, i.e. information about mappings that assumed              the last user step in order to get new cells and create a
to be obtained by the user. This output is reviewed by              new row in a relational database with the same
the user and he/she returns feedback. The prototype                 relation/attribute information and new values taken
stores information from the user feedback into the                  from new cells.
Target database and uses this feedback in order to                      As it was shown in the example above, sometimes
generate the next output. History means a local storage             the value returned by the user is not just a text content
of the previous feedbacks, including ones from other                of the cell, but its derivative: compare “FY 2009” in
tables that were already processed by the user.                     Table 1 with just “2009” in the feedback, Table 2.
Information about previous actions is also used by the              Solution of this problem is based on the observation
prototype for the output generation.                                that shifted cells share similar structure of their content.
                                                                    We store the way of producing the value from the cell
5 Shift approach                                                    in form of a regular expression. More precisely, we
                                                                    have prepared a set of patterns, which should be tested
Our approach is based on two observations: (a) each                 to match cell content against the value returned by the
table is a layout structure for storing similar objects; and        user within his/her feedback. Currently there are two
(b) closely located tables (e.g. from the same document             types of patterns: (a) taking a substring of original cell
or Web-site) usually share similar structure and context.           content, and (b) replacing a substring of original cell
A user processes similar objects from the same table                content with another predefined string. Patterns of both
sequentially by repeating similar actions; and this                 types contain corresponding regular expressions to be
repetition allows the system to learn and anticipate the            applied to cell content.




                                                               16
                                       Table 4: Example of cell value patterns
Pattern                           Description                               Cell value                  Feedback value
.*::::0                           Takes the value as is                     HP                          HP
(.*)\s(.*)::::1::::2              Splits by space and tries each part       Financial year              year
(\d+)::::1                        Takes only digits from the content        146%                        146
thousand::::000::::ReplaceAll     Replaces all words thousand by 3 zeros    45 thousand                 45000

    Patterns of the first type also contain matching               metrics taken from SimMetrics4, the best results have
group numbers to be taken into account; patterns of the            been obtained using Levenstein distance [6]. In
second type contain a replacement string and a key                 addition, we modified string metric as follows:
word ReplaceAll in order to distinguish this type. See                 1. All digits are considered to be equal characters,
Table 4 for the examples of cell value patterns.                          because, as it is written above, we want to
Sequence of colons (::::) serves as a delimiter.                          capture strings of the same attribute, or data type,
    These patterns work as follows. Assume, have a cell                   and difference between numeric strings tells
with content “FY 2009” and the corresponding                              nothing about difference between attributes. Note
feedback value is “2009”. We try to apply the first                       that we do not consider all numbers to be equal,
pattern (.*) to cell content, which means that we simply                  because much difference in orders of magnitude
take the whole content as is. The result does not match                   can indirectly indicate different attributes.
with the feedback value. Then we try the next pattern                  2. If one of the strings is empty, we use pre-defined
((.*)\s(.*)::::1::::2). The result matches with the                       value (0.5): sometimes cell values are missed, for
feedback value and we store this pattern for the                          example, it can mean that the previous value or
corresponding cells of the shift.                                         some default value should be taken instead.
If there are more than two steps produced by the user,                    Unmodified metrics return zero similarity for
then we construct candidate shifts for all paired                         such cases, but cell value absence does not
combinations of user feedbacks, assess them, and                          necessarily indicate the difference in attributes.
choose the best shift in order to apply it to cells of the             3. If both strings are long (more than 3 words), we
last user feedback. For example, if there are 3                           use pre-defined value (0.8): again, difference in
feedbacks, we construct 3 candidate shifts: 1-to-2, 1-to-                 long texts does not reflect difference in
3, and 2-to-3; all 3 candidate shifts are assessed                        attributes.
independently, and the best shift is then applied to the               Named entity type consistency: predefined value
last, 3rd feedback.                                                for 3 cases depending on named entity types of cells
    Shift assessment is performed by computing linear              contents:
combination of the following 4 features:                               1. Named entity types are equal – value is 1;
    Average shift length: normalized sum of lengths of                 2. Named entity types are unequal – value is 0;
cells offsets. For the example shift, there are 2 offsets              3. Named entity types are both undefined – value
having length 2 and 1 offset having length 0, so the                      is 0.5;
feature value is 2/3.                                                  Motivation is the same as in the previous feature,
    Motivation: people tend to process the table                   but here we utilize information about named entity
sequentially, taking closest cells if possible. In other           types in order to check attribute consistency. We use a
words, shorter shifts are preferable by users.                     conditional random field (CRF) model with a
    We also tried different weights for different                  combination of different popular features applied in
directions, e.g. bigger weight for the right offset than           supervised named entity recognition [13]. There are 6
for the down one, because top-down processing seem to              supported named entity types: Acronym, Date,
be more common, but experiments show that different                Location, Numeric, Organization, Person.
weights do not introduce any positive effect.                          Coefficients for the linear combination are chosen
    Cells offsets consistency: normalized number of                experimentally to maximize the accuracy: we test all
most common cells offsets. For the shift in the example,           possible values from 0 to 1 with step 0.2 so that their
there are 2 types of offsets: 2 down-by-1 offsets and 1            sum equals to 1. We found the best accuracy to be
remain-fixed offset, so the feature value is 2/3 again.            obtained with 0, 0.4, 0.2, 0.4 correspondingly; the
    Motivation: a shift usually contains similar offsets of        further granulation does not change the result.
cells, e.g. all 4 cells are shifted down by 1 more often,
then when 2 cells are shifted down by 1 and other 2                5.2 One User Feedback Phase
cells are shifted down by 2 or right by any number.                    Given one user feedback we need a shift to apply it
    Average text similarity: normalized value of string            to this user feedback as it is done in Regular phase.
similarity metrics computed over corresponding                     There are two ways: take a most appropriate shift from
(shifted) cells contents.                                          the previously processed tables or construct a shift from
    Motivation: when we shift one cell to another, both
of them must contain values of the same attribute, and
                                                                   4
different values of the same attributes are usually                 SimMetrics is a Similarity Metric Library provided by
similar text strings.                                              UK Sheffield University
    To compute such similarity we tried several string             http://sourceforge.net/projects/simmetrics/




                                                              17
scratch. To find the most appropriate shift we check all                           Table 6: Source of stored feedback
stored shifts for applicability, and then assess similarity                   A           B        C
of all applicable shifts modified in accordance with the                  1   Public budget
current table. To assess the modified shift we use the                    2   Company Capital Share of market
linear combination as in section 5.1, but coefficients are                3   Dell        100      27%
re-estimated (0.2, 0, 0.6, 0.2).                                          4   Lenovo      105      33%
    This way is very similar to No user feedback phase,
see Section 5.3 for details. In short words, modification                          Table 7: Applied stored feedback
means that we replace contents of cells in the shift by               Relation      Attribute      Cell Value Cell
the contents of the corresponding cells of the current                                                            pattern
table. For example, if there is a stored shift (from some             Report        Company        A4     HP      .*::::0
previously processed table) with just one cell offset –               Report        Capital        B4     10      .*::::0
C4-to-C5 with contents “USA”-to-“Russia” – then we
                                                                      Report        Share of       C4     20      (\d+)::::1
modify the shift so that now contents of the cell offset is
                                                                                    market
“20”-to-“60”: we take contents of C4 and C5 cells of
Table 1. Note that if the shift contains cell offset like
                                                                          After that we construct a shift from the stored
B1-to-B2 with ordinary colspan and rowspan (all equal
                                                                      feedback (Table 5) to the obtained feedback (Table 7)
to 1), then such the shift is inapplicable for the Table 1,
                                                                      and assess it in the way described in section 3.1 with re-
because B1 and B2 cells of this table have different
                                                                      estimated coefficients (0.2, 0, 0.4, 0.4). Such
colspans.
                                                                      assessment allows us to choose the most similar table
    If there is no suitable shift, i.e. similarity of the best
                                                                      among all already processed ones.
shift does not reach a predefined threshold, then we
                                                                          Note that the stored feedback from the 3rd row, but
construct a new one. For this purpose, we iterate over
                                                                      not 4th, is not applicable to Table 1, because A3 cells
combinations of all possible offsets of row and column
                                                                      have different rowspans.
for each cell. To limit combinatorial explosion we do
not consider offsets above predefined thresholds: 5 for
rows and 3 for columns. Each offset combination is                    6 Evaluation
actually a shift that can be assessed as it is described              We compiled a set of 30 tables containing financial
above; coefficients are left the same.                                reports and a set of more than 150 corresponding user
                                                                      feedbacks5. We use the following test metrics: accuracy,
5.3 No User Feedback Phase
                                                                      precision and recall. Accuracy shows the fraction of
When no user feedback is available for the current                    outputs that fully match the user feedback: if at least
table, we can use information about previously                        one value in the tool output is wrong, the whole answer
processed tables. We store all user feedbacks during                  is considered to be wrong. Precision and recall
table processing for the case if some of them have a                  characterize the number of correct cell mappings.
structure (set of cell positions) appropriate for a new               Obviously, precision and recall can be much higher than
table. To choose the most appropriate user feedback we                accuracy, because many answers are partially correct.
first check all of them for applicability, i.e. current table             Table 8 shows the results for the regular phase; the
must have cells in all cell positions of the user feedback,           2nd and the 3rd rows show efficiency of shift
and these cells must have the same characteristics —                  constructing module and shift choosing module
particularly, colspan and rowspan. Then all applicable                respectively. We consider the shift constructor to work
user feedbacks are assessed by constructing and                       correctly if there is a right output (maybe not the chosen
assessing special "fake" shifts from the stored feedback              one); for the shift chooser we count only those outputs
to the feedback obtained by applying the stored                       when there is a correct shift constructed and thereby the
feedback structure to current table. Assume we have a                 shift chooser had a chance to choose it.
feedback shown in Table 5 from the already processed
Table 6 and we just begin to process Table 1.                                     Table 8: Regular phase results
                                                                      Total accuracy                           74%
        Table 5: Example of stored feedback                           Shift constructor accuracy               78%
 Relation Attribute     Cell Value       Cell                         Shift chooser accuracy                   94%
                                         pattern                      Precision                                84%
 Report    Company      A4      Lenovo .*::::0                        Recall                                   80%
 Report    Capital      B4      105      .*::::0
 Report    Market       C4      33       (\d+)::::1                       Table 9 shows the results for the first 2 phases. Easy
           share                                                      to see that these results depend on feedbacks from
                                                                      previously processed tables, because each user feedback
    Then we take a structure of the stored feedback, that             is stored during the processing and affects the following
is actually a cell position and a cell pattern, and apply it          outputs. However, the order of tables inside the
to the considered Table 1, see Table 7.
                                                                      5
                                                                          Dataset's URL : http://modis.ispras.ru/datasets/td.zip




                                                                 18
document may be valuable; therefore, we shuffle order                 7 Conclusions and Future Work
of blocks containing tables from the same document.
    It is worth mentioning that the results of the first              In this paper, we focused on the task of semi-automatic
two phases also depend on stored shifts (feedbacks, for               data extraction from tables and mapping them to
the first phase) and at the begging of the work, without              relational scheme; we introduce novel approach that
any stored shift, we face the problem of cold start. That             tracks user decisions to predict forthcoming ones; we
is why we add tests with prepared set of 5 simple stored              evaluated it on our test data.
feedbacks and shifts from other tables.                                   Shift approach provides general scheme for semi-
    In addition, we also run tests when all feedbacks and             automatic table processing, but many problems are out
shifts from the same 30 test tables are stored and used               of scope of this research. One of them is extraction of
for the testing. Of course, it is not an absolutely fair test,        value from table cell. In most cases it is sufficient to
because there are stored feedbacks and shifts with                    take text substrings (see cell B2 in Table 1), but
strictly the same text content, but these tests may help to           sometimes certain table cell is actually a set of different
understand if the mistakes are caused by the problem of               attribute values that should be processed by more
missing stored shifts or the incorrect choice of the                  complex methods. For instance, a price list of a
stored shift to be applied.                                           hardware store often contains cells like the following:
                                                                          HP "Pavilion dm4-2102er" QJ453EA (Core i5
 Table 9: Results for No user feedback and One user                   2430M-2.40GHz, 6144MB, HD6470M, WebCam)
                   feedback phases                                        Another direction of further research is related to
Phase      Number       Accuracy Precision Recall                     target scheme: currently we copy information about
           of stored                                                  relation and attribute for shifted cells, but methods that
No user    0            4%          11%       7%                      are more sophisticated can consider semantics of both
feedback feedbacks
           5            4%          8%        6%                      table content and relational scheme.
           all          16%         28%       25%
One user 0              43%         91%       90%                     References
feedback 5              48%         87%       86%                      [1] D. W. Embley, M. Hurst, D. Lopresti, G. Nagy.
           all          86%         87%       86%                         Table-processing paradigms: a research survey.
                                                                          International Journal of Document Analysis and
    Table 10 shows results for each module for one user                   Recognition (IJDAR), 8(2-3), p. 66-86, 2006.
feedback phase. The similarity estimator chooses the                   [2] D. W. Embley, C. Tao, S. W. Liddle. Automating
most similar shift among all stored ones. The similarity                  the Extraction of Data from HTML Tables with
threshold is estimated perfectly: if the similarity                       Unknown Structure. Knowledge Engineering, 54
estimator chooses an appropriate shift among the stored                   (1), p. 3-28, 2005.
ones, then we always choose it and never try to                        [3] D. W. Embley, M. Krishnamoorthy. Factoring
construct our one instead. The shift constructor works                    Web Tables. Proceedings of 24th International
not bad, it means that our algorithm constructs most of                   Conference on Industrial Engineering and Other
possible shifts, but the shift chooser makes a lot of                     Applications of Applied Intelligent Systems,
mistakes.                                                                 p. 253-263, 2011.
                                                                       [4] F. Fumarola, T. Weninger, R. Barber, D. Malerba,
   Table 10: Accuracy of modules for one user                             and J. Han. HyLiEn: a hybrid approach to general
                 feedback phase                                           list extraction on the web. Proceedings of the 20th
 Evaluated      0 stored    5 stored   All stored                         international conference companion on World wide
 module         shifts      shifts     shifts                             web, pp. 35–36, 2011.
  Similarity        100%          100%       100%                      [5] W. Gatterbauer, P. Bohunsky, M. Herzog, B.
  threshold
  Similarity        100%          100%       100%                         Krüpl, and B. Pollak. Towards domain-independent
  estimator                                                               information extraction from web tables.
  Shift chooser     25%           20%        0%
                                                                          Proceedings of the 16th international conference on
  Shift             80%           77%        33%                          World Wide Web, pp. 71–80, 2007.
  constructor
                                                                       [6] V. I. Levenshtein. Binary codes capable of
    Some mistakes in the regular phase could be                           correcting deletions, insertions, and reversals.
explained by the following: sometimes a user goes from                    Soviet Physics Doklady, 10: p. 707–10, 1966.
the top to the bottom of the left part of the table (e.g.              [7] D. Lopresti, G. Nagy. A tabular survey of
takes all values from the first and the second columns)                   automated table processing. Graphics Recognition
and then he/she repeats the similar actions for the right                 Recent Advances, p. 93-120, 2000.
part (takes all values from the first and the third
                                                                       [8] G. Nagy, S. Seth, D. Jin, D. W. Embley,
columns). Our tool cannot predict a correct shift at the
                                                                          S. Machado, and M. Krishnamoorthy. Data
moment when the user switches to the right part, that is
                                                                          extraction from web tables: The devil is in the
why results for shift constructor are low.
                                                                          details. Document Analysis and Recognition
                                                                          (ICDAR), pp. 242–246, 2011.




                                                                 19
 [9] G. Nagy and M. Tamhankar. VeriClick: an
    efficient tool for table format verification.
    IS&T/SPIE Electronic Imaging, p. 82970M–
    82970M, 2012.
[10] C. Peterman, C.H. Chang, H. Alam. A system for
    table under-standing. Proceedings of the
    Symposium on Document Im-age Understanding
    Technology (SDIUT’97), p. 55–62, 1997.
[11] A. C. Silva, A. M. Jorge, L. Torg. Design of an
    end-to-end method to extract information from
    tables. International Journal on Document Analysis
    and Recognition (8), No. 2-3, p. 144-171, 2006.
[12] Y. A. Tijerino, D. W. Embley, D. W. Lonsdale,
    Y. Ding, G. Nagy. Towards ontology generation
    from tables. World Wide Web 8, no. 3, 261-285,
    2005.
[13] M. Tkachenko, A. Simanovsky. Named entity
    recognition: Exploring features. Proceedings of
    KONVENS 2012, p. 118-127, 2012.
[14] B. G. Vasudevan, A.G. Parvathy, A. Kumar,
    R. Balakrishnan. Automated Knowledge-based
    Information Extraction from Financial Reports.
    Knowledge Engineering and Management, 7(5),
    p. 61-68, 2009.
[15] R. Zanibbi, D. Blostein, J. R. Cordy. A survey of
    table recognition: Models, observations,
    transformations, and inferences. International
    Journal of Document Analysis and Recognition,
    7(1), Springer, Heidelberg, p. 1–16, 2004.




                                                         20