=Paper=
{{Paper
|id=Vol-1108/paper3
|storemode=property
|title=Semi-automatic Data Extraction from Tables
|pdfUrl=https://ceur-ws.org/Vol-1108/paper3.pdf
|volume=Vol-1108
|dblpUrl=https://dblp.org/rec/conf/rcdl/AstrakhantsevTV13
}}
==Semi-automatic Data Extraction from Tables
==
Semi-automatic Data Extraction from Tables
© Nikita Astrakhantsev © Denis Turdakov © Natalia Vassilieva
ISPRAS ISPRAS HP Labs
Moscow Moscow Saint-Petersburg
astrakhantsev@ispras.ru turdakov@ispras.ru vassilieva@hp.com
Abstract suggest The Periodic Table of the Elements as an
example of the table requiring semantic knowledge for
This paper describes a novel approach to interpretation; the authors of this survey also state that it
automate extraction of useful information from is easy to contrive other examples “that are challenging
tables and to record the knowledge procured in even from a human perspective”.
a structured data repository. The approach is We present a semi-automatic approach that tracks
based on modeling a behavior of an expert, actions of a domain expert, when he/she begins to
who collects tabular data and maps them to a process a table (map cell data to a relational scheme),
predefined relational schema. Experimental derives regularities/patterns of expert behavior, and
results demonstrate that the proposed approach applies them to the rest of the table in order to predict
predicts expert decisions with high accuracy further mappings.
and thus significantly minimizes the time This paper is organized as follows: In Section 2 we
required of an expert for data aggregation. give an overview of related works. Section 3 discusses
the table format used in our work. Section 4 describes
1 Introduction the general architecture of our prototype, while
Section 5 describes it in details. Section 6 presents our
Tables are widely used in scientific, financial and other
experimental evaluation. We conclude in Section 7.
analytical documents to concisely communicate
information to human readers. A table usually contains
information about objects of the same type. These 2 Related work
objects can be easily perceived by the reader through Silva et. al [11] survey about 50 works devoted to tables
content and relations between different table elements processing in details. Authors outline several table-
(cells, rows, etc.), because he/she is experienced in table related tasks and corresponding table representation
reading and is able to take in both semantic and models, which serve as input and output for these tasks.
structural information. Sometimes two tables with The span of tasks goes from location of a table in a
exactly the same structure are interpreted completely document to semantic interpretation of the information
different just because of slight difference in the external contained in the table. There are more works focused on
context or because the content of some cells differs. the basic low-level table related tasks than on the more
However, it is hard to automate table analysis and knowledge based ones. A deep interpretation of the
information is extracted mainly by hand by interested table is almost always requires context specific
parties. A common scenario used by experts is manual knowledge. Existing solutions for extracting
cell by cell extraction of data from a table into a information are very domain specific and designed for
relational database, and then using OLAP or other particular table types.
techniques to generate reports and perform analytics Zanibbi et. al [15] present the table recognition
over these data. Efforts required for manual extraction literature in terms of the interaction of table models,
are considerable, while the time of experts is always observations, transformations, and inferences. Most of
costly. described methods are fully automatic and consider the
There is no end-to-end solution for automatic task of table processing in isolation from further usage
information extraction from arbitrary tables. And as it of the extracted information.
appears to us, construction of a fully automatic Embley et al. [2] extract data from XML tables and
instrument is hardly feasible. It might be possible to map them to a given target database schema with 96/85
parse automatically the structure of any table, but precision and 93/91 recall (depending on a domain —
semantic interpretation of tabular data requires the car advertisement and cell-phone correspondingly).
knowledge of a domain expert. Embley et al. [1] However, their approach requires a hand-crafted
ontology, which is costly and, more important, cannot
always be in place due to high specificity of certain
Proceedings of the 15th All-Russian Conference documents.
"Digital Libraries: Advanced Methods and More recent work of Embley and
Technologies, Digital Collections" ― RCDL-2013, Krishnamoorthy [3] transforms CSV or HTML tables
Yaroslavl, Russia, October 14-18 2013.
14
into a canonical representation in order to obtain a Thus, except for the highly specialized programs
target representation, one of which is Relational table. like VeriClick and commercial tools like Google Refine
A canonical table representation is based on Header or MS Excel, we are not aware of any research on
Paths, a purely syntactic technique that relates headers interactive information extraction from tables; and the
and data cells. Further transformation into a target authors of VeriClick confirm this observation [5]. It
representation is performed by specially defined also should be noted that most works try to convert a
relational algebra. In contrast to the previous approach, table into some more usable format, but not to extract
this one does not use any semantic knowledge; also it needed parts of information from a table.
demands a user to point out the top-left data cell in
cases where proposed heuristics cannot define such a 3 Table format
cell automatically. Precision of correct Header Paths
construction is about 74%. In [8] the authors evolve this Like other basic notions, 'table' has a lot of different
approach by using interactive tool VeriClick [9], ‘a definitions.
macro-enabled spreadsheet interface that provides Peterman et al. [10] suggest the following intuitive
ground-truthing, confirmation, correction, and definition: “tables have a regular repetitive structure
verification functions for CSV tables’. However, this along one axis so that the data type is determined either
tool is used only to locate so-called ‘critical cells’: by the horizontal or vertical indices.” The definition
corner cells that allows to distinguish header and data given by Lopresti et al. [7] consists of similar items:
regions in a table. 1. 2-D cell assembly for presenting information;
Vasudevan et. al [14] address a closely related task: 2. Regular, repetitive structure along at least one
to automate data extraction from financial reports axis;
presented in PDF documents with many tables. The 3. Datatype determined by either horizontal or
proposed approach shows good results (95.7% precision vertical index.
and 78.4% recall), but it requires a quality review stage Tijerino et al. [12] uses standard definition of a
and, since it is based on domain knowledge heuristics, relational table.
the range of its application is severely limited. In this work we consider a table to be a set of cells
Gatterbauer et al. [4] extract information from web with some text content. In other words, it is the only
tables by using two-dimensional visual model provided property of a table that is used explicitly; other
by web browsers instead of tree-based (HTML) properties like repetitive structure or the same datatype
representation. Fumarola et al. [5] combine knowledge in a column or a row are considered by our method
about the visual structure of the Web page and the implicitly.
HTML markup for web lists extraction. Their tool is Our prototype takes HTML tables as an input;
applicable for web tables, too; moreover, the authors therefore we use terminology from HTML in order to
evaluate accuracy on the same dataset as Gatterbauer describe cell properties; for example, colspan as a
and report very good quality (more than 99% precision relative width of a table cell. However, our method does
and recall on table records). But this dataset is domain- not depend on any specific properties of HTML format.
independent, and target format of table analysis is more We use classical spreadsheet addressing for cells.
general than ours; therefore, it cannot be directly For example, A2 of Table 1 is an empty cell in the first
compared with our method. column and the second row with rowspan equal 2.
Looking at semi-automatic tools, Google Refine 1
should be noted. It is "a power tool for working with Table 1: Example of a source table with spreadsheet
messy data, cleaning it up, transforming it from one coordinates
format into another, extending it with web services, and A B C D E
linking it to databases like Freebase". However, Google 1 Secret budget
Refine is mostly for data cleansing and is not capable of 2 FY 2009 FY 2010
performing information extraction and interpretation. It 3 Oper. Capital Oper. Capital
also cannot track user decisions and anticipate them. 4 HP 10 20 30 40
Similarly, Microsoft Excel2 can perform such 5 Oracle 50 60 10 20
transformations using formulas and macros, but it 6 Samsung 12 34 56 78
demands user to define these formulas explicitly.
Praedea3 is another semi-automatic tool that uses 4 General architecture
predefined text-mining models (chosen by user) for
extracting required data from unstructured documents Assume that a user processes a table step-by-step.
and mapping them to database or XML scheme. It is not During each step the user selects multiple cells from the
capable to process a table in case there is no suitable table and maps them to a record in a relational database.
predefined model for it. Let us define user feedback as information about
mappings from a single user step. See Table 2 for an
1 http://code.google.com/p/google-refine example of the initial user feedback provided by
him/her while processing Table 1.
2 http://office.microsoft.com/en-us/excel/
3 http://www.praedea.com
15
Table 2: User feedback example: two first columns forthcoming actions. The key idea of the approach is to
represent data about the target relational scheme; consider the shift from one user step to the next one.
two last ones represent table cell data More precisely, a shift consists of row and column
Relation Attribute Cell Value offsets (possibly zero) for each cell represented in a pair
name name position of sequential user feedbacks. Since any table contains
Report Company A4 HP several objects of the same type and structure, we try to
Report Operating B4 10 recognize it by tracking sequential user steps.
Report Financial Year B2 2009 Table 3 shows an example of a shift from the first
Figure 1 shows the general architecture of the feedback (underlined words) to the second one (bolded
prototype. words). Note that some cells remain fixed, e.g. B2.
Figure 1: General architecture Table 3: Shift example: from (HP, 10, 2009) to
(Oracle, 50, 2009)
A B C D E
1 Secret budget
2 FY 2009 FY 2010
3 Oper. Capital Oper. Capital
4 HP 10 20 30 40
5 Oracle 50 60 10 20
6 Samsung 12 34 56 78
Shift approach is used in different ways depending
on the number of the available user feedbacks. In
section 5.1 we describe a regular phase of our
algorithm: when at least two user feedbacks are
available and we can explicitly compute shifts. Section
5.2 presents the phase when we have only one user
feedback and have to guess a correct shift. Section 5.3
considers the phase when user starts to process a table
and we have to predict his/her actions based on the
previously processed tables.
5.1 Regular phase
The prototype takes as input an HTML table and After two user steps we can construct a shift by
enriches it by the predefined Natural Language computing row and column offsets explicitly for every
Processing pipeline; currently we add only information cell. For example, after two steps of processing Table 1
about Named Entity types. the following shift is constructed: the first two cells are
Enriched table is further processed by the main part shifted down by one cell and the third cell stays at the
of the prototype - Table processor on the diagram. It same position.
generates output, which has the same format as the user The constructed shift is further applied to cells of
feedback, i.e. information about mappings that assumed the last user step in order to get new cells and create a
to be obtained by the user. This output is reviewed by new row in a relational database with the same
the user and he/she returns feedback. The prototype relation/attribute information and new values taken
stores information from the user feedback into the from new cells.
Target database and uses this feedback in order to As it was shown in the example above, sometimes
generate the next output. History means a local storage the value returned by the user is not just a text content
of the previous feedbacks, including ones from other of the cell, but its derivative: compare “FY 2009” in
tables that were already processed by the user. Table 1 with just “2009” in the feedback, Table 2.
Information about previous actions is also used by the Solution of this problem is based on the observation
prototype for the output generation. that shifted cells share similar structure of their content.
We store the way of producing the value from the cell
5 Shift approach in form of a regular expression. More precisely, we
have prepared a set of patterns, which should be tested
Our approach is based on two observations: (a) each to match cell content against the value returned by the
table is a layout structure for storing similar objects; and user within his/her feedback. Currently there are two
(b) closely located tables (e.g. from the same document types of patterns: (a) taking a substring of original cell
or Web-site) usually share similar structure and context. content, and (b) replacing a substring of original cell
A user processes similar objects from the same table content with another predefined string. Patterns of both
sequentially by repeating similar actions; and this types contain corresponding regular expressions to be
repetition allows the system to learn and anticipate the applied to cell content.
16
Table 4: Example of cell value patterns
Pattern Description Cell value Feedback value
.*::::0 Takes the value as is HP HP
(.*)\s(.*)::::1::::2 Splits by space and tries each part Financial year year
(\d+)::::1 Takes only digits from the content 146% 146
thousand::::000::::ReplaceAll Replaces all words thousand by 3 zeros 45 thousand 45000
Patterns of the first type also contain matching metrics taken from SimMetrics4, the best results have
group numbers to be taken into account; patterns of the been obtained using Levenstein distance [6]. In
second type contain a replacement string and a key addition, we modified string metric as follows:
word ReplaceAll in order to distinguish this type. See 1. All digits are considered to be equal characters,
Table 4 for the examples of cell value patterns. because, as it is written above, we want to
Sequence of colons (::::) serves as a delimiter. capture strings of the same attribute, or data type,
These patterns work as follows. Assume, have a cell and difference between numeric strings tells
with content “FY 2009” and the corresponding nothing about difference between attributes. Note
feedback value is “2009”. We try to apply the first that we do not consider all numbers to be equal,
pattern (.*) to cell content, which means that we simply because much difference in orders of magnitude
take the whole content as is. The result does not match can indirectly indicate different attributes.
with the feedback value. Then we try the next pattern 2. If one of the strings is empty, we use pre-defined
((.*)\s(.*)::::1::::2). The result matches with the value (0.5): sometimes cell values are missed, for
feedback value and we store this pattern for the example, it can mean that the previous value or
corresponding cells of the shift. some default value should be taken instead.
If there are more than two steps produced by the user, Unmodified metrics return zero similarity for
then we construct candidate shifts for all paired such cases, but cell value absence does not
combinations of user feedbacks, assess them, and necessarily indicate the difference in attributes.
choose the best shift in order to apply it to cells of the 3. If both strings are long (more than 3 words), we
last user feedback. For example, if there are 3 use pre-defined value (0.8): again, difference in
feedbacks, we construct 3 candidate shifts: 1-to-2, 1-to- long texts does not reflect difference in
3, and 2-to-3; all 3 candidate shifts are assessed attributes.
independently, and the best shift is then applied to the Named entity type consistency: predefined value
last, 3rd feedback. for 3 cases depending on named entity types of cells
Shift assessment is performed by computing linear contents:
combination of the following 4 features: 1. Named entity types are equal – value is 1;
Average shift length: normalized sum of lengths of 2. Named entity types are unequal – value is 0;
cells offsets. For the example shift, there are 2 offsets 3. Named entity types are both undefined – value
having length 2 and 1 offset having length 0, so the is 0.5;
feature value is 2/3. Motivation is the same as in the previous feature,
Motivation: people tend to process the table but here we utilize information about named entity
sequentially, taking closest cells if possible. In other types in order to check attribute consistency. We use a
words, shorter shifts are preferable by users. conditional random field (CRF) model with a
We also tried different weights for different combination of different popular features applied in
directions, e.g. bigger weight for the right offset than supervised named entity recognition [13]. There are 6
for the down one, because top-down processing seem to supported named entity types: Acronym, Date,
be more common, but experiments show that different Location, Numeric, Organization, Person.
weights do not introduce any positive effect. Coefficients for the linear combination are chosen
Cells offsets consistency: normalized number of experimentally to maximize the accuracy: we test all
most common cells offsets. For the shift in the example, possible values from 0 to 1 with step 0.2 so that their
there are 2 types of offsets: 2 down-by-1 offsets and 1 sum equals to 1. We found the best accuracy to be
remain-fixed offset, so the feature value is 2/3 again. obtained with 0, 0.4, 0.2, 0.4 correspondingly; the
Motivation: a shift usually contains similar offsets of further granulation does not change the result.
cells, e.g. all 4 cells are shifted down by 1 more often,
then when 2 cells are shifted down by 1 and other 2 5.2 One User Feedback Phase
cells are shifted down by 2 or right by any number. Given one user feedback we need a shift to apply it
Average text similarity: normalized value of string to this user feedback as it is done in Regular phase.
similarity metrics computed over corresponding There are two ways: take a most appropriate shift from
(shifted) cells contents. the previously processed tables or construct a shift from
Motivation: when we shift one cell to another, both
of them must contain values of the same attribute, and
4
different values of the same attributes are usually SimMetrics is a Similarity Metric Library provided by
similar text strings. UK Sheffield University
To compute such similarity we tried several string http://sourceforge.net/projects/simmetrics/
17
scratch. To find the most appropriate shift we check all Table 6: Source of stored feedback
stored shifts for applicability, and then assess similarity A B C
of all applicable shifts modified in accordance with the 1 Public budget
current table. To assess the modified shift we use the 2 Company Capital Share of market
linear combination as in section 5.1, but coefficients are 3 Dell 100 27%
re-estimated (0.2, 0, 0.6, 0.2). 4 Lenovo 105 33%
This way is very similar to No user feedback phase,
see Section 5.3 for details. In short words, modification Table 7: Applied stored feedback
means that we replace contents of cells in the shift by Relation Attribute Cell Value Cell
the contents of the corresponding cells of the current pattern
table. For example, if there is a stored shift (from some Report Company A4 HP .*::::0
previously processed table) with just one cell offset – Report Capital B4 10 .*::::0
C4-to-C5 with contents “USA”-to-“Russia” – then we
Report Share of C4 20 (\d+)::::1
modify the shift so that now contents of the cell offset is
market
“20”-to-“60”: we take contents of C4 and C5 cells of
Table 1. Note that if the shift contains cell offset like
After that we construct a shift from the stored
B1-to-B2 with ordinary colspan and rowspan (all equal
feedback (Table 5) to the obtained feedback (Table 7)
to 1), then such the shift is inapplicable for the Table 1,
and assess it in the way described in section 3.1 with re-
because B1 and B2 cells of this table have different
estimated coefficients (0.2, 0, 0.4, 0.4). Such
colspans.
assessment allows us to choose the most similar table
If there is no suitable shift, i.e. similarity of the best
among all already processed ones.
shift does not reach a predefined threshold, then we
Note that the stored feedback from the 3rd row, but
construct a new one. For this purpose, we iterate over
not 4th, is not applicable to Table 1, because A3 cells
combinations of all possible offsets of row and column
have different rowspans.
for each cell. To limit combinatorial explosion we do
not consider offsets above predefined thresholds: 5 for
rows and 3 for columns. Each offset combination is 6 Evaluation
actually a shift that can be assessed as it is described We compiled a set of 30 tables containing financial
above; coefficients are left the same. reports and a set of more than 150 corresponding user
feedbacks5. We use the following test metrics: accuracy,
5.3 No User Feedback Phase
precision and recall. Accuracy shows the fraction of
When no user feedback is available for the current outputs that fully match the user feedback: if at least
table, we can use information about previously one value in the tool output is wrong, the whole answer
processed tables. We store all user feedbacks during is considered to be wrong. Precision and recall
table processing for the case if some of them have a characterize the number of correct cell mappings.
structure (set of cell positions) appropriate for a new Obviously, precision and recall can be much higher than
table. To choose the most appropriate user feedback we accuracy, because many answers are partially correct.
first check all of them for applicability, i.e. current table Table 8 shows the results for the regular phase; the
must have cells in all cell positions of the user feedback, 2nd and the 3rd rows show efficiency of shift
and these cells must have the same characteristics — constructing module and shift choosing module
particularly, colspan and rowspan. Then all applicable respectively. We consider the shift constructor to work
user feedbacks are assessed by constructing and correctly if there is a right output (maybe not the chosen
assessing special "fake" shifts from the stored feedback one); for the shift chooser we count only those outputs
to the feedback obtained by applying the stored when there is a correct shift constructed and thereby the
feedback structure to current table. Assume we have a shift chooser had a chance to choose it.
feedback shown in Table 5 from the already processed
Table 6 and we just begin to process Table 1. Table 8: Regular phase results
Total accuracy 74%
Table 5: Example of stored feedback Shift constructor accuracy 78%
Relation Attribute Cell Value Cell Shift chooser accuracy 94%
pattern Precision 84%
Report Company A4 Lenovo .*::::0 Recall 80%
Report Capital B4 105 .*::::0
Report Market C4 33 (\d+)::::1 Table 9 shows the results for the first 2 phases. Easy
share to see that these results depend on feedbacks from
previously processed tables, because each user feedback
Then we take a structure of the stored feedback, that is stored during the processing and affects the following
is actually a cell position and a cell pattern, and apply it outputs. However, the order of tables inside the
to the considered Table 1, see Table 7.
5
Dataset's URL : http://modis.ispras.ru/datasets/td.zip
18
document may be valuable; therefore, we shuffle order 7 Conclusions and Future Work
of blocks containing tables from the same document.
It is worth mentioning that the results of the first In this paper, we focused on the task of semi-automatic
two phases also depend on stored shifts (feedbacks, for data extraction from tables and mapping them to
the first phase) and at the begging of the work, without relational scheme; we introduce novel approach that
any stored shift, we face the problem of cold start. That tracks user decisions to predict forthcoming ones; we
is why we add tests with prepared set of 5 simple stored evaluated it on our test data.
feedbacks and shifts from other tables. Shift approach provides general scheme for semi-
In addition, we also run tests when all feedbacks and automatic table processing, but many problems are out
shifts from the same 30 test tables are stored and used of scope of this research. One of them is extraction of
for the testing. Of course, it is not an absolutely fair test, value from table cell. In most cases it is sufficient to
because there are stored feedbacks and shifts with take text substrings (see cell B2 in Table 1), but
strictly the same text content, but these tests may help to sometimes certain table cell is actually a set of different
understand if the mistakes are caused by the problem of attribute values that should be processed by more
missing stored shifts or the incorrect choice of the complex methods. For instance, a price list of a
stored shift to be applied. hardware store often contains cells like the following:
HP "Pavilion dm4-2102er" QJ453EA (Core i5
Table 9: Results for No user feedback and One user 2430M-2.40GHz, 6144MB, HD6470M, WebCam)
feedback phases Another direction of further research is related to
Phase Number Accuracy Precision Recall target scheme: currently we copy information about
of stored relation and attribute for shifted cells, but methods that
No user 0 4% 11% 7% are more sophisticated can consider semantics of both
feedback feedbacks
5 4% 8% 6% table content and relational scheme.
all 16% 28% 25%
One user 0 43% 91% 90% References
feedback 5 48% 87% 86% [1] D. W. Embley, M. Hurst, D. Lopresti, G. Nagy.
all 86% 87% 86% Table-processing paradigms: a research survey.
International Journal of Document Analysis and
Table 10 shows results for each module for one user Recognition (IJDAR), 8(2-3), p. 66-86, 2006.
feedback phase. The similarity estimator chooses the [2] D. W. Embley, C. Tao, S. W. Liddle. Automating
most similar shift among all stored ones. The similarity the Extraction of Data from HTML Tables with
threshold is estimated perfectly: if the similarity Unknown Structure. Knowledge Engineering, 54
estimator chooses an appropriate shift among the stored (1), p. 3-28, 2005.
ones, then we always choose it and never try to [3] D. W. Embley, M. Krishnamoorthy. Factoring
construct our one instead. The shift constructor works Web Tables. Proceedings of 24th International
not bad, it means that our algorithm constructs most of Conference on Industrial Engineering and Other
possible shifts, but the shift chooser makes a lot of Applications of Applied Intelligent Systems,
mistakes. p. 253-263, 2011.
[4] F. Fumarola, T. Weninger, R. Barber, D. Malerba,
Table 10: Accuracy of modules for one user and J. Han. HyLiEn: a hybrid approach to general
feedback phase list extraction on the web. Proceedings of the 20th
Evaluated 0 stored 5 stored All stored international conference companion on World wide
module shifts shifts shifts web, pp. 35–36, 2011.
Similarity 100% 100% 100% [5] W. Gatterbauer, P. Bohunsky, M. Herzog, B.
threshold
Similarity 100% 100% 100% Krüpl, and B. Pollak. Towards domain-independent
estimator information extraction from web tables.
Shift chooser 25% 20% 0%
Proceedings of the 16th international conference on
Shift 80% 77% 33% World Wide Web, pp. 71–80, 2007.
constructor
[6] V. I. Levenshtein. Binary codes capable of
Some mistakes in the regular phase could be correcting deletions, insertions, and reversals.
explained by the following: sometimes a user goes from Soviet Physics Doklady, 10: p. 707–10, 1966.
the top to the bottom of the left part of the table (e.g. [7] D. Lopresti, G. Nagy. A tabular survey of
takes all values from the first and the second columns) automated table processing. Graphics Recognition
and then he/she repeats the similar actions for the right Recent Advances, p. 93-120, 2000.
part (takes all values from the first and the third
[8] G. Nagy, S. Seth, D. Jin, D. W. Embley,
columns). Our tool cannot predict a correct shift at the
S. Machado, and M. Krishnamoorthy. Data
moment when the user switches to the right part, that is
extraction from web tables: The devil is in the
why results for shift constructor are low.
details. Document Analysis and Recognition
(ICDAR), pp. 242–246, 2011.
19
[9] G. Nagy and M. Tamhankar. VeriClick: an
efficient tool for table format verification.
IS&T/SPIE Electronic Imaging, p. 82970M–
82970M, 2012.
[10] C. Peterman, C.H. Chang, H. Alam. A system for
table under-standing. Proceedings of the
Symposium on Document Im-age Understanding
Technology (SDIUT’97), p. 55–62, 1997.
[11] A. C. Silva, A. M. Jorge, L. Torg. Design of an
end-to-end method to extract information from
tables. International Journal on Document Analysis
and Recognition (8), No. 2-3, p. 144-171, 2006.
[12] Y. A. Tijerino, D. W. Embley, D. W. Lonsdale,
Y. Ding, G. Nagy. Towards ontology generation
from tables. World Wide Web 8, no. 3, 261-285,
2005.
[13] M. Tkachenko, A. Simanovsky. Named entity
recognition: Exploring features. Proceedings of
KONVENS 2012, p. 118-127, 2012.
[14] B. G. Vasudevan, A.G. Parvathy, A. Kumar,
R. Balakrishnan. Automated Knowledge-based
Information Extraction from Financial Reports.
Knowledge Engineering and Management, 7(5),
p. 61-68, 2009.
[15] R. Zanibbi, D. Blostein, J. R. Cordy. A survey of
table recognition: Models, observations,
transformations, and inferences. International
Journal of Document Analysis and Recognition,
7(1), Springer, Heidelberg, p. 1–16, 2004.
20