Semi-automatic Data Extraction from Tables © Nikita Astrakhantsev © Denis Turdakov © Natalia Vassilieva ISPRAS ISPRAS HP Labs Moscow Moscow Saint-Petersburg astrakhantsev@ispras.ru turdakov@ispras.ru vassilieva@hp.com Abstract suggest The Periodic Table of the Elements as an example of the table requiring semantic knowledge for This paper describes a novel approach to interpretation; the authors of this survey also state that it automate extraction of useful information from is easy to contrive other examples “that are challenging tables and to record the knowledge procured in even from a human perspective”. a structured data repository. The approach is We present a semi-automatic approach that tracks based on modeling a behavior of an expert, actions of a domain expert, when he/she begins to who collects tabular data and maps them to a process a table (map cell data to a relational scheme), predefined relational schema. Experimental derives regularities/patterns of expert behavior, and results demonstrate that the proposed approach applies them to the rest of the table in order to predict predicts expert decisions with high accuracy further mappings. and thus significantly minimizes the time This paper is organized as follows: In Section 2 we required of an expert for data aggregation. give an overview of related works. Section 3 discusses the table format used in our work. Section 4 describes 1 Introduction the general architecture of our prototype, while Section 5 describes it in details. Section 6 presents our Tables are widely used in scientific, financial and other experimental evaluation. We conclude in Section 7. analytical documents to concisely communicate information to human readers. A table usually contains information about objects of the same type. These 2 Related work objects can be easily perceived by the reader through Silva et. al [11] survey about 50 works devoted to tables content and relations between different table elements processing in details. Authors outline several table- (cells, rows, etc.), because he/she is experienced in table related tasks and corresponding table representation reading and is able to take in both semantic and models, which serve as input and output for these tasks. structural information. Sometimes two tables with The span of tasks goes from location of a table in a exactly the same structure are interpreted completely document to semantic interpretation of the information different just because of slight difference in the external contained in the table. There are more works focused on context or because the content of some cells differs. the basic low-level table related tasks than on the more However, it is hard to automate table analysis and knowledge based ones. A deep interpretation of the information is extracted mainly by hand by interested table is almost always requires context specific parties. A common scenario used by experts is manual knowledge. Existing solutions for extracting cell by cell extraction of data from a table into a information are very domain specific and designed for relational database, and then using OLAP or other particular table types. techniques to generate reports and perform analytics Zanibbi et. al [15] present the table recognition over these data. Efforts required for manual extraction literature in terms of the interaction of table models, are considerable, while the time of experts is always observations, transformations, and inferences. Most of costly. described methods are fully automatic and consider the There is no end-to-end solution for automatic task of table processing in isolation from further usage information extraction from arbitrary tables. And as it of the extracted information. appears to us, construction of a fully automatic Embley et al. [2] extract data from XML tables and instrument is hardly feasible. It might be possible to map them to a given target database schema with 96/85 parse automatically the structure of any table, but precision and 93/91 recall (depending on a domain — semantic interpretation of tabular data requires the car advertisement and cell-phone correspondingly). knowledge of a domain expert. Embley et al. [1] However, their approach requires a hand-crafted ontology, which is costly and, more important, cannot always be in place due to high specificity of certain Proceedings of the 15th All-Russian Conference documents. "Digital Libraries: Advanced Methods and More recent work of Embley and Technologies, Digital Collections" ― RCDL-2013, Krishnamoorthy [3] transforms CSV or HTML tables Yaroslavl, Russia, October 14-18 2013. 14 into a canonical representation in order to obtain a Thus, except for the highly specialized programs target representation, one of which is Relational table. like VeriClick and commercial tools like Google Refine A canonical table representation is based on Header or MS Excel, we are not aware of any research on Paths, a purely syntactic technique that relates headers interactive information extraction from tables; and the and data cells. Further transformation into a target authors of VeriClick confirm this observation [5]. It representation is performed by specially defined also should be noted that most works try to convert a relational algebra. In contrast to the previous approach, table into some more usable format, but not to extract this one does not use any semantic knowledge; also it needed parts of information from a table. demands a user to point out the top-left data cell in cases where proposed heuristics cannot define such a 3 Table format cell automatically. Precision of correct Header Paths construction is about 74%. In [8] the authors evolve this Like other basic notions, 'table' has a lot of different approach by using interactive tool VeriClick [9], ‘a definitions. macro-enabled spreadsheet interface that provides Peterman et al. [10] suggest the following intuitive ground-truthing, confirmation, correction, and definition: “tables have a regular repetitive structure verification functions for CSV tables’. However, this along one axis so that the data type is determined either tool is used only to locate so-called ‘critical cells’: by the horizontal or vertical indices.” The definition corner cells that allows to distinguish header and data given by Lopresti et al. [7] consists of similar items: regions in a table. 1. 2-D cell assembly for presenting information; Vasudevan et. al [14] address a closely related task: 2. Regular, repetitive structure along at least one to automate data extraction from financial reports axis; presented in PDF documents with many tables. The 3. Datatype determined by either horizontal or proposed approach shows good results (95.7% precision vertical index. and 78.4% recall), but it requires a quality review stage Tijerino et al. [12] uses standard definition of a and, since it is based on domain knowledge heuristics, relational table. the range of its application is severely limited. In this work we consider a table to be a set of cells Gatterbauer et al. [4] extract information from web with some text content. In other words, it is the only tables by using two-dimensional visual model provided property of a table that is used explicitly; other by web browsers instead of tree-based (HTML) properties like repetitive structure or the same datatype representation. Fumarola et al. [5] combine knowledge in a column or a row are considered by our method about the visual structure of the Web page and the implicitly. HTML markup for web lists extraction. Their tool is Our prototype takes HTML tables as an input; applicable for web tables, too; moreover, the authors therefore we use terminology from HTML in order to evaluate accuracy on the same dataset as Gatterbauer describe cell properties; for example, colspan as a and report very good quality (more than 99% precision relative width of a table cell. However, our method does and recall on table records). But this dataset is domain- not depend on any specific properties of HTML format. independent, and target format of table analysis is more We use classical spreadsheet addressing for cells. general than ours; therefore, it cannot be directly For example, A2 of Table 1 is an empty cell in the first compared with our method. column and the second row with rowspan equal 2. Looking at semi-automatic tools, Google Refine 1 should be noted. It is "a power tool for working with Table 1: Example of a source table with spreadsheet messy data, cleaning it up, transforming it from one coordinates format into another, extending it with web services, and A B C D E linking it to databases like Freebase". However, Google 1 Secret budget Refine is mostly for data cleansing and is not capable of 2 FY 2009 FY 2010 performing information extraction and interpretation. It 3 Oper. Capital Oper. Capital also cannot track user decisions and anticipate them. 4 HP 10 20 30 40 Similarly, Microsoft Excel2 can perform such 5 Oracle 50 60 10 20 transformations using formulas and macros, but it 6 Samsung 12 34 56 78 demands user to define these formulas explicitly. Praedea3 is another semi-automatic tool that uses 4 General architecture predefined text-mining models (chosen by user) for extracting required data from unstructured documents Assume that a user processes a table step-by-step. and mapping them to database or XML scheme. It is not During each step the user selects multiple cells from the capable to process a table in case there is no suitable table and maps them to a record in a relational database. predefined model for it. Let us define user feedback as information about mappings from a single user step. See Table 2 for an 1 http://code.google.com/p/google-refine example of the initial user feedback provided by him/her while processing Table 1. 2 http://office.microsoft.com/en-us/excel/ 3 http://www.praedea.com 15 Table 2: User feedback example: two first columns forthcoming actions. The key idea of the approach is to represent data about the target relational scheme; consider the shift from one user step to the next one. two last ones represent table cell data More precisely, a shift consists of row and column Relation Attribute Cell Value offsets (possibly zero) for each cell represented in a pair name name position of sequential user feedbacks. Since any table contains Report Company A4 HP several objects of the same type and structure, we try to Report Operating B4 10 recognize it by tracking sequential user steps. Report Financial Year B2 2009 Table 3 shows an example of a shift from the first Figure 1 shows the general architecture of the feedback (underlined words) to the second one (bolded prototype. words). Note that some cells remain fixed, e.g. B2. Figure 1: General architecture Table 3: Shift example: from (HP, 10, 2009) to (Oracle, 50, 2009) A B C D E 1 Secret budget 2 FY 2009 FY 2010 3 Oper. Capital Oper. Capital 4 HP 10 20 30 40 5 Oracle 50 60 10 20 6 Samsung 12 34 56 78 Shift approach is used in different ways depending on the number of the available user feedbacks. In section 5.1 we describe a regular phase of our algorithm: when at least two user feedbacks are available and we can explicitly compute shifts. Section 5.2 presents the phase when we have only one user feedback and have to guess a correct shift. Section 5.3 considers the phase when user starts to process a table and we have to predict his/her actions based on the previously processed tables. 5.1 Regular phase The prototype takes as input an HTML table and After two user steps we can construct a shift by enriches it by the predefined Natural Language computing row and column offsets explicitly for every Processing pipeline; currently we add only information cell. For example, after two steps of processing Table 1 about Named Entity types. the following shift is constructed: the first two cells are Enriched table is further processed by the main part shifted down by one cell and the third cell stays at the of the prototype - Table processor on the diagram. It same position. generates output, which has the same format as the user The constructed shift is further applied to cells of feedback, i.e. information about mappings that assumed the last user step in order to get new cells and create a to be obtained by the user. This output is reviewed by new row in a relational database with the same the user and he/she returns feedback. The prototype relation/attribute information and new values taken stores information from the user feedback into the from new cells. Target database and uses this feedback in order to As it was shown in the example above, sometimes generate the next output. History means a local storage the value returned by the user is not just a text content of the previous feedbacks, including ones from other of the cell, but its derivative: compare “FY 2009” in tables that were already processed by the user. Table 1 with just “2009” in the feedback, Table 2. Information about previous actions is also used by the Solution of this problem is based on the observation prototype for the output generation. that shifted cells share similar structure of their content. We store the way of producing the value from the cell 5 Shift approach in form of a regular expression. More precisely, we have prepared a set of patterns, which should be tested Our approach is based on two observations: (a) each to match cell content against the value returned by the table is a layout structure for storing similar objects; and user within his/her feedback. Currently there are two (b) closely located tables (e.g. from the same document types of patterns: (a) taking a substring of original cell or Web-site) usually share similar structure and context. content, and (b) replacing a substring of original cell A user processes similar objects from the same table content with another predefined string. Patterns of both sequentially by repeating similar actions; and this types contain corresponding regular expressions to be repetition allows the system to learn and anticipate the applied to cell content. 16 Table 4: Example of cell value patterns Pattern Description Cell value Feedback value .*::::0 Takes the value as is HP HP (.*)\s(.*)::::1::::2 Splits by space and tries each part Financial year year (\d+)::::1 Takes only digits from the content 146% 146 thousand::::000::::ReplaceAll Replaces all words thousand by 3 zeros 45 thousand 45000 Patterns of the first type also contain matching metrics taken from SimMetrics4, the best results have group numbers to be taken into account; patterns of the been obtained using Levenstein distance [6]. In second type contain a replacement string and a key addition, we modified string metric as follows: word ReplaceAll in order to distinguish this type. See 1. All digits are considered to be equal characters, Table 4 for the examples of cell value patterns. because, as it is written above, we want to Sequence of colons (::::) serves as a delimiter. capture strings of the same attribute, or data type, These patterns work as follows. Assume, have a cell and difference between numeric strings tells with content “FY 2009” and the corresponding nothing about difference between attributes. Note feedback value is “2009”. We try to apply the first that we do not consider all numbers to be equal, pattern (.*) to cell content, which means that we simply because much difference in orders of magnitude take the whole content as is. The result does not match can indirectly indicate different attributes. with the feedback value. Then we try the next pattern 2. If one of the strings is empty, we use pre-defined ((.*)\s(.*)::::1::::2). The result matches with the value (0.5): sometimes cell values are missed, for feedback value and we store this pattern for the example, it can mean that the previous value or corresponding cells of the shift. some default value should be taken instead. If there are more than two steps produced by the user, Unmodified metrics return zero similarity for then we construct candidate shifts for all paired such cases, but cell value absence does not combinations of user feedbacks, assess them, and necessarily indicate the difference in attributes. choose the best shift in order to apply it to cells of the 3. If both strings are long (more than 3 words), we last user feedback. For example, if there are 3 use pre-defined value (0.8): again, difference in feedbacks, we construct 3 candidate shifts: 1-to-2, 1-to- long texts does not reflect difference in 3, and 2-to-3; all 3 candidate shifts are assessed attributes. independently, and the best shift is then applied to the Named entity type consistency: predefined value last, 3rd feedback. for 3 cases depending on named entity types of cells Shift assessment is performed by computing linear contents: combination of the following 4 features: 1. Named entity types are equal – value is 1; Average shift length: normalized sum of lengths of 2. Named entity types are unequal – value is 0; cells offsets. For the example shift, there are 2 offsets 3. Named entity types are both undefined – value having length 2 and 1 offset having length 0, so the is 0.5; feature value is 2/3. Motivation is the same as in the previous feature, Motivation: people tend to process the table but here we utilize information about named entity sequentially, taking closest cells if possible. In other types in order to check attribute consistency. We use a words, shorter shifts are preferable by users. conditional random field (CRF) model with a We also tried different weights for different combination of different popular features applied in directions, e.g. bigger weight for the right offset than supervised named entity recognition [13]. There are 6 for the down one, because top-down processing seem to supported named entity types: Acronym, Date, be more common, but experiments show that different Location, Numeric, Organization, Person. weights do not introduce any positive effect. Coefficients for the linear combination are chosen Cells offsets consistency: normalized number of experimentally to maximize the accuracy: we test all most common cells offsets. For the shift in the example, possible values from 0 to 1 with step 0.2 so that their there are 2 types of offsets: 2 down-by-1 offsets and 1 sum equals to 1. We found the best accuracy to be remain-fixed offset, so the feature value is 2/3 again. obtained with 0, 0.4, 0.2, 0.4 correspondingly; the Motivation: a shift usually contains similar offsets of further granulation does not change the result. cells, e.g. all 4 cells are shifted down by 1 more often, then when 2 cells are shifted down by 1 and other 2 5.2 One User Feedback Phase cells are shifted down by 2 or right by any number. Given one user feedback we need a shift to apply it Average text similarity: normalized value of string to this user feedback as it is done in Regular phase. similarity metrics computed over corresponding There are two ways: take a most appropriate shift from (shifted) cells contents. the previously processed tables or construct a shift from Motivation: when we shift one cell to another, both of them must contain values of the same attribute, and 4 different values of the same attributes are usually SimMetrics is a Similarity Metric Library provided by similar text strings. UK Sheffield University To compute such similarity we tried several string http://sourceforge.net/projects/simmetrics/ 17 scratch. To find the most appropriate shift we check all Table 6: Source of stored feedback stored shifts for applicability, and then assess similarity A B C of all applicable shifts modified in accordance with the 1 Public budget current table. To assess the modified shift we use the 2 Company Capital Share of market linear combination as in section 5.1, but coefficients are 3 Dell 100 27% re-estimated (0.2, 0, 0.6, 0.2). 4 Lenovo 105 33% This way is very similar to No user feedback phase, see Section 5.3 for details. In short words, modification Table 7: Applied stored feedback means that we replace contents of cells in the shift by Relation Attribute Cell Value Cell the contents of the corresponding cells of the current pattern table. For example, if there is a stored shift (from some Report Company A4 HP .*::::0 previously processed table) with just one cell offset – Report Capital B4 10 .*::::0 C4-to-C5 with contents “USA”-to-“Russia” – then we Report Share of C4 20 (\d+)::::1 modify the shift so that now contents of the cell offset is market “20”-to-“60”: we take contents of C4 and C5 cells of Table 1. Note that if the shift contains cell offset like After that we construct a shift from the stored B1-to-B2 with ordinary colspan and rowspan (all equal feedback (Table 5) to the obtained feedback (Table 7) to 1), then such the shift is inapplicable for the Table 1, and assess it in the way described in section 3.1 with re- because B1 and B2 cells of this table have different estimated coefficients (0.2, 0, 0.4, 0.4). Such colspans. assessment allows us to choose the most similar table If there is no suitable shift, i.e. similarity of the best among all already processed ones. shift does not reach a predefined threshold, then we Note that the stored feedback from the 3rd row, but construct a new one. For this purpose, we iterate over not 4th, is not applicable to Table 1, because A3 cells combinations of all possible offsets of row and column have different rowspans. for each cell. To limit combinatorial explosion we do not consider offsets above predefined thresholds: 5 for rows and 3 for columns. Each offset combination is 6 Evaluation actually a shift that can be assessed as it is described We compiled a set of 30 tables containing financial above; coefficients are left the same. reports and a set of more than 150 corresponding user feedbacks5. We use the following test metrics: accuracy, 5.3 No User Feedback Phase precision and recall. Accuracy shows the fraction of When no user feedback is available for the current outputs that fully match the user feedback: if at least table, we can use information about previously one value in the tool output is wrong, the whole answer processed tables. We store all user feedbacks during is considered to be wrong. Precision and recall table processing for the case if some of them have a characterize the number of correct cell mappings. structure (set of cell positions) appropriate for a new Obviously, precision and recall can be much higher than table. To choose the most appropriate user feedback we accuracy, because many answers are partially correct. first check all of them for applicability, i.e. current table Table 8 shows the results for the regular phase; the must have cells in all cell positions of the user feedback, 2nd and the 3rd rows show efficiency of shift and these cells must have the same characteristics — constructing module and shift choosing module particularly, colspan and rowspan. Then all applicable respectively. We consider the shift constructor to work user feedbacks are assessed by constructing and correctly if there is a right output (maybe not the chosen assessing special "fake" shifts from the stored feedback one); for the shift chooser we count only those outputs to the feedback obtained by applying the stored when there is a correct shift constructed and thereby the feedback structure to current table. Assume we have a shift chooser had a chance to choose it. feedback shown in Table 5 from the already processed Table 6 and we just begin to process Table 1. Table 8: Regular phase results Total accuracy 74% Table 5: Example of stored feedback Shift constructor accuracy 78% Relation Attribute Cell Value Cell Shift chooser accuracy 94% pattern Precision 84% Report Company A4 Lenovo .*::::0 Recall 80% Report Capital B4 105 .*::::0 Report Market C4 33 (\d+)::::1 Table 9 shows the results for the first 2 phases. Easy share to see that these results depend on feedbacks from previously processed tables, because each user feedback Then we take a structure of the stored feedback, that is stored during the processing and affects the following is actually a cell position and a cell pattern, and apply it outputs. However, the order of tables inside the to the considered Table 1, see Table 7. 5 Dataset's URL : http://modis.ispras.ru/datasets/td.zip 18 document may be valuable; therefore, we shuffle order 7 Conclusions and Future Work of blocks containing tables from the same document. It is worth mentioning that the results of the first In this paper, we focused on the task of semi-automatic two phases also depend on stored shifts (feedbacks, for data extraction from tables and mapping them to the first phase) and at the begging of the work, without relational scheme; we introduce novel approach that any stored shift, we face the problem of cold start. That tracks user decisions to predict forthcoming ones; we is why we add tests with prepared set of 5 simple stored evaluated it on our test data. feedbacks and shifts from other tables. Shift approach provides general scheme for semi- In addition, we also run tests when all feedbacks and automatic table processing, but many problems are out shifts from the same 30 test tables are stored and used of scope of this research. One of them is extraction of for the testing. Of course, it is not an absolutely fair test, value from table cell. In most cases it is sufficient to because there are stored feedbacks and shifts with take text substrings (see cell B2 in Table 1), but strictly the same text content, but these tests may help to sometimes certain table cell is actually a set of different understand if the mistakes are caused by the problem of attribute values that should be processed by more missing stored shifts or the incorrect choice of the complex methods. For instance, a price list of a stored shift to be applied. hardware store often contains cells like the following: HP "Pavilion dm4-2102er" QJ453EA (Core i5 Table 9: Results for No user feedback and One user 2430M-2.40GHz, 6144MB, HD6470M, WebCam) feedback phases Another direction of further research is related to Phase Number Accuracy Precision Recall target scheme: currently we copy information about of stored relation and attribute for shifted cells, but methods that No user 0 4% 11% 7% are more sophisticated can consider semantics of both feedback feedbacks 5 4% 8% 6% table content and relational scheme. all 16% 28% 25% One user 0 43% 91% 90% References feedback 5 48% 87% 86% [1] D. W. Embley, M. Hurst, D. Lopresti, G. Nagy. all 86% 87% 86% Table-processing paradigms: a research survey. International Journal of Document Analysis and Table 10 shows results for each module for one user Recognition (IJDAR), 8(2-3), p. 66-86, 2006. feedback phase. The similarity estimator chooses the [2] D. W. Embley, C. Tao, S. W. Liddle. Automating most similar shift among all stored ones. The similarity the Extraction of Data from HTML Tables with threshold is estimated perfectly: if the similarity Unknown Structure. Knowledge Engineering, 54 estimator chooses an appropriate shift among the stored (1), p. 3-28, 2005. ones, then we always choose it and never try to [3] D. W. Embley, M. Krishnamoorthy. Factoring construct our one instead. The shift constructor works Web Tables. Proceedings of 24th International not bad, it means that our algorithm constructs most of Conference on Industrial Engineering and Other possible shifts, but the shift chooser makes a lot of Applications of Applied Intelligent Systems, mistakes. p. 253-263, 2011. [4] F. Fumarola, T. Weninger, R. Barber, D. Malerba, Table 10: Accuracy of modules for one user and J. Han. HyLiEn: a hybrid approach to general feedback phase list extraction on the web. Proceedings of the 20th Evaluated 0 stored 5 stored All stored international conference companion on World wide module shifts shifts shifts web, pp. 35–36, 2011. Similarity 100% 100% 100% [5] W. Gatterbauer, P. Bohunsky, M. Herzog, B. threshold Similarity 100% 100% 100% Krüpl, and B. Pollak. Towards domain-independent estimator information extraction from web tables. Shift chooser 25% 20% 0% Proceedings of the 16th international conference on Shift 80% 77% 33% World Wide Web, pp. 71–80, 2007. constructor [6] V. I. Levenshtein. Binary codes capable of Some mistakes in the regular phase could be correcting deletions, insertions, and reversals. explained by the following: sometimes a user goes from Soviet Physics Doklady, 10: p. 707–10, 1966. the top to the bottom of the left part of the table (e.g. [7] D. Lopresti, G. Nagy. A tabular survey of takes all values from the first and the second columns) automated table processing. Graphics Recognition and then he/she repeats the similar actions for the right Recent Advances, p. 93-120, 2000. part (takes all values from the first and the third [8] G. Nagy, S. Seth, D. Jin, D. W. Embley, columns). Our tool cannot predict a correct shift at the S. Machado, and M. Krishnamoorthy. Data moment when the user switches to the right part, that is extraction from web tables: The devil is in the why results for shift constructor are low. details. Document Analysis and Recognition (ICDAR), pp. 242–246, 2011. 19 [9] G. Nagy and M. Tamhankar. VeriClick: an efficient tool for table format verification. IS&T/SPIE Electronic Imaging, p. 82970M– 82970M, 2012. [10] C. Peterman, C.H. Chang, H. Alam. A system for table under-standing. Proceedings of the Symposium on Document Im-age Understanding Technology (SDIUT’97), p. 55–62, 1997. [11] A. C. Silva, A. M. Jorge, L. Torg. Design of an end-to-end method to extract information from tables. International Journal on Document Analysis and Recognition (8), No. 2-3, p. 144-171, 2006. [12] Y. A. Tijerino, D. W. Embley, D. W. Lonsdale, Y. Ding, G. Nagy. Towards ontology generation from tables. World Wide Web 8, no. 3, 261-285, 2005. [13] M. Tkachenko, A. Simanovsky. Named entity recognition: Exploring features. Proceedings of KONVENS 2012, p. 118-127, 2012. [14] B. G. Vasudevan, A.G. Parvathy, A. Kumar, R. Balakrishnan. Automated Knowledge-based Information Extraction from Financial Reports. Knowledge Engineering and Management, 7(5), p. 61-68, 2009. [15] R. Zanibbi, D. Blostein, J. R. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. International Journal of Document Analysis and Recognition, 7(1), Springer, Heidelberg, p. 1–16, 2004. 20