Introduction

Information Extraction for Semi-structured Email Corpora

Science Media Center

firstname.lastname@sciencemediacenter.de

Cologne

Germany

0 TH Koln - University of Applied Sciences , Cologne , Germany

Information extraction is a requirement for enhanced IR techniques. To surpass rigid extraction rules in wrappers based on XPath or CSS selectors, we present a new extraction method extending the FleXPath method that was used for structured XML retrieval in INEX. We expand this method to work with semi-structured HTML and present a case study and a short evaluation based on a corpus of emails from scienti c publishers.

Information extraction Semi-structured documents Emails

Introduction

Information extraction is a requirement and necessity for enhanced information retrieval techniques like entity-based search, semantic search, or other approaches that make use of properly structured and annotated information [ 3 ]. Searching for or within structured information like XML documents (INEX) allows to make use of semantic annotations directly, but usually, source documents are not semantically structured at all and only allow full-text search. In this work, we use a newsletter corpus that is not formally structured but semistructured with some reoccurring parts of the mails (like title, dates, or authors). We want to extract this information to allow rich IR techniques like ltering, browsing, or ranking based on these semantic annotations.

Usually information extraction [ 6 ] relies on structural layouts and syntactical patterns [ 5 ] within the source documents. Wrappers are used to make use of these structures. Wrappers use pattern matching procedures and heavily rely on prede ned or learned extraction rules [ 9 ]. A more general approach for wrapper construction is XPath [ 8 ] or derivation like OXPath [ 10 ]. Both systems are used to address and locate parts and nodes in XML / HTML documents and to extract their content. Especially for HTML documents CSS selectors are an alternative to XPath [ 4 ].

Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

A common issue in real-life web information extraction is the fact that XPath or CSS-based extraction rules are not exible and are prone to even slight changes in the source documents' structures. Due to this, the process of adjusting an existing wrapper to new requirements is costly. Amer-Yahia et al. introduced a technique called FleXPath [ 1 ] to surpass this issue. They developed a mixture of structured XPath and full-text XQuery-based search techniques to extract information from structured XML documents. Thus FlexPath allows using both: database style querying and full-text search. The results from both query approaches are scored and ranked on structural and full-text search features. This approach was successfully evaluated in INEX [ 2 ].

We argue that automated web information extraction could bene t from this idea and that the FleXPath approach could be expanded for semi-structured HTML documents. Our main goal is to yield better coverage for structured information extraction. Due to the nature of XPath or CSS-based selectors, any structural changes to the source of extraction will a ect the quality of the returned data. This paper aims to create and to test an information extraction concept for semi-structured documents.

We will demonstrate the feasibility of our approach by implementing our algorithm in a demonstrator called FleXipy and test it against a sample document collection containing di erent types of semi-structured HTML data. We will test whether our approaches will result in a more robust and more complete outcome in contrast to \normal" XPath/CSS-based extraction frameworks. 2

FleXipy - Information Extraction for HTML

In this section we will present FleXipy, an extraction mechanism that incorporates both path-based and text-based features. FleXipy always starts with a strict path-based extraction rule that was con gured by the wrapper generator. Only if this strict extraction rule does not verify and returns an empty node, the additional extraction features are activated. First, we generate a list of possible node candidates by looking in the neighborhood of the original con gured path. In a second step, we use di erent text-based features to rank the di erent node candidates to nd the best-matching node and path to correct the wrong extraction rule. 2.1

Main Components of FleXipy

The core of FleXipy is divided into three major components: (1) Modules for path-based tree interactions that use rigid patterns like XPath, modules for textual, (2) content-based interactions that use full-text search patterns and nally (3) a module to rank the node candidates.

Path-based Modules There are two di erent methods for nding node candidates with FleXipy. One requires a full XPath to nodes where the desired text should be found. The other one requires an XPath to select a structural description to look for anywhere in the DOM tree. When giving a full XPath, we will use it as a template. This method is useful in cases where the desired text should be in this node but moved to a nearby sibling due to minimal structural changes. Using a limited breadth- rst search on the DOM tree and a following depth-limited search, we will check any siblings for existing text and their full XPath will be added to the list of possible candidates. In FleXipy, this is called template search.

The second method requires a structural description of a subtree to search for in the DOM tree, e.g //tbody/tr/td/h1. Any found expression containing text will be considered a candidate for further analysis. In cases where no candidate can be found, it is recommended and supported to reduce the size of the subtree to enlarge the search space. This is called subtree search.

The two path-based searches return a list of candidates. We calculate the distances between the candidates in the DOM tree and the original con gured XPath expression. This distance will be normalized for the template search. For subtree search, no distance can be calculated. Therefore the distance is set to d = 0. All other distances contribute to the path score like follows: p = 1 d=dmax. A smaller distance will lead to a higher path score of 0 p 1. Text-based Modules Text- and content-based modules are components that test the content of all candidates found by the path-based modules. These modules represent a veri cation process. Depending on the con gured rules, the interaction can be expressed by using a Radcli /Obershelp fuzzy string matching algorithm or by using a combination of rules to verify the structural features of the extracted text. These rules can match text features like length and formatting, check for pre xes or use text similarity scores for text which may only change slightly for multiple sources. Every con gured rule symbolizes a veri cation task for the extracted text and should \penalize" candidates for not passing a pattern rule. The text score is t = 1 f =n, where f is the number of failed veri cations and n is the number of con gured rules. The result of this process is a score of 0 t 1 for all text-based features.

Candidate Ranking The two module types return two scores: p for for all path-based features and t for all textual features for all node candidates. Using a simple linear combination of both normalized scores p and t will then lead to a nal ranking R (Eq. 1):

R = p (1 ) + t (1) where 0 1 is usually set to 0.7. The value for is a purely heuristically determined best-practice value that works for the test data sets in our case study. (see section 3).

Example Extraction with FleXipy

To further clarify how the FleXipy algorithm works, we will showcase a step by step extraction in comparison to a conventional XPath based extraction approach. in this case, are formatted using CSS, and we also assume that headlines have a minimum length of 30 characters. Therefore, we can make use of functions provided by the FleXipy framework addressing these characteristics. These directives will be added to the FleXipy con guration le.

During a simulated extraction process, we got an email where the title can be found not in //div/p[ 7 ]/span[ 1 ] but in //div/p[ 8 ]/span[ 1 ]. Using a normal XPath-based wrapper, the extraction will fail since the path has changed. The FleXipy framework will, rst of all, check the given XPath expression against the con gured directives. If they match, FleXipy is done, if not FleXipy will use its path-based modules to look in the surroundings of the con gured XPath for nodes that contain any form of text assuming that the wanted text only moved slightly. In this case, the mentioned template search will look for siblings for p and span nodes, ultimately also cover the wanted text. When reaching the con gured limits, all found nodes containing text will then be checked against the text-based directives. Combining the distance and probability that a found text is the wanted text, the framework will then calculate a ranking of candidates and provide it for further processing. 3

A Case Study on FleXipy for Emails

To test the feasibility of the FleXipy approach, we will present the results of a case study with semi-structured email texts. The source of our test collection is a set of extracted email bodies in HTML from the Google Digital News Initiativefunded PRepublicatIOn Radar (PRIOR) project3. The background of PRIOR is to enable science journalists to keep up with the latest scienti c research in relevant domains of knowledge. Scienti c publishers o er science journalists exclusive access to prepublications of upcoming research articles under a very strict embargo. These prepublications are distributed exclusively via email and are very heterogeneous by nature as each publisher has its own format. The PRIOR project uses a fully automated extraction pipeline written in Python that utilizes the web crawling framework scrapy and XPath-based wrappers for information extraction. The daily work with these emails shows that these wrappers are highly vulnerable against deviations in the source data, which is the case for most of the emails. Most of the time, the emails are highly unstructured or do not share a common set of formatting rules, making them nearly impossible to process by standard wrappers. These issues make the unstructured and heterogeneous email bodies from the PRIOR project a perfect test bed for evaluating the FleXipy approach. 3.1

Evaluation Setup

We built a small test collection out of 75 PRIOR emails from the four scienti c journals Science, Jama, Lancet, and Cell. Each email may contain many 3 https://ir.web.th-koeln.de/projects/prior/ embargo announcements that consist of di erent metadata elds. We manually identi ed and annotated 202 of these metadata elds (also called slots) within the source emails. The elds were of the four following types: Embargo dates, contact details, titles of the embargoed articles, and short abstracts. For each eld, we manually de ned the correct XPath to extract the information to test the di erent system con gurations.

Two di erent con gurations were part of our evaluation: A simple XPathbased extraction without any modi cations (called baseline) and a full features FleXipy con guration (called Full-FleXipy). The baseline represents the standard extraction process done by frameworks like Scrapy. From each journal, we randomly took one email as training data to manually extract the XPath expression and used it to create a simple wrapper for the information extraction. The same XPath expression was later con gured in FleXipy. The training emails were removed from the evaluation corpus. These extracted XPath expressions were then used as a rigid set of extraction rules for the remaining emails. Since emails are heterogeneous by nature, the results of this simulated extraction process are expected to be insu cient. After generating our baseline, we created additional test-based rule sets (like pre x patterns) for the training data and activated the path-based modules. 3.2

Evaluation Metrics

We compared the results of FleXipy against manually annotated data for all slots. The comparison relies on two di erent measurements: F1 and the slot error rate [ 7 ] (SER). When evaluating, we have four di erent kinds of results for a slot: (1) We correctly extracted a slot (hit ), (2) we have slots which contain incorrect data (substitution), (3) we have slots without any data (deletion) and (4) slots which were con gured and extracted but cannot be found in the source data and should not have been extracted (insertion).

Returned slots are either a hit, a substitution or an insertion, but only slots which count as a hit are relevant in terms of precision and recall and therefore F1. We decided to use the slot error rate as a additional measurement as it introduces a performance measurement for information extraction which gives adequate weight to the di erent error types. It is de ned as follows (Eq. 2): SER = (substitutions + deletions + insertions) (hits + substitutions + deletions) (2) where a lower SER value indicates a better extraction performance. 3.3

Findings

We evaluated four di erent entry types (embargo dates, contact information, titles, and abstracts) for four di erent scienti c journals (Science, Jama, Lancet, and Cell). We report SER and F1-scores (see Table 1). The statistical signi cance of di erences in the average performance is determined using a two-sided Baseline Science

Jama Lancet

Cell avg

Embargo

Contact SER SER

Student's t-test. Signi cant improvements over the baselines (p < 0:05) are indicated with an asterisk.

We see clear improvements in the average extraction performance in all four extracted entry types on both F1 and SER. Especially the contact information shows a signi cant improvement from an average SER of 1.06 to 0.17 and an improvement of F1 from 0.1 to 0.67. The other improvements are still evident but not statistically signi cant. A small increase can also be seen for the most heterogeneous slot de ned increasing the F1 average for abstracts from 0.37 to 0.43. Only one single slot entry type for one single journal had a loss compared to the baseline: the embargo dates from the Science journal. Here the SER value raised from 0.38 to 0.44 while the F1 still increased from 0.69 to 0.72.

We can also see that the information saved in the embargo slot was extracted perfectly for three of the four providers. For Cell, FleXipy was capable of nding all missing titles and contacts and x them accordingly. For Jama, there is no change in the result at all, since the baseline result was already near perfect from the beginning. 4

Discussion and Outlook

We presented and evaluated an information extraction method to complement simple XPath or CSS-based wrappers. We implemented our approach that incorporates path-based and text-based extraction rules and a candidate ranking module in a demonstrator called FleXipy. The evaluation of email newsletters from di erent scienti c publishers and journals showed clear improvements compared to a simple XPath baseline. The improvements are possible as FleXipy can search for alternative candidates when the original XPath-selector would only return an empty node (deletions). Additionally, FleXipy could be used to correct the two other extraction errors types substitutions and insertions as the textbased modules would allow checking on additional features (like text patterns) rather than only on path features.

Although the evaluation showed a clear improvement and returned nearly perfect results for some entry types, we have to emphasize the preliminary character of this evaluation. It is only a case study with 202 manually annotated slots resulting in a rather small test collection. To get a more reliable result, the size of the collection should be increased. Also, the con gured rules used to verify the data can be improved by increasing the size of the training data.

An interesting result is the opposing performance of SER and F1 for the embargo entries in Science. This is the case when the extraction results shift from cases of substitutions to cases of deletions. This leads to a decreasing number of returned documents and therefore in uences the precision and recall. The increase in F1 does not clearly describe an overall increase in the extraction result since there could still be the case of less correctly extracted slots. These contradicting results for the embargo dates in Science are a compelling case that demonstrates the justi cation for using SER to complement the precision/recallxed perspective of F1. While we increase the number of relevant slots in the result, we introduce more and di erent error types.

The overall results of this case study can be promising for further development of this approach to improve web information extraction. FleXipy is not a complete extraction framework (like e.g., Scrapy) but can be part of such frameworks to complement XPath or CSS-based wrappers. For productive use, this framework should be extended with more possibilities to write extraction rules for the source data.

Acknowledgments

This work was part of the PRIOR project that was funded by the Google Digital News Initiative4. 4 https://newsinitiative.withgoogle.com/dnifund/dni-projects/priorprepublication-radar-round-4/

1. Amer-Yahia , S. , Lakshmanan , L.V.S. , Pandit , S. : Flexpath: Flexible structure and full-text querying for xml . In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data . pp. 83 { 94 . SIGMOD '04, ACM , New York, NY, USA ( 2004 ). https://doi.org/10.1145/1007568.1007581, http: //doi.acm. org/10 .1145/1007568.1007581

2. Amer-Yahia , S. , Lalmas , M.: XML search: languages, INEX and scoring . ACM SIGMOD Record 35 ( 4 ), 16 {23 (Dec 2006 ). https://doi.org/10.1145/1228268.1228271, http://portal.acm.org/citation.cfm?doid= 1228268 . 1228271

3. Balog , K. : Meet the Data . In: Balog, K . (ed.) Entity-Oriented Search , pp. 25 { 53 . The Information Retrieval Series, Springer International Publishing, Cham ( 2018 ). https://doi.org/10.1007/978-3- 319 -93935-3 2, https://doi.org/10.1007/978-3- 319 -93935- 3 _ 2

4. Chamberlain , S. , Ram , K. , Grolemund , G.: Extracting data from the web apis and beyond . In: The R User Conference 2016 . Stanford University, Stanford, California ( 2016 ), https://github.com/ropensci-training/user2016-tutorial/blob/ master/slides.pdf

5 . Chang , C.H. , Kayed , M. , Girgis , M.R. , Shaalan , K.F. : A survey of web information extraction systems . IEEE transactions on knowledge and data engineering 18 ( 10 ), 1411 { 1428 ( 2006 )

6. Jurafsky , D. , Martin , J.H. : Speech and language processing , vol. 3 . Pearson London ( 2014 )

7. Makhoul , J. , Kubala , F. , Schwartz , R. , Weischedel , R. , et al.: Performance measures for information extraction . In: Proceedings of DARPA broadcast news workshop . pp. 249 { 252 . Herndon , VA ( 1999 )

8. Melton , J. , Buxton , S. : Querying

XML

: XQuery, XPath, and SQL/XML in context . The Morgan Kaufmann Series in Data Management Systems, Elsevier Science ( 2011 ), https://books.google.de/books?id=EuYRXgDqVp0C

9. Michels , C. , Fayzrakhmanov , R.R. , Ley , M. , Sallinger , E. , Schenkel , R.: Oxpathbased data acquisition for dblp . In: 2017 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2017 , Toronto, ON, Canada, June 19-23, 2017 . pp. 319 { 320 . IEEE Computer Society ( 2017 ). https://doi.org/10.1109/JCDL. 2017 . 7991609 , https:// doi.org/10.1109/JCDL. 2017 .7991609

10. Neumann , M. , Steinberg , J. , Schaer , P. : Web-Scraping for Non-Programmers: Introducing OXPath for Digital Library Metadata Harvesting . Code4Lib Journal 2017 ( 38 ) ( Oct 2017 ), https://journal.code4lib.org/articles/13007