Information Extraction for Semi-structured Email Corpora Hendrik Adam1 and Philipp Schaer2[0000−0002−8817−4632] 1 Science Media Center, Cologne, Germany firstname.lastname@sciencemediacenter.de 2 TH Köln - University of Applied Sciences, Cologne, Germany firstname.lastname@th-koeln.de Abstract. Information extraction is a requirement for enhanced IR techniques. To surpass rigid extraction rules in wrappers based on XPath or CSS selectors, we present a new extraction method extending the FleXPath method that was used for structured XML retrieval in INEX. We expand this method to work with semi-structured HTML and present a case study and a short evaluation based on a corpus of emails from sci- entific publishers. Keywords: Information extraction · Semi-structured documents · Emails 1 Introduction Information extraction is a requirement and necessity for enhanced informa- tion retrieval techniques like entity-based search, semantic search, or other ap- proaches that make use of properly structured and annotated information [3]. Searching for or within structured information like XML documents (INEX) allows to make use of semantic annotations directly, but usually, source docu- ments are not semantically structured at all and only allow full-text search. In this work, we use a newsletter corpus that is not formally structured but semi- structured with some reoccurring parts of the mails (like title, dates, or authors). We want to extract this information to allow rich IR techniques like filtering, browsing, or ranking based on these semantic annotations. Usually information extraction [6] relies on structural layouts and syntactical patterns [5] within the source documents. Wrappers are used to make use of these structures. Wrappers use pattern matching procedures and heavily rely on pre- defined or learned extraction rules [9]. A more general approach for wrapper construction is XPath [8] or derivation like OXPath [10]. Both systems are used to address and locate parts and nodes in XML / HTML documents and to extract their content. Especially for HTML documents CSS selectors are an alternative to XPath [4]. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 A common issue in real-life web information extraction is the fact that XPath or CSS-based extraction rules are not flexible and are prone to even slight changes in the source documents’ structures. Due to this, the process of adjusting an ex- isting wrapper to new requirements is costly. Amer-Yahia et al. introduced a technique called FleXPath [1] to surpass this issue. They developed a mixture of structured XPath and full-text XQuery-based search techniques to extract in- formation from structured XML documents. Thus FlexPath allows using both: database style querying and full-text search. The results from both query ap- proaches are scored and ranked on structural and full-text search features. This approach was successfully evaluated in INEX [2]. We argue that automated web information extraction could benefit from this idea and that the FleXPath approach could be expanded for semi-structured HTML documents. Our main goal is to yield better coverage for structured information extraction. Due to the nature of XPath or CSS-based selectors, any structural changes to the source of extraction will affect the quality of the returned data. This paper aims to create and to test an information extraction concept for semi-structured documents. We will demonstrate the feasibility of our approach by implementing our al- gorithm in a demonstrator called FleXipy and test it against a sample document collection containing different types of semi-structured HTML data. We will test whether our approaches will result in a more robust and more complete outcome in contrast to “normal” XPath/CSS-based extraction frameworks. 2 FleXipy - Information Extraction for HTML In this section we will present FleXipy, an extraction mechanism that incor- porates both path-based and text-based features. FleXipy always starts with a strict path-based extraction rule that was configured by the wrapper generator. Only if this strict extraction rule does not verify and returns an empty node, the additional extraction features are activated. First, we generate a list of pos- sible node candidates by looking in the neighborhood of the original configured path. In a second step, we use different text-based features to rank the different node candidates to find the best-matching node and path to correct the wrong extraction rule. 2.1 Main Components of FleXipy The core of FleXipy is divided into three major components: (1) Modules for path-based tree interactions that use rigid patterns like XPath, modules for textual, (2) content-based interactions that use full-text search patterns and finally (3) a module to rank the node candidates. Path-based Modules There are two different methods for finding node can- didates with FleXipy. One requires a full XPath to nodes where the desired text 3 should be found. The other one requires an XPath to select a structural de- scription to look for anywhere in the DOM tree. When giving a full XPath, we will use it as a template. This method is useful in cases where the desired text should be in this node but moved to a nearby sibling due to minimal structural changes. Using a limited breadth-first search on the DOM tree and a following depth-limited search, we will check any siblings for existing text and their full XPath will be added to the list of possible candidates. In FleXipy, this is called template search. The second method requires a structural description of a subtree to search for in the DOM tree, e.g //tbody/tr/td/h1. Any found expression containing text will be considered a candidate for further analysis. In cases where no candidate can be found, it is recommended and supported to reduce the size of the subtree to enlarge the search space. This is called subtree search. The two path-based searches return a list of candidates. We calculate the distances between the candidates in the DOM tree and the original configured XPath expression. This distance will be normalized for the template search. For subtree search, no distance can be calculated. Therefore the distance is set to d = 0. All other distances contribute to the path score like follows: p = 1 − d/dmax . A smaller distance will lead to a higher path score of 0 ≤ p ≤ 1. Text-based Modules Text- and content-based modules are components that test the content of all candidates found by the path-based modules. These mod- ules represent a verification process. Depending on the configured rules, the interaction can be expressed by using a Radcliff/Obershelp fuzzy string match- ing algorithm or by using a combination of rules to verify the structural features of the extracted text. These rules can match text features like length and for- matting, check for prefixes or use text similarity scores for text which may only change slightly for multiple sources. Every configured rule symbolizes a verifica- tion task for the extracted text and should “penalize” candidates for not passing a pattern rule. The text score is t = 1 − f /n, where f is the number of failed verifications and n is the number of configured rules. The result of this process is a score of 0 ≤ t ≤ 1 for all text-based features. Candidate Ranking The two module types return two scores: p for for all path-based features and t for all textual features for all node candidates. Using a simple linear combination of both normalized scores p and t will then lead to a final ranking R (Eq. 1): R = p × (1 − α) + t × α (1) where 0 ≤ α ≤ 1 is usually set to 0.7. The value for α is a purely heuristically determined best-practice value that works for the test data sets in our case study. (see section 3). 4 2.2 Example Extraction with FleXipy To further clarify how the FleXipy algorithm works, we will showcase a step by step extraction in comparison to a conventional XPath based extraction ap- proach. 1 ... 2 < div id = " email - wrapper " > 3 < div > 4
... p > 5
... p > 6
... p > 7
... p > 8
... p > 9
... p > 10
... p > 11
< span style = " ... " > Sweet Potato History Casts Doubt on Early Contact between Polynesia and the Americas span > 12 < br > 13 < span style = " ... " > ... span > 14 p > 15
... p > 16 ... 17 div > 18 div > 19 ... Fig. 1. Top: Section from an embargo email from the journal Cell. Bottom: Corre- sponding HTML code for this email. Figure 1 shows the head of an article in one of the typical embargo emails handled in PRIOR (For explanation of PRIOR see section 3). For this example, we would like to extract the title (“Sweet Potato History Casts ...”). The starting point for the extraction for both methods is manually finding the XPath pointing to the headline. Using the HTML snippet from Figure 1, the XPath expression needed is //*[@id="email-wrapper"]/div/p[7]/span[1]. We can see from the expres- sion, that there are two nodes with siblings of the same kind (p and span). Due to the heterogeneous nature of emails, this could cause problems since the head- line node could still be moving in the DOM tree of the same provider. A typical XPath based wrapper will then extract no or unwanted data. When using FleXipy, we can make use of the text-based modules for identify- ing unwanted or no data. Looking at our example, it is clear that the headlines, 5 in this case, are formatted using CSS, and we also assume that headlines have a minimum length of 30 characters. Therefore, we can make use of functions provided by the FleXipy framework addressing these characteristics. These di- rectives will be added to the FleXipy configuration file. During a simulated extraction process, we got an email where the title can be found not in //div/p[7]/span[1] but in //div/p[8]/span[1]. Using a normal XPath-based wrapper, the extraction will fail since the path has changed. The FleXipy framework will, first of all, check the given XPath expression against the configured directives. If they match, FleXipy is done, if not FleXipy will use its path-based modules to look in the surroundings of the configured XPath for nodes that contain any form of text assuming that the wanted text only moved slightly. In this case, the mentioned template search will look for siblings for p and span nodes, ultimately also cover the wanted text. When reaching the configured limits, all found nodes containing text will then be checked against the text-based directives. Combining the distance and probability that a found text is the wanted text, the framework will then calculate a ranking of candidates and provide it for further processing. 3 A Case Study on FleXipy for Emails To test the feasibility of the FleXipy approach, we will present the results of a case study with semi-structured email texts. The source of our test collection is a set of extracted email bodies in HTML from the Google Digital News Initiative- funded PRepublicatIOn Radar (PRIOR) project3 . The background of PRIOR is to enable science journalists to keep up with the latest scientific research in relevant domains of knowledge. Scientific publishers offer science journalists exclusive access to prepublications of upcoming research articles under a very strict embargo. These prepublications are distributed exclusively via email and are very heterogeneous by nature as each publisher has its own format. The PRIOR project uses a fully automated extraction pipeline written in Python that utilizes the web crawling framework scrapy and XPath-based wrappers for information extraction. The daily work with these emails shows that these wrap- pers are highly vulnerable against deviations in the source data, which is the case for most of the emails. Most of the time, the emails are highly unstructured or do not share a common set of formatting rules, making them nearly impossible to process by standard wrappers. These issues make the unstructured and hetero- geneous email bodies from the PRIOR project a perfect test bed for evaluating the FleXipy approach. 3.1 Evaluation Setup We built a small test collection out of 75 PRIOR emails from the four scien- tific journals Science, Jama, Lancet, and Cell. Each email may contain many 3 https://ir.web.th-koeln.de/projects/prior/ 6 embargo announcements that consist of different metadata fields. We manually identified and annotated 202 of these metadata fields (also called slots) within the source emails. The fields were of the four following types: Embargo dates, contact details, titles of the embargoed articles, and short abstracts. For each field, we manually defined the correct XPath to extract the information to test the different system configurations. Two different configurations were part of our evaluation: A simple XPath- based extraction without any modifications (called baseline) and a full features FleXipy configuration (called Full-FleXipy). The baseline represents the stan- dard extraction process done by frameworks like Scrapy. From each journal, we randomly took one email as training data to manually extract the XPath expres- sion and used it to create a simple wrapper for the information extraction. The same XPath expression was later configured in FleXipy. The training emails were removed from the evaluation corpus. These extracted XPath expressions were then used as a rigid set of extraction rules for the remaining emails. Since emails are heterogeneous by nature, the results of this simulated extraction process are expected to be insufficient. After generating our baseline, we created additional test-based rule sets (like prefix patterns) for the training data and activated the path-based modules. 3.2 Evaluation Metrics We compared the results of FleXipy against manually annotated data for all slots. The comparison relies on two different measurements: F1 and the slot error rate [7] (SER). When evaluating, we have four different kinds of results for a slot: (1) We correctly extracted a slot (hit), (2) we have slots which contain incorrect data (substitution), (3) we have slots without any data (deletion) and (4) slots which were configured and extracted but cannot be found in the source data and should not have been extracted (insertion). Returned slots are either a hit, a substitution or an insertion, but only slots which count as a hit are relevant in terms of precision and recall and therefore F1 . We decided to use the slot error rate as a additional measurement as it introduces a performance measurement for information extraction which gives adequate weight to the different error types. It is defined as follows (Eq. 2): (substitutions + deletions + insertions) SER = (2) (hits + substitutions + deletions) where a lower SER value indicates a better extraction performance. 3.3 Findings We evaluated four different entry types (embargo dates, contact information, titles, and abstracts) for four different scientific journals (Science, Jama, Lancet, and Cell). We report SER and F1 -scores (see Table 1). The statistical signifi- cance of differences in the average performance is determined using a two-sided 7 Embargo Contact Title Abstract SER F1 SER F1 SER F1 SER F1 Baseline Science 0.38 0.69 0.69 0.40 0.50 0.53 0.63 0.38 Jama 0.00 1.00 nan 0.00 0.00 1.00 0.12 0.88 Lancet 0.73 0.29 1.50 0.00 0.93 0.13 0.80 0.21 Cell 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 avg 0.53 0.49 1.06 0.10 0.61 0.41 0.64 0.37 Full-FleXipy Science 0.44 0.72 0.00 1.00 0.19 0.84 0.38 0.63 Jama 0.00 1.00 nan 0.00 0.00 1.00 0.12 0.88 Lancet 0.00 1.00 0.50 0.67 0.33 0.67 0.80 0.21 Cell 0.00 1.00 0.00 1.00 0.00 1.00 1.00 0.00 avg 0.11 0.93 0.17* 0.67* 0.13 0.88 0.57 0.43 Table 1. Results of the information extraction for four different field types and four different journals comparing the baseline approach and the full FleXipy implementa- tion. We report the slot error rate (SER) and F1 -scores. Improvements are marked green and losses are marked red. Significance was tested using a two-sided t-test. Field types not existing in a journal are represented by nan values. Student’s t-test. Significant improvements over the baselines (p < 0.05) are in- dicated with an asterisk. We see clear improvements in the average extraction performance in all four extracted entry types on both F1 and SER. Especially the contact information shows a significant improvement from an average SER of 1.06 to 0.17 and an improvement of F1 from 0.1 to 0.67. The other improvements are still evident but not statistically significant. A small increase can also be seen for the most heterogeneous slot defined increasing the F1 average for abstracts from 0.37 to 0.43. Only one single slot entry type for one single journal had a loss compared to the baseline: the embargo dates from the Science journal. Here the SER value raised from 0.38 to 0.44 while the F1 still increased from 0.69 to 0.72. We can also see that the information saved in the embargo slot was extracted perfectly for three of the four providers. For Cell, FleXipy was capable of finding all missing titles and contacts and fix them accordingly. For Jama, there is no change in the result at all, since the baseline result was already near perfect from the beginning. 4 Discussion and Outlook We presented and evaluated an information extraction method to complement simple XPath or CSS-based wrappers. We implemented our approach that in- corporates path-based and text-based extraction rules and a candidate ranking module in a demonstrator called FleXipy. The evaluation of email newsletters from different scientific publishers and journals showed clear improvements com- pared to a simple XPath baseline. The improvements are possible as FleXipy can 8 search for alternative candidates when the original XPath-selector would only return an empty node (deletions). Additionally, FleXipy could be used to correct the two other extraction errors types substitutions and insertions as the text- based modules would allow checking on additional features (like text patterns) rather than only on path features. Although the evaluation showed a clear improvement and returned nearly perfect results for some entry types, we have to emphasize the preliminary char- acter of this evaluation. It is only a case study with 202 manually annotated slots resulting in a rather small test collection. To get a more reliable result, the size of the collection should be increased. Also, the configured rules used to verify the data can be improved by increasing the size of the training data. An interesting result is the opposing performance of SER and F1 for the embargo entries in Science. This is the case when the extraction results shift from cases of substitutions to cases of deletions. This leads to a decreasing number of returned documents and therefore influences the precision and recall. The increase in F1 does not clearly describe an overall increase in the extraction result since there could still be the case of less correctly extracted slots. These contradicting results for the embargo dates in Science are a compelling case that demonstrates the justification for using SER to complement the precision/recall- fixed perspective of F1 . While we increase the number of relevant slots in the result, we introduce more and different error types. The overall results of this case study can be promising for further develop- ment of this approach to improve web information extraction. FleXipy is not a complete extraction framework (like e.g., Scrapy) but can be part of such frame- works to complement XPath or CSS-based wrappers. For productive use, this framework should be extended with more possibilities to write extraction rules for the source data. Acknowledgments This work was part of the PRIOR project that was funded by the Google Digital News Initiative4 . References 1. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: Flexpath: Flexible structure and full-text querying for xml. In: Proceedings of the 2004 ACM SIGMOD Inter- national Conference on Management of Data. pp. 83–94. SIGMOD ’04, ACM, New York, NY, USA (2004). https://doi.org/10.1145/1007568.1007581, http: //doi.acm.org/10.1145/1007568.1007581 2. Amer-Yahia, S., Lalmas, M.: XML search: languages, INEX and scoring. ACM SIG- MOD Record 35(4), 16–23 (Dec 2006). https://doi.org/10.1145/1228268.1228271, http://portal.acm.org/citation.cfm?doid=1228268.1228271 4 https://newsinitiative.withgoogle.com/dnifund/dni-projects/prior- prepublication-radar-round-4/ 9 3. Balog, K.: Meet the Data. In: Balog, K. (ed.) Entity-Oriented Search, pp. 25–53. The Information Retrieval Series, Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-93935-3 2, https://doi.org/10.1007/978-3- 319-93935-3_2 4. Chamberlain, S., Ram, K., Grolemund, G.: Extracting data from the web apis and beyond. In: The R User Conference 2016. Stanford University, Stanford, Califor- nia (2016), https://github.com/ropensci-training/user2016-tutorial/blob/ master/slides.pdf 5. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE transactions on knowledge and data engineering 18(10), 1411–1428 (2006) 6. Jurafsky, D., Martin, J.H.: Speech and language processing, vol. 3. Pearson London (2014) 7. Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R., et al.: Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop. pp. 249–252. Herndon, VA (1999) 8. Melton, J., Buxton, S.: Querying XML: XQuery, XPath, and SQL/XML in con- text. The Morgan Kaufmann Series in Data Management Systems, Elsevier Science (2011), https://books.google.de/books?id=EuYRXgDqVp0C 9. Michels, C., Fayzrakhmanov, R.R., Ley, M., Sallinger, E., Schenkel, R.: Oxpath- based data acquisition for dblp. In: 2017 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2017, Toronto, ON, Canada, June 19-23, 2017. pp. 319–320. IEEE Computer Society (2017). https://doi.org/10.1109/JCDL.2017.7991609, https:// doi.org/10.1109/JCDL.2017.7991609 10. Neumann, M., Steinberg, J., Schaer, P.: Web-Scraping for Non-Programmers: In- troducing OXPath for Digital Library Metadata Harvesting. Code4Lib Journal 2017(38) (Oct 2017), https://journal.code4lib.org/articles/13007