Ontology Mediated Information Extraction with M ASTRO S YSTEM -T Domenico Lembo1 , Yunyao Li2 Lucian Popa2 , Kun Qian2 , and Federico Maria Scafoglieri?1 1 Dip. di Ingegneria Informatica, Automatica e Gestionale Sapienza Università di Roma, Rome, Italy 2 IBM Almaden Research Center, San Jose, California Abstract. In several data-centric application domains, the need arises to extract valuable information from unstructured text documents. The recent paradigm of Ontology Mediated Information Extraction (OMIE) faces this problem by taking into account the knowledge expressed by a domain ontology, and reasoning over it to improve the quality of extracted data. M ASTRO S YSTEM -T is a novel tool for OMIE, developed by Sapienza University and IBM Almaden Research. In this work, we demonstrate its usage for information extraction over real-world financial text documents from the U.S. EDGAR system. Introduction One of the basic problems of the data-centric information era is the processing of huge amount of unstructured data. If the information inside them is to be automatically ma- nipulated and analyzed, it must be first rearranged into a structured form in which the relevant “facts” can be easily accessed. Information Extraction (IE) provides support to this problem. It refers to the task of automatically organizing gathered data into a structured representation, typically a spread-sheet or a database [11, 6, 4]. Various statistical, rule-based, and learning based approaches for IE have been proposed along the years, leveraging techniques from NLP, machine learning, computational linguistics, databases and knowledge representa- tion (see, e.g., [7, 2, 5, 1]). In this frame of reference, ontologies, which provide formal and explicit specifications of conceptualizations, have been recognized to play an im- portant role in IE [12]. However, despite ontology-based IE has been so far the subject of several investigations [12, 10], how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. Ontology Mediated Information Extraction (OMIE) [9, 8] is a new paradigm for IE which aims at filling this gap. It properly seeks to use the semantic knowledge expressed in ontologies to improve query answering over unstructured data (specifically raw text). ? Work done while at IBM Almaden - Research. Work supported in part by European Research Council under the European Union’s Horizon 2020 Programme through the ERC Advanced Grant WhiteMech (No. 834228) Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0) Domenico Lembo, Yunyao Li Lucian Popa, Kun Qian, and Federico Maria Scafoglieri In this work, we demonstrate M ASTRO S YSTEM -T, a new OMIE system born from a collaboration between the University of Rome “La Sapienza” and IBM Research Al- maden. In particular, after a brief presentation of the system architecture and its main functionalities, we show an OMIE application involving a set of real-world financial text documents coming from the U.S repository of Electronic Data Gathering, Analysis and Retrieval system (EDGAR). Interestingly, with M ASTRO S YSTEM -T we are able to extract data at query time, without having to materialize them in advance. We dis- cuss this feature together with some preliminary experiments that show how ontology reasoning allow us to increase the quality of the extracted data. System Overview The OMIE framework, on which M ASTRO S YSTEM -T is based, is an adaptation of the well-know framework of Ontology Based Data Access (OBDA) [13]. In an OBDA system, an ontology is mapped to an external source database through declarative map- pings, which specify the semantic relationship between the ontology vocabulary and the data (mainly relational) at the sources. In OMIE the data source is instead a repository of unstructured text documents, which are “linked” to the ontology through so-called extraction assertions. Input GUI Mastro System-T Sparql Q.A. Engine Porjects Query Manager Query HUB Ontology Ontology Manager SPARQL Endpoint Extraction Extraction Assertions Assertions Manger Specification Manager System-T Interface System T Documents AQL Engine Fig. 1: M ASTRO S YSTEM -T Architecture This connection between OBDA and OMIE is also reflected in the implementation of our tool. M ASTRO S YSTEM -T, whose architecture is showed in Fig. 1, is a specific tuning of the OBDA engine M ASTRO 1 [3] in order to interface it with S YSTEM -T [1], an IE commercial tool developed at IBM Almaden. The inputs to the system are: – An ontology, specified in any of the standard syntaxes for OWL 2. The ontology is automatically approximated by M ASTRO in the standard profile OWL2QL, to guarantee tractability of query answering. 1 http://obdasystems.com/mastro Ontology Mediated Information Extraction with M ASTRO S YSTEM -T – A set of extraction assertions (EAs) of the form φ(~x) ; P (~x), where P is a pred- icate of the ontology, φ(~x) is a rule-based extractor, and ~x are “frontier variables”, through which, intuitively, data extracted from the source documents instantiate the ontology predicate P [9]. EAs are managed by the ‘Extraction Assertion Manager’ module. The extractors are specified into a declarative rule-based language, and can be combined together with relational algebra operators. Specifically, they are written in AQL, a concrete language used by S YSTEM -T, which is in charge of their processing. In simple terms, S YSTEM -T evaluates extractors over a text and produces a set of spans, i.e., pairs of indexes that identify substrings in the text that are used to construct the individuals that instantiate the ontology. – A set of textual documents, which are managed by S YSTEM -T. – The user’s queries, expressed in standard SPARQL, which are parsed and managed by the ‘Query Manager’ module. Query Sparql Answers Query Q.A. Engine Query Rewriter Ontology Spans to Ontology Extraction SPARQL to AQL Answer Assertions AQL Query Spans System-T Documents Fig. 2: Query Answering Workflow Note that, following the principles of OBDA, in OMIE, the facts of the ontology are not materialized, but they are virtually defined through the extraction assertions. The main reasoning service is Query Answering (QA), which is carried out through query rewriting techniques adapted from those used in M ASTRO, as described in [9]. M ASTRO S YSTEM -T computes answers to the user’s SPARQL queries posed over the ontology by transforming them into AQL extractors and delegating their execution for information extraction from a given document to S YSTEM -T. M ASTRO S YSTEM -T triggers only the extraction assertions useful to generate the answers to the user’s query at hand and returns always the most up-to-date answer. This is particularly suited for dy- namic scenarios, where source documents change frequently and query answers cannot be computed on the basis of outdated materializations. In a nutshell, the query transformation process realized by the ‘QA Engine’ includes an ontology-based query rewriting phase, and a further reformulation step that uses extraction assertions to transform the query over the ontology into a set of extractors to be executed over the text documents. The complete workflow is illustrated in Fig. 2. Domenico Lembo, Yunyao Li Lucian Popa, Kun Qian, and Federico Maria Scafoglieri Fig. 3: User Interface Demonstration We demonstrate M ASTRO S YSTEM -T in a real world financial domain. The Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) is a public platform where companies acting in the U.S. are required by law to enter a range of information for government controls. EGARD is mainly composed by a large amount of raw text subject to significant updates over time. Since human effort is not sufficient to process this amount of data, there is the need for a mechanism that can automate the extraction phase by always providing the most up-to-date information and allowing data sharing and standardization. To prove the effectiveness of M ASTRO S YSTEM -T in this context, we have created an ad-hoc ontology around the concept of company and a set of extraction assertions, and we have selected a set of text documents from EDGARD concerning the top five fortune companies. We then issued a set of queries, and processed them with and without the reasoning activated, in order to highlight its role in the extraction phase. To deactivate the reasoning we simply ask the system to skip the ontology-based query rewriting phase, which actually means that it ignores all ontology axioms. With respect to the tests that we have carried out, the reasoning mainly impacts on the recall. This is due to the fact that the compilation of the ontology inside the query leads to use a set of extractors that otherwise wouldn’t have been triggered. As an example, in Table 1 we report the values of precision, recall and f-measure of the query that requires all companies, i.e., SELECT ?X WHERE {?X a :Company}. Conclusions Our preliminary tests show that reasoning over the ontology through M ASTRO S YSTEM -T may improve the quality of certain extractions. We have also shown how Ontology Mediated Information Extraction with M ASTRO S YSTEM -T Without Reasoning With Reasoning Gap Precision 81.82% 82.71% +0.89% Recall 66.8% 76.26% +9.46% F-Measure 73.59% 79.35% +5.76% Table 1: Company query results in our system data can be extracted at query-time, i.e., without having to materialize in advance all instances of the ontology, which always guarantees up-to-date answers. We are currently working to incorporate in M ASTRO S YSTEM -T additional capabilities, e.g., to support entity linking, and reduce the design effort required for the specification of extraction assertions. References 1. L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. Systemt: An algebraic approach to declarative information extraction. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 128–137, 2010. 2. H. Cunningham. Gate, a general architecture for text engineering. Comput. Humanit., 36(2):223–254, 2002. 3. G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, R. Rosati, M. Ruzzi, and D. F. Savo. MASTRO: A reasoner for effective ontology-based data access. In Proc. of the 1st Int. Workshop on OWL Reasoner Evaluation (ORE 2012), volume 858 of CEUR, 2012. 4. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal ap- proach to information extraction. J. of the ACM, 62(2):12, 2015. 5. R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 541–550, 2011. 6. D. Jurafsky and J. H. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Pren- tice Hall, Pearson Education International, 2009. 7. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th Int. Conf. on Machine Learning (ICML), pages 282–289, 2001. 8. D. Lembo, Y. Li, L. Popa, and F. M. Scafoglieri. Ontology mediated information extraction in financial domain with mastro system-t. In D. Burdick and J. Pujara, editors, Proc. of the 6th Int. ACM Workshop on Data Science for Macro-Modeling, (DSMM 2020), pages 3:1–3:6. ACM, 2020. 9. D. Lembo and F. M. Scafoglieri. Ontology-based document spanning systems for informa- tion extraction. Int. J. Semantic Comput., 14(1):3–26, 2020. 10. H. Saggion, A. Funk, D. Maynard, and K. Bontcheva. Ontology-based information extraction for business intelligence. In Proc. of the 6th Int. Semantic Web Conference, and the 2nd Asian Semantic Web Conference (ISWC 2007 + ASWC 2007), pages 843–856, 2007. 11. S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261–377, 2008. 12. D. C. Wimalasuriya and D. Dou. Ontology-based information extraction: An introduction and a survey of current approaches. Information Sciences, 36(3):306–323, 2010. 13. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za- kharyaschev. Ontology-based data access: A survey. In Proc. of the 27th Int. Joint Conf. on Artificial Intelligence, (IJCAI 2018), pages 5511–5519, 2018.