<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology Mediated Information Extraction with MASTRO SYSTEM-T</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Domenico Lembo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunyao Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucian Popa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kun Qian</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Maria Scafoglieri?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dip. di Ingegneria Informatica, Automatica e Gestionale Sapienza Universita` di Roma</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Almaden Research Center</institution>
          ,
          <addr-line>San Jose, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In several data-centric application domains, the need arises to extract valuable information from unstructured text documents. The recent paradigm of Ontology Mediated Information Extraction (OMIE) faces this problem by taking into account the knowledge expressed by a domain ontology, and reasoning over it to improve the quality of extracted data. MASTRO SYSTEM-T is a novel tool for OMIE, developed by Sapienza University and IBM Almaden Research. In this work, we demonstrate its usage for information extraction over real-world financial text documents from the U.S. EDGAR system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>One of the basic problems of the data-centric information era is the processing of huge
amount of unstructured data. If the information inside them is to be automatically
manipulated and analyzed, it must be first rearranged into a structured form in which the
relevant “facts” can be easily accessed.</p>
      <p>
        Information Extraction (IE) provides support to this problem. It refers to the task
of automatically organizing gathered data into a structured representation, typically a
spread-sheet or a database [
        <xref ref-type="bibr" rid="ref11 ref4 ref6">11, 6, 4</xref>
        ]. Various statistical, rule-based, and learning based
approaches for IE have been proposed along the years, leveraging techniques from
NLP, machine learning, computational linguistics, databases and knowledge
representation (see, e.g., [
        <xref ref-type="bibr" rid="ref1 ref2 ref5 ref7">7, 2, 5, 1</xref>
        ]). In this frame of reference, ontologies, which provide formal
and explicit specifications of conceptualizations, have been recognized to play an
important role in IE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, despite ontology-based IE has been so far the subject
of several investigations [
        <xref ref-type="bibr" rid="ref10 ref12">12, 10</xref>
        ], how to exploit the reasoning abilities offered by an
ontology to improve the extraction process has not yet been specifically studied.
      </p>
      <p>
        Ontology Mediated Information Extraction (OMIE) [
        <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
        ] is a new paradigm for IE
which aims at filling this gap. It properly seeks to use the semantic knowledge expressed
in ontologies to improve query answering over unstructured data (specifically raw text).
In this work, we demonstrate MASTRO SYSTEM-T, a new OMIE system born from a
collaboration between the University of Rome “La Sapienza” and IBM Research
Almaden. In particular, after a brief presentation of the system architecture and its main
functionalities, we show an OMIE application involving a set of real-world financial
text documents coming from the U.S repository of Electronic Data Gathering, Analysis
and Retrieval system (EDGAR). Interestingly, with MASTRO SYSTEM-T we are able
to extract data at query time, without having to materialize them in advance. We
discuss this feature together with some preliminary experiments that show how ontology
reasoning allow us to increase the quality of the extracted data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>System Overview</title>
      <p>
        The OMIE framework, on which MASTRO SYSTEM-T is based, is an adaptation of
the well-know framework of Ontology Based Data Access (OBDA) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In an OBDA
system, an ontology is mapped to an external source database through declarative
mappings, which specify the semantic relationship between the ontology vocabulary and the
data (mainly relational) at the sources. In OMIE the data source is instead a repository
of unstructured text documents, which are “linked” to the ontology through so-called
extraction assertions.
      </p>
      <p>Input</p>
      <p>Sparql
Query
Ontology
Extraction
Assertions
Documents</p>
      <sec id="sec-2-1">
        <title>Mastro System-T</title>
        <p>Query Manager
Ontology Manager
Extraction Assertions</p>
        <p>Manger
System-T Interface</p>
      </sec>
      <sec id="sec-2-2">
        <title>System T</title>
        <p>AQL Engine</p>
        <p>GUI</p>
        <p>Porjects
HUB
SPARQL
Endpoint
Specification</p>
        <p>Manager</p>
        <p>Fig. 1: MASTRO SYSTEM-T Architecture</p>
        <p>
          This connection between OBDA and OMIE is also reflected in the implementation
of our tool. MASTRO SYSTEM-T, whose architecture is showed in Fig. 1, is a specific
tuning of the OBDA engine MASTRO 1 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] in order to interface it with SYSTEM-T [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
an IE commercial tool developed at IBM Almaden. The inputs to the system are:
– An ontology, specified in any of the standard syntaxes for OWL 2. The ontology
is automatically approximated by MASTRO in the standard profile OWL2QL, to
guarantee tractability of query answering.
1 http://obdasystems.com/mastro
– A set of extraction assertions (EAs) of the form (~x) ; P (~x), where P is a
predicate of the ontology, (~x) is a rule-based extractor, and ~x are “frontier variables”,
through which, intuitively, data extracted from the source documents instantiate the
ontology predicate P [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. EAs are managed by the ‘Extraction Assertion Manager’
module. The extractors are specified into a declarative rule-based language, and
can be combined together with relational algebra operators. Specifically, they are
written in AQL, a concrete language used by SYSTEM-T, which is in charge of
their processing. In simple terms, SYSTEM-T evaluates extractors over a text and
produces a set of spans, i.e., pairs of indexes that identify substrings in the text that
are used to construct the individuals that instantiate the ontology.
– A set of textual documents, which are managed by SYSTEM-T.
– The user’s queries, expressed in standard SPARQL, which are parsed and managed
by the ‘Query Manager’ module.
        </p>
        <p>Query</p>
        <p>Answers
Q.A. Engine</p>
        <p>Spans to Ontology</p>
        <p>Answer
Spans</p>
        <p>Sparql</p>
        <p>Query
Query Rewriter
SPARQL to AQL</p>
        <p>AQL Query
System-T</p>
        <p>Note that, following the principles of OBDA, in OMIE, the facts of the ontology are
not materialized, but they are virtually defined through the extraction assertions.</p>
        <p>
          The main reasoning service is Query Answering (QA), which is carried out through
query rewriting techniques adapted from those used in MASTRO, as described in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
MASTRO SYSTEM-T computes answers to the user’s SPARQL queries posed over the
ontology by transforming them into AQL extractors and delegating their execution for
information extraction from a given document to SYSTEM-T. MASTRO SYSTEM-T
triggers only the extraction assertions useful to generate the answers to the user’s query
at hand and returns always the most up-to-date answer. This is particularly suited for
dynamic scenarios, where source documents change frequently and query answers cannot
be computed on the basis of outdated materializations.
        </p>
        <p>In a nutshell, the query transformation process realized by the ‘QA Engine’ includes
an ontology-based query rewriting phase, and a further reformulation step that uses
extraction assertions to transform the query over the ontology into a set of extractors to
be executed over the text documents. The complete workflow is illustrated in Fig. 2.</p>
        <p>Fig. 3: User Interface</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Demonstration</title>
      <p>We demonstrate MASTRO SYSTEM-T in a real world financial domain. The Electronic
Data Gathering, Analysis, and Retrieval system (EDGAR) is a public platform where
companies acting in the U.S. are required by law to enter a range of information for
government controls. EGARD is mainly composed by a large amount of raw text subject
to significant updates over time. Since human effort is not sufficient to process this
amount of data, there is the need for a mechanism that can automate the extraction phase
by always providing the most up-to-date information and allowing data sharing and
standardization. To prove the effectiveness of MASTRO SYSTEM-T in this context, we
have created an ad-hoc ontology around the concept of company and a set of extraction
assertions, and we have selected a set of text documents from EDGARD concerning the
top five fortune companies. We then issued a set of queries, and processed them with
and without the reasoning activated, in order to highlight its role in the extraction phase.
To deactivate the reasoning we simply ask the system to skip the ontology-based query
rewriting phase, which actually means that it ignores all ontology axioms. With respect
to the tests that we have carried out, the reasoning mainly impacts on the recall. This is
due to the fact that the compilation of the ontology inside the query leads to use a set
of extractors that otherwise wouldn’t have been triggered. As an example, in Table 1
we report the values of precision, recall and f-measure of the query that requires all
companies, i.e., SELECT ?X WHERE f?X a :Companyg.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Our preliminary tests show that reasoning over the ontology through MASTRO
SYSTEM-T may improve the quality of certain extractions. We have also shown how
in our system data can be extracted at query-time, i.e., without having to materialize in
advance all instances of the ontology, which always guarantees up-to-date answers. We
are currently working to incorporate in MASTRO SYSTEM-T additional capabilities,
e.g., to support entity linking, and reduce the design effort required for the specification
of extraction assertions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Reiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaithyanathan</surname>
          </string-name>
          . Systemt:
          <article-title>An algebraic approach to declarative information extraction</article-title>
          .
          <source>In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pages
          <fpage>128</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>H.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          . Gate, a
          <article-title>general architecture for text engineering</article-title>
          . Comput. Humanit.,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>223</fpage>
          -
          <lpage>254</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>G. De Giacomo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ruzzi</surname>
            , and
            <given-names>D. F.</given-names>
          </string-name>
          <string-name>
            <surname>Savo</surname>
          </string-name>
          . MASTRO:
          <article-title>A reasoner for effective ontology-based data access</article-title>
          .
          <source>In Proc. of the 1st Int. Workshop on OWL Reasoner Evaluation (ORE</source>
          <year>2012</year>
          ), volume
          <volume>858</volume>
          <source>of CEUR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kimelfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Reiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vansummeren</surname>
          </string-name>
          .
          <article-title>Document spanners: A formal approach to information extraction</article-title>
          .
          <source>J. of the ACM</source>
          ,
          <volume>62</volume>
          (
          <issue>2</issue>
          ):
          <fpage>12</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoffmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          .
          <article-title>Knowledge-based weak supervision for information extraction of overlapping relations</article-title>
          .
          <source>In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pages
          <fpage>541</fpage>
          -
          <lpage>550</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition</article-title>
          . Prentice Hall, Pearson Education International,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. C. N.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proc. of the 18th Int. Conf. on Machine Learning (ICML)</source>
          , pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Scafoglieri</surname>
          </string-name>
          .
          <article-title>Ontology mediated information extraction in financial domain with mastro system-t</article-title>
          . In D. Burdick and J. Pujara, editors,
          <source>Proc. of the 6th Int. ACM Workshop on Data Science for Macro-Modeling</source>
          ,
          <source>(DSMM</source>
          <year>2020</year>
          ), pages
          <fpage>3</fpage>
          :
          <fpage>1</fpage>
          -
          <issue>3</issue>
          :
          <fpage>6</fpage>
          . ACM,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          and
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Scafoglieri</surname>
          </string-name>
          .
          <article-title>Ontology-based document spanning systems for information extraction</article-title>
          .
          <source>Int. J. Semantic Comput.</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. H.
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Funk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Maynard</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bontcheva</surname>
          </string-name>
          .
          <article-title>Ontology-based information extraction for business intelligence</article-title>
          .
          <source>In Proc. of the 6th Int. Semantic Web Conference, and the 2nd Asian Semantic Web Conference (ISWC</source>
          <year>2007</year>
          +
          <article-title>ASWC 2007)</article-title>
          , pages
          <fpage>843</fpage>
          -
          <lpage>856</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          . Information extraction.
          <source>Foundations and Trends in Databases</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <fpage>261</fpage>
          -
          <lpage>377</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Wimalasuriya</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          .
          <article-title>Ontology-based information extraction: An introduction and a survey of current approaches</article-title>
          .
          <source>Information Sciences</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>306</fpage>
          -
          <lpage>323</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. G. Xiao,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kontchakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zakharyaschev</surname>
          </string-name>
          .
          <article-title>Ontology-based data access: A survey</article-title>
          .
          <source>In Proc. of the 27th Int. Joint Conf. on Artificial Intelligence</source>
          ,
          <source>(IJCAI 2018)</source>
          , pages
          <fpage>5511</fpage>
          -
          <lpage>5519</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>