<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Coupling Ontologies with Document Spanners</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Domenico Lembo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Maria Scafoglieri</string-name>
          <email>scafoglierig@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIAG</institution>
          ,
          <addr-line>Sapienza Univerista di Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Information extraction (IE) refers to the task of turning text documents into
a structured form, in order to make the information contained therein
automatically processable. Recently, Fagin et al. have carried out a foundational study
on rule-based IE, and introduced a formal framework based on the notion of
(document) spanner [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. A spanner P is a function that maps a given string
(D) to a relation, i.e., a set of tuples, over its spans. A span is a pair of indices
that identify substrings of D. For example, given the string DL-2018 DL-2019,
the spans [4; 8i and [12; 16i identify the substrings 2018 and 2019, respectively.
Fagin et al. study possible representations of spanners and analyze how the use
of some algebraic operations on the relations extracted from strings in uences
the spanner expressiveness. In particular, they consider spanners de ned by
regular expressions with capture variables (a.k.a. \regex formulas ") and relational
algebra. Regex formulas di er from classical regular expressions since they allow
for mapping sub-matches of regular expressions, in the form of spans, to
variables. For example, in the regex formula xf[0 9]+g, x is a variable matching
to spans of nonempty strings consisting only of digits. Applied, for instance, to
the string DL-2018 DL-2019, it returns the (unary) relation f([4; 8i); ([12; 16i)g.
      </p>
      <p>
        In our paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we construct on the notion of document spanners and
propose a formal framework for coupling them with ontologies. Through this
framework we aim at structuring information extracted from text documents according
to the intensional knowledge expressed by the ontology. To this aim, we adapt
the well-known framework for Ontology-based Data Access (OBDA), in which
an ontology is mapped to an external source database through declarative
mappings, which specify the semantic relationship between the ontology vocabulary
and the data at the sources [
        <xref ref-type="bibr" rid="ref13 ref3">3,13</xref>
        ]. OBDA is a powerful paradigm for data access
and governance, for its ability to shift data management and integration at the
conceptual level. In OBDA, however, ontologies have been used so far only on
top of relational databases, with very few exceptions (as, e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). In our paper
we thus enrich OBDA with the capability of accessing unstructured information
contained in text documents.
      </p>
      <p>
        Our rst contribution is the notion of Ontology-based document spanning
(OBDS) system, in which an ontology is linked to text documents through
extraction assertions, which in OBDS act exactly as mapping assertions in OBDA.
In particular, an extraction assertion associates a document spanner P to a query
q speci ed over the ontology. In OBDSs queries over the ontology are conjunctive
queries (CQs), and document spanners are de ned as regex formulas extended
with the relational algebra operators union, projection, join and string selection
(the class of such spanners is studied in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and is denoted with [[RGXf[; ;1; =g]]).
      </p>
      <p>An OBDS E is thus a pair hT ; Ri, where T is a Description Logic (DL) TBox
and R is a set of extraction assertions of the form</p>
      <p>
        P (v1; : : : ; vn) ;
(v1; : : : ; vn)
where P (v1; : : : ; vn) is a document spanner in [[RGXf[; ;1; =g]], associated to
variables v1; : : : ; vn, and (v1; : : : ; vn) is a CQ with free variables v1; : : : ; vn.
Atoms of this CQ are built over v1; : : : ; vn, and possibly over other existential
variables and/or constants, and object terms denoting individuals \constructed"
from the spans returned by P when applied to a document (as in OBDA
mappings [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). We notice that extraction assertions we have de ned correspond to a
powerful form of GLAV mapping assertions [
        <xref ref-type="bibr" rid="ref5 ref9">9,5</xref>
        ]. We also say that assertions are
GAV, if (v1; : : : ; vn) does not contain existential variables. An interpretation
I is a model for E w.r.t. a document D if I is a model for T , and I satis es R
w.r.t. D. As in OBDA, we adopt a sound interpretation for extraction assertions,
according to which, intuitively, satisfying R w.r.t. D means that, for each
extraction assertion in R, (t) evaluates to true in I, for each tuple t = t1; : : : ; tn
of substrings corresponding to the spans returned by P , where (t) is the CQ
in which every vi is substituted with ti (see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for details).
      </p>
      <p>
        A second contribution of our paper is on query answering, i.e., how to
compute the certain answers to queries posed over the ontology of an OBDS system
E = hT ; Ri, i.e., those answers that hold in each model of E . We focus on CQs
and OBDS systems whose ontology is speci ed in the Description Logic (DL)
DL-LiteR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is well-known that in OBDA, when the ontology is in DL-LiteR
and mapping assertions are GLAV, CQ answering is rst-order rewritable, i.e.,
it can be reduced to the evaluation of a rst-order query over the
underlying database [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore, the rewriting has the same expressiveness of the
database queries used in the mapping. A natural question is now whether a
similar behaviour shows up also in DL-LiteR OBDS systems, i.e., whether we
can reduce query answering to the execution of a document spanner of the same
expressiveness of the spanners in R. We positively answer to this question, by
providing an algorithm that rewrites every CQ issued over a DL-LiteR OBDS
system into a spanner belonging to [[RGXf[; ;1; =g]]. To this aim, we adapt to
OBDS a technique from OBDA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In a nutshell: we rst split each extraction
assertion into a pair of GAV and LAV assertions; we then rewrite the query
according to the TBox by using, e.g., the PerfectRef algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; we then further
rewrite the query according to the \LAV part" of R, by means of an algorithm
for view-based query rewriting, like Minicon [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]; nally we rewrite the query
with respect to the \GAV part" of R, essentially by unfolding the query
predicates with the spanners they are associated with. Since evaluating such spanners
is polynomial in the size of the input document, we can also conclude that CQ
answering in this setting is polynomial in data complexity.
      </p>
      <p>
        We nally note that there is a lot of previous work on use of ontologies in IE
(see, e.g., [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for a survey). To the best of our knowledge, our paper is the rst
one studying query answering over ontologies populated from text documents,
in the spirit of OBDA. Also, we believe that our OBDS framework may pave the
way for an in-depth investigation of the role of ontologies in IE.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Botoeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cogrel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezk</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Xiao.</surname>
          </string-name>
          <article-title>OBDA beyond relational DBs: A study for MongoDB</article-title>
          .
          <source>In Proc. of the 29th Int. Workshop on Description Logic (DL)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Tractable reasoning and e cient query answering in description logics: The DL-Lite family</article-title>
          .
          <source>J. of Automated Reasoning</source>
          ,
          <volume>39</volume>
          (
          <issue>3</issue>
          ):
          <volume>385</volume>
          {
          <fpage>429</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Ontologybased data access and integration</article-title>
          .
          <source>In Encyclopedia of Database Systems, Second Edition</source>
          . Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ruzzi</surname>
          </string-name>
          .
          <article-title>Using OWL in data integration</article-title>
          .
          <source>In Semantic Web Information Management - A Model-Based Perspective</source>
          , pages
          <volume>397</volume>
          {
          <fpage>424</fpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z. G.</given-names>
            <surname>Ives</surname>
          </string-name>
          .
          <article-title>Principles of Data Integration</article-title>
          . Morgan Kaufmann,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kimelfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Reiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vansummeren</surname>
          </string-name>
          .
          <article-title>Document spanners: A formal approach to information extraction</article-title>
          .
          <source>J. of the ACM</source>
          ,
          <volume>62</volume>
          (
          <issue>2</issue>
          ):
          <fpage>12</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kimelfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Reiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vansummeren</surname>
          </string-name>
          .
          <article-title>Declarative cleaning of inconsistencies in information extraction</article-title>
          .
          <source>ACM Trans. on Database Systems</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ):6:
          <issue>1</issue>
          {6:
          <fpage>44</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Scafoglieri</surname>
          </string-name>
          .
          <article-title>A formal framework for coupling document spanners with ontologies</article-title>
          .
          <source>In Proc. of the 2nd IEEE Int. Conf. on Arti cial Intelligence and Knowledge Engineering (AIKE)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          .
          <article-title>Data integration: A theoretical perspective</article-title>
          .
          <source>In Proc. of the 21st ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS)</source>
          , pages
          <fpage>233</fpage>
          {
          <fpage>246</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Linking data to ontologies</article-title>
          .
          <source>J. on Data Semantics</source>
          , X:
          <volume>133</volume>
          {
          <fpage>173</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>R.</given-names>
            <surname>Pottinger</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Minicon: A scalable algorithm for answering queries using views</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>10</volume>
          (
          <issue>2-3</issue>
          ):
          <volume>182</volume>
          {
          <fpage>198</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Wimalasuriya</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          .
          <article-title>Ontology-based information extraction: An introduction and a survey of current approaches</article-title>
          .
          <source>Information Sciences</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <volume>306</volume>
          {
          <fpage>323</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. G. Xiao,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kontchakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zakharyaschev</surname>
          </string-name>
          .
          <article-title>Ontology-based data access: A survey</article-title>
          .
          <source>In Proc. of the 27th Int. Joint Conf. on Arti cial Intelligence (IJCAI)</source>
          , pages
          <fpage>5511</fpage>
          {
          <fpage>5519</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>