<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Handling uncertainty in information extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maurice van Keulen</string-name>
          <email>m.vankeulen@utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mena B. Habib</string-name>
          <email>m.badiehhabibmorgan@utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Twente, Faculty of EEMCS</institution>
          ,
          <addr-line>Enschede</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
(Too) much data is still inaccessible for data processing, because it is
unstructured, textually embedded in documents, webpages, or text fields in databases.
Information extraction (IE) is a technology capable of extracting entities, facts,
and relations. IE helps to turn the web into a real ‘web of data’ [BHBL09].</p>
      <p>In the Neogeography-project [HvK11b], we focus on named entity extraction
(NEE) from database text fields and short messages. NEE typically consists of
phases like recognition (which phrases are named entities), matching and
enrichment (lookups in reference databases and dictionaries possibly adding
information), and disambiguation (to which real-world object does a phrase refer).</p>
      <p>Because natural language is highly ambiguous and computers are still
incapable of ‘real’ semantic understanding, NEE (and IE in general) is a highly
imperfect process. For example, it is ambiguous how to interpret the word “Paris”:
it could be a first name, a city, etc. Even resolving it to a city, a lookup in
GeoNames1 learns that there are numerous other places called “Paris” besides
the capital of France. In [HvK11a], we found that around 46% of toponyms2 have
two or more, 35% three or more, and 29% four or more references in GeoNames.</p>
      <p>Although many probabilistic and fuzzy techniques abound, some aspects
often remain absolute: extraction rules absolutely recognize and annotate a phrase
or not, only a top item from a ranking is chosen for a next phase, etc. We envision
an approach that fundamentally treats annotations and extracted information
as uncertain throughout the process. We humans happily deal with doubt and
misinterpretation every day, why shouldn’t computers?</p>
      <p>We envision developing information extractors ‘Sherlock Holmes style’ —
“when you have eliminated the impossible, whatever remains, however
improbable, must be the truth” — by adopting the principles and requirements below.
– Annotations are uncertain, hence we process both annotations as well as
information about the uncertainty surrounding them.
1 http://www.geonames.org
2 A toponym is any name that refers to a location including, e.g., names of buildings.</p>
      <p>Paris Hilton stayed in the Paris Hilton
a1 a2 a3 a4 a5 a6 a7
a8 a9
An annotation a = (b, e, τ ) declares a phrase ϕbe from b to e to be interpreted
as entity type τ . For example, a8 in Fig. 1(a) declares ϕ = “Paris Hilton” from
b = 1 to e = 2 to be interpreted as type τ = Person. An interpretation I = (A, U )
of a sentence s consists of an annotation set A and a structure U representing the
uncertainty among the annotations. In the sequel, we discuss what U should be,
but for now view it as a set of random variables (RVs) R with their dependencies.</p>
      <p>Rather unconventionally, we don’t start with an empty A, but with a ‘no
knowledge’ point-of-view where any phrase can have any interpretation. So our
initial A is {a | a = (b, e, τ ) ∧ τ ∈ T ∧ ϕbe is a phrase of s} where T is the set of
possible types.</p>
      <p>With T finite, A is also finite. More importantly, |A| = O(klt) where k = |s|
is the length of s, l is the maximum length phrases considered, and t = |T |.
Hence, A grows linearly in size with each. In the example of Fig.1(a), T =
b∧¬a
0.32
a∧b
0.48
{Person, Toponym, City} and we have 28 · |T | = 84 annotations. Even though we
envision a more ingenious implementation, no probabilistic database would be
severely challenged by a complete annotation set for a typical text field.
3</p>
      <p>Knowledge application is conditioning
We explain how to ‘apply knowledge’ in our approach by means of the example
of Fig.1, i.e., with our A with 84 (possible) annotations and an ontology only
containing Person, Toponym, and City. Suppose we like to add the knowledge
Person —dnc— City. The effect should be the removal of some annotations and
adjustment of the probabilities of the remaining ones. world set</p>
      <p>An initial promising idea is to store x v P
tttamihhsoieennMeaedianxnynibsBaotyMteapnaStrcnioeo[bnaHaossAbfsoiiKnleciasiOatcait0hnce9d]tud.unawIptcnoaleerbrMltadiassaieysn,edBtersMteudelcaSerh---, aaa112118 .111ab.n.n.112eo..ta.tPCPy.teeit.irrpyossooennns..... ..... ..... {{{wxxxs111281d=== 111}}} xxxxxx111121211818 010101 000000......467328
scriptor (wsd) containing a set of RV . . . . . . . . .
assignments from a world set table (see
Fig.2). RVs are assumed independent. Fig. 2. Initial annotation set stored in a
For example, the 3rd annotation tuple probabilistic database (MayBMS-style)
only exists when x18 = 1 which is the case with a probability of 0.8. Each
annotation can be seen as a probabilistic event, which are all independent in our
j
starting point. Hence, we can store A by associating each annotation tuple ai
with one boolean RV xj . Consequently, the database size is linear with |A|.</p>
      <p>i</p>
      <p>Adding knowledge such as Person —dnc— City means that certain RVs become
dependent and that certain combinations of RV assignments become impossible.
Let us focus on two individual annotations a21 (’“Paris” is a City) and a18 (“Paris
Hilton” is a Person). These two annotations become mutually exclusive. The
process of adjusting the probabilities is called conditioning [KO08]. It boils down
to redistributing the remaining probability mass. Fig.3 illustrates this for a = a2
and b = a81. The remaining probability mass is 1 − 0.48 = 0.52. Hence, th1e
A first attempt is to replace x21 and x18 with one fresh three-valued RV x0 with
the probabilities just calculated, i.e., wsd(a21) = {x0 = 1} and wsd(a18) = {x0 = 2}
with P (x0 = 0) = 0.15, P (x0 = 1) = 0.23, and P (x0 = 2) = 0.62. Unfortunately,
since annotations massively overlap, we face a combinatorial explosion. For this
rule, we end up with one RV with up to 22·28 = 256 ≈ 7 · 1016 cases.
Solution directions What we are looking for in this paper is a structure that is
expressive enough to capture all dependencies between RVs and at the same time
allowing for scalable processing of conditioning operations. The work of [KO08]
represents dependencies resulting from queries with a tree of RV assignments.
We are also investigating the shared correlations work of [SDG08].
4</p>
      <p>Conclusions
We envision an approach where information extractors are developed based on
an ontology definition process for knowledge about possible (in)correctness of
annotations. Main properties are treating annotations as fundamentally
uncertain and interactive addition of knowledge starting from a ‘no knowledge hence
everything is possible’ situation. The feasibility of the approach hinges on
efficient storage and conditioning of probabilistic dependencies. We discuss this very
problem, argue that a trivial approach doesn’t work, and propose two solution
directions: the conditioning approach of MayBMS and the shared correlations
work of Getoor et al.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [BHBL09]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data: The story so far</article-title>
          .
          <source>Int'l J. on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [HAKO09]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          .
          <article-title>MayBMS: a probabilistic database management system</article-title>
          .
          <source>In Proc. of the 35th SIGMOD Int'l Conf. on Management Of Data</source>
          , Providence, Rhode Island, pages
          <fpage>1071</fpage>
          -
          <lpage>1074</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>[HvK11a] M. B. Habib and M. van Keulen</surname>
          </string-name>
          .
          <article-title>Named entity extraction and disambiguation: The reinforcement effect</article-title>
          .
          <source>In Proc. of the 5th Int'l Workshop on Management of Uncertain Data (MUD)</source>
          ,
          <source>Seatle, 29 Aug</source>
          , pages
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>[HvK11b] M. B. Habib</surname>
            and
            <given-names>M. van Keulen. Neogeography:</given-names>
          </string-name>
          <article-title>The challenge of channelling large and ill-behaved data streams</article-title>
          .
          <source>Technical Report TR-CTIT-11- 08</source>
          , Enschede,
          <year>2011</year>
          . ISSN 1381-3625, http://eprints.eemcs.utwente.nl/19854.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [KO08]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koch</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          .
          <article-title>Conditioning probabilistic databases</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>313</fpage>
          -
          <lpage>325</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [SDG08]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          .
          <article-title>Exploiting shared correlations in probabilistic databases</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>809</fpage>
          -
          <lpage>820</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [vKdK09]
          <string-name>
            <surname>M. van Keulen</surname>
          </string-name>
          and
          <string-name>
            <surname>A. de Keijzer</surname>
          </string-name>
          .
          <article-title>Qualitative effects of knowledge rules and user feedback in probabilistic data integration</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1191</fpage>
          -
          <lpage>1217</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>