-

Handling uncertainty in information extraction

Maurice van Keulen

m.vankeulen@utwente.nl 0

Mena B. Habib

m.badiehhabibmorgan@utwente.nl 0 0 University of Twente, Faculty of EEMCS , Enschede , The Netherlands

This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.

Introduction (Too) much data is still inaccessible for data processing, because it is unstructured, textually embedded in documents, webpages, or text fields in databases. Information extraction (IE) is a technology capable of extracting entities, facts, and relations. IE helps to turn the web into a real ‘web of data’ [BHBL09].

In the Neogeography-project [HvK11b], we focus on named entity extraction (NEE) from database text fields and short messages. NEE typically consists of phases like recognition (which phrases are named entities), matching and enrichment (lookups in reference databases and dictionaries possibly adding information), and disambiguation (to which real-world object does a phrase refer).

Because natural language is highly ambiguous and computers are still incapable of ‘real’ semantic understanding, NEE (and IE in general) is a highly imperfect process. For example, it is ambiguous how to interpret the word “Paris”: it could be a first name, a city, etc. Even resolving it to a city, a lookup in GeoNames1 learns that there are numerous other places called “Paris” besides the capital of France. In [HvK11a], we found that around 46% of toponyms2 have two or more, 35% three or more, and 29% four or more references in GeoNames.

Although many probabilistic and fuzzy techniques abound, some aspects often remain absolute: extraction rules absolutely recognize and annotate a phrase or not, only a top item from a ranking is chosen for a next phase, etc. We envision an approach that fundamentally treats annotations and extracted information as uncertain throughout the process. We humans happily deal with doubt and misinterpretation every day, why shouldn’t computers?

We envision developing information extractors ‘Sherlock Holmes style’ — “when you have eliminated the impossible, whatever remains, however improbable, must be the truth” — by adopting the principles and requirements below. – Annotations are uncertain, hence we process both annotations as well as information about the uncertainty surrounding them. 1 http://www.geonames.org 2 A toponym is any name that refers to a location including, e.g., names of buildings.

Paris Hilton stayed in the Paris Hilton a1 a2 a3 a4 a5 a6 a7 a8 a9 An annotation a = (b, e, τ ) declares a phrase ϕbe from b to e to be interpreted as entity type τ . For example, a8 in Fig. 1(a) declares ϕ = “Paris Hilton” from b = 1 to e = 2 to be interpreted as type τ = Person. An interpretation I = (A, U ) of a sentence s consists of an annotation set A and a structure U representing the uncertainty among the annotations. In the sequel, we discuss what U should be, but for now view it as a set of random variables (RVs) R with their dependencies.

Rather unconventionally, we don’t start with an empty A, but with a ‘no knowledge’ point-of-view where any phrase can have any interpretation. So our initial A is {a | a = (b, e, τ ) ∧ τ ∈ T ∧ ϕbe is a phrase of s} where T is the set of possible types.

With T finite, A is also finite. More importantly, |A| = O(klt) where k = |s| is the length of s, l is the maximum length phrases considered, and t = |T |. Hence, A grows linearly in size with each. In the example of Fig.1(a), T = b∧¬a 0.32 a∧b 0.48 {Person, Toponym, City} and we have 28 · |T | = 84 annotations. Even though we envision a more ingenious implementation, no probabilistic database would be severely challenged by a complete annotation set for a typical text field. 3

Knowledge application is conditioning We explain how to ‘apply knowledge’ in our approach by means of the example of Fig.1, i.e., with our A with 84 (possible) annotations and an ontology only containing Person, Toponym, and City. Suppose we like to add the knowledge Person —dnc— City. The effect should be the removal of some annotations and adjustment of the probabilities of the remaining ones. world set

An initial promising idea is to store x v P tttamihhsoieennMeaedianxnynibsBaotyMteapnaStrcnioeo[bnaHaossAbfsoiiKnleciasiOatcait0hnce9d]tud.unawIptcnoaleerbrMltadiassaieysn,edBtersMteudelcaSerh---, aaa112118 .111ab.n.n.112eo..ta.tPCPy.teeit.irrpyossooennns..... ..... ..... {{{wxxxs111281d=== 111}}} xxxxxx111121211818 010101 000000......467328 scriptor (wsd) containing a set of RV . . . . . . . . . assignments from a world set table (see Fig.2). RVs are assumed independent. Fig. 2. Initial annotation set stored in a For example, the 3rd annotation tuple probabilistic database (MayBMS-style) only exists when x18 = 1 which is the case with a probability of 0.8. Each annotation can be seen as a probabilistic event, which are all independent in our j starting point. Hence, we can store A by associating each annotation tuple ai with one boolean RV xj . Consequently, the database size is linear with |A|.

Adding knowledge such as Person —dnc— City means that certain RVs become dependent and that certain combinations of RV assignments become impossible. Let us focus on two individual annotations a21 (’“Paris” is a City) and a18 (“Paris Hilton” is a Person). These two annotations become mutually exclusive. The process of adjusting the probabilities is called conditioning [KO08]. It boils down to redistributing the remaining probability mass. Fig.3 illustrates this for a = a2 and b = a81. The remaining probability mass is 1 − 0.48 = 0.52. Hence, th1e A first attempt is to replace x21 and x18 with one fresh three-valued RV x0 with the probabilities just calculated, i.e., wsd(a21) = {x0 = 1} and wsd(a18) = {x0 = 2} with P (x0 = 0) = 0.15, P (x0 = 1) = 0.23, and P (x0 = 2) = 0.62. Unfortunately, since annotations massively overlap, we face a combinatorial explosion. For this rule, we end up with one RV with up to 22·28 = 256 ≈ 7 · 1016 cases. Solution directions What we are looking for in this paper is a structure that is expressive enough to capture all dependencies between RVs and at the same time allowing for scalable processing of conditioning operations. The work of [KO08] represents dependencies resulting from queries with a tree of RV assignments. We are also investigating the shared correlations work of [SDG08]. 4

Conclusions We envision an approach where information extractors are developed based on an ontology definition process for knowledge about possible (in)correctness of annotations. Main properties are treating annotations as fundamentally uncertain and interactive addition of knowledge starting from a ‘no knowledge hence everything is possible’ situation. The feasibility of the approach hinges on efficient storage and conditioning of probabilistic dependencies. We discuss this very problem, argue that a trivial approach doesn’t work, and propose two solution directions: the conditioning approach of MayBMS and the shared correlations work of Getoor et al.

[BHBL09]

Bizer ,

Heath , and

Berners-Lee . Linked data: The story so far . Int'l J. on Semantic Web and Information Systems (IJSWIS) , 5 ( 3 ): 1 - 22 , 2009 .

[HAKO09]

Huang ,

Antova ,

Koch , and

Olteanu . MayBMS: a probabilistic database management system . In Proc. of the 35th SIGMOD Int'l Conf. on Management Of Data , Providence, Rhode Island, pages 1071 - 1074 , 2009 .

[HvK11a] M. B. Habib and M. van Keulen . Named entity extraction and disambiguation: The reinforcement effect . In Proc. of the 5th Int'l Workshop on Management of Uncertain Data (MUD) , Seatle, 29 Aug , pages 9 - 16 , 2011 .

[HvK11b] M. B. Habib and M. van Keulen. Neogeography:

The challenge of channelling large and ill-behaved data streams . Technical Report TR-CTIT-11- 08 , Enschede, 2011 . ISSN 1381-3625, http://eprints.eemcs.utwente.nl/19854.

[KO08]

Koch and

Olteanu . Conditioning probabilistic databases . Proc. VLDB Endow ., 1 ( 1 ): 313 - 325 , 2008 .

[SDG08]

Sen ,

Deshpande , and

Getoor . Exploiting shared correlations in probabilistic databases . Proc. VLDB Endow ., 1 ( 1 ): 809 - 820 , 2008 .

[vKdK09] M. van Keulen and A. de Keijzer . Qualitative effects of knowledge rules and user feedback in probabilistic data integration . The VLDB Journal , 18 ( 5 ): 1191 - 1217 , 2009 .