INTRODUCTION

Ontologies for Multilingual Extraction

Deryle W. Lonsdale

David W. Embley

embley@cs.byu.edu 0

Stephen W. Liddle

liddle@byu.edu 1 0 Computer Science, Brigham Young University 1 Information Systems, Brigham Young University 2 Linguistics & English Lang., Brigham Young University

2010

26 30

In our global society, multilingual barriers sometimes prohibit and often discourage people from accessing a wider variety of goods and services. We propose multilingual extraction ontologies as an approach to resolving these issues. Our ontologies provide a conceptual framework for a narrow domain of interest. Grounding narrow-domain ontologies linguistically enables them to map relevant utterances and text to meaningful concepts in the ontology. Our prior work includes leveraging large-scale lexicons and terminology resources for grounding and augmenting ontological content [14]. Linguistically grounding ontologies in multiple languages enables cross-language communication within the scope of the various ontologies' domains. We quantify the success of linguistically grounded ontologies by measuring precision and recall of extracted concepts, and we can gauge the success of automated cross-linguistic-mapping construction by measuring the speed of creation and the accuracy of generated lexical resources.

INTRODUCTION

Though English has so far served as the principal language for Internet use (with currently 28.7% of all users), its relative importance is rapidly diminishing. Chinese users, for example, comprise 21.7% of Internet users and their growth in numbers between 2000 and 2009 has been 1,018.7%; the growth in Spanish users has been 631.3% over the last decade. Since more people want to access web information in more languages, this poses a substantial challenge and opportunity for research and business organizations whose interest is in providing multilingual access to web content.

The BYU Data Extraction research Group (DEG)1 has worked for years on tools—such as its Ontology Extraction System (OntoES)—to enable access to web content of various types: car advertisements, obituaries, clinical trial data, and biomedical information. The group to date has focused on English web data, while 1This work was funded in part by U.S. National Science Foundation grants for the TIDIE (IIS-0083127) and TANGO (IIS-0414644) projects. understanding the eventual need to extend OntoES to other languages. This appears to be an opportune time for our group to enter the area of multilingual information extraction and show how the DEG infrastructure is poised to make significant contributions in this area as it has already has in extracting English information.

There are currently a few efforts in the area of multilingual information extraction. Some focus on very narrow domains, such as technical information for oil drilling and exploration in Norwegian and English. Others are more general but involve more than two languages, such as accessing European train system schedules. The U.S. government (NIST TREC), the European Union (7th Framework CLEF), and Japan (NTCIR) all have initiatives to help further the development and evaluation of multilingual information retrieval and data extraction systems. Of course, Google and other companies interested in web content and market share are enabling multilingual access to the Internet.

Almost all of the existing efforts involve a typical scenario that might include: collecting a query in the user’s language, translating that query into the language of the web pages to be searched, locating the answers, and then returning the relevant results to the user or to someone who can help the user understand their content. This approach is fraught with problems since machine translation (MT), a core component in the process, is still a developing technology.

For reasons discussed below, we believe that our approach has technical and linguistic merit, and can introduce a fresh perspective on multilingual information extraction. Our ontology-based techniques are ideal for extracting content in various languages without having to rely directly on MT. By carefully developing the knowledge resources necessary, we can extend DEGtype processing to other languages in a modular fashion.

THE ONTOLOGY-BASED APPROACH 2. 2.1 Extraction Ontologies

Just over a decade ago, the BYU Data-Extraction research Group (DEG) began its work on information extraction. In a 1999 paper, DEG researchers described an efficacious way to combine ontologies with simple natural language processing [ 5 ].2 The idea is to de2Recently, others have begun to combine ontologies with clare a narrow domain ontology for an application of interest and augment its concepts with linguistic recognizers. Coupling recognizers with conceptual modeling turns a conceptual ontology into an extraction ontology. When applied to data-rich semi-structured text, an extraction ontology recognizes linguistic elements that identify concept instances for the object and relationship sets in the ontology’s conceptual model. We call our system OntoES, Ontology-based Extraction System.

Consider, for example, a typical car ad. Its content can be modeled with a conceptual ontology such as that shown in Figure 1. With linguistic recognizers added for concepts such Make, Model, Year, Price, and Mileage, the domain ontology becomes an extraction ontology.

We have developed a form-based tool [ 15 ] that helps users to develop ontologies including declaring recognizers and associating them with ontological concepts. It also permits users to specify regular expressions that recognize traditional value phrases for car prices such as “$15,900”, “7,595”, and “$9500”—with optional dollar signs and commas. Users can also declare additional recognizers for other expected price expressions such as “15 grand”. To help make recognizers more precise, users can declare exception expressions, left and right context expressions, units expressions, and even keyword phrases such as “MSRP” and “our price” to help sort out various prices that might appear. Figure 2 shows snippets from recognizer declarations for car ads data.

Applying the recognizers of all the concepts in a carads extraction ontology to a car ad annotates, extracts, and organizes the facts from that ad. The result is a machine-readable cache of facts that users can query or use to perform data analysis or other automated tasks.3

To verify that a carefully designed extraction ontology for car ads can indeed annotate, extract, and organize facts for query and analysis, DEG researchers have natural language processing [ 11, 2 ]. The combination has been called “linguistically grounding ontologies.” 3See http://deg.byu.edu for a working online demonstration of the system.

Price internal representation: Integer external representation: \$[ 1-9 ]\d{0,2},?\d{3} | \d?\d [Gg]rand | ... context keywords: price|asking|obo|neg(\.|otiable)|... ...

LessThan(p1: Price, p2: Price) returns (Boolean) context keywords: (less than|<|under|...)\s*{p2} |... ...

Make ... external representation: CarMake.lexicon ... conducted experiments with hundreds of car ads from various on-line sources containing thousands of fact instances. In one experiment, when an existing OntoES car ads ontology was hand-tuned on a corpus of 100 development documents and then tested on an unseen corpus of about 110 car ads, the system extracted 1003 attributes with with recall measures of 94% and precision measures nearing 100% [ 6 ].

Recently, DEG researchers have experimented with information extraction in Japanese. Figure 3 shows an OntoES extraction ontology that can extract information from Japanese car ads analogous to the English one shown earlier. The concept names are in Japanese as are the regular-expression recognizers. Yen amounts range from 10,000 yen to 9,999,999 yen whereas dollar amounts range from $100 to $99,999. The critical observation is that the structure of the Japanese ontology is identical to the structure of the English ontology.

This type of ontology-based matching across languages at the lexical level indicates a possible strategy for providing a cross-linguistic bridge through concepts rather than relying on traditional means of translation. Similar approaches have been tried in such areas as machine translation (e.g. [ 4 ]) and cross-linguistic information retrieval [ 12 ].

As currently implemented, OntoES extraction ontologies can “read” and “write” in any single language. The car-ad examples here are in English and Japanese, but extraction ontologies work the same for all languages. To “read” means to recognize instance values for ontological concepts, to extract them, and to appropriately link related values together based on the associated conceptual relationships and constraints. To “write” means to list the facts recorded in the ontological structure. Having “read” a typical car ad, OntoES might write: Year: 1984 Make: Dodge Model: W100 Price: $2,000 Feature: 4x4 Feature: Pickup

Accessory: 12.5x35” mud tires In addition, based on the constraints, OntoES knows and can write several meta statements about an ontology. Examples: “an Accessory is a F eature” (white triangles denote hyponym/hypernym is-a constraints); “T rim is part of M odelT rim” (black triangles denote meronym/holonym is-part-of constraints), “Car has at most one M ake” (the participation constraint 0:1 on Car for M ake denotes that Car objects in car ads associate with M ake names between 0 and 1 times, or “at most once”).

As currently implemented, however, OntoES cannot read in one language and write in another. This crosslinguistic ability to read in one language and then translate to and write in another language is the essence of our multilingual-oriented development. For example, we expect to be able to read the price in yen from a Japanese car-ad and write “Price: $24,124” and to read the Kanji symbols for the make and write “Make: Mitsubishi”. To assure this level of functionality, we need to encode unit or currency conversion routines for values like P rice and to encode cross-linguistic lexicons for named entities such as M ake. In principle, encoding this cross-linguistic mapping is currently possible, but represents a fair amount of manual effort. We are currently finding ways to largely automate this mapping. In addition, we are adding two other capabilities to the system that will similarly enhance extraction and query processing: compound recognizers and patterns.

Compound recognizers allow OntoES to directly recognize ontological relationships beyond simple concepts. For a query like: “Find Nissans for sale with years between 1995 and 2005.”, we need to recognize each of the years as well as the between constraint that relates them. Our previous work has implemented compound recognizers for operators in free-form queries [ 1 ], but we now seek to linguistically ground these types of ontological relationships.

Patterns will allow OntoES to identify and extract from structured text. For example, car ads often appear as a table with P rice in one column, Y ear in another column, and M ake and M odel in a third column. Detecting patterns in documents will allow OntoES to apply specialized extraction rules and likely improve extraction accuracy. By extending our work with table patterns [ 8 ], we expect to fully exploit patterns in text. 2.2

Multilingual Mappings

We are extending in a principled way the cross-linguistic effectiveness of our OntoES system by adapting it for use in processing data-rich documents in languages other than English. Though the OntoES system was originally designed to handle English-language documents, it was implemented according to standard webrelated software engineering principles and best practices: version control, integrated development enviroments, standardized data markup and encoding (XML, RDF, and OWL), Unicode character representation, and tractability (SWRL rules and Pellet-based reasoning). Consequently, we anticipate that internationalization of the system should be relatively straightforward, not requiring wholesale rewrites of crucial components. This should allow us to handle web pages in any language, given appropriate linguistic knowledge sources. Since OntoES does not need to parse out the grammatical structure of webpage text, only lower-level lexical (wordbased) information is necessary for linguistic processing.

The system’s lexical knowledge is highly modular, with specific resources encoded as user-selectable lexicons. The information used to build up existing content for the English lexicons includes a mix of implicit knowledge and existing resources. Some lexicon entries were created by students during class and project work; other entries were developed from existing lexical resources (e.g. the US Census Bureau for personal names, the World Factbook for country names, Ethnologue for language names, etc.). We are developing analogous lexicons for other languages, and adapting OntoES as necessary to accommodate them in its processing. As was the case for English, this involves some hand-crafting of relevant material, as well as finding and converting existing data sources in other languages for targeted types of lexical information. Often this is relatively straightforward: for example, WordNet is a sizable and important component for English OntoES, and similar and compatible resources exist for other languages. However, we also need to rely on linguistic knowledge and experience to find, convert, and implement appropriate cross-linguistic lexical resources.

In the realm of cross-linguistic extraction systems, OntoES has a clear advantage. We claim that ontologies, which lie at the crux of our extraction approach, can serve as viable interlinguas. We are currently substantiating this claim. Since an ontology represents a conceptualization of items and relationships of interest (e.g. interesting properties of a car, information needed to set up a doctor’s appointment, etc.), a given ontology should be appropriate cross-linguistically with perhaps occasionally some slight cultural adaptation. For example, in our prior work on extraction from obituaries [ 5 ] we found that worldwide cultural and dialect differences were readily apparent even in English material. Certain terms for events like “tenth day kriya”, “obsequies”, and “cortege” were found only in English obituaries announcing events outside of America. Since our lexical resources serve as a “grounding” of the lowest-level concepts from ontologies with the lexical content of the web pages, substituting one language’s lexicon for another’s provide OntoES a true cross-linguistic capability. 2.3

Ongoing Work

Our current work involves several separate but related tasks. We are locating annotated corpora in other languages amenable for evaluation purposes, and collecting and annotating interesting multilingual web material of our own. We are also developing prototype lexicons and recognizers for these target languages. Of course, our work requires us to develop and adapt prototype ontologies for target languages for sample concepts in data-rich domains.

In addition, we are enhancing extraction ontologies by enabling them to (1) explicitly discover and extract relationships among object instances of interest, and (2) discover patterns of interest from which they can more certainly identify and extract both object instances and relationship instances of interest. This involves devising, investigating, designing, coding, and evaluating algorithms for compound recognizers and for pattern discovery and patterned information extraction.

Finally, we are evaluating system performance using standard metrics and gold-standard annotated data.

CONCLUSION

Though an interesting effort in its own right, we expect our multilingual extraction work to also contribute to our larger effort to create a Web of Knowledge [ 7, 9 ]. Our research centers around resolving some of the tough technical issues involved in a community-wide effort to deploy the semantic web [ 16 ] and in concert with efforts at Yahoo!, Google, and elsewhere to extract information from the web and integrate it into community portals to enable community members to better discover, search, query, and track interesting community information [ 3, 10, 13 ]. Multilingual extraction ontologies have the farreaching potential to play a significant role as semanticweb work finds its way into mainstream use in global communities.

[1]

Al-Muhammed and

Embley . Ontologybased constraint recognition for free-form service requests . In Proceedings of the 23rd International Conference on Data Engineering (ICDE'07) , pages 366 - 375 , Istanbul, Turkey, 2007 .

[2]

Buitelaar ,

Cimiano ,

Haase , and

Sintek . Towards linguistically grounded ontologies . In Proceedings of the 6th European Semantic Web Conference (ESWC'09) , pages 111 - 125 , Heraklion, Greece, 2009 .

[3]

DeRose ,

Shen ,

Chen ,

Doan , and

Ramakrishnan . Building structured web community portals: A top-down, compositional, and incremental approach . In Proceedings of the 33rd Very Large Database Conference (VLDB'07) , pages 23 - 28 , Vienna, Austria, 2007 .

[4]

B. J.

Dorr . Machine Translation: A view from the lexicon . MIT Press, Cambridge, MA, 1993 .

[5]

Embley , D. Campbell,

Jiang ,

Liddle ,

Lonsdale ,

Y.-K.

Ng , and

Smith . Conceptual-model-based data extraction from multiple-record web pages . Data & Knowledge Engineering , 31 ( 3 ): 227 - 251 , 1999 .

[6]

Embley , D. Campbell,

Liddle , and

Smith . Ontology-based extraction and structuring of information from data-rich unstructured documents . In Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM'98) , pages 52 - 59 , Washington D.C., 1998 .

[7]

Embley ,

Liddle ,

Lonsdale , G. Nagy,

Tijerino ,

Clawson ,

Crabtree ,

Ding ,

Jha ,

Lian ,

Lynn ,

Padmanabhan ,

Peters ,

Tao ,

Watts ,

Woodbury , and

Zitzelberger . A conceptual-model-based computational alembic for a web of knowledge . In Proceedings of the 27th International Conference on Conceptual Modeling (ER08) , pages 532 - 533 , 2008 .

[8]

Embley ,

Tao , and

Liddle . Automating the extraction of data from HTML tables with unknown structure . Data & Knowledge Engineering , 54 ( 1 ): 3 - 28 , 2005 .

[9]

Embley and

Zitzelberger . Theoretical foundations for enabling a web of knowledge . In Proceedings of the 6th International Symposium on Foundations of Information and Knowledge Systems (FoIKS10) , Sophia, Bulgaria, 2010 .

[10]

Halevy ,

Norvig , and

Pereira . The unreasonable effectiveness of data . IEEE Intelligent Systems, March/April 2009 .

[11]

Hunter ,

Lu ,

Firby ,

W. B.

Jr. , H. Johnson, P. Ogren, and

Cohen. OpenDMAP : An open source, ontology-driven, concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression . BMC Bioinformatics , 9 ( 8 ), 2008 .

[12]

Kishida . Technical issues of cross-language information retrieval: A review . Information Processing and Management: an International Journal , 41 : 433 - 455 , 2005 .

[13]

Kumar ,

Pang ,

Ramakrishnan ,

Tomkins ,

Bohannon ,

Keerthi , and

Merugu . A web of concepts . In Proceedings of the 2009 Symposium on Principles of Database Systems , pages 1 - 12 , Providence, RI, 2009 .

[14]

Lonsdale ,

D. W.

Embley ,

Ding ,

Xu , and

Hepp . Reusing ontologies and language components for ontology generation . Data & Knowledge Engineering , 69 : 318 - 330 , 2010 .

[15]

Tao ,

Embley , and

Liddle . FOCIH: Form-based ontology creation and information harvesting . In Proceedings of the 28th International Conference on Conceptual Modeling (ER 2009 ), pages 346 - 359 , Gramado, Brazil, 2009 .

[16] W3C (World Wide Web Consortium) Semantic Web Activity Page . http://www.w3.org/2001/sw/.