=Paper=
{{Paper
|id=None
|storemode=property
|title=Ontologies for Multilingual Extraction
|pdfUrl=https://ceur-ws.org/Vol-571/paper1.pdf
|volume=Vol-571
|dblpUrl=https://dblp.org/rec/conf/www/LonsdaleEL10
}}
==Ontologies for Multilingual Extraction==
Ontologies for Multilingual Extraction
Deryle W. Lonsdale David W. Embley Stephen W. Liddle
Linguistics & English Lang. Computer Science Information Systems
Brigham Young University Brigham Young University Brigham Young University
lonz@byu.edu embley@cs.byu.edu liddle@byu.edu
ABSTRACT understanding the eventual need to extend OntoES to
In our global society, multilingual barriers sometimes other languages. This appears to be an opportune time
prohibit and often discourage people from accessing a for our group to enter the area of multilingual informa-
wider variety of goods and services. We propose multi- tion extraction and show how the DEG infrastructure
lingual extraction ontologies as an approach to resolv- is poised to make significant contributions in this area
ing these issues. Our ontologies provide a conceptual as it has already has in extracting English information.
framework for a narrow domain of interest. Grounding There are currently a few efforts in the area of mul-
narrow-domain ontologies linguistically enables them to tilingual information extraction. Some focus on very
map relevant utterances and text to meaningful con- narrow domains, such as technical information for oil
cepts in the ontology. Our prior work includes lever- drilling and exploration in Norwegian and English. Oth-
aging large-scale lexicons and terminology resources for ers are more general but involve more than two lan-
grounding and augmenting ontological content [14]. Lin- guages, such as accessing European train system sched-
guistically grounding ontologies in multiple languages ules. The U.S. government (NIST TREC), the Euro-
enables cross-language communication within the scope pean Union (7th Framework CLEF), and Japan (NT-
of the various ontologies’ domains. We quantify the suc- CIR) all have initiatives to help further the development
cess of linguistically grounded ontologies by measuring and evaluation of multilingual information retrieval and
precision and recall of extracted concepts, and we can data extraction systems. Of course, Google and other
gauge the success of automated cross-linguistic-mapping companies interested in web content and market share
construction by measuring the speed of creation and the are enabling multilingual access to the Internet.
accuracy of generated lexical resources. Almost all of the existing efforts involve a typical sce-
nario that might include: collecting a query in the user’s
language, translating that query into the language of
1. INTRODUCTION the web pages to be searched, locating the answers, and
Though English has so far served as the principal then returning the relevant results to the user or to
language for Internet use (with currently 28.7% of all someone who can help the user understand their con-
users), its relative importance is rapidly diminishing. tent. This approach is fraught with problems since ma-
Chinese users, for example, comprise 21.7% of Internet chine translation (MT), a core component in the pro-
users and their growth in numbers between 2000 and cess, is still a developing technology.
2009 has been 1,018.7%; the growth in Spanish users For reasons discussed below, we believe that our ap-
has been 631.3% over the last decade. Since more peo- proach has technical and linguistic merit, and can in-
ple want to access web information in more languages, troduce a fresh perspective on multilingual information
this poses a substantial challenge and opportunity for extraction. Our ontology-based techniques are ideal for
research and business organizations whose interest is in extracting content in various languages without hav-
providing multilingual access to web content. ing to rely directly on MT. By carefully developing the
The BYU Data Extraction research Group (DEG)1 knowledge resources necessary, we can extend DEG-
has worked for years on tools—such as its Ontology type processing to other languages in a modular fashion.
Extraction System (OntoES)—to enable access to web
content of various types: car advertisements, obituar-
2. THE ONTOLOGY-BASED APPROACH
ies, clinical trial data, and biomedical information. The
group to date has focused on English web data, while
2.1 Extraction Ontologies
1
This work was funded in part by U.S. National Sci- Just over a decade ago, the BYU Data-Extraction
ence Foundation grants for the TIDIE (IIS-0083127) research Group (DEG) began its work on information
and TANGO (IIS-0414644) projects. extraction. In a 1999 paper, DEG researchers described
an efficacious way to combine ontologies with simple
Copyright is held by the author/owner(s). natural language processing [5].2 The idea is to de-
WWW2010, April 26-30, 2010, Raleigh, North Carolina. 2
. Recently, others have begun to combine ontologies with
1
clare a narrow domain ontology for an application of Price
internal representation: Integer
interest and augment its concepts with linguistic recog-
external representation: \$[1-9]\d{0,2},?\d{3}
nizers. Coupling recognizers with conceptual modeling | \d?\d [Gg]rand | ...
turns a conceptual ontology into an extraction ontol- context keywords: price|asking|obo|neg(\.|otiable)|...
ogy. When applied to data-rich semi-structured text, an ...
extraction ontology recognizes linguistic elements that LessThan(p1: Price, p2: Price) returns (Boolean)
identify concept instances for the object and relation- context keywords: (less than|<|under|...)\s*{p2} |...
...
ship sets in the ontology’s conceptual model. We call
Make
our system OntoES, Ontology-based Extraction System. ...
Consider, for example, a typical car ad. Its content external representation: CarMake.lexicon
can be modeled with a conceptual ontology such as that ...
shown in Figure 1. With linguistic recognizers added for
concepts such Make, Model, Year, Price, and Mileage,
the domain ontology becomes an extraction ontology. Figure 2: Sample Recognizer Declarations for
Car Ads.
conducted experiments with hundreds of car ads from
various on-line sources containing thousands of fact in-
stances. In one experiment, when an existing OntoES
car ads ontology was hand-tuned on a corpus of 100
development documents and then tested on an unseen
corpus of about 110 car ads, the system extracted 1003
attributes with with recall measures of 94% and preci-
sion measures nearing 100% [6].
Recently, DEG researchers have experimented with
information extraction in Japanese. Figure 3 shows an
OntoES extraction ontology that can extract informa-
tion from Japanese car ads analogous to the English
one shown earlier. The concept names are in Japanese
as are the regular-expression recognizers. Yen amounts
range from 10,000 yen to 9,999,999 yen whereas dollar
amounts range from $100 to $99,999. The critical ob-
servation is that the structure of the Japanese ontology
Figure 1: Extraction Ontology for Car Ads. is identical to the structure of the English ontology.
This type of ontology-based matching across languages
We have developed a form-based tool [15] that helps at the lexical level indicates a possible strategy for pro-
users to develop ontologies including declaring recog- viding a cross-linguistic bridge through concepts rather
nizers and associating them with ontological concepts. than relying on traditional means of translation. Simi-
It also permits users to specify regular expressions that lar approaches have been tried in such areas as machine
recognize traditional value phrases for car prices such as translation (e.g. [4]) and cross-linguistic information re-
“$15,900”, “7,595”, and “$9500”—with optional dollar trieval [12].
signs and commas. Users can also declare additional rec-
ognizers for other expected price expressions such as “15
grand”. To help make recognizers more precise, users
can declare exception expressions, left and right con-
text expressions, units expressions, and even keyword
phrases such as “MSRP” and “our price” to help sort
out various prices that might appear. Figure 2 shows
snippets from recognizer declarations for car ads data.
Applying the recognizers of all the concepts in a car-
ads extraction ontology to a car ad annotates, extracts,
and organizes the facts from that ad. The result is a
machine-readable cache of facts that users can query or
use to perform data analysis or other automated tasks.3
To verify that a carefully designed extraction ontol-
ogy for car ads can indeed annotate, extract, and orga-
nize facts for query and analysis, DEG researchers have
natural language processing [11, 2]. The combination
has been called “linguistically grounding ontologies.”
3
See http://deg.byu.edu for a working online demon- Figure 3: Japanese Extraction Ontology for Car
stration of the system. Ads.
2
As currently implemented, OntoES extraction ontolo- patterns [8], we expect to fully exploit patterns in text.
gies can “read” and “write” in any single language. The
car-ad examples here are in English and Japanese, but 2.2 Multilingual Mappings
extraction ontologies work the same for all languages. We are extending in a principled way the cross-lin-
To “read” means to recognize instance values for onto- guistic effectiveness of our OntoES system by adapt-
logical concepts, to extract them, and to appropriately ing it for use in processing data-rich documents in lan-
link related values together based on the associated con- guages other than English. Though the OntoES system
ceptual relationships and constraints. To “write” means was originally designed to handle English-language doc-
to list the facts recorded in the ontological structure. uments, it was implemented according to standard web-
Having “read” a typical car ad, OntoES might write: related software engineering principles and best prac-
tices: version control, integrated development enviro-
Year: 1984
ments, standardized data markup and encoding (XML,
Make: Dodge
RDF, and OWL), Unicode character representation, and
Model: W100
tractability (SWRL rules and Pellet-based reasoning).
Price: $2,000
Consequently, we anticipate that internationalization of
Feature: 4x4
the system should be relatively straightforward, not re-
Feature: Pickup
quiring wholesale rewrites of crucial components. This
Accessory: 12.5x35” mud tires
should allow us to handle web pages in any language,
In addition, based on the constraints, OntoES knows given appropriate linguistic knowledge sources. Since
and can write several meta statements about an ontol- OntoES does not need to parse out the grammatical
ogy. Examples: “an Accessory is a F eature” (white structure of webpage text, only lower-level lexical (word-
triangles denote hyponym/hypernym is-a constraints); based) information is necessary for linguistic processing.
“T rim is part of M odelT rim” (black triangles denote The system’s lexical knowledge is highly modular,
meronym/holonym is-part-of constraints), “Car has at with specific resources encoded as user-selectable lex-
most one M ake” (the participation constraint 0:1 on icons. The information used to build up existing con-
Car for M ake denotes that Car objects in car ads as- tent for the English lexicons includes a mix of implicit
sociate with M ake names between 0 and 1 times, or “at knowledge and existing resources. Some lexicon entries
most once”). were created by students during class and project work;
As currently implemented, however, OntoES cannot other entries were developed from existing lexical re-
read in one language and write in another. This cross- sources (e.g. the US Census Bureau for personal names,
linguistic ability to read in one language and then trans- the World Factbook for country names, Ethnologue for
late to and write in another language is the essence of language names, etc.). We are developing analogous lex-
our multilingual-oriented development. For example, icons for other languages, and adapting OntoES as nec-
we expect to be able to read the price in yen from a essary to accommodate them in its processing. As was
Japanese car-ad and write “Price: $24,124” and to read the case for English, this involves some hand-crafting of
the Kanji symbols for the make and write “Make: Mit- relevant material, as well as finding and converting ex-
subishi”. To assure this level of functionality, we need isting data sources in other languages for targeted types
to encode unit or currency conversion routines for val- of lexical information. Often this is relatively straight-
ues like P rice and to encode cross-linguistic lexicons for forward: for example, WordNet is a sizable and impor-
named entities such as M ake. In principle, encoding tant component for English OntoES, and similar and
this cross-linguistic mapping is currently possible, but compatible resources exist for other languages. How-
represents a fair amount of manual effort. We are cur- ever, we also need to rely on linguistic knowledge and
rently finding ways to largely automate this mapping. experience to find, convert, and implement appropriate
In addition, we are adding two other capabilities to the cross-linguistic lexical resources.
system that will similarly enhance extraction and query In the realm of cross-linguistic extraction systems,
processing: compound recognizers and patterns. OntoES has a clear advantage. We claim that ontolo-
Compound recognizers allow OntoES to directly rec- gies, which lie at the crux of our extraction approach,
ognize ontological relationships beyond simple concepts. can serve as viable interlinguas. We are currently sub-
For a query like: “Find Nissans for sale with years be- stantiating this claim. Since an ontology represents a
tween 1995 and 2005.”, we need to recognize each of conceptualization of items and relationships of interest
the years as well as the between constraint that relates (e.g. interesting properties of a car, information needed
them. Our previous work has implemented compound to set up a doctor’s appointment, etc.), a given ontology
recognizers for operators in free-form queries [1], but we should be appropriate cross-linguistically with perhaps
now seek to linguistically ground these types of ontolog- occasionally some slight cultural adaptation. For exam-
ical relationships. ple, in our prior work on extraction from obituaries [5]
Patterns will allow OntoES to identify and extract we found that worldwide cultural and dialect differences
from structured text. For example, car ads often ap- were readily apparent even in English material. Certain
pear as a table with P rice in one column, Y ear in an- terms for events like “tenth day kriya”, “obsequies”,
other column, and M ake and M odel in a third column. and “cortege” were found only in English obituaries an-
Detecting patterns in documents will allow OntoES to nouncing events outside of America. Since our lexical
apply specialized extraction rules and likely improve ex- resources serve as a “grounding” of the lowest-level con-
traction accuracy. By extending our work with table cepts from ontologies with the lexical content of the web
3
pages, substituting one language’s lexicon for another’s [5] D. Embley, D. Campbell, Y. Jiang, S. Liddle,
provide OntoES a true cross-linguistic capability. D. Lonsdale, Y.-K. Ng, and R. Smith.
Conceptual-model-based data extraction from
2.3 Ongoing Work multiple-record web pages. Data & Knowledge
Our current work involves several separate but related Engineering, 31(3):227–251, 1999.
tasks. We are locating annotated corpora in other lan- [6] D. Embley, D. Campbell, S. Liddle, and R. Smith.
guages amenable for evaluation purposes, and collecting Ontology-based extraction and structuring of
and annotating interesting multilingual web material of information from data-rich unstructured
our own. We are also developing prototype lexicons documents. In Proceedings of the 7th International
and recognizers for these target languages. Of course, Conference on Information and Knowledge
our work requires us to develop and adapt prototype Management (CIKM’98), pages 52–59,
ontologies for target languages for sample concepts in Washington D.C., 1998.
data-rich domains. [7] D. Embley, S. Liddle, D. Lonsdale, G. Nagy,
In addition, we are enhancing extraction ontologies Y. Tijerino, R. Clawson, J. Crabtree, Y. Ding,
by enabling them to (1) explicitly discover and extract P. Jha, Z. Lian, S. Lynn, R. Padmanabhan,
relationships among object instances of interest, and (2) J. Peters, C. Tao, R. Watts, C. Woodbury, and
discover patterns of interest from which they can more A. Zitzelberger. A conceptual-model-based
certainly identify and extract both object instances and computational alembic for a web of knowledge. In
relationship instances of interest. This involves devis- Proceedings of the 27th International Conference
ing, investigating, designing, coding, and evaluating al- on Conceptual Modeling (ER08), pages 532–533,
gorithms for compound recognizers and for pattern dis- 2008.
covery and patterned information extraction. [8] D. Embley, C. Tao, and S. Liddle. Automating
Finally, we are evaluating system performance using the extraction of data from HTML tables with
standard metrics and gold-standard annotated data. unknown structure. Data & Knowledge
Engineering, 54(1):3–28, 2005.
3. CONCLUSION [9] D. Embley and A. Zitzelberger. Theoretical
foundations for enabling a web of knowledge. In
Though an interesting effort in its own right, we ex-
Proceedings of the 6th International Symposium
pect our multilingual extraction work to also contribute
on Foundations of Information and Knowledge
to our larger effort to create a Web of Knowledge [7, 9].
Systems (FoIKS10), Sophia, Bulgaria, 2010.
Our research centers around resolving some of the tough
technical issues involved in a community-wide effort to [10] A. Halevy, P. Norvig, and F. Pereira. The
deploy the semantic web [16] and in concert with efforts unreasonable effectiveness of data. IEEE
at Yahoo!, Google, and elsewhere to extract information Intelligent Systems, March/April 2009.
from the web and integrate it into community portals to [11] L. Hunter, Z. Lu, J. Firby, W. B. Jr., H. Johnson,
enable community members to better discover, search, P. Ogren, and K. Cohen. OpenDMAP: An open
query, and track interesting community information [3, source, ontology-driven, concept analysis engine,
10, 13]. Multilingual extraction ontologies have the far- with applications to capturing knowledge
reaching potential to play a significant role as semantic- regarding protein transport, protein interactions
web work finds its way into mainstream use in global and cell-type-specific gene expression. BMC
communities. Bioinformatics, 9(8), 2008.
[12] K. Kishida. Technical issues of cross-language
information retrieval: A review. Information
4. REFERENCES Processing and Management: an International
[1] M. Al-Muhammed and D. Embley. Ontology- Journal, 41:433–455, 2005.
based constraint recognition for free-form service [13] R. Kumar, B. Pang, R. Ramakrishnan,
requests. In Proceedings of the 23rd International A. Tomkins, P. Bohannon, S. Keerthi, and
Conference on Data Engineering (ICDE’07), S. Merugu. A web of concepts. In Proceedings of
pages 366–375, Istanbul, Turkey, 2007. the 2009 Symposium on Principles of Database
[2] P. Buitelaar, P. Cimiano, P. Haase, and Systems, pages 1–12, Providence, RI, 2009.
M. Sintek. Towards linguistically grounded [14] D. Lonsdale, D. W. Embley, Y. Ding, L. Xu, and
ontologies. In Proceedings of the 6th European M. Hepp. Reusing ontologies and language
Semantic Web Conference (ESWC’09), pages components for ontology generation. Data &
111–125, Heraklion, Greece, 2009. Knowledge Engineering, 69:318–330, 2010.
[3] P. DeRose, W. Shen, F. Chen, A. Doan, and [15] C. Tao, D. Embley, and S. Liddle. FOCIH:
R. Ramakrishnan. Building structured web Form-based ontology creation and information
community portals: A top-down, compositional, harvesting. In Proceedings of the 28th
and incremental approach. In Proceedings of the International Conference on Conceptual Modeling
33rd Very Large Database Conference (VLDB’07), (ER 2009), pages 346–359, Gramado, Brazil, 2009.
pages 23–28, Vienna, Austria, 2007. [16] W3C (World Wide Web Consortium) Semantic
[4] B. J. Dorr. Machine Translation: A view from the Web Activity Page. http://www.w3.org/2001/sw/.
lexicon. MIT Press, Cambridge, MA, 1993.
4