=Paper= {{Paper |id=None |storemode=property |title=Ontologies for Multilingual Extraction |pdfUrl=https://ceur-ws.org/Vol-571/paper1.pdf |volume=Vol-571 |dblpUrl=https://dblp.org/rec/conf/www/LonsdaleEL10 }} ==Ontologies for Multilingual Extraction== https://ceur-ws.org/Vol-571/paper1.pdf
                    Ontologies for Multilingual Extraction

         Deryle W. Lonsdale                        David W. Embley                      Stephen W. Liddle
       Linguistics & English Lang.                  Computer Science                     Information Systems
        Brigham Young University                Brigham Young University              Brigham Young University
             lonz@byu.edu                        embley@cs.byu.edu                       liddle@byu.edu



ABSTRACT                                                        understanding the eventual need to extend OntoES to
In our global society, multilingual barriers sometimes          other languages. This appears to be an opportune time
prohibit and often discourage people from accessing a           for our group to enter the area of multilingual informa-
wider variety of goods and services. We propose multi-          tion extraction and show how the DEG infrastructure
lingual extraction ontologies as an approach to resolv-         is poised to make significant contributions in this area
ing these issues. Our ontologies provide a conceptual           as it has already has in extracting English information.
framework for a narrow domain of interest. Grounding               There are currently a few efforts in the area of mul-
narrow-domain ontologies linguistically enables them to         tilingual information extraction. Some focus on very
map relevant utterances and text to meaningful con-             narrow domains, such as technical information for oil
cepts in the ontology. Our prior work includes lever-           drilling and exploration in Norwegian and English. Oth-
aging large-scale lexicons and terminology resources for        ers are more general but involve more than two lan-
grounding and augmenting ontological content [14]. Lin-         guages, such as accessing European train system sched-
guistically grounding ontologies in multiple languages          ules. The U.S. government (NIST TREC), the Euro-
enables cross-language communication within the scope           pean Union (7th Framework CLEF), and Japan (NT-
of the various ontologies’ domains. We quantify the suc-        CIR) all have initiatives to help further the development
cess of linguistically grounded ontologies by measuring         and evaluation of multilingual information retrieval and
precision and recall of extracted concepts, and we can          data extraction systems. Of course, Google and other
gauge the success of automated cross-linguistic-mapping         companies interested in web content and market share
construction by measuring the speed of creation and the         are enabling multilingual access to the Internet.
accuracy of generated lexical resources.                           Almost all of the existing efforts involve a typical sce-
                                                                nario that might include: collecting a query in the user’s
                                                                language, translating that query into the language of
1. INTRODUCTION                                                 the web pages to be searched, locating the answers, and
   Though English has so far served as the principal            then returning the relevant results to the user or to
language for Internet use (with currently 28.7% of all          someone who can help the user understand their con-
users), its relative importance is rapidly diminishing.         tent. This approach is fraught with problems since ma-
Chinese users, for example, comprise 21.7% of Internet          chine translation (MT), a core component in the pro-
users and their growth in numbers between 2000 and              cess, is still a developing technology.
2009 has been 1,018.7%; the growth in Spanish users                For reasons discussed below, we believe that our ap-
has been 631.3% over the last decade. Since more peo-           proach has technical and linguistic merit, and can in-
ple want to access web information in more languages,           troduce a fresh perspective on multilingual information
this poses a substantial challenge and opportunity for          extraction. Our ontology-based techniques are ideal for
research and business organizations whose interest is in        extracting content in various languages without hav-
providing multilingual access to web content.                   ing to rely directly on MT. By carefully developing the
   The BYU Data Extraction research Group (DEG)1                knowledge resources necessary, we can extend DEG-
has worked for years on tools—such as its Ontology              type processing to other languages in a modular fashion.
Extraction System (OntoES)—to enable access to web
content of various types: car advertisements, obituar-
                                                                2. THE ONTOLOGY-BASED APPROACH
ies, clinical trial data, and biomedical information. The
group to date has focused on English web data, while
                                                                2.1 Extraction Ontologies
1
 This work was funded in part by U.S. National Sci-               Just over a decade ago, the BYU Data-Extraction
ence Foundation grants for the TIDIE (IIS-0083127)              research Group (DEG) began its work on information
and TANGO (IIS-0414644) projects.                               extraction. In a 1999 paper, DEG researchers described
                                                                an efficacious way to combine ontologies with simple
Copyright is held by the author/owner(s).                       natural language processing [5].2 The idea is to de-
WWW2010, April 26-30, 2010, Raleigh, North Carolina.            2
.                                                                   Recently, others have begun to combine ontologies with



                                                            1
clare a narrow domain ontology for an application of          Price
                                                                 internal representation: Integer
interest and augment its concepts with linguistic recog-
                                                                 external representation: \$[1-9]\d{0,2},?\d{3}
nizers. Coupling recognizers with conceptual modeling                                      | \d?\d [Gg]rand | ...
turns a conceptual ontology into an extraction ontol-            context keywords: price|asking|obo|neg(\.|otiable)|...
ogy. When applied to data-rich semi-structured text, an          ...
extraction ontology recognizes linguistic elements that          LessThan(p1: Price, p2: Price) returns (Boolean)
identify concept instances for the object and relation-          context keywords: (less than|<|under|...)\s*{p2} |...
                                                                 ...
ship sets in the ontology’s conceptual model. We call
                                                              Make
our system OntoES, Ontology-based Extraction System.             ...
   Consider, for example, a typical car ad. Its content          external representation: CarMake.lexicon
can be modeled with a conceptual ontology such as that           ...
shown in Figure 1. With linguistic recognizers added for
concepts such Make, Model, Year, Price, and Mileage,
the domain ontology becomes an extraction ontology.          Figure 2: Sample Recognizer Declarations for
                                                             Car Ads.


                                                             conducted experiments with hundreds of car ads from
                                                             various on-line sources containing thousands of fact in-
                                                             stances. In one experiment, when an existing OntoES
                                                             car ads ontology was hand-tuned on a corpus of 100
                                                             development documents and then tested on an unseen
                                                             corpus of about 110 car ads, the system extracted 1003
                                                             attributes with with recall measures of 94% and preci-
                                                             sion measures nearing 100% [6].
                                                                Recently, DEG researchers have experimented with
                                                             information extraction in Japanese. Figure 3 shows an
                                                             OntoES extraction ontology that can extract informa-
                                                             tion from Japanese car ads analogous to the English
                                                             one shown earlier. The concept names are in Japanese
                                                             as are the regular-expression recognizers. Yen amounts
                                                             range from 10,000 yen to 9,999,999 yen whereas dollar
                                                             amounts range from $100 to $99,999. The critical ob-
                                                             servation is that the structure of the Japanese ontology
  Figure 1: Extraction Ontology for Car Ads.                 is identical to the structure of the English ontology.
                                                                This type of ontology-based matching across languages
   We have developed a form-based tool [15] that helps       at the lexical level indicates a possible strategy for pro-
users to develop ontologies including declaring recog-       viding a cross-linguistic bridge through concepts rather
nizers and associating them with ontological concepts.       than relying on traditional means of translation. Simi-
It also permits users to specify regular expressions that    lar approaches have been tried in such areas as machine
recognize traditional value phrases for car prices such as   translation (e.g. [4]) and cross-linguistic information re-
“$15,900”, “7,595”, and “$9500”—with optional dollar         trieval [12].
signs and commas. Users can also declare additional rec-
ognizers for other expected price expressions such as “15
grand”. To help make recognizers more precise, users
can declare exception expressions, left and right con-
text expressions, units expressions, and even keyword
phrases such as “MSRP” and “our price” to help sort
out various prices that might appear. Figure 2 shows
snippets from recognizer declarations for car ads data.
   Applying the recognizers of all the concepts in a car-
ads extraction ontology to a car ad annotates, extracts,
and organizes the facts from that ad. The result is a
machine-readable cache of facts that users can query or
use to perform data analysis or other automated tasks.3
   To verify that a carefully designed extraction ontol-
ogy for car ads can indeed annotate, extract, and orga-
nize facts for query and analysis, DEG researchers have
natural language processing [11, 2]. The combination
has been called “linguistically grounding ontologies.”
3
  See http://deg.byu.edu for a working online demon-         Figure 3: Japanese Extraction Ontology for Car
stration of the system.                                      Ads.



                                                              2
   As currently implemented, OntoES extraction ontolo-            patterns [8], we expect to fully exploit patterns in text.
gies can “read” and “write” in any single language. The
car-ad examples here are in English and Japanese, but             2.2 Multilingual Mappings
extraction ontologies work the same for all languages.               We are extending in a principled way the cross-lin-
To “read” means to recognize instance values for onto-            guistic effectiveness of our OntoES system by adapt-
logical concepts, to extract them, and to appropriately           ing it for use in processing data-rich documents in lan-
link related values together based on the associated con-         guages other than English. Though the OntoES system
ceptual relationships and constraints. To “write” means           was originally designed to handle English-language doc-
to list the facts recorded in the ontological structure.          uments, it was implemented according to standard web-
Having “read” a typical car ad, OntoES might write:               related software engineering principles and best prac-
                                                                  tices: version control, integrated development enviro-
     Year: 1984
                                                                  ments, standardized data markup and encoding (XML,
     Make: Dodge
                                                                  RDF, and OWL), Unicode character representation, and
     Model: W100
                                                                  tractability (SWRL rules and Pellet-based reasoning).
     Price: $2,000
                                                                  Consequently, we anticipate that internationalization of
     Feature: 4x4
                                                                  the system should be relatively straightforward, not re-
     Feature: Pickup
                                                                  quiring wholesale rewrites of crucial components. This
     Accessory: 12.5x35” mud tires
                                                                  should allow us to handle web pages in any language,
In addition, based on the constraints, OntoES knows               given appropriate linguistic knowledge sources. Since
and can write several meta statements about an ontol-             OntoES does not need to parse out the grammatical
ogy. Examples: “an Accessory is a F eature” (white                structure of webpage text, only lower-level lexical (word-
triangles denote hyponym/hypernym is-a constraints);              based) information is necessary for linguistic processing.
“T rim is part of M odelT rim” (black triangles denote               The system’s lexical knowledge is highly modular,
meronym/holonym is-part-of constraints), “Car has at              with specific resources encoded as user-selectable lex-
most one M ake” (the participation constraint 0:1 on              icons. The information used to build up existing con-
Car for M ake denotes that Car objects in car ads as-             tent for the English lexicons includes a mix of implicit
sociate with M ake names between 0 and 1 times, or “at            knowledge and existing resources. Some lexicon entries
most once”).                                                      were created by students during class and project work;
   As currently implemented, however, OntoES cannot               other entries were developed from existing lexical re-
read in one language and write in another. This cross-            sources (e.g. the US Census Bureau for personal names,
linguistic ability to read in one language and then trans-        the World Factbook for country names, Ethnologue for
late to and write in another language is the essence of           language names, etc.). We are developing analogous lex-
our multilingual-oriented development. For example,               icons for other languages, and adapting OntoES as nec-
we expect to be able to read the price in yen from a              essary to accommodate them in its processing. As was
Japanese car-ad and write “Price: $24,124” and to read            the case for English, this involves some hand-crafting of
the Kanji symbols for the make and write “Make: Mit-              relevant material, as well as finding and converting ex-
subishi”. To assure this level of functionality, we need          isting data sources in other languages for targeted types
to encode unit or currency conversion routines for val-           of lexical information. Often this is relatively straight-
ues like P rice and to encode cross-linguistic lexicons for       forward: for example, WordNet is a sizable and impor-
named entities such as M ake. In principle, encoding              tant component for English OntoES, and similar and
this cross-linguistic mapping is currently possible, but          compatible resources exist for other languages. How-
represents a fair amount of manual effort. We are cur-            ever, we also need to rely on linguistic knowledge and
rently finding ways to largely automate this mapping.             experience to find, convert, and implement appropriate
In addition, we are adding two other capabilities to the          cross-linguistic lexical resources.
system that will similarly enhance extraction and query              In the realm of cross-linguistic extraction systems,
processing: compound recognizers and patterns.                    OntoES has a clear advantage. We claim that ontolo-
   Compound recognizers allow OntoES to directly rec-             gies, which lie at the crux of our extraction approach,
ognize ontological relationships beyond simple concepts.          can serve as viable interlinguas. We are currently sub-
For a query like: “Find Nissans for sale with years be-           stantiating this claim. Since an ontology represents a
tween 1995 and 2005.”, we need to recognize each of               conceptualization of items and relationships of interest
the years as well as the between constraint that relates          (e.g. interesting properties of a car, information needed
them. Our previous work has implemented compound                  to set up a doctor’s appointment, etc.), a given ontology
recognizers for operators in free-form queries [1], but we        should be appropriate cross-linguistically with perhaps
now seek to linguistically ground these types of ontolog-         occasionally some slight cultural adaptation. For exam-
ical relationships.                                               ple, in our prior work on extraction from obituaries [5]
   Patterns will allow OntoES to identify and extract             we found that worldwide cultural and dialect differences
from structured text. For example, car ads often ap-              were readily apparent even in English material. Certain
pear as a table with P rice in one column, Y ear in an-           terms for events like “tenth day kriya”, “obsequies”,
other column, and M ake and M odel in a third column.             and “cortege” were found only in English obituaries an-
Detecting patterns in documents will allow OntoES to              nouncing events outside of America. Since our lexical
apply specialized extraction rules and likely improve ex-         resources serve as a “grounding” of the lowest-level con-
traction accuracy. By extending our work with table               cepts from ontologies with the lexical content of the web



                                                              3
pages, substituting one language’s lexicon for another’s      [5] D. Embley, D. Campbell, Y. Jiang, S. Liddle,
provide OntoES a true cross-linguistic capability.                D. Lonsdale, Y.-K. Ng, and R. Smith.
                                                                  Conceptual-model-based data extraction from
2.3 Ongoing Work                                                  multiple-record web pages. Data & Knowledge
   Our current work involves several separate but related         Engineering, 31(3):227–251, 1999.
tasks. We are locating annotated corpora in other lan-        [6] D. Embley, D. Campbell, S. Liddle, and R. Smith.
guages amenable for evaluation purposes, and collecting           Ontology-based extraction and structuring of
and annotating interesting multilingual web material of           information from data-rich unstructured
our own. We are also developing prototype lexicons                documents. In Proceedings of the 7th International
and recognizers for these target languages. Of course,            Conference on Information and Knowledge
our work requires us to develop and adapt prototype               Management (CIKM’98), pages 52–59,
ontologies for target languages for sample concepts in            Washington D.C., 1998.
data-rich domains.                                            [7] D. Embley, S. Liddle, D. Lonsdale, G. Nagy,
   In addition, we are enhancing extraction ontologies            Y. Tijerino, R. Clawson, J. Crabtree, Y. Ding,
by enabling them to (1) explicitly discover and extract           P. Jha, Z. Lian, S. Lynn, R. Padmanabhan,
relationships among object instances of interest, and (2)         J. Peters, C. Tao, R. Watts, C. Woodbury, and
discover patterns of interest from which they can more            A. Zitzelberger. A conceptual-model-based
certainly identify and extract both object instances and          computational alembic for a web of knowledge. In
relationship instances of interest. This involves devis-          Proceedings of the 27th International Conference
ing, investigating, designing, coding, and evaluating al-         on Conceptual Modeling (ER08), pages 532–533,
gorithms for compound recognizers and for pattern dis-            2008.
covery and patterned information extraction.                  [8] D. Embley, C. Tao, and S. Liddle. Automating
   Finally, we are evaluating system performance using            the extraction of data from HTML tables with
standard metrics and gold-standard annotated data.                unknown structure. Data & Knowledge
                                                                  Engineering, 54(1):3–28, 2005.
3.   CONCLUSION                                               [9] D. Embley and A. Zitzelberger. Theoretical
                                                                  foundations for enabling a web of knowledge. In
   Though an interesting effort in its own right, we ex-
                                                                  Proceedings of the 6th International Symposium
pect our multilingual extraction work to also contribute
                                                                  on Foundations of Information and Knowledge
to our larger effort to create a Web of Knowledge [7, 9].
                                                                  Systems (FoIKS10), Sophia, Bulgaria, 2010.
Our research centers around resolving some of the tough
technical issues involved in a community-wide effort to      [10] A. Halevy, P. Norvig, and F. Pereira. The
deploy the semantic web [16] and in concert with efforts          unreasonable effectiveness of data. IEEE
at Yahoo!, Google, and elsewhere to extract information           Intelligent Systems, March/April 2009.
from the web and integrate it into community portals to      [11] L. Hunter, Z. Lu, J. Firby, W. B. Jr., H. Johnson,
enable community members to better discover, search,              P. Ogren, and K. Cohen. OpenDMAP: An open
query, and track interesting community information [3,            source, ontology-driven, concept analysis engine,
10, 13]. Multilingual extraction ontologies have the far-         with applications to capturing knowledge
reaching potential to play a significant role as semantic-        regarding protein transport, protein interactions
web work finds its way into mainstream use in global              and cell-type-specific gene expression. BMC
communities.                                                      Bioinformatics, 9(8), 2008.
                                                             [12] K. Kishida. Technical issues of cross-language
                                                                  information retrieval: A review. Information
4.   REFERENCES                                                   Processing and Management: an International
 [1] M. Al-Muhammed and D. Embley. Ontology-                      Journal, 41:433–455, 2005.
     based constraint recognition for free-form service      [13] R. Kumar, B. Pang, R. Ramakrishnan,
     requests. In Proceedings of the 23rd International           A. Tomkins, P. Bohannon, S. Keerthi, and
     Conference on Data Engineering (ICDE’07),                    S. Merugu. A web of concepts. In Proceedings of
     pages 366–375, Istanbul, Turkey, 2007.                       the 2009 Symposium on Principles of Database
 [2] P. Buitelaar, P. Cimiano, P. Haase, and                      Systems, pages 1–12, Providence, RI, 2009.
     M. Sintek. Towards linguistically grounded              [14] D. Lonsdale, D. W. Embley, Y. Ding, L. Xu, and
     ontologies. In Proceedings of the 6th European               M. Hepp. Reusing ontologies and language
     Semantic Web Conference (ESWC’09), pages                     components for ontology generation. Data &
     111–125, Heraklion, Greece, 2009.                            Knowledge Engineering, 69:318–330, 2010.
 [3] P. DeRose, W. Shen, F. Chen, A. Doan, and               [15] C. Tao, D. Embley, and S. Liddle. FOCIH:
     R. Ramakrishnan. Building structured web                     Form-based ontology creation and information
     community portals: A top-down, compositional,                harvesting. In Proceedings of the 28th
     and incremental approach. In Proceedings of the              International Conference on Conceptual Modeling
     33rd Very Large Database Conference (VLDB’07),               (ER 2009), pages 346–359, Gramado, Brazil, 2009.
     pages 23–28, Vienna, Austria, 2007.                     [16] W3C (World Wide Web Consortium) Semantic
 [4] B. J. Dorr. Machine Translation: A view from the             Web Activity Page. http://www.w3.org/2001/sw/.
     lexicon. MIT Press, Cambridge, MA, 1993.



                                                             4