=Paper= {{Paper |id=None |storemode=property |title=Modeling Wordlists via Semantic Web |pdfUrl=https://ceur-ws.org/Vol-571/paper7.pdf |volume=Vol-571 |dblpUrl=https://dblp.org/rec/conf/www/PoornimaG10 }} ==Modeling Wordlists via Semantic Web== https://ceur-ws.org/Vol-571/paper7.pdf
        Modeling Wordlists via Semantic Web Technologies

                        Shakthi Poornima                                                    Jeff Good
                    Department of Linguistics                                       Department of Linguistics
              State University of New York at Buffalo                         State University of New York at Buffalo
                        Buffalo, NY USA                                                 Buffalo, NY USA
                     poornima@buffalo.edu                                            jcgood@buffalo.edu


ABSTRACT                                                                database consisting of rough translational equivalents which
We describe an abstract model for the traditional linguis-              lacks precision, but has coverage well-beyond what would
tic wordlist and provide an instantiation of the model in               otherwise be available.
RDF/XML intended to be usable both for linguistic research                 This paper describes an effort to convert around 2700
and machine applications.                                               wordlists covering more than 1500 languages (some word-
                                                                        lists represent dialects) and close to 500,000 forms into an
                                                                        RDF format to make them more readily accessible in a Se-
Categories and Subject Descriptors                                      mantic Web context.2 This may well represent the largest
E.2 [Data]: Data Storage Representations                                single collection of wordlists anywhere and certainly repre-
                                                                        sents the largest collection in a standardized format. While
Keywords                                                                the work described here was originally conceived to support
                                                                        descriptive and comparative linguistics, we will argue that
wordlists, interoperation, RDF                                          the use of Semantic Web technologies has the additional ben-
                                                                        eficial effect of making these resources more readily usable
1.   INTRODUCTION                                                       in other domains, in particular certain NLP applications.
  Lexical resources are of potential value to both traditional             We approach this work as traditional, not computational
descriptive linguistics as well as computational linguistics.1          linguists, and our current goal is to encode the available
However, the kinds of lexicons produced in the course of lin-           materials not with new information but rather to transfer
guistic description are not typically easily exploitable in nat-        the information they contain in a more exploitable format.
ural language processing applications, despite the fact that            Semantic Web technologies allow us to represent traditional
they cover a much larger portion of the world’s languages               linguistic data in a way we believe remains faithful to the
than lexicons specifically designed for NLP applications. In            original creator’s conception and, at the same time, to pro-
fact, one particular descriptive linguistic product, a wordlist,        duce a resource that can serve purposes for which it was
can be found for around a third to a half of the world’s seven          not originally intended (e.g., simplistic kinds of translation).
thousand or so languages, though wordlists have not played              Our work, therefore, indicates that Semantic Web offers a
a prominent role in NLP to the best of our knowledge.                   promising approach for representing the work of descriptive
  Wordlists are widely employed by descriptive linguists as             linguists in ways of use to computational linguists.
a first step towards the creation of a dictionary or as a
means to quickly gather information about a language for                2.     MODELING A WORDLIST
the purposes of language comparison (especially in parts of               We illustrate the basic structure of a wordlist in (1), which
the world where languages are poorly documented). Be-                   gives a typical presentation format. Here, the language be-
cause of this, they exist for many more languages than do               ing described is French, with English labels used to index
full lexicons. While the lexical information they contain is            general meanings.
quite sparse, they are relatively consistent in their struc-
ture across resources. As we will see, this makes them good             (1)     man       homme
candidates for exploitation in the creation of a multilingual                   woman     femme
1
  Funding provided for the work described here has been
provided by NSF grant BCS-0753321 in the context of                       The information encoded in a wordlists is quite sparse.
a larger-scale project, Lexicon-Enhancement via the Gold                In general, they give no indication of morphosyntactic fea-
Ontology, headed by researchers at the Institute for Lan-               tures (e.g., part of speech), nor of fine-grained semantics.
guage Information and Technology at Eastern Michi-                      Meanings are most usually indexed simply by the use of la-
gan University.      More information can be found at                   bels drawn from languages of wider communication (e.g.,
http://linguistlist.org/projects/lego.cfm.                              English or Spanish), though the intent is not to translate
                                                                        between languages but, rather, to find the closest semantic
                                                                        2
                                                                          These wordlists were collected by Timothy Usher and Paul
                                                                        Whitehouse in the context of traditional comparative lin-
Copyright is held by the author/owner(s).                               guistic research, and represent an enormous effort without
WWW2010, April 26-30, 2010, Raleigh, North Carolina.                    which the work described here would not have been possible.
.




                                                                   39
match in the target language for what is presumed to be a              sharable ontology for language documentation and descrip-
general concept. The notional relationship between a mean-             tion. The key data encoded by our RDF representation of
ing and a form in a wordlist is not one of defining (as is the         wordlists is the counterpart mapping between a particular
case in a monolingual dictionary) or translating (as is the            wordlist concepts (lego:concept) drawn from our concepti-
case of a bilingual dictionary), but rather something we term          con and a form (gold:formUnit) found in a given wordlist.
counterpart following [1]. This is not a particularly precise
relation, but it is not intended to be. Specifying too much                
difficult to collect wordlists rapidly, which is otherwise one                 
of their most desirable features.                                                 
   The concepts that one sees in traditional linguistic word-                        
lists have often been informally standardized across lan-                               
guages and projects through the use of what we call here                             
concepticons. Concepticons are curated sets of concepts,                             
minimally indexed via words from one language of wider                                  
communication but, perhaps, also described more elabo-                                    -ak
rately using multiple languages (e.g., English and Spanish)                             
as well as illustrative example sentences. They may include                          
concepts of such general provenance that counterparts would                       
be expected to occur in almost all languages, such as to eat,                  
or concepts only relevant to a certain geographical region or                  
language family. For instance, Amazonian languages do not
                                                                           
have words for mosque, and Siberian languages do not have
a term for toucan [1, p.5-6].
   To the extent that the same concepticon can be employed                     Figure 1: Wordlist Entry RDF Fragment
across wordlists, it can be understood as a kind of inter-
lingua, though it is not usually conceptualized as such by                An important feature of our RDF model, illustrated in
descriptive linguists. The concepticon we are employing is             Figure 1 is that the counterpart relation does not relate a
based on three available concept lists. The most precise and           meaning directly to a form but rather to a linguistic sign
recently published list is that of the Loanword Typology               (gold:LinguisticSign) whose form feature then contains
(LWT) project [1], which consists of around 1400 entries.              the relevant specification. This structure would allow addi-
                                                                       tional information (e.g., part of speech, definition, example)
3.    WORDLISTS AND SEMANTIC WEB                                       about the lexical element specified by the given form to be
                                                                       added to the representation at the level of the linguistic sign,
   Each wordlist in our RDF datanet consists of two com-
                                                                       if it were to become available.
ponents: metadata and a set of entries. The metadata
gives relevant identifying information for the wordlist e.g., a
unique identifier, the ISO 639-3 code, the related Ethnologue          4.     PROSPECTS
language name, alternate language names, reference(s), the               The data model described here was originally designed
compilers of the wordlist, etc. The entries set consists of            to promote lexical data interoperability for descriptive lin-
all entries in the wordlist. The structure of our entries is           guistic purposes. At the same time, it makes visible the
quite simple, consisting of a reference to an external con-            similarities between a concepticon and an interlingua, thus
cepticon entry in the concepticon employed by our project              opening up the possibility of straightforward exploitation
paired with a form in the target language using the counter-           of a data type produced in a descriptive linguistic context
part relationship discussed above. Obviously, this structure           in NLP contexts. Furthermore, by expressing the model in
could be elaborated. However, it is sufficient for this first          the form of an RDF graph rather than a more parochial
stage of a work and, we believe, serves as an appropriate              XML format, it can be more easily processed. Potential
baseline for further specification.                                    NLP applications for this datanet involve tasks where sim-
   In cases where there is more than one form attached to              ple word-to-word mapping across languages may be useful.
a concept, we create two concept-form mappings. For in-                One such example is the PanImages4 search of the PanLex
stance, the entry in (2) from a wordlist of North Asmat, a             project which facilitates cross-lingual image searching. More
language spoken in Indonesia, associates the concept grand-            work could be done to promote interoperability, of course.
father with two counterparts, whose relationship to each               For example, we could devise an LMF [2] expression of our
other has not been specified in our source.                            model, though we leave this for the future.

(2)   grandfather: -ak, afak
                                                                       5.     REFERENCES
   An RDF/XML fragment describing one of the two forms                 [1] In M. Haspelmath and U. Tadmor (eds.), Loanwords in
in (2) is given in Figure 1 for illustrative purposes. In addi-            the world’s languages: A comparative handbook. 2009.
tion to drawing on standard RDF constructs, we also draw               [2] G. Francopoulo, et al. Multilingual resources for NLP
on descriptive linguistic concepts from GOLD3 (General On-                 in the lexical markup framework (LMF). Language
tology for Linguistic Description), which is intended to be a              Resources and Evaluation, 43:57–70, 2009.
3
  http://linguistics-ontology.org/. Similar ontologies
                                                                       4
such as SKOS could also be used in lieu of GOLD.                           http://www.panimages.org/




                                                                  40