From language documentation data to LLOD: A case study in Turkic lemon dictionaries

From language documentation data to LLOD: A case study in Turkic lemon dictionaries ChristianChiarcos Applied Computational Linguistics Goethe-Universität

Robert-Mayer-Straße 11-15 60325 Frankfurt Germany

DésiréeWalther Applied Computational Linguistics Goethe-Universität

Robert-Mayer-Straße 11-15 60325 Frankfurt Germany

MaximIonov Applied Computational Linguistics Goethe-Universität

Robert-Mayer-Straße 11-15 60325 Frankfurt Germany

From language documentation data to LLOD: A case study in Turkic lemon dictionaries 0BF209E31E1D1CA2F6F2F6CEB10D2D9E GROBID - A machine learning software for extracting information from scholarly documents Lemon-OntoLex model Turtle Turkic languages SPARQL

In this paper, we describe the Lemon-OntoLex modeling of dictionaries created within language documentation efforts. We focus on exemplary resources for two less-resourced languages from the Turkic language family, Chalkan and Tuvan. Both datasets have been conveted into a Linked Data representation using the Lemon-OntoLex data model, with an extensible converter written in Python. We compare the conversion process for two both lexical resources, we analyze the difficulties we encountered during the conversion process and discuss the cases which caused the most common problems during the conversion. Furthermore, we evaluate the quality of converted dictionaries using specially designed SPARQL queries, and by manually checking random samples of the data. Finally, we describe the future application of this data within a lexicographic-comparative workbench, designed to facilitate language contact studies.

Background and Motivation

Linguistic Linked Open Data (LLOD) has become widely popular in the language resource community in recent years, and with particular success in the area of machine-readable dictionaries, where the Lemon-OntoLex model now has become widely adopted. While Linked Open Data is also receiving substantial resonance in the areas of language documentation and typology for a considerable period now [6,8,13,7], the application of Lemon-OntoLex has been rarely discussed in this context, so far. Here, we describe the application of Lemon-OntoLex to dictionary data from two exemplary low-resource languages from the Turkic language family.

The research described in this paper is conducted as part of the BMBFfunded Research Group "Linked Open Dictionaries (LiODi)" (2015-2020) at the Goethe-Universität Frankfurt, Germany, and our activities focus on uses of Linked Data to facilitate the integration data across different dictionaries, or between dictionaries and corpora. As a cooperation between researchers from natural language processing and empirical linguistics, LiODi aims at developing methodologies and algorithmic solutions to facilitate research on comparative lexicography in the context of linguistic, cultural, sociological and historical studies. In particular, we develop a workbench that facilitates the cross-linguistic search of semantically and / or phonologically related words in various languages. LiODi is a joint effort of the Applied Computational Linguistics (ACoLi) lab at the Institute of Computer Science and the Institute of Empirical Linguistics at Goethe University Frankfurt, Germany, with a focus on Turkic languages (pilot phase, 2015-2016), resp. languages of the Caucasus (main phase, 2017-2020) and selected contact languages.

One main type of data in the project are dictionaries, and the conversion of an etymological dictionary of the Turkic languages has previously been described by [2]. Here, this approach is extended to another category of lexical data, dictionaries and word lists as created as part of language documentation efforts of two Turkic languages: Chalkan and Tuvan. By means of an implemented converter, XML-generated lexical language documentation data are converted into an RDF representation with a model based on the Lemon-OntoLex model. Subsequently, a check is made to determine the extent to which a generic converter for lexical resources is possible or how much effort is needed to expand the converter. Finally, evaluation steps are carried out using SPARQL queries on a SPARQL endpoint as well as a random check of the output data.

Data

The lexical resources described in the paper are a part of RELISH project1 ( [5]) and can be accessed with LEXUS tool2 ( [12]) which allows exporting in an XML format with LEXUS schema.

Chalkan language3 is a variety of Northern Altai language, a Turkic language spoken in South Siberia ( [11, p. 67]). A Nothern Altai have no fixed literary norms and many different dialects, mainly spoken in rural areas. It is heavily influenced by the Russian language. As a result, there are strong differences in the grammaticalization between the written and the spoken ( [11, p. 74]).

Tuvan language also belongs to the Turkic language family. It is spoken in the Republic of Tyva, a part of Russian Federation, to the southwest of Tofalaria Siberia region. It is also spoken by some Mongol speakers ([4, p. 2]). Similar to other languages of the Altai region, Tuvan was strongly influenced by Russian languages. Also it had a strong Mongolian influence. Initially, Tuvan orthography was based on the Latin alphabet, but since 1943 it has only been written in Cyrillic ( [3]).

Modeling

The modeling is based on the Lemon-OntoLex model. Additionally, the vocabularies SKOS ( [10]), LexInfo ( [9]) and RDFS are used. Furthermore, a new namespace is created, which contains lexical entries, forms, senses and concepts.

According to the Lemon-OntoLex model, Form, Lexical Sense, Lexical Entry, Lexical Concept and Concept Set are represented as classes. The lexical form contains the written and the phonetic representation. The lexical sense encompasses only the range of use in a note of the SKOS vocabulary, if this information is present. The decision to outsource the semantic meaning into a concept was also made to enable a semantic search based on a lexical set of concepts. Thus, words with the same semantic meaning refer to a common concept. This concept refers not only to the relevant lexical entries, but also to the corresponding lexical senses. The core of the Lemon-OntoLex model was extended by homonym relations between lexical entries. The decision to use the Lexinfo vocabulary to represent homonym relations instead of other possible ways (e.g. using the Lexico-Semantic Relations of the vartrans module) was dictated by the fact that we were already using the LexInfo vocabulary to represent the parts of speech. Our goal was to use as few vocabularies as possible.

Implementation

The main goal of the converter is the implementation of the modeling decisions of the Chalkan lexical resource as well as their reuse for the implementation of the modeling of the Tuvan data. The extension to Tuvan seemed sophisticated with regard to the differentiated features and different element hierarchies between the two lexical resources, which are in XML format.

In the lexical resource Chalkan a division of lexical entries is necessary, since several parts of speech were originally assigned to a lexical entry. A challenge is, above all, the assignment of abbreviated parts of speech to the correct forms in the LexInfo vocabulary. We used a dictionary for those abbreviations in order to be able to reuse them for other lexical resources. Compared to the lexical resource Tuvan, however, it always has one lexical form.

Based on the described experiences a generic converter can be only implemented without the consideration of specific exceptions and leading to a loss of information.

However, abbreviations and lexicon-specific exceptions are implemented for this reason.

Due to the differentiation of lexical entries, the creation of several forms due to the separation of lexeme chains and the formation of different lexical senses, concepts and homonym relations for a single entry, this converter is extremely multifaceted precisely because of the differentiated features of both lexical resources. Due to this reason, it is impossible to apply the converter to the new lexicons right away, but the modular architecture makes it easy to adapt it to the specifics of a new resource.

Above all, the modularization of individual steps of the modelling provides a good overview, is easily expandable and existing modules can be reused. The converter automatically recognizes the lexical resource and encapsulation of the name spaces required for the output file provides an acceptable overview.

Evaluation

Important Facts of the output files

The Chalkan dataset was converted to 3 165 lexical entries. Newly created lexical entries were linked using homonym relations. This resulted in 208 new entries, which means that the total number of lexical entries in the resulting dataset is 3 373. A former lexical entry has been divided between two and four times, and thus has a maximum of three homonym relations for four divisions.

The number of triples that was modeled for Chalkan is 68 286, providing a wide range for search queries. Above all, instead of abbreviations, the respective part of speech was assigned to the LexInfo vocabulary ( [9]).

Tuvan data has some peculiarities within its 7 482 lexical entries. Several written representations contained chains of different lexemes separated by commas after the initial conversion. This lead to their assignment to the same entry. The maximum number of a lexeme chain is eleven. The number of new lexical entries resulting from a lexical entry by division is two, which have a homonym relation with each other. A total of 17 additional lexical entries were generated during the modeling of Tuvan data, which means that the total number of lexical entries amounts to 7 499. The data is represented by 136 580 triples, almost all modeling decisions for Chalkan could be applied to Tuvan. Lexeme chains form the exception among others.

Conversion evaluation

Because of the fact that two dictionaries are available as a result, the queries do not have to be restricted to a lexicon, but can specifically address both lexicons.

After looking into the lexical resources, the question arises whether there are different entries that have the same definition and have not yet been merged into one lexical concept. Therefore, the resource was stored in two separate entries. The example "hunting", which is frequent in languages due to the cultural reasons, is investigated. If the SPARQL query shown in Fig. 2 provides results, the concepts concerned can be merged and the lexical entries refer to the same lexical concept. ?concept skos:definition "hunting"@en ; 8 skos:definition ?definition . 9

?concept ontolex:isEvokedBy ?entry . 10 } Fig. 2. Checks whether "hunting" is associated with multiple concepts as well as entries. entry concept definition ontology:tuvan. . . 4cb8_0 ontology:concept_tuvan. . . c94cca "hunting"@en ontology:tuvan. . . 2a00_0 ontology:concept_tuvan. . . 452a10 "hunting"@en Fig. 3. Output of query in Fig. 2.

The result of the query (Figure 3) shows that two lexical entries with two different lexical concepts for the definition "hunting" exist in the Tuvan dictionary.

In addition, it was examined whether a word in a language can have several meanings, which was the case in Chalkan.

For the definition of "father", it was examined whether the spelling is the same or similar in both dictionaries. In fact, there were even two matches, reflecting the similarity of both languages.

Additionally, sample-based check could be carried out because the IDs of the lexical resource were transferred to the output file. The original identifiers from the lexical resource were used as part of the URI of each lexical entry, form and sense. Using these IDs, each form, sense, and lexical entry could be examined by random sampling and the correctness could be determined.

Application

The primary goal of the LiODi project is to develop a workbench and associated methodologies to facilitate language contact studies in Eurasia and the Caucasus area. The Comparative-Lexicographical Workbench (Fig. 4) provides novel search functionalities extending the functionality of existing platforms, formbased search and gloss-(meaning-)based search, currently applied to the Turkic language family and its contact languages. Given a lexeme in a particular language, say, Chalkan, and a set of related languages, say, the Turkic languages in general, the system retrieves phonologically similar lexemes for the respective target languages.

Both search functionalities aim to detect candidate cognates. The data provided by Starling represents a gold standard, but can also be directly integrated into the search process:

In Fig. 5, we query for Chalkan ana and possible cognates from Turkic (as an inherited word) or Mongolic (as a possible source of loan words). The results are organized according to the taxonomic status of the varieties in www.multitree. org. They include a gloss from a Chalkan dictionary (marked by subscript C), but in addition provide form-based matches (subscript +) from the Starling dictionaries (S), e.g., with Turkish ana and its etymologically corresponding forms, etc.

A prototype of this workbench is available 4 . Albeit still limited in coverage and functionality, it illustrates a core strength of the Lemon-Ontolex-based approach: Given a number of bilingual dictionaries, we can use identical SPARQL fragments to retrieve word lists over which then a transitive closure can be calculated. 5 This is illustrated in Fig. 5 with a screenshot of the workbench prototype for the Chalkan word küski and the corresponding SPARQL query, 6 with corresponding properties in different Lemon dictionaries being highlighted.

The development of Lemon towards a community standard is still in progress, even though the publication of the W3C community report in May 20167 set a stable milestone. While many early adopters of Lemon still use older specifications or resource-specific extensions (e.g., DBnary), it is expected that these will eventually converge towards the current specification. For example, DBnary still uses the older Lemon-Monnet model at the time of writing, but is currently in migration to Lemon-OntoLex (Gilles Serraset via the OntoLex mailing list, 2017-03-10). In the course of this process, we will eventually be able to apply the same SPARQL property path to internal and external resources to retrieve bilingual word pairs that may then be the basis for online transitive search across dictionaries.

Even more so, Lemon (and the SPARQL concept of federation with the keyword SERVICE) allows us to evoke remote data sources directly. However, querying a local graph instead is normally a more scalable solution.

Summary and Outlook

We described the application of Lemon-OntoLex to two dictionaries compiled in the context of language documentation efforts, an area where the appli- cation of Lemon-OntoLex has rarely been discussed before. We focus on exemplary resources for two less-resourced languages from the Turkic language family, Chalkan and Tuvan. Both datasets have been converted into a Linked Data representation using the Lemon-OntoLex data model, with an extensible converter written in Python. Finally, their application within a comparativelexicographical workbench has been described, where Linked Data permits to formulate transitive queries over dictionaries from entire language families.

A proof-of-principle implementation of this workbench is currently available (http://dbserver.acoli.cs.uni-frankfurt.de:5000/search/?query= %D0%B0%D1%80%D1%8B&originLang=&targetLang=trk). Using the Chalkan-Russian data described in this paper, the Russian-English DBnary8 and the English-Turkic etymological dictionary described in [1], it performs a transitive search for cognate candidates across two pivot languages (Russian and English): Lemonbased transitive sense links yield semantically corresponding forms, and the result set is ordered according to phonological (graphological) similarity with the requested word per sense. Top-level matches are thus most likely cognate candidates. For a limited number of small dictionaries with few thousand words as those created as part of language documentation efforts described here, our vanilla system is actually able to perform an effective online search without any further optimization. With more data being produces as part of the project, scalability issues are a likely area of future studies.

With the publication of this paper, the converter and the Chalkan data will be published under open licenses. For the Tuvan data, we are still waiting for legal clearance, but the original XML that serves as a basis for conversion, is, however, publicly available from The Language Archive 9 . With the converter provided, its Linked Data edition can be locally recreated. editors, Linked Data in Linguistics, pages 139-149. Springer, Heidelberg, 2012.

Fig. 1 .1Fig. 1. Common data model of Tuvan and Chalkan based on the Lemon-OntoLex model.

Fig. 4 .4Fig. 4. Design study: Form-based search in the Comparative-Lexicographical Workbench

Fig. 5 .5Fig. 5. Transitive query for Chalkan küski via Russian-English DBnary to the Starling Turkic etymological dictionary, (a) Workbench visualization, (b) SPARQL query (slightly simplified) https://tla.mpi.nl/relish/. https://tla.mpi.nl/tools/tla-tools/older-tools/lexus/. Sometimes the language name is spelled Chelkan. http://dbserver.acoli.cs.uni-frankfurt.de:5000/ At the moment, this is done on the fly, with greater amounts of data in the system, optimizations will become necessary. Note that this query has been slightly simplified with respect to prefix declarations, matching typed and untyped strings, and different Lemon namespaces. https://www.w3.org/2016/05/ontolex/. http://kaiko.getalp.org/sparql.

Acknowledgments

The research described in this paper was conducted in the project 'Linked Open Dictionaries (LiODi, 2015-2020)', funded by the German Ministry for Education and Research (BMBF) as an Early Career Research Group on eHumanities. 9 https://tla.mpi.nl/relish/.

Bibliography

Linking the tower of babel: Modelling a massive set of etymological dictionaries as rdf FrankAbromeit ChristianFäth LDL 2016 5th Workshop on Linked Data in Linguistics: Managing, Building and Using Linked Language Resources 2016 11 Linking the Tower of Babel: Modelling a massive set of etymological dictionaries as RDF FrankAbromeit ChristianChiarcos ChristianFäth MaximIonov Proceedings of the 5th Workshop on Linked Data in Linguistics (LDL-2016): Managing, Building and Using Linked Language Resources JohnPMccrae ChristianChiarcos ElenaMontiel Ponsoda ThierryDeclerck PetyaOsenova SebastianHellmann the 5th Workshop on Linked Data in Linguistics (LDL-2016): Managing, Building and Using Linked Language Resources

Portoroz, Slovenia

2016 <author> <persName><forename type="first">Simon</forename><surname>Ager</surname></persName> </author> <author> <persName><surname>Tuvan</surname></persName> </author> <imprint> <date type="published" when="2016">2016</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b3"> <analytic> <title level="a" type="main">Auxiliary Verb Constructions in Altai-Sayan Turkic GD SAnderson Turcologica Series Harrassowitz 2004 rendering endangered lexicons interoperable through standards harmonization": the relish project HelenAristar-Dry SebastianDrude MenzoWindhouwer JostGippert IrinaNevskaya LREC 2012: 8th International Conference on Language Resources and Evaluation European Language Resources Association (ELRA 2012 The gold community of practice: An infrastructure for linguistic data on the web ScottFarrar WilliamDLewis Language Resources and Evaluation 41 1 2007 The Cross-Linguistic Linked Data project RobertForkel 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing

Reykjavik, Iceland

May 2014 Developing odin: A multilingual repository of annotated language data for hundreds of the world's languages DWilliam FeiLewis Xia Literary and Linguistic Computing 25 3 2010 <author> <persName><forename type="first">John</forename><surname>Mccrae</surname></persName> </author> <author> <persName><forename type="first">Philipp</forename><surname>Cimiano</surname></persName> </author> <author> <persName><forename type="first">Paul</forename><surname>Buitelaar</surname></persName> </author> </analytic> <monogr> <title level="j">Lexinfo ontology 2 2010 Skos simple knowledge organization system namespace document -html variant AlistairMiles SeanBechhofer 2009 Locational and directional relations and tense and aspect marking in Chalkan, a South Siberian Turkic language IrinaNevskaya Studies in Lanuage Companion Series John Benjamins Publishing Company 2014 Creating multimedia dictionaries of endangered languages using lexus JacquelijnRingersma MarcKemps-Snijders Interspeech 2007: 8th Annual conference on the International Speech Communication Association ISCA-Int. Speech Communication Assoc 2007 Tyto -A collaborative research tool for linked linguistic data AndreaCSchalley Christian Chiarcos SebastianNordhoff SebastianHellmann