=Paper=
{{Paper
|id=Vol-2084/paper9
|storemode=property
|title=Digital Cultural Heritage and Revitalization of Endangered Finno-Ugric Languages
|pdfUrl=https://ceur-ws.org/Vol-2084/paper9.pdf
|volume=Vol-2084
|authors=Anisia Katinskaia,Roman Yangarber
|dblpUrl=https://dblp.org/rec/conf/dhn/KatinskaiaY18
}}
==Digital Cultural Heritage and Revitalization of Endangered Finno-Ugric Languages==
Digital cultural heritage and revitalization of endangered Finno-Ugric languages Anisia Katinskaia and Roman Yangarber University of Helsinki, Finland first.last@cs.helsinki.fi Abstract. The preservation of linguistic diversity has long been recog- nized as a crucial, integral part of supporting our cultural heritage. Yet many “minority” languages—those that lack official state status—are in decline, many severely endangered. We present a prototype system aimed at “heritage” speakers of endangered Finno-Ugric languages. Heritage speakers are people who have heard the language used by the older gen- erations while they were growing up, and who possess a considerable passive competency—well beyond the “beginner” level,—but are lacking in active fluency. Our system is based on natural language processing and artificial intel- ligence. It assists the learners by allowing them to learn from arbitrary texts of their choice, and by creating exercises that engage them in ac- tive production of language—rather than in passive memorization of ma- terial. Continuous automatic assessment helps guide the learner toward improved fluency. We believe that providing such AI-based tools will help bring these languages to the forefront of the modern digital age, raise prestige, and encourage the younger generations to become involved in reversal of language decline. 1 Introduction The rapidly developing computational technologies are expanding into every domain related to languages. Computers are used as tools to support language learning and maintenance. During the last decade, many online, widely ac- cessible language learning tools have emerged. However, most of them do not cover or support minority languages. In this paper, we introduce Revita—a freely available online platform, de- signed to support learning/tutoring for endangered languages, beyond the be- ginner level. Revita currently works for several endangered or minority lan- guages: Udmurt, Meadow Mari, Erzya, Komi-Zyrian, Komi-Permiak, North Saami, and Sakha.1 All of these are Finno-Ugric languages, except for the last, which is a Turkic language. Most of the languages are represented inside the Russian Federation (RF), each with a moderate-to-small number of speakers. The system also works for several “major” languages—currently, for Finnish, 1 The system can be viewed at revita.cs.helsinki.fi Swedish, German, Russian, and Kazakh. This functionality—support for major languages—evolved in Revita in part for practical reasons. For instance, Finnish is closely related and structurally similar to other Finno-Ugric languages. The Russian language exerts a powerful influence on most of the above-mentioned languages, since they are represented primarily inside the RF, where Russian is the only state official language. Most communication (including written) in these languages exhibits common, spontaneous code-switching into Russian. Revita can automatically generate a large variety of exercises, created from arbitrary, real texts that are chosen and uploaded by the learners themselves. Alternatively, the texts can be uploaded by teachers, or shared by learners with one another. The key research challenge underlying the system is the aim to adapt the level of exercises to each individual user, depending on her level of competence. The system tries to estimate the level of competence based on the learner’s answers to exercises. In this way, Revita lies at the intersection of two well-established research areas: Intelligent Tutoring Systems (ITS) and Computer-Assisted Language Learning (CALL). This paper is structured as follows: Section 2 presents prior work in com- putational tools for language learning. Section 3 describes the system in greater detail, and positions Revita in the field of Educational Data Mining (EDM), and Section 4 presents conclusions and pointers for future work. 2 Prior work The idea of using computers to support language learning has emerged over 50 years ago. The research field of CALL—Computer assisted language learning— encompasses a broad variety of technologies, which serve to support language learning and teaching. CALL is briefly defined as “the search for and study of applications of the computer in language teaching and learning,” [7]. One of the first CALL systems, PLATO, was created in the early 1970s [4] and today one can find many systems, most of them commercial, and most aimed toward beginners and focusing on a limited set of (commercially popular) languages. The idea of using computer assisted language learning for endangered lan- guages is not widespread, and in the field of revitalization, CALL is a relatively new concept. A number of studies investigate how technology can influence the revitalization process, [1,11,13,12,15]. With the increase of accessibility of new technologies, the methodology of language revitalization is also evolving. For example, tools for recording and sharing language data collected from native speakers in authentic contexts are accessible, and the tools facilitate the tasks to a much greater extent than was possible previously. However, we still encounter skeptical views relative to the use of computational technologies in teaching languages. [11] recognizes the effectiveness of computer as a tool for collecting data, but also underlines that while the computer provides new opportunities for teaching, the new tech- nology creates new pedagogical demands for its effective implementation. The main concern is the threat of an absence of authentic communicative environ- ment in settings where language learning is aided by computer. Nevertheless, with the rapid expansion of the Internet, computational tech- nology is receiving wide application for revitalization purposes. The Hawaiian language is a good example of how technology can assist in supporting lan- guages. In preservation and dissemination of language materials and develop- ing multiple models of communication, technology played a most significant role for Hawaiian, [14]. For example, the Ulukau website 2 provides access to valuable resources for teaching and revitalization. This is important, considering the shortage of texts available for Hawaiian. The Leoki bulletin system, [3,16], provides communica- tion by email, online chats and conferences, announcements about the Hawai- ian language, online order forms for the purchase of Hawaiian language books, dictionary databases, where users can suggest new words, issues of newspa- pers, posted stories and songs, information about educational support, and so on. Leoki has provided an opportunity for communication among speakers separated by distance, which is crucial, since for some students e-communication is the only chance to use the language outside of classroom. Some application and platforms are already in use for supporting endan- gered languages. Memrise3 is a learning platform for courses created by users, and it includes several courses in Irish and Welsh. Chickasaw Language Basic4 is an application for Apple mobile devices for learning Chickasaw, a Native American language of the Muskogean language family. It offers videos, songs, words and phrases in Chickasaw. Aikuma5 is an application for collecting data from endangered language speakers. The users can easily make records, en- hance them with meta-data, offer translations phrase-by-phrase and to share it with other users of the app. Tusaalanga6 is an iOS application for learning of five Nunavut dialects (spoken in Northern Canada), which has dialogues with audio, grammatical lessons and glossaries with audio. The “Ma! Iwaidja” dictionary7 is an application for learning Iwaidja, an Australian language. Any user can insert new words, phrases, and their translations. The Skidegate Haida Language application8 for Haida, which is spoken by the Haida people in the Haida Gwaii Archipelago, off the coast of Canada, and on Prince of Wales Is- land in Alaska. This application has a bilingual dictionary and collection of phrases. Words and phrases are illustrated by pictures and audio. Users have an option to edit the content and replace it with their own images and audio recordings. “Learn Manx”9 is an application for Manx, a Goidelic Celtic lan- 2 http://ulukau.org 3 https://www.memrise.com/ 4 https://itunes.apple.com/us/app/chickasaw-language-basic/id448797486?mt=8 5 http://www.aikuma.org/aikuma-app.html 6 http://www.tusaalanga.ca/ios/about 7 https://play.google.com/store/apps/details?id=com.pollen.maiwaidjadictionary 8 http://www.firstvoices.com/en/Hlgaagilda-Xaayda-Kil 9 https://play.google.com/store/apps/details?id=com.anspear.language.manx guage, which has become extinct as a first language, but has around 2000 speak- ers. The application includes words and basic phrases, bilingual dictionaries, a flashcard learning system; it allows the recording of responses to questions to compare with model responses, and exercises on grammar and comprehension. Most of these applications have some language materials; they often offer the possibility for users to add more data from informants and include dictionaries and limited set of phrases to learn. Several popular commercial language learning systems cover languages con- sidered to be endangered. For example, Duolingo offers courses for learners of Irish and Guaranı́, an indigenous language of South America. Rosetta Stone, a commercial language-learning software provider, has established the Endan- gered Language Program,10 whose goal is to revitalize several endangered lan- guages. The program claims to provide support for Chicksaw, Mohawk, Chiti- macha, Inuktitut, Inupiat, and Navajo. (However, we do not find these lan- guages in the list of languages available for practicing by a registered user on the platform.) We do not draw a clear distinction between using CALL for language main- tenance in general vs. for teaching/learning endangered languages and revi- talization of “heritage” languages. The latter are languages spoken by people whose ancestral language can be considered indigenous, may lack official sta- tus in the area where it is spoken, and may also be endangered, [10]. In any case, whether the heritage language is endangered or not, the learner uses it regain or retain access to the ancestral culture linked to this language. A number of studies examine ways in which CALL can be helpful for learn- ing heritage languages. For example, in [8], the authors examine how computer- mediated communication (CMC) has helped Russian heritage speakers in the USA in the acquisition of academic-level literacy. CMC includes all forms of communications mediated by computer: email, forums, chat-rooms, messages, etc. All of these forms of communication with instructors and other learners can help improve writing and reading in the heritage language, in a range of regis- ters. As the authors stress, this can help to observe the target language in use, access relevant resources about the language, and to anchor the oral language— with which learners are more familiar—in the written form. The use of CMC was shown to have a positive effect on vocabulary acquisition, spelling skills, composing messages, “spoken” writing, grammatical competence, and atten- tion to punctuation. More communication in the target language also entails a growth of interest in the heritage culture, and in exploring the cultural identity of the learners. Another example of a CALL system for learning a heritage language is de- scribed in [6]. The system provides exercises to learners of Runyakitara, a Bantu language. The language is not endangered, as it is spoken by over 6 million people in Western Uganda. The system is aimed at native speakers of Run- yakitara, but with limited competence in this language. Usually these are chil- dren of Runyakitara migrants who have a very rudimentary knowledge of the 10 https://www.rosettastone.com/endangered language. The aim of the project is to introduce such learners to their native language, develop literacy skills and increase the learners’ respect toward and pride in their culture. The main focus is noun morphology, which is difficult to learn. The system for practice includes morphological exercises and testing of the learners’ knowledge of morphology and vocabulary; it also provides scores which can help teacher to evaluate the learners’ progress. All nouns in the sys- tem were extracted from a Runyankore-Rukiga dictionary, Kashoboorozi, and parsed by a finite-state morphological analyzer.11 The system offers exercises, such as plural forms of nouns. The user should type in an answer and can re- ceive feedback about whether the answer was correct. The learner can also get supplementary material for grammatical explanations. The system saves infor- mation about the user, including the dates of practice sessions, the material covered and the scores. The teacher can obtain information about scores of all learners by lessons. Results of pre-testing and post-testing (after completing all exercises) showed a significant increase in grammar scores. Experiment with the system proved its effectiveness for the task of learning the heritage lan- guage by its native speakers, a majority of whom reported that they would like to continue using it. 3 System description 3.1 Features of the Revita system In this section we describe the main features of Revita language learning sys- tem, [5]. The system is developed based on the idea of providing users the possibility to learn languages actively, rather than passively absorbing learning materials. This active model of learning has broad implications and manifests itself in the following: – Users seek out learning materials, which are of interest to them. In this way the learner collects a database of materials and exercises, which can be use- ful for pedagogical purposes for endangered languages. – The user develops her active language skills by producing language forms in the context of the story; in most cases, exercises involve unrestricted lan- guage forms. (In some settings—such as mobile use, where typing may be less convenient, the system may offer multiple-choice exercises.) – The feedback provided should offer more insight than simply “correct” vs. “incorrect”. Further, rather than revealing the correct answer immediately after the first failed attempt, a more clever approach should push the stu- dent further to seek out the correct answer by herself. – Advanced modes of using Revita system (which are currently under de- velopment) expand on the idea of providing the possibility for advanced users—users who have some competency also in linguistic terminology—to 11 This approach is relevant in our context, since Revita also makes use of finite-state morphological analyzers as low-level supporting components. Fig. 1. Story practice mode for the Udmurt language. filter exercises by linguistic concepts, to monitor and direct their own learn- ing progress more actively and on a finer level. The system has a small “public” library of stories for every language, which are available for all users, including non-registered users. It can be problematic for users to find suitable learning material when beginning to work on learning a language; the system offers some texts to begin practicing with stories. Revita also offers links to sources of other authentic texts, including newspapers, pop- ular journals, etc., as well as and information about the language. We plan to extend the public library for all languages with materials and links supplied by experts in the respective languages. A central idea behind the platform is that users will actively add their own learning materials, to their private, personal libraries. Texts can be uploaded from a personal computer (.txt or .doc files), or by copying-pasting from other sources, or by loading from a Website—the user provides a URL of a Web page containing text material she wishes to use for practice. Users can also upload stacks of flashcards containing words to practice, with their translations or defi- nitions. In future we plan to extend these types of learning materials with audio data. For endangered languages this option can be very important because then the system will also serve as a tool for preserving language data, as many CALL systems do. Uploaded materials can be shared with other users. We consider the possibility to practice using material chosen and uploaded by the user as a key feature of the Revita system, because such material is partic- ularly suited to keep the user interested and motivated. Another aspect which makes the practicing process more engaging is that it employs real-word texts, rather than artificial texts specially constructed for exercises. Such “natural” materials can better help to get immersed in contemporary culture and the life of the speakers of the target language. There are several exercise modes avail- able to users. Without registration users can read stories in the reading mode and do exercises in the practice mode, but their results are be saved and cannot be used for adapting future exercise sessions to their competence level. The reading mode is simplest; it allows the learner to become acquainted with the story, and to ask for translations of unfamiliar words in context. The practice mode is currently the main type of exercise mode in Revita, see Figure 112 The learner chooses a story to practice and receives it piece by piece, with some of the words obscured for cloze exercises. When each story is uploaded, it is analyzed by several natural language processing (NLP) modules (including morphological analyzers), and all possible candidates for exercises are chosen and saved. The choice of exercises in a given session is random, but it depends on the previous answers given by the user: exercises which were easy or too dif- ficult to answer are chosen for the new exercises with lower probability, in order not avoid boring the student (with questions that are too easy) or discouraging the student (with questions that are too difficult) too frequently. The user can receive two types of exercises in the practice mode: multiple choice—more suitable for non-inflected words—or cloze quizzes for inflected parts of speech. In the latter case, the user receives the base form of a hidden word, and needs to insert a correct grammatical form of the word appropriate for its context. The correct answer is the answer appearing in the story. The possibility to insert different forms acceptable in the same context is under de- velopment. The same story can be practiced more than once, as the generated exercises will be differ from the previous ones, because they influenced by the history of the previous answers. After answering questions, the user receives immediate feedback and the next piece of story with new exercises. We plan to develop the feedback functional- 12 The user has opened this story in the Practice mode (in another tab, the a story is open in the Reading mode). The highlighted snippet contains the current exercises. Exer- cises are of two kinds: white boxes contain base forms of words, where the user must type in the inflected form correct for the context; drop-down menus are multiple- choice questions. The snippet above was answered previously—correct answers are in green, incorrect in blue, with a magnifying glass to allow inspection of the “mistake.” Each correct answer gives one point (an apple) in the score box. Any word can be clicked to obtain a translation—in the left panel, into a choice of languages. In lower- left are buttons for entering symbols not found on common keyboards. Progress bars indicate proportion of correct answers (left) and story covered (right). ity further in such a way that it is not only providing a correct answer, but a series of hints which can help guide the user toward the solution after several attempts, and adapt to the particular user. For another type of exercise, the system can generate crosswords from the stories. Entries in the crossword are chosen following the same principles as for the practice mode, considering the history of previous answers. A crossword is built from the hidden words which should be inserted back into the story in the correct grammatical form. The user receives the translations of the required words as hints; in this way, the user actively expands both the vocabulary and grammatical competency. A competition feature is available for some exercise modes, where while prac- ticing with a story, the user is trying to beat an opponent who is doing the same exercises based on the same story. This injects timing and speed into the practice sessions, where the learner has additional constraints due to the op- ponent’s advance through the session. Currently, the opponent is a bot that at- tempts to model as closely as possible the user’s own recent answering and timing patterns. Thus, the idea of this mode is for the user to compete exactly with herself—trying to improve on the correctness and speed of her answers. During the practice sessions, the user can request translations for (most) un- familiar words in the text—all such requests are automatically added into the user’s set of flashcards, as we assume that these words are good for the user to review later. Flashcards can be used for vocabulary practice sessions apart from reading the stories. (Flashcards are also a common feature in some language- learning platforms.) 3.2 Small vs. big languages One of the main advantages of Revita system is that it is relatively easy to add a new language, if a morphological analyzer and any other required NLP mod- ules are available for that language. With such modules and some assistance from language experts for making required testing and adjustments, Revita can be used to generate exercises from any uploaded story in then new language. Some of the more advanced features depend on the availability of large quanti- ties of language data for developing more robust language models. Finding suf- ficient quantities of data can be problematic for some of the smaller languages, thus limiting the capacity of the learning platform. Thus, to some extent Revita provides different functionality for different languages depending on available language data. 4 Conclusions and current work We have presented Revita, an on-line platform that allows us to explore several large-scale, important challenges recognized in digital humanities today. In the area of cultural heritage, it helps us address the global problem of language endangerment—by bringing state-of-the-art AI tools used for learn- ing “larger” languages to benefit endangered minority languages. Not coinci- dentally, by embodying an on-line, technological solution, it in part addresses the important sub-problem of raising prestige of the minority languages, by injecting these languages into the center of modernized discourse. It is well understood that prestige is a crucial social factor in language decline and en- dangerment. We believe that our approach provides exciting new and rich sources of data for studying and modeling the process of language learning, with the aim of enhancing the learning experience. It opens opportunities for new research in educational data mining. We are currently investigating application of state-of- the-art methodologies in the context of Revita, including Bayesian knowledge tracing, [9] and knowledge space theory, [2]. In other scientific areas, such as CALL and ITS, it allows us to bring the latest advances in AI to bear on the modern understanding of pedagogical methodology. Acknowledgments This research was supported in part by the FinUgRevita Project, funded by the Academy of Finland, Grant No. 267097. References 1. Buszard-Welcher, L.: Can the web help save my language. The green book of lan- guage revitalization in practice pp. 331–45 (2001) 2. Doignon, J.P., Falmagne, J.C.: Knowledge spaces. Springer Science & Business Media (2012) 3. Hale, C.: How do you say computer in Hawaiian? (1995) 4. Hart, R.: Language study and the PLATO system. Studies in Language Learning 3(1), 1–24 (1981) 5. Katinskaia, A., Nouri, J., Yangarber, R.: Revita: a system for language learning and supporting endangered languages. In: Proceedings of the Joint 6th Workshop on NLP for Computer Assisted Language Learning and 2nd Workshop on NLP for Research on Language Acquisition at NoDaLiDa. Linköping University Electronic Press (2017) 6. Katushemererwe, F., Nerbonne, J.: Computer-assisted language learning (CALL) in support of (re)-learning native languages: the case of Runyakitara. Computer As- sisted Language Learning 28(2), 112–129 (2015) 7. Levy, M.: Computer-assisted language learning: Context and conceptualization. Ox- ford University Press (1997) 8. Meskill, C., Anthony, N.: Computer mediated communication: tools for instructing Russian heritage language learners. Heritage Language Journal 6(1), 1–22 (2008) 9. Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L.J., Sohl-Dickstein, J.: Deep knowledge tracing. In: NIPS: Advances in Neural Information Processing Systems. pp. 505–513 (2015) 10. Revithiadou, A., Kourtis-Kazoullis, V., Soukalopoulou, M., Konstantoudakis, K., Zarras, C.: Developing CALL for heritage languages: The 7 Keys of the Dragon. The EuroCALL Review 23(2), 38–57 (2015) 11. Villa, D.J.: Integrating technology into minority language preservation and teaching efforts: An inside job (2002) 12. Ward, M.: The additional uses of CALL in the endangered language context. Re- CALL 16(2), 345–359 (2004) 13. Ward, M., Genabith, J.: CALL for endangered languages: Challenges and rewards. Computer Assisted language learning 16(2-3), 233–258 (2003) 14. Warschauer, M.: Technology and indigenous language revitalization: Analyzing the experience of Hawai’i. Canadian Modern Language Review 55(1), 139–159 (1998) 15. Warschauer, M.: Technology and social inclusion: Rethinking the digital divide. MIT press (2004) 16. Warschauer, M., Donaghy, K., Kuamoÿo, H.: Leoki: A powerful voice of Hawai- ian language revitalization. Computer Assisted Language Learning 10(4), 349–361 (1997)