=Paper=
{{Paper
|id=Vol-3232/paper25
|storemode=property
|title=Archive Infrastructure and Spoken Language Corpora for Saami Languages in Finland
|pdfUrl=https://ceur-ws.org/Vol-3232/paper25.pdf
|volume=Vol-3232
|authors=Marko Jouste,Jukka Mettovaara,Petter Morottaja,Niko Partanen
|dblpUrl=https://dblp.org/rec/conf/dhn/JousteMMP22
}}
==Archive Infrastructure and Spoken Language Corpora for Saami Languages in Finland==
Archive Infrastructure and Spoken Language Corpora for Saami
Languages in Finland
Marko Jouste1, Jukka Mettovaara1, Petter Morottaja1 and Niko Partanen2
1
University of Oulu, Finland
2
University of Helsinki, Finland
Abstract
This study presents the results of an Aanaar Saami pilot project in the Saami Culture Archive,
University of Oulu. The project has established a set of conventions to transcribe and annotate
Aanaar Saami recordings in the archive’s collection and created a mechanism through which
grammatically annotated but anonymous versions can be imported to the Korp search interface
in the Language Bank of Finland. The practices include wide use of Saami language
technology, the use of Finnish computational research infrastructure, and they can be extended
later to other Saami languages in the archive.
Keywords 1
Saami studies, Aanaar Saami, research infrastructure, language technology
1. Introduction
There are three Saami languages spoken in Finland: Aanaar (Inari) Saami, North Saami and Skolt
Saami. There are multimedia materials archived there for these languages from over the past one
hundred years. Although the materials were previously archived in different institutions around the
country, currently various linguistic, folkloristic and ethnomusicological recordings have been
organized in a corpus in the Saami Culture Archive of the University of Oulu. This does not mean that
all Saami materials in Finland would be accessible in one location, and the neighboring countries have
their own archival infrastructures as well. The Giellagas Institute for Saami Studies at the University of
Oulu has a nation-wide responsibility to organize, introduce and provide Saami language and cultural
studies as well as research at the academic level, and the work within the Saami Culture Archive directly
serves these responsibilities. The archive staff also has the needed cultural and linguistic competence
to work with the Saami materials.
Additionally, new materials are actively collected. In recent years, this has especially taken place in
connection with language revitalization work. The purpose has been to support language teaching,
planning and research (for information on language revitalization work in the Saami context, see Olthuis
et al. [1], and the studies by Pasanen [2], [3]). The Aanaar Saami and Skolt Saami languages have
undergone significant and successful revitalization efforts, and in both languages, there is an increasing
demand for language materials that are suitable for second language learners and also fluent community
members who want to study the language deeper.
From the point of view of modern language communities and researchers, the archived collections
even the ones in the Saami Culture Archive can be considered as what is often called legacy data,
referring to the materials that have been collected in a more remote past by researchers who have since
passed away or are not themselves working with these materials any longer. The concept of legacy
addresses the fact that these materials that are not originally ours are under our curation even today.
Holton et al. [4] discuss that in many instances this has resulted in the materials and the information
they contain being transferred from the indigenous communities to non-indigenous archives. Our case
The 6th Digital Humanities in the Nordic and Baltic Countries 2022 Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022.
EMAIL: marko.jouste@oulu.fi (A. 1); jukka.mettovaara@oulu.fi (A. 2); petter.morottaja@oulu.fi (A. 3); niko.partanen@helsinki.fi (A. 4)
ORCID: 0000-0003-1971-054X (A. 1); 0000-0002-4727-6704 (A. 2); 0000-0001-8584-3880 (A. 4)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
269
study provides a contrary case. These materials are hosted in a Saami archive, within the same
institutional structure that also supports Saami education and other fields. It has been argued by Dobrin
and Schwartz [5] that the main challenge in the use of legacy data is the way how they are connected
to the intentions and ideas of the original creators of the materials, and they have a social history that
must be understood when the data is used. We believe that the Saami Culture Archive is as a Saami
organization in a strong position to evaluate how the materials can be used in modern times, as the
perspective is not that of an outsider but an insider. Discussion about different aspects is certainly still
on-going: O’Meara and Good already discussed a decade ago that social and legal concerns in making
legacy materials more openly available have not been thoroughly studied [4], and this conversation is
certainly on-going even today. We think that case studies like ours illustrate various ways of how wider
accessibility can be achieved while ensuring the appropriateness of use. The best practices will develop
in time when different approaches can be evaluated in a longer perspective. At the same time we think
that the archives should have an active role in this process as well, since especially with the legacy data
the archives are in the major role as curators of these materials.
The materials archived in the Saami Culture Archive are closely connected to other work done at
the Giellagas Institute of the University of Oulu, including indigenous cultural work, teaching and
research. As these materials originate from an indigenous culture, there are several specific questions
that need to be addressed, especially with regard to access and cultural information therein. At the same
time, there is value in making the materials as accessible as possible for the communities themselves,
ideally online, so that they can be used to their full potential in language teaching, planning and
maintenance. This connects to the idea of repurposing the archival data for language reclamation, as
discussed by Lukaniec [7, 323], who points to numerous pedagogical tools, teaching aids, dictionaries
and other resources which can be created from archival resources – once they are transformed into
structured and normalized format that functions as a corpus.
We recognize that for the Saami materials we discuss here, the above-mentioned Saami communities
are the foremost group that has the need for and interest in these resources, and for whom access must
also be provided and with whom the conditions of use need to be negotiated. At the same time, we also
want to find a solution whereby the materials can be used without significant barriers, especially when
those barriers arise from the needs and practices of the majority culture. For example, we think that a
language learner should be able to easily verify how a word is used from a corpus without the necessity
of university-account-based authentication, which would quickly be the situation if the materials are
accessible strictly for verified educational use.
In our solution, the materials are described and analysed by specialists in these languages and
cultures, many of whom are also native speakers. We use a manual tagging method where personally
identifiable and sensitive information is marked and can thereby be restricted or removed from
subsequent derivations. These versions will be made available in the Language Bank of Finland2. In
this way, parts of the archived materials can be used as language learning tools, or as example sentences,
e.g., for dictionary infrastructure, while also taking into account the cultural integrity and sensitivity of
the data. We use a national computational infrastructure, which is secure and allows for continuous
refinement of the materials. What is specific in our approach is the extensive use of modern language
technology, combined with a close connection to the needs of language communities.
In our study, we describe the workflow and evaluate the time spent in different work phases, thereby
providing concrete numeric guidelines for similar future projects. We also analyse and test in detail the
sensitive information tagging method which we have designed, with the goal of being able to estimate
how much of the tagged content needs to be pseudonymized and anonymized. Similar tagging methods
have been used before [7], but concrete estimations of how well they function are important and still
scarce. We also provide accurate statistics about the size of the resulting corpus, and describe it in a
manner that will directly benefit the new users, both in digital humanities and other fields of research.
The case study we report was a pilot project for Aanaar Saami that was conducted May–December
2021. This pilot was successful, and, in the future, similar work will be extended to the North Saami
and Skolt Saami languages, thereby covering all Saami languages spoken in Finland. The model will
certainly be developed further as the work advances, but, at this point, we want to create a solid
foundation for later work.
2
https://www.kielipankki.fi/language-bank
270
2. Aanaar Saami materials in the Giellagas Corpus of Spoken Saami
As we described in the introduction, during the past decades various recorded Saami materials have
been relocated to the Saami Culture Archive. Under these circumstances, the work on them can be
planned systematically and on a long-term basis from the perspective of one Saami organization. The
oldest recordings in the archive are wax cylinders from 1913, and the majority originate after the 1950s
when modern recording equipment started to become available. For a detailed description of the history
of Aanaar Saami recordings, see the summary in Jouste’s dissertation [8, 50–68].
The Aanaar Saami materials currently hosted in the archive total approximately 92 hours. The earlier
archival sources of these materials are the archives of the Finno-Ugrian Society, the archive of the
Finnish Literature Society and the Tape Archive of the Finnish Language in the Institute for the
Languages of Finland. Besides this, new recordings have been done within the “Complementary Aanaar
Saami Language Education” project (CASLE, for more information on the project, see [1]) program for
adults, where the recording work was part of the course work. Most of the Aanaar Saami recordings are
done by outsider researchers, but especially the recordings connected to language teaching represent a
more collaborative and community-oriented approach.
Before the project started, all recordings were digitized but mainly untranscribed. Approximately
5% of the recordings had some level of transcription, half of which was in time aligned XML format,
customarily used in linguistic research nowadays. From this starting point, it was deemed important
that workflows be built through which the amount of systematically structured and annotated materials
could be increased. A pilot project was designed to meet this need.
3. Pilot project
The pilot project was initiated in spring 2021 and it had four contributors. It is a part of a larger
initiative to strengthen the infrastructure in the Saami Culture Archive. The goal was to establish a
workflow from transcribed and annotated recordings into the services of the Language Bank of Finland.
In a larger perspective, this would need to be connected to an updated archive management system in
the Saami Culture Archive, and digital preservation processes that are also organized on a national level.
For the time being, however, we have aimed at a solution that reaches concrete outcomes with resources
that we already have and that interacts with working practices that already exist. Essentially, in the
future when new recordings are transcribed for different purposes, integrating them to this workflow
will be easy, and they can, if desired, be made accessible and searchable in ways that allow respectful
treatment of personally identifiable information.
The system is very modular in the sense that the current data storage could be changed into
something else with little effort, as this would essentially mean copying and transferring the file system.
The transcriptions are stored in ELAN files, but this is an XML structure that can readily be transformed
into other formats as desired. The main building blocks in this structure are utterance level time stamps
and hierarchical annotations at word level. Although we currently provide access through the Korp
search interface at Language Bank of Finland, we are not bound to one singular interface or collaborator.
Naturally after having invested this much work in our solution, we would not start to implement radical
changes without very careful thought, but in the long term we believe this modularity will come to play
an important role.
The motive for the pilot project emerged from the specific new value and benefit that the spoken
language corpus has. The significance of constructing a spoken language corpus for, e.g., revitalization
and educational use can hardly be overemphasized. Speech is the primary form of language, and it
differs fundamentally from written communication. For example, much of spoken phonological and
lexical variation is oftentimes not present in the codified written standard, and even the ways of
constructing sentences and phrases often differ from those of written conventions. A spoken corpus
enables us to examine features of natural discourse, such as use of interjections, repetition, self-
correction, discourse particles and code-switching, that are characteristically part of spoken
communication. The information gained through studying these features helps both educators to teach
the language and learners to learn the language in a more natural way. In addition, the contents of the
corpus provide a significant source of oral tradition and cultural knowledge.
271
3.1. Data selection
The transcribed materials were initially selected to represent geographical variants of Aanaar Saami,
consisting of recordings from both eastern and western areas of Aanaar Saami, and later on more
transcription tasks were initiated by Giellagas Institute students of Aanaar Saami language and
coordinated by the Saami Culture Archive. All of the ongoing transcription projects were joined in this
pilot project in order to finish the transcription tasks and standardize the transcription conventions used.
The transcribed recordings were chosen mainly from the older recordings spanning from 1913 to 1992.
The recordings included adequately represent the main traditional speaking areas of Aanaar Saami and
the temporal span of the archive recordings as a whole.
3.2. Data storage
The educational institutions in Finland can access the computing environments maintained and
developed by CSC – IT Center for Science3 free of charge. The Saami Culture Archive also uses several
of the CSC products in the current project. As the services are built on the national level, institutions
like our own can benefit from tools in which the basic security and maintenance are coordinated on a
higher level. At the same time, we believe, the practices of one organization should be relatively easily
transferable to other organizations within the context of Finland.
The context of the Saami Culture Archive is that of a medium-sized multimedia archive. The
collection is large, but not in the millions of items, and although the collection keeps growing, it can
still be handled effectively by the archival staff. Several individuals often need concurrent access to the
same materials, and while the collections are used, adjacent resources are customarily created. For audio
and video recordings this means that new transcriptions are added to the archive, and these improve the
further usability of the resources for new archive users.
We have used CSC’s Allas object storage system to store the edited materials and to share the files
among the archive and project workers. The access can be controlled easily and everyone has access to
the same files. The files from Allas can also be accessed in CSC’s computing environment Puhti, which
has been a large advantage. This setup is not perfect, as we would benefit from more granular version
control and logging solutions, but as this is the recommended solution for the research data currently
processed in Finnish institutions of higher education, we have wanted to adopt this as our solution, too.
In spring 2022 we also participated in Allas service’s user interview process, hoping that our
experiences will eventually also be beneficial for the implementation of future versions.
3.3. Transcription conventions
The transcriptions were done with ELAN software [10]. The basic, initial transcription (non-
standard transcription) is based on the orthographical rules of Aanaar Saami with some additions to
mark phonemically relevant quantity distinctions (i.e. the half-long consonants and vowels, short
diphthongs, and short consonant clusters). In the non-standard transcription tier, the goal was also to
mark the phonemic and morphological variations resulting in non-standard word forms. It was also
possible for the transcriber to include utterances characteristic to spoken conversations, like unfinished
or interrupted words, laughter, coughing, and pauses, but this was not emphasized in the project. The
unclear parts of the speech were marked, with possible guesses of unidentified words by the transcriber.
In the next step of the transcription, the non-standard transcription was standardized (tier
Orthography). In this state, the phonemic diacritic markers, discourse analytical and unclear words were
removed or simplified, resulting in a transcription that follows the orthographical rules of Aanaar
Saami. The main idea for the orthographical tier is to make it easier to conduct word queries and use
the orthography-based Aanaar Saami morphosyntactic analyser designed by Giellatekno group of UiT
– The Arctic university of Norway4. The phonologically more accurate transcription, however, is saved
and can be added to queries based on orthographical word forms and morphosyntactic descriptions.
3
https://www.csc.fi/en/solutions-for-research
4
https://giellatekno.uit.no/
272
In the initial state of the transcription, the recording was screened for possibly sensitive or private
materials in three categories: 1) place names 2) personal names 3) sensitive information. These parts
were flagged to be filtered or censored in later stages. Identifying place names and personal names is
generally a mundane task for the transcriber. The third category, however, gives more room for
interpretation. We took a relatively cautious stance with instructions for transcribers to flag anything
that might feel like sensitive information, and the final judgment of flagging of possible sensitive
information was tasked to core staff of the project.
3.4. Annotation conventions
The transcriptions are annotated with the methods presented by Gerstenberger et. al. [11]. They have
previously been used in documentation of Pite Saami [12] and Komi [13]. The idea is that a rule-based
morphosyntactic analyser is run over ELAN’s transcription tier, and the resulting annotations are
written directly to the ELAN file. The resulting tiers are then manually corrected. In the initial stage,
the output of the analyser was inspected and analysed before manual correction was even considered.
This allowed correcting typing errors and especially transcription specific conventions such as
hesitation and unclear words, which are marked as described above. After the manual correction was
done, an extensive list of remaining issues was created. This makes it possible to address these questions
in following project phases.
Especially in lemmatization and morphological analysis, the quality was so high that even the most
common unrecognized words were relatively rare. We used this as a justification to continue the manual
correction in order to achieve a small gold corpus. It would be possible and desirable to envision
workflows where the analyser is developed more in parallel with the transcription work, but, as our
project team was small, resources were limited, and the goal of the work being a concrete annotated
corpus, we followed the process described here.
Although the analyser returns information about Aanaar Saami morphology and syntax, the latter
was not included in the manual correction phase after initial tests were done. The syntactic analysis was
simply still at too rudimentary a stage with missing and wrong syntactic tags. This information was
kept in the files, but it was too slow to correct it entirely by hand. More information about the time
estimations also for this work phase is included in table 1.
The resulting annotations include first the lemma, which means the word form that is found as the
headword in dictionaries. In the subordinate tiers, we store part-of-speech information and the
morphological analysis. Part of speech contains only one tag, as there is only a small number of
available categories. However, for the morphological analysis a string of feature tags is provided. These
tags convey information such as tense, person, number, case, derivational elements and other
grammatical-category information. The system is similar for all languages in the GiellaLT infrastructure
and extensively documented5. This approach is very beneficial for the work on Aanaar Saami, as the
same searches should be relatively easy to compare between different Saami varieties when annotated
with a similar system. From this point of view, inclusion of more languages in the workflow described
in this study is very important for the Saami Cultural Archive as well.
3.5. Anonymization
There are many situations where personal and identifiable information in the archived data should
not be made public. Information about living persons falls under the European GDPR regulation, and
as the materials discussed here have been recorded much earlier than the current legislation, the
conventions used when the original recordings were made are often very different from modern
practices. We are currently applying relatively strict measures of anonymization for the archival data,
although situations may arise where the old age of the materials or possibility of making new
agreements with individuals participating in the recordings could allow more open sharing. At the same
time, we have to be careful when applying practices developed for the majority cultures in an indigenous
context. When the participants in the recordings can still be reached, discussing and agreeing with them
5
https://giellalt.github.io/lang-smn/
273
on new online distribution could be one way to improve the understanding on how the materials can be
shared. Naturally, when the original work was done decades ago, it would have been impossible to even
envision that someone would like to listen to the recording online, for example.
The flagging conventions to which the anonymization is based on was described above for the initial
transcription. For this, we use an independent tier that is not connected to any other tier, so adjustments
can be done at all times and on varying granularities, and they are entirely independent from changed
annotation boundaries on other tiers as they are connected to time codes instead of annotations. To
illustrate this, an annotation that tags some segment as containing identifiable information can span a
very short segment or a longer span, depending on the situation and certainty of the annotator. When
the actual filtering of this information is done, the segments that overlap the annotation even partially
are hidden. This way the identifiable information is removed relatively roughly, so that we can
confidently make the remaining texts openly available. At the same time, if there is a later need to refine
any annotation, we can simply adjust, remove and add the tags in new versions. We do err on the side
of caution but recognize that this kind of tagging scheme is never final and may change when more
work is done on the corpus.
3.6. The Language Bank of Finland version
The version stored in the Language Bank of Finland will be made openly available in 2022, but as
described, personally identifiable information has been filtered out. There are references to the original
archival materials, and it is possible to request access to them through the Saami Culture Archive. The
material is organized so that searches are possible by all the parameters we have annotated in the
previous stages, with the exception of the cultural comments.
Figure 1: Upcoming Korp-interface view in the Language Bank of Finland6. The final version is intended
to be as fully translated into Aanaar Saami as possible; note that the draft version in the screenshot
also contains Finnish.
In the Korp interface [14] there are three possible search methods: Simple, Extended and Advanced.
Figure 1 above displays the Extended query mode, where combinations of word sequences and
annotations can be searched in a relatively simple way: horizontal elements are sequential words, and
vertical elements represent annotations, which would allow searching for a specific part of speech and
6
https://korp.csc.fi/
274
morphological category, for example. At the same time, the search query currently used always
becomes visible in the Advanced view using CQP query language, or Corpus Query Protocol [15]. This
is a well standardized and widely used search format, which many users may already be familiar with.
The search results are displayed in a typical keyword-in-context view, also known as KWIC. This
displays all results vertically below one another with the matching word highlighted. By clicking
individual words, it is possible to see further information about the line in question on the right side.
This includes the original analysis, but also information about the pilot project, corpus and the details
needed to request the original audio and ELAN files from the Saami Culture Archive.
If the segment has been flagged in a previous work phase, the content of those utterances will not
be shown at all in this view. It could be possible to create a more fine-grained filtering here, for example,
by taking into account that the flag for place names should only filter those tokens tagged as places.
This, however, is already error prone, both from technical and human perspectives, as one would need
to ascertain that the place is always tagged correctly and systematically. Again, this could also be
verified by an additional manual check.
4. Evaluation of time spent in different tasks
Some of the work in the pilot project was technical and some linguistic. Evaluation of the time spent
in different work phases is not trivial, but, in Table 1, we provide a careful evaluation based on our
experience. For the technical work, we must emphasize that we used an already very functional
software7 which the authors have made available under an open license. The version used in our pilot
project was adjusted to the tier structure in the project, but the foundation is the same with a rule-based
analyser and processing the XML files with Python.
Initial adjustment of the script was thereby relatively fast, including approximately one week of
work time. As the project progressed, we adjusted the software for the new needs that emerged. This
involved adaptations for new transcription conventions and the processing of flagging information. We
also added a new output option for VRT files used by the backend of the Korp interface. All this code
is available in GitHub8. Although the CSC’s Allas service functioned very well, there was an occasional
need to check the files, organize them and verify in which steps they currently are to be found. All in
all, this resulted in approximately one month of software-engineering related work. If the underlying
analyser had been modified during the project, the amount of work would have probably increased by
several months, and this would have also then involved highly specialized work where both linguistic
and technical skills are crucial.
As shown in Table 1, the technical work is still minor when compared to the workload of even one
hour of transcription. To set this into scale, the whole Aanaar Saami collection in the Saami Culture
Archive is 92 hours, as described above. Applying the whole linguistic workflow to one hour of
recording would amount to approximately 35 hours of work.
Table 1
Estimation of the time spent in different work phases
Work phase Time (for 1 h of transcription)
Adjusting the script initially One week
Modifying the script during the project Two weeks
Maintaining the data repository One week
Non-standard transcription (and flagging) 10 h
Orthographic transcription 5h
Morphological checking 10 h
Syntactic checking 10 h
7
https://github.com/langdoc/elan-fst
8
https://github.com/nikopartanen/giellagas
275
The details vary from recording to recording, as the quality of audio and possible original
transcription, the amount of overlapping speech, hesitations, code-switching and pauses all influence
how much time is needed. Naturally, also the experience of the transcribers and annotators matters. In
our pilot project, both transcribers, one native speaker and one advanced learner, were highly qualified
in the language.
One figure that is currently missing concerns the speed of annotations if they were done entirely
manually. We believe, however, that this would be many times slower than what we can currently
present. The morphological and syntactic checking is also a task that could become faster if the analyser
were systematically improved, but the improvement in speed would not be linear, as the checking would
still be important and done for each token.
5. Further work
Transcribing work of Aanaar Saami materials will continue gradually by volunteer Aanaar Saami
students who will benefit from the newly established conventions, which bring steadiness and
continuity to the work. When untranscribed Aanaar Saami recordings are used and transcribed, for
example, in BA and MA level theses, the resulting transcriptions will be archived and integrated into
the corpus through the conventions described in this study.
The Saami Culture Archive will continue to maintain the infrastructure which has been set up during
the pilot project. Further collaboration with the Language Bank of Finland and UiT – The Arctic
University of Norway will also be fostered, as their services and tools are essential for our work. The
Language Bank of Finland offers the corpus search interface in a setting that is already becoming
familiar for different user groups, and the GiellaLT language technology is the backbone of the
morphosyntactic analysis which our pipeline depends on. Nonetheless, even from the point of view of
the Saami Culture Archive, there are several issues where local infrastructure could still be improved.
Most of these components involve local data storage and metadata handling.
Currently, only very general metadata is included in the corpus, as the openly available version is
anonymized. The Saami Cultural Archive, however, has much more detailed information about the
recordings. At the moment, the metadata is not directly searchable, as the archival data management
infrastructure still has to be fully established, and most of the information is stored in spreadsheets. One
solution could be to use open-source archive repository management tools, such as AtoM9. At the same
time, there is national Finnish infrastructure that is being developed, and the development of which the
Saami Culture Archive is also actively monitoring.
Regardless of which management system is used with the repository, the metadata, ideally, and at
least on some level, would be published so that the users could more easily judge whether there are
materials useful for them in the archive. At the same time, however, the Language Bank of Finland
version already gives the users a new kind of impression of the materials in the archive, which can make
it better known what kind of content there is. Additionally, the metadata in the archive could be
expanded, for example, based on the new transcriptions. For example, the fact that some recordings do
contain sensitive information is something that should be added from the transcription level to the item
description level.
When the corpus is available to the users, we should also build upon the feedback we receive. The
current setup aims to be broadly useful and to have a low barrier of entry, but it is still a complex search
system with its own query language. This means it is possible that the user instructions and even some
parts of the pipeline could be redesigned so that they are in practice more useful. We may not need to
build a complete user feedback system, but it is important to acknowledge that we do not yet know who
all the users will be and what kind of backgrounds and needs they have, especially in a longer time
frame.
As the pilot project has proven to work well, the same model will be used for organizing and
expanding the spoken corpora of North Saami and Skolt Saami. Hopefully, in the near future, we will
find a solution to add the audio parts in word queries. This has both technical and ethical considerations.
First, hosting the audio files and linking the relevant segments is not trivial. Second, the audio would
9
https://www.accesstomemory.org/en
276
be identifiable, and it should, somehow, need to account for the removed segments that contain personal
information, as the intention is not to make those publicly available.
The materials could be extended, with the emphasis on modern language (e.g. Yle archives and
possible brand new materials that will be recorded to be openly published from the start). In the future,
the gathered data of parallel audio and text will advance the development of voice recognition and
speech synthesis for Aanaar Saami. This way the workflow described in the paper would be even more
effective.
6. Conclusion
In this paper, we have given a detailed description of the conventions of developing an Aanaar Saami
spoken language corpus in the Saami Culture Archive in the University of Oulu. The pilot project
provides a way to make archived materials easily accessible and searchable while respecting the privacy
and anonymity of individual participants in the recordings. Language technology is used to make the
annotation phase more efficient, but the whole data curation still depends primarily on the linguistic
and cultural knowledge of the specialists working with the materials.
Although the chosen structures are influenced by the conventions often seen in linguistics and
language documentation, we do believe that the work can also benefit other adjacent fields, especially
in the context of digital humanities. The recordings and their transcriptions store a wealth of cultural
information both about Aanaar Saami life and history in the 20th century and are also important
documents about topics such as toponymy. The main audience of our work is the Aanaar Saami
language community, primarily the language learners and users, but the conventions we have used are
not restricted in their usability to just a few specific purposes.
7. References
[1] M-L. Olthuis, S. Kivelä, and T. Skutnabb-Kangas, Revitalising indigenous languages: How to
recreate a lost generation. Multilingual matters, Bristol, 2013.
[2] A. Pasanen, Kuávsui já peeivičuovâ. 'Sarastus ja päivänvalo': Inarinsaamen kielen revitalisaatio.
Uralica Helsingiensia 9, Finno-Ugrian Society, Helsinki, 2015.
[3] A. Pasanen, “This Work is Not for Pessimists”: Revitalization of Inari Sámi Language, in: L.
Hinton, L. Huss and G. Roche (Eds.), The Routledge handbook of language revitalization,
Routledge, New York, NY, 2018, pp. 364–372.
[4] G. Holton, Y. Wesley, P. Leonard and L. Pulsifer, Indigenous Peoples, Ethics, and Linguistic Data,
in: A. L. Berez-Kroeker, B. McDonnell, E. Koller, L. B. Collister (Eds.), The Open Handbook of
Linguistic Data Management, The MIT Press, 2022.
[5] L. M. Dobrin & S. Schwartz, The social lives of linguistic legacy materials. Language
Documentation and Description 21 (2021) 1–36.
[6] C. O’Meara, and J. Good, Ethical issues in legacy language resources. Language &
Communication 30.3 (2010) 162–170.
[7] M. Lukaniec, Managing Data from Archival Documentation for Language Reclamation, in: A. L.
Berez-Kroeker, B. McDonnell, E. Koller, L. B. Collister (Eds), The Open Handbook of Linguistic
Data Management, The MIT Press, 2022.
[8] M. Jouste, Tullâčalmaaš kirdâččij 'tulisilmillä lenteli' - Inarinsaamelainen 1900-luvun alun
musiikkikulttuuri paikallisen perinteen ja ympäröivien kulttuurien vuorovaikutuksessa. [The One
Who Flew with the Fire eyes - The Musical Culture of the Aanaar Sámi People in the Interaction
of the Local Tradition and the Neighbouring Cultures]. Acta Universitatis Tamperensis 1650,
Tampere University Press, 2011. urn:isbn:978-951-44-8551-0.
[9] N. Partanen, R. Blokland & M. Rießler, A pseudonymization method for language documentation
corpora: an experiment with spoken Komi, in: 6th International Workshop on Computational
Linguistics of Uralic Languages, January 10–11 2020, Vienna, Austria, 2020, pp. 1–8.
[10] ELAN (Version 6.3), Max Planck Institute for Psycholinguistics, Nijmegen, The Language
Archive, 2022. URL: https://archive.mpi.nl/tla/elan.
277
[11] C. Gerstenberger, N. Partanen, M. Rießler & J. Wilbur, Instant Annotations: Applying NLP
Methods to the Annotation of Spoken Language Documentation Corpora, in: International
Workshop for Computational Linguistics of Uralic Languages. The Association for Computational
Linguistics, 2017, pp. 25–36.
[12] J. Wilbur, ELAN as a search engine for hierarchically structured, tagged corpora, in: Proceedings
of the Fifth International Workshop on Computational Linguistics for Uralic Languages, 2019, pp.
90–103.
[13] J. Rueter, N. Partanen, M. Hämäläinen & T. Trosterud, Overview of open-source morphology
development for the Komi-Zyrian language: Past and future, in: Proceedings of the Seventh
International Workshop on Computational Linguistics of Uralic Languages, The Association for
Computational Linguistics, 2021.
[14] L. Borin, M. Forsberg, & J. Roxendal, Korp – the corpus infrastructure of Språkbanken, in:
Proceedings of the Eighth International Conference on Language Resources and Evaluation
(LREC'12), 2012, pp. 474–478.
[15] S. Evert, The CQP query language tutorial, IMS Stuttgart, CWB version, 2, b90, 2005.
278