=Paper=
{{Paper
|id=Vol-1399/paper4
|storemode=property
|title=Ainm.ie: Breathing New Life into a Canonical Collection of Irish-language Biographies
|pdfUrl=https://ceur-ws.org/Vol-1399/paper4.pdf
|volume=Vol-1399
|dblpUrl=https://dblp.org/rec/conf/bd/RaghallaighC15
}}
==Ainm.ie: Breathing New Life into a Canonical Collection of Irish-language Biographies==
Ainm.ie: Breathing new life into a canonical collection of Irish-language
biographies
Brian Ó Raghallaigh, Gearóid Ó Cleircín
Dublin City University
Dublin, Ireland
brian.oraghallaigh@dcu.ie, gearoid.ocleircin@dcu.ie
Abstract
In this paper we present the Ainm.ie online collection of Irish-language biographies. This collection is a product of a project to
retro-digitise the Beathaisnéis series, published between 1986 and 2007, as well as ongoing biographical work to expand and enrich the
collection. The Beathaisnéis series comprised biographical accounts of 1,650 lives and an additional 520 amendments and
supplementary articles, written in Irish. Persons were chosen for inclusion in this series according to their relevance to the
Irish-language world. This canonical collection is an invaluable research tool for Irish-language scholars, historians and others but the
print volumes risked becoming obsolete and inaccessible. As well as producing a digital version of the original texts, a key aim of the
Ainm.ie project has been to ensure the continuity of biographical research in Irish by providing an online platform for publication. This
paper introduces the Beathaisnéis collection and explains its context. It goes on to describe the digitisation process and the editorial
work carried out to enrich the digital version, as well as the motivation for this version. Finally, it discusses how the project has
facilitated contemporary biographical research in Irish.
Keywords: Irish language, digitisation, biographical dictionary
accounts of 1,650 lives and an additional 520
1. Introduction amendments and supplementary articles, written in Irish.
In this paper we present the Ainm.ie online collection of Persons were chosen for inclusion in this series according
Irish-language biographies. This collection is a product of to their relevance to the Irish-language world. Some of the
a project to retro-digitise the Beathaisnéis series persons are nationally renowned and also appear in other
(Breathnach & Ní Mhurchú, 1986-2007), written and national biographical resources but most are not widely
published between 1986 and 2007, as well as ongoing known outside of the small Irish-language community.
biographical work to expand and enrich the collection. The Beathaisnéis project can therefore be seen as an
With additions since initial digitisation and publication alternative dictionary of national(ist) biography, using
online, the collection now comprises 1,720 biographies, connection with the Irish language as a yardstick for
with a further 10-15 being added annually. inclusion. The timeline covered, from the 17th century to
the present day, encompasses a period during which Irish
In addition to presenting the digitisation project and the went from being the dominant language of the country to
resulting digital resource, we will motivate the creation of the language of a socially and geographically disparate
this digital resource by looking at the advantages of the minority. The persons included reflect this, with 17th
digital version of the collection over the original print century chieftains, scribes and theologians rubbing
version. shoulders with 19th century revivalists and
revolutionaries as well as modern day folk singers,
While the Ainm.ie website is bilingual, the biographical academics and language activists.
accounts are available in Irish only. The site contains a
number of browse and search facilities, which draw on a The original authors of the nine volume print series,
limited set of metadata stored for each biography. This Diarmuid Breathnach and Máire Ní Mhurchú, were
metadata was added as part of the Ainm.ie project, as colleagues in the archives of the Irish state broadcaster
described in Section 4. RTÉ. They were often asked to provide biographical
information on relatively well-known figures in the
The Ainm.ie project is one of multiple research projects Irish-language community, particularly for obituaries, and
being carried out by Fiontar that involve the identification they became increasingly aware of the lack of an
of valuable non-digital language resources, their authoritative biographical dictionary. In order to fill in the
digitisation where necessary, and the application of web, blanks they regularly had to do basic biographical
database, and language technology to these resources to research themselves, contacting relatives and tracking
widen access and availability, and to increase down birth and death records. Eventually, in 1979, they
effectiveness and usability (Ó Raghallaigh & Měchura, decided to start work on a biographical dictionary
2014). themselves with the intention of covering the 100 year
period from 1882 to 1982 (Breathnach & Ní Mhurchú,
2. Beathaisnéis: some context 2001:17). This was published in five volumes between
1986 and 1997. Breathnach and Ní Mhurchú subsequently
The original Beathaisnéis series comprises biographical
went on to expand the scope of the project to take in the
20
periods from 1560 to 1881 and from 1983 to 2007. It was Other possibilities such as named entity extraction and
while working on the final published volume in the early network analysis would also be opened up by the creation
00’s, that they began to think about passing on the of a digital version of the collection.
responsibility to a new generation of biographers. They
were extremely enthusiastic about the potential of a Making the digital version available online would open up
digital version and provided a significant amount of the potential to link the biographies with equivalents in
information and support during the early years of the other online collections, such as the DIB, and to link
project while continuing to draft new biographies. metadata to other online resources.
3. Advantages of the digital version 3.3 Public interaction
The Ainm.ie project was inspired by the digitisation of Putting the collection online would open up the editorial
other canonical biographical resources like the Oxford process allowing members of the public to suggest
Dictionary of National Biography 1 and the Australian inclusions, to highlight errors and to provide various other
Dictionary of Biography 2 which proved that the move types of feedback. This has been encouraged and
from print to digital could make such collections more facilitated by the use of Twitter and Facebook accounts to
accessible and potentially increase their user base. These share news and features such as the Biography of the
projects also highlighted the kind of added value that the week.
digital edition could bring such as quicker updates,
regular thematic features and the potential for using the 4. Retro-digitisation
biographical data in new and interesting ways. The first stage of the Ainm.ie project involved the
retro-digitisation of the nine volumes of the Beathaisnéis
The initial application for funding for the Ainm.ie project series. Volumes 5, 6, 8 and 9 were made available by the
in 2009 coincided with the publication of the nine volume publishers in a QuarkXPress publishing format that could
Dictionary of Irish Biography (McGuire & Quinn, 1999) be exported to Microsoft DOC format. These volumes
which was made available concurrently in an online were exported in this way, before being checked, exported
version. 3 Advice was sought from researchers in the to text, cleaned and processed for publication online.
Royal Irish Academy, where the Dictionary of Irish
Biography (DIB) project is based, regarding best practice Volumes 1, 2, 3, 4 and 7, which were not available in any
in creating an online biographical collection. digital format from which text could be extracted, were
scanned and converted to Microsoft DOCX format using
3.1 Accessibility OCR, before being checked, exported to text, cleaned and
It became clear from examining similar online collections processed. Scanning and OCR was carried out by outside
and from talking to colleagues in the field that a digital contractors. Checking the texts that were created using
version of Beathaisnéis could offer some substantial OCR involved the reinstating of characters lost or
benefits. The most obvious of these was the possibility of misinterpreted during the automatic recognition stage.
making the material more accessible. Breathnach and Ní
Mhurchú had deliberately published the collection in Before exporting the volumes to text, bold and Italics text
relatively small volumes with the intention of setting formatting in the DOC and DOCX documents was
themselves achievable targets. Another benefit for the converted to a form of markdown, that could be retained
authors was that publication of subsequent volumes in the after exporting to text. Markdown is a plain text
series allowed them to include corrections and additions formatting syntax. 4 Our version of markdown involved
to previous editions (2001:25). However, the reality of a enclosing bold formatted text between asterisks (e.g. *this
nine volume collection published over a twenty year is a bold example*), and enclosing Italics formatted text
period, was that volumes regularly went out of print, a fact between plusses (e.g. +this is an Italics example+).
that was exacerbated by the limited size of the
Irish-language publishing market. A freely accessible The volumes were then exported to text, and some
digital edition would make the entire collection available programmatic cleaning was carried out, e.g. spurious line
to all. breaks and superfluous white space were removed. Once
cleaned, individual biographies were extracted from the
3.2 New possibilities volumes in text format, and saved as individual text files.
Creating a digital version of the collection would enhance The individual biographies were then processed.
its usability. Full text search and clickable
cross-references would undoubtedly allow users to drill 4.1 Pre-processing
down into the collection more quickly than before. Before the biographies were added to the Ainm.ie
database, a number of pre-processing tasks were carried
out. These tasks included the extraction of basic metadata
1
http://www.oxforddnb.com/ Accessed on 24 June 2015.
2 4
http://adb.anu.edu.au/ Accessed on 24 June 2015. http://daringfireball.net/projects/markdown/ Accessed on 24
3
http://dib.cambridge.org/ Accessed on 24 June 2015. June 2015.
21
and the insertion of cross-references. user-friendly editing.
Firstly, each file was assigned a unique identifier. Each The editorial team enhanced the collection in a number of
file was then converted to a simple XML format which ways. Firstly, all automatic pre-processing was checked,
comprised a header containing metadata relating to the and OCR errors were corrected. In addition, a style guide
article and a body containing the biography text. was developed for the digital edition in an attempt to
standardise items such as references, quotations, dates
and numbers as well as certain spelling and grammatical
Basic metadata was then added to each file. Firstly, global issues. The original volumes were published over a
metadata regarding the volume and collection was twenty year period and thus contained a certain amount of
inserted. Secondly, each person's first name, surname, inconsistencies that could be cleaned up. The style guide
date of birth, and date of death, where given, were parsed is now circulated to new contributors to ensure
and extracted from the first line of each source file, and consistency.
inserted into the metadata header of the XML file.
The most significant editorial enhancement was the
Legacy textual cross-references, i.e. "[q.v.]", "[B1]" (i.e. integration of supplementary notes to the primary
Beathaisnéis/Volume 1), "[B2]", etc., were then removed biographies. As mentioned in Section 3, the authors had
and replaced with cross-references tagged/marked up included new information relating to over 500
with a target identifier, i.e. the unique ID number of the biographies in appendices at the end of each volume. This
allowed them to include new research that had come to
target biography. These new cross-references were
light since the primary biography was published and also
created programmatically by searching for each name for to correct any factual inaccuracies. The digital edition
which there was a biography in the collection. Where a provided an opportunity to amend the relevant accounts to
match was found, it was tagged with the target identifier. reflect this additional information. Careful redrafting was
These cross-references were subsequently manually necessary in some instances where the supplementary
verified. information was extensive and the original authors were
consulted when appropriate. This element of the editorial
Further named entities were then searched for and tagged process continues today as newly-published research is
in the body of each biography. Placenames, as well as a reviewed and new information relevant to a biography is
closed set of publications, organisations, educational added. The editorial team also accept submissions from
institutions, professions and political parties, were tagged the public via email and correct inaccuracies or add minor
details when verified.
during this stage of pre-processing. Some of the lists of
named entities were based on indexes included in the
4.3 Post-processing
Beathaisnéis series, others were compiled specifically for
this purpose. Once all biographies had been checked and enhanced by
the editorial team, the collection of biographies were
prepared for publication online. This stage in the project
Placenames found were tagged with target identifiers
involved the development of a tool to export the
from Logainm.ie, the Placenames Database of Ireland 5, collection from the repository into a purpose built
the authoritative source for Irish toponymic data, and a relational database.
dataset also developed and hosted by Fiontar, in
conjunction with the Placenames Branch of the For each biography in the collection, the tool extracts the
Government of Ireland. Place objects in Logainm.ie metadata from the article's XML header before inserting it
contain toponymic and geographic data, as well as links to into the database in normalised form. The tool then
other geographical databases, such as GeoNames. extracts and cleans the XML body before inserting it into
Tagging of placenames was done programmatically by the database. The tool is now run weekly to update the
searching for each placename in the Logainm.ie database Ainm.ie database.
in each of the biographies. Base, mutated and inflected
forms of each placename were searched for using a 5. Tools and resources
linguistically aware search algorithm. In cases of Drawing on Fiontar's experiences from the Téarma.ie 6
ambiguity, where multiple places in Logainm.ie had the and Logainm.ie projects, web and database technologies
same name, all possible references were added, and the were harnessed to publish the biographies online. A web
correct one was selected by hand afterwards. application was built to present the biographies in a
user-friendly way to a new audience.
4.2 Editorial processing
The Ainm.ie web application comprises a home page, an
The files were then committed to a central Subversion information section, a number of tools for browsing and
data repository to which an editorial team was granted
searching the collection, and a biography viewer. The
access. Editors worked on local working copies of the
home page also includes a Biography of the week widget
repository, and committed changes as they worked. which can be embedded on other sites.
Editors worked on the XML files using a locally installed
XML editor. A stylesheet was developed to facilitate
5 6
http://www.logainm.ie/en/ Accessed on 24 June 2015. http://www.tearma.ie/Home.aspx Accessed on 24 June 2015.
22
The first of the browsing tools is the alphabetical list. This 7. Future plans
tool groups the biographies alphabetically according to Fiontar currently has a limited amount of funding to host
the surname, and comprises a paging tool to browse the and maintain the website which makes it somewhat
letters of the alphabet. One of the novel aspects of this tool difficult to plan major developments. We would like to
is that it lists women under both the feminine and further develop the site in a number of ways, with a view
masculine forms of their surnames. For example, the to strengthening links with other projects and resources,
biography of Áine Ní Raghallaigh (1868 - 1942) will be and thus enhancing the user experience. We hope to
listed both under “N” and under “O”, amongst instances enhance the search and browsing tools by incorporating
of the masculine form of that surname, i.e. “Ó the Irish Surnames Index we are developing as part of the
Raghallaigh”. This feature is language specific. Dúchas.ie project, a collaboration with University
College Dublin to digitise the National Folklore
The second browsing tool is the themes tool. This tool Collection of Ireland (Ó Cleircín et al, 2014). This
allows users to generate lists of biographies that share resource would facilitate the suggestion of related
named entities. This tool uses the tagged named entities in biographies based on relationships between different
the body of the biographies to build visual tag clouds. The surnames. We also intend to link the entries in this
named entities include placenames, publications, collection, where possible, to related entries in other
organisations, educational institutions, professions and collections and databases. We undertook a comparable
political parties, all of which were tagged and verified in project with Logainm.ie, using linked data to connect
the pre-processing and editing stage. The tag clouds are places in the Placenames Database of Ireland with places
rebuilt each time the database is updated. in other datasets such as GeoNames (Lopes et al, 2014).
Finally, we plan to redesign the home page of the site to
The third browsing tool is the timeline tool. This tool enhance usability. We also plan to enhance the editorial
groups the biographies by birth and death dates. Once a experience by developing web-based editorial tools which
year is selected from the timeline, a list of persons born on would supersede the current setup, which involves offline
that year as well as a list of persons who died that year are editing of individual XML files checked out from a
presented to the user. repository.
Additional browsing tools are incorporated into the 8. Acknowledgements
biography viewer, in the right hand column. The first is a
Wikipedia style infobox. This infobox contains links to The project is a partnership between Cló Iar-Chonnacht,
other persons that share an occupation with the current an Irish-language specialist publisher that holds the
person. The second tool lists persons in the collection with copyright to the material, and Fiontar, Dublin City
the same surname as the person being viewed. This tool is University, who developed the technical solution
linguistically aware in that it will list both men and described in this paper. Funding for the project was
women with the same surname. The third tool lists provided by the Irish government.
biographies that contain cross-references to the current
biography. Finally, all cross-references from the current 9. References
biography to other biographies in the collections, or to
places in the Placenames Database of Ireland Breathnach, D. & Ní Mhurchú, M. (1986-2007).
(Logainm.ie), are clickable hyperlinks. These links are Beathaisnéis (9 volumes). Dublin: An Clóchomhar.
created during the transformation of the biography from Breathnach, D. & Ní Mhurchú, M. (2001). 1882-1982
database entry to web page. Beathaisnéis: Fiontar Taighde. Studia Hibernica, 31, pp.
17-25.
Finally, the full text of all biographies in the collection can Lopes, N., Grant, R., Ó Raghallaigh, B., Ó Carragáin, E.,
be searched using the search tool. This tool can be Collins, S., & Decker, S. (2014). Linked Logainm:
accessed from the home page. Enhancing Library Metadata using Linked Data of Irish
Place Names. Communications In Computer And
6. Continuity Information Science, 416, Theory and Practice of
A central concern of this project from day one was Digital Libraries, pp. 65-76.
ensuring the continuity of Ainm.ie as an authoritative McGuire, J. & Quinn, J. (Eds.) (2009). Dictionary of Irish
Irish-language biographical resource. As mentioned in Biography. Cambridge: Cambridge University Press.
Section 2, the original authors were keen to hand over the Ó Cleircín, G., Bale, A. & Ó Raghallaigh, B. (2014).
responsibility to a younger generation of researchers so it Dúchas.ie: Ré Nua i Stair Chnuasach Bhéaloideas
was important to create a sustainable structure. To this end, Éireann. Béaloideas, 82, pp. 85-99.
a panel of ‘joint-editors’ was established in 2013 to write
Ó Raghallaigh, B. & Měchura, M. B. (2014). Developing
new biographies and to provide information regarding the
update of existing biographies in the collection. This high-end reusable tools and resources for
editorial panel produces 10-15 new biographies per year. Irish-language terminology, lexicography, onomastics
A shortlist of candidates for biography is agreed upon at (toponymy), folkloristics, and more, using modern web
an annual meeting between the joint-editors and the and database technologies. Proceedings of the First
publisher with each joint-editor being allocated a number Celtic Language Technology Workshop (CLTW), 23
of biographies to work on. The texts are then processed August 2014, Dublin, pp. 66-70.
and published by Fiontar.
23