=Paper= {{Paper |id=Vol-1399/paper4 |storemode=property |title=Ainm.ie: Breathing New Life into a Canonical Collection of Irish-language Biographies |pdfUrl=https://ceur-ws.org/Vol-1399/paper4.pdf |volume=Vol-1399 |dblpUrl=https://dblp.org/rec/conf/bd/RaghallaighC15 }} ==Ainm.ie: Breathing New Life into a Canonical Collection of Irish-language Biographies== https://ceur-ws.org/Vol-1399/paper4.pdf
       Ainm.ie: Breathing new life into a canonical collection of Irish-language
                                     biographies
                                      Brian Ó Raghallaigh, Gearóid Ó Cleircín
                                                     Dublin City University
                                                        Dublin, Ireland
                                      brian.oraghallaigh@dcu.ie, gearoid.ocleircin@dcu.ie

                                                               Abstract
In this paper we present the Ainm.ie online collection of Irish-language biographies. This collection is a product of a project to
retro-digitise the Beathaisnéis series, published between 1986 and 2007, as well as ongoing biographical work to expand and enrich the
collection. The Beathaisnéis series comprised biographical accounts of 1,650 lives and an additional 520 amendments and
supplementary articles, written in Irish. Persons were chosen for inclusion in this series according to their relevance to the
Irish-language world. This canonical collection is an invaluable research tool for Irish-language scholars, historians and others but the
print volumes risked becoming obsolete and inaccessible. As well as producing a digital version of the original texts, a key aim of the
Ainm.ie project has been to ensure the continuity of biographical research in Irish by providing an online platform for publication. This
paper introduces the Beathaisnéis collection and explains its context. It goes on to describe the digitisation process and the editorial
work carried out to enrich the digital version, as well as the motivation for this version. Finally, it discusses how the project has
facilitated contemporary biographical research in Irish.

Keywords: Irish language, digitisation, biographical dictionary



                                                                         accounts of 1,650 lives and an additional 520
                    1.    Introduction                                   amendments and supplementary articles, written in Irish.
In this paper we present the Ainm.ie online collection of                Persons were chosen for inclusion in this series according
Irish-language biographies. This collection is a product of              to their relevance to the Irish-language world. Some of the
a project to retro-digitise the Beathaisnéis series                      persons are nationally renowned and also appear in other
(Breathnach & Ní Mhurchú, 1986-2007), written and                        national biographical resources but most are not widely
published between 1986 and 2007, as well as ongoing                      known outside of the small Irish-language community.
biographical work to expand and enrich the collection.                   The Beathaisnéis project can therefore be seen as an
With additions since initial digitisation and publication                alternative dictionary of national(ist) biography, using
online, the collection now comprises 1,720 biographies,                  connection with the Irish language as a yardstick for
with a further 10-15 being added annually.                               inclusion. The timeline covered, from the 17th century to
                                                                         the present day, encompasses a period during which Irish
In addition to presenting the digitisation project and the               went from being the dominant language of the country to
resulting digital resource, we will motivate the creation of             the language of a socially and geographically disparate
this digital resource by looking at the advantages of the                minority. The persons included reflect this, with 17th
digital version of the collection over the original print                century chieftains, scribes and theologians rubbing
version.                                                                 shoulders with 19th century revivalists and
                                                                         revolutionaries as well as modern day folk singers,
While the Ainm.ie website is bilingual, the biographical                 academics and language activists.
accounts are available in Irish only. The site contains a
number of browse and search facilities, which draw on a                  The original authors of the nine volume print series,
limited set of metadata stored for each biography. This                  Diarmuid Breathnach and Máire Ní Mhurchú, were
metadata was added as part of the Ainm.ie project, as                    colleagues in the archives of the Irish state broadcaster
described in Section 4.                                                  RTÉ. They were often asked to provide biographical
                                                                         information on relatively well-known figures in the
The Ainm.ie project is one of multiple research projects                 Irish-language community, particularly for obituaries, and
being carried out by Fiontar that involve the identification             they became increasingly aware of the lack of an
of valuable non-digital language resources, their                        authoritative biographical dictionary. In order to fill in the
digitisation where necessary, and the application of web,                blanks they regularly had to do basic biographical
database, and language technology to these resources to                  research themselves, contacting relatives and tracking
widen access and availability, and to increase                           down birth and death records. Eventually, in 1979, they
effectiveness and usability (Ó Raghallaigh & Měchura,                    decided to start work on a biographical dictionary
2014).                                                                   themselves with the intention of covering the 100 year
                                                                         period from 1882 to 1982 (Breathnach & Ní Mhurchú,
          2.    Beathaisnéis: some context                               2001:17). This was published in five volumes between
                                                                         1986 and 1997. Breathnach and Ní Mhurchú subsequently
The original Beathaisnéis series comprises biographical
                                                                         went on to expand the scope of the project to take in the

                                                                    20
periods from 1560 to 1881 and from 1983 to 2007. It was             Other possibilities such as named entity extraction and
while working on the final published volume in the early            network analysis would also be opened up by the creation
00’s, that they began to think about passing on the                 of a digital version of the collection.
responsibility to a new generation of biographers. They
were extremely enthusiastic about the potential of a                Making the digital version available online would open up
digital version and provided a significant amount of                the potential to link the biographies with equivalents in
information and support during the early years of the               other online collections, such as the DIB, and to link
project while continuing to draft new biographies.                  metadata to other online resources.

     3.   Advantages of the digital version                         3.3 Public interaction
The Ainm.ie project was inspired by the digitisation of             Putting the collection online would open up the editorial
other canonical biographical resources like the Oxford              process allowing members of the public to suggest
Dictionary of National Biography 1 and the Australian               inclusions, to highlight errors and to provide various other
Dictionary of Biography 2 which proved that the move                types of feedback. This has been encouraged and
from print to digital could make such collections more              facilitated by the use of Twitter and Facebook accounts to
accessible and potentially increase their user base. These          share news and features such as the Biography of the
projects also highlighted the kind of added value that the          week.
digital edition could bring such as quicker updates,
regular thematic features and the potential for using the                          4.    Retro-digitisation
biographical data in new and interesting ways.                      The first stage of the Ainm.ie project involved the
                                                                    retro-digitisation of the nine volumes of the Beathaisnéis
The initial application for funding for the Ainm.ie project         series. Volumes 5, 6, 8 and 9 were made available by the
in 2009 coincided with the publication of the nine volume           publishers in a QuarkXPress publishing format that could
Dictionary of Irish Biography (McGuire & Quinn, 1999)               be exported to Microsoft DOC format. These volumes
which was made available concurrently in an online                  were exported in this way, before being checked, exported
version. 3 Advice was sought from researchers in the                to text, cleaned and processed for publication online.
Royal Irish Academy, where the Dictionary of Irish
Biography (DIB) project is based, regarding best practice           Volumes 1, 2, 3, 4 and 7, which were not available in any
in creating an online biographical collection.                      digital format from which text could be extracted, were
                                                                    scanned and converted to Microsoft DOCX format using
3.1 Accessibility                                                   OCR, before being checked, exported to text, cleaned and
It became clear from examining similar online collections           processed. Scanning and OCR was carried out by outside
and from talking to colleagues in the field that a digital          contractors. Checking the texts that were created using
version of Beathaisnéis could offer some substantial                OCR involved the reinstating of characters lost or
benefits. The most obvious of these was the possibility of          misinterpreted during the automatic recognition stage.
making the material more accessible. Breathnach and Ní
Mhurchú had deliberately published the collection in                Before exporting the volumes to text, bold and Italics text
relatively small volumes with the intention of setting              formatting in the DOC and DOCX documents was
themselves achievable targets. Another benefit for the              converted to a form of markdown, that could be retained
authors was that publication of subsequent volumes in the           after exporting to text. Markdown is a plain text
series allowed them to include corrections and additions            formatting syntax. 4 Our version of markdown involved
to previous editions (2001:25). However, the reality of a           enclosing bold formatted text between asterisks (e.g. *this
nine volume collection published over a twenty year                 is a bold example*), and enclosing Italics formatted text
period, was that volumes regularly went out of print, a fact        between plusses (e.g. +this is an Italics example+).
that was exacerbated by the limited size of the
Irish-language publishing market. A freely accessible               The volumes were then exported to text, and some
digital edition would make the entire collection available          programmatic cleaning was carried out, e.g. spurious line
to all.                                                             breaks and superfluous white space were removed. Once
                                                                    cleaned, individual biographies were extracted from the
3.2 New possibilities                                               volumes in text format, and saved as individual text files.
Creating a digital version of the collection would enhance          The individual biographies were then processed.
its usability. Full text search and clickable
cross-references would undoubtedly allow users to drill             4.1 Pre-processing
down into the collection more quickly than before.                  Before the biographies were added to the Ainm.ie
                                                                    database, a number of pre-processing tasks were carried
                                                                    out. These tasks included the extraction of basic metadata
1
  http://www.oxforddnb.com/ Accessed on 24 June 2015.
2                                                                   4
  http://adb.anu.edu.au/ Accessed on 24 June 2015.                    http://daringfireball.net/projects/markdown/ Accessed on 24
3
  http://dib.cambridge.org/ Accessed on 24 June 2015.               June 2015.


                                                               21
and the insertion of cross-references.                              user-friendly editing.

Firstly, each file was assigned a unique identifier. Each           The editorial team enhanced the collection in a number of
file was then converted to a simple XML format which                ways. Firstly, all automatic pre-processing was checked,
comprised a header containing metadata relating to the              and OCR errors were corrected. In addition, a style guide
article and a body containing the biography text.                   was developed for the digital edition in an attempt to
                                                                    standardise items such as references, quotations, dates
                                                                    and numbers as well as certain spelling and grammatical
Basic metadata was then added to each file. Firstly, global         issues. The original volumes were published over a
metadata regarding the volume and collection was                    twenty year period and thus contained a certain amount of
inserted. Secondly, each person's first name, surname,              inconsistencies that could be cleaned up. The style guide
date of birth, and date of death, where given, were parsed          is now circulated to new contributors to ensure
and extracted from the first line of each source file, and          consistency.
inserted into the metadata header of the XML file.
                                                                    The most significant editorial enhancement was the
Legacy textual cross-references, i.e. "[q.v.]", "[B1]" (i.e.        integration of supplementary notes to the primary
Beathaisnéis/Volume 1), "[B2]", etc., were then removed             biographies. As mentioned in Section 3, the authors had
and replaced with cross-references tagged/marked up                 included new information relating to over 500
with a target identifier, i.e. the unique ID number of the          biographies in appendices at the end of each volume. This
                                                                    allowed them to include new research that had come to
target biography. These new cross-references were
                                                                    light since the primary biography was published and also
created programmatically by searching for each name for             to correct any factual inaccuracies. The digital edition
which there was a biography in the collection. Where a              provided an opportunity to amend the relevant accounts to
match was found, it was tagged with the target identifier.          reflect this additional information. Careful redrafting was
These cross-references were subsequently manually                   necessary in some instances where the supplementary
verified.                                                           information was extensive and the original authors were
                                                                    consulted when appropriate. This element of the editorial
Further named entities were then searched for and tagged            process continues today as newly-published research is
in the body of each biography. Placenames, as well as a             reviewed and new information relevant to a biography is
closed set of publications, organisations, educational              added. The editorial team also accept submissions from
institutions, professions and political parties, were tagged        the public via email and correct inaccuracies or add minor
                                                                    details when verified.
during this stage of pre-processing. Some of the lists of
named entities were based on indexes included in the
                                                                    4.3 Post-processing
Beathaisnéis series, others were compiled specifically for
this purpose.                                                       Once all biographies had been checked and enhanced by
                                                                    the editorial team, the collection of biographies were
                                                                    prepared for publication online. This stage in the project
Placenames found were tagged with target identifiers
                                                                    involved the development of a tool to export the
from Logainm.ie, the Placenames Database of Ireland 5,              collection from the repository into a purpose built
the authoritative source for Irish toponymic data, and a            relational database.
dataset also developed and hosted by Fiontar, in
conjunction with the Placenames Branch of the                       For each biography in the collection, the tool extracts the
Government of Ireland. Place objects in Logainm.ie                  metadata from the article's XML header before inserting it
contain toponymic and geographic data, as well as links to          into the database in normalised form. The tool then
other geographical databases, such as GeoNames.                     extracts and cleans the XML body before inserting it into
Tagging of placenames was done programmatically by                  the database. The tool is now run weekly to update the
searching for each placename in the Logainm.ie database             Ainm.ie database.
in each of the biographies. Base, mutated and inflected
forms of each placename were searched for using a                                   5.   Tools and resources
linguistically aware search algorithm. In cases of                  Drawing on Fiontar's experiences from the Téarma.ie 6
ambiguity, where multiple places in Logainm.ie had the              and Logainm.ie projects, web and database technologies
same name, all possible references were added, and the              were harnessed to publish the biographies online. A web
correct one was selected by hand afterwards.                        application was built to present the biographies in a
                                                                    user-friendly way to a new audience.
4.2 Editorial processing
                                                                    The Ainm.ie web application comprises a home page, an
The files were then committed to a central Subversion               information section, a number of tools for browsing and
data repository to which an editorial team was granted
                                                                    searching the collection, and a biography viewer. The
access. Editors worked on local working copies of the
                                                                    home page also includes a Biography of the week widget
repository, and committed changes as they worked.                   which can be embedded on other sites.
Editors worked on the XML files using a locally installed
XML editor. A stylesheet was developed to facilitate

5                                                                   6
    http://www.logainm.ie/en/ Accessed on 24 June 2015.                 http://www.tearma.ie/Home.aspx Accessed on 24 June 2015.


                                                               22
The first of the browsing tools is the alphabetical list. This                           7.    Future plans
tool groups the biographies alphabetically according to                Fiontar currently has a limited amount of funding to host
the surname, and comprises a paging tool to browse the                 and maintain the website which makes it somewhat
letters of the alphabet. One of the novel aspects of this tool         difficult to plan major developments. We would like to
is that it lists women under both the feminine and                     further develop the site in a number of ways, with a view
masculine forms of their surnames. For example, the                    to strengthening links with other projects and resources,
biography of Áine Ní Raghallaigh (1868 - 1942) will be                 and thus enhancing the user experience. We hope to
listed both under “N” and under “O”, amongst instances                 enhance the search and browsing tools by incorporating
of the masculine form of that surname, i.e. “Ó                         the Irish Surnames Index we are developing as part of the
Raghallaigh”. This feature is language specific.                       Dúchas.ie project, a collaboration with University
                                                                       College Dublin to digitise the National Folklore
The second browsing tool is the themes tool. This tool                 Collection of Ireland (Ó Cleircín et al, 2014). This
allows users to generate lists of biographies that share               resource would facilitate the suggestion of related
named entities. This tool uses the tagged named entities in            biographies based on relationships between different
the body of the biographies to build visual tag clouds. The            surnames. We also intend to link the entries in this
named entities include placenames, publications,                       collection, where possible, to related entries in other
organisations, educational institutions, professions and               collections and databases. We undertook a comparable
political parties, all of which were tagged and verified in            project with Logainm.ie, using linked data to connect
the pre-processing and editing stage. The tag clouds are               places in the Placenames Database of Ireland with places
rebuilt each time the database is updated.                             in other datasets such as GeoNames (Lopes et al, 2014).
                                                                       Finally, we plan to redesign the home page of the site to
The third browsing tool is the timeline tool. This tool                enhance usability. We also plan to enhance the editorial
groups the biographies by birth and death dates. Once a                experience by developing web-based editorial tools which
year is selected from the timeline, a list of persons born on          would supersede the current setup, which involves offline
that year as well as a list of persons who died that year are          editing of individual XML files checked out from a
presented to the user.                                                 repository.

Additional browsing tools are incorporated into the                                 8.   Acknowledgements
biography viewer, in the right hand column. The first is a
Wikipedia style infobox. This infobox contains links to                The project is a partnership between Cló Iar-Chonnacht,
other persons that share an occupation with the current                an Irish-language specialist publisher that holds the
person. The second tool lists persons in the collection with           copyright to the material, and Fiontar, Dublin City
the same surname as the person being viewed. This tool is              University, who developed the technical solution
linguistically aware in that it will list both men and                 described in this paper. Funding for the project was
women with the same surname. The third tool lists                      provided by the Irish government.
biographies that contain cross-references to the current
biography. Finally, all cross-references from the current                                 9.   References
biography to other biographies in the collections, or to
places in the Placenames Database of Ireland                           Breathnach, D. & Ní Mhurchú, M. (1986-2007).
(Logainm.ie), are clickable hyperlinks. These links are                  Beathaisnéis (9 volumes). Dublin: An Clóchomhar.
created during the transformation of the biography from                Breathnach, D. & Ní Mhurchú, M. (2001). 1882-1982
database entry to web page.                                              Beathaisnéis: Fiontar Taighde. Studia Hibernica, 31, pp.
                                                                         17-25.
Finally, the full text of all biographies in the collection can        Lopes, N., Grant, R., Ó Raghallaigh, B., Ó Carragáin, E.,
be searched using the search tool. This tool can be                      Collins, S., & Decker, S. (2014). Linked Logainm:
accessed from the home page.                                             Enhancing Library Metadata using Linked Data of Irish
                                                                         Place Names. Communications In Computer And
                     6.   Continuity                                     Information Science, 416, Theory and Practice of
A central concern of this project from day one was                       Digital Libraries, pp. 65-76.
ensuring the continuity of Ainm.ie as an authoritative                 McGuire, J. & Quinn, J. (Eds.) (2009). Dictionary of Irish
Irish-language biographical resource. As mentioned in                    Biography. Cambridge: Cambridge University Press.
Section 2, the original authors were keen to hand over the             Ó Cleircín, G., Bale, A. & Ó Raghallaigh, B. (2014).
responsibility to a younger generation of researchers so it              Dúchas.ie: Ré Nua i Stair Chnuasach Bhéaloideas
was important to create a sustainable structure. To this end,            Éireann. Béaloideas, 82, pp. 85-99.
a panel of ‘joint-editors’ was established in 2013 to write
                                                                       Ó Raghallaigh, B. & Měchura, M. B. (2014). Developing
new biographies and to provide information regarding the
update of existing biographies in the collection. This                   high-end reusable tools and resources for
editorial panel produces 10-15 new biographies per year.                 Irish-language terminology, lexicography, onomastics
A shortlist of candidates for biography is agreed upon at                (toponymy), folkloristics, and more, using modern web
an annual meeting between the joint-editors and the                      and database technologies. Proceedings of the First
publisher with each joint-editor being allocated a number                Celtic Language Technology Workshop (CLTW), 23
of biographies to work on. The texts are then processed                  August 2014, Dublin, pp. 66-70.
and published by Fiontar.



                                                                  23