Ainm.ie: Breathing new life into a canonical collection of Irish-language biographies Brian Ó Raghallaigh, Gearóid Ó Cleircín Dublin City University Dublin, Ireland brian.oraghallaigh@dcu.ie, gearoid.ocleircin@dcu.ie Abstract In this paper we present the Ainm.ie online collection of Irish-language biographies. This collection is a product of a project to retro-digitise the Beathaisnéis series, published between 1986 and 2007, as well as ongoing biographical work to expand and enrich the collection. The Beathaisnéis series comprised biographical accounts of 1,650 lives and an additional 520 amendments and supplementary articles, written in Irish. Persons were chosen for inclusion in this series according to their relevance to the Irish-language world. This canonical collection is an invaluable research tool for Irish-language scholars, historians and others but the print volumes risked becoming obsolete and inaccessible. As well as producing a digital version of the original texts, a key aim of the Ainm.ie project has been to ensure the continuity of biographical research in Irish by providing an online platform for publication. This paper introduces the Beathaisnéis collection and explains its context. It goes on to describe the digitisation process and the editorial work carried out to enrich the digital version, as well as the motivation for this version. Finally, it discusses how the project has facilitated contemporary biographical research in Irish. Keywords: Irish language, digitisation, biographical dictionary accounts of 1,650 lives and an additional 520 1. Introduction amendments and supplementary articles, written in Irish. In this paper we present the Ainm.ie online collection of Persons were chosen for inclusion in this series according Irish-language biographies. This collection is a product of to their relevance to the Irish-language world. Some of the a project to retro-digitise the Beathaisnéis series persons are nationally renowned and also appear in other (Breathnach & Ní Mhurchú, 1986-2007), written and national biographical resources but most are not widely published between 1986 and 2007, as well as ongoing known outside of the small Irish-language community. biographical work to expand and enrich the collection. The Beathaisnéis project can therefore be seen as an With additions since initial digitisation and publication alternative dictionary of national(ist) biography, using online, the collection now comprises 1,720 biographies, connection with the Irish language as a yardstick for with a further 10-15 being added annually. inclusion. The timeline covered, from the 17th century to the present day, encompasses a period during which Irish In addition to presenting the digitisation project and the went from being the dominant language of the country to resulting digital resource, we will motivate the creation of the language of a socially and geographically disparate this digital resource by looking at the advantages of the minority. The persons included reflect this, with 17th digital version of the collection over the original print century chieftains, scribes and theologians rubbing version. shoulders with 19th century revivalists and revolutionaries as well as modern day folk singers, While the Ainm.ie website is bilingual, the biographical academics and language activists. accounts are available in Irish only. The site contains a number of browse and search facilities, which draw on a The original authors of the nine volume print series, limited set of metadata stored for each biography. This Diarmuid Breathnach and Máire Ní Mhurchú, were metadata was added as part of the Ainm.ie project, as colleagues in the archives of the Irish state broadcaster described in Section 4. RTÉ. They were often asked to provide biographical information on relatively well-known figures in the The Ainm.ie project is one of multiple research projects Irish-language community, particularly for obituaries, and being carried out by Fiontar that involve the identification they became increasingly aware of the lack of an of valuable non-digital language resources, their authoritative biographical dictionary. In order to fill in the digitisation where necessary, and the application of web, blanks they regularly had to do basic biographical database, and language technology to these resources to research themselves, contacting relatives and tracking widen access and availability, and to increase down birth and death records. Eventually, in 1979, they effectiveness and usability (Ó Raghallaigh & Měchura, decided to start work on a biographical dictionary 2014). themselves with the intention of covering the 100 year period from 1882 to 1982 (Breathnach & Ní Mhurchú, 2. Beathaisnéis: some context 2001:17). This was published in five volumes between 1986 and 1997. Breathnach and Ní Mhurchú subsequently The original Beathaisnéis series comprises biographical went on to expand the scope of the project to take in the 20 periods from 1560 to 1881 and from 1983 to 2007. It was Other possibilities such as named entity extraction and while working on the final published volume in the early network analysis would also be opened up by the creation 00’s, that they began to think about passing on the of a digital version of the collection. responsibility to a new generation of biographers. They were extremely enthusiastic about the potential of a Making the digital version available online would open up digital version and provided a significant amount of the potential to link the biographies with equivalents in information and support during the early years of the other online collections, such as the DIB, and to link project while continuing to draft new biographies. metadata to other online resources. 3. Advantages of the digital version 3.3 Public interaction The Ainm.ie project was inspired by the digitisation of Putting the collection online would open up the editorial other canonical biographical resources like the Oxford process allowing members of the public to suggest Dictionary of National Biography 1 and the Australian inclusions, to highlight errors and to provide various other Dictionary of Biography 2 which proved that the move types of feedback. This has been encouraged and from print to digital could make such collections more facilitated by the use of Twitter and Facebook accounts to accessible and potentially increase their user base. These share news and features such as the Biography of the projects also highlighted the kind of added value that the week. digital edition could bring such as quicker updates, regular thematic features and the potential for using the 4. Retro-digitisation biographical data in new and interesting ways. The first stage of the Ainm.ie project involved the retro-digitisation of the nine volumes of the Beathaisnéis The initial application for funding for the Ainm.ie project series. Volumes 5, 6, 8 and 9 were made available by the in 2009 coincided with the publication of the nine volume publishers in a QuarkXPress publishing format that could Dictionary of Irish Biography (McGuire & Quinn, 1999) be exported to Microsoft DOC format. These volumes which was made available concurrently in an online were exported in this way, before being checked, exported version. 3 Advice was sought from researchers in the to text, cleaned and processed for publication online. Royal Irish Academy, where the Dictionary of Irish Biography (DIB) project is based, regarding best practice Volumes 1, 2, 3, 4 and 7, which were not available in any in creating an online biographical collection. digital format from which text could be extracted, were scanned and converted to Microsoft DOCX format using 3.1 Accessibility OCR, before being checked, exported to text, cleaned and It became clear from examining similar online collections processed. Scanning and OCR was carried out by outside and from talking to colleagues in the field that a digital contractors. Checking the texts that were created using version of Beathaisnéis could offer some substantial OCR involved the reinstating of characters lost or benefits. The most obvious of these was the possibility of misinterpreted during the automatic recognition stage. making the material more accessible. Breathnach and Ní Mhurchú had deliberately published the collection in Before exporting the volumes to text, bold and Italics text relatively small volumes with the intention of setting formatting in the DOC and DOCX documents was themselves achievable targets. Another benefit for the converted to a form of markdown, that could be retained authors was that publication of subsequent volumes in the after exporting to text. Markdown is a plain text series allowed them to include corrections and additions formatting syntax. 4 Our version of markdown involved to previous editions (2001:25). However, the reality of a enclosing bold formatted text between asterisks (e.g. *this nine volume collection published over a twenty year is a bold example*), and enclosing Italics formatted text period, was that volumes regularly went out of print, a fact between plusses (e.g. +this is an Italics example+). that was exacerbated by the limited size of the Irish-language publishing market. A freely accessible The volumes were then exported to text, and some digital edition would make the entire collection available programmatic cleaning was carried out, e.g. spurious line to all. breaks and superfluous white space were removed. Once cleaned, individual biographies were extracted from the 3.2 New possibilities volumes in text format, and saved as individual text files. Creating a digital version of the collection would enhance The individual biographies were then processed. its usability. Full text search and clickable cross-references would undoubtedly allow users to drill 4.1 Pre-processing down into the collection more quickly than before. Before the biographies were added to the Ainm.ie database, a number of pre-processing tasks were carried out. These tasks included the extraction of basic metadata 1 http://www.oxforddnb.com/ Accessed on 24 June 2015. 2 4 http://adb.anu.edu.au/ Accessed on 24 June 2015. http://daringfireball.net/projects/markdown/ Accessed on 24 3 http://dib.cambridge.org/ Accessed on 24 June 2015. June 2015. 21 and the insertion of cross-references. user-friendly editing. Firstly, each file was assigned a unique identifier. Each The editorial team enhanced the collection in a number of file was then converted to a simple XML format which ways. Firstly, all automatic pre-processing was checked, comprised a header containing metadata relating to the and OCR errors were corrected. In addition, a style guide article and a body containing the biography text. was developed for the digital edition in an attempt to standardise items such as references, quotations, dates and numbers as well as certain spelling and grammatical Basic metadata was then added to each file. Firstly, global issues. The original volumes were published over a metadata regarding the volume and collection was twenty year period and thus contained a certain amount of inserted. Secondly, each person's first name, surname, inconsistencies that could be cleaned up. The style guide date of birth, and date of death, where given, were parsed is now circulated to new contributors to ensure and extracted from the first line of each source file, and consistency. inserted into the metadata header of the XML file. The most significant editorial enhancement was the Legacy textual cross-references, i.e. "[q.v.]", "[B1]" (i.e. integration of supplementary notes to the primary Beathaisnéis/Volume 1), "[B2]", etc., were then removed biographies. As mentioned in Section 3, the authors had and replaced with cross-references tagged/marked up included new information relating to over 500 with a target identifier, i.e. the unique ID number of the biographies in appendices at the end of each volume. This allowed them to include new research that had come to target biography. These new cross-references were light since the primary biography was published and also created programmatically by searching for each name for to correct any factual inaccuracies. The digital edition which there was a biography in the collection. Where a provided an opportunity to amend the relevant accounts to match was found, it was tagged with the target identifier. reflect this additional information. Careful redrafting was These cross-references were subsequently manually necessary in some instances where the supplementary verified. information was extensive and the original authors were consulted when appropriate. This element of the editorial Further named entities were then searched for and tagged process continues today as newly-published research is in the body of each biography. Placenames, as well as a reviewed and new information relevant to a biography is closed set of publications, organisations, educational added. The editorial team also accept submissions from institutions, professions and political parties, were tagged the public via email and correct inaccuracies or add minor details when verified. during this stage of pre-processing. Some of the lists of named entities were based on indexes included in the 4.3 Post-processing Beathaisnéis series, others were compiled specifically for this purpose. Once all biographies had been checked and enhanced by the editorial team, the collection of biographies were prepared for publication online. This stage in the project Placenames found were tagged with target identifiers involved the development of a tool to export the from Logainm.ie, the Placenames Database of Ireland 5, collection from the repository into a purpose built the authoritative source for Irish toponymic data, and a relational database. dataset also developed and hosted by Fiontar, in conjunction with the Placenames Branch of the For each biography in the collection, the tool extracts the Government of Ireland. Place objects in Logainm.ie metadata from the article's XML header before inserting it contain toponymic and geographic data, as well as links to into the database in normalised form. The tool then other geographical databases, such as GeoNames. extracts and cleans the XML body before inserting it into Tagging of placenames was done programmatically by the database. The tool is now run weekly to update the searching for each placename in the Logainm.ie database Ainm.ie database. in each of the biographies. Base, mutated and inflected forms of each placename were searched for using a 5. Tools and resources linguistically aware search algorithm. In cases of Drawing on Fiontar's experiences from the Téarma.ie 6 ambiguity, where multiple places in Logainm.ie had the and Logainm.ie projects, web and database technologies same name, all possible references were added, and the were harnessed to publish the biographies online. A web correct one was selected by hand afterwards. application was built to present the biographies in a user-friendly way to a new audience. 4.2 Editorial processing The Ainm.ie web application comprises a home page, an The files were then committed to a central Subversion information section, a number of tools for browsing and data repository to which an editorial team was granted searching the collection, and a biography viewer. The access. Editors worked on local working copies of the home page also includes a Biography of the week widget repository, and committed changes as they worked. which can be embedded on other sites. Editors worked on the XML files using a locally installed XML editor. A stylesheet was developed to facilitate 5 6 http://www.logainm.ie/en/ Accessed on 24 June 2015. http://www.tearma.ie/Home.aspx Accessed on 24 June 2015. 22 The first of the browsing tools is the alphabetical list. This 7. Future plans tool groups the biographies alphabetically according to Fiontar currently has a limited amount of funding to host the surname, and comprises a paging tool to browse the and maintain the website which makes it somewhat letters of the alphabet. One of the novel aspects of this tool difficult to plan major developments. We would like to is that it lists women under both the feminine and further develop the site in a number of ways, with a view masculine forms of their surnames. For example, the to strengthening links with other projects and resources, biography of Áine Ní Raghallaigh (1868 - 1942) will be and thus enhancing the user experience. We hope to listed both under “N” and under “O”, amongst instances enhance the search and browsing tools by incorporating of the masculine form of that surname, i.e. “Ó the Irish Surnames Index we are developing as part of the Raghallaigh”. This feature is language specific. Dúchas.ie project, a collaboration with University College Dublin to digitise the National Folklore The second browsing tool is the themes tool. This tool Collection of Ireland (Ó Cleircín et al, 2014). This allows users to generate lists of biographies that share resource would facilitate the suggestion of related named entities. This tool uses the tagged named entities in biographies based on relationships between different the body of the biographies to build visual tag clouds. The surnames. We also intend to link the entries in this named entities include placenames, publications, collection, where possible, to related entries in other organisations, educational institutions, professions and collections and databases. We undertook a comparable political parties, all of which were tagged and verified in project with Logainm.ie, using linked data to connect the pre-processing and editing stage. The tag clouds are places in the Placenames Database of Ireland with places rebuilt each time the database is updated. in other datasets such as GeoNames (Lopes et al, 2014). Finally, we plan to redesign the home page of the site to The third browsing tool is the timeline tool. This tool enhance usability. We also plan to enhance the editorial groups the biographies by birth and death dates. Once a experience by developing web-based editorial tools which year is selected from the timeline, a list of persons born on would supersede the current setup, which involves offline that year as well as a list of persons who died that year are editing of individual XML files checked out from a presented to the user. repository. Additional browsing tools are incorporated into the 8. Acknowledgements biography viewer, in the right hand column. The first is a Wikipedia style infobox. This infobox contains links to The project is a partnership between Cló Iar-Chonnacht, other persons that share an occupation with the current an Irish-language specialist publisher that holds the person. The second tool lists persons in the collection with copyright to the material, and Fiontar, Dublin City the same surname as the person being viewed. This tool is University, who developed the technical solution linguistically aware in that it will list both men and described in this paper. Funding for the project was women with the same surname. The third tool lists provided by the Irish government. biographies that contain cross-references to the current biography. Finally, all cross-references from the current 9. References biography to other biographies in the collections, or to places in the Placenames Database of Ireland Breathnach, D. & Ní Mhurchú, M. (1986-2007). (Logainm.ie), are clickable hyperlinks. These links are Beathaisnéis (9 volumes). Dublin: An Clóchomhar. created during the transformation of the biography from Breathnach, D. & Ní Mhurchú, M. (2001). 1882-1982 database entry to web page. Beathaisnéis: Fiontar Taighde. Studia Hibernica, 31, pp. 17-25. Finally, the full text of all biographies in the collection can Lopes, N., Grant, R., Ó Raghallaigh, B., Ó Carragáin, E., be searched using the search tool. This tool can be Collins, S., & Decker, S. (2014). Linked Logainm: accessed from the home page. Enhancing Library Metadata using Linked Data of Irish Place Names. Communications In Computer And 6. Continuity Information Science, 416, Theory and Practice of A central concern of this project from day one was Digital Libraries, pp. 65-76. ensuring the continuity of Ainm.ie as an authoritative McGuire, J. & Quinn, J. (Eds.) (2009). Dictionary of Irish Irish-language biographical resource. As mentioned in Biography. Cambridge: Cambridge University Press. Section 2, the original authors were keen to hand over the Ó Cleircín, G., Bale, A. & Ó Raghallaigh, B. (2014). responsibility to a younger generation of researchers so it Dúchas.ie: Ré Nua i Stair Chnuasach Bhéaloideas was important to create a sustainable structure. To this end, Éireann. Béaloideas, 82, pp. 85-99. a panel of ‘joint-editors’ was established in 2013 to write Ó Raghallaigh, B. & Měchura, M. B. (2014). Developing new biographies and to provide information regarding the update of existing biographies in the collection. This high-end reusable tools and resources for editorial panel produces 10-15 new biographies per year. Irish-language terminology, lexicography, onomastics A shortlist of candidates for biography is agreed upon at (toponymy), folkloristics, and more, using modern web an annual meeting between the joint-editors and the and database technologies. Proceedings of the First publisher with each joint-editor being allocated a number Celtic Language Technology Workshop (CLTW), 23 of biographies to work on. The texts are then processed August 2014, Dublin, pp. 66-70. and published by Fiontar. 23