A Quantitative Analysis of Biographical Data from Ainm, the Irish-language Biographical Database Úna Bhreathnach, Cathal Burke, Jeaic Mag Fhinn, Gearóid Ó Cleircín, Brian Ó Raghallaigh Fiontar & Scoil na Gaeilge, Dublin City University Dublin, Ireland E-mail: Una.Bhreathnach@dcu.ie, Cathal.Burke@dcu.ie, Jeaic.MagFhinn@dcu.ie, Gearoid.OCleircin@dcu.ie, Brian.ORaghallaigh@dcu.ie Abstract This paper looks at some trends identifiable in the biographical data contained in the Ainm collection of Irish-language related biographies. The data structure is described and the reasons for its particular structure are outlined. The structured data is then analysed to identify some notable patterns and significant gaps in the Ainm biographical collection. These features and omissions are discussed in the context of the creation of both the original print biographical dictionary (the Beathaisnéis series) and the more recent digital version (www.ainm.ie). Keywords: Mining biographies for structured information; quantitative analysis; biographical dictionaries; digitizing biographical data; Irish-language biography; Irish biography additional information which came to light after 1. The Ainm project: background publication. After the retirement of the Beathaisnéis authors from the project, and the digitsation of the print The Ainm (the Irish word for ‘name’) project is an online material, DCU and Cló Iar-Chonnacht (the publishers of biographical database focused on people, mainly (although the series) secured a small amount of funding to update and not exclusively) Irish, who had a connection to the Irish expand the biographies. A panel of contributors was language. It is written in Irish and has been available online established in 2013 to continue writing biographies, mainly since 2011. of recently deceased subjects but also of overlooked lives, with between 10 and 15 new lives added annually. The database evolved from, and now significantly expands, the Beathaisnéis (‘biography’) series of published In addition to new biographies, additional content has been biographies (Breathnach & Ní Mhurchú 1986-2007). The added to the website. Thematic essays are added annually authors of the Beathaisnéis series, Diarmuid Breathnach to provide an introduction to different categories of and Máire Ní Mhurchú, intended to create a dictionary of biography (e.g. participants in the 1916 rising; traditional biography, using relevance to the Irish-language world as singers; folklore collectors). Visualisation features have the main yardstick for inclusion, and with a strong focus on been developed too, in particular an interactive map lives associated with the Gaelic Revival and the period displaying placenames tagged in the various lives. A 1882-1982, which are covered in five volumes (Breathnach feature for displaying the social networks of individuals is & Ní Mhurchú 1986, 1990, 1992, 1994 & 1997). The scope also in development. was subsequently expanded, in three further volumes, to the previous periods, 1782-1881 and 1560-1881 (Breathnach The result to date is a collection of 1,756 biographies, with & Ní Mhurchú 1999 & 2001) and to the subsequent period an average length of 1,223 words and 37 tags or cross- 1983-2002 (Breathnach & Ní Mhurchú 2002), with a references in each. 1,652 of these biographies are from the further volume of supplements, amendments and indexes original series and 104 have been added since 2010. These (Breathnach & Ní Mhurchú 2007). biographies overlap with the much larger English-language Dictionary of Irish Biography1 (c.420 also feature there). The process by which the Beathaisnéis volumes were The Ainm database is used widely, with an average of digitised, tagged and edited by Fiontar & Scoil na Gaeilge, 1,143 searches per day (14/10/2018 - 05/03/19). DCU, is described in detail elsewhere (Ó Raghallaigh & Ó Cleircín 2015). Over 600 of the previously published biographies were updated to reflect corrections and 1 dib.cambridge.org Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2. The Ainm database: data structure 3. The Ainm database: metadata The biographical entries are stored as XML data (using The XML data follows the TEI: Text Encoding Initiative the SQL Server XML Data Type) in a relational database. guidelines for the most part. In addition to the required This allows us to store and modify the XML data in an , <forename/>, <surname/>, <birth/>, <death/> and efficient and transacted way. It also allows us to <sex/> elements, <addName/>, <school/>, <university/> conveniently log changes and store versions of the and <occupation/> are included as metadata in the biographical entries. Each entry in the biographies table <header/> element, where known. These metadata are comprises a unique ID, a text TITLE, and an XML displayed in the biography title and infobox on the public document. The ID field is a permanent unique identifier website.4 The dates are also used on the timeline5 and within the database and can be used to access the thematic tag cloud6 tools. biography over the web in HTML2 or XML3 format. The HTML is generated from the underlying XML using an In addition to (biographical) entry level metadata contained XSL transformation. in the header, certain entity types have been tagged inline in the biography <text/> element. These include placenames (<placeName/>), publications (<opus/>), Gaelic League branches (<conradh/>), educational institutions (<eduInst/>) and political parties (<party/>). This information is used to create the aforementioned tag clouds. Placename tags include a reference to the Placename Database of Ireland7 where the place is in Ireland and a reference to GeoNames8 where the place is outside of Ireland. People (<persName>) are also tagged in the <text/> element. People tags include a cross-reference where the person is within the database. A <bibliography/> element is included after the <text/> element in some of the newer biographies. Figure 1: TEI elements used in the biography of Pádraig Mac Piarais For the original collection of biographies, all inline tagging (i.e. markup of named entities in the body text of the The original biographical entries digitised from biographical entry) was done automatically using a Beathaisnéis Volumes 1–9 (Breathnach & Ní Mhurchú, purpose-built tagger written in Python. The tagger searched 1986–2007) have ID numbers in the range 1–1999. Stubs for and tagged named entities (i.e. names, placenames, (short entries) from the original volumes are in the range publications, institutions, and political parties). The tagger 2000–2999. New biographical entries written and added included a custom NLP function to deal with initial between 2010 and 2016 are in the range 3000–3999, and mutation (e.g. gCorcaigh) of entities in the Irish-language entries added since 2017 are numbered 5000+. text. This function tagged the mutated entity and inserted the base form (e.g. Corcaigh) as an attribute of the entity. All inline tagging was subsequently manually checked. Newer biographies are tagged manually. Older biographies are gradually being re-checked as they are prepared for use as “biography of the week” on Twitter, Facebook and in the project newsletter. User feedback is also considered. The Beathaisnéis collection (on which the Ainm database Figure 2: Web view of the biography of Pádraig Mac is primarily based) is the result of the passionate work of Piarais two committed amateur biographers. One of the challenges this poses for the creation of standardised metadata is that the original authors did not complete profile sheets or index 2 6 e.g. www.ainm.ie/Bio.aspx?ID=454 www.ainm.ie/Tags.aspx 3 7 e.g. www.ainm.ie/Bio.aspx?ID=454&xml=true www.logainm.ie 4 8 e.g. www.ainm.ie/Bio.aspx?ID=454 www.geonames.org 5 www.ainm.ie/Timeline.aspx cards such as those commonly used by other dictionaries of but also the history of the collection itself, as Warren national biography (Warren, 2018; Reinert et al, 2015) and (2018) suggests in his examination of the Oxford which can make the creation of entry-level metadata (e.g. Dictionary of National Biography (ODNB); the digitisation where the person was born) relatively straightforward. of national biographies can be used, he says, to analyse While Breathnach and Ní Mhurchú did follow a typical dually the history of both the nation and of the dictionary formula in constructing their biographies, they did not list itself. In doing so, Warren illustrates not only what the key elements such as profession, religion, gender or place ODNB contains, but also how it came to be: of birth/death independently of the text. In order to retrospectively register such information (i.e. the entry- “...investigating the ODNB (1.) in its entirety and level metadata as opposed to the named-entity recognition (2.) as an historically contingent digital artifact previously described) it has been necessary to manually offers wider purchase on the historical knowledge extract the relevant details – a slow process that is still it makes available and the historical knowledge- ongoing in the case of certain elements. The image below making it constrains...” (2018) from a typical entry displays the typical categories for which data has been extracted to date. It includes date of Searches by occupation, birthplace or year tell less about birth, date of death, place of birth, gender, school, third- their individual importance in the history of the nation than level education and occupation. Other common categories they do about the imagination and biases of the various such as religion and place of death have yet to be extracted. contributors. Warren notes that mothers of women in ODNB are quite often queens, and mothers in general tend to be actresses, teachers and noblewomen, while fathers are frequently landowners, army officers, clergymen or merchants. National biographies do not generally attempt to capture the typical member of the nation, but rather the atypical, the exceptional names deemed important by the biography’s contributors. It might seem odd that “naval officer” is the third most frequent profession in the ODNB and that Britain and the Defeat of Napoleon (1996) is the most referenced monograph in the entirety of the ODNB, but the context of the ODNB’s construction casts light on these seeming oddities: a prolific naval historian, Sir John Laughton (1830-1915), was responsible for 1,000 Figure 3: Typical categories used for data extraction biographies of naval figures, roughly 1 out of every 38 (example: Tomás Ó Flannghaile) Dictionary of National Biography entries, all of which, it was decided, would be added to the subsequent ODNB. This required additional research which often referenced Britain and the Defeat of Napoleon (1996). 4. Quantitative analysis Likewise, most common years of death in ODNB not only 4.1 Background reflect periods of illness or bloodshed, but also the inherent The digitisation of biographical collections offers the biases contributors brought in with their selection of opportunity to examine collections at a scale not previously subjects: possible when only utilizing biographical text. With the “The local peaks in 1883 and 1908... once again remind us to attend to the data infrastructure. creation and linking of standardised datasets from Rather than marking some hitherto unknown unstructured text, the overall contents of the collection can plague afflicting the Victorian aristocracy, 1883 be revealed and trends can be analysed. The Finnish marks the point at which contemporaneous deaths BiographySampo9 (Tamper et al, 2018), for example, ceased to be meaningful to Stephen, his deputy illustrates the potential for such examination, interrogating Sidney Lee, and their collaborators.” (Warren, the biographies as a collection of linked data, as does the 2018) Netherlands’ BiographyNet10 (Fokkens et al, 2017). Not only do such datasets illustrate the overall makeup of a What biases, designed or unintended, can therefore be national biography, and therein the history of the nation, found within Ainm? Inspection of the most common years 9 10 https://seco.cs.aalto.fi/projects/biografiasampo/en/ http://www.biographynet.nl/ of birth or death reveal the clearest influence the selection writing of these biographies inherently asserts the criteria and original aim of the biographical dictionary had importance of the Irish language itself in both the basis and on the dictionary itself. The original impetus for the imagination of the nation. Beathaisnéis was to produce a biographical dictionary based on 100 years of the Gaelic Revival, covering those 4.2 Timespan who had died between 1882 and 1982. The scope of the Magnus Ó Domhnaill (c.1490-1563) is the earliest-born in project gradually changed as the authors decided to include lives from both before and after that arbitrary period. the collection, with 2017 the most recent year of death. Nonetheless, the fact that the first five volumes focused Spanning seven different centuries, there are 306 different years of birth. An interesting demographic is revealed when exclusively on that 100 year period and that only two of the nine volumes cover the period from 1560 to 1881 means we analyse the years of birth and death on a broader range. The most significant is the fact that 815 people were born that the collection is inevitably biased towards the period from the mid 19th century onwards. This is shown clearly in the 19th century, 46% of the lives. If we add in those who were born in the 20th century, 514 people, we reach in figures below (Figure 4). 1329 lives, 75% of the total collection. Furthermore, 1064 The image of the nation captured in the pages of the ODNB people died in the 20th century, 62% of the total. Therefore, is less a reflection of the nation’s history than of the making having been born in the 19th century, and having died in of the dictionary itself (Warren, 2018). Between 1450 and the 20th century, the majority of lives lived through the 2000, France, the Netherlands, and the United States of revival period of Irish, something which comes in line with America all apparently supersede England, Scotland, the understanding that the Beathaisnéis project initially Ireland, and Wales in importance in the ODNB, Warren centred around those most active in reviving the language claims, because of the presumed Englishness of each during the late 19th and early 20th century. There are biographical subject. Relevant countries besides England however 110 people with no year of birth or year of death were mentioned specifically, while any reference to the stored as metadata. (In most cases, these lives were in the dictionary’s own nation was left assumed, and therefore left form of short ‘stub’ articles by the Beathaisnéis authors, out. Likewise, there is a presumed continuity among the where birth and death dates were not included in the title Ainm biographies: each person played a particular role in and therefore not automatically extracted. This is an area the Irish-language world. Their relevance to this world is for future improvement. Having identified this gap, we will fundamental to their inclusion in this collection and begin manually extracting available missing data in order remains the principal criterion for evaluating suitability. to store it accordingly.) The first volume of the Beathaisnéis series outlined criteria for inclusion: “...Irish speakers who did something remarkable or who achieved a level of excellence in their lives. Undoubtedly there are also Irish speakers who wouldn’t earn a place in the national pantheon but who are still of importance or who are well-known in the context of the Revival period. Both types are included in this volume and will be included in future collections.”11 (Breathnach & Ní Mhurchú, 1986, 11) Figure 4: The predominance of the 19th Century (year of Although it might not seem noteworthy to specify the first birth) and the 20th Century (year of death) in the above official language of the nation in a national biography, the graph helps to further illustrate the original aims of the case of Irish is somewhat exceptional, in that it exists Beathaisnéis project. simultaneously as official and minoritised, essential to the establishment and imagination of the nation while only 4.3 Gender being spoken by a minority of the same nation. These Men account for almost 90% (1,580) of the biographies. biographies, therefore, aim specifically to capture, and There are only 176 biographical accounts of women. These write into being, the Irish-language nation not previously figures highlight a major gender imbalance in the recorded in biography. In highlighting the important lives collection, and surely represent the greatest area for future of the nation which are relevant to the Irish language, the research opportunities, but they are not out of sync with 11 Authors’ translation. other international biographical databases (Farr, 2012). Of Figure 5: The top 10 places of birth recorded in the Ainm the women included in the collection 86% were born from database (where metadata is available) the year 1847 onwards and 76% died in the twentieth century. Of the biographies written since 2013, 26% are of While each province is widely represented in the collection, women, showing a significantly increased representation. Munster is the highest represented province with 534 lives (44% of those born in Ireland), almost double that of Leinster (280), followed by Connacht (210) and Ulster 4.4 Birthplace: country and county (193). The noticeable difference is something which has A connection to the Irish language is the primary condition been previously alluded to by the original authors of the for inclusion in the collection, yet 20 different countries are biographies. Although they made an effort not to neglect represented in the database. Ireland (including Northern other areas (referring to Connacht and northern Leinster Ireland) is the top represented country with 1,217 people. particularly), it is clear that there were more people of England is next on the list with 63. Germany, Scotland and interest to them in Munster, due mainly to the historic the United States are next, each with 15. The other strength of Irish in the province, particularly around the countries represented in the database are India, Norway, time of the revival period: ‘…there is a sort of nucleas or Switzerland, Sweden, Italy, Wales, France, the kernel of literacy, as you’d say, in Munster, and especially Netherlands, Australia, Denmark, Malta, Belgium, China, in Cork, and maybe part of Kerry as well. But, I saw figures the Czech Republic and Japan. The total number of people from the time of the revival… seventy or eighty percent of born outside of Ireland is 140, or 8% of the collection. the people reading the language were in Cork.’12 Diarmuid Breathnach also states his belief that ‘they had very good 399 people have no recorded place of birth stored as Irish, especially those from Cork and a lot of Munster metadata. There are a number of very short biographies people particularly.’13 Of the biographies published since (134) which lack key biographical information and require 2013, 17% are from Munster. This reduced proportion may further research. Filling this gap represents an area of future be attributed to the fact that the Irish-speaking community improvement for the project. It was not possible for the is no longer Munster-dominated, or to the bias of the original authors to find records of a place of birth for some Beathaisnéis authors. 103 lives from the 16th, 17th and 18th century. Each county in Ireland is represented in the collection (Figure 6), with Cork (the largest county by size) being the highest represented county with 197 people, or 16% of those born in Ireland, and Fermanagh and Leitrim (both small counties) being the lowest represented counties with 3 lives each. The top six represented counties are Cork, Dublin, Galway, Kerry, Donegal and Waterford; all except Dublin contain an Irish language speaking area, or ‘Gaeltacht’. These counties represent 42% of the collection. The number of people born in England, 63, is higher than any of the other 26 counties. Figure 6: Province of birth for those born in Ireland (where metadata is available) 4.5 Education and profession A university education is recorded for 767 subjects; another 989 do not have metadata stored regarding university education and some of these were also university educated. Of those with available metadata, 40% of women (71) attained some form of university education, in comparison 12 13 Translation of extract from unpublished interview with Translation of extract from interview with Diarmuid Diarmuid Breathnach and Máire Ní Mhurchú, 2010. Breathnach and Máire Ní Mhurchú, 2010, www.ainm.ie/Info.aspx?Topic=resources.en to 44% of men (696). University College Dublin (173) was lawyers, doctors, astronomers, actors, journalists, artists, the most commonly attended university of the database, engineers, miners, broadcasters, soldiers, and publishers. followed by Trinity College Dublin (111), St. Patrick’s College, Drumcondra, Dublin (71), National University of Ireland, Galway (69), St. Patrick’s College, Maynooth (61), and National University of Ireland, Cork (56). 45 attended either Oxford University, Cambridge or Harvard, but only 6 of these were born in Ireland. There are accounts of people attending university all across Europe, most notably universities in England, Germany, France, Italy, Spain and Belgium. Of these, the Irish Colleges of Rome and Paris, and Leuven University, Belgium, appear more frequently than others. The preponderance of these religious institutions can be attributed to the large number of clergymen who travelled abroad for education during the period from the 16th to 18th centuries when this was Figure 7: Professional demographics for subjects born pre prohibited to Catholics in Ireland. and post the Great Famine (where metadata is available) Of the 1,756 lives, 1,690 have at least one occupation recorded, with ‘teacher’ being the most common Priests, poets, and writers dominate the professions early in profession, among both men and women, with a the 16th and 17th centuries. The 19th century (see Figure representation of 20% of men and 24% of women (21% 7) sees a decline in poets, 70% of whom were born before total). This makes sense in the context of the central role the start of the Great Famine (1845); this can be attributed the Irish language played, and continues to play, in the to the decline of the bardic poet tradition in Irish. The education system, however the original aims of the numbers of teachers, civil servants, politicians and collection certainly influence this propensity towards translators begin to rise around the same time. The end of education, given the necessarily central role of teachers in the 19th century and beginning of the 20th century also sees the revival of any language. Many of those involved with a rise in folklore, music, and song collectors, no doubt due the Gaelic Revival spent time teaching Irish to others. to the desire to recuperate all that was lost in the previous century of famine, emigration and political unrest. There is a very high proportion of clergymen. There are 239 Catholic priests (bishops, archbishops, Christian brothers, 5. Conclusion Franciscans, Jesuits) and 42 Protestant ministers, which represents around 18% of men documented. In comparison, The Ainm example highlights some issues which confront there are only two nuns, Mary Bonaventure Browne and digitisers of biographical dictionaries: omissions or Máire Treasa Ó Murchú, recorded in the collection. unstructured data in original material, and text which is not easily tagged. These issues are still being addressed by Most of those teachers and clergymen had a second the editorial team. occupation for which they were more recognised; clergymen were often professors. For both men and The preponderance of 19th and 20th century lives in Ainm women, writers, scholars and poets complete the top five is a reflection of the original editorial aims, rather than of professions: being a published writer was one of the the most important era for the Irish language, which had suggested criteria for inclusion in the collection.14 This begun to decline as a literary and administrative language preponderance of writers, and initial suggestion for their long before then. Quantitative analysis can be used to inclusion, also corresponds with the original focus on the confirm the authors’ acknowledged bias towards certain Gaelic Revival, in which the construction of a modern, regions (Munster) and professions (writers), as well as the written literature in Irish played an important role. Other usual gender disparity. As Warren (2018) found for the occupations to feature highly on the list include civil ODNB, so too for Ainm: it tells the history of both the servants, musicians, singers and folklore collectors, nation and of itself. politicians, lecturers, translators and editors; there are also 14 Unpublished interview with Diarmuid Breathnach and Máire Ní Mhurchú, 2010. 6. Acknowledgements The Ainm project is a partnership between Cló Iar- Chonnacht, an Irish-language specialist publisher that holds the copyright to the material, and the Gaois research group in Fiontar & Scoil na Gaeilge, Dublin City University, who developed and maintain the database. Funding for the project is provided by the Irish Government. 7. References Breathnach, D., Ní Mhurchú, M. (1986-2007). Beathaisnéis (9 volumes). Dublin: An Clóchomhar. Farr, M. (2012). Review of Online Dictionaries of National Biography, (review no. 1259), https://reviews.history.ac.uk/review/1259, (accessed 09.05.2019). Fokkens. A., ter Braake, S., Ockeloen, N., Vossen, P., Legêne, S., Schreiber, G., de Boer, V. (2017) BiographyNet: Extracting Relations Between People and Events. In: Bernád, Á. Z., Gruber, C., & Kaiser, M. eds., Europa baut auf Biographien: Aspekte, Bausteine, Normen und Standards für eine europäische Biographik. Wien: New Academic Press. pp. 193--224. Muir, R. (1996). Britain and the Defeat of Napoleon, 1807-1815. New Haven: Yale University Press. Ó Raghallaigh, B., Ó Cleircín, G. (2015). Ainm.ie: Breathing new life into a canonical collection of Irish- language biographies. In Biographical Data in a Digital World (BD). Amsterdam: CEUR-WS.org, pp. 20--23, http://ceur-ws.org/Vol-1399/paper4.pdf, (accessed 15.05.2019). Reinert, M., Schrott, M. and Ebneth, B. (2015). From Biographies to Data Curation - The Making of www.deutsche-biographie.de. In Biographical Data in a Digital World (BD). Amsterdam: CEUR-WS.org, pp. 13--19, http://ceur-ws.org/Vol-1399/paper3.pdf, (accessed 15.05.2019). Tamper M., Leskinen P., Apajalahti K., Hyvönen E. (2018) Using Biographical Texts as Linked Data for Prosopographical Research and Applications. In: Ioannides M. et al. (eds) Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection. EuroMed 2018. Lecture Notes in Computer Science, vol 11196. Springer, Cham. Warren, C. (2018) Historiography's Two Voices: Data Infrastructure and History at Scale in the Oxford Dictionary of National Biography (ODNB). Journal of Cultural Analytics. 22.11.2018, 10.31235/osf.io/rbkdh, (accessed 08.05.2019).