=Paper=
{{Paper
|id=Vol-2119/paper4
|storemode=property
|title=Small Lives, Big Meanings. Expanding the Scope of Biographical Data through Entity Linkage and Disambiguation
|pdfUrl=https://ceur-ws.org/Vol-2119/paper4.pdf
|volume=Vol-2119
|authors=Lodewijk Petram,Jelle van Lottum,Rutger van Koert,Sebastiaan Derks
|dblpUrl=https://dblp.org/rec/conf/bd/PetramLKD17
}}
==Small Lives, Big Meanings. Expanding the Scope of Biographical Data through Entity Linkage and Disambiguation==
Small Lives, Big Meanings Expanding the Scope of Biographical Data through Entity Linkage and Disambiguation Lodewijk Petram, Jelle van Lottum, Rutger van Koert, Sebastiaan Derks Huygens ING, KNAW Humanities Cluster Oudezijds Achterburgwal 185, 1012 DK Amsterdam E-mail: {lodewijk.petram; jelle.van.lottum; sebastiaan.derks}@huygens.knaw.nl; rutger.van.koert@di.huc.knaw.nl Abstract The Huygens institute for Dutch history and culture aims to facilitate and enhance collaborative research with and on biographical data. We give a brief outline of the Huygens ING digital biographical data policy, describe how we share our data with the world, and explain how we facilitate the exploration of similarities and interconnections between the Huygens data, external data collections and user- uploaded datasets, without imposing selection criteria. Finally, we present a use case that shows how our policy and infrastructure enable researchers to employ large collections of ambiguous biographical data, hitherto mainly used for genealogical reference, for addressing innovative, challenging research questions. Keywords: biographical data, entity matching, disambiguation, digital infrastructure, genealogical data, prosopography when he first sailed to Asia. This mini-biography is hardly 1. Introduction revolutionary – historians have pieced together bits of biographical information from multiple sources for ages Daniel Engel was born in Danzig (present-day Gdańsk in (e.g. Ogborne, 2008) – and it is also still not worthy of an Poland) and signed up with the Delft branch of the Dutch entry in the Biography Portal. However, advances in digital East India Company (VOC) on the first of October, 1766. techniques now allow for (semi-)automated matching of He worked as an ordinary seaman during the seven-month large numbers of data entities. Disambiguating the just journey to Batavia (now Jakarta), stayed there for nine under 800,000 person entities in the VOC employment months and then sailed back to Europe on the same ship. records has become feasible, and the same holds for data Engel was probably illiterate, as he signed with a cross. 1 observations in other large, digitized source collections. This is all we can infer about the life of Daniel This opens up possibilities for employing the many Engel from his employment record – too little by far to snapshots of persons’ lives that are available in e.g. deserve an entry in the Biography Portal of the genealogical sources and historical employment records in Netherlands2, the online collection of biographies of large-scale prosopographical analyses that may be prominent people from Dutch history, maintained by the instrumental in answering urgent, challenging research Huygens Institute for Dutch history and culture (Huygens questions. ING). Engel was simply one of the many thousands of men At Huygens ING, we seek to connect such from German lands who joined the ranks of the VOC in the collections of disambiguated data to our traditional, mostly seventeenth and eighteenth centuries. highly curated sets of biographical data, with the intent to But there is more on Daniel Engel. It seems he create an integrated environment that meets the needs of joined the VOC two more times, in 1788 and 1792, as a researchers working on a broad range of research questions. boatswain’s mate and able seaman, respectively. The latter In the remainder of this paper, we outline the Huygens ING employment record furthermore shows that Engel died in digital biographical data policy and how we aim to Asia, on the second of October, 1798. There is also mention incorporate data on the lives of both prominent people and of a Daniel Engel from Danzig in the interrogation small fry in our new linked open data infrastructure, give a transcripts of the English admiralty, dating from the Fourth short overview of the technique we use for (semi-) Anglo-Dutch War (1780-1784), when the English seized automatically matching entities from one or multiple many Dutch ships. This sailor worked as a boatswain on a sources, and finally present a research use case. merchant’s ship that was supposed to have brought cargo from Curacao to Rotterdam in 1782. He was born in 1753 or 1754.3 2. Huygens ING and Biographical Data It is likely that these four data observations refer The mission statement of Huygens ING reads: ‘Innovating to the same individual: together they form a logical career history: unravelling history with new technology’. The path of an eighteenth-century sailor, even though Daniel institute tries to accomplish this mission by developing and Engel would have been only twelve or thirteen years old applying new, advanced digital tools that help open up 1 2 These and other VOC employment data: VOC Opvarenden http://www.biografischportaal.nl/ database 3 Prize Paper Dataset, cf. footnote 6. (http://www.gahetna.nl/collectie/index/nt00444/view/NT00444_ OPVARENDEN and http://dutchshipsandsailors.nl/). 22 historical sources, which are often difficult to access and they can link up to our data, or make connections between use, and hence stimulate innovation in research. The the data in the infrastructure and external datasets. institute’s updated digital biographical data policy reflects Furthermore, to accommodate researchers’ needs, we are this mission. currently developing an entity matching tool, which will Traditionally, biographical dictionaries have become available within the digital infrastructure, that formed the heart of the Huygens ING biographical data allows researchers to easily find candidates of matches collection. The institute has a long history of editing between entities from multiple datasets. After validation, biographical dictionaries and publishing these as book the matches will be linked to a resolved entity. We will go series or, in more recent times, making the entries available further into the details of this tool in the next section. digitally through separate web interfaces. The development To gather and present the data in clear, domain- of the Biography Portal, essentially an index to the various specific collections, our infrastructure consists of multiple, biographical dictionaries, was a first effort of bringing interconnected instances. The curated Huygens ING together the available biographical data. datasets on the history of knowledge, Dutch history and Huygens ING is now gradually entering a new literary studies are available for reference and analysis in stage, in which all biographical data are migrated to the Data Huygens ING. 4 This data hub is directly linked to that institute’s new digital infrastructure. Structured data on of CLARIAH5, the Dutch national digital infrastructure person entities are interlinked with a text browser, in which project for the Arts and Humanities. The benefit of this set- the original texts of the biographical dictionaries and other up is that it enables us to validate and manage the data book and source collections are made available. A user can within the domain context, and it also helps us implement thus easily search for a person and view related entries in our data provenance policy. Huygens ING provides biographical dictionaries and mentionings in other texts. comprehensive provenance information for all its datasets So far, the new infrastructure largely resembles a and presents this in a form that is both understandable for re-fashioned Biographical Portal. What is new, however, is humans and interoperable with other data infrastructures that the structured data are ingested into a linked open data within the semantic web. On dataset level, the provenance (LOD) environment, and can hence easily be linked with information consists of a short and general description of other datasets (both internal and external, national and the dataset, a list of most-used sources, and information on international). To guarantee optimal findability and re- selection criteria and information extraction techniques that usability of our data on persons, we align all person entities were applied in the process of compiling the dataset. This to those linked open data ontologies that are most used in information is available as an introductory text to the the Arts and Humanities, and by cultural heritage dataset and is also added, in short form and modelled using institutions, both within the Netherlands and the P-PLAN Ontology (Garijo and Gil, 2012), to every internationally: CIDOC-CRM, Wikidata, schema.org and record. As such it will enable researchers who see an FOAF. isolated data observation in the LOD cloud to learn about Furthermore, the new digital infrastructure is the context in which the data observation came about. specifically designed as a humanities research Additionally, on record level, we provide specific environment. Whereas the traditional book volumes, web references to sources. We encourage users to provide the interfaces, and even the Biography Portal first and foremost same information for user-uploaded datasets in the served as reference works – a typical researcher would use CLARIAH data hub. Furthermore, for all data in the them to look up information on one or a small number of infrastructure, technical provenance information is persons – the new environment offers better search automatically retained. This allows users to see when a (elasticsearch) and functionality to explore similarities and particular dataset was originally uploaded and by whom, interconnections, thus allowing users who practice and which edits were made on a particular data element, collective biography and prosopography to easily collect either manually or by built-in tooling, such as for entity data on the groups of people of their interest (cf. Harders matching. and Lipphardt, 2006). Researchers can furthermore link data elements across multiple sources and use data 3. Automated Record Linkage observations to enrich their own datasets. Finally, they can The record linkage tool we are currently developing query the data through the API or download selections of enables users to find matches between entities in one or data in various file formats, and then analyse the data more sets of data observations, selected from the structured offline or using tools for data analysis and visualisation that data repository within our digital research environment or are available on the internet. In short, the data are ready to external LOD sources. For the time being, the tool is be used by researchers. primarily intended for finding matches between person The Huygens ING digital infrastructure thus has entities. It allows users to measure name similarity and an interactive character; the focus is not solely on making refine candidate matches using rules that are e.g. based on data available, but also, and especially, on allowing geographical data or dates. researchers to use and share them. In parallel to this, we We chose to develop the tool in a PostgreSQL aim to facilitate and enhance collaborative research with environment for the relatively speedy matching results it and on biographical data, which comprises, in our view, offers, especially when using trigram matching. The tool any biographical data that might be of interest to academia. downloads selected rdf triples, automatically converts them Researchers are welcome to upload their own data, which 4 5 https://data.huygens.knaw.nl/ https://www.clariah.nl/; https://anansi.clariah.nl/ 23 into csv-format and loads them into the PostgreSQL these, simply having estimates of the size of the migrant environment. In the matching process, it creates a new influx is not sufficient. After all, it makes a huge difference dataset with matched entities, which, after validation by the whether migrants are non-skilled, skilled or become skilled user, is returned to the LOD environment. This new dataset during their careers in the recipient economy. includes full provenance information about the matching Although for the pre-1800 period sources parameters that were applied (algorithm and additional containing clear indicators of education or training levels rules) and the user doing the final validation step. All are rare, we do have large numbers of historical provenance data are automatically retained during the employment records. However, such sources often provide process of candidate generation and validation. no more than a snapshot of a person’s life and are therefore The tool offers various methods for measuring relatively limited in their use. But by matching entities string similarity, which can be used for matching names from multiple source collections, these records become and toponyms: trigram matching (the preferred method, for much more meaningful. Matching a sufficiently large speed reasons; it uses the similarity function in the number of entities was hitherto practically impossible, due PostgreSQL (9.5) pg_trgm module), Levenshtein distance, to the simple fact that these data collections are large and and (Double) Metaphone. When geocodes are available, manually finding matches takes a lot of time, but our locations can also be matched using the PostgreSQL automated entity matching tool enables us to do so – and in extension PostGIS. This extension allows users to find the near future other scholars as well. In the case of matches based on either an exact geographical location or HUMIGEC, the tool helps us to reconstruct individual a user-set range around a geographic point. careers, which in turn makes it possible to compare the To start the matching procedure, a user first relative successfulness of migrant and native workers. As manually selects data fields for matching and then creates the success of careers is a good indicator of skills, this a set of refinement rules, tailored to the data at hand, to assessment will allow us to address the central research improve matching results and/or exclude irrelevant question of the project. matching candidates. For example, if a user wants to match We selected the maritime sector of the eighteenth- entities from a birth register with a faculty list, he could century Dutch Republic as a case study in HUMIGEC, create a rule that discards candidates that would have been because this was a key sector of the economy, characterised under eighteen or over one hundred years of age when by a high level of migrant participation. Moreover, its employed at university. Another rule could state that workers were well documented: we have almost 800,000 candidates who are between age 25 and 65 when employed employment records of the VOC, digitised by a number of at university should get higher scores. archival institutions in the Netherlands, that cover the entire The tool leads the user through an iterative eighteenth century, and c. 15,500 records on Dutch matching procedure (cf. e.g. Efremova et al., 2014; Idrissou mercantile marine crews from the Prize Paper Dataset et al., 2017). Users are encouraged to set strict rules at first. compiled by HUMIGEC’s PI Jelle van Lottum. 6 Each This will yield a relatively small number of high-quality record in both collections contains data on a sailor’s name, candidates, from which the user can then select matches for place of birth, rank on board and start date of the approval. After this first matching and validation round, the employment. For the sailors in the Prize Paper Dataset, we matched records are split from the original dataset and sent also know their age when questioned by the English to a new dataset, which only contains validated data. The admiralty. user can then let the tool iterate once or multiple times over By matching entities within and between these the remaining original data using different sets of matching datasets, as shown by the example of Daniel Engel in the rules to generate additional candidate sets, from which introduction to this paper, we can (partially) reconstruct approved matches can be added to the set of validated data. sailors’ careers, which we can then use to compare the level Taken together, the steps in the matching procedure yield of job mobility (i.e. promotion or job switching) of non- results with high precision and recall. migrant and migrant workers (Gibbons and Waldman, 1999). This will give us insight into the extent to which 4. Research Use Case: Sailors’ Careers migrants succeeded in gaining skills (i.e. human capital) during their careers, and compare this to non-migrants. The research project ‘Human capital, immigration and the We use the entity alignment tool introduced in the early modern Dutch economy: job mobility of native and previous section to find data observations that are probably immigrant workers in the maritime labour market, c.1700- related to the same individual. We first look for data 1800 (HUMIGEC)’ illustrates the potential of our observations with a high level of name similarity, measured infrastructure and entity linkage tool for academic research. on the basis of trigram matching, and filter out irrelevant This project’s research question originates from a currently results by applying a set of rules based on dates (for hotly debated topic in both the political and the public example, a person cannot have sailed out before birth or arena: what is the economic contribution of migrant after death, cannot have been employed on two ships at the workers on a recipient economy? This is a difficult question same time, cannot have been in Asia and Europe at the to answer for modern economies, let alone for economies same time, etc.) and domain expertise (it is e.g. unlikely from the past, since historical statistics on education or that a person who had worked as an ordinary seaman on a training levels of workers are largely lacking. Without 6 in Europe, c. 1650–1815. Data collected in the Economic and Social Research Council (ESRC) project no. RES-062-23-3339: Migration, human capital and labour productivity: The international maritime labour market 24 single trip rejoined the ranks of the VOC as a captain). van den Hoek from Delft, for example, a sailor we could Next, we use the sailors’ places of birth as a check easily trace in eight different VOC employment records, is on matches. Since we have to deal with quite a bit of not unrepresentative because of his name – were his name variation in toponym spelling – all records were written Jan de Jong, he would not suddenly become a synecdoche down by clerks who often did not speak the same language for maritime life (cf. Van Lottum, Brock and Sumnall, as the sailors in front of them and who were also frequently 2015). Moreover, since we base our analysis on a large unfamiliar with the towns and villages, often in the German number of observations, we think the bias in our sample lands and Scandinavia, mentioned by the sailors – we towards non-standard names will not have a significant decided to try to standardise place names and reconcile influence on our results. However, to check whether the them to their modern-day GeoNames equivalents. We have non-standard names that are likely to be overrepresented in so far standardised around 30,000 unique toponym our sample were not typical for a certain class of attestations and aim to at least double this number before eighteenth-century society, we will compare them to the project end. The standardised toponyms allow us to first family names of Amsterdam’s highest-income tertile, look for exact matches. Thereafter, using the geo derived from registers of a 1742 income tax (Oldewelt, coordinates given back by GeoNames, we geo-group 1945), and to the family names in Amsterdam’s birth, locations to find possible additional matches. In this way, marriage and death registers from the mid-eighteenth we also catch sailors who used their birth place and region century.8 interchangeably. A discussion of the digital heuristics involved in For all remaining person entity matches, our project will naturally also be included in the general suggested on the basis of name similarity, but not description of the set of validated record matches and, in corroborated by matching places of birth, we perform a very short form, in each record’s P-Plan provenance. So, if final birthplace check by measuring string similarity of the for example a future researcher of the Asian activities of original place name attestations, so as to account for the VOC would see that some person entities from the possible mistakes by clerks or transcribers of the original Official letters of the United East India Company – a documents – Norden in East Frisia might easily have been Huygens ING digital resource that will be added to our misunderstood as Naarden close to Amsterdam. LOD infrastructure in due course – were connected to The scope of our project does not allow for records detailing sailors’ careers and others not, he would experiments with standardising person names. We know that this could have as much to do with selection bias therefore rely on the trigram matching algorithm to cope in the linkset as with the actual careers of these people. with spelling variations in names. However, for a follow- up project to HUMIGEC, we are thinking of also 5. Conclusions standardising person names, beginning with native Biography as a historical method has traditionally mainly workers’ names. To this end, we would use the Database of been used as a means to illustrate qualitative themes, Surnames in The Netherlands 7 to standardise family names, generally based on one or a small set of case studies. From and group variants of given names on the basis of data around the turn of the century the online availability of generated by Gerrit Bloothooft (e.g. Bloothooft and national biographical dictionaries in e.g. the Netherlands, Schraagen, 2015). Germany, the United Kingdom and Australia allowed for This paper is not well-suited for going deeply into larger-scale biographical research and the formation of socioeconomic analysis and statistical results – collective biographies (cf. Arthur, 2015; Carter, 2012). But incidentally, HUMIGEC is still an ongoing research project these were inevitably limited by the scope of the online and we currently only have very preliminary results – but a biographical collection and influenced by the selection brief reflection on methodology is in place. First of all, it is criteria (and biases) of its editors. important to stress that our method is far from perfect. At The Huygens research infrastructure and best, it gives us a limited view on career paths in the Dutch biographical data policy, however, allow researchers to go eighteenth-century maritime sector, for the available one step further. The institute makes available all sources do not cover the entire sector and we have no biographical data contained in its collection, both highly ground truth for assessing the performance of the entity curated data from biographical dictionaries and persons linkage process. We do, however, have a set of manually- data retrieved from various textual sources. Furthermore, matched entities that we use for a superficial assessment of as illustrated by the HUMIGEC research case, researchers our matching method. However, these matches are self- can use the infrastructure to semi-automatically connect evidently incomplete and are furthermore likely to be external datasets to the core data or disambiguate their own biased towards non-standard names. data. In HUMIGEC, we use the large number of mini- That same bias will be present in the automatically biographies obtained through digital methods as a means of generated matching candidates: disambiguating illustrating wider social and economic processes. Indeed, employment records of sailors with common names, who as Paul Arthur predicted, this approach is ‘a demonstration were born in large towns and cities, is in many cases simply of biography’s greatly increased capacity, in the digital era, impossible, both for humans and computers. This gives to activate cross-disciplinary investigation, and become a reason for some concern about the representativeness of dynamic agent for integrating and connecting individual our study, but then again, sailors with non-standard names lives and their historical contexts’ (Arthur, 2015). were not atypical because of their unusual names. Sixtus 7 8 http://www.cbgfamilienamen.nl/ https://archief.amsterdam/indexen/ 25 Digital advances such as the one described in this Migration and Human Capital in the Long Eighteenth paper are blurring the boundaries between (collective) Century: The Life of Joseph Anton Ponsaing. In: M. biography, prosopography and other socioeconomic Fusaro et al. (Eds.), Law, Labour, and Empire. research methods. In parallel with this development, all Comparative Perspectives on Seafarers, c. 1500-1800. biographical data observations, however insignificant they Basingstoke: Palgrave Macmillan, pp. 158--176. may seem at first sight, may become very meaningful and Ogborne, M. (2008). Global lives. Britain and the world, instrumental to answering important research questions 1550-1800. Cambridge: Cambridge University Press. when disambiguated and combined with other data. Oldewelt, W.F.H. (1945). Kohier van de personeele Huygens ING aims to facilitate and enhance the full range quotisatie te Amsterdam over het jaar 1742. 2 vols. of biography methods by making available a digital Amsterdam: Genootschap Amstelodamum. infrastructure that welcomes all biographical data – be they on the lives of prominent people or small fry – and offering functionality for exploration of similarities and interconnections between data observations. 6. Acknowledgements We thank Ania Ahamed and Jessica den Oudsten for their research assistance, and the Amsterdam City Archives for sharing their genealogical data with us. HUMIGEC received funding from CLARIAH. 9 7. References Arthur, P. (2015). Re-imagining a Nation: The Australian Dictionary of Biography Online. European Journal of Life Writing, 4, pp. 108--124. Bloothooft, G., Schraagen, M. (2015). Learning Name Variants from Inexact High-Confidence Matches. In G. Bloothooft, P. Christen, K. Mandemakers, M. Schraagen (Eds.), Population Reconstruction. Cham: Springer, pp. 61--83. Carter, P. (2012). Opportunities for National Biography Online: The Oxford Dictionary of National Biography, 2005–2012. In M. Nolan, C. Fernon, The ADB’s Story. Canberra: ANU Press, pp. 345--371 Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., Tuyls, K., (2014). A Baseline Method for Genealogical Entity Resolution. Proceedings Workshop Population Reconstruction: 19-21 February 2014, Amsterdam. Garijo, D., Gil, Y. (2012). Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data. Proceedings of the 2nd International Workshop on Linked Science: 12/11/2012, Boston, USA. Gibbons, R., Waldman, M. (1999). Careers in organizations: theory and evidence, in: Ashenfelter, O., Card, D. (Eds.), Handbook of labor economics. Vol. 3B. Amsterdam: Elsevier, pp. 2373--2437. Harders, L., Lipphardt, V. (2006). Kollektivbiografie in der Wissenschaftsgeschichte als qualitative und problemorientierte Methode. Traverse, 13(2), pp. 81-- 91. Idrissou, A.K., Hoekstra, R., Harmelen, F. van, Khalili, A., Besselaar, P. van den (2017). Is my sameAs the same as your sameAs? Lenticular Lenses for Context-Specific Identity. Proceedings of the Knowledge Capture Conference, Austin, TX, USA, December 04 - 06, 2017. Article No. 23. Lottum, J. van, Brock, A., Sumnall, C. (2015). Mobility, 9 https://www.clariah.nl/en/projects/research-pilots/granted- pilot-research-projects/humigec 26