Persons Linking in Baptism Records? Jaroslav Rozman1[0000−0001−8443−433X] and František Zbořil2 1 Brno University of Technology, Brno, Czech Republic rozmanj@fit.vutbr.cz 2 Brno University of Technology, Brno, Czech Republic zborilf@fit.vutbr.cz Abstract. This paper describes models that can be automatically cre- ated from genealogical records (baptisms, marriages and burials). Those models represent various relationships between people mentioned in the records. The most important relationship is child - father or mother, but there can be found others - grandparents, godparents, best men, midwifes or priests. Usual goal in genealogy is to create models that are based only on child - parents relationships. Such models are called family trees. In this paper we describe whole procedure neccessary for creating such models. We are using rewritten baptisms records of small village that covers year range between cca 1607 to 1899. Those records are loaded from simple database and they are transformed to the struc- ture containing all neccessary information about one person. From this first structure we create another one, that contains all persons mentioned in the record and which is suitable for comparison. After comparison the probability that both persons in two records are the same is computed. If the probability is smaller than threshold, the record is added to the output database, if it is bigger, it is merged with the second record. Be- cause the rewritten records were hand-connected to the family tree in genealogical SW and all persons got its own ID, we are then able to find succes rate of our approach. Keywords: Genealogy · Baptism records · Records linkage. 1 Introduction Genealogy, the science about family trees, is widely widespread around the world, e.g. in the USA it is one of the most common hobbies. There are a lot of web servers, where people can upload their family trees and share it with other people (Myheritage, Geni, etc.). Most genealogy web servers allow people to upload their family trees and then searching engines search for possible matches. But those family trees are without any links to the physical records. So we came with a slightly different approach. First, we rewrite the records with references to the particular record on the page of parish book and the searching is done after it. ? This work was supported by TACR No. TL01000130, by BUT project FIT-S-17-4014 and the IT4IXS: IT4Innovations Excellence in Science project (LQ1602). 2 J. Rozman et al. The original idea of this approach is to help genealogists to avoid multi- ple repeated searches in the books. Typically most genealogists do not rewrite records, they just go through it, find records that interests them and continue searching in another book. But usually, when they are far enough in the past, the number of searched persons in one book grows and it is not easy to find them all during one search, so it is neccessary to go through the same parish book multiple times. And at that moment it can appear that it would be easier to rewrite the whole book already in the beginning. Apart of it, many genealo- gists search independently in the same books. So if the rewrited records would be accessible for everybody during the moment of rewritting then people could cooperate. Then it could make things much easier. Advantage of this approach is that one genealogist can rewrite only those records that interests him/her, other genealogists rewrite records that they are interested in and in such a way the whole book can be rewritten. Rewriting of just a few records that a person is interested in is not so demanding comparing to the case when one person has to rewrite the whole book, which can have few hundreds of pages with thousands of records. The goal of our project is to create database suitable for community rewriting of parish books. But our project has also higher level part. We want to create another database, where not the records, but the persons and relationships be- tween them will be the main part. We want to perform relationship modelling, i. e. to connect identical persons appearing in the first database. People there can have different roles (child, father, mother, etc.) that will be described in later sections. So after such connection we can have not only the basic family tree but we can also know relationships among children, parents and godparents and other. In the general sense, we have some persons and we know in which records (documents) they are appearing and in what roles. Even though genealogy is widely spread hobby, the record linkage in this area is not commonly used. The reason probably is the small amount of data. The parish books are being rewritten only in few countries, for example the BALSAC project3 in Quebec is probably one of the first, another example is HisKi4 in Finland. There are some examples, where small part of the country were rewritten and data were used for record linkage. The work on connecting records in small town is described in [1]. Here for the determining of score of first or last names authors used weights that depend on the frequency of the particular name in the database. For example name ”Joseph” is very common, so its weight is much smaller than ”Lucas” which is very rare. This means that if they try to determine if two Josephs are the same person the probability will be much smaller than for two Lucases. The record linking were done in three steps: in first step they tried to find common couples and associate them with all of their children, in second step, they link marriage certificates into pedigrees and in the final step they linked both previous datasets together. Because they did 3 http://balsac.uqac.ca/english 4 http://hiski.genealogia.fi/hiski/9asgip?en Persons Linking in Baptism Records 3 not have the ground truth, they validated the results by set of test like number of marriages per individual, number of children per marriage and other. Other example is from the Val Borbera valley in Italy [2]. Here again authors are using both birth and marriage records. That is because birth records do not contain names of children’s grandparents and without this information it is more-or-less impossible to create pedigree. Similar as we do they use blocking where they test is both persons in two records have same sex an if date of birth of one person is in the birth interval of second person. The validation of results is done by searching for the birth record of the child’s father, then they find father’s marriage record, compare names of fathers parents in birth record and marriage record and finally they check if name of wife in marriage record is the same as name of mother in child’s birth record. To our best knowledge the only papers that work with ground truth are [5] and [3]. In [5] the authors used database created by domain experts with birth, marriage, burials and census data from Isle of Skye between 1861 and 1901. Their results are mainly focused on evaluation of record to record than person to person and as matching tool they used algorithm that was originally meant to use for authors matching, so their results is hard to compare to others. In [3] the authors are using their tool described in [4]. They work with database with more than 100 thousands persons obtained from a domain expert. They used Bayes probability to determine if two records are about the same person. Bayes probability here has similar effect as in [1] - the probability is depends on the frequency of the person’s name. Because they have ground truth for the dataset they used, they were able to compute accuracy (between 57% and 65%) depending on the linking method they used. The neural networks are used in [6], [7] or [8]. In [6] the author also used ground truth, unfortunatelly, there is not explained how the matched data looked and how the matching was done. The rest of this paper is organised as follows: next section is short introduc- tion to parish books, 3rd section is about preparations of records, 4th section about record linking, 5th is about dataset we used and 6th is testing and finally there is a conclusion. 2 Parish Books The duty of writing church registers was ordered in 1563 by Trident council. It was ordered again for Czech lands in 1591. So the oldest church registers in Czech Republic has been founded around the year 1600. But due to the various wars and other disasters (fires mostly) it is common for most villages to have church registers since about 1650. There are three kinds of church registers (parish books) -– baptisms, marriages and burials. Since only allowed religion at that time was Catholicism, the church registers started as catholic. Later, when also Evangelicism and Judaism were allowed, there were more kinds of church registers, but together with allowing other religions, the uniform printed form was ordered, so we use only the catholic churches registers as example. 4 J. Rozman et al. As we stated before there were three kinds of church registers. Because these registers were first used only for ecclesiastical purposes, it does not have infor- mation about date of birth (or death), but only about baptisms (or burials). Since 1784, when they were declared as public (not ecclesiastical any more) doc- uments, they started to contain also information about date of birth (death). Church registers (or more exactly, the latter one we can call civil registers) con- tain private information, so they are freely accessible only when the time from last record is more than 100 years in birth registers and more than 75 years in marriage and death registers. It means we can assume they are freely accessible up to about 1900 in births and 1910-1920 in marriages/burials. Those church registers are usually scanned and they are freely accessible via internet. The language, that was used varies, but we can generally say that the oldest church registers were written in the Czech language, then in Latin and since 1784 in German and then again in the Czech language. The structure of records since 1784 till about 1900 did not change, so we are stating this structure here (see Fig. 1). The structure before 1784 was similar, but there were less information. Usually the info about child’s (spouse’s) grand- parents and the reason of death was missing. Since in this paper we deal with the birth records only, we provide the structure of birth records as follows: Parish/birth record – Date of birth and baptism – Name of priest – Name of child – Male/Female – Il/legitimate – Name of father, occupation, place of residence, names and place or residence of his parents – Father’s religion – Name of mother, names and place or residence of her parents – Mother’s religion – Names, occupations and place of residence of godfathers – Usually the name of midwife was added – There were sometimes remarks about date of death, marriage or other 3 Records Preparation With cooperation with our colleagues from Philosophical Faculty of Masaryk University we have created a template for rewritting of the baptism records. The template has almost 150 items, because we included full adresses and occupations for all possible people (up to 13 persons) appearing in the records and we also added some items for information that are added extra to the record (date of wedding or death in baptism records, etc). We used names of the child and all its mentioned ancestors for linking, we also assume that we use names of godparents, Persons Linking in Baptism Records 5 Fig. 1. Example of baptism/birth record with heading from the church register. On the upper image there is baptism/birth record from May 1899 (17th was the birth, 23rd was the baptism). The child is Matilda, father is Josef Řičánek and mother is Matilda. The father’s occupation is hajný. For both father and mother there are also names (Mikuláš Řičánek, Magdalena Sláma, and Engelbert Přichystal, Vincencie Polák ), occupations (dělnı́k, výměnkář ) and villages (Ubcı́ch, Bukovince) of their parents. This is what we labeled as 4GP record (4 grandparents known). Additional informations here is date, place and name of groom (on lower left corner) and dates of birth of both parents (16/3 (18)53, 9/3 (18)58). On the lower image there is baptism record from 1.1.1734. The child’s name is Josephus Stephanus and his father is Georgius Snaschel and mother is Anna. This record we labeled as F+M (father and mother). 6 J. Rozman et al. because it can be of a big help in old records where only the first and the last name of father and first name of mother is used. And sometime the father’s last name was changed, because at that time the last names were not stable. In such case the names of godparents can help, because people usually used only one or two pairs of godparents for all their children, so we can compare first names of parents and the names of godparents and based on that decide if the children are siblings. 3.1 Structure of One Person For each baptism record described in previous subsection we create the same structure for every person mentioned in the record. The information about one person is as follows: – ID - id of the person – IDrecord - id of the record from parish book, used for not comparing persons from the same record – IDgedcom - id from genealogical SW where family tree was manually created, used for presicion/recall computing – role - role of the person in the record (CHILD, FATHER, MOTHER, MOTH- ERSFATHER, etc.) – birth date – birth date range - for parents and grantparents – baptism date – death date – death date range – weddings date – weddings date range – first name - there can be more names – last name – multiples – sex – religion – occupation - there can be more occupations for one person – place of birth (village, street, house No.) – places of live (village, street, house No.) - for parents, taken from place of birth of their child, there can be more addresses – identities - array of ID of person’s who was determined as identical – fathers - array of ID of father’s of persons from identities – mothers - array of ID of mother’s of persons from identities – partners - array of ID of husbands/wives (in fact second parent of child) – children - array of ID of offsprings Every record for comparing consists of such information about every person that is in the baptism record. During creating of records we fill the ranges based on the date of baptism and some assumptions: Persons Linking in Baptism Records 7 – The birth date range is for ancestors of the child, where we suppose that man can have children between 15 and 65 years (women between 15 and 55). – Person can have the first wedding in 15 and last at the time of death. – Person can live up to 100 years. – If we know, that a person had wedding or child born, we know, that such person had to die after such date (except minus 9 monthes for father). – If a child is ”illegitimate”, we know that wedding of the parents has to be after this date, if it is ”legitimate” the wedding has to be before this date. This structure is also used in the second database where the connections are kept. That is why also identities, fathers, mothers, partners and children are part of the record. When the records are created from the baptism record, the IDs of fathers, mothers and children are added and their probabilities/scores are set to 1.0. After the search for the identity is performed, the ID of second person is add to the ”identities” together with the probability/score and also other fields are updated (address, etc.). 3.2 Records for Comparing When comparing records for linking, the obvious way is to compare each record with others. Obviously it is not neccessary to compare children whose birth dates are more than 65/55 years apart because the probability that such old man/woman will have children is very low. There are also compared only persons with same sex. Because we want to find if any person in one record in a parish book is the same as any person in other records, we decided to split every record in parish book to as many records as is the number of persons mentioned in the record. It means that from the upper image in Fig. 1 we got 7 other records (for 1 child, 2 parents, 4 grandparents), from the lower image we got 3 records (1 child, 2 parents). For every child all its ancestors will be in the new record. Same for fa- ther/mother - all his/her ancestors will be in the new record and because name of wife/husband is important information for record linking it will be also added. In case of grandparents there will be only their husband/wife in the new record, in case that their father is also known, he will be also added. From those informations we create one long record whose items will be com- pared with others. It means that all records have to have the same structure, because of that we have to add items for husband/wife for the records with child, although it is obvious that newborn baby can not have any partner. All structures are shown in Table 1. Substructures contains particular items for com- paring and are same for all persons (Ch, F, M, FF, FM,..., MgF):   Ch, F, M, ..., M gF = f irstname, lastname, religion, occupation, address 8 J. Rozman et al. Child Ch F FF FM FgF M MF MM MgF 0 0 000 0 0 00 Father F FF 0 0 0 FM FgF 0 0 M MF 0 0 0 MM MgF 0 0 Mother M MF 0 0 0 MM MgF 0 0 F FF 0 0 0 FM FgF 0 0 F. father FF 0 0 0 0 0 0 0 0 FM FgF 0 0 0 0 0 00 F. mother FM FgF 0 0 0 0 0 0 0 FF 0 0 0 0 0 0 00 F. gr.father FgF 0 0 0 0 0 0 0 0 0 0 000 0 0 00 M. father MF 0 0 0 0 0 0 0 0 MM MgF 0 0 0 0 0 00 M. mother MM MgF 0 0 0 0 0 0 0 MF 0 0 0 0 0 0 00 M. gr. father MgF 0 0 0 0 0 0 0 0 0 0 000 0 0 00 Table 1. Structures of all persons created from one record for comparing. The zero means this part is empty (for example there is usually not information about father’s grandfather from his father side in the baptism record, there is only info about grand- father from his mother’s side - FgF). The five columns with all zeros are not neccessary, but they are there for records consistency. FgF MgF FF FM MF MM FgF MgF F M Ch FF FM MF MM F M Ch Fig. 2. Scheme of comparing child (boy - is compared only to males) in one record with every possible person (male) in another record. There is also child - child comparison, because there can be two records with the same child (for example when two parish books have time overlap). Persons Linking in Baptism Records 9 4 Records Linking The records, structured as described in previous sections are compared everyone to everyone. This would means lots of computing, so we limited comparing only for people of the same sex and we also check range of birth date and compare only persons, that are inside this range. Resulting score is between 0 and 1 and the threshold for marking as identity is set to 0.85. Also we applied weights for marking more significant items like names or address where weights are set to 1.0, occupation can change during time, so we set the weights to 0.5 and religion is usually catholic, so we set it to 0.1. There are three various variable types in the comparison record. There are dates for births, strings for last names and lists. In the lists there are date ranges, occupations and baptism names. Dates are compared as a set of integers, for string comparison we are currently using Levenshtein distance and the resulting real-type number is used for identity score. In date comparison the result is either 0 or 1. Because one person can have more occupations it is neccessary to compare every occupation to every occupation. Similar with given names - a person can be given more names at the baptism, but later only one is usually used, therefore it is neccessary to compare all the given names. In the first pass over data we only add matching IDs to the identities field (see information about person in subsection 3.1) of the examined record. In the second pass over data we check all identities in the examined record (and recursively in the records from identity) so we are able to copy information to the examined (first occurence) record. The information means copying info about children, husbands/wifes, occupations and adresses, we also sort all children according to the birth date and then change possible date ranges of births, marriages and deaths for its ancestors. 5 Datasets Testing was done on the dataset created from birth records for one village be- tween years 1607 (1635 respectively) and 1899. There are 1961 records that were manually connected to the family trees in genealogical software. Then it was exported to the .csv file together with the IDs (here we called it IDgedcom) for every person. Those IDs allow us checking if the matching was correct or not. Here we have to state, that such approach has ”disadvantage”, because our data are somehow ”ideal”. That is because we lost small differences that certainly were in the original records - e.g. same person can be once written as ”Jan” and once as ”Johann” or also the writing of surnames or occupations can be slightly different. But we suppose that we can allow such simplification because we suppose that in our final system the words in the database will be normal- ized anyway (probably except of the last names, that will be normalized only partially). From this data we created two kinds of datasets - in the first set we tried to imitate original records so we erased ancestors that were not mentioned in the 10 J. Rozman et al. original records (because when we created the dataset, we have 4 grandparents in almost every record, which does not correspond with the reality). We got 4 datasets that we marked as 4GP (four grandparents), MF (mother’s father), MLN (mother’s last name) and F+M (father and mother). In the Table 2 we can see how much information is available. In the second kind of datasets we chose only those records that contain all 4 grandparents. From original 1961 records we got 1097 records. In these datasets we again deleted ancestors according to the Table 2, but this time we did it for all 1097 records. Father Mother Father’s fath. Fathers’s moth. Mother’s fath. Mother’s moth. 4GP 1+1 1+1 1+1 1+1 1+1 1+1 MF 1+1 1+1 0+0 0+0 1+1 0+0 MLN 1+1 1+1 0+0 0+0 0+0 0+0 F+M 1+1 1+0 0+0 0+0 0+0 0+0 Table 2. Table shows what information is available in various records for 6 ancestors. First is first name, second is last name. 1 means it is known, 0 means it is unknown. 6 Testing and Discussion First we describe results from the second datasets, because they are better for evaluation of the algorithm, while first datasets more correspond to the real baptism records, but does not give so good idea about working of the algorithm. From the Table 3 we can see that number of records, comparisons and time is decreasing as the number of persons in records is decreasing. We can see that for MLN and F+M the number of records is the same, because the number of persons in the records does not change - in MLN we know the last name of mother. Recall is ratio of true positive/(true positive + false negative). This value is approximatelly the same for all four cases, about 96%. This means we are quite successful in finding true matches. What is changing significantly is precision. This value corresponds to what amount of pairs marked as matches are really true matches. This value is about 60%, which means that among matches is about 40% of false positives. This value is quite hing and moreover the precision decreases very significantly for F+M. This means that almost 90% of matches marked as positive matches were in fact false positives. This is caused by lack of information in the records. For example, we know that somebody’s name is Maria and only other information we have is time range of her birthdate, which can be 150 years. This means that we connect all Marias whose birth ranges intersects and because Maria was very common first name, we get lot of false positive matches. If we examine what caused false matches in the F+M dataset, we found that from 11 735 false matches, 11 691 were caused by wrongly matching somebody to CHILD. FATHER is wrongly matched only in 998 cases, but MOTHER in 10 781 cases. In MLN dataset, where we know mother’s last name, the ratio of false FATHER:MOTHER matches is only 998:643, so the Persons Linking in Baptism Records 11 number of false mothers matches is even smaller then for fathers. From the Table 3 we can see that best results are for MF, so the obvious solution would be to delete father’s parents and mother’s mother from 4GP records. Unfortunatelly, from Table 4 we can see, that the time where we have all four grandpatents is quite small (in our case 1858 - 1899, for mother’s father or mother’s last name 1791 - 1857). Luckily, approximately 60% of all people born between 1600 and 1900 were born after 1800. The resulting presicion mainly in the F+M dataset is not too high, but we have to keep in mind that those results are based only on baptism records and even human genealogists are not able to connect persons among generations when they do not know their last names. This can be solved only by adding other informations like from burial records or better from marriage records where we can usually get brides last names. Recall [%] Precision [%] No. of records No. of comparisons Time [s] 4GP 94.0 56.9 7679 6 251 838 97.0 MF 98.2 67.1 4388 3 066 204 33.2 MLN 96.9 55.3 3291 2 159 885 18.0 F+M 97.3 12.5 3291 2 075 102 18.2 Table 3. This results are from the second kind of dataset. The time range here is the same for all four datasets (1735 - 1899) and the number of original records is also same for all four datasets (1097). Number of comparisons differs, because records in different datasets contains different number of persons (see Table 2). Year Range Recall [%] Precision [%] No. of records No. of comparisons Time [s] All 1607 - 1899 96.0 42.9 12102 15 747 936 180.7 4GP 1858 -1899 94.4 74.2 4434 2 970 692 52.0 MF 1812 - 1857 97.6 64.9 2349 1 110 679 13.2 MLN 1791 - 1812 96.3 98.9 694 128 485 1.1 F+M 1607 - 1790 94.2 22.3 1425 400 928 3.0 Table 4. Comparison of datasets with data corresponding to the real parish books. ”All” means all data (4GP, MF, MLN, F+M) are together in one file. Number of original records here (in All or together in others) is 1961. In the first kind of datasets we have again 4GP, MF, MLN and F+M and then all those combined together. This datasets better reflects reality, because more into the past, less information were provided (and also more mistakes and missing records). Results from this datasets can be seen on Table 4. Again we can see that values for recall are quite high, about 95% and for precision they goes down from 74.2% for 4GP to 22.3% for F+M, but again with exception for MLN, where they have 98.9%, which is very high value, that can be caused by small size of this dataset. 12 J. Rozman et al. 7 Conclusion In this paper we described chain of tasks that is neccessary to perform when we want to load baptism records from one database, find identical persons and store resulting family trees into another database. We have created vector-based comparison and tested it on about 2000 baptism records connected into family tree from one small village. Dataset we have used was not from raw baptism data, but it was exported from genealogical software and turned into two datasets, one that correspond to real records and second mainly for testing. Advantage of this approach is that we have IDs from genealogical SW for every person that allow us to find out if the matching was correct or not and then compute recall and precision. Our future work will be mainly aimed on conditional probability in the com- parisons of names. Now we assume that all the names have the same probability, but this is not true. Other improvement can be using of neural network for the decision if two records are the same or not. In the approaches with classical or conditional probability there are lot of weights, because every item of the record has different significance and there is very difficult to tune them all. This problem could be solved by neural networks where the input would be a vector of values as is described in Section 4 and the output then would be matching probability of the persons. We also want to create same datasets from marriage and burial records and we are working on the rewriting of the complete original records that will be supplied by person IDs from genealogical software. References 1. Dintelman, S., Maness, T.: Reconstituting the Population of a Small European Town Using Probabilistic Record Linking: A Case Study, Family History Technology Workshop, BYU 2009 2. Milani, G., Masciullo, C., et al.: Computer-based genealogy reconstruction in founder populations. J. of Biomedical Informatics, 44: 997–1003, 2011 3. Malmi, E. , Gionis, A., Solin, A.: Computationally Inferred Genealogical Networks Uncover Long-Term Trends in Assortative Mating, Proceedings of WWW Confer- ence, 883-892, Lyon, 2018 4. Malmi, E., Rasa, M., Gionis, A.: AncestryAI: A Tool for Exploring Computationally Inferred Family Trees, Proceedings of International World Wide Web Conference Comittee, 2017 5. Christen, P.: Application of Advanced Record Linkage Techniques for Complex Pop- ulation Reconstruction, 2016, ArXiv e-prints (Dec 2016), arXiv:cs.DB/1612.04286 6. Wilson, D. R.: Beyond Probabilistic Record Linkage: Using Neural Networks and Complex Features to Improve Genealogical Record Linkage, Proceedings of Inter- national Joint Conference on Neural Networks, USA, 2011 7. Pixton B., Giraud-Carrier, Ch.: Using Structured Neural Networks for Records Link- age, In Proceedings of the Sixth Annual Workshopon Technology for Family History and Genealogical Research. 8. Gottapu, R. D., Dagli, C., Bharami, A.: Entity Resolution Using Convolutional Neu- ral Networks, Procedia Computer Science 95, 2016, doi: 10.1016/j.procs.2016.09.306