=Paper= {{Paper |id=Vol-1399/paper14 |storemode=property |title=How to Make it in History. Working Towards a Methodology of Canon Research with Digital Methods |pdfUrl=https://ceur-ws.org/Vol-1399/paper14.pdf |volume=Vol-1399 |dblpUrl=https://dblp.org/rec/conf/bd/BraakeF15 }} ==How to Make it in History. Working Towards a Methodology of Canon Research with Digital Methods== https://ceur-ws.org/Vol-1399/paper14.pdf
          How to Make it in History. Working Towards a Methodology of Canon
                             Research with Digital Methods
                                          Serge ter Braake and Antske Fokkens
                                History and Computational Linguistics, VU University Amsterdam
                                    De Boelelaan 1105 1081 HV Amsterdam, the Netherlands
                                            s.ter.braake@vu.nl, antske.fokkens@vu.nl

                                                                 Abstract
This paper proposes a methodology for studying canonisation of people in history with digital methods. These canons are for the most
part culturally determined. For a select group of people, there is no doubt that they merit the necessary attention, but there is a large gray
field of ‘second rate’ individuals who had an impact on history of which only a small group is included in more than a footnote. This
makes the attention people get from historians rather arbitrary, subjective and unacademic. Digital humanities technologies can help us
to work around this arbitrariness and to get insight into the canonisation processes.

Keywords: Canonisation, Fame, Ngrams, Named Entity Recognition


                       1    Introduction                                      2    Canonisation and Digital Humanities
This paper proposes a methodology for studying canoni-                     Canonisation of people and events in history is an unfortu-
sation of people in history with digital methods.1 With                    nate, but natural process. Once individuals are mentioned
‘canonisation of people in history’ we mean the repeated                   and remembered in various sources, they enter the frame-
mentioning of people in any history book (e.g. a study                     works people use to maintain their memory for a longer
on British Parliament), reference work (e.g. a biographi-                  period of time (Halbwachs, 1985, p.29). This means that
cal dictionary), newspaper, website or actual canon (e.g.                  once well-embedded in collective memory or historiogra-
the ‘Canon van Nederland’.2 Canons are for the most part                   phy, a person does not leave it easily and that those that
culturally determined, rather than by the actual impact peo-               did not make it are doomed to oblivion, unless they are
ple had in history. The example of the continuous under-                   (re)discovered. The urge to make formalised ‘canons’ of
representation of women in history works makes this only                   what everyone should know about history, no matter how
too clear (Bosch, 2014). For a select group of people, there               useful for education and public history, reinforces this pro-
is no doubt that they merit the necessary attention in his-                cess. This means that historians could be ‘blind’ to large
toriography, but there is a large gray field of ‘second rate’              groups of potentially historically interesting people and
individuals who had an impact on history of which only                     events. Canonisation therefore impedes historical innova-
a small group is included in more than a footnote. This                    tion and it needs to be studied in order to break it.
makes the focus of historians on a relatively limited group                The problem of biases in historiography are well known,
of people rather arbitrary, subjective and unacademic. Digi-               but there has been little research into how selection pro-
tal humanities technologies can help us to work around this                cesses work and what this could mean for our knowledge
arbitrariness and to get insight into the canonisation pro-                and views of history as a whole. For the historian, this
cesses.                                                                    effect of reinforcing what we think to know about history
In this paper we take canonisation of individuals in the                   and continuously forgetting/ignoring what we do not know
Netherlands as our example, but the same methodology                       poses a major, and as yet still underestimated, problem
could be applied to other countries. The rest of this paper is             (Sample, 2012; Earhart, 2012).
structured as follows. In Section 2, we introduce the phe-                 One of the main challenges in addressing this problem is
nomenon of canonisation in history and the role digital hu-                that identifying influential people that did not make it into
manities can play. Section 3 discusses the different sources               the history books is a process of collecting needles from a
that could provide an answer to our question. In Section 4,                haystack. Historians need to go through vast amounts of
we provide a breakdown of the available biographical data                  data that contain references to influential people and find
and tools for the Netherlands, how to make good use of                     those people that are forgotten despite being equally influ-
them and what their limitations are. We propose a method-                  ential as their famous or semi-famous contemporaries. Dig-
ology for making the best use of digital methods in combi-                 ital methods are necessary to carry out such research in an
nation with traditional methods for canon breaking research                efficient way.
in Section 5. In Section 6 we show some preliminary re-                    The advent of the digital age has in general sparked a new
sults, which is followed by our conclusions.                               interest in frequency lists, which help us in understanding
                                                                           canonisation processes. Ngram viewers can tell us the fre-
                                                                           quency of a (combination) of words within a certain corpus
   1
       All URLs in this paper were latest retrieved on 31 May 2015         of texts over time, we can count the number of words used
   2
       http://www.entoen.nu/                                               by members of parliament and the kind of terms they use

                                                                      85
and we can evoke fame rankings of people who are men-                  queried that do not have the bias of modern records. We
tioned in Wikipedia.3 Such lists are particularly interesting          want to scan for any names in a wide variety of not only
for humanities researchers, since they give them the oppor-            books, but also sources like journals, newspapers, pam-
tunity to approach old topics in a different way.4 Computer            phlets and archive material, and see what happens to their
software is able to analyse much more text than any human              fame in the course of centuries. In Section 3, we will say a
could ever do, which allows humanities researchers to back             bit more about the potentially interesting sources to use to
up interpretations based on anecdotal evidence with actual             get a grasp on these ‘missing persons.’
numbers and to formulate or test hypotheses more quickly.
With the Google Ngram viewer, based on the words in mil-                                   3    The sources
lions of books, it is for example easy to see how the popu-            To map canonisation in history we need to make a distinc-
larity of Anne Frank rises quickly after the Second World              tion between the different sources we can use. There are
War.5                                                                  contemporary sources (e.g. a pamphlet from 1581 scolding
The creators of the Google Ngram viewer have run some                  William of Orange) on the one hand and sources written
interesting experiments with their corpus (Michel et al.,              after the death of a person (e.g. a biography on William
2011). The most closely related to our goals are the ones              of Orange from 1978) on the other. Similarly, there are
on the rise to fame of all famous people between 1800 and              sources with a conscious selection of people (e.g. histori-
2000 and the ‘Science Hall of Fame.’6 The first experiment             cal sources like a biographical dictionary) and sources that
used the 740,000 names of persons in Wikipedia and 42,358              do not or less consciously select (like a list of land own-
names in the database of the Encyclopedia Britannica. This             ers). Obviously we can have both contemporary and later
yielded interesting results, e.g. 1) Most people knew a quick          sources with and without a conscious selection, as can be
rise to fame followed by a slow decline after the peak; 2)             seen in Figure 1.
Most people enjoyed their peak circa 75 years after their              The contemporary sources are needed to see how famous
births; 3) People increasingly become more famous more                 a person was in his or her own time. We will see in
quickly, but also are forgotten more easily (Michel et al.,            Section 6, Table 2 for example, that the politician Johan
2011, p.180).                                                          Rudolph Thorbecke was extremely prominent in the news-
Online biographical dictionaries and Ngram viewers give                papers of his time. It is logical to assume that a person is
ample possibilities for investigating who became famous                often most famous in his or her own time, but the examples
and why, even when taking all the source biases and limita-            of the painter Vincent van Gogh and Anne Frank already
tions of the tools into account. It is more challenging how-           show that this is not always the case. The sources pub-
ever, to look for the people who did not become famous,                lished after the death of an individual show how the fame
while they were prominent enough in their own time. Even               of a person developed. Even though Thorbecke remained
if the data in Google Books and the KB Ngram viewer are                one of the canonised figures from Dutch history, his fame
less discriminative than the biographical dictionaries, they           declined over the years, as can be seen from the sources
do not solve this problem. When the creators of the Google             after his death. Obviously, for historical figures before the
Ngram viewer did their research on the fame of people be-              nineteenth century this starting point will be difficult to de-
tween 1800 and 2000 they used existing lists of people from            termine, due to the lack of sources.
Wikipedia and the Encyclopedia Britannica. Even if the                 It is more complex to make a distinction between sources
lists from the Encyclopedia ‘reflect a process of expert cu-           that consciously select people to write about and sources
ration that began in 1768’ (Michel et al., 2011, p.180), it            that do not. A biographical dictionary is a good example
still is biased and subjective. Logically, the people who are          of a source that does consciously select individuals. One of
left out of Wikipedia and the Encyclopedia do not show up              the main questions of any editor of a biographical dictio-
in the fore mentioned two analyses either and therefore, to            nary is who is noteworthy enough to get an entry and who
a certain extent, the canon reaffirms itself.                          is not. A history book on the Dutch Revolt is already a less
These experiments are, in other words, top-down: existing              clear example of selection. Obviously, any historian selects
lists were used to match with records of the past. The ex-             the people and events he or she deems important enough to
periment can show that certain people are not mentioned                describe. The mentioning of individuals might, however,
as much as one would expect, but not that certain people or            have to do with the selection of an event (e.g. presence at
events were ‘hot topics’ during a certain time, but have been          a certain battle) rather than with any selection of persons.
forgotten since. For a complete picture, records need to be            This is why we consider prosopographical studies, group
                                                                       biographies, as good examples of sources that do not con-
   3
     Wikirank:        http://wikirank.di.unimi.it/                     sciously select individuals. Prosopographies are quantita-
index.html;        Pantheon: http://pantheon.media.                    tive studies on larger groups of people. The category a per-
mit.edu/methods
   4                                                                   son belongs to (e.g. officers at the Council of Holland) de-
     e.g. When was the word potato used for the first time? ‘De
                                                                       termines whether someone is selected for the study, not the
DBNL ngram-viewer van de KB’: https://www.youtube.
com/watch?v=XpMqypF46RY
                                                                       person him- or herself. Newspapers also select what they
   5
     https://books.google.com/ngrams/ search for                       deem the most important news, but that is mostly driven by
‘Anne Frank’, on 13 May 2015.                                          popular demand and not by historical judgments on who is
   6
     http://www.sciencemag.org/site/feature/                           influential enough to include.
misc/webfeat/gonzoscientist/episode14/                                 The differences between these sources have to be taken into
index.xhtml                                                            account for any historical interpretation of the results. In

                                                                  86
                                                                     after large consulting rounds. Sometimes the availability of
                                                                     experts also had an influence on who is included and who
                                                                     not (Hanssen, 1995, p.78), (Nadel, 1984, p. 52). In the
                                                                     Biography Portal of the Netherlands, 23 of such biographi-
                                                                     cal datasets are gathered,8 resulting in biographical data on
                                                                     over 75,000 individuals. These individuals can be analysed
                                                                     on common characteristics, such as age, gender and claim
                                                                     to fame. The dataset of the BP is used in our bottom-up
                                                                     analysis in Section 6. These biographical dictionaries are
                                                                     excellent examples of sources with a conscious selection
                                                                     after the deaths of individuals. Because, if all is well, in-
                                                                     dividuals only have one entry in a biographical dictionary,
                                                                     their fame can be ‘measured’ by looking at the occurrence
                                                                     in other people’s biographies.
                                                                     Resources such as DBpedia,9 a structured dataset in RDF
                                                                     based on the data in Wikipedia, offer similar possibilities
                                                                     for group analyses of ‘famous’ people. The advantage of
                                                                     these datasets over biographical dictionaries is that they are
                                                                     bigger, dynamic, more inclusive and edited by ‘the crowd’
                                                                     rather than by a selected group of editors. Wikipedia does
                                                                     particularly well in providing reliable basic data on indi-
                                                                     viduals. One of the disadvantages is that DBpedia and
                                                                     Wikipedia have clear biases as well, which are more of-
Figure 1: Schematic view of sources for research on canon-           ten grounded in ‘Geek hobbies’ than in academic tradition
isation processes                                                    (Rosenzweig, 2011). The broad criteria Wikipedia uses for
                                                                     inclusion nevertheless make it a source with a less con-
                                                                     scious selection. Furthermore, it provides continuously up-
the following section we shall see how this already is facil-        dated information on people, both during their life and af-
itated.                                                              ter their death (e.g.: actor Leonard Nimoy (†2015) had an
                                                                     extensive entry on Wikipedia during his life, which is still
            4   Available tools and data
                                                                     being adjusted and complemented as we speak.10
To investigate who became a well-known person, who did               For data on Dutch people that were even less consciously
not, and why, we need at least the following data from as            selected the KB (National Dutch Library) Ngram viewer
many records as possible: names, dates and places of birth           is a good resource to start.11 The main advantage of the
and ‘claims to fame’ (i.e. why did or could someone be-              KB Ngram viewer is that it uses the words in over 9 mil-
come famous?). In theory any record of people or events              lion digitised newspaper pages from the Netherlands and
could be suitable for our purpose, from medieval chroni-             thereby also covers people and events that were once con-
cles to early modern newspapers, to modern school books.             sidered worth mentioning and might have been forgotten
For a full picture of canonisation a wide variety of sources         in historiography. Unfortunately, the biases which are in-
needs to be consulted, from each category listed in Figure 1.        troduced by the limited availability of digitised newspapers
Lists of famous people are strongly dependent on the kind            will also influence the results provided here.12
of medium that is consulted, as will be demonstrated in Sec-         The data derived from Google Books and made accessible
tion 6. The enormous amount of data from these sources               in the Google Ngram viewer and its raw datasets are less
could never be close-read by one person. We therefore need           specific for the Dutch situation, but still useful. They are
digital methods to speed up our research (Wilkens, 2012, p.          less biased by preselections of digitisation than the newspa-
251, 255), (Michel et al., 2011, p. 176). In this section we         per archive. The ‘black box’ of the Ngram viewer, however,
will discuss a non-exhaustive selection of what we deem to           makes it impossible to see to what extend sources with a
be the most obvious sources to start such research, and how
they relate to the sources mentioned in Section 3.
                                                                     Biography: http://adb.anu.edu.au/
It is relatively easy to trace the people who did make it in             8
                                                                           http://www.biografischportaal.nl
history for a top down analysis of canonisation. Biograph-               9
                                                                           http://www.dbpedia.com
ical dictionaries list the supposedly most noteworthy men               10
                                                                           https://en.wikipedia.org/?title=Leonard_
and women from, for example, a country, profession, time             Nimoy
period or political movement. Many countries host a dic-                11
                                                                           http://kbkranten.politicalmashup.nl/
tionary of national biography online, offering increasingly             12
                                                                           This point was also made clearly by Bram Mellink in
enhanced options for research.7 People described in bio-             his presentation at ‘studiedag God in Nederland 3.0’ (21
graphical dictionaries were selected by the editors, often           november 2014) entitled ‘Zoekt en gij zult vinden. Digi-
                                                                     tale onderzoeksmethoden, religiegeschiedenis en het prob-
  7
    e.g. Oxford Dictionary of National Biography: http:              leem van de ondoorzichtige dorpsrel (1951-1952)’ Slides:
//www.oxforddnb.com/; Deutsche Biographie: http://                   http://www.religiegeschiedenis.nl/rg/docs/
www.deutsche-biographie.de; Australian Dictionary of                 PresentatieBramMellink.pdf

                                                                87
conscious or less conscious selection of people were used.           names of people. The exact method we followed for this
The Ngram viewer calculates the word frequency in a se-              paper is described in Section 6.
lection of 5 million out of the 15 million books scanned             2) Initially, all names should be considered as belonging to
by Google. The Ngrams are available in corpora of sev-               unique individuals and we should assign all of them an In-
eral languages, though not in Dutch (Michel et al., 2011).13         ternationalized Resource Identifier (IRI).16 We cannot sim-
Furthermore, there are Google NGrams for Dutch, which                ply assume that the same name refers to the same person.
is a dataset of 133 billion words extracted from open web-           By assigning all names unique IRIs to start with there is no
sites between October and December 200814 and the DBNL               risk of polluting the original data. Any errors can always be
Ngram viewer, which searches in Dutch literary texts.15 For          traced back to the original source this way (de Boer et al.,
this paper we have used all these four Ngram viewers.                2014).
                                                                     3) The third step is to disambiguate all the names and es-
       5   A Methodology for Canon Research                          tablish which can be linked to the same person. It is not
                                                                     trivial to do this automatically,17 but it can be done (as
In this section, we propose a method for fruitful compu-
                                                                     by Veres (Bohannon, 2011)) by comparing the mentioned
tational analysis of canon formation with digital historical
                                                                     dates, places, other people and professions in the context.
data. As mentioned above, Ngram viewers are suitable for
                                                                     Ideally, the probability of each match should also be indi-
‘top-down’ research on canons, when you know which peo-
                                                                     cated. The role of the historian is vital in writing an al-
ple you are looking for. We want to combine this approach
                                                                     gorithm for this task, to provide the historical context and
with a bottom-up approach, where the starting point is not
                                                                     establish what can be considered evidence for a match be-
an existing list of names, but all the names from as many re-
                                                                     tween two people.
sources as possible from all categories as described in Sec-
                                                                     4) Most efforts in digitising data evolve around specific
tion 3. This way we can also find whose fame did not last
                                                                     ‘canonised’ topics. We therefore need a non-digitised con-
for centuries and formulate ideas on why this is the case.
                                                                     trol dataset to establish in what way the fact that we can
Another identified problem with Ngram viewers is that they
                                                                     only use digitised sources for computational analyses in-
provide little context and provenance information. Espe-
                                                                     fluences the results. For this, a historian still needs to go
cially for a historian, it is important to know where infor-
                                                                     through the archives to analyse non-digitised sources and
mation came from, to check the reliability and to see the
                                                                     write down the names and generic data like dates of birth
context (Fokkens et al., 2014). We therefore need to facili-
                                                                     and death and ‘claim to fame’. Of course the historian will
tate the need for provenance and context by making a divi-
                                                                     once again have to take into account the different kind of
sion between the original data and a layer above the original
                                                                     sources as mentioned in Section 3. This set should be anal-
data (a supraset) where computational reasoning has taken
                                                                     ysed both apart from and together with the digital set.
place. Both the provenance of the original data and that of
                                                                     5) We would then be able to draw up graphs and tables of
the processes that took place manipulating them should be
                                                                     which people were mentioned often in what works, when,
traceable (Ockeloen et al., 2013; Moreau and Groth, 2013).
                                                                     where and how, which would provide insight in the canon-
To facilitate both a bottom-up approach and insight into
                                                                     isation of Dutch history.
context and provenance we suggest the following steps:
                                                                     6) Finally, a more detailed survey should be done by the
                                                                     historian. The leads provided by technology should be fol-
1) To investigate canonisation, we need to identify all
                                                                     lowed to see the context and find explanations for the find-
names in our datasets and not restrict ourselves to prede-
                                                                     ings. We need access to provenance and context to give
fined lists. We are, after all, not only looking for the peo-
                                                                     room for theory and to assess the meaning of all these num-
ple who made it to the canon, but also for the ones that
                                                                     bers (see Hall (2012) for a similar argument).
were forgotten. We therefore need an approach for Named
                                                                     For this paper we performed step 1 and applied a basic ap-
Entity Recognition (NER) to filter out all names from our
                                                                     proach to address step 3.
sources. A commonly used state-of-the art named entity
recognizer for English reports a 90% F-score (Finkel et al.,                                 6    Results
2005). However, there are less training sets for Dutch and           6.1 Top down approach
the task we need in this step is easier than the typical NER
                                                                     In this section we will discuss the results from a top down
task: we are producing lists of people names for histori-
                                                                     approach for investigating who is most famous in Dutch
ans to study. We therefore mainly need very high recall
                                                                     history. Since any existing fame list would do as a start-
on identifying person names. Precision is less important,
                                                                     ing point, we took the top 25 of the Dutch TV elections of
because historians can simply discard expressions that do
                                                                     the ’Grootste Nederlander’ (Grandest Dutch person).18 We
not refer to a person in their final analysis. Furthermore,
                                                                     then ranked them basing ourselves on the Google (books)
we are not interested in names that do not refer to people
                                                                     Ngram viewer (for English), the KB Ngram viewer, tak-
and standard NER approaches are trained to identify loca-
                                                                     ing the words from Dutch newspapers, the DBNL Ngram
tions, organisations and miscellaneous names in addition to
                                                                       16
                                                                           IRIs are generalizations of URIs that support Unicode.
  13                                                                   17
     https://books.google.com/ngrams/info                                  Note that there is no predefined ontology, which makes this
  14
     http://www.let.rug.nl/gosse/bin/Web1T5\                         a different task from standard named entity disambiguation as in
_freq.perl and https://catalog.ldc.upenn.edu/                        (Mendes et al., 2011).
LDC2009T25                                                              18
                                                                           http://nl.wikipedia.org/wiki/De_
  15
     http://www.dbnl.org/zoek/ngram.php                              grootste_Nederlander

                                                                88
viewer, containing words from mostly literary texts and               Rank         Elections 2004          Total NGram viewers
Google Ngrams for Dutch, which contains all words used                1             Pim Fortuijn           Koningin Wilhelmina
on the Internet at the end of 2008. With these sets we have           2          Willem van Oranje          Willem van Oranje
                                                                      3             Willem Drees              Koningin Juliana
sources from historiography, the news, cultural texts and
                                                                      4        Antoni van Leeuwenhoek        Vincent van Gogh
the Internet, which together should provide a rather bal-             5          Desiderius Erasmus        Rembrandt van Rijn
anced set of sources with much and less selection, both               6             Johan Cruijff               Anne Frank
from the period during and after individuals’ lives. We               7           Michiel de Ruyter           Johan Thorbecke
ranked the individuals by their highest score in one year,            8              Anne Frank             Christiaan Huygens
since for the limited scope of this paper it would go too far         9          Rembrandt van Rijn         Desiderius Erasmus
to calculate a balanced average for each individual.                  10          Vincent van Gogh              Prins Claus
We faced several challenges in identifying the right peo-
ple. The spelling of names is possibly the biggest issue             Table 1: The most famous Dutch people in history accord-
here. Before the nineteenth century there was no standard-           ing to the 2004 TV elections and the Ngram viewers in the
ised spelling of names, which results in many varieties in           tables below
not only contemporary sources, but also in modern works.
Even if a particular name is usually spelled the same way,
a bad OCR quality could still give a bias in the results. The
options to use wildcards in the viewers to catch all varia-
tions often are very limited.                                        is needed to find all instances. To give just one example:
Another problem is caused by people with the same name.              Dutch treasurer Vincent Cornelisz from the first half of the
William the Silent, number two in the elections (see Ta-             sixteenth century was very famous in his time, but is cur-
ble 1), is most commonly known as William of Orange.                 rently unknown to a wide audience. In history books he is
The hits we receive for ‘William of Orange’ in the Google            not only referred to as Vincent Cornelisz, but also as Vin-
Ngram viewer however, may refer to the leader of the Dutch           cent van Mierop (a name which was used for the first time
revolt (†1584) we are looking for, but also to his great-            by his son, not by him), or as Vincent Cornelisz van Mierop.
grandson, the later King of England (†1702), number 72               In records of his own time, he was so well known that often
in the TV elections. Pollution with instances of the king of         he was simply referred to as master Vincent, which ironi-
England could be especially significant in the English cor-          cally means that the fame in his own time causes a problem
pus of Google books. We therefore only used his nickname             in tracing his fame in our time (ter Braake, 2007, p. 375).
‘William the Silent’ in this corpus. Despite the significant
reduction in hits, he still ranks number 1 in Google books,          In Tables 1, 2, and 3, we see the top ten occurrences of fa-
which further justified our decision.                                mous people when searching for the original TV elections
Identifying the humanist scholar Desiderius Erasmus poses            top 25. The highest average position in all Ngram viewers
a problem because he is known as ‘Erasmus’. Dropping his             is listed in the right column of Table 1. It is very clear that
first name would lead to many additional hits from other             the fame of a person depends greatly on the kind of medium
people and Google NGrams for Dutch does not even fa-                 that is used. Number 1 of the TV elections, the politician
cilitate searching for unigrams. The same applies to the             Pim Fortuijn, only features in the Ngrams for Dutch, which
philosopher Baruch de Spinoza. A quick search in the                 is not surprising since the other lists are for the years 1800-
World Biographical Information System19 shows us that                2000 and he only rose to fame in the twenty-first century.
while there are 789 hits for Erasmus, there are ‘only’ 8 hits        Queen Wilhelmina, the number 1 in the Total Ngrams list
for Spinoza (and most of them refer to the correct and the           surprisingly did not make it to the top 10 of the elections.
same person) indicating that the risk of pollution is lower.         The same can be said for the other members of the royalty,
Still, results in Google Ngrams seem significantly inflated          prince Claus and queen Juliana. Apparently they were and
for the unigram Spinoza, giving him an extremely high                are very famous, but are not considered of too much his-
score in 1883. The year 1883 does not have a high score              torical significance by the Dutch people. Prime minister
when searching for bigrams of ‘Baruch Spinoza’, or tri-              Thorbecke claims a high position in the overall ranking due
grams of ‘Baruch de Spinoza’, which strongly suggests that           to the many mentions in Dutch newspapers in the middle
too much pollution occurs when the first name is dropped.            of the nineteenth century. Christiaan Huygens owes his po-
We therefore added the results for ‘Baruch Spinoza’ and              sition primarily to the fact that the DBNL has many of his
‘Baruch de Spinoza’, whilst knowing the score does not re-           private letters in its collection. Dutch soccer player Marco
flect all references to him.                                         van Basten does not make it to the overall top ten, but does
There also are people who are known differently during               score highly in the newspapers and on the Internet. William
their lives, such as members of the royalty. We had to               of Orange/the Silent and painter Vincent van Gogh are the
search for both princess and queen Juliana and princess              only people who feature in every list. If anything, these ta-
and queen Wilhelmina to obtain the best result. For widely           bles show how relative fame is. The more (heterogeneous)
known people like them this problem can be circumvented              big datasets we have at our disposal the more balanced the
quite easily, but in other cases specific domain knowledge           picture will become. In the following subsection we will
  19                                                                 explore what happens when we use a bottom up approach
    http://db.saur.de/WBIS/basicSearch.jsf
                                                                     and try to find the famous people that do not feature on any
The system hosts biographies on 6 million people from 58
biographical archives all over the world.                            preexisting list.

                                                                89
 Rank      Google Ngram viewer             KB Ngram viewer        and the approach will still serve its purpose (though higher
 1           William of Orange              Johan Thorbecke       precision does make the historian’s job easier). Our basic
 2              Anne Frank                  Koningin Juliana      pattern-matching approach is thus preferable for this partic-
 3         Koningin Wilhelmina            Koningin Wilhelmina     ular research over the more sophisticated machine-learning
 4          Vincent van Goghh                  Prins Claus
                                                                  approaches that have higher precision, but lower recall.
 5           Johan Thorbecke                William of Orange
 6        Antoni van Leeuwenhoek           Rembrandt van Rijn     We tested our method on the data of the Biography Portal
 7           Koningin Juliana               Vincent van Gogh      of the Netherlands, an aggregated dataset of 23 different
 8          Christiaan Huygens          Johan van Oldenbarneveldt sources, all with their own limitations and biases.21 A bi-
 9          Desiderius Erasmus              Marco van Basten      ographical dictionary in itself is a ‘canon’ of noteworthy
 10         Rembrandt van Rijn             Desiderius Erasmus     people and will therefore not reveal many ‘forgotten’ peo-
                                                                  ple. The Portal nevertheless provides a suitable dataset to
Table 2: The most famous Dutch people in history accord-          try out our methodology. It provides a large volume of de-
ing to Google Ngram viewer (1800-2000) and KB Ngram               scriptive texts in Dutch and the output of our algorithm will
viewer (1800-2000)                                                reveal what person names occur most in these texts (in their
                                                                  own and in other people’s biographies) and thereby show-
                                                                  ing us a measure of fame after all. The principle of applying
  Rank       DBNL Ngram viewer            Google Ngrams for Dutch our method does not differ from applying it to a set that did
  1           Christiaan Huygens              Marco van Basten    not apply any form of selection.
  2           Rembrandt van Rijn                  Anne Frank
                                                                  With our approach we could easily get the number of oc-
  3             Johan Thorbecke                  Pim Fortuijn
                                                                  currences of entries such as Willem van Oranje (William of
  4           Desiderius Erasmus            Koningin Wilhelmina
  5            William of Orange                 Johan Cruijff    Orange). The fact that we also got results from ‘Tweede
  6            Baruch de Spinoza                Toon Hermans      Kamer’ (Dutch parliament, 882 hits) ‘Den Haag’ (The
  7          Koningin Wilhelmina               Koningin Juliana   Hague, 830 hits) and ‘Staten van Holland’ (States of Hol-
  8               Willem Drees                Willem van Oranje   land, 420 hits) shows an interesting overall bias towards po-
  9            Vincent van Gogh               Vincent van Gogh    litical history, but can be easily discarded for our purpose
  10       Johan van Oldenbarneveldt              Prins Claus     here. You do not need to be a domain expert to easily see
                                                                  that these expressions do not refer to people.
Table 3: The most famous Dutch people in history ac-              Named entity disambiguation is more problematic. Af-
cording to DNBL Ngram viewer (1800-2000) and Google               ter discarding the false hits we have Willem I, Willem II,
Ngrams for Dutch (2008)                                           Willem III, Willem IV and Willem V ranking in the top 10
                                                                  of our list, but unfortunately there have been many counts,
                                                                  dukes and stadtholders over the centuries who go by that
6.2 Bottom Up Approach                                            name and title. A problem of a different nature is that we
                                                                  have Willem I, Willem van Oranje and the prins van Oranje
As mentioned in Section 5, it is relatively easy to iden-
                                                                  ranking high, which could all refer to the same person:
tify names with tools for Named Entity Recognition. For
                                                                  William the Silent (of Orange), the number 2 from the TV
this particular study, we use a highly simplistic but effec-
                                                                  elections and the overall ranking in Table 1. Hits such as
tive pattern-matching approach. We select combinations of
                                                                  ‘Van den Bergh’ also causes identity problems, since with-
words that start with a word that starts with a capital (e.g.
                                                                  out the context we cannot see which Van den Bergh this is,
Willem) and end with a word that starts with a capital (e.g.
                                                                  or even if he or she is an actual historical person or just a
Oranje), which works fine for Dutch (but would be quite
                                                                  historian who is cited often. Some of the results are quite
useless for German that capitalises all nouns). Because
                                                                  telling, however. We are quite sure that ‘Karel V’ will al-
both the first and last word must start with a capital letter,
                                                                  most always refer to emperor Charles V (and perhaps a few
we avoid the inclusion of words that start the sentence.20
                                                                  times to the fourteenth century French King) and that Fred-
The algorithm allows for two sequential lower case words
                                                                  erik Hendrik and prins Maurits refer to the famous sons of
within the name, since it is customary to write prepositions
                                                                  William the Silent. Domela Nieuwenhuis must refer to the
and determiners in Dutch names in lower case when they
                                                                  social anarchist Ferdinand Domela Nieuwenhuis, since he
are preceded by a first name or initials. The algorithm can
                                                                  has quite a unique name.
thus capture names such as Johan Derk van der Capellen
                                                                  In a first attempt of named entity disambiguation we inves-
tot den Pol, but no names where three lowercase words fol-
                                                                  tigated the possibilities of applying time constraints based
low each other which are extremely rare in Dutch.
                                                                  on metadata and temporal expressions in the text. This way
For our particular use case, we primarily aim for recall,
                                                                  count Willem II (thirteenth century), stadtholder Willem II
because (1) historians can immediately filter out the in-
                                                                  (seventeenth century) and king Willem II (nineteenth cen-
valid patterns found by our approach and (2) bad patterns
                                                                  tury) would be easily separated.
are often singletons in the corpus having no or little influ-
                                                                  We implemented a basic approach that tackles the time con-
ence on the top and middle of our frequency based lists.
                                                                  straint of identity, which is based on the idea that people can
For these reasons, precision can be as low as 5% or 10%
                                                                  only personally interact with someone who was alive at the
   20
      Names such as Willem II are identified, because the sources
                                                                  same time as they were. Because this is the case, we as-
use Roman capital letters to add numbers to nobility with the same
                                                                          21
first name.                                                                    http://www.biografischportaal.nl

                                                                     90
 Rank       BP first results            BP second results                texts from the BP, but who do not have a biographical entry
 1              Willem I                 Willem I (1772)                 of their own. The results of this exercise were interesting
 2             Willem III                Karel V (1500)                  enough, but do still involve quite a lot of handwork from the
 3          Prins van Oranje             Willem II (1792)                historian. Many people in the list we generated did have
 4              Karel V                 Willem III (1650)
                                                                         their own entry after all, but are mentioned in a slightly
 5             Willem II                Willem V (1748)
 6             Willem V            Domela Nieuwenhuis (1846)
                                                                         different way. Politician P.W.A. Cort van der Linden, for
 7         Frederik Hendrik          Frederik Hendrik (1584)             example, is often mentioned as Cort van der Linden (16
 8        Domela Nieuwenhuis            Willem III (1817)                times) and similar issues occur with many other politicians
 9             Willem IV            Lodewijk Napoleon (1778)             from the nineteenth and twentieth century. Moreover, some
 10          Prins Maurits              Willem IV (1711)                 individuals are known under various alternative names. For
                                                                         instance, sixteenth century duke Karel van Gelre is listed as
Table 4: Results from the Biography Portal of the Nether-                Karel van Egmond.
lands, without (left) and with (right) time disambiguation.              The people who are mentioned most frequently in the texts
The second column also shows year of birth                               and who really do not have their own biographical entry are
                                                                         listed in Table 5. We find an important religious figure, a
                                                                         communist philosopher (probably mainly thanks to the bi-
                                                                         ographical dictionary on socialists included in the Portal),
sume that in the typical case, people who are mentioned in               no less than eight French rulers, an English king and a Ger-
someone’s biography will be a contemporary of the biogra-                man emperor in the top 12. It does not bring us closer to
phy’s subject. In order to establish which mentions refer to             the forgotten people in Dutch history, but does show a clear
the same person, we extracted the date of birth and date of              connection of Dutch elites with French royalty (or a bias
death from the metadata of the biographies in our corpus.                in the dictionaries towards France or people involved with
While going through the corpus to identify names, we only                France). We also encounter the previously mentioned prob-
merged names when the lifespan of the subjects either over-              lem of how to identify people who are mostly known with
lapped or were maximum 50 years apart from each other.                   one name. To detect Erasmus in the Ngram viewers we had
This baseline assures that, if the reference in the text itself          to search for Desiderius Erasmus. In Table 5 we see 15
is not about the far past or future, it is at least possible that        mentions of Napoleon Bonaparte, while there will be many
the texts refer to the same person.                                      more for just Napoleon. To find them, however, we would
Because there may be people alive at the same time who                   have to expand our algorithm to include one word instances
have the same name and 50 years offers quite a range, the                as well, which would result in too much noise for our anal-
approach does not offer any guarantees that references to                yses for this basic version of our algorithm.
different people are not combined, but it helps to solve                 To trace the individuals who were noteworthy in their own
some of the clearer cases where sources do not talk about                time, but are forgotten in history, we are more likely to be
the same person. It solves, for instance, the issue of high              successful when analysing sources with a semi-conscious
nobility with the exact same name. They are either from                  selection mentioned in Figure 1. We applied our method to
a different era altogether, or they have a different number              a sample of 99 historic Dutch newspaper texts provided by
behind their name.                                                       the Koninklijke Biobliotheek.22 The sample is too small to
The results of this approach are quite promising. Table 4                provide indications of ‘forgotten people’, but the outcome
shows that while we previously were not able to distin-                  of this test shows that our method can be applied success-
guish between Willem III, the nineteenth century king and                fully to these articles. The outcome furthermore confirmed
Willem III the seventeenth century stadtholder, we now                   our observation based on data from the BP that phrases that
have them listed as two different individuals. It also shows             do not correspond to a name generally occur only once
that Willem I does not refer to William the Silent at all, as            and therefore do not form a hindrance for the historian,
one may expect from the lists from our top down approach,                given that a single mention does not point to (contempo-
but to nineteenth century king Willem I. Looking at the ta-              rary) fame.
bles it seems that the Biography Portal of the Netherlands,
and then most likely especially the two biggest dictionar-                                   7    Conclusions
ies included in there from the nineteenth and early twenti-              In this paper we addressed the importance of research on
eth century, are strongly biased towards the House of Or-                canon formation in historical research. Before the advent
ange. Further research might show that many people were                  of digital technologies and the availability of digitised data,
included in the dictionaries because of their link to king               this could only be done tentatively. We have shown that
Willem I.                                                                despite many methodological and technical problems, there
By refining this method, for example by automatically                    is a decent amount of data available and there are tools that
merging similar instances like ‘Willem I’ and ‘Koning                    facilitate group analyses of famous people.
Willem I’, and by applying it to a larger and a wider variety            In section 5, we proposed a method to complement a top-
of datasets we would become closer to seeing canonisation                down approach of analysing people still famous now with a
patterns than traditional research could have ever brought               bottom-up approach, which gives more room for unbiased
us.                                                                      selection, context and provenance of the data. The basic
In an attempt to trace the ‘forgotten’ individuals we made
                                                                           22
a list of the people who do get mentioned frequently in the                     http://lab.kbresearch.nl/get/Downloads

                                                                    91
 Rank             Individuals without                 Number                            8   Acknowledgements
                 their own biography                of mentions
 1                   Jezus Christus                    > 75              This work was supported by the BiographyNet
 2             Karel II (king of England)               60               project      http://www.biographynet.nl                  (Nr.
 3           Lodewijk XIV (king of France)              40               660.011.308), funded by the Netherlands eScience Center
 4           Lodewijk VIII (king of France)             25               (http://esciencecenter.nl/). Partners in this project are the
 5            Lodewijk XI (king of France)              25               Netherlands eScience Center, the Huygens/ING Institute of
 6              Frans I (king of France)                23               the Royal Dutch Academy of Sciences and VU University
 7           Lodewijk XII (king of France)              18               Amsterdam. We would like to thank Dr. Ronald Sluijter
 8          Karl Marx (German philosopher)              18
                                                                         for his insightful comments on an earlier version of this
 9          Lodewijk XVI ((king of France)              16
                                                                         paper. All remaining errors are our own.
 10            Jozef II (German emperor)                15
 11          Lodewijk XIII (king of France)             15
 12       Napoleon Bonaparte (French emperor)           15                                   9    References
                                                                         J. Bohannon. 2011. Google books, wikipedia, and the fu-
Table 5: People mentioned most frequently in the Biogra-
                                                                            ture of culturomics. Science, 131:135.
phy Portal of the Netherlands, without their own biograph-
                                                                         M. Bosch. 2014. 1001 vrouwen in perspectief. traditie en
ical entry
                                                                            verandering van het biografische woordenboek in neder-
                                                                            land en elders. BMGN, LCHR, 129(1):55–76.
                                                                         V. de Boer, J. Leinenga, M. van Rossum, and R. Hoekstra.
means to carry out such research are available. Even though                 2014. Dutch ships and sailors linked data cloud. In Pro-
methodologies for task 3) are still in a preliminary stage and              ceedings of the International Semantic Web Conference
the work in 4) and 5) still is labor intensive, the possibilities           (ISWC 2014), 19-23 October, Riva del Garda, Italy.
provided by digital humanities make this research feasible.              A.E. Earhart. 2012. Can information be unfettered?: Race
In section 6, we discussed some difficulties in applying a                  and the new digital humanities canon. In M. K. Gold, ed-
top-down approach and have also discussed the first re-                     itor, Debates in the Digital Humanities, pages 309–318.
sults of a bottom-up approach. A close collaboration be-                    University of Minnesota Press, Minneapolis, London.
tween historians and computer scientists is a requirement to             J.R. Finkel, T. Grenager, and Ch. Manning. 2005. Incorpo-
make such research successful, especially in the named en-                  rating non-local information into information extraction
tity disambiguation. Expert domain knowledge combined                       systems by gibbs sampling. In Proceedings of the 43rd
with complex algorithms are needed to match as many in-                     Annual Meeting on Association for Computational Lin-
dividuals correctly as possible and to signal false positives.              guistics (ACL ’05). Association for Computational Lin-
Eventually such exercises can help us to explain why some                   guistics, Stroudsburg, PA, USA.
people only get 15 minutes of fame and others live on in                 A.S. Fokkens, S. ter Braake, N. Ockeloen, P. Vossen,
memory over centuries.                                                      S. Legêne, and G. Schreiber. 2014. Biographynet:
The approaches we presented in this paper are relatively ba-                Methodological issues when nlp supports historical re-
sic. We explained that this is not an issue for named entity                search. In Proceedings of the 9th edition of the Lan-
recognition, because precision is of minor importance for                   guage Resources and Evaluation Conference (LREC),
the historian investigating canonisation. We plan to experi-                Reykjavik, Iceland, May.
ment with alternative versions of the algorithm including a              M. Halbwachs. 1985. Das kollektive Gedchtnis. Mit einem
version that can handle single names such as Erasmus and                    Geleitwort zur deutschen Ausgabe von Heinz Maus. Fis-
Napoleon. However, given that our basic algorithm already                   cher, Frankfurt am Main.
provides results that yield interesting results, future work             G. Hall. 2012. Has critical theory run out of time for data-
will mainly focus on better disambiguation. We expect that                  driven scholarship? In M. K. Gold, editor, Debates in the
standard methods for named entity disambiguation are not                    Digital Humanities, pages 127–132. University of Min-
the most suitable for this task and data, because they tend                 nesota Press, Minneapolis, London.
to make use of the content words used in the text and ad-                L. Hanssen. 1995. Op zoek naar een onbekende. bi-
dress a wider range of named entities than just people. We                  ograsche lexicons als wetenschappelijk hulpmiddel. Bi-
therefore expect most from a domain and target entity spe-                  ografisch Bulletin, 5(1):77–83.
cific approach that combines frequency of the first and last             P. N. Mendes, M. Jakob, A. Garca-Silva, and Ch. Bizer.
name, information about time and place, as well as social                   2011. Dbpedia spotlight: shedding light on the web of
networks.                                                                   documents. In 7th International Conference on Semantic
The most important next step, however, will be to apply the                 Systems (I-Semantics ’11).
methods outlined in this paper to new datasets that also pro-            J.B. Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian
vide a contemporary perspective and/or use semi-conscious                   Veres, Matthew K. Gray, William Brockman, The
selection. Contemporary sources play a vital role in iden-                  Google Books Team, Joseph P. Pickett, Dale Hoiberg,
tifying people who were famous and fell in oblivion thus                    Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker,
providing the necessary means to compare and identify                       Martin A. Nowak, and Erez Lieberman Aiden. 2011.
what aspects contribute to canonisation once initial fame                   Quantitative analysis of culture using millions of digi-
is achieved.                                                                tized books. Science, 131:176–182.

                                                                    92
L. Moreau and P. Groth. 2013. Provenance: An Introduc-
   tion to PROV. Synthesis Lectures on the Semantic Web:
   Theory and Technology. Morgan & Claypool.
I. Nadel. 1984. Biography. Fiction, fact & form. MacMil-
   lan, London and Basingstoke.
N. Ockeloen, A.S. Fokkens, S. ter Braake, P. Vossen,
   V. de Boer, G. Schreiber, and S. Legêne. 2013. Biog-
   raphynet: Managing provenance at multiple levels and
   from different perspectives. In Proceedings of the Work-
   shop on Linked Science (LISC2013) at ISWC (2013).
R. Rosenzweig. 2011. Wikipedia: Can history be open
   source? In R. Rosenzweig, editor, Clio Wired. The
   Future of the Past in the Digital Age, pages 51–82.
   Columbia University Press, New York.
M.L. Sample. 2012. Unseen and unremarked on: Don
   DeLillo and the failure of the digital humanities. In
   M. K. Gold, editor, Debates in the Digital Humanities,
   pages 187–201. University of Minnesota Press, Min-
   neapolis, London.
S. ter Braake. 2007. Met Recht en Rekenschap. De
   ambtenaren bij het Hof van Holland en de Haagse
   Rekenkamer in de Habsburgse Tijd (1483-1558). Ver-
   loren, Hilversum.
M. Wilkens. 2012. Canons, close reading, and the evolu-
   tion of method. In M. K. Gold, editor, Debates in the
   Digital Humanities, pages 249–258. University of Min-
   nesota Press, Minneapolis, London.




                                                              93