<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Web Based Named Entity Linking for Digital Humanities and Heritage Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesca Frontini</string-name>
          <email>fFrancesca.Frontinig@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen Brando</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Gabriel Ganascia</string-name>
          <email>Jean-Gabriel.Ganasciag@lip6.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Linguistica Computazionale CNR</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Labex OBVIL. LiP6. CNRS</institution>
          ,
          <addr-line>4 place Jussieu, 75005, Paris</addr-line>
        </aff>
      </contrib-group>
      <fpage>77</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>This paper proposes a graph based methodology for automatically disambiguating authors' mentions in a corpus of French literary criticism. Candidate referents are identi ed and evaluated using a graph based named entity linking algorithm, which exploits a knowledge-base built out of two di erent resources (DBpedia and the BnF linked data). The algorithm expands previous ones applied for word sense disambiguation and entity linking, with good results. Its novelty resides in the fact that it successfully combines a generic knowledge base such as DBpedia with a domain speci c one, thus enabling the e cient annotation of minor authors. This will help specialists to follow mentions of the same author in di erent works of literary criticism, and thus to investigate their literary appreciation over time.</p>
      </abstract>
      <kwd-group>
        <kwd>named-entity linking</kwd>
        <kwd>linked data</kwd>
        <kwd>digital humanities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Named Entities (NE) are linguistic expressions that stand like rigid
designators for referents; such entities normally include names of persons, geographical
places, organizations, but also temporal references such as dates. Enriching
mentions with a link to its referent by means of a unique identi er is crucial for the
semantic annotation of texts. This is done by pointing to an external resource,
such as a Universal Resource Identi er (URI) in the Linked Open Data (LOD)
cloud. Segments in text referring to a Named Entity are known as entity
mentions.</p>
      <p>
        Named Entity Linking (NEL) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a sub task of Named Entity Recognition
and Disambiguation (NERD). NERD algorithms automatically detect entities
in texts and assign them to a given class3. The NEL module assigns a unique
identi er to the detected entities, thus disambiguating them by pointing to their
referent. Linking is crucial since the same mention can represent di erent
entities in di erent contexts and at the same time one entity can be mentioned in
the text in di erent forms. So for instance the mention \Goncourt" can refer
to any of the two Goncourt brothers, Edmond or Jules. At the same time Jules
de Goncourt can be referred to in the text as \Goncourt", \J. Goncourt", \J.
de Goncourt", ... This means that, in order to automatically retrieve all
passages in a text where Jules de Goncourt is mentioned, it is necessary not only
to annotate all these mentions as a Named Entity of the class person, but to
provide them with a unique key that distinguishes them from those of other
people, in this case those of Edmond. The bibliographic identi er \Goncourt,
Jules de (1830-1870)", as well as the links &lt;http://www.idref.fr/027835995&gt;
and &lt;http://fr.dbpedia.org/page/Jules de Goncourt&gt; are examples of such an
identi er.
      </p>
      <p>Besides ensuring disambiguation, linking also performs an important
additional task, namely textual enrichment, in that it connects the mention with
sources of additional information - such as DBpedia in the previous example
that needs not be stored in the text but can be accessed when required. In the
case of Edmond de Goncourt, additional information from DBpedia can tell us
what books he authored, where he was born, ....</p>
      <p>The main issue with NEL in digital humanities is that mentions of persons
often refer to individuals that are not listed in general ontologies such as Yago
or DBpedia, that constitute the typical knowledge base for linking in other
domains. Such individuals are often present in other knowledge bases, notably
bibliographical linked data repositories (such as the French National Library BnF
linked data repository). On the other hand, linking requires access to
ontological knowledge, in that choosing between two individuals having the same name
may requires comparing the context of the mention with a priori knowledge. In
this respect, knowledge bases such as DBpedia remain an important source of
general knowledge of the World. Thus the ideal linking algorithm for literary
criticism texts combines general and domain speci c sources. The experiment
here described goes in this direction.</p>
      <p>The paper will rst present previous approaches to NEL, then the proposed
graph based disambiguation algorithm based on the notion of centrality, nally
describe the experiment carried out on the corpus and the results. Some
conclusions and suggestions for further improvement of the algorithm are nally
given.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Previous approaches</title>
      <p>Previous approaches for NEL can be divided in two main families. Those using
text similarity and those using graph based methods. Both these methods are
unsupervised, and they do not rely on pre-annotated corpora for training.</p>
      <p>
        The best known tool of the rst group is DBpedia Spotlight [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], that performs
NER and DBpedia linking at the same time. Spotlight identi es the candidates
for each mention by performing string similarity between the mention and the
DBpedia labels, then it decides which entry is the most likely by comparing the
text surrounding the mention with the textual description of each candidate.
The referent whose description is more similar to the context of the mention
in terms of TF/IDF is chosen. This method is known to be very e cient, but
it can only provide linking towards resources such as DBpedia, whose entries
come with a description in the form of unstructured text. Other knowledge
bases do not provide a textual description for their entries, such is the case of
the bibliographical databases that constitute the ideal linking for mentions of
authors.
      </p>
      <p>Graph-based approaches rely on formalised knowledge described in graph
form that is built from a Knowledge Base (KB) (e.g. the Wikipedia article
network, Freebase, DBpedia, etc.). Reasoning can be performed through graph
analysis operations. It is thereby possible to at least partially reproduce the actual
decision process with which humans disambiguate mentions. A reader may
decide that the mention \James" refers to philosopher \William James" and not
to writer \Henry James" because it occurs in the same context as \Hume" and
\Kant". In the same way such algorithms build a graph out of the candidates
available for each possible referent in a given context and use the relative position
of each referent within the graph to choose the correct referent for each mention.
The graph is built for a context (such as a paragraph) containing possibly more
than one mention, so that the disambiguation of one mention is helped by the
other ones.</p>
      <p>
        This kind of approach is similar to the one used in Word Sense
Disambiguation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where a set of words in a given sentence needs to be labeled with the
appropriate sense label by using the information contained in a lexical database
such as WordNet. The key idea of this approach is that for all ambiguous words
in the context, senses that belong to the same semantic space should be
selected, and that in this way two ambiguous words can mutually disambiguate
each other. More speci cally, a subgraph is built, constituted only of the relevant
links between the possible senses of the di erent words, and then for each
alternative sense labeling, the most central is chosen. This procedure, when applied
to such context speci c subgraphs, ensures that in the end the chosen senses for
each word will be the one better connected to each other.
      </p>
      <p>
        Centrality is an abstract concept, and it can be calculated by using di erent
algorithms4. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] the experiment was carried out using the following
algorithms: Indegree, Betweenness, Closeness, PageRank, as well as with a
combination of all these metrics using a voting system. Results showed the advantage of
using centrality with respect to other similarity measures. While the
combination of all centrality algorithms scores the best, Indegree centrality seems to be
the better performing when compared to the other ones in terms of precision.
      </p>
      <p>
        This graph based approach has been applied to NEL, where mentions take
the place of words and Wikipedia articles that of WordNet synsets. Here too
centrality measures are performed on the Wikipedia structure in order to use
the rich set of relations to disambiguate mentions. More speci cally in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] English
texts were disambiguated using a graph that relies only on English Wikipedia,
and was constituted of the links and of the categories found in Wikipedia
articles. So for instance the edges of the graph represent whether ArticleA links
      </p>
      <sec id="sec-2-1">
        <title>4 For a discussion of the notion of centrality see also [10]</title>
        <p>to ArticleB or whether ArticleA has CategoryC. Here too \local" centrality is
then used to assign the correct link to the ambiguous mention. We have chosen
a graph-based approach to NEL that will be described in the next section.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Our approach</title>
      <p>Our approach to disambiguate NE mentions is a graph-based one. Vertices are
represented by URIs of mention candidates (e.g. dbpedia:Victor Hugo) as well as
URIs of concepts (e.g. foaf:Person) or individuals connected to at least two
different candidates. Edges are semantic relations de ned explicitly between URIs
(e.g. \type"). The graph is undirected and their vertices and edges are a priori
unweighted. We take advantage of the notion of centrality in Graph Theory to
link a NE mention with the URI of the most probable candidate for that
mention. In other words, we want to nd the subset of vertices of di erent candidates
having the greatest number of edges among them. The edges and vertices of the
graph are built leveraging knowledge from di erent LOD sources whose nature
is graph-based.</p>
      <p>We illustrate the proposed approach with an example. Let us consider the
following phrase of a French text of literary criticism written by Albert Thibaudet
(1936) :</p>
      <p>Quant au rythme, si Victor Hugo a depasse Lamartine, il n'a pas ete plus
loin que Vigny.</p>
      <p>In bold there are three mentions automatically recognized by a NER
algorithm, that need now be linked to an identi er.</p>
      <p>For each mention, the NEL algorithm selects possible candidates by exact
string matching of the current mention and dictionary entries (e.g. Hugo, M.
Hugo) and retrieves the corresponding URIs of the listed LOD sources. An
excerpt of the candidates of the three named-entities from the example is listed
below by distinguishable personal information instead of URI for readability
sake.</p>
      <p>Thanks to the URIs, it is possible to retrieve from the Web of Data the
associated RDF graph for each candidate and combine them into a single graph. It
should contain only those predicates involving at least two candidates of di
erent mentions because we only want the predicates that play an important role in
the disambiguation process. Calculating the centrality for every candidate will
then give us the best candidates for the three mentions. Figure 1 shows an
excerpt of the resulting graph where the chosen mention candidates are marked in
bold. We can notice that the vertex yago:RomanticPoets is the one that in
uences the centrality measure the most because it is shared by the three chosen
candidates. Likewise, other vertices connected to the chosen nodes, such as
dbpedia:romanticisme and dbpedia:Alexandru Macedonski, are in uential.</p>
      <p>Named Entities are disambiguated and referenced within the context of a
paragraph, so in principle two (identical) mentions of the same author within
one paragraph will always receive the same link, while the same mention in
di erent paragraphs might be assigned a di erent referent, depending on the
other mentions it occurs with.</p>
      <p>
        The NEL task is commonly de ned in such a way that it does not assume
the existence of the correct referent among the candidates in the knowledge base
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This is due to the fact that Wikipedia/DBpedia can hardly be a complete
knowledge base even for textual genres such as contemporary newspapers
articles. This seems even less true for the corpus that constitutes the object of our
experiment. French literary criticism texts contain references not only to famous
authors, but also to other minor gures that are not listed in Wikipedia.
Therefore our proposal is to aim for a quasi complete reference base for the task of
referencing authors.
      </p>
      <p>Our approach relies importantly on a lookup dictionary; this is the subject
of the following section.
4</p>
    </sec>
    <sec id="sec-4">
      <title>LOD-based lookup dictionary</title>
      <p>
        Linked data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is an important way of publishing knowledge in the Semantic
Web. Such data is easily available via web services; LOD is composed of triplets
of the form (subject, predicate, object) where subjects designate URIs, objects
may be URIs or data-typed literals, and predicates represents binary relations.
Queries can be run in the SPARQL language and data is provided with a
dereferenciable and persistent identi er called URI (Uniform Resource Identi er). Many
of the available linked data are of great interest for digital humanities [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and
for the domain of literary criticism in particular. More speci cally, information
on authors for French texts can be found in the French version of DBpedia on
the one hand, and in the catalogue of the Bibliothque Nationale de France (BnF)
on the other.
      </p>
      <p>The French DBpedia is constituted of the articles of the French version of
Wikipedia. In DBpedia entries are classi ed one or more of the types of the
DBpedia ontology. So for instance the author known as Stendhal5 is classi ed
as Person, Artist, Writer, and at the top level, as Thing. Moreover, authors are
linked to each other by horizontal relations such as In uencedBy, and, indirectly,
by being linked to the same concept, such as Romanticism. BnF entries list all
authors of books ever published in France; their entries contain information on
date of birth and death, gender, alternative names, works authored. For instance
the BnF entry for Voltaire6 gives several alternative names such as Franois-Marie
Arouet (Voltaire's real name), Wolter, Good Naturd Wellwisher, ...</p>
      <p>Most crucially, BnF links its entry to the DBpedia one when existing, thus
making it very easy to connect the two resources in one knowledge graph.
Moreover, BnF entries also list the author's Idref, which is the o cial identi cation
system used by French universities and higher education establishments to
identify, track and manage the documents in their possession. The combination of
these two sources was considered able to grant a su cient coverage for a corpus
of French literary criticism, thus the BnF and the DBpedia SPARQL endpoints
were queried for all authors, retrieving their biographic information (name,
surname, alternative names, dates of birth and death, title, ...) in structured form.</p>
      <p>In order to be able to retrieve all possible mentions of an author, this
information was processed into a dictionary of authors, that contains all alternative
names of an author, plus a series of alternative forms automatically generated,
with the links to BnF and DBpedia entries. Automatically generated alternative
names are of the form:
{ surname only (Rousseau)
{ initials + surname (J.J. Rousseau, JJ Rousseau, ...)
{ title + surname (M. Rousseau, M Rousseau)
G{ive..n. the domain (French literature) this procedure ensures that the retrieval
of at least one candidate URI for most mentions. At the same time, the mass of
information present in the BnF repository will generate several homonyms and
make most mentions ambiguous; thus good disambiguation becomes crucial.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Implementation of the NEL algorithm</title>
      <p>The NEL algorithm processes a le in XML-TEI format7; NE mentions are
annotated with NER annotations (e.g. tag &lt;persName&gt;) for every paragraph; the
5 http://fr.dbpedia.org/page/Stendhal
6 http://data.bnf.fr/11928669/voltaire/
7 http://www.tei-c.org/index.xml
algorithm is devised to processes one single class at a time (here Person). It uses
a lookup dictionary per class listing super cial forms and their associated URIs
from LOD sources, as described in the previous section. The algorithm produces
an enriched version of the input le indicating the chosen candidate for each
mention. We developed our implementation in Java ; RDF data is processed
thanks to the Jena API8; graphs are manipulated by the JgraphT API9 and
implementation of centrality measures are available in the Social Network analysis
tool, JgraphT-SNA10. In particular, the algorithm performs the following steps
for every paragraph of the XML-TEI le:</p>
      <sec id="sec-5-1">
        <title>1. look for URIs of mention candidates in the dictionary</title>
        <p>2. retrieve the RDF graphs of those URIs
3. simplify and combine graphs then compute the selected centrality measure
4. choose URI of candidate with the higher score per mention then write results
in TEI le</p>
        <p>The algorithm searches for (1) possible candidates of mentions by exact string
matching the mentions of the current paragraph and super cial forms in the
dictionary; there must be at least one ambiguous mention to continue. It retrieves
URIs (BnF, DBpedia) of mention candidates from dictionary entries. Next, the
RDF graph is retrieved (2) for every URI and converted to a JgraphT-compatible
graph, where RDF objects and subjects are vertices and RDF predicates are
edges. Irrelevant edges and vertices are removed from graphs. We keep edges
which involve at least two vertices representing URIs candidates. Information
coming from di erent sources is combined into a single graph (3); the way we
combine graphs is straightforward. The fusion is implicitly done thanks to one
of the main LOD principles which consists of reusing vocabularies published
in the LOD vocabulary cloud. In other words, edges (predicates) and vertices
(URI nodes) should be shared by at least two graphs associated to candidates of
di erent mentions. The selected centrality measure (e.g. closeness) is calculated
for the resulting graph. Finally, the algorithm chooses (4) the URI of the mention
candidate with the higher centrality score and annotates the input XML-TEI
le with this information.</p>
        <p>Furthermore, simpli cation of graphs and calculation of centrality measures
in the combined graph are crucial parts of the algorithm (3). This step is detailed
in the Algorithm 1. It essentially removes edges which are irrelevant to calculate
a centrality measure, in other words, it deletes those edges which involve at most
one vertex of a non-candidate URI.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experiments and results</title>
      <p>This section describes the experiments settings used to test our proposal as well
as preliminary results which are encouraging. In this experiment, in order to</p>
      <sec id="sec-6-1">
        <title>8 https://jena.apache.org/ 9 http://jgrapht.org 10 https://bitbucket.org/sorend/jgrapht-sna</title>
        <p>Algorithm 1 NEL: simplify and combine graphs, compute centrality
Require: graphs: graphs of candidates per mention, measure: centrality measure
for graph in graphs do
initialize vertexToDelete
for vertex in graph do
if vertex is not a candidate then
initialize vertexCheck
for edges of vertex do
if vertex1 notEqual vertex AND vertex1 is candidate then</p>
        <p>vertexCheck.add(vertex1)
end if
if vertex2 notEqual vertex AND vertex2 is candidate then</p>
        <p>vertexCheck.add(vertex2)
end if
end for
if size of vertexCheck &lt; 2 then</p>
        <p>vertexToDelete.add(vertex)
end if
end if
end for
graph.removeAllVertices(vertexToDelete)
chosenURIs = calculateCentrality(measure, graph)
end for
return chosenURIs, chosen candidate per mention
evaluate the performance of the algorithm, the linking is performed on correctly
identi ed and classi ed authors.
6.1</p>
        <p>Experiment settings
The test corpus consists of a French text of literary criticism titled \Une these
sur le symbolisme" (A thesis about Symbolism) and it is the rst volume of the
work named \Re exions sur la littrature" (Re exions on literature) published
by Albert Thibaudet in 1938.</p>
        <p>The text is drawn from a larger \Corpus critique"11, published in TEI by
the Labex OBVIL and containing a large collection of critical essays by di erent
authors.</p>
        <p>
          The chosen text in particular presents a high density of authors' mentions,
so that each paragraph generally contains an average of 2-3 mentions that are
treated at the same time by the algorithm. Mentions concerning authors were
manually annotated by two experts in French literature; URIs assigned to
mentions are those from Idref12. Guidelines to manual annotation were those
proposed by the MUC7 conferences as well as those de ned by the XML/TEI
standard. The resulting test corpus contains 1021 manually annotated mentions of
11 http://obvil.paris-sorbonne.fr/corpus/critique/
12 www.idref.fr
person entities. We measure the precision of the proposed NEL approach in
terms of the attribution of the right URI to a mention with respect to the URI
manually assigned by humans. The authors lookup dictionary was
automatically built in advance thanks to the BnF LOD source which is rich in SameAs
predicates pointing to DBpedia and Idref URIs. The resulting lookup
dictionary is composed of 4,218,798 author names including their alternative names
(e.g. M. Lamartine, Monsieur Lamartine, etc.). We chose 3 centrality measures
commonly used in social network analysis and the word-sens disambiguation
problem, these are: DegreeCentrality [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], BrandesBetweennessCentrality [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
FreemanClosenessCentrality [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], as implemented in the JgraphT-SNA tool.
6.2
        </p>
        <p>Results and Analysis</p>
        <sec id="sec-6-1-1">
          <title>The test results with the three algorithm are shown in table 1,</title>
          <p>Precision is calculated comparing the number of correctly assigned links over
the total of manually annotated entities of authors. The best result is obtained
with BrandesBetweennessCentrality, with a precision of 0.74. DegreeCentrality
has a comparable performance, FreemanCloseness centrality seems to heavily
underperform with respect to the other centrality measures. The last column of
table 1 shows the number of empty links over the total.</p>
          <p>These rst results are satisfying: though far from the 85% accuracy that
is normally achieved by similar algorithms on the news domain, such levels of
precision are nevertheless remarkable, considering that in many cases the text
discusses minor authors, today unknown, that are not necessarily listed in
DBpedia. Moreover, the use of BnF makes the number of candidates (and thus the
possibility of error) explode, with sometimes as much as 20 or more possible
candidate for a mention.</p>
          <p>To quantify authors incompleteness in both the DBpedia and BnF data sets
used in this experiment, we count the number of mentions in which the algorithm
(using DegreeCentrality measure) does not nd any corresponding URI in the
chosen KB. In this manner, there are 160 author mentions, out of 1021 mentions
identi ed in the corpus by the algorithm, that have no match in DBpedia, that
is around 16%. Remarkably, there are only 23 mentions (i.e. 2%) that have no
match in either BnF or DBpedia. Notice that all authors in this test set that are
in DBpedia are also in BnF.</p>
          <p>The most frequent mistakes considering DegreeCentrality and
BrandesBetweennessCentrality measures (the most similar and precise ones) concern the
following authors: Viele-Gri n, Francis (1864-1937); Boileau, Nicolas
(16361711); Barres, Maurice (1862-1923); Payen, Fernand (1872-1946); Lefranc, Abel
(1863-1952); Shakespeare, William (1564-1616); Spencer, Herbert (1820-1903);
Goncourt, Edmond de (1822-1896) and brother Goncourt, Jules de (1830-1870);
Mentre, Franois (1877-1950). The algorithm makes three types of mistakes.</p>
          <p>MISSING CANDIDATES - In 23 cases the algorithm is unable to retrieve
any candidate from the lookup dictionary, since the author is not present in any
knowledge base. This is the case of author Francis Viele-Gri n. In other cases
the correct entity is present but not associated with the required pseudonym.
This is the case of William Shakespeare's alleged alter ego William Stanley13.
This alias is not listed in the dictionary for Shakespeare, therefore, it is not
possible to assign both mentions to the same person (and thus the same URI).</p>
          <p>MISSING CONTEXT - In some rare cases only one ambiguous author's
mention is present in a single paragraph, thus the algorithm resorts to a fall
back strategy, choosing the entity with more links in absolute. Sometimes this
strategy causes errors, as in the case of \Vigny", for whom, in isolation, the
wrong link to Auriane Vigny is chosen.</p>
          <p>INCOMPLETE INFORMATION - In some cases the context of the
sentence should be su cient to produce a correct disambiguation but the NEL
algorithm makes mistakes due to lack of links in the knowledge base, which
prevents the centrality measure to produce the desired result. For instance,
\Shakespeare", when mentioned in the context of Shakespearian critic Abel Lefranc,
should produce the correct linking to William, but Nicolas is chosen instead.
Clearly explicit links between Abel Lefranc and the object of his studies are
missing in the knowledge bases. Ancient authors also tend to cause problems
due to lack of information, e.g. the Greek author Lysias is mistaken for an
homonymous French revolutionary collective.</p>
          <p>WRONG, MISLEADING INFORMATION - Sometimes the
knowledge bases contain wrong or misleading information. For instance there exist a
BnF entry for the \Ronsard family", classi ed as foaf:Person, which is chosen
instead the correct assignment, namely one of its members, Pierre de Ronsard.
The opposite is also true, so some mentions refer to both Goncourt brothers
as a collective noun, but the algorithm chooses one of the two. Finally, wrong
or misleading pseudonyms are sometimes listed in BnF for an author, causing
wrong candidates to be injected in the graph and sometimes selected. So for
instance \Descartes" is listed as a pseudonym for novelist Horace Walpole and
thus sometimes Walpole is wrongly chosen as the link for philosopher Descartes.</p>
          <p>Error analysis also shows that sometimes relevant information that is present
in the knowledge base is not used in the decision process because it cannot
be encoded in the graph in the form of links. A typical example is temporal
information which is encoded in the form of dates (data-typed literals). In other
words the fact that - for a given context - two candidate referents lived in the
same period of time cannot be taken into account.
13 Stanley is believed by some to be the real author behind Shakespeare's works.</p>
          <p>To evaluate the impact of the temporal dimension, we chose to evaluate
against an index from which we removed authors born after the date of
publishing of the work. The results show a slight improvement with DegreeCentrality
reaching precision 0.78 and BrandesBetweennessCentrality 0.77. A greater
improvement may be obtained using a more sophisticated graph building
algorithm, that transforms information about dates of birth and death in links that
can connect authors in a measurable way.
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and future work</title>
      <p>We presented an algorithm to perform NEL on a corpus of 19th century
literary criticism, with the speci c goal of disambiguating and referencing author
mentions for research purposes. The NEL module is meant to be used in
combination with a NER module, and will help researchers in the creation of digital
literary editions enriched with information about authors. The main purpose of
this work is to help scholars in history of literature to perform complex queries in
order to study the literary appreciation of authors over time, and investigate the
history of literary criticism in French literature. More speci cally the enrichment
of the aforementioned \Corpus critique" is meant to enhance ongoing research
in the history of scienti c ideas, and to provide a way to follow the
dissemination of theories and concepts de ned by Charles Darwin, Claude Bernard, Henri
Bergson in non scienti c texts of their time.</p>
      <p>The reported experiment shows how combining di erent sources can be useful
to perform linking on a domain speci c corpus with satisfying results. While the
precision is not yet state of the art, it is nevertheless remarkable, considering
that it is the rst time graph that centrality algorithms have been used for
NEL combining DBpedia with a domain speci c source. Tests showed signi cant
di erences between one implementation of centrality and the other two. Error
analysis suggests possible improvements of the algorithm, including the ad hoc
transformation of temporal information - present in the knowledge base in the
form of literals - into links of the context graph. Another possible evolution of the
algorithm would be to assign di erent weights to the edges so that for instance
sharing the same literary circle becomes a more important relation than being
born in the same town. Weights would be learned from manually annotated data.
Further experiments will be carried out with di erent corpora and on di erent
categories of entities, notably places.</p>
      <p>Experimenting with the size of the context will also be necessary, in order
to nd the best trade-o between e ciency and informativeness. A more ample
context (ideally a whole chapter) may produce a better graph of candidates, such
that all mentions can disambiguate each other correctly. But at the same time
this may introduce noise, and also generate a graph so big that its construction
and the calculation of centrality may require too much time.</p>
      <p>
        Another possible evolution of the algorithm could be to improve the graph
fusion procedure. So far, our strategy does not handle the proper fusion of
individuals which are described heterogeneously by the di erent sources (e.g. Victor
Hugo as described by BnF, as described by DBpedia, and so on). In this study
we chose to study the problem from a quantitative point-of-view and thus to
consider existent knowledge as it is without a pre-processing step. In the future,
we foresee to make use of strategies commonly applied in Conceptual Graphs for
information fusion [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this way, the resulting graph would better concentrate
domain knowledge (i.e. avoid redundancy and con icts) and thus calculate a
more accurate centrality measure.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work was supported by French state funds managed by the ANR within
the Investissements d'Avenir programme under reference ANR-11-IDEX-0004-02
and by an IFER Fernand Braudel Scholarship awarded by FMSH.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst</source>
          .
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>122</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brandes</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>A faster algorithm for betweenness centrality</article-title>
          .
          <source>Journal of Mathematical Sociology</source>
          <volume>25</volume>
          (
          <issue>2</issue>
          ),
          <volume>163</volume>
          {
          <fpage>177</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Freeman</surname>
          </string-name>
          , L.C.
          <article-title>: A set of measures of centrality based on betweenness</article-title>
          . Sociometry pp.
          <volume>35</volume>
          {
          <issue>41</issue>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hachey</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curran</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Graph-based named entity linking with wikipedia</article-title>
          .
          <source>In: Web Information System Engineering{WISE</source>
          <year>2011</year>
          , pp.
          <volume>213</volume>
          {
          <fpage>226</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hachey</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nothman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curran</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Evaluating entity linking with wikipedia</article-title>
          .
          <source>Arti cial intelligence 194</source>
          , 130{
          <fpage>150</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Laudy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganascia</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          :
          <article-title>Information fusion using conceptual graphs: a tv programs case study</article-title>
          .
          <source>In: ICCS</source>
          . pp.
          <volume>158</volume>
          {
          <issue>165</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Silva,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Dbpedia spotlight: shedding light on the web of documents</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Semantic Systems</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey of named entity recognition and classi cation</article-title>
          .
          <source>Lingvisticae Investigationes</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <volume>3</volume>
          {
          <fpage>26</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Entity linking: Finding extracted entities in a knowledge base</article-title>
          .
          <source>In: Multi-source, Multilingual Information Extraction and Summarization</source>
          , pp.
          <volume>93</volume>
          {
          <fpage>115</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Rochat</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Character Networks and Centrality</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Lausanne (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihalcea</surname>
          </string-name>
          , R.:
          <article-title>Unsupervised graph-basedword sense disambiguation using measures of word semantic similarity</article-title>
          .
          <source>In: ICSC</source>
          . vol.
          <volume>7</volume>
          , pp.
          <volume>363</volume>
          {
          <issue>369</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Van Hooland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Wilde</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steiner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Van de Walle, R.:
          <article-title>Exploring entity recognition and disambiguation for cultural heritage collections</article-title>
          .
          <source>Literary and linguistic computing</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>