=Paper=
{{Paper
|id=Vol-2314/paper6
|storemode=property
|title=Linking historical sources to established knowledge bases in order to inform entity linkers in cultural heritage
|pdfUrl=https://ceur-ws.org/Vol-2314/paper6.pdf
|volume=Vol-2314
|authors=Gary Munnelly,Annalina Caputo,Séamus Lawless
|dblpUrl=https://dblp.org/rec/conf/comhum/MunnellyCL18
}}
==Linking historical sources to established knowledge bases in order to inform entity linkers in cultural heritage==
<pdf width="1500px">https://ceur-ws.org/Vol-2314/paper6.pdf</pdf>
<pre>
    Linking historical sources to established knowledge bases in order to
                  inform entity linkers in cultural heritage

                        Gary Munnelly, Annalina Caputo, Séamus Lawless
                                         Adapt Centre
                                    Trinity College Dublin
                                            Ireland
            {gary.munnelly, annalina.caputo, seamus.lawless}@adaptcentre.ie


                     Abstract                              studied by historians.
                                                              A possible solution is to construct tailored KBs
    A problem for researchers applying En-                 from resources used by scholars investigating CH
    tity Linking techniques to niche Cultural              material. Such KBs would presumably be more ap-
    Heritage collections is the availability of            propriate for annotating the kinds of specialised col-
    Knowledge Bases with adequate coverage                 lections in question. However, taking this approach
    for their domain. While it is possible to gen-         can hobble one of the greatest benefits of annotat-
    erate a specialised Knowledge Base from                ing a collection with semantic resources, namely
    available resources, this can result in a col-         the ability to integrate with other collections which
    lection which is semantically annotated,               are annotated using the same vocabulary.
    but remains separate from other collections
                                                              This paper discusses a linking method which
    due to the use of a unique vocabulary. This
                                                           was developed while constructing a specialised KB
    paper presents a linking scheme for map-
                                                           for notable Irish historical figures. The approach
    ping a newly created Knowledge Base of
                                                           is intended to identify corresponding entities in
    significant Irish people to DBpedia for the
                                                           DBpedia for each entity in the new KB where such
    purposes of both enriching the new Knowl-
                                                           equivalents exist. This facilitates communication
    edge Base, facilitating integration with other
                                                           between collections annotated with the specialised
    collections and enabling multi-Knowledge
                                                           KB and others annotated with more general KBs.
    Base Entity Linking which has been the
                                                           Moreover EL with respect to multiple ontologies is
    subject of some research. The method is de-
                                                           made possible, provided the EL service in question
    scribed and evaluated, showing that achieves
                                                           supports such an operation. Research by Brando
    a high level of performance on a new Knowl-
                                                           et al. (Brando et al., 2016) has shown that it is
    edge Base constructed from the Oxford
                                                           beneficial to EL in CH when a specialised KB can
    Dictionary of National Biography and the
                                                           be integrated with a more general one, and an EL
    Dictionary of Irish Biography.
                                                           service can perform linking across both resources
1   Introduction                                           in unison.
                                                              In Section 2 this paper discusses related work
While Entity Linking (EL) has seen much devel-             and surrounding context which motivated the de-
opment over the years (Bunescu and Pasca, 2006;            velopment of this method. Section 3 describes the
Milne and Witten, 2008; Ratinov et al., 2011; Yosef        method in question. An evaluation is carried out
et al., 2011; Usbeck et al., 2014; Waitelonis and          in Section 4 and Section 5 provides concluding
Sack, 2016; Brando et al., 2016), it is hindered by        remarks.
several limitations when applied to Cultural Her-
itage (CH) collections. Most notable is a signif-          2   Related work
icant under-representation of entities in common
Knowledge Bases (KB) such as DBpedia (Agirre               The overarching research related to this paper in-
et al., 2012; Van Hooland et al., 2015; Munnelly           vestigates methods of performing EL on primary
and Lawless, 2018b). Consequently, EL systems              source Irish historical archives. This research fo-
which are informed by such KBs are ill equipped            cuses on two entity types – people and locations.
for annotating niche CH collections such as those            Using DBpedia as a KB, previous work has

                                                      59
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


shown that only ∼23% of entities in a manually                 such counterparts exist. This also helps to iden-
annotated gold standard subset of the collection               tify entities in ODNB and DIB which are not yet
could be annotated with a corresponding entity URI             documented in DBpedia, showing where an EL
(Munnelly and Lawless, 2018b). This illustrates                system that is informed by a KB based on ODNB
that an overwhelming number of entities in the                 and DIB may be better equipped for linking in Irish
collection cannot be identified without using an al-           historical archives.
ternative KB. Furthermore, the challenging nature
of the language in the collection (the documents               3   Method
are rife with spelling inconsistencies) means that
                                                               In order to facilitate the integration of a KB derived
EL systems struggle to identify a correct referent
                                                               from ODNB or DIB with DBpedia, an approach
even when the entity exists in the KB.
                                                               for linking biographies to their DBpedia counter-
   The challenges faced with regards to geographic
                                                               parts was developed. First, all DBpedia entities
features have been largely mediated using either
                                                               belonging to the class dbo:Person are indexed us-
GeoNames (GeoNames) or GeoHive (Debruyne
                                                               ing Solr3 . The name of each entity, the full text of
et al., 2016) as KBs. Both of these linked data
                                                               the Wikipedia article from which they are derived,
resources have significantly better coverage of Irish
                                                               and anchor text on incoming links to the article
geography than DBpedia. They also have the added
                                                               were indexed.
benefit of attempting to identify and link against
                                                                  Anchor text indicates alternative surface forms
their counterparts in the DBpedia ontology where
                                                               which may refer to an entity. For example, the
possible. This means that a collection which is
                                                               DIB biography for the 7th Earl of Mayo uses his
annotated with geonames or geohive entities can
                                                               full name and excludes his title, “Dermot Robert
communicate, at least in part, with a collection that
                                                               Wyndham-Bourke” while his name in DBpedia
is annotated with DBpedia.
                                                               is given as “Dermot Bourke 7th Earl of Mayo”.
   Identifying a suitable KB to represent people in
                                                               Indexing anchor text can help to loosely capture
the collection is more problematic. It is an unfortu-
                                                               the equivalence of these two references, assum-
nate fact that most individuals do not matter enough
                                                               ing that Wikipedia uses the anchor text “Dermot
to be documented in any commonly available KB.
                                                               Robert Wyndham-Bourke” to link to the Earl of
   Two resources used by historians in this domain
                                                               Mayo’s Wikipedia article from some other resource.
are the Oxford Dictionary of National Biography
                                                               However, it can also introduce some unwanted
(ODNB)1 and the Dictionary of Irish Biography
                                                               noise. For example, the anchor text for “Moun-
(DIB)2 . Both are collections of biographies written
                                                               trath” has been found to point to the entity “Sir
by historians about notable Irish and English histor-
                                                               Charles Coote”. Using anchor text as a source of
ical figures. The subject of each article is usually a
                                                               surface forms can thus be something of a double-
single entity which corresponds to a person. Titles
                                                               edged sword and it is worth investigating whether
contain the subject’s forename, surname and vari-
                                                               or not the effects of indexing this information are
ant names, and links between related biographies
                                                               ultimately beneficial for a specific use case.
exist in the text of each article. Hence they exhibit
                                                                  For each biography entry in ODNB and DIB
structural properties similar to those that originally
                                                               b ∈ B, the title btitle is executed as a query against
made Wikipedia a useful KB for EL. They are of
                                                               Solr. Matches on the title field and anchor text are
greater specificity to the history of the British Isles
                                                               boosted over matches in the article’s content. A list
than other more general resources and thus may
                                                               of up to ten top-ranked candidates Pb is returned.
help to fill some of the gaps in DBpedia, or at the
                                                               The best matching DBpedia referent p∗b ∈ P b for
very least limit the scope of the linker’s search to
                                                               a given biography is the one that maximises the
entities that are relevant to this geographic region.
                                                               expression:
   The goal of this work is to connect entries in
ODNB and DIB with their corresponding entries
                                                                              p∗b = argmax Ψ(b, p)               (1)
in DBpedia, such that a new KB built on these re-                                     p∈P
sources would be linked with their counterparts in
a larger, more established semantic resource where                Where Ψ(b, p) is computed as a linear combina-
                                                               tion of content similarity and name similarity.
1. http://www.oxforddnb.com/
2. http://dib.cambridge.org/                                   3. http://lucene.apache.org/solr/


                                                          60
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


   For a given candidate p ∈ P b , content simi-              NIL indicates that a biography does not have a
larity Ω between the biography bcontent and the              DBpedia counterpart.
candidate’s Wikipedia article particle is computed
using negative Word Mover’s Distance (WMD)                   4   Evaluation
(Kusner et al., 2015) as implemented in gensim
(Řehůřek and Sojka, 2010). This method estab-             The approach described is essentially an EL solu-
lishes a vector representation of documents using            tion. The service receives as input a surface form
word embeddings and then computes the distance               and some context which may help to identify the
between points in the two representations. Es-               subject of the reference. Solr performs the candi-
sentially, the dissimilarity of two documents is             date retrieval process, identifying a subset of candi-
measured by examining how far the vector rep-                dates to which the surface form might be referring.
resentations of words in one document must travel            The linking method then proceeds to identify the
through space before the document will semanti-              most likely referent from the pool of candidates.
cally match its counterpart. This is obviously a             This means that it is possible to evaluate the per-
very computationally expensive operation. Similar-           formance of the method using EL benchmarking
ity is found by subtracting the normalised distance          tools. For the initial investigation, the BAT Frame-
from 1. Word embeddings are computed using a                 work (Cornolti et al., 2013) was used to assess
Word2Vec model (Mikolov et al., 2013) trained on             performance4 . The choice to use BAT instead of
a Wikipedia dump excluding redirects, disambigua-            the more commonly employed GERBIL (Usbeck
tion pages etc.                                              et al., 2015) at this point in the evaluation was for
   The name similarity function Φ is based on the            scrutability of the results.
Monge-Elkan Method (Monge and Elkan, 1996).                     Two ground truth, gold standard subsets were
The biography title btitle and name of a candidate           derived from a random sample of 200 biographies
pname are lower-cased and tokenized. Stop words              obtained from both DIB and ODNB (400 samples
are removed yielding two sets of tokens Tb and T p .         in total). A human annotator manually linked each
The sets are added to a bipartite graph with edge            sample with a corresponding DBpedia URI if an
weights computed using Jaro-Winkler similarity               equivalent entity could be identified in the DBpedia
(Winkler, 1990). An optimal mapping Tb 7→ T p                ontology. Where no URI could be established, a
is found using Edmond’s blossom algorithm (Ed-               NIL label was applied.
monds, 1965) giving W , the set of weighted edges               Ultimately 64 of the ODNB samples and 72 of
which comprise the mapping. Name similarity is               the DIB samples were labelled as NIL. This would
the generalised mean of the edge weights in W as             suggest that approximately 36% of entities in DIB
described by Jimenez et al. (Jimenez et al., 2009)           and 32% of entities in ODNB are not documented
where m = 2 in this experiment:                              in DBpedia. This is somewhat disappointing at it
                                     ! m1                    suggests that the number of entities gained from
                          1                                  using ODNB and DIB as source KBs is not as
            Φ(b, p) =          ∑ wm
                         |W | w∈W
                                                (2)
                                                             high as may be desirable. However, one must still
                                                             remember that this KB has the effect of limiting the
   This yields the final formulation of Ψ as a func-         scope of the EL system’s search to a geographic
tion of the form:                                            region, which is undoubtedly beneficial.
                                                                For the purposes of the evaluation the values of
         Ψ(b, p) = α Φ(btitle , pname )                      α and β were fixed at α = 0.1 and β = 0.9. This
                                                  (3)
                    + β Ω(bcontent , particle )              choice of weighting was due to the fact that a com-
                                                             parison with the name has already been partially
   Where α and β are tuning parameters chosen                performed by the candidate retrieval process. The
such that α + β = 1.                                         strongest feature for identifying a referent is thus a
   A hard threshold τ is applied to p∗b , enforcing a        comparison of the description of the entities as pro-
minimum similarity between a biography and its               vided in the biography content and the text of the
final chosen referent p∗b :                                  Wikipedia article. Even so, it was found that lend-
                (                                            ing some small weight to the similarity between
                  p∗b ,     if Ψ(b, p∗b ) > τ
          p∗b =                                   (4)
                  NIL, otherwise                             4. https://github.com/marcocor/bat-framework


                                                        61
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


        1                                                              1

       0.8                                                           0.8

       0.6                                                           0.6
  F1


                                                                F1
       0.4                                                           0.4

       0.2                                                           0.2

        0                                                              0
             0   0.2    0.4    0.6   0.8         1                         0    0.2      0.4    0.6   0.8       1
                       Threshold (τ)                                                    Threshold (τ)

Figure 1: Change in performance on DIB with                   Figure 2: Change in performance on ODNB with
values of τ. Note that optimal threshold is seen              values of τ. Note that optimal threshold is seen
when τ = 0.55.                                                when τ = 0.55.


surface forms yielded a slight increase in the F1             Wikipedia articles. A follow-up investigation gen-
score for the linking method.                                 erated a new gold standard subset for ODNB with
                                                              a minimum threshold of 50 words on the content
   The method was tested by evaluating the quality
                                                              of the biography for inclusion. The performance of
of the links established by the method for varying
                                                              the method improved dramatically on this collec-
values of τ. A threshold similarity of τ = 0.55 was
                                                              tion, but still lagged slightly behind that of DIB an
found to give the best results. This threshold yields
                                                              F1 score of 77.5%. The remaining disparity was
the best trade-off between the method annotating
                                                              ascribed to two challenging article types in ODNB:
a biography with a DBpedia URI or a NIL label.
However, as can be seen in figures 1 and 2, the                 1. ODNB contains disambiguation pages which
method is highly sensitive to the value of τ, with a               list individuals who have the same surname.
slight variation resulting in a dramatic drop-off in               Identifying these pages programatically is
performance.                                                       challenging and so it is difficult to filter them.
   Arguably, given the need for accuracy when con-
structing KBs for academic study, a sub-optimal                 2. Some articles discuss more than one person,
threshold τ > 0.55 may be desirable. This will re-                 where multiple entities’ stories are inextrica-
sult in fewer overall links to DBpedia, but makes                  bly linked, e.g., the famous serial killers Burke
the algorithm more conservative, reducing the num-                 and Hare. Note that this is also a problem with
ber of false positives.                                            DIB.
   During the initial evaluation subject to the condi-
tions above, this approach achieved an F1 score of                         Collection            τ    F1
81.5% on DIB, but only 67.5% on ODNB. Some                                 DIB                0.55   81.5
of the imprecision stems from Solr as 43.1% of                             ODNB               0.55   67.5
incorrect labels on ODNB and 45.9% of incorrect                            ODNB (filtered)    0.55   77.5
labels on DIB can be ascribed to the correct ref-
erent not being among the results returned by the                           Table 1: Summary of results
search engine. However the remaining disparity in
performance was somewhat alarming and subject
to investigation.                                             4.1    Further analysis
   It was found that the problem arose from mul-              In an attempt to evaluate the relative performance
tiple articles in ODNB which do not contain text.             of this linking method with respect to other state
They are simply pictorial renderings of their sub-            of the art EL systems, a comparative analysis was
ject. Consequently, the WMD algorithm had no                  conducted. For this evaluation, the GERBIL bench-
content by which to compare the biography to                  marking platform was used (Usbeck et al., 2015).

                                                         62
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


 Annotator             Micro F1       Micro Pre- Micro                  Macro F1       Macro          Macro
                                      cision     Recall                                Precision      Recall
 Babelfy               0.5333         0.7304           0.4200           0.4200         0.4200         0.4200
 DBpedia Spotlight     0.0099         0.5000           0.0050           0.0050         0.0050         0.0050
 FOX                   0.5112         0.5833           0.4550           0.4550         0.4550         0.4550
 KEA                   0.3437         0.3935           0.3050           0.3050         0.3050         0.3050
 Munnelly              0.8221         0.8241           0.8200           0.8200         0.8200         0.8200
 PBOH                  0.0000         0.0000           0.0000           0.0000         0.0000         0.0000

Table 2: GERBIL results for DIB. Munnelly (the method presented in this paper) clearly outperforms all
other available services in the evaluation.

 Annotator             Micro F1       Micro Pre- Micro                  Macro F1       Macro          Macro
                                      cision     Recall                                Precision      Recall
 Babelfy               0.6222         0.8522           0.4900           0.4900         0.4900         0.4900
 DBpedia Spotlight     0.0000         0.0000           0.0000           0.0000         0.0000         0.0000
 FOX                   0.4921         0.6667           0.3900           0.3900         0.3900         0.3900
 KEA                   0.5133         0.6259           0.4350           0.4350         0.4350         0.4350
 Munnelly              0.7700         0.7700           0.7700           0.7700         0.7700         0.7700
 PBOH                  0.0000         0.0000           0.0000           0.0000         0.0000         0.0000

Table 3: GERBIL results for ODNB. Munnelly (the method presented in this paper) clearly outperforms
all other available services in the evaluation.

GERBIL is built on the BAT framework and so the             between entities. However, given the nature of the
results it produces were expected to be somewhat            problem being tackled this is an appropriate bias.
comparable with those presented in the previous                GERBIL was configured to perform a D2KB
section. However, in the interests of thoroughness          evaluation, that is, the EL systems were provided
a simple web interface to the biography linking             with the surface form and the context of the men-
method was set up so that GERBIL could directly             tion. Their sole task was to identify a referent for
benchmark the method.                                       the surface form.
   The two gold standard collections – DIB and
ODNB (filtered) – were converted into NIF docu-                At the time of the experiment, only 5 of the 17
ments (Hellmann et al., 2012). The surface form             EL services that are registered with GERBIL were
which named the subject of the biography was in-            available. These were Babelfy, DBpedia Spotlight,
jected at the beginning of the article with some            FOX, KEA, and PBOH (Moro et al., 2014; Mendes
modification:                                               et al., 2011; Waitelonis and Sack, 2016; Speck and
                                                            Ngomo, 2014; Ganea et al., 2016). The experiment
1. The names were originally in surname, fore-              was configured to run with these five services.
   name order. This was reversed.
                                                               It should be noted that FOX is essentially
2. Where multiple formulations of a name were
                                                            AGDISTIS (Usbeck et al., 2014) with an entity
   present between parenthesis, e.g., different lan-
                                                            recognition layer before the disambiguation phase.
   guages or nicknames, these alternate formula-
                                                            Given that this is a D2KB task, FOX can arguably
   tions were collapsed and removed from the sur-
                                                            be considered an evaluation of AGDISTIS with
   face form.
                                                            some caveats. Namely, FOX maintains its own de-
  This was intended to yield a surface form which           ployment of AGDISTIS which is not necessarily
was more easily recognisable by EL systems. The             in line with the most recent version, and the entity
generated surface form was marked as the only               recognition stage in FOX is mandatory, meaning
entity in the document. This clearly gives an ad-           that even in the D2KB task it will attempt to spot
vantage to EL systems which perform analysis on             entities. GERBIL compensates for this when com-
the context of a mention rather than relationships          puting the results of the evaluation, but it may still

                                                       63
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


        1                                                            1

       0.8                                                          0.8

       0.6                                                          0.6
  F1


                                                               F1
       0.4                                                          0.4

       0.2                                                          0.2

        0                                                            0
             0   0.2    0.4    0.6   0.8        1                         0   0.2    0.4    0.6   0.8        1
                       Threshold (τ)                                                Threshold (τ)

Figure 3: Change in performance of DBpedia Spot-             Figure 4: Change in performance of DBpedia Spot-
light on DIB with values of τ. Note that the perfor-         light on ODNB with values of τ. Note that the
mance is much more consistent than earlier plots.            performance is much more consistent than earlier
                                                             plots.
lead to some skew in the final figures. Even so, the
inclusion of FOX is helpful considering that the             icantly better on this linking task than the other
default AGDISTIS service was not online.                     systems evaluated. However, it is difficult to under-
   Under these conditions, GERBIL evaluated the              stand precisely why this is the case since GERBIL
EL systems. For brevity, only the precision, recall          does not provide access to the internal machina-
and F1 measures for the overall linking task are             tions of the evaluation. In particular, the perfor-
reported in tables 2 and 3. The full results of the          mance of DBpedia Spotlight is unexpectedly low.
evaluation on both DIB and ODNB are available                Without knowing how GERBIL is calling this API
online56 .                                                   endpoint, the reason for this seemingly complete
   Both Macro and Micro F1 measures are reported.            failure cannot be determined.
Macro and Micro take slightly different views of                Spotlight’s performance is particularly suspi-
the collection. Macro treats each input document             cious as this is a task where it should perform
as an individual disambiguation problem, comput-             reasonably well. It uses a language model to com-
ing precision and recall for each document and               pare the context of a mention with descriptions of
then averaging the results across the whole collec-          known entities, meaning it relies on contextual fea-
tion. Micro treats the entire collection as one large        tures to identify a referent; an approach which this
disambiguation problem and computes precision                experiment favours.
and recall for all annotations in the gold standard.            A specific evaluation using Spotlight’s disam-
Given these definitions, we can expect that the re-          biguation API endpoint7 was performed using a
sults for Micro and Macro precision, recall and F1           custom script using the content of each biography
measure will be roughly (if not exactly) equal for           individually and the injected surface form as pre-
this specific evaluation, given that each document           viously described. The responses from the server
is comprised of only one entity. However, as pre-            were dumped to a series of CSV files. As with
viously mentioned, some services will attempt to             the evaluation described in Section 4 the value of
perform Named Entity Recognition even when the               the confidence threshold for annotation was varied.
specified task is D2KB. This can result in some              Under these conditions, Spotlight performed con-
disparity between the results of Micro and Macro             siderably better with F1 scores reported by BAT
evaluation.                                                  between 0.25 and 0.465 for ODNB and scores be-
   The figures presented seem to confirm that the            tween 0.52 and 0.555 for DIB depending on the
method described by this paper performs signif-              value of the confidence threshold which ranged
5. http://gerbil.aksw.org/gerbil/experiment?                 from 0 to 1. A summary of these results can be
id=201810190004
6. http://gerbil.aksw.org/gerbil/experiment?                 7. http://model.dbpedia-spotlight.org/en/
id=201810190005                                              disambiguate


                                                        64
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


seen in Figures 3 and 4. It is notable they are much            Acknowledgments
more stable than values shown in Figures 1 and 2.
                                                                The ADAPT Centre for Digital Content Technol-
   This direct examination suggests Spotlight per-
                                                                ogy is funded under the SFI Research Centres Pro-
forms much better at this annotation task than the
                                                                gramme (Grant 13/RC/2106) and is co-funded un-
results of the GERBIL experiment indicate. This
                                                                der the European Regional Development Fund.
should not be considered as an attempt to under-
mine GERBIL, which is an important attempt at
providing a consistent benchmark for the tumul-                 References
tuous challenge of evaluating EL systems. But                   Agirre, Eneko, Ander Barrena, Oier Lopez de Lacalle,
it does strongly highlight the need for low level               Aitor Soroa, Samuel Fernando, and Mark Stevenson
scrutability, reporting and configuration of the APIs           (2012). Matching cultural heritage items to wikipedia.
being evaluated, which is a feature that GERBIL                 In Nicoletta Calzolari (Conference Chair), Khalid
                                                                Choukri, Thierry Declerck, Mehmet Uğur Doğan,
ostensibly lacks.
                                                                Bente Maegaard, Joseph Mariani, Asuncion Moreno,
   It is, nevertheless reassuring to see that the scores        Jan Odijk, and Stelios Piperidis, eds., Proceedings
for this paper’s method (designated “Munnelly” in               of the Eighth International Conference on Language
the tables of results) conform to those values ob-              Resources and Evaluation (LREC’12). European Lan-
                                                                guage Resources Association (ELRA).
tained by the earlier BAT investigation.
                                                                Brando, Carmen, Francesca Frontini, and Jean-Gabriel
5   Conclusion                                                  Ganascia (2016). REDEN: Named Entity Linking
                                                                in Digital Literary Editions Using Linked Data Sets.
Given the task at hand, the method of linking pre-              Complex Systems Informatics and Modeling Quarterly,
                                                                7:60 – 80.
sented in this paper seems to identify referents in
DBpedia with a reassuring level of accuracy. In-                Bunescu, Razvan C. and Marius Pasca (2006). Us-
deed, the method is not restricted to this simple use           ing Encyclopedic Knowledge for Named Entity Disam-
                                                                biguation. In European Chapter of the Association for
case, as it is for all intents and purposes a fully im-
                                                                Computational Linguistics, vol. 6, pages 9–16.
plemented EL system. Given a set of surface forms
and a context it should provide a set of suitable               Cornolti, Marco, Paolo Ferragina, and Massimiliano
referents for the inputs.                                       Ciaramita (2013). A Framework for Benchmarking
                                                                Entity-Annotation Systems. In Proceedings of the 22nd
   However, this method falls into a common EL                  International Conference on World Wide Web, pages
trap which is the trade-off between performance                 249–260. New York, NY, USA: ACM.
and time. The more accurate an EL method is, the
                                                                Debruyne, Christophe, Éamonn Clinton, Lorraine Mc-
more computationally expensive it is expected to                Nerney, Atul Nautiyal, and Declan O’Sullivan (2016).
become. This is extremely true with this approach               Serving Ireland’s Geospatial Information as Linked
which requires as much as a minute to identify a                Data. In International Semantic Web Conference
referent for a single entity.                                   (Posters & Demos).
   While this approach was initially conceived as               Edmonds, Jack (1965). Paths, Trees, and Flowers.
an ad-hoc solution to a specific problem, its per-              Canadian Journal of Mathematics, 17(3):449–467.
formance in the evaluation is encouraging and fu-               Ganea, Octavian-Eugen, Marina Ganea, Aurelien Luc-
ture work may seek to further investigate the con-              chi, Carsten Eickhoff, and Thomas Hofmann (2016).
struction of an EL service based on this approach               Probabilistic Bag-of-Hyperlinks Model for Entity Link-
                                                                ing. In Proceedings of the 25th International Confer-
provided the issue with time and computational
                                                                ence on World Wide Web, pages 927–938. International
complexity can be resolved. The current imple-                  World Wide Web Conferences Steering Committee.
mentation is known to perform several wasteful
                                                                GeoNames (2018).    Geonames.     URL http://
operations, the results of which could be cached or
                                                                geonames.org/ (accessed 2018-10-19).
even pre-computed and indexed to improve perfor-
mance.                                                          Hellmann, Sebastian, Jens Lehmann, and Sören Auer
                                                                (2012). NIF: An Ontology-Based and Linked-Data-
   At the time of writing, the annotation task for              Aware NLP Interchange Format. Working Draft, page
linking ODNB, DIB and DBpedia has been com-                     252.
pleted and included in a custom KB (Munnelly and
                                                                Jimenez, Sergio, Claudia Becerra, Alexander Gelbukh,
Lawless, 2018a). Ongoing work is investigating the              and Fabio Gonzalez (2009). Generalized Mongue-
usefulness of these links for improving the quality             Elkan Method for Approximate Text String Compar-
of EL on Irish CH datasets.                                     ison. In International Conference on Intelligent Text


                                                           65
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


Processing and Computational Linguistics, pages 559–          pages 1375–1384. Association for Computational Lin-
570. Springer.                                                guistics.
Kusner, Matt, Yu Sun, Nicholas Kolkin, and Kilian             Řehůřek, Radim and Petr Sojka (2010). Software
Weinberger (2015). From Word Embeddings to Doc-               Framework for Topic Modelling with Large Cor-
ument Distances. In International Conference on Ma-           pora. In Proceedings of the LREC 2010 Workshop
chine Learning, pages 957–966.                                on New Challenges for NLP Frameworks, pages 45–
                                                              50. Valletta, Malta: ELRA. http://is.muni.cz/
Mendes, Pablo N., Max Jakob, Andrés García-Silva,             publication/884893/en.
and Christian Bizer (2011). DBpedia Spotlight: Shed-
ding Light on the Web of Documents. In Proceedings            Speck, René and Axel-Cyrille Ngonga Ngomo (2014).
of the 7th International Conference on Semantic Sys-          Named entity recognition using FOX. In Proceedings
tems, pages 1–8. New York, NY, USA: ACM.                      of the ISWC 2014 Posters & Demonstrations Track,
                                                              pages 85–88. CEUR-WS.org. URL http://ceur-ws.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S.             org/Vol-1272/paper_70.pdf.
Corrado, and Jeff Dean (2013). Distributed Representa-
tions of Words and Phrases and their Compositionality.        Usbeck, Ricardo, Axel-Cyrille Ngonga Ngomo,
In Advances in neural information processing systems,         Michael Röder, Daniel Gerber, Sandro Athaide Coelho,
pages 3111–3119.                                              Sören Auer, and Andreas Both (2014). AGDISTIS –
                                                              Graph-Based Disambiguation of Named Entities Using
Milne, David and Ian H. Witten (2008). Learning to            Linked Data. In International Semantic Web Confer-
Link with Wikipedia. In Proceedings of the 17th ACM           ence, pages 457–471. Springer.
Conference on Information and Knowledge Manage-
ment, CIKM ’08, pages 509–518. New York, NY, USA:             Usbeck, Ricardo, Michael Röder, Axel-Cyrille
ACM.                                                          Ngonga Ngomo, Ciro Baron, Andreas Both, Martin
                                                              Brümmer, Diego Ceccarelli, Marco Cornolti, Didier
Monge, Alvaro and Charles Elkan (1996). The Field             Cherix, Bernd Eickmann, et al. (2015). GERBIL:
Matching Problem: Algorithms and Applications. In             General Entity Annotator Benchmarking Framework.
In Proceedings of the Second International Conference         In Proceedings of the 24th International Conference
on Knowledge Discovery and Data Mining, pages 267–            on World Wide Web, pages 1133–1143. New York, NY,
270.                                                          USA: ACM.
Moro, Andrea, Francesco Cecconi, and Roberto Nav-             Van Hooland, Seth, Max De Wilde, Ruben Verborgh,
igli (2014). Multilingual Word Sense Disambiguation           Thomas Steiner, and Rik Van de Walle (2015). Explor-
and Entity Linking for Everybody. In International Se-        ing entity recognition and disambiguation for cultural
mantic Web Conference (Posters & Demos), pages 25–            heritage collections. Digital Scholarship in the Human-
28.                                                           ities, 30(2):262–279.
Munnelly, Gary and Séamus Lawless (2018a). Con-
                                                              Waitelonis, Jörg and Harald Sack (2016). Named En-
structing a Knowledge Base for Entity Linking on Irish
                                                              tity Linking in #Tweets with KEA. In #Microposts—
Cultural Heritage Collections. Procedia Computer Sci-
                                                              6th Workshop on Making Sense of Microposts, pages
ence, 137:199 – 210. Proceedings of the 14th Interna-
                                                              61–63. CEUR-WS.org. URL http://ceur-ws.org/
tional Conference on Semantic Systems 10th – 13th of
                                                              Vol-1691/paper_14.pdf.
September 2018 Vienna, Austria.
Munnelly, Gary and Séamus Lawless (2018b). Inves-             Winkler, William (1990). String Comparator Metrics
tigating Entity Linking in Early English Legal Docu-          and Enhanced Decision Rules in the Fellegi-Sunter
ments. In Proceedings of the 18th ACM/IEEE Joint              Model of Record Linkage. In Proceedings of the Sec-
Conference on Digital Libraries, pages 59–68. New             tion on Survey Research Methods, pages 354–359.
York, NY, USA: ACM.                                           Yosef, Mohamed Amir, Johannes Hoffart, Ilaria Bor-
Ratinov, Lev, Dan Roth, Doug Downey, and Mike An-             dino, Marc Spaniol, and Gerhard Weikum (2011).
derson (2011). Local and Global Algorithms for Dis-           Aida: An Online Tool for Accurate Disambiguation of
ambiguation to Wikipedia. In Proceedings of the 49th          Named Entities in Text and Tables. Proceedings of the
Annual Meeting of the Association for Computational           VLDB Endowment, 4(12):1450–1453.
Linguistics: Human Language Technologies, vol. 1,


                                                         66

</pre>