=Paper=
{{Paper
|id=Vol-2314/paper6
|storemode=property
|title=Linking historical sources to established knowledge bases in order to inform entity linkers in cultural heritage
|pdfUrl=https://ceur-ws.org/Vol-2314/paper6.pdf
|volume=Vol-2314
|authors=Gary Munnelly,Annalina Caputo,Séamus Lawless
|dblpUrl=https://dblp.org/rec/conf/comhum/MunnellyCL18
}}
==Linking historical sources to established knowledge bases in order to inform entity linkers in cultural heritage==
Linking historical sources to established knowledge bases in order to inform entity linkers in cultural heritage Gary Munnelly, Annalina Caputo, Séamus Lawless Adapt Centre Trinity College Dublin Ireland {gary.munnelly, annalina.caputo, seamus.lawless}@adaptcentre.ie Abstract studied by historians. A possible solution is to construct tailored KBs A problem for researchers applying En- from resources used by scholars investigating CH tity Linking techniques to niche Cultural material. Such KBs would presumably be more ap- Heritage collections is the availability of propriate for annotating the kinds of specialised col- Knowledge Bases with adequate coverage lections in question. However, taking this approach for their domain. While it is possible to gen- can hobble one of the greatest benefits of annotat- erate a specialised Knowledge Base from ing a collection with semantic resources, namely available resources, this can result in a col- the ability to integrate with other collections which lection which is semantically annotated, are annotated using the same vocabulary. but remains separate from other collections This paper discusses a linking method which due to the use of a unique vocabulary. This was developed while constructing a specialised KB paper presents a linking scheme for map- for notable Irish historical figures. The approach ping a newly created Knowledge Base of is intended to identify corresponding entities in significant Irish people to DBpedia for the DBpedia for each entity in the new KB where such purposes of both enriching the new Knowl- equivalents exist. This facilitates communication edge Base, facilitating integration with other between collections annotated with the specialised collections and enabling multi-Knowledge KB and others annotated with more general KBs. Base Entity Linking which has been the Moreover EL with respect to multiple ontologies is subject of some research. The method is de- made possible, provided the EL service in question scribed and evaluated, showing that achieves supports such an operation. Research by Brando a high level of performance on a new Knowl- et al. (Brando et al., 2016) has shown that it is edge Base constructed from the Oxford beneficial to EL in CH when a specialised KB can Dictionary of National Biography and the be integrated with a more general one, and an EL Dictionary of Irish Biography. service can perform linking across both resources 1 Introduction in unison. In Section 2 this paper discusses related work While Entity Linking (EL) has seen much devel- and surrounding context which motivated the de- opment over the years (Bunescu and Pasca, 2006; velopment of this method. Section 3 describes the Milne and Witten, 2008; Ratinov et al., 2011; Yosef method in question. An evaluation is carried out et al., 2011; Usbeck et al., 2014; Waitelonis and in Section 4 and Section 5 provides concluding Sack, 2016; Brando et al., 2016), it is hindered by remarks. several limitations when applied to Cultural Her- itage (CH) collections. Most notable is a signif- 2 Related work icant under-representation of entities in common Knowledge Bases (KB) such as DBpedia (Agirre The overarching research related to this paper in- et al., 2012; Van Hooland et al., 2015; Munnelly vestigates methods of performing EL on primary and Lawless, 2018b). Consequently, EL systems source Irish historical archives. This research fo- which are informed by such KBs are ill equipped cuses on two entity types – people and locations. for annotating niche CH collections such as those Using DBpedia as a KB, previous work has 59 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) shown that only ∼23% of entities in a manually such counterparts exist. This also helps to iden- annotated gold standard subset of the collection tify entities in ODNB and DIB which are not yet could be annotated with a corresponding entity URI documented in DBpedia, showing where an EL (Munnelly and Lawless, 2018b). This illustrates system that is informed by a KB based on ODNB that an overwhelming number of entities in the and DIB may be better equipped for linking in Irish collection cannot be identified without using an al- historical archives. ternative KB. Furthermore, the challenging nature of the language in the collection (the documents 3 Method are rife with spelling inconsistencies) means that In order to facilitate the integration of a KB derived EL systems struggle to identify a correct referent from ODNB or DIB with DBpedia, an approach even when the entity exists in the KB. for linking biographies to their DBpedia counter- The challenges faced with regards to geographic parts was developed. First, all DBpedia entities features have been largely mediated using either belonging to the class dbo:Person are indexed us- GeoNames (GeoNames) or GeoHive (Debruyne ing Solr3 . The name of each entity, the full text of et al., 2016) as KBs. Both of these linked data the Wikipedia article from which they are derived, resources have significantly better coverage of Irish and anchor text on incoming links to the article geography than DBpedia. They also have the added were indexed. benefit of attempting to identify and link against Anchor text indicates alternative surface forms their counterparts in the DBpedia ontology where which may refer to an entity. For example, the possible. This means that a collection which is DIB biography for the 7th Earl of Mayo uses his annotated with geonames or geohive entities can full name and excludes his title, “Dermot Robert communicate, at least in part, with a collection that Wyndham-Bourke” while his name in DBpedia is annotated with DBpedia. is given as “Dermot Bourke 7th Earl of Mayo”. Identifying a suitable KB to represent people in Indexing anchor text can help to loosely capture the collection is more problematic. It is an unfortu- the equivalence of these two references, assum- nate fact that most individuals do not matter enough ing that Wikipedia uses the anchor text “Dermot to be documented in any commonly available KB. Robert Wyndham-Bourke” to link to the Earl of Two resources used by historians in this domain Mayo’s Wikipedia article from some other resource. are the Oxford Dictionary of National Biography However, it can also introduce some unwanted (ODNB)1 and the Dictionary of Irish Biography noise. For example, the anchor text for “Moun- (DIB)2 . Both are collections of biographies written trath” has been found to point to the entity “Sir by historians about notable Irish and English histor- Charles Coote”. Using anchor text as a source of ical figures. The subject of each article is usually a surface forms can thus be something of a double- single entity which corresponds to a person. Titles edged sword and it is worth investigating whether contain the subject’s forename, surname and vari- or not the effects of indexing this information are ant names, and links between related biographies ultimately beneficial for a specific use case. exist in the text of each article. Hence they exhibit For each biography entry in ODNB and DIB structural properties similar to those that originally b ∈ B, the title btitle is executed as a query against made Wikipedia a useful KB for EL. They are of Solr. Matches on the title field and anchor text are greater specificity to the history of the British Isles boosted over matches in the article’s content. A list than other more general resources and thus may of up to ten top-ranked candidates Pb is returned. help to fill some of the gaps in DBpedia, or at the The best matching DBpedia referent p∗b ∈ P b for very least limit the scope of the linker’s search to a given biography is the one that maximises the entities that are relevant to this geographic region. expression: The goal of this work is to connect entries in ODNB and DIB with their corresponding entries p∗b = argmax Ψ(b, p) (1) in DBpedia, such that a new KB built on these re- p∈P sources would be linked with their counterparts in a larger, more established semantic resource where Where Ψ(b, p) is computed as a linear combina- tion of content similarity and name similarity. 1. http://www.oxforddnb.com/ 2. http://dib.cambridge.org/ 3. http://lucene.apache.org/solr/ 60 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) For a given candidate p ∈ P b , content simi- NIL indicates that a biography does not have a larity Ω between the biography bcontent and the DBpedia counterpart. candidate’s Wikipedia article particle is computed using negative Word Mover’s Distance (WMD) 4 Evaluation (Kusner et al., 2015) as implemented in gensim (Řehůřek and Sojka, 2010). This method estab- The approach described is essentially an EL solu- lishes a vector representation of documents using tion. The service receives as input a surface form word embeddings and then computes the distance and some context which may help to identify the between points in the two representations. Es- subject of the reference. Solr performs the candi- sentially, the dissimilarity of two documents is date retrieval process, identifying a subset of candi- measured by examining how far the vector rep- dates to which the surface form might be referring. resentations of words in one document must travel The linking method then proceeds to identify the through space before the document will semanti- most likely referent from the pool of candidates. cally match its counterpart. This is obviously a This means that it is possible to evaluate the per- very computationally expensive operation. Similar- formance of the method using EL benchmarking ity is found by subtracting the normalised distance tools. For the initial investigation, the BAT Frame- from 1. Word embeddings are computed using a work (Cornolti et al., 2013) was used to assess Word2Vec model (Mikolov et al., 2013) trained on performance4 . The choice to use BAT instead of a Wikipedia dump excluding redirects, disambigua- the more commonly employed GERBIL (Usbeck tion pages etc. et al., 2015) at this point in the evaluation was for The name similarity function Φ is based on the scrutability of the results. Monge-Elkan Method (Monge and Elkan, 1996). Two ground truth, gold standard subsets were The biography title btitle and name of a candidate derived from a random sample of 200 biographies pname are lower-cased and tokenized. Stop words obtained from both DIB and ODNB (400 samples are removed yielding two sets of tokens Tb and T p . in total). A human annotator manually linked each The sets are added to a bipartite graph with edge sample with a corresponding DBpedia URI if an weights computed using Jaro-Winkler similarity equivalent entity could be identified in the DBpedia (Winkler, 1990). An optimal mapping Tb 7→ T p ontology. Where no URI could be established, a is found using Edmond’s blossom algorithm (Ed- NIL label was applied. monds, 1965) giving W , the set of weighted edges Ultimately 64 of the ODNB samples and 72 of which comprise the mapping. Name similarity is the DIB samples were labelled as NIL. This would the generalised mean of the edge weights in W as suggest that approximately 36% of entities in DIB described by Jimenez et al. (Jimenez et al., 2009) and 32% of entities in ODNB are not documented where m = 2 in this experiment: in DBpedia. This is somewhat disappointing at it ! m1 suggests that the number of entities gained from 1 using ODNB and DIB as source KBs is not as Φ(b, p) = ∑ wm |W | w∈W (2) high as may be desirable. However, one must still remember that this KB has the effect of limiting the This yields the final formulation of Ψ as a func- scope of the EL system’s search to a geographic tion of the form: region, which is undoubtedly beneficial. For the purposes of the evaluation the values of Ψ(b, p) = α Φ(btitle , pname ) α and β were fixed at α = 0.1 and β = 0.9. This (3) + β Ω(bcontent , particle ) choice of weighting was due to the fact that a com- parison with the name has already been partially Where α and β are tuning parameters chosen performed by the candidate retrieval process. The such that α + β = 1. strongest feature for identifying a referent is thus a A hard threshold τ is applied to p∗b , enforcing a comparison of the description of the entities as pro- minimum similarity between a biography and its vided in the biography content and the text of the final chosen referent p∗b : Wikipedia article. Even so, it was found that lend- ( ing some small weight to the similarity between p∗b , if Ψ(b, p∗b ) > τ p∗b = (4) NIL, otherwise 4. https://github.com/marcocor/bat-framework 61 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) 1 1 0.8 0.8 0.6 0.6 F1 F1 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Threshold (τ) Threshold (τ) Figure 1: Change in performance on DIB with Figure 2: Change in performance on ODNB with values of τ. Note that optimal threshold is seen values of τ. Note that optimal threshold is seen when τ = 0.55. when τ = 0.55. surface forms yielded a slight increase in the F1 Wikipedia articles. A follow-up investigation gen- score for the linking method. erated a new gold standard subset for ODNB with a minimum threshold of 50 words on the content The method was tested by evaluating the quality of the biography for inclusion. The performance of of the links established by the method for varying the method improved dramatically on this collec- values of τ. A threshold similarity of τ = 0.55 was tion, but still lagged slightly behind that of DIB an found to give the best results. This threshold yields F1 score of 77.5%. The remaining disparity was the best trade-off between the method annotating ascribed to two challenging article types in ODNB: a biography with a DBpedia URI or a NIL label. However, as can be seen in figures 1 and 2, the 1. ODNB contains disambiguation pages which method is highly sensitive to the value of τ, with a list individuals who have the same surname. slight variation resulting in a dramatic drop-off in Identifying these pages programatically is performance. challenging and so it is difficult to filter them. Arguably, given the need for accuracy when con- structing KBs for academic study, a sub-optimal 2. Some articles discuss more than one person, threshold τ > 0.55 may be desirable. This will re- where multiple entities’ stories are inextrica- sult in fewer overall links to DBpedia, but makes bly linked, e.g., the famous serial killers Burke the algorithm more conservative, reducing the num- and Hare. Note that this is also a problem with ber of false positives. DIB. During the initial evaluation subject to the condi- tions above, this approach achieved an F1 score of Collection τ F1 81.5% on DIB, but only 67.5% on ODNB. Some DIB 0.55 81.5 of the imprecision stems from Solr as 43.1% of ODNB 0.55 67.5 incorrect labels on ODNB and 45.9% of incorrect ODNB (filtered) 0.55 77.5 labels on DIB can be ascribed to the correct ref- erent not being among the results returned by the Table 1: Summary of results search engine. However the remaining disparity in performance was somewhat alarming and subject to investigation. 4.1 Further analysis It was found that the problem arose from mul- In an attempt to evaluate the relative performance tiple articles in ODNB which do not contain text. of this linking method with respect to other state They are simply pictorial renderings of their sub- of the art EL systems, a comparative analysis was ject. Consequently, the WMD algorithm had no conducted. For this evaluation, the GERBIL bench- content by which to compare the biography to marking platform was used (Usbeck et al., 2015). 62 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) Annotator Micro F1 Micro Pre- Micro Macro F1 Macro Macro cision Recall Precision Recall Babelfy 0.5333 0.7304 0.4200 0.4200 0.4200 0.4200 DBpedia Spotlight 0.0099 0.5000 0.0050 0.0050 0.0050 0.0050 FOX 0.5112 0.5833 0.4550 0.4550 0.4550 0.4550 KEA 0.3437 0.3935 0.3050 0.3050 0.3050 0.3050 Munnelly 0.8221 0.8241 0.8200 0.8200 0.8200 0.8200 PBOH 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Table 2: GERBIL results for DIB. Munnelly (the method presented in this paper) clearly outperforms all other available services in the evaluation. Annotator Micro F1 Micro Pre- Micro Macro F1 Macro Macro cision Recall Precision Recall Babelfy 0.6222 0.8522 0.4900 0.4900 0.4900 0.4900 DBpedia Spotlight 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 FOX 0.4921 0.6667 0.3900 0.3900 0.3900 0.3900 KEA 0.5133 0.6259 0.4350 0.4350 0.4350 0.4350 Munnelly 0.7700 0.7700 0.7700 0.7700 0.7700 0.7700 PBOH 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Table 3: GERBIL results for ODNB. Munnelly (the method presented in this paper) clearly outperforms all other available services in the evaluation. GERBIL is built on the BAT framework and so the between entities. However, given the nature of the results it produces were expected to be somewhat problem being tackled this is an appropriate bias. comparable with those presented in the previous GERBIL was configured to perform a D2KB section. However, in the interests of thoroughness evaluation, that is, the EL systems were provided a simple web interface to the biography linking with the surface form and the context of the men- method was set up so that GERBIL could directly tion. Their sole task was to identify a referent for benchmark the method. the surface form. The two gold standard collections – DIB and ODNB (filtered) – were converted into NIF docu- At the time of the experiment, only 5 of the 17 ments (Hellmann et al., 2012). The surface form EL services that are registered with GERBIL were which named the subject of the biography was in- available. These were Babelfy, DBpedia Spotlight, jected at the beginning of the article with some FOX, KEA, and PBOH (Moro et al., 2014; Mendes modification: et al., 2011; Waitelonis and Sack, 2016; Speck and Ngomo, 2014; Ganea et al., 2016). The experiment 1. The names were originally in surname, fore- was configured to run with these five services. name order. This was reversed. It should be noted that FOX is essentially 2. Where multiple formulations of a name were AGDISTIS (Usbeck et al., 2014) with an entity present between parenthesis, e.g., different lan- recognition layer before the disambiguation phase. guages or nicknames, these alternate formula- Given that this is a D2KB task, FOX can arguably tions were collapsed and removed from the sur- be considered an evaluation of AGDISTIS with face form. some caveats. Namely, FOX maintains its own de- This was intended to yield a surface form which ployment of AGDISTIS which is not necessarily was more easily recognisable by EL systems. The in line with the most recent version, and the entity generated surface form was marked as the only recognition stage in FOX is mandatory, meaning entity in the document. This clearly gives an ad- that even in the D2KB task it will attempt to spot vantage to EL systems which perform analysis on entities. GERBIL compensates for this when com- the context of a mention rather than relationships puting the results of the evaluation, but it may still 63 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) 1 1 0.8 0.8 0.6 0.6 F1 F1 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Threshold (τ) Threshold (τ) Figure 3: Change in performance of DBpedia Spot- Figure 4: Change in performance of DBpedia Spot- light on DIB with values of τ. Note that the perfor- light on ODNB with values of τ. Note that the mance is much more consistent than earlier plots. performance is much more consistent than earlier plots. lead to some skew in the final figures. Even so, the inclusion of FOX is helpful considering that the icantly better on this linking task than the other default AGDISTIS service was not online. systems evaluated. However, it is difficult to under- Under these conditions, GERBIL evaluated the stand precisely why this is the case since GERBIL EL systems. For brevity, only the precision, recall does not provide access to the internal machina- and F1 measures for the overall linking task are tions of the evaluation. In particular, the perfor- reported in tables 2 and 3. The full results of the mance of DBpedia Spotlight is unexpectedly low. evaluation on both DIB and ODNB are available Without knowing how GERBIL is calling this API online56 . endpoint, the reason for this seemingly complete Both Macro and Micro F1 measures are reported. failure cannot be determined. Macro and Micro take slightly different views of Spotlight’s performance is particularly suspi- the collection. Macro treats each input document cious as this is a task where it should perform as an individual disambiguation problem, comput- reasonably well. It uses a language model to com- ing precision and recall for each document and pare the context of a mention with descriptions of then averaging the results across the whole collec- known entities, meaning it relies on contextual fea- tion. Micro treats the entire collection as one large tures to identify a referent; an approach which this disambiguation problem and computes precision experiment favours. and recall for all annotations in the gold standard. A specific evaluation using Spotlight’s disam- Given these definitions, we can expect that the re- biguation API endpoint7 was performed using a sults for Micro and Macro precision, recall and F1 custom script using the content of each biography measure will be roughly (if not exactly) equal for individually and the injected surface form as pre- this specific evaluation, given that each document viously described. The responses from the server is comprised of only one entity. However, as pre- were dumped to a series of CSV files. As with viously mentioned, some services will attempt to the evaluation described in Section 4 the value of perform Named Entity Recognition even when the the confidence threshold for annotation was varied. specified task is D2KB. This can result in some Under these conditions, Spotlight performed con- disparity between the results of Micro and Macro siderably better with F1 scores reported by BAT evaluation. between 0.25 and 0.465 for ODNB and scores be- The figures presented seem to confirm that the tween 0.52 and 0.555 for DIB depending on the method described by this paper performs signif- value of the confidence threshold which ranged 5. http://gerbil.aksw.org/gerbil/experiment? from 0 to 1. A summary of these results can be id=201810190004 6. http://gerbil.aksw.org/gerbil/experiment? 7. http://model.dbpedia-spotlight.org/en/ id=201810190005 disambiguate 64 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) seen in Figures 3 and 4. It is notable they are much Acknowledgments more stable than values shown in Figures 1 and 2. The ADAPT Centre for Digital Content Technol- This direct examination suggests Spotlight per- ogy is funded under the SFI Research Centres Pro- forms much better at this annotation task than the gramme (Grant 13/RC/2106) and is co-funded un- results of the GERBIL experiment indicate. This der the European Regional Development Fund. should not be considered as an attempt to under- mine GERBIL, which is an important attempt at providing a consistent benchmark for the tumul- References tuous challenge of evaluating EL systems. But Agirre, Eneko, Ander Barrena, Oier Lopez de Lacalle, it does strongly highlight the need for low level Aitor Soroa, Samuel Fernando, and Mark Stevenson scrutability, reporting and configuration of the APIs (2012). Matching cultural heritage items to wikipedia. being evaluated, which is a feature that GERBIL In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, ostensibly lacks. Bente Maegaard, Joseph Mariani, Asuncion Moreno, It is, nevertheless reassuring to see that the scores Jan Odijk, and Stelios Piperidis, eds., Proceedings for this paper’s method (designated “Munnelly” in of the Eighth International Conference on Language the tables of results) conform to those values ob- Resources and Evaluation (LREC’12). European Lan- guage Resources Association (ELRA). tained by the earlier BAT investigation. Brando, Carmen, Francesca Frontini, and Jean-Gabriel 5 Conclusion Ganascia (2016). REDEN: Named Entity Linking in Digital Literary Editions Using Linked Data Sets. Given the task at hand, the method of linking pre- Complex Systems Informatics and Modeling Quarterly, 7:60 – 80. sented in this paper seems to identify referents in DBpedia with a reassuring level of accuracy. In- Bunescu, Razvan C. and Marius Pasca (2006). Us- deed, the method is not restricted to this simple use ing Encyclopedic Knowledge for Named Entity Disam- biguation. In European Chapter of the Association for case, as it is for all intents and purposes a fully im- Computational Linguistics, vol. 6, pages 9–16. plemented EL system. Given a set of surface forms and a context it should provide a set of suitable Cornolti, Marco, Paolo Ferragina, and Massimiliano referents for the inputs. Ciaramita (2013). A Framework for Benchmarking Entity-Annotation Systems. In Proceedings of the 22nd However, this method falls into a common EL International Conference on World Wide Web, pages trap which is the trade-off between performance 249–260. New York, NY, USA: ACM. and time. The more accurate an EL method is, the Debruyne, Christophe, Éamonn Clinton, Lorraine Mc- more computationally expensive it is expected to Nerney, Atul Nautiyal, and Declan O’Sullivan (2016). become. This is extremely true with this approach Serving Ireland’s Geospatial Information as Linked which requires as much as a minute to identify a Data. In International Semantic Web Conference referent for a single entity. (Posters & Demos). While this approach was initially conceived as Edmonds, Jack (1965). Paths, Trees, and Flowers. an ad-hoc solution to a specific problem, its per- Canadian Journal of Mathematics, 17(3):449–467. formance in the evaluation is encouraging and fu- Ganea, Octavian-Eugen, Marina Ganea, Aurelien Luc- ture work may seek to further investigate the con- chi, Carsten Eickhoff, and Thomas Hofmann (2016). struction of an EL service based on this approach Probabilistic Bag-of-Hyperlinks Model for Entity Link- ing. In Proceedings of the 25th International Confer- provided the issue with time and computational ence on World Wide Web, pages 927–938. International complexity can be resolved. The current imple- World Wide Web Conferences Steering Committee. mentation is known to perform several wasteful GeoNames (2018). Geonames. URL http:// operations, the results of which could be cached or geonames.org/ (accessed 2018-10-19). even pre-computed and indexed to improve perfor- mance. Hellmann, Sebastian, Jens Lehmann, and Sören Auer (2012). NIF: An Ontology-Based and Linked-Data- At the time of writing, the annotation task for Aware NLP Interchange Format. Working Draft, page linking ODNB, DIB and DBpedia has been com- 252. pleted and included in a custom KB (Munnelly and Jimenez, Sergio, Claudia Becerra, Alexander Gelbukh, Lawless, 2018a). Ongoing work is investigating the and Fabio Gonzalez (2009). Generalized Mongue- usefulness of these links for improving the quality Elkan Method for Approximate Text String Compar- of EL on Irish CH datasets. ison. In International Conference on Intelligent Text 65 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) Processing and Computational Linguistics, pages 559– pages 1375–1384. Association for Computational Lin- 570. Springer. guistics. Kusner, Matt, Yu Sun, Nicholas Kolkin, and Kilian Řehůřek, Radim and Petr Sojka (2010). Software Weinberger (2015). From Word Embeddings to Doc- Framework for Topic Modelling with Large Cor- ument Distances. In International Conference on Ma- pora. In Proceedings of the LREC 2010 Workshop chine Learning, pages 957–966. on New Challenges for NLP Frameworks, pages 45– 50. Valletta, Malta: ELRA. http://is.muni.cz/ Mendes, Pablo N., Max Jakob, Andrés García-Silva, publication/884893/en. and Christian Bizer (2011). DBpedia Spotlight: Shed- ding Light on the Web of Documents. In Proceedings Speck, René and Axel-Cyrille Ngonga Ngomo (2014). of the 7th International Conference on Semantic Sys- Named entity recognition using FOX. In Proceedings tems, pages 1–8. New York, NY, USA: ACM. of the ISWC 2014 Posters & Demonstrations Track, pages 85–88. CEUR-WS.org. URL http://ceur-ws. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. org/Vol-1272/paper_70.pdf. Corrado, and Jeff Dean (2013). Distributed Representa- tions of Words and Phrases and their Compositionality. Usbeck, Ricardo, Axel-Cyrille Ngonga Ngomo, In Advances in neural information processing systems, Michael Röder, Daniel Gerber, Sandro Athaide Coelho, pages 3111–3119. Sören Auer, and Andreas Both (2014). AGDISTIS – Graph-Based Disambiguation of Named Entities Using Milne, David and Ian H. Witten (2008). Learning to Linked Data. In International Semantic Web Confer- Link with Wikipedia. In Proceedings of the 17th ACM ence, pages 457–471. Springer. Conference on Information and Knowledge Manage- ment, CIKM ’08, pages 509–518. New York, NY, USA: Usbeck, Ricardo, Michael Röder, Axel-Cyrille ACM. Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Monge, Alvaro and Charles Elkan (1996). The Field Cherix, Bernd Eickmann, et al. (2015). GERBIL: Matching Problem: Algorithms and Applications. In General Entity Annotator Benchmarking Framework. In Proceedings of the Second International Conference In Proceedings of the 24th International Conference on Knowledge Discovery and Data Mining, pages 267– on World Wide Web, pages 1133–1143. New York, NY, 270. USA: ACM. Moro, Andrea, Francesco Cecconi, and Roberto Nav- Van Hooland, Seth, Max De Wilde, Ruben Verborgh, igli (2014). Multilingual Word Sense Disambiguation Thomas Steiner, and Rik Van de Walle (2015). Explor- and Entity Linking for Everybody. In International Se- ing entity recognition and disambiguation for cultural mantic Web Conference (Posters & Demos), pages 25– heritage collections. Digital Scholarship in the Human- 28. ities, 30(2):262–279. Munnelly, Gary and Séamus Lawless (2018a). Con- Waitelonis, Jörg and Harald Sack (2016). Named En- structing a Knowledge Base for Entity Linking on Irish tity Linking in #Tweets with KEA. In #Microposts— Cultural Heritage Collections. Procedia Computer Sci- 6th Workshop on Making Sense of Microposts, pages ence, 137:199 – 210. Proceedings of the 14th Interna- 61–63. CEUR-WS.org. URL http://ceur-ws.org/ tional Conference on Semantic Systems 10th – 13th of Vol-1691/paper_14.pdf. September 2018 Vienna, Austria. Munnelly, Gary and Séamus Lawless (2018b). Inves- Winkler, William (1990). String Comparator Metrics tigating Entity Linking in Early English Legal Docu- and Enhanced Decision Rules in the Fellegi-Sunter ments. In Proceedings of the 18th ACM/IEEE Joint Model of Record Linkage. In Proceedings of the Sec- Conference on Digital Libraries, pages 59–68. New tion on Survey Research Methods, pages 354–359. York, NY, USA: ACM. Yosef, Mohamed Amir, Johannes Hoffart, Ilaria Bor- Ratinov, Lev, Dan Roth, Doug Downey, and Mike An- dino, Marc Spaniol, and Gerhard Weikum (2011). derson (2011). Local and Global Algorithms for Dis- Aida: An Online Tool for Accurate Disambiguation of ambiguation to Wikipedia. In Proceedings of the 49th Named Entities in Text and Tables. Proceedings of the Annual Meeting of the Association for Computational VLDB Endowment, 4(12):1450–1453. Linguistics: Human Language Technologies, vol. 1, 66