<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>REBL: Entity Linking at Scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chris Kamphuis</string-name>
          <email>chris@cs.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faegheh Hasibi</string-name>
          <email>f.hasibi@cs.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jimmy Lin</string-name>
          <email>jimmylin@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjen P. de Vries</string-name>
          <email>arjen@cs.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University</institution>
          ,
          <addr-line>Nijmegen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Waterloo</institution>
          ,
          <addr-line>Waterloo</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>REBL is an extension of the Radboud Entity Linker (REL) for Batch Entity Linking. REBL is developed after encountering unforeseen issues when trying to link the large MS MARCO v2 web document collection with REL. In this paper we discuss the issues we ran into and our solutions to mitigate them. REBL makes it easier to isolate the GPU heavy operations from the CPU heavy operations, by separating the mention detection stage from the candidate selection and entity disambiguation stages. By improving the entity disambiguation module we were able to lower the time needed for linking documents by an order of magnitude. The code for REBL is publicly available on GitHub.</p>
      </abstract>
      <kwd-group>
        <kwd>http</kwd>
        <kwd>//chriskamphuis</kwd>
        <kwd>com (C</kwd>
        <kwd>Kamphuis)</kwd>
        <kwd>https</kwd>
        <kwd>//hasibi</kwd>
        <kwd>com (F</kwd>
        <kwd>Hasibi)</kwd>
        <kwd>https</kwd>
        <kwd>//cs</kwd>
        <kwd>uwaterloo</kwd>
        <kwd>ca/~jimmylin/</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Entity linking concerns the task of automatically identifying entity mentions in the text and
linking them to the corresponding entities in a knowledge-base (KB). It fulfils a key role in
knowledge-grounded understanding of text and has been proven efective for diverse tasks in
information retrieval [
        <xref ref-type="bibr" rid="ref1">1, 2, 3, 4, 5, 6, 7</xref>
        ], natural language processing [8, 9], and
recommendation [10]. Utilizing entity annotations in these downstream tasks depends upon the annotation
of text corpora with a method for entity linking. Due to the complexity of entity linking
systems, this process is often performed by a third-party entity linking toolkit, examples including
DBpedia Spotlight [11], TAGME [12], Nordlys [13], GENRE [14], and REL [15].
A caveat in existing entity linking toolkits is that they have not been designed for batch
processing large numbers of documents. Existing entity linking toolkits are primarily optimised
to annotate individual documents, one at a time. This severely restricts utilization of
state-ofthe-art entity linking tools such as REL and GENRE, that employ neural approaches and require
GPUs for fast operation. Annotating millions of documents incurs significant computational
overhead, to the extent that annotation of a large text corpus becomes practically infeasible
nEvelop-O
LGOBE
(A. P. de Vries)
(J. Lin); https://cs.ru.nl/~arjen (A. P. de Vries)
using modest computational power resources. Batch entity linking is however necessary to
build today’s data-hungry machine learning models, considering large text corpora like the new
MS MARCO v2 (12M Web documents) [16].
      </p>
      <p>This paper describes our experience with optimizing the Radboud Entity Linking (REL) toolkit
for batch processing large corpora. REL annotates individual documents eficiently, requiring
only modest computational resources, while performing competitively when compared to the
state-of-the-art methods on efectiveness. It considers entity linking as a modular problem
consisting of three stages:
(i) Mention detection. The goal of this step is to identify all possible text spans in a document
that might refer to an entity. If a text span that refers to an entity is not identified properly in
this stage, the system will not be able to correctly link the entity in later stages.
(ii) Candidate selection. For every detected mention, REL considers up to  1+ 2(= 7) candidate
entities.  1(= 4) candidate entities are selected based on their prior occurrence probability (|)
(for entity e given mention m). These priors are pre-calculated from Wikipedia hyperlinks and
the CrossWiki [17] corpus. The other  2(= 3) entities are chosen based on the similarity of their
embeddings to the contextual embedding of the mention (considering a context of maximum
200 word tokens).
(iii) Entity disambiguation. The goal of this final step is to map the mention to the correct
entity in a knowledge base. The candidate entities for each mention are obtained from the
previous stage and REL implements the Ment-norm method proposed by Le and Titov [18].
This paper explains the challenges of batch processing in REL and presents the approaches we
found to overcome these challenges. We show that our updated REL toolkit, REBL, improves
REL eficiency 9.5 times, decreasing the processing time per document (excluding mention
detection) on a sample of 5000 MS MARCO documents from 1.23 seconds to 0.13 seconds. We
demonstrate that REBL enables the annotation of a large corpus like MS MARCO v2, given
modest computational resources. We discuss potential improvements that can be made in order
to further improve eficiency of batch entity linking. The REBL code and toolkit are available
publicly at https://github.com/informagi/REBL.</p>
    </sec>
    <sec id="sec-2">
      <title>2. From REL to REBL</title>
      <p>The objective that led to this paper was to link the MS MARCO v2 collection [16]. This
collection contains 11,959,635 documents split into 60 compressed files, totalling roughly 33GB
in size. Decompressed, these files are in JSON line format (where every line represents a JSON
document). Documents have five fields: url, title, headings, body, and docid. For our experiments
we wanted to link the title, headings, and body of the documents. We use the 2019-07 Wikipedia
dump to link to, which is one of the two dumps REL was initially developed on. It is, however,
straightforward to take another dump of Wikipedia and develop another REL instance.
In order to ease linking this size of data, we separated the GPU heavy mention detection stage
from the CPU heavy candidate selection and entity disambiguation stages; the modified code
can be found on GitHub.1 The inputs for mention detection are the compressed MS MARCO v2
document files, and its output consists of the mentions found and their location in the document,
in Apache Parquet format.2 These files together with the source text are the input for the
subsequent phases (candidate selection and entity disambiguation). The final output consists
of Parquet files containing spans of text and their linked entities. In the following, we discuss
what is changed for mention detection, candidate selection, and entity disambiguation steps to
make REL more suited to link the MS MARCO v2 collection.</p>
      <sec id="sec-2-1">
        <title>2.1. Mention Detection</title>
        <p>REL [15] uses Flair [19] for mention detection, a state-of-the-art named entity recognition
system. Flair uses the segtok3 package to segment an (Indo-European) document in sentences,
internally represented as Sentence objects. These sentences are split into words / symbols
represented as Token objects. When creating these representations however, it is not possible
to recreate the source text properly, as Flair removes multiple whitespace characters when
occurring after each other. REL corrects for this to preserve the correct span data with regard
to its location in the source text, which is an ineficient process. We set out to construct the
underlying data structures ourselves for REBL. To do this, we used the syntok4 package, a
follow-up version of segtok. The author of both packages claims that the syntok package
segments sentences better than segtok.</p>
        <p>When constructing the sentences from the token objects, we ran into another issue originated
from data handling procedure in Flair: Flair removes various zero width Unicode characters
from the source text: zero width space (U+200B), zero width non-joiner (U+200C), variation
selector-16 (U+FE0F0), and zero-width no-break space (U+FEFF). These characters occur rarely,
but in a collection as big and diverse as MS MARCO v2, these characters are found in some
documents. When encountering these characters, the token objects were constructed such that
the span and ofset of the token still referred to that of the source text.</p>
        <p>For the case of the zero width space, we updated the syntok package; although zero width
space is not considered a whitespace character according to the Unicode standard, it should be
considered a character that separates two words. For the other Unicode characters removed by
Flair, we manually update the span in the Token objects created by Flair such that they refer
correctly to the positions in the source text. Now, when Flair identifies a series of tokens as a
possible mention, we can directly identify the location in the source text from the Token objects.
Flair supports named entity recognition in batches; this way multiple batches of text can be sent
to the GPU for faster inference time. Because REL had been designed to tag one document at a
time, it did not use this functionality. REBL exploits this feature, allowing the user to specify
the number of documents to be tagged simultaneously.</p>
        <p>1https://github.com/informagi/REBL
2https://github.com/apache/parquet-format
3https://github.com/fnl/segtok
4https://github.com/fnl/syntok</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Candidate Selection and Entity Disambiguation</title>
        <p>REL makes use of a (|) prior, where e is an entity, and m is a mention. These priors are saved
in a (SQLite) database, and up to 100 priors per mention are considered. Data conversion between
client and the representation stored in the database incurred however a large serialization cost.
We updated this to a format that is faster to load, with the additional benefit of a considerably
decreased database size.5 We experimented with data storage in the DuckDB column oriented
database as an alternative, but found that SQLite was (still) more eficient as key-value store, at
least in DuckDB’s current state of development.</p>
        <p>We found that the entity disambiguation stage took much longer than reported in the original
REL paper. This diference is explained by the length of the documents to be linked. The
documents evaluated by van Hulst et al. [15] were on average 323 tokens long with an average
of 42 mentions to consider. The number of tokens in an MS MARCO v2 document is on average
1800, with 84 possible mentions per document.6 Per mention, 100 tokens to the left, and 100
tokens to the right are considered as the context for the disambiguation model. The larger
documents result in a larger memory consumption per context and per document, with higher
processing costs as a consequence.</p>
        <p>We improve the eficiency of the entity disambiguation step such that it could be run in a
manageable time. REL recreates database cursors for every transaction. We rewrote the REL
database code such that one database cursor is created for the entity disambiguation module.
Within a document, the same queries were issued to the database multiple times. This happens
for example when a mention occurs multiple times within a document. By caching the output
of these queries, we were able to significantly lower the number of database calls needed. We
cached all database calls per every segment in the collection, as we ran the process for every
segment separately.</p>
        <p>The default setting of REL is to keep embeddings on the GPU after they are loaded. This, however,
slowed down disambiguation when many documents are being processed consecutively, because
operations like normalization were carried out over all embeddings on the GPU. By clearing
these embeddings as soon as a document is processed, a significant speed up has been achieved.
Finally, after retrieving the embeddings from the database, REL puts them in a python list. We
rewrote the REL code such that the binary data is directly loaded from NumPy, a data format
that Pytorch operates on.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Efects on Execution</title>
      <p>In the mention detection stage, we improved tokenization and applied batching. In the MS
MARCO v2 collection, 411, 906 documents have tokens that were automatically removed by
Flair, which are 3.4% of all documents in the collection. The MS MARCO v1 collection does
not have documents that contained these characters; the documents in that version of the
5The table that represents the priors shrank from 9.6GB to 2.2GB.</p>
      <p>6These figures are calculated over the body field; we also tagged the shorter title and headers fields.
collection are (probably) sanitized before publishing. Batching documents in the mention
detection stage decreased the average time for finding all named entities. We used batches
of size 10, as the documents are relatively large. The optimal batch size will depend on the
available GPU memory.</p>
      <p>A few documents in the MS MARCO v2 collection could not be linked. This happened only
in extraordinary cases, where linking with entities did not make sense in the first place; an
example being a document consisting of numbers only.7 Here, the syntok package created one
long Sentence object from this file that could not fit in GPU memory.</p>
      <p>The default setting of REL was to keep
embeddings in GPU memory after they were loaded, by
clearing them from GPU memory after every
document a speed up was achieved.</p>
      <p>When an entity occurs within a document, there
is a high probability of it occurring multiple times.</p>
      <p>By caching the calls, we increase the memory
usage but are able to lower the time needed for
candidate selection + entity disambiguation.</p>
      <p>By representing the candidates better in the
database, we were able to save on conversion time
lowering the time needed for candidate selection.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future work</title>
      <p>We introduced REBL, an extension for the Radboud Entity Linker. We utilize REL’s modular
design to separate the GPU heavy mention detection stage from the CPU heavy candidate
selection and entity disambiguation stages, as many researchers have dedicated GPU and CPU
7The source document was a price list in PDF format.
machines. The mention detection module has been made more robust and reliable, using a
better segmenter and preserving location metadata correctly. The candidate selection step
and entity disambiguation step were updated to improve their runtime, especially for longer
documents.</p>
      <p>Although it is now possible to run REL [15] on MS MARCO v2 [16] in a (for us) somewhat
reasonable time, we identified further improvements to implement, that we work on actively.
Found mentions are compared to all other mentions during the candidate selection step, the
complexity of this step is ( 2), with  being the number of mentions found in a document,
which is especially problematic for longer documents. As we are only interested in mentions
that are similar, we expect that it might be worthwhile to implement a locality sensitive hashing
algorithm to decrease the number of comparisons needed in this stage. However, we would
need to run additional experiments to ensure the efectiveness of the model does not sufer.
REBL now implements a two step approach that writes intermediate results to the file system
in Parquet format. A streaming variant would be preferable. We have also kept SQLite as
database backend, but will consider specialized key-value stores to speed up candidate selection
and entity disambiguation. We will revisit DuckDB upon progress in the implementation of
zero-cost positional joins.</p>
      <p>The candidate selection stage considers the context of a mention. This context has to be
constructed from the source document. As a result, we load the source data a second time
during candidate selection. Alternatively, we may output mention context in the mention
detection stage, which could then speed up the remaining. However, this would significantly
increase the size of the mention detection output. More experiments are needed to strike the
right balance here.</p>
      <p>Overall, it has become clear that a data processing oriented perspective on entity linking is
necessary for eficient solutions. Having made explicit quite a few implicit design choices,
re-evaluating these might lead to more efective entity linking as well. The last word on entity
linking at size has not been written!</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is part of the research program Commit2Data with project number 628.011.001
(SQIREL-GRAPHS), which is (partly) financed by the Netherlands Organisation for Scientific
Research (NWO).
[2] E. J. Gerritse, F. Hasibi, A. P. de Vries, Graph-Embedding Empowered Entity Retrieval, in:</p>
      <p>Proceedings of the 42nd European Conference on Information Retrieval, 2020, pp. 97–110.
[3] C. Xiong, J. Callan, T.-Y. Liu, Word-Entity Duet Representations for Document
Ranking, in: Proceedings of the 40th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’17, 2017, p. 763–772.
[4] F. Hasibi, K. Balog, S. E. Bratsberg, Exploiting Entity Linking in Queries for Entity Retrieval,
in: Proceedings of the 2016 ACM International Conference on the Theory of Information
Retrieval, ICTIR ’16, 2016, p. 209–218.
[5] K. Balog, H. Ramampiaro, N. Takhirov, K. Nørvåg, Multi-Step Classification Approaches to
Cumulative Citation Recommendation, in: Proceedings of the 10th Conference on Open
Research Areas in Information Retrieval, OAIR ’13, 2013, p. 121–128.
[6] R. Reinanda, E. Meij, M. de Rijke, Mining, Ranking and Recommending Entity Aspects, in:
Proceedings of the 38th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’15, 2015, p. 263–272.
[7] S. Chatterjee, L. Dietz, BERT-ER: Query-specific BERT Entity Representations for Entity
Ranking, in: Proceedings of the 45th International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR ’22, 2022, p. 1466–1477.
[8] T. Lin, Mausam, O. Etzioni, Entity linking at web scale, in: Proceedings of the Joint
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
(AKBC-WEKEX), 2012, pp. 84–88.
[9] D. Ferrucci, Introduction to “This is Watson”, IBM Journal of Research and Development
56 (2012) 1:1–1:15. doi:10.1147/JRD.2012.2184356.
[10] Y. Yang, O. Irsoy, K. S. Rahman, Collective Entity Disambiguation with Structured Gradient
Tree Boosting, in: Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long Papers), 2018, pp. 777–786.
[11] P. N. Mendes, M. Jakob, A. García-Silva, C. Bizer, DBpedia Spotlight: Shedding Light on
the Web of Documents, in: Proceedings of the 7th International Conference on Semantic
Systems, I-Semantics ’11, 2011, p. 1–8.
[12] P. Ferragina, U. Scaiella, TAGME: On-the-Fly Annotation of Short Text Fragments (by
Wikipedia Entities), in: Proceedings of the 19th ACM International Conference on
Information and Knowledge Management, CIKM ’10, 2010, p. 1625–1628.
[13] F. Hasibi, K. Balog, D. Garigliotti, S. Zhang, Nordlys: A Toolkit for Entity-Oriented and
Semantic Search, in: Proceedings of the 40th International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’17, 2017, p. 1289–1292.
[14] N. D. Cao, G. Izacard, S. Riedel, F. Petroni, Autoregressive Entity Retrieval, in: International</p>
      <p>Conference on Learning Representations, 2021.
[15] J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog, A. P. de Vries, REL: An Entity Linker
Standing on the Shoulders of Giants, in: Proceedings of the 43rd International ACM SIGIR
Conference on Research and Development in Information Retrieval, 2020, p. 2197–2200.
[16] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, B. M. Andrew
McNamara, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang, MS MARCO: A
Human Generated MAchine Reading COmprehension Dataset, in: InCoCo@NIPS, 2016.
[17] V. I. Spitkovsky, A. X. Chang, A Cross-Lingual Dictionary for English Wikipedia
Concepts, in: Proceedings of the Eighth International Conference on Language Resources
and Evaluation (LREC’12), European Language Resources Association (ELRA), 2012, pp.
3168–3175.
[18] P. Le, I. Titov, Improving Entity Linking by Modeling Latent Relations between
Mentions, in: Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 2018, pp. 1595–1604.
[19] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use
framework for state-of-the-art NLP, in: Annual Conference of the North American Chapter
of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Gerritse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hasibi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. P. de Vries</surname>
          </string-name>
          ,
          <article-title>Entity-Aware Transformers for Entity Search</article-title>
          ,
          <source>in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1455</fpage>
          -
          <lpage>1465</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>