<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Contextual NN Ensemble Retrieval Approach for Semantic Postal Address Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>El Moundir Faraoun</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nédra Mellouli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stéphane Millot</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Myriam Lamolle</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ESILV DVRC, Léonard de Vinci group</institution>
          ,
          <addr-line>12 Av. Léonard de Vinci Paris La défense</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIASD Paris 8 University</institution>
          ,
          <addr-line>2 rue de la liberté, Saint-Denis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>TEDIES, TALK solutions, 45 Av. de Paris</institution>
          ,
          <addr-line>Monéteau</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>96</fpage>
      <lpage>111</lpage>
      <abstract>
        <p>The biggest challenge today regarding courier services (delivery of small to medium-sized parcels) is the problem of Address Matching. With the expansion of geographical data and the diversity of formats in which it is received, traditional matching methods are becoming increasingly obsolete due to the lack of conformity of delivery information with postal address writing standards. These new constraints are afecting parcel delivery quality in terms of deliverables, cost and environmental impact. This research focuses on courier delivery data (i.e. postal addresses of recipients) in the context of matching French postal addresses. We introduce a new ensemble retrieval approach to the problem through a voting system leveraging multiple k-Nearest Neighbors search algorithms, called NN-vote which efectively transform the Address Matching task to an Address Retrieval task. NN-vote returns the top best normalized addresses similar to a given query (a non-normalized delivery address). The system takes advantage of several address representations, in particular Pre-trained Transformers-Based Sentence Embeddings. The system has been tested on a real database of French delivery addresses. The method meets high expectations, returning exactly matched addresses with a success rate of up to 96% in top 10 as well as 86% in top 1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Address matching or transport entity alignment</kwd>
        <kwd>Recipients/consignees identification or pairing</kwd>
        <kwd>Recovery of recipients</kwd>
        <kwd>Address retrieval</kwd>
        <kwd>Ensemble NN retrieval models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The transport Entity Alignment problem, also known as the postal Address Matching (AM) problem
is inherently an NLP task given that a postal address is mainly structured as a short sentence with a
specific arrangement of Named Entities (i.e. attributes or features like Road Name or Door Number)
which makes it fall within the scope of Entity Matching (EM). The task involves efectively processing
and comparing structural components of a pair of addresses (, ) for accurate matching (i.e.  and 
refer to the same real world object).</p>
      <p>Carriers identify delivery addresses received via EDI (Electronic Data Interchange) by matching them
with recipient addresses already registered in their database. Nothing could be more simple at first
glance, except that delivery addresses are increasingly received in non normalized forms. The addresses
received are often incorrect and/or noisy, thus identifying a valid address from an invalid one becomes a
very challenging task. The anomalies present in a delivery address can be: (1) Writing errors, including
typographic ones, spelling mistakes, repetition, or the absence of specific address features ; (2) Address
noise may involve personal information, such as names, phone numbers, or requests for appointments
; (3) Lastly, Semantic or contextual errors include the presence of features from unrelated addresses,
feature replacements (e.g. "avenue" instead of "street"), feature aliases, like abbreviations or acronyms as
well as polysemous1 features and finally addresses represented by their semantic synonyms, specifically
named zones or parks.</p>
      <p>Let’s take for example, the following real delivery address received by a French Carrier: "avenue du
g n ral leclerc centre commercial auchan 89200 avallon". Here the correct Road Type is "rue"
instead of "avenue" and the typographic error in "g n ral" is intended as "general". Not to mention
the absence of a Door Number in the address. We finally note that "centre commercial auchan" is a
semantic synonym for the address. These anomalies distort the structure of an address and prevent it
from being paired with a valid address record.</p>
      <p>
        The AM problem is traditionally solved with a binary “Match/No Match” classification of address
pairs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] mainly relying on neural network-based methods [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">2, 3, 4, 5, 1, 6</xref>
        ]; yet, the task itself is imagined
in a scenario of matching address records between two tables or deduplicating records in a data table.
However, in the context of delivery, this correspondence is a search for information similar to a given
request (address received). Thus, we are dealing with an unsupervised Information Retrieval (IR) problem
in which each new address is treated as a query, which may be valid or incorrectly formatted, and for
which we try to find valid "candidate" addresses in the database. This formalization is very relevant
since it allows retrieved candidates for a delivery address to be sorted in terms of contextual similarity.
Furthermore, the number of "candidate" addresses is relatively small, reducing the computation time if
all reference address records have to be aligned.
      </p>
      <p>
        Our objective in this research is to take advantage of the various possible representations of addresses,
in particular Transformer-Based Sentence Embeddings in the context of Information Retrieval. We
propose an ensemble multi-embeddings models approach based on the -Nearest Neighbors algorithm
(NN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], with a voting process between multiple NN search models.
      </p>
      <p>The remainder of this paper is organized into 7 sections. Section 2 reviews the work carried out
in relation to address matching. Section 3 formalizes the Address Retrieval problem. We describe
our approach in Section 4 and present its experimental settings in Section 5. Results are detailed
and discussed in Section 6. We conclude this work by considering its limitations and prospects for
improvement in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work and State-of-the-Art</title>
      <p>
        The existing solutions for Address Matching can be summed up in two approaches. An approach based
on string similarity measures or matching rules [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. However, the problem remains that these methods
are based mainly on structural comparisons between addresses, and they quickly become obsolete
when faced with addresses that are written diferently but retain the same semantic meaning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
fact, textual similarity distances such as Levenshtein and others [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14">10, 11, 12, 13, 14</xref>
        ], are used for address
matching. These distances depend on the choice of a similarity threshold, which is generally high. This
score remains very approximate and eliminates the possibility of matching pairs below the chosen
threshold. Other methods are based on decision tree matching rules [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These methods improve the
matching performance but require systematic calibration of the rules by experts due to the diversity of
address writing models.
      </p>
      <p>
        A second approach based on machine learning (ML) or deep learning architecture (DL) aims to
learn the semantic similarity between addresses [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]. These methods mainly rely on vector
representations of address elements such as Word2Vec [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or FastText [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to use them as input
data of ML (e.g. Random Forest, XGBoost) or DL inference models (e.g. ESIM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], ABLC [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) for
classification. However, in order to get those word embeddings, a parsing step is needed, which is the
process of segmenting addresses into their essential features or elements (e.g. Road Number, Road
Name or Postal Code). Various parsing techniques were used for this task. For instance, the latter
studies used respectively CRFs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], heuristic rules [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Jieba2 algorithm and the Trie syntax tree
algorithm. Nonetheless, these methods often come short in the proper parsing of noisy erroneous
addresses. Furthermore, the lack of context between words in an address due to the static nature of
word embeddings suggests that these methods may fail to match certain ambiguous addresses, such as
synonymous or polysemous ones [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and addresses that are too distorted by noise and errors.
      </p>
      <p>
        Recently, the advent of pre-trained transformer encoders [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], like Roberta [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], has transformed
various tasks by introducing hyper-contextualized word embeddings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This breakthrough has enabled
the achievement of state-of-the-art performances through fine-tuning these encoders for specific tasks,
particularly in Entity Matching [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]. In the context of Address Matching, a model named GeoRoberta
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is proposed. It involves a generation of geographical knowledge for addresses by fine-tuning a
Roberta encoder for the task of address features tags detection. It also allows to obtain a textual
encoding of GoogleMaps API3 geographical coordinates of addresses based on Geohash4. It is worth
to know that GeoRoberta, is based on a pre-trained Roberta encoder as well. It generates augmented
contextualized embeddings for an address pair by combining at input, elements of both addresses and
their Geohash encodings. The output embeddings are fused afterwards with a second augmented pair
of addresses, by combining the feature tags embeddings and their Geohash tag embeddings. This final
fused representation is fed into a matching classification layer for the address matching task. The
approach integrates textual and geographical data, leveraging the power of pre-trained transformers
which allows the matching of polysemous and synonymous addresses more eficiently. However, the
generation of Geohash coordinates is based on Google geocoding, which is likely to be wrong for certain
ambiguous or excessively erroneous addresses.
      </p>
      <p>
        We argue that the use of sentence embeddings to represent addresses in the context of similar
information retrieval is much more adapted in terms of representation quality [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This type of representation
uses the training of Trasformer-Based Bi-Encoders [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] for the Semantic Textual Similarity (STS) task.
It succeeds in reducing the distance between two addresses in a latent space even though they have
different expressions. Moreover it solves the problem of synonymous addresses and allows the resolution
of Address Matching through Information Retrieval algorithms [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Such a solution was introduced in
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] by fine-tuning a DistilBert [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] Bi-Encoder on address pairs and utilize it for query top "Candidate"
addresses retrieving after which a fine-tuned Cross-Encoder for address pair classification is used as
a top candidates Re-ranker. To take this idea further, we propose several types of representations,
vectors (sentence and word embeddings) and raw (textual address content). Giving rise to several lists
of  normalized addresses candidates via the Ensemble NN algorithms, we propose to finally re-rank
them through a vote based on the maximum number of appearances of a given candidate (i.e. Term
Frequency) among the ensemble NN models.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Address Retrieval formalization</title>
      <p>In this section, we introduce the address structure and we define its schema allowing us to formalize
the Address retrieval (AR) problem. We are focusing on French reference addresses and considering
only the French address features. Therefor, the correct structure of any address is the one that follows
the oficial representation model 5 of French postal addresses, namely any address that contains the
basic features of the latter which makes it possible to precisely identify the geographical point of the
recipient. The features of a correct French address are described in Fig. 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Address model</title>
        <p>Address structure definition: Let be a set of vocabulary  , which includes all permissible instances
of possible features in a given address. For example, "avenue" might be an instance of the feature
RoadType. We define  as the set of all correctly structured, normalized address sentences. A
normalized address would follow in this case the oficial model of a French address.</p>
        <p>Within , there exits a reference set  such that, ∀  ∈  ,  is both normalized and
corresponds to an actual real-world location. Thus  is the set of all normalized valid addresses
with a real geographical point.
3https://developers.google.com/maps/documentation/geocoding?hl=fr
4http://geohash.org/
5https://www.upu.int/</p>
        <p>An address model ℳ is defined by a structure function ⪯ (· ). This function takes a sequence of
elements from the vocabulary  and produces a normalized address in . Specifically: ⪯ (· ) :
  →  where  can be 4 or 5, representing the number of components in a address and  = 4
being the special case where an address doesn’t need a RoadType. The components 1, ...,  ∈  must
satisfy:
• 1 is an instance of DoorNumber,
•  is an instance of CityName,
• ⪯ is a partial order relation defined on  , that we denote (, ⪯ ) such that for any 1 ≤  &lt;  ≤ ,
∃ (,  ) ∈  ×  , and  ⪯  ,
• ⪯ (1, ..., ) ↦→  ∈  such that  = 12....</p>
        <p>With the following formalization framework, ⪯ (· ) could be assumed as a grammar allowing us to
generate address sentences that are syntactically and semantically correct. Moreover, if  ∈  ,  is
a normalized address with a real-world location.</p>
        <p>The latter definition allows us to consider any address that follows the address model ℳ as normalized.
That being said, an address can be normalized but nonexistent. The following examples of French
addresses illustrate this point:
• (i) "16 avenue jean jaures 89000 auxerre" is a normalized existing address.</p>
        <p>• (ii) "16 rue jean jaures 89300 joigny" is a normalized address but nonexistent.</p>
        <p>Although the second address is technically correct in its structure, a simple anomaly like features
instance replacement of RoadType, PostalCode and CityName makes it not corresponding to a real
location. In such a case, the ensemble NN multi-embeddings models are interesting since the address
semantic context is considered.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Address retrieval</title>
        <p>Now that the address structure formalism is defined, through the lexicographic order relation defined
on the feature instances of an address, we assume in the following of our work that an address is simply
a structured sentence with a particular context (a.k.a an address sentence). We define the problem of
Address Retrieval as a problem of semantic search of textual documents.</p>
        <p>Address embedding definition: Let  be an address sentence. Given a textual encoders , An
address representation is defined as the output of  where  is the input. We define 0 as id(· ) (i.e. the
identity function) and therefor, an address representation can be:
• raw (i.e. the textual content of the address itself) through 0
• a vector embedding through a neural encoder .</p>
        <p>In the rest of the paper, and for the sake of simplicity, we refer to the ensemble of raw and vector
model embeddings as multi-embeddings models.
Contextual NN Address Retrieval task: We want to obtain for a given address query  and
through an encoder , a query representation  ∈  for NN retrieval. The Neighborhood of
 is then constructed by fetching its  nearest neighbors from a set of reference address sentence
2
representations  ⊂   according to a distance function (· ) :  → R.</p>
        <p>More formally, the  nearest neighbors of  can be obtained by:
 := {1, 2, . . . ,  | (,  ) are the  smallest distances,  ∈ [| |]}
(1)
where  denotes the set of indices in [| |] = {1, ..., | |}, which points to  neighbors with
the smallest distances close to 0.</p>
        <p>
          Although a distance  depends on the fixed encoder used for an address representation, our NN
retrieval model is still generic. For example, if  is a Transformer-Based Bi-Encoder model then the
distance  would be a -like distance. Roughly speaking, our NN model have three parameters:
the , the  representation and the distance  [
          <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Our Approach</title>
      <p>Ensemble voting for multi-embeddings NN models is a robust technique that exploits the strengths of
diferent embedding methods to improve prediction accuracy. By generating multiple embeddings for
the same data and combining the predictions of multiple NN models through voting, we can achieve
better performance and more reliable results. This approach is particularly useful for our task in which
diferent embeddings capture diferent aspects of the addresses. In order to perform the task of correct
address retrieval, we had to undergo the subsequent steps: (1) Data pre-processing and deduplication
for both delivery and reference addresses, (2) Ofline fine-tuning of diferent Bi-Encoders on the STS
task in order to construct multiple retrieval sets of normalized address embeddings, (3) NN retrieval
models construction (see Fig. 2) and (4) Online aggregating of the diferent search results through the
design of a vote schema (see Fig. 3).</p>
      <sec id="sec-4-1">
        <title>4.1. Data pre-processing</title>
        <p>Before fine-tuning the Bi-Encoders, it was necessary to go through two word pre-processing steps then
a deduplication step:
• The first step is the cleaning of both delivery and reference addresses and it involves removing
accents and punctuation that might be present in data.
• The second step concerns the removal of interfering elements. This step is only applied to the
delivery addresses given that all reference addresses are supposed to be correct and normalized.
This step removes a set of unnecessary symbols that can be found in non-normalized addresses
(e.g. ‘+’, ‘*’, ‘&amp;’, . . . etc.).
• The third step is the deduplication of delivery address records. By removing these exact duplicates,
we ensured that our fine-tuning process was eficient and not biased by redundant data points.</p>
        <p>The final step is dataset creation for the Bi-Encoders fine-tuning. This step includes another cleaning
process we explained in details in 5.1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ofline fine-tuning of Bi-Encoders</title>
        <p>Here, we have as an input a set of (delivery, reference) address pairs. The aim of this step is to fine-tune
multiple bi-encoders to generate the address sentence vector embeddings.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Bi-Encoder</title>
          <p>
            Bi-Encoders are Siamese Transformers Networks generally fine-tuned on Semantic Textual Similarity
tasks for the purpose of generating meaningful sentence embeddings. Typically, a pre-trained
transformer model is first chosen as the training base of the Bi-Encoder. We use two types of pre-trained
models:
• “Camembert-base” [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ] , a model specific to the French language,
• “XLM-Roberta-base” [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ], a multilingual model,
which we both adapted on a large corpus of French postal addresses by continuing their training on
the Masked Language Modeling (MLM) task. We also used the MLM objective to train a third small
Roberta-based model [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] from scratch on the same corpus.
          </p>
          <p>
            Given an address sentence pair (, ), a forward pass of the transformer over each tokenized address
generates token embeddings for both  and . Mean pooling is then applied on each address token
representations resulting in two fixed length vectors which will be our address sentence embeddings.
Considering a specific STS task, the best semantic address matching performance is found through
the optimization of an objective function such as the “contrastive loss” [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] which is used mainly in
neural networks for classification and matching tasks, such as similarity learning. It is often used in
Siamese networks to train models to learn similar representations for pairs of similar samples and
dissimilar representations for pairs of dissimilar samples. Readers interested in exploring the Bi-Encoder
architecture can refer to [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ].
          </p>
          <p>
            In our case, our sentences are postal addresses that are no more than a few words long. In addition, all
addresses have, more-or-less, the same vocabulary that repeats itself, such as road types or city names.
All this reduces the diversity of context between dissimilar addresses. This constraint led us to believe
that using a basic objective function would not succeed in creating a suficient gap in terms of distance
between dissimilar addresses. To overcome this, we decided to use the "Multiple Negative Ranking Loss"
(MNLR) objective function [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ], which is often used in the context of ranking and information retrieval
tasks and therefore more suited to our similarity search task. This approach is supported by findings in
[
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] which highlights that including multiple negatives in each batch enhances the model’s ability to
distinguish between dissimilar examples without the need to specifically design hard negative pairs.
Finding truly efective negative examples can be challenging and significantly impact the performance,
making MNRL’s ability to utilize multiple negatives in a straightforward manner highly advantageous
which leads to better performance and more robust embeddings.
          </p>
          <p>Multiple Negative Ranking Loss definition: For a given  address sentence embeddings pairs
[(1 , 1 ), ..., ( ,  )] between query-reference address sentences (1, ...,  ) and (1, ...,  )
where (, ) are labeled as similar, and (,  ) where  ̸=  are labeled as not similar. The loss function
is as follows:</p>
          <p>⎡  ⎤
− 1 ∑︁ ⎣( ,  ) −  ∑︁ ( , )⎦
=1 =1
(2)</p>
          <p>This function allows the model to consider in a given batch of positive address pairs, for one sample
(, ), using all the normalized reference addresses  in the other positive pairs,  − 1 negative pairs
(,  ). This strategy helps the model to widen the distance between negative examples ,  where 
is the score function (Generally ( ,  ) = ( ,  )). This loss function helps reducing the
impact of the lack of context in the addresses.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Retrieval set creation</title>
          <p>Having a dataset of normalized reference address sentences  and a fine-tuned Bi-Encoder , we
can generate a retrieval sentence embedding set  through a forward pass over all address instances
of  . This embedding set would be later used at inference time for the retrieval of a given query
nearest neighbors.
4.3. NN retrieval models
NN-vote is an Ensemble Information Retrieval system based on the search results of multi NN
models, all similar in their operations but very diferent in their basis of representations of the searched
addresses. In general, an individual NN search model is a -Nearest Neighbor algorithm which
takes as a parameter a distance  specific to the type of representation of the searched points (e.g.
a Levenshtein or Jaccard distance for a raw textual representation). The algorithm computes all the
distances between a query and the search points previously pre-registered in the retrieval reference set
 ( =  for raw textual representation) and returns the list of the  most similar points
having the smallest distance with the query. Table 1 shows the diferent combinations of (encodings,
similarities) that can be used in a NN search model (NN Retriever) within the voting system. The
table illustrate the possible types of address representation previously mentioned in Section 3.2, That is,
the raw textual representation through which we will have diferent NN search models each with a
well-defined type of string distance (see Table 1); and (2) vector representation divided into two types:
• traditional embeddings built by way of mean pooling the static word embeddings of address
elements such as Word2Vec,
• contextual sentence embeddings, fine-tuned for textual similarity, of postal addresses.</p>
          <p>Without any a priori hypotheses about the origin of the errors, we have carried out an empirical
search for the best address representation spaces with the appropriate similarity measures. We simply
applied the various representations and similarity measures in the literature and compared eleven string
similarity measures for the raw representations and four vector similarity measures for the static and
dynamic embedding. representations (see Table 1). Some of the string similarity measures, such as
"Ratio" or "Token_set_ratio," are taken from the fuzzywuzzy library6 as they enable more robust and
lfexible comparisons by incorporating tokenization and sorting mechanisms. Unlike traditional metrics
like Levenshtein and Jaro, which focus solely on character-level edits, fuzzywuzzy’s methods account
for word order and partial matches, making them more suitable for real-world text data. The chosen
sentence embedding models are considered (see Section 4.2), hence we had a total of 31 NN models.
The advantage here is to allow us to have a maximum of individual candidate lists of retrieved addresses
in order to compare, firstly, the performance of each NN Retriever model and, secondly, use them
to draw the candidates in common between the lists as the most similar candidates. Fig. 2 shows the
architecture of a single NN Retriever. We finally define the similarity search process as follows: (1)
we convert a query  by the desired representation type to have ; (2)  is then passed into the NN
Retriever which will be responsible for computing the distances between  and all the representations
in the retrieval set in order to return the  address indices most similar to  ranked according to the
smallest distance.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Ensemble voting retrieval system</title>
        <p>The system is termed "multi-embeddings models" due to its dual approach, leveraging both raw address
representations and advanced deep learning (DL) techniques for vector text representations in address
matching. The core functionality of the system involves returning a final list of  similar candidate
addresses through a voting process. Among the NN ensemble models, the voting process is based
on the maximum number of occurrences of a candidate address for a given query. It should be noted
that this system needs two types of important values: (1) the number of repetitions of each candidate
address  in the diferent -lists; (2) the diferent similarity scores of a pair ( ,) for which  appeared
with diferent NN models.
4.4.1. Retrieval Flow
1. Candidate address lists retrieval: The system begins by retrieving the -lists of candidate addresses
using the ensemble NN retrieving pipeline. Each model in the ensemble provides a list of address
indices for a given query.
2. Voting process:
• Repetition counting: The first step in the voting process is to count the number of repetitions
of each candidate address  across the diferent -lists.
• Grouping and sorting: Candidates indices are then grouped based on their repetition counts.</p>
        <p>This creates "bags" of indices, where each bag contains one or more indices pointing to
associated addresses. Then the bags are sorted by the maximum number of repetitions.
• In-bag max pooling of similarity scores: Within each bag, the system collects the similarity
scores for each address from the diferent NN models in which they appeared. Max pooling
is then applied to these scores to determine the maximum similarity score for each address
within the bag.
• In-bag Ranking: The addresses are then sorted within each bag based on their maximum
similarity scores.
3. Final address list retrieval:
• Final output: All the bags are concatenated resulting in a sorted list of addresses where the
top candidate address has been repeated the most times and possesses the highest similarity
score.
• Cut-of value choice : The system sets the value of  (the number of neighbors to return)
and computes performance metrics to evaluate the efectiveness of the address matching
process. The value of  is not necessarily the same as  since the voting process ultimately
is ranking all the candidates of the -lists combined which would naturally produces k-plus
candidates depending on how heterogeneous the k-lists are.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Settings</title>
      <sec id="sec-5-1">
        <title>5.1. Data description</title>
        <p>In our experiments, we use real private postal address data made available by a carrier in the region of
Yonne, France. This data consists of two database tables, a table of approximately 1M non-normalized
addresses of deliveries received via EDI and another table of registered recipients of more than 42K
normalized postal addresses. After the de-duplication step mentioned above, and due to the presence of
large number of identical delivery instances, just over 85% of all delivery address instances have been
de-duplicated, mainly because most deliveries are business addresses. As a result, we are left with just
over 147K distinct delivery addresses.</p>
        <p>Dataset creation: We are in an ofline training set up (i.e. our ensemble NN retriever doesn’t need
training but rather takes advantage of the diferent representations, vector or raw, of postal addresses
in order to search for the most similar addresses). That said, the creation of a dataset of address pairs
(i.e. non-normalized query-address, normalized reference-address) is necessary for two reasons: (1)
the ofline fine-tuning of the diferent sentence representation models for the addresses and (2) using
the dataset in the final performance test of the NN-vote system. To do this, we use the recipient keys
associated with the records in the two tables to create a dataset of over 147K address pairs. The dataset
is then divided between training data and test data with respective proportions of 90% and 10%. The
same test data will be used to evaluate NN-vote. A second cleaning is carried out on the training
dataset to eliminate certain non-normalized entry addresses likely to reduce the learning quality of the
Bi-Encoder such as addresses having only the postal code and the city name. These type of addresses
lacks completely the context linking them to their supposed normalized counterparts. Around 0.8% of
the training data was impacted by this second cleaning. The Table 2 shows some examples of this kind
of addresses.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Bi-encoders fine-tuning parameters</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Fine-tuning Base</title>
          <p>The Three chosen base transformers were trained on a corpus of approximately 950K oficial French
postal addresses from the Yonne region, France and adjacent regions taken from the oficial governmental
website7. The complete training of the three encoders was carried out during 5 iterations and no
parameter optimization was done. The aim here was simply to adapt the three language models to the
postal addresses and have them as a basis for fine-tuning the Bi-Encoders. The “transformers” package
from HuggingFace8 was used to train these language models.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Bi-encoders fine-tuning</title>
          <p>
            The three Bi-Encoders were fine-tuned according to the best combination of hyper-parameters presented
in Table 3. Both Camembert-base and XLM-Roberta-base architectures used for the first two Bi-Encoders
ifne-tuning can be explored in details in [
            <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
            ] as for the third one, a custom pre-trained Roberta-small
architecture (6-layers, 128-hidden, 8-heads and 8 million parameters) is used. The three Bi-Encoders
were adjusted on a local server with an NVIDIA Tesla A100 graphics card (20 GB) via the SBERT9
“sentences-transformers” package.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Models evaluation</title>
        <p>
          For the evaluation of the proposed voting approach, we compare it with our diferent individual NN
models in addition to the bi-encoder (BI_DistilBert) model proposed by Duarte et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], where they
        </p>
        <p>Bi-Encoder
19</p>
        <p>32
AdamW
2e-5
WarmupLinear
100
0.01
MNR Loss
20
2e-3
Parameter</p>
        <p>epochs
batch size</p>
        <p>Optimizer
Learning Rate</p>
        <p>Scheduler
Warmup Steps
Weight Decay</p>
        <p>
          Loss
19
2e-5
use DistilBert Multilingual as a basis for fine-tuning their model. To remain consistent with the cited
research, we consider a value of  neighbors equal to 10 but we take the time to test other values of
 with respect to our individual systems. The models were evaluated based on two metrics: (1) The
existence ratio (ER), which is the proportion of correctly predicted positive pairs out of all pairs in the
test data set, and (2) the MRR, i.e. the Mean Reciprocal Rank, which is a measure used to evaluate the
quality of the appearance ranks of correct query responses via information retrieval systems. For a
sample of queries  and , i.e. the position of the correct searched address for a query  ∈  with
 = 1, ..., ||, the MRR formula can be defined as follows:
  =
1 ∑|︁|
,
(3)
The primary objective of the models is to achieve a maximum ER at the exact matching level (i.e. the
predicted address is exactly the address sought for the query). In addition, two types of ER are computed:
(1) The ER of the correct predictions in the first rank (top 1) and (2) the ER of the correct predictions
among the  address candidates (top k). We are also interested in the matching ER at the road level (i.e.
the predicted address is at least in the correct road of the searched address). This type of ER is all the
more important since in practical cases, carriers will generally be able to successfully deliver parcels as
long as they are in the same lane of the delivery address[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Comparison of individual NN models</title>
        <p>Our first intent was to compare individual NN systems in order to identify the best performing model
in terms of top k ER at the exact search level (top k exact). The results illustrated in Fig. 4a show the
superiority of NN models based on the diferent sentence representations and this comes down to
the quality of the hyper contextualized embeddings in comparison for example with word embeddings
like Word2Vec or FastText. We also note that models based on raw representations are generally
more eficient than Word2Vec and FastText. The reason is probably because of the enormous loss
of information in the static embeddings due to the mean pooling used to create the address vectors.
Increasing the value of  positively impacts the existence ratio overall the models because the larger
the list of neighbors, the greater the chance of more dificult addresses to be retrieved. However, the
increasing levels of the existence ratio vary between 5% for sentence embedding, 11% for raw embedding
and 26% for static embedding with a  value between 5 and 120, as shown in Figure 4a. This can be
explained by the level of accuracy of the sentence embedding, as the majority of positive pairs are
already identified within the first 5 candidate addresses. In contrast, the raw and static embedding
models require a very high  value of up to 120. In terms of MRR, the results in Figure 4b are consistent
with the existence ratios, as the best models should have the highest MRR at the lowest possible 
value. The dynamic finetunner sentence NN embedding models retrieve the searched addresses at
the highest ranks compared to the other models. Furthermore, they remain stable as  increases, thus
demonstrating their strong retrieval ability even with the earliest candidates thanks to their ability to
capture address context. This was expected as well, as the very purpose of sentence transformers is
to learn how to reduce the distance between vectors of positive address pairs, even if they are very
diferent syntactically, whereas models based on string similarity distances only perform well when the
addresses are relatively similar syntactically.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. kNN multi-embeddings models experiment results</title>
        <sec id="sec-6-2-1">
          <title>6.2.1. Multi-embeddings models instances</title>
          <p>We wanted to test the performance of the voting system using the set of individual NN models while
having the flexibility to select diferent subsets to maximize the voting eficiency. Fig. 5 illustrates the
(a) NN’s top k exact ERs
(b) NN’s top k exact MRRs
results of the ERs top k exact of the chosen subsets that had the overall better performance. We observe
that the subset of sentence only models performs generally better. If we further exclude from the latter
the NN models based on Roberta from scratch (camembert + XLM), we see a small increase in ERs
at k values of 5 and 10. This increasing aspect is due to the original pre-training of Camembert and
XLM_Roberta. It shows the extent that language models (pre-trained on large language corpora) have
in terms of performance quality when used in other tasks such as the STS task. The voting system
with all models is the least eficient and this can be explained by the large diferences between the
neighbor lists returned by the sentence models and the other models. In other words, it is natural
that NN retrivers with the most mistakes in predicting positive pairs impact the ability of the vote to
systematically propose a high number of repetitions to the sought-after addresses. This explanation
remains even more coherent when we remove the static vectors models from the vote (sentence + raw).
We notice a clear improvement in ERs. As for the MRR results, we observe in Fig. 6 that in general,
voting systems based on sentence models succeed in recovering more positive normalized addresses at
the highest ranks.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>6.2.2. Discussion</title>
          <p>We put ourselves in comparison with BI_DistilBert. We take into account the addresses found at the
road level and we also consider the top 1 results. We remain consistent regarding the value of  = 10.
Table 4 illustrates the best individual NN models and voting systems in comparison with BI_DistilBert.
We find that BI_DistilBert performs better than the raw models. However, it remains below the ER
results of the individual NN sentence models and this is due to two main reasons. First, the base of
the model, which is multilingual DistilBert, was not pre-trained on a corpus of postal addresses before
ifne-tuning its Bi_Encoder. However, we believe that it is important that basic language models learn the
structure of a postal address independently of the similarity task. Second, the additional dificulty that
our dataset brings. Indeed, our addresses are much more dificult in terms of the errors and noise likely
to occur. NN-vote systems are better overall, supporting our intuition that aggregating results from
multiple sources significantly improves similarity search performance. We do, however, note exceptions
to the rule. Some individual NN models such as (A) and (B) come before the (G) and (H) vote systems.
This decrease in ERs confirms to us that aggregation alone does not always guarantee better results and
that a high and heterogeneous number of models used in the voting process negatively impacts the
prediction quality. This is why the individual performance of the models used in the vote must also
be taken into account. More specifically, the vote will be more likely to have superior results if it uses
as its aggregation sources, search models that are the least wrong in their predictions. (I) manages to
compete with the two best individual NN sentence models but adds no improvement and in particular
in the top 1 exact. It is undoubtedly the participation of Rsent models in the vote that prevents it from
standing out from the other search systems, since Rsent is significantly less eficient than Csent and
XLMsent. In conclusion, the best vote is the one that uses the Csent and XLMsent models with a ER top
1 exact of 86.2% and a ER top 10 exact of 96% thus demonstrating the ability of the voting system to
retrieve more positive address pairs in the top 1.</p>
          <p>Inference time: We measured the retrieval time for 100 address queries to compare the various
solutions, as shown in Table 4. Retrieval times for voting systems (between 51s and 173s) are notably
longer than individual NN models. Despite being conducted without optimization in an experimental
setup, we find these times acceptable for business applications.
12
13
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this work we focused on the problem of matching postal addresses. We first showed that this task can
be simply formalised as an information retrieval problem where models such as NN have been shown
to be eficient in both computation time and accuracy. For this purpose, we have assumed that an address
is a sentence described with a set of entities and, consequently, it could contain erroneous elements or
noisy elements. However, the positions of the entities have an impact on address recognition. For these
reasons, we have proposed using diferent address representation spaces, such as the word embedding
space or the sentence embedding space with a pre-trained transformer. Each representation contributes
in part to the search for the closest address in that space. In order to aggregate the contribution of the
diferent spaces, we proposed a NN ensemble models based on a voting system called NN-vote. The
experimental results show that our system performs very well, achieving an accuracy of around 96% in
the top 10 and 86.2% in the top 1. This system shows its value for this type of task, even though the
voting algorithm is still very naive for the time being. In fact, the algorithm favours addresses with a
maximum number of repetitions and re-ranks them solely on the basis of the highest similarity score.
Hence the impact of the number of voters on the number of appearances. In addition, the system’s
focus on the highest score of the address, without taking into account the overall quality of the scores,
can lead to the dominance of a single score, even if other scores are more indicative. As a perspective,
we are improving the voting process in order to consider and reinforce the potential efectiveness of a
model with a lower but more significant score.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Thanks to the french ANRT (Association Nationale de la Recherche et de la Technologie) for funding this
project under the "Cifre convention for thesis funding" https://www.anrt.asso.fr/ and to the developers of
TEDIES, TALK solutions who assisted in this project https://site.tedies.eu/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guermazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sellami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Boucelma</surname>
          </string-name>
          ,
          <article-title>Georoberta: A transformer-based approach for semantic address matching</article-title>
          ,
          <source>in: HAL</source>
          ,
          <year>2023</year>
          . URL: https://hal.science/hal-04465164.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Comber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Arribas-Bel</surname>
          </string-name>
          ,
          <article-title>Machine learning innovations in address matching: A practical comparison of word2vec and crfs</article-title>
          , in: Transactions in
          <string-name>
            <surname>GIS</surname>
          </string-name>
          , volume
          <volume>23</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>334</fpage>
          -
          <lpage>348</lpage>
          . URL: https://doi.org/10.1111/tgis.12522.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guermazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sellami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Boucelma</surname>
          </string-name>
          ,
          <article-title>Address validation in transportation and logistics: A machine learning based entity matching approach</article-title>
          , in: Communications in Computer and Information Science, volume
          <volume>1323</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>320</fpage>
          -
          <lpage>334</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -65965-3_
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Du</surname>
          </string-name>
          , T. Liu,
          <article-title>A deep learning architecture for semantic address matching</article-title>
          , in:
          <source>International Journal of Geographical Information Science</source>
          , volume
          <volume>34</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>559</fpage>
          -
          <lpage>576</lpage>
          . URL: https://doi.org/10.1080/13658816.
          <year>2019</year>
          .
          <volume>1681431</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          , G. Chen,
          <article-title>Deep contrast learning approach for address semantic matching</article-title>
          ,
          <source>in: Applied Sciences</source>
          , volume
          <volume>11</volume>
          ,
          <year>2021</year>
          , p.
          <fpage>7608</fpage>
          . URL: https://doi.org/10.3390/app11167608.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Duarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <article-title>Improving address matching using siamese transformer networks</article-title>
          ,
          <source>in: Lecture Notes in Computer Science</source>
          , volume
          <volume>14116</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>413</fpage>
          -
          <lpage>425</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>031</fpage>
          -49011-8_
          <fpage>33</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Rakotondrasoa</surname>
          </string-name>
          , et al.,
          <article-title>Quantitative comparison of nearest neighbor search algorithms</article-title>
          , in: arXiv,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2307.05235.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gschwind</surname>
          </string-name>
          ,
          <article-title>Fast record linkage for company entities</article-title>
          , in: IEEE Conference Publication,
          <year>2020</year>
          . URL: https://ieeexplore.ieee.org/document/9006095.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mengjun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Qingyun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Mingjun</surname>
          </string-name>
          ,
          <article-title>A new method of chinese address extraction based on address tree model</article-title>
          ,
          <source>in: Acta Geodaetica et Cartographica Sinica</source>
          , volume
          <volume>44</volume>
          ,
          <year>2015</year>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          ,
          <article-title>Binary codes capable of correcting deletions, insertions and reversals</article-title>
          ,
          <source>in: Soviet Phys. Doklady</source>
          , volume
          <volume>10</volume>
          ,
          <year>1966</year>
          , p.
          <fpage>707</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Damerau</surname>
          </string-name>
          ,
          <article-title>A technique for computer detection and correction of spelling errors</article-title>
          ,
          <source>in: Communications of the ACM</source>
          , volume
          <volume>7</volume>
          ,
          <year>1964</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jaccard</surname>
          </string-name>
          ,
          <article-title>Distribution de la flore alpine dans le bassin des dranses et dans quelques regions voisines</article-title>
          ,
          <source>in: Bulletin de La Société Vaudoise Des Sciences Naturelles</source>
          , volume
          <volume>37</volume>
          ,
          <year>1901</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>272</lpage>
          . URL: https://www.scirp.org.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaro</surname>
          </string-name>
          ,
          <article-title>Advances in record-linkage methodology as applied to matching the 1985 census of tampa, lforida</article-title>
          , in
          <source>: Journal of the American Statistical Association</source>
          , volume
          <volume>84</volume>
          ,
          <year>1989</year>
          , pp.
          <fpage>414</fpage>
          -
          <lpage>414</lpage>
          . URL: https://doi.org/10.2307/2289924.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Winkler</surname>
          </string-name>
          ,
          <article-title>String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage</article-title>
          ,
          <source>in: ERIC</source>
          ,
          <year>1990</year>
          . URL: https://eric.ed.gov/?id=
          <fpage>ED325505</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          , in: arXiv,
          <year>2013</year>
          . URL: https://arxiv.org/abs/1301.3781.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information</article-title>
          , in: arXiv, volume
          <volume>5</volume>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1607.04606.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>An introduction to conditional random fields</article-title>
          , in: arXiv,
          <year>2010</year>
          . URL: https://arxiv.org/abs/1011.4088.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: NeurIPS,
          <year>2017</year>
          . URL: https://papers.nips.cc/paper_files/paper/2017/ hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , in: arXiv,
          <year>2019</year>
          . URL: https://arxiv.org/ abs/
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Suhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          , W.-C. Tan,
          <article-title>Deep entity matching with pre-trained language models</article-title>
          ,
          <source>in: Proceedings of the VLDB Endowment</source>
          , volume
          <volume>14</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>60</lpage>
          . URL: https: //doi.org/10.14778/3421424.3421431.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>U.</given-names>
            <surname>Brunner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stockinger</surname>
          </string-name>
          ,
          <article-title>Entity matching with transformer architectures - a step forward in data integration</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>2020</year>
          . URL: https://doi.org/10.5441/002/edbt.
          <year>2020</year>
          .
          <volume>58</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          . URL: https://doi.org/10.18653/v1/d19-
          <fpage>1410</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          , in: arXiv,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , M. Douze,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Billion-scale similarity search with gpus</article-title>
          ,
          <source>arXiv preprint arXiv:1702.08734</source>
          (
          <year>2017</year>
          ). URL: https://doi.org/10.48550/arXiv.1702.08734, submitted on 28
          <source>Feb</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Bentley</surname>
          </string-name>
          ,
          <article-title>Multidimensional binary search trees used for associative searching</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>18</volume>
          (
          <year>1975</year>
          )
          <fpage>509</fpage>
          -
          <lpage>517</lpage>
          . URL: https://doi.org/10.1145/361002.361007. doi:
          <volume>10</volume>
          .1145/361002. 361007.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Omohundro</surname>
          </string-name>
          ,
          <article-title>Five Balltree Construction Algorithms</article-title>
          ,
          <source>Technical Report</source>
          , International Computer Science Institute, Berkeley, CA,
          <year>1989</year>
          . URL: https://www.icsi.berkeley.edu/pubs/techreports/ TR-89-063.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dupont</surname>
          </string-name>
          , E. de la Clergerie,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>Camembert: a tasty french language model</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7203</fpage>
          -
          <lpage>7219</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>645</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: arXiv,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1911</year>
          .02116.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, Learning a similarity metric discriminatively, with application to face verification</article-title>
          ,
          <source>in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)</source>
          ,
          <year>2005</year>
          . URL: https://doi.org/10.1109/cvpr.
          <year>2005</year>
          .
          <volume>202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Strope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lukacs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Miklos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          ,
          <article-title>Eficient natural language response suggestion for smart reply</article-title>
          , in: arXiv,
          <year>2017</year>
          . URL: https: //arxiv.org/abs/1705.00652.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chernyavskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ilvovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kalinin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Batch-softmax contrastive loss for pairwise sentence scoring tasks, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (jul</article-title>
          <year>2022</year>
          )
          <fpage>116</fpage>
          -
          <lpage>126</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>9</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>