Contextual 𝑘NN Ensemble Retrieval Approach for Semantic Postal Address Matching

Contextual 𝑘NN Ensemble Retrieval Approach for Semantic Postal Address Matching ElMoundirFaraoun LIASD Paris 8 University

2 rue de la liberté Saint-Denis France

TEDIES TALK solutions

45 Av. de Paris Monéteau France

NédraMellouli LIASD Paris 8 University

2 rue de la liberté Saint-Denis France

ESILV DVRC Léonard de Vinci group

12 Av. Léonard de Vinci Paris La défense France

StéphaneMillot TEDIES TALK solutions

45 Av. de Paris Monéteau France

MyriamLamolle LIASD Paris 8 University

2 rue de la liberté Saint-Denis France

Contextual 𝑘NN Ensemble Retrieval Approach for Semantic Postal Address Matching 1613-0073 3EFD204124CD1A9845934DEAE60FFE38 GROBID - A machine learning software for extracting information from scholarly documents Address matching or transport entity alignment Recipients/consignees identification or pairing Recovery of recipients Address retrieval Ensemble 𝑘NN retrieval models

The biggest challenge today regarding courier services (delivery of small to medium-sized parcels) is the problem of Address Matching. With the expansion of geographical data and the diversity of formats in which it is received, traditional matching methods are becoming increasingly obsolete due to the lack of conformity of delivery information with postal address writing standards. These new constraints are affecting parcel delivery quality in terms of deliverables, cost and environmental impact. This research focuses on courier delivery data (i.e. postal addresses of recipients) in the context of matching French postal addresses. We introduce a new ensemble retrieval approach to the problem through a voting system leveraging multiple k-Nearest Neighbors search algorithms, called 𝑘NN-vote which effectively transform the Address Matching task to an Address Retrieval task. 𝑘NN-vote returns the top best normalized addresses similar to a given query (a non-normalized delivery address). The system takes advantage of several address representations, in particular Pre-trained Transformers-Based Sentence Embeddings. The system has been tested on a real database of French delivery addresses. The method meets high expectations, returning exactly matched addresses with a success rate of up to 96% in top 10 as well as 86% in top 1.

Introduction

The transport Entity Alignment problem, also known as the postal Address Matching (AM) problem is inherently an NLP task given that a postal address is mainly structured as a short sentence with a specific arrangement of Named Entities (i.e. attributes or features like Road Name or Door Number) which makes it fall within the scope of Entity Matching (EM). The task involves effectively processing and comparing structural components of a pair of addresses (𝑎, 𝑏) for accurate matching (i.e. 𝑎 and 𝑏 refer to the same real world object).

Carriers identify delivery addresses received via EDI (Electronic Data Interchange) by matching them with recipient addresses already registered in their database. Nothing could be more simple at first glance, except that delivery addresses are increasingly received in non normalized forms. The addresses received are often incorrect and/or noisy, thus identifying a valid address from an invalid one becomes a very challenging task. The anomalies present in a delivery address can be: (1) Writing errors, including typographic ones, spelling mistakes, repetition, or the absence of specific address features ; (2) Address noise may involve personal information, such as names, phone numbers, or requests for appointments ; (3) Lastly, Semantic or contextual errors include the presence of features from unrelated addresses, feature replacements (e.g. "avenue" instead of "street"), feature aliases, like abbreviations or acronyms as well as polysemous 1 features and finally addresses represented by their semantic synonyms, specifically named zones or parks.

IAL@ECML-PKDD'24: 8 th Intl. Worksh. & Tutorial on Interactive Adaptive Learning, Sep. 9 th , 2024, Vilnius, Lithuania el.moundir.faraoun@gmail.com (E. M. Faraoun); n.mellouli@iut.univ-paris8.fr (N. Mellouli); Stephane.millot@edies.fr (S. Millot); m.lamolle@iut.univ-paris8.fr (M. Lamolle) Let's take for example, the following real delivery address received by a French Carrier: "avenue du g n ral leclerc centre commercial auchan 89200 avallon". Here the correct Road Type is "rue" instead of "avenue" and the typographic error in "g n ral" is intended as "general". Not to mention the absence of a Door Number in the address. We finally note that "centre commercial auchan" is a semantic synonym for the address. These anomalies distort the structure of an address and prevent it from being paired with a valid address record.

The AM problem is traditionally solved with a binary "Match/No Match" classification of address pairs [1] mainly relying on neural network-based methods [2,3,4,5,1,6]; yet, the task itself is imagined in a scenario of matching address records between two tables or deduplicating records in a data table. However, in the context of delivery, this correspondence is a search for information similar to a given request (address received). Thus, we are dealing with an unsupervised Information Retrieval (IR) problem in which each new address is treated as a query, which may be valid or incorrectly formatted, and for which we try to find valid "candidate" addresses in the database. This formalization is very relevant since it allows retrieved candidates for a delivery address to be sorted in terms of contextual similarity. Furthermore, the number of "candidate" addresses is relatively small, reducing the computation time if all reference address records have to be aligned.

Our objective in this research is to take advantage of the various possible representations of addresses, in particular Transformer-Based Sentence Embeddings in the context of Information Retrieval. We propose an ensemble multi-embeddings models approach based on the 𝑘-Nearest Neighbors algorithm (𝑘NN) [7], with a voting process between multiple 𝑘NN search models.

The remainder of this paper is organized into 7 sections. Section 2 reviews the work carried out in relation to address matching. Section 3 formalizes the Address Retrieval problem. We describe our approach in Section 4 and present its experimental settings in Section 5. Results are detailed and discussed in Section 6. We conclude this work by considering its limitations and prospects for improvement in Section 7.

Related Work and State-of-the-Art

The existing solutions for Address Matching can be summed up in two approaches. An approach based on string similarity measures or matching rules [8,9]. However, the problem remains that these methods are based mainly on structural comparisons between addresses, and they quickly become obsolete when faced with addresses that are written differently but retain the same semantic meaning [4]. In fact, textual similarity distances such as Levenshtein and others [10,11,12,13,14], are used for address matching. These distances depend on the choice of a similarity threshold, which is generally high. This score remains very approximate and eliminates the possibility of matching pairs below the chosen threshold. Other methods are based on decision tree matching rules [9]. These methods improve the matching performance but require systematic calibration of the rules by experts due to the diversity of address writing models.

A second approach based on machine learning (ML) or deep learning architecture (DL) aims to learn the semantic similarity between addresses [2,3,4,5]. These methods mainly rely on vector representations of address elements such as Word2Vec [15] or FastText [16] to use them as input data of ML (e.g. Random Forest, XGBoost) or DL inference models (e.g. ESIM [4], ABLC [5]) for classification. However, in order to get those word embeddings, a parsing step is needed, which is the process of segmenting addresses into their essential features or elements (e.g. Road Number, Road Name or Postal Code). Various parsing techniques were used for this task. For instance, the latter studies used respectively CRFs [17], heuristic rules [3], Jieba2 algorithm and the Trie syntax tree algorithm. Nonetheless, these methods often come short in the proper parsing of noisy erroneous addresses. Furthermore, the lack of context between words in an address due to the static nature of word embeddings suggests that these methods may fail to match certain ambiguous addresses, such as synonymous or polysemous ones [1], and addresses that are too distorted by noise and errors.

Recently, the advent of pre-trained transformer encoders [18], like Roberta [19], has transformed various tasks by introducing hyper-contextualized word embeddings [1]. This breakthrough has enabled the achievement of state-of-the-art performances through fine-tuning these encoders for specific tasks, particularly in Entity Matching [20,21]. In the context of Address Matching, a model named GeoRoberta [1] is proposed. It involves a generation of geographical knowledge for addresses by fine-tuning a Roberta encoder for the task of address features tags detection. It also allows to obtain a textual encoding of GoogleMaps API3 geographical coordinates of addresses based on Geohash 4 . It is worth to know that GeoRoberta, is based on a pre-trained Roberta encoder as well. It generates augmented contextualized embeddings for an address pair by combining at input, elements of both addresses and their Geohash encodings. The output embeddings are fused afterwards with a second augmented pair of addresses, by combining the feature tags embeddings and their Geohash tag embeddings. This final fused representation is fed into a matching classification layer for the address matching task. The approach integrates textual and geographical data, leveraging the power of pre-trained transformers which allows the matching of polysemous and synonymous addresses more efficiently. However, the generation of Geohash coordinates is based on Google geocoding, which is likely to be wrong for certain ambiguous or excessively erroneous addresses.

We argue that the use of sentence embeddings to represent addresses in the context of similar information retrieval is much more adapted in terms of representation quality [6]. This type of representation uses the training of Trasformer-Based Bi-Encoders [22] for the Semantic Textual Similarity (STS) task. It succeeds in reducing the distance between two addresses in a latent space even though they have different expressions. Moreover it solves the problem of synonymous addresses and allows the resolution of Address Matching through Information Retrieval algorithms [7]. Such a solution was introduced in [6] by fine-tuning a DistilBert [23] Bi-Encoder on address pairs and utilize it for query top "Candidate" addresses retrieving after which a fine-tuned Cross-Encoder for address pair classification is used as a top candidates Re-ranker. To take this idea further, we propose several types of representations, vectors (sentence and word embeddings) and raw (textual address content). Giving rise to several lists of 𝑘 normalized addresses candidates via the Ensemble 𝑘NN algorithms, we propose to finally re-rank them through a vote based on the maximum number of appearances of a given candidate (i.e. Term Frequency) among the ensemble 𝑘NN models.

Address Retrieval formalization

In this section, we introduce the address structure and we define its schema allowing us to formalize the Address retrieval (AR) problem. We are focusing on French reference addresses and considering only the French address features. Therefor, the correct structure of any address is the one that follows the official representation model5 of French postal addresses, namely any address that contains the basic features of the latter which makes it possible to precisely identify the geographical point of the recipient. The features of a correct French address are described in Fig. 1.

Address model

Address structure definition: Let be a set of vocabulary 𝑉 , which includes all permissible instances of possible features in a given address. For example, "avenue" might be an instance of the feature RoadType. We define 𝐷 𝑛𝑜𝑟𝑚 as the set of all correctly structured, normalized address sentences. A normalized address would follow in this case the official model of a French address.

Within 𝐷 𝑛𝑜𝑟𝑚 , there exits a reference set 𝐷 𝑟𝑒𝑓 such that, ∀ 𝑎 ∈ 𝐷 𝑟𝑒𝑓 , 𝑎 is both normalized and corresponds to an actual real-world location. Thus 𝐷 𝑟𝑒𝑓 is the set of all normalized valid addresses with a real geographical point.

• 𝑥 1 is an instance of DoorNumber, • 𝑥 𝑛 is an instance of CityName,

• ⪯ is a partial order relation defined on 𝑉 , that we denote (𝑉, ⪯) such that for any

1 ≤ 𝑖 < 𝑗 ≤ 𝑛, ∃ (𝑥 𝑖 , 𝑥 𝑗 ) ∈ 𝑉 × 𝑉 , and 𝑥 𝑖 ⪯ 𝑥 𝑗 , • 𝑓 ⪯ (𝑥 1 , ..., 𝑥 𝑛 ) ↦ → 𝑎 ∈ 𝐷 𝑛𝑜𝑟𝑚 such that 𝑎 = 𝑥 1 𝑥 2 ...𝑥 𝑛 .

With the following formalization framework, 𝑓 ⪯ (•) could be assumed as a grammar allowing us to generate address sentences that are syntactically and semantically correct. Moreover, if 𝑎 ∈ 𝐷 𝑟𝑒𝑓 , 𝑎 is a normalized address with a real-world location.

The latter definition allows us to consider any address that follows the address model ℳ as normalized. That being said, an address can be normalized but nonexistent. The following examples of French addresses illustrate this point:

• (i) "16 avenue jean jaures 89000 auxerre" is a normalized existing address. • (ii) "16 rue jean jaures 89300 joigny" is a normalized address but nonexistent.

Although the second address is technically correct in its structure, a simple anomaly like features instance replacement of RoadType, PostalCode and CityName makes it not corresponding to a real location. In such a case, the ensemble 𝑘NN multi-embeddings models are interesting since the address semantic context is considered.

Address retrieval

Now that the address structure formalism is defined, through the lexicographic order relation defined on the feature instances of an address, we assume in the following of our work that an address is simply a structured sentence with a particular context (a.k.a an address sentence). We define the problem of Address Retrieval as a problem of semantic search of textual documents.

Address embedding definition:

Let 𝑎 be an address sentence. Given a textual encoders 𝐸, An address representation is defined as the output of 𝐸 where 𝑎 is the input. We define 𝐸 0 as id(•) (i.e. the identity function) and therefor, an address representation can be:

• raw (i.e. the textual content of the address itself) through 𝐸 0 • a vector embedding through a neural encoder 𝐸.

In the rest of the paper, and for the sake of simplicity, we refer to the ensemble of raw and vector model embeddings as multi-embeddings models.

Contextual 𝑘NN Address Retrieval task: We want to obtain for a given address query 𝑞 and through an encoder 𝐸, a query representation 𝑒 𝑞 ∈ 𝒳 𝐸 for 𝑘NN retrieval. The Neighborhood of 𝑒 𝑞 is then constructed by fetching its 𝑘 nearest neighbors from a set of reference address sentence representations 𝒳 𝐷 𝑟𝑒𝑓 ⊂ 𝒳 𝐸 according to a distance function 𝑑(•) : 𝒳 2 𝐷 𝑟𝑒𝑓 → R. More formally, the 𝑘 nearest neighbors of 𝑒 𝑞 can be obtained by:

𝒦 := {𝑖 1 , 𝑖 2 , . . . , 𝑖 𝑘 | 𝑑(𝑒 𝑞 , 𝑒 𝑖 𝑗 ) are the 𝑘 smallest distances, 𝑖 𝑗 ∈ [|𝒳 𝐷 𝑟𝑒𝑓 |]}(1)

where 𝒦 denotes the set of indices in [|𝒳 𝐷 𝑟𝑒𝑓 |] = {1, ..., |𝒳 𝐷 𝑟𝑒𝑓 |}, which points to 𝑘 neighbors with the smallest distances close to 0.

Although a distance 𝑑 depends on the fixed encoder used for an address representation, our 𝑘NN retrieval model is still generic. For example, if 𝐸 is a Transformer-Based Bi-Encoder model then the distance 𝑑 would be a 𝑐𝑜𝑠𝑖𝑛𝑒-like distance. Roughly speaking, our 𝑘NN model have three parameters: the 𝑘, the 𝐸 representation and the distance 𝑑 [24,25,26].

Our Approach

Ensemble voting for multi-embeddings 𝑘NN models is a robust technique that exploits the strengths of different embedding methods to improve prediction accuracy. By generating multiple embeddings for the same data and combining the predictions of multiple 𝑘NN models through voting, we can achieve better performance and more reliable results. This approach is particularly useful for our task in which different embeddings capture different aspects of the addresses. In order to perform the task of correct address retrieval, we had to undergo the subsequent steps: (1) Data pre-processing and deduplication for both delivery and reference addresses, (2) Offline fine-tuning of different Bi-Encoders on the STS task in order to construct multiple retrieval sets of normalized address embeddings, (3) 𝑘NN retrieval models construction (see Fig. 2) and ( 4) Online aggregating of the different search results through the design of a vote schema (see Fig. 3).

Data pre-processing

Before fine-tuning the Bi-Encoders, it was necessary to go through two word pre-processing steps then a deduplication step:

• The first step is the cleaning of both delivery and reference addresses and it involves removing accents and punctuation that might be present in data. • The second step concerns the removal of interfering elements. This step is only applied to the delivery addresses given that all reference addresses are supposed to be correct and normalized. This step removes a set of unnecessary symbols that can be found in non-normalized addresses (e.g. '+', '*', '&', . . . etc.). • The third step is the deduplication of delivery address records. By removing these exact duplicates, we ensured that our fine-tuning process was efficient and not biased by redundant data points.

The final step is dataset creation for the Bi-Encoders fine-tuning. This step includes another cleaning process we explained in details in 5.1.

Offline fine-tuning of Bi-Encoders

Here, we have as an input a set of (delivery, reference) address pairs. The aim of this step is to fine-tune multiple bi-encoders to generate the address sentence vector embeddings.

Bi-Encoder

Bi-Encoders are Siamese Transformers Networks generally fine-tuned on Semantic Textual Similarity tasks for the purpose of generating meaningful sentence embeddings. Typically, a pre-trained transformer model is first chosen as the training base of the Bi-Encoder. We use two types of pre-trained models:

• "Camembert-base" [27] , a model specific to the French language, • "XLM-Roberta-base" [28], a multilingual model, which we both adapted on a large corpus of French postal addresses by continuing their training on the Masked Language Modeling (MLM) task. We also used the MLM objective to train a third small Roberta-based model [19] from scratch on the same corpus.

Given an address sentence pair (𝑎, 𝑏), a forward pass of the transformer over each tokenized address generates token embeddings for both 𝑎 and 𝑏. Mean pooling is then applied on each address token representations resulting in two fixed length vectors which will be our address sentence embeddings. Considering a specific STS task, the best semantic address matching performance is found through the optimization of an objective function such as the "contrastive loss" [29] which is used mainly in neural networks for classification and matching tasks, such as similarity learning. It is often used in Siamese networks to train models to learn similar representations for pairs of similar samples and dissimilar representations for pairs of dissimilar samples. Readers interested in exploring the Bi-Encoder architecture can refer to [22].

In our case, our sentences are postal addresses that are no more than a few words long. In addition, all addresses have, more-or-less, the same vocabulary that repeats itself, such as road types or city names. All this reduces the diversity of context between dissimilar addresses. This constraint led us to believe that using a basic objective function would not succeed in creating a sufficient gap in terms of distance between dissimilar addresses. To overcome this, we decided to use the "Multiple Negative Ranking Loss" (MNLR) objective function [30], which is often used in the context of ranking and information retrieval tasks and therefore more suited to our similarity search task. This approach is supported by findings in [31] which highlights that including multiple negatives in each batch enhances the model's ability to distinguish between dissimilar examples without the need to specifically design hard negative pairs. Finding truly effective negative examples can be challenging and significantly impact the performance, making MNRL's ability to utilize multiple negatives in a straightforward manner highly advantageous which leads to better performance and more robust embeddings.

Multiple Negative Ranking Loss definition: For a given 𝑁 address sentence embeddings pairs [(𝑒 𝑎 1 , 𝑒 𝑏 1 ), ..., (𝑒 𝑎 𝑁 , 𝑒 𝑏 𝑁 )] between query-reference address sentences (𝑎 1 , ..., 𝑎 𝑁 ) and (𝑏 1 , ..., 𝑏 𝑁 ) where (𝑎 𝑖 , 𝑏 𝑖 ) are labeled as similar, and (𝑎 𝑖 , 𝑏 𝑗 ) where 𝑖 ̸ = 𝑗 are labeled as not similar. The loss function is as follows:

− 1 𝑁 𝑁 ∑︁ 𝑖=1 ⎡ ⎣ 𝑆(𝑒 𝑎 𝑖 , 𝑒 𝑏 𝑖 ) − 𝑙𝑜𝑔 𝑁 ∑︁ 𝑗=1 𝑒 𝑆(𝑒𝑎 𝑖 ,𝑒 𝑏 𝑗 ) ⎤ ⎦(2)

This function allows the model to consider in a given batch of positive address pairs, for one sample (𝑎 𝑖 , 𝑏 𝑖 ), using all the normalized reference addresses 𝑏 𝑗 in the other positive pairs, 𝑁 − 1 negative pairs (𝑎 𝑖 , 𝑏 𝑗 ). This strategy helps the model to widen the distance between negative examples 𝑎 𝑖 , 𝑏 𝑗 where 𝑆 is the score function (Generally 𝑆(𝑒 𝑎 𝑖 , 𝑒 𝑏 𝑖 ) = 𝑐𝑜𝑠𝑖𝑛𝑒(𝑒 𝑎 𝑖 , 𝑒 𝑏 𝑖 )). This loss function helps reducing the impact of the lack of context in the addresses.

Retrieval set creation

Having a dataset of normalized reference address sentences 𝐷 𝑟𝑒𝑓 and a fine-tuned Bi-Encoder 𝐸, we can generate a retrieval sentence embedding set 𝒳 𝐷 𝑟𝑒𝑓 through a forward pass over all address instances of 𝐷 𝑟𝑒𝑓 . This embedding set would be later used at inference time for the retrieval of a given query nearest neighbors.

𝑘NN retrieval models

𝑘NN-vote is an Ensemble Information Retrieval system based on the search results of multi 𝑘NN models, all similar in their operations but very different in their basis of representations of the searched addresses. In general, an individual 𝑘NN search model is a 𝑘-Nearest Neighbor algorithm which takes as a parameter a distance 𝑑 specific to the type of representation of the searched points (e.g. a Levenshtein or Jaccard distance for a raw textual representation). The algorithm computes all the distances between a query and the search points previously pre-registered in the retrieval reference set 𝒳 𝐷 𝑟𝑒𝑓 (𝒳 𝐷 𝑟𝑒𝑓 = 𝐷 𝑟𝑒𝑓 for raw textual representation) and returns the list of the 𝑘 most similar points having the smallest distance with the query. Table 1 shows the different combinations of (encodings, similarities) that can be used in a 𝑘NN search model (𝑘NN Retriever) within the voting system. The table illustrate the possible types of address representation previously mentioned in Section 3.2, That is, the raw textual representation through which we will have different 𝑘NN search models each with a well-defined type of string distance (see Table 1); and (2) vector representation divided into two types:

• traditional embeddings built by way of mean pooling the static word embeddings of address elements such as Word2Vec, • contextual sentence embeddings, fine-tuned for textual similarity, of postal addresses. Without any a priori hypotheses about the origin of the errors, we have carried out an empirical search for the best address representation spaces with the appropriate similarity measures. We simply applied the various representations and similarity measures in the literature and compared eleven string similarity measures for the raw representations and four vector similarity measures for the static and dynamic embedding. representations (see Table 1). Some of the string similarity measures, such as "Ratio" or "Token_set_ratio," are taken from the fuzzywuzzy library6 as they enable more robust and flexible comparisons by incorporating tokenization and sorting mechanisms. Unlike traditional metrics like Levenshtein and Jaro, which focus solely on character-level edits, fuzzywuzzy's methods account for word order and partial matches, making them more suitable for real-world text data. The chosen sentence embedding models are considered (see Section 4.2), hence we had a total of 31 𝑘NN models. The advantage here is to allow us to have a maximum of individual candidate lists of retrieved addresses in order to compare, firstly, the performance of each 𝑘NN Retriever model and, secondly, use them to draw the candidates in common between the lists as the most similar candidates. Fig. 2 shows the architecture of a single 𝑘NN Retriever. We finally define the similarity search process as follows: (1) we convert a query 𝑞 by the desired representation type to have 𝑒 𝑞 ; (2) 𝑒 𝑞 is then passed into the 𝑘NN Retriever which will be responsible for computing the distances between 𝑒 𝑞 and all the representations in the retrieval set in order to return the 𝑘 address indices most similar to 𝑞 ranked according to the smallest distance.

Ensemble voting retrieval system

The system is termed "multi-embeddings models" due to its dual approach, leveraging both raw address representations and advanced deep learning (DL) techniques for vector text representations in address matching. The core functionality of the system involves returning a final list of 𝑚 similar candidate addresses through a voting process. Among the 𝑘NN ensemble models, the voting process is based on the maximum number of occurrences of a candidate address for a query. It should be noted that this system needs two types of important values: (1) the number of repetitions of each candidate address 𝑖 in the different 𝑘-lists; (2) the different similarity scores of a pair (𝑞,𝑖) for which 𝑖 appeared with different 𝑘NN models.

Retrieval Flow

1. Candidate address lists retrieval: The system begins by retrieving the 𝑘-lists of candidate addresses using the ensemble 𝑘NN retrieving pipeline. Each model in the ensemble provides a list of address indices for a given query.

Voting process:

• Repetition counting: The first step in the voting process is to count the number of repetitions of each candidate address 𝑖 across the different 𝑘-lists. • Grouping and sorting: Candidates indices are then grouped based on their repetition counts.

This creates "bags" of indices, where each bag contains one or more indices pointing to associated addresses. Then the bags are sorted by the maximum number of repetitions. • In-bag max pooling of similarity scores: Within each bag, the system collects the similarity scores for each address from the different 𝑘NN models in which they appeared. Max pooling is then applied to these scores to determine the maximum similarity score for each address within the bag. • In-bag Ranking: The addresses are then sorted within each bag based on their maximum similarity scores.

Final address list retrieval:

• Final output: All the bags are concatenated resulting in a sorted list of addresses where the top candidate address has been repeated the most times and possesses the highest similarity score. • Cut-off value choice: The system sets the value of 𝑚 (the number of neighbors to return) and computes performance metrics to evaluate the effectiveness of the address matching process. The value of 𝑚 is not necessarily the same as 𝑘 since the voting process ultimately is ranking all the candidates of the 𝑘-lists combined which would naturally produces k-plus candidates depending on how heterogeneous the k-lists are.

Experimental Settings

Data description

In our experiments, we use real private postal address data made available by a carrier in the region of Yonne, France. This data consists of two database tables, a table of approximately 1M non-normalized addresses of deliveries received via EDI and another table of registered recipients of more than 42K normalized postal addresses. After the de-duplication step mentioned above, and due to the presence of large number of identical delivery instances, just over 85% of all delivery address instances have been de-duplicated, mainly because most deliveries are business addresses. As a result, we are left with just over 147K distinct delivery addresses.

Dataset creation:

We are in an offline training set up (i.e. our ensemble 𝑘NN retriever doesn't need training but rather takes advantage of the different representations, vector or raw, of postal addresses in order to search for the most similar addresses). That said, the creation of a dataset of address pairs (i.e. non-normalized query-address, normalized reference-address) is necessary for two reasons: (1) the offline fine-tuning of the different sentence representation models for the addresses and (2) using the dataset in the final performance test of the 𝑘NN-vote system. To do this, we use the recipient keys associated with the records in the two tables to create a dataset of over 147K address pairs. The dataset is then divided between training data and test data with respective proportions of 90% and 10%. The same test data will be used to evaluate 𝑘NN-vote. A second cleaning is carried out on the training dataset to eliminate certain non-normalized entry addresses likely to reduce the learning quality of the Bi-Encoder such as addresses having only the postal code and the city name. These type of addresses lacks completely the context linking them to their supposed normalized counterparts. Around 0.8% of the training data was impacted by this second cleaning. The Table 2 shows some examples of this kind of addresses.

Bi-encoders fine-tuning parameters

Fine-tuning Base

The Three chosen base transformers were trained on a corpus of approximately 950K official French postal addresses from the Yonne region, France and adjacent regions taken from the official governmental In this example, 'xxxx' is used as a placeholder pourrain because the expediter only had the recipient's name and needed to fill in something for the incomplete address website 7 . The complete training of the three encoders was carried out during 5 iterations and no parameter optimization was done. The aim here was simply to adapt the three language models to the postal addresses and have them as a basis for fine-tuning the Bi-Encoders. The "transformers" package from HuggingFace8 was used to train these language models.

Bi-encoders fine-tuning

The three Bi-Encoders were fine-tuned according to the best combination of hyper-parameters presented in Table 3. Both Camembert-base and XLM-Roberta-base architectures used for the first two Bi-Encoders fine-tuning can be explored in details in [27,28] as for the third one, a custom pre-trained Roberta-small architecture (6-layers, 128-hidden, 8-heads and 8 million parameters) is used. The three Bi-Encoders were adjusted on a local server with an NVIDIA Tesla A100 graphics card (20 GB) via the SBERT9 "sentences-transformers" package.

Models evaluation

For the evaluation of the proposed voting approach, we compare it with our different individual 𝑘NN models in addition to the bi-encoder (BI_DistilBert) model proposed by Duarte et al. [6], where they use DistilBert Multilingual as a basis for fine-tuning their model. To remain consistent with the cited research, we consider a value of 𝑘 neighbors equal to 10 but we take the time to test other values of 𝑘 with respect to our individual systems. The models were evaluated based on two metrics: (1) The existence ratio (ER), which is the proportion of correctly predicted positive pairs out of all pairs in the test data set, and (2) the MRR, i.e. the Mean Reciprocal Rank, which is a measure used to evaluate the quality of the appearance ranks of correct query responses via information retrieval systems. For a sample of queries 𝑄 and 𝑟𝑎𝑛𝑘 𝑖 , i.e. the position of the correct searched address for a query 𝑞 𝑖 ∈ 𝑄 with 𝑖 = 1, ..., |𝑄|, the MRR formula can be defined as follows:

𝑀 𝑅𝑅 = 1 |𝑄| |𝑄| ∑︁ 𝑖=1 1 𝑟𝑎𝑛𝑘 𝑖 ,(3)

The primary objective of the models is to achieve a maximum ER at the exact matching level (i.e. the predicted address is exactly the address sought for the query). In addition, two types of ER are computed:

(1) The ER of the correct predictions in the first rank (top 1) and ( 2) the ER of the correct predictions among the 𝑘 address candidates (top k). We are also interested in the matching ER at the road level (i.e. the predicted address is at least in the correct road of the searched address). This type of ER is all the more important since in practical cases, carriers will generally be able to successfully deliver parcels as long as they are in the same lane of the delivery address [6].

Results

Comparison of individual 𝑘NN models

Our first intent was to compare individual 𝑘NN systems in order to identify the best performing model in terms of top k ER at the exact search level (top k exact). The results illustrated in Fig. 4a show the superiority of 𝑘NN models based on the different sentence representations and this comes down to the quality of the hyper contextualized embeddings in comparison for example with word embeddings like Word2Vec or FastText. We also note that models based on raw representations are generally more efficient than Word2Vec and FastText. The reason is probably because of the enormous loss of information in the static embeddings due to the mean pooling used to create the address vectors.

Increasing the value of 𝑘 positively impacts the existence ratio overall the models because the larger the list of neighbors, the greater the chance of more difficult addresses to be retrieved. However, the increasing levels of the existence ratio vary between 5% for sentence embedding, 11% for raw embedding and 26% for static embedding with a 𝑘 value between 5 and 120, as shown in Figure 4a. This can be explained by the level of accuracy of the sentence embedding, as the majority of positive pairs are already identified within the first 5 candidate addresses. In contrast, the raw and static embedding models require a very high 𝑘 value of up to 120. In terms of MRR, the results in Figure 4b are consistent with the existence ratios, as the best models should have the highest MRR at the lowest possible 𝑘 value. The dynamic finetunner sentence 𝑘NN embedding models retrieve the searched addresses at the highest ranks compared to the other models. Furthermore, they remain stable as 𝑘 increases, thus demonstrating their strong retrieval ability even with the earliest candidates thanks to their ability to capture address context. This was expected as well, as the very purpose of sentence transformers is to learn how to reduce the distance between vectors of positive address pairs, even if they are very different syntactically, whereas models based on string similarity distances only perform well when the addresses are relatively similar syntactically.

kNN multi-embeddings models experiment results

Multi-embeddings models instances

We wanted to test the performance of the voting system using the set of individual 𝑘NN models while having the flexibility to select different subsets to maximize the voting efficiency. Fig. 5 illustrates the results of the ERs top k exact of the chosen subsets that had the overall better performance. We observe that the subset of sentence only models performs generally better. If we further exclude from the latter the 𝑘NN models based on Roberta from scratch (camembert + XLM), we see a small increase in ERs at k values of 5 and 10. This increasing aspect is due to the original pre-training of Camembert and XLM_Roberta. It shows the extent that language models (pre-trained on large language corpora) have in terms of performance quality when used in other tasks such as the STS task. The voting system with all models is the least efficient and this can be explained by the large differences between the neighbor lists returned by the sentence models and the other models. In other words, it is natural that 𝑘NN retrivers with the most mistakes in predicting positive pairs impact the ability of the vote to systematically propose a high number of repetitions to the sought-after addresses. This explanation remains even more coherent when we remove the static vectors models from the vote (sentence + raw). We notice a clear improvement in ERs. As for the MRR results, we observe in Fig. 6 that in general, voting systems based on sentence models succeed in recovering more positive normalized addresses at the highest ranks.

Discussion

We put ourselves in comparison with BI_DistilBert. We take into account the addresses found at the road level and we also consider the top 1 results. We remain consistent regarding the value of 𝑘 = 10. Table 4 illustrates the best individual 𝑘NN models and voting systems in comparison with BI_DistilBert. We find that BI_DistilBert performs better than the raw models. However, it remains below the ER results of the individual 𝑘NN sentence models and this is due to two main reasons. First, the base of the model, which is multilingual DistilBert, was not pre-trained on a corpus of postal addresses before fine-tuning its Bi_Encoder. However, we believe that it is important that basic language models learn the structure of a postal address independently of the similarity task. Second, the additional difficulty that our dataset brings. Indeed, our addresses are much more difficult in terms of the errors and noise likely to occur. 𝑘NN-vote systems are better overall, supporting our intuition that aggregating results from multiple sources significantly improves similarity search performance. We do, however, note exceptions to the rule. Some individual 𝑘NN models such as (A) and (B) come before the (G) and (H) vote systems. This decrease in ERs confirms to us that aggregation alone does not always guarantee better results and that a high and heterogeneous number of models used in the voting process negatively impacts the prediction quality. This is why the individual performance of the models used in the vote must also be taken into account. More specifically, the vote will be more likely to have superior results if it uses as its aggregation sources, search models that are the least wrong in their predictions. (I) manages to compete with the two best individual 𝑘NN sentence models but adds no improvement and in particular in the top 1 exact. It is undoubtedly the participation of Rsent models in the vote that prevents it from standing out from the other search systems, since Rsent is significantly less efficient than Csent and XLMsent. In conclusion, the best vote is the one that uses the Csent and XLMsent models with a ER top 1 exact of 86.2% and a ER top 10 exact of 96% thus demonstrating the ability of the voting system to retrieve more positive address pairs in the top 1.

Inference time:

We measured the retrieval time for 100 address queries to compare the various solutions, as shown in Table 4. Retrieval times for voting systems (between 51s and 173s) are notably longer than individual 𝑘NN models. Despite being conducted without optimization in an experimental setup, we find these times acceptable for business applications.

Conclusions

In this work we focused on the problem of matching postal addresses. We first showed that this task can be simply formalised as an information retrieval problem where models such as 𝑘NN have been shown to be efficient in both computation time and accuracy. For this purpose, we have assumed that an address is a sentence described with a set of entities and, consequently, it could contain erroneous elements or noisy elements. However, the positions of the entities have an impact on address recognition. For these reasons, we have proposed using different address representation spaces, such as the word embedding space or the sentence embedding space with a pre-trained transformer. Each representation contributes in part to the search for the closest address in that space. In order to aggregate the contribution of the different spaces, we proposed a 𝑘NN ensemble models based on a voting system called 𝑘NN-vote. The experimental results show that our system performs very well, achieving an accuracy of around 96% in the top 10 and 86.2% in the top 1. This system shows its value for this type of task, even though the voting algorithm is still very naive for the time being. In fact, the algorithm favours addresses with a maximum number of repetitions and re-ranks them solely on the basis of the highest similarity score. Hence the impact of the number of voters on the number of appearances. In addition, the system's focus on the highest score of the address, without taking into account the overall quality of the scores, can lead to the dominance of a single score, even if other scores are more indicative. As a perspective, we are improving the voting process in order to consider and reinforce the potential effectiveness of a model with a lower but more significant score.

Figure 1 :1Figure 1: French Postal Address Features

Figure 2 :2Figure 2: 𝑘NN Search Model Architecture.

Figure 3 :3Figure 3: Ensemble Vote Process.

(a) 𝑘NN's top k exact ERs (b) 𝑘NN's top k exact MRRs

Figure 4 :4Figure 4: Evaluation of Individual 𝑘NN models with regards to metrics: ER and MRR

Figure 5 :Figure 6 :56Figure 5: 𝑘NN-vote top k exact ERs

Table 11Different Combinations used for 𝑘NN RetrieverRepresentation EncodingSimilarityRawTextual contentJaro, Jaro-Winkle,Levenshtein, Jaccard,Damerau-Levenshtein,Ratio, Token set ratio,Token sort ratio,Partial ratio, Set ratio,Seq ratioVectorCsent (Camembert Bi-Encoder)CosinusXLMsent (Xlm Roberta Bi-Encoder)EuclideanRsent (Roberta custom Bi-Encoder)Correlationwvavg (Word2Vec word embeddings averaged) Cityblockftavg (fastText word embeddings averaged)

Table 22Examples of Delivery Address DeletionReceived addressNormalized addressJustification for deletiontrichey 89430 trichey4 rue maillet 89430 trichey This address only have the postal code andthe name of the city89160 89160 sambourg 11 rue d argenteuil 89160Here another example where door numbersambourgand road name are missingxxxx 89240 pourrain30 route d aillant 89240

Table 33Best Found Fine-tuning Hyper-parameters for Bi-EncodersBi-Encoder

Table 44Existence Ratios of the Best Methods System Top 1 exact Top 1 Top 10 exact Top 10 MRR Time (100 𝑞𝑢𝑒𝑟𝑖𝑒𝑠)(A) Csent_cosine0.8570.9160.9590.970 0.89512(B) XLMsent_cosine0.8560.9170.9540.968 0.89313(C) Rsent_cosine0.7990.8790.9300.956 0.8476(D) Token_set_ratio0.7400.8290.8660.889 0.7918(E) Ratio0.7300.8340.8710.891 0.7855(F) BI_DisilBert0.7630.7930.9180.939 0.82610(G) all models0.7600.8720.9500.962 0.830173(H) sentence + raw0.8010.8960.9570.967 0.859144(I) sentence only0.8520.9200.9590.972 0.89469(J) camembert + XLM0.8620.9210.9600.972 0.90051

https://github.com/fxsjy/jieba El Moundir Faraoun et al. CEUR Workshop Proceedings 96-111 https://developers.google.com/maps/documentation/geocoding?hl=fr http://geohash.org/ https://www.upu.int/ El Moundir Faraoun et al. CEUR Workshop Proceedings 96-111 https://github.com/seatgeek/fuzzywuzzy https://adresse.data.gouv.fr/ https://huggingface.co/docs/transformers/index https://www.sbert.net/

Acknowledgments

Thanks to the french ANRT (Association Nationale de la Recherche et de la Technologie) for funding this project under the "Cifre convention for thesis funding" https://www.anrt.asso.fr/ and to the developers of TEDIES, TALK solutions who assisted in this project https://site.tedies.eu/.

Georoberta: A transformer-based approach for semantic address matching YGuermazi SSellami OBoucelma 2023 HAL Machine learning innovations in address matching: A practical comparison of word2vec and crfs SComber DArribas-Bel 10.1111/tgis.12522 Transactions in GIS 23 2019 Address validation in transportation and logistics: A machine El Moundir Faraoun et al YGuermazi SSellami OBoucelma 10.1007/978-3-030-65965-3_21 CEUR Workshop Proceedings 96-111 learning based entity matching approach 2020 1323 Communications in Computer and Information Science A deep learning architecture for semantic address matching YLin MKang YWu QDu TLiu 10.1080/13658816.2019.1681431 International Journal of Geographical Information Science 34 2019 Deep contrast learning approach for address semantic matching JChen JChen XShe JMao GChen 10.3390/app11167608 Applied Sciences 11 7608 2021 Improving address matching using siamese transformer networks ADuarte AOliveira 10.1007/978-3-031-49011-8_33 Lecture Notes in Computer Science 14116 2023 Quantitative comparison of nearest neighbor search algorithms HMRakotondrasoa arXiv, 2023 Fast record linkage for company entities TGschwind IEEE Conference Publication 2020 A new method of chinese address extraction based on address tree model KMengjun DQingyun WMingjun Acta Geodaetica et Cartographica Sinica 44 2015 Binary codes capable of correcting deletions, insertions and reversals VLevenshtein Soviet Phys. Doklady 10 707 1966 A technique for computer detection and correction of spelling errors FJDamerau Communications of the ACM 7 1964 Distribution de la flore alpine dans le bassin des dranses et dans quelques regions voisines PJaccard Bulletin de La Société Vaudoise Des Sciences Naturelles 37 1901 Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida MJaro 10.2307/2289924 Journal of the American Statistical Association 84 1989 String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage WEWinkler 1990 ERIC Efficient estimation of word representations in vector space TMikolov KChen GCorrado JDean arXiv, 2013 PBojanowski EGrave AJoulin TMikolov Enriching word vectors with subword information 2017 5 arXiv CSutton AMccallum arXiv, 2010 An introduction to conditional random fields Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones AGomez LKaiser IPolosukhin 2017 NeurIPS YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov arXiv Roberta: A robustly optimized bert pretraining approach 2019 Deep entity matching with pre-trained language models YLi JLi YSuhara ADoan W.-CTan 10.14778/3421424.3421431 Proceedings of the VLDB Endowment the VLDB Endowment 2020 14 Entity matching with transformer architectures -a step forward in data integration UBrunner KStockinger 10.5441/002/edbt.2020.58 EDBT 2020 Sentence-bert: Sentence embeddings using siamese bert-networks NReimers IGurevych 10.18653/v1/d19-1410 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019 Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter VSanh LDebut JChaumond TWolf arXiv 2019 Billion-scale similarity search with gpus JJohnson MDouze HJégou 10.48550/arXiv.1702.08734 arXiv:1702.08734 CEUR Workshop Proceedings Moundir Faraoun et al. 96 111 2017. 28 Feb 2017 arXiv preprint El Multidimensional binary search trees used for associative searching JLBentley 10.1145/361002.361007 doi:10.1145/361002. 361007 Communications of the ACM 18 1975 Five Balltree Construction Algorithms SMOmohundro 1989 Berkeley, CA International Computer Science Institute Technical Report Camembert: a tasty french language model LMartin BMuller PSuárez YDupont EDe La Clergerie DSeddah BSagot 10.18653/v1/2020.acl-main.645 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics 2020 Unsupervised cross-lingual representation learning at scale AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov arXiv 2020 Learning a similarity metric discriminatively, with application to face verification SChopra RHadsell YLecun 10.1109/cvpr.2005.202 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) 2005. 2005 Efficient natural language response suggestion for smart reply MHenderson RAl-Rfou BStrope YSung LLukacs RGuo SKumar BMiklos RKurzweil arXiv, 2017 Batch-softmax contrastive loss for pairwise sentence scoring tasks AChernyavskiy DIlvovsky PKalinin PNakov 10.18653/v1/2022.naacl-main.9 Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies jul 2022