<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioannis Arvanitis-Kasinikos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Papadakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National and Kapodistrian University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity Matching (EM) is a fundamental task in data management, involving the identification and linking of records that refer to the same real-world entity across diferent datasets. While Large Language Models (LLMs) have shown promise in addressing complex natural language processing tasks, their substantial computational requirements often limit their practical applicability. In this work, we investigate the use of 7B parameter LLMs with 4-bit quantization for EM tasks executable on commodity hardware. We explore various prompting strategies, including zero-shot, few-shot, and general matching definition prompts, to evaluate their efectiveness in improving EM accuracy. Experiments are conducted on two benchmark datasets with products, which present varying levels of complexity and challenge in product descriptions. Our findings demonstrate that 7B parameter LLMs can efectively perform EM, with the Orca2 model consistently outperforming others across diferent prompting strategies and datasets. The study highlights that few-shot prompting significantly enhances performance over zero-shot approaches, emphasizing the importance of task-specific examples and careful prompt design. We also examine the impact of example order in few-shot prompts and find that it has a substantial efect on model performance. Finally, we examine hardware limitations, demonstrating that efective EM can be achieved with resource-constrained models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Entity Matching</kwd>
        <kwd>7B LLMs</kwd>
        <kwd>Zero-Shot Prompts</kwd>
        <kwd>Few-Shot Prompts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Entity Resolution (ER) constitutes a vital task in data
management that involves identifying and linking records from
diferent datasets that refer to the same real-world entity
[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In many domains, including e-commerce,
healthcare, and finance, accurate ER is essential for ensuring data
quality, enabling efective data integration, and supporting
informed decision-making [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, this task is
challenging due to data inconsistencies, incompleteness, and
ambiguity across diferent sources [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        As an example, consider the product descriptions in
Figure 1. Despite corresponding to the same object (Sony
headphones), there are significant variations in product names,
attributes, and dimensions. These discrepancies illustrate
the challenges in reconciling variations across datasets,
particularly when dealing with unstructured text and linguistic
diferences. Accurate ER in scenarios like this is crucial for
product catalog integration, price comparison, and
recommendation systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Due to its quadratic time complexity, ER solutions
typically implement the Filtering-Verification framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
The Filtering step, often called Blocking, significantly
reduces the computational cost to the most similar candidate
pairs, which are the most likely matches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The
Verification step performs Entity Matching (EM), which essentially
determines whether two records are duplicates, describing
the same real-world object. In the following, we exclusively
focus on EM.
      </p>
      <p>
        Traditional EM solutions typically rely on rule-based
approaches, string similarity metrics, or machine learning
algorithms [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. However, these methods can
struggle with complex linguistic variations and contextual
understanding, while requiring domain expertise and heavy
human involvement [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This is addressed by more recent
state-of-the-art approaches that leverage deep learning (DL)
techniques [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However, they require substantial amounts
of training data, which are rarely available.
      </p>
      <p>
        Recent advancements in NLP, particularly in Large
Language Models (LLMs), ofer new possibilities for addressing
EM challenges [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. LLMs possess advanced
capabilities for natural language understanding, which allows them
to process and interpret complex textual descriptions [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Most importantly, LLM-based EM can be performed in
zeroshot settings, requiring no training instances, a
characteristic particularly attractive for out-of-the-box solutions.
      </p>
      <p>
        In this work, we evaluate the performance of 7B
parameter LLMs in entity matching tasks. While larger LLMs
with hundreds of billions of parameters have shown
impressive results [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], their computational requirements often
make them impractical for many real-world applications. By
employing these LLMs, which excel in natural language
understanding and semantic similarity assessment, this work
seeks to address EM challenges in real-world datasets with
linguistic variations and unstructured text, while also
highlighting their suitability for execution on commodity
hardware. The focus on 7B parameter LLMs is motivated by their
potential for eficient deployment on commodity hardware,
making them more suitable for practical applications.
      </p>
      <p>To this end, we perform an extensive experimental
evaluation that considers the models’ ability to handle diferent
types of EM scenarios. We explore novel zero-shot,
fewshot, and general matching definition prompting strategies
to assess their efectiveness in improving matching accuracy.
Our goal is to bridge the gap between the advanced
capabilities of LLMs and the practical constraints of real-world EM
applications, potentially paving the way for more eficient
and accurate ER techniques in diverse domains.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>There is a plethora of recent LLM-based EM methods,
because LLMs ofer several advantages over traditional EM
solutions: (i) contextual understanding, as they understand
the context and semantics of entity descriptions better than
traditional string matching techniques. (ii) robustness, since
LLMs are typically more capable of addressing variations
in how entity information is expressed. (iii) zero-shot and
few-shot learning, i.e., LLMs can accomplish EM tasks with
no or minimal examples of matching decisions. These
characteristics render LLMs ideal for most EM tasks, especially
those with complex, unstructured product descriptions.</p>
      <p>
        The seminal work on LLM-based EM [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] investigated
the efectiveness of GPT3-175B in EM, focusing on three
key parameters: (i) problem definition, exploring diferent
phrasings such as “Are Product A and Product B the same?”
or “Are Product A and Product B equivalent?”. (ii) in-context
learning, comparing zero-shot with few-shot approaches.
The former involve prompts with no examples in the prompt,
while the latter involve a couple of examples, which are
selected randomly or by experts. (iii) entity serialization,
testing the use of all attributes or just a subset of them.
Their experimental analysis led to the following conclusions:
(i) few-shot learning significantly outperforms zero-shot
approaches, (ii) attribute selection yields better results than
using all attributes, (iii) problem definition has a substantial
impact on performance, (iv) LLM performance is comparable
to the state-of-the-art DL-based matching algorithms.
      </p>
      <p>
        A detailed study was conducted in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], using six LLMs,
three hosted and three open-source ones. The experiments
explored additional parameters such as problem definition,
language complexity, output specification, entity
serialization, in-context learning, instructions, and fine-tuning. The
experimental results revealed that: (i) no single prompt
consistently outperformed all others across diferent scenarios.
(ii) Open-source LLMs showed comparable efectiveness
to hosted models. (iii) LLMs performed competitively with
deep learning-based matchers, even in zero-shot settings.(iv)
Few-shot and instruction-based prompts generally
outperformed zero-shot approaches. (v) Fine-tuning significantly
improved efectiveness.
      </p>
      <p>
        In another line of research, three distinct prompting
strategies were explored in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]: (i) Match prompts, which
contain traditional pair-wise questions. E.g., “Do these two
records refer to the same real-world entity? Record 1:
[details]. Record 2: [details].” (ii) Comparison prompts, which
ask for the most similar entity to a given reference. E.g.,
“Which of these two records is more consistent with the
given record? Given Record: [details]. (A) Record 1:
[details]. (B) Record 2: [details].” (iii) Selection prompts, which
identify a matching entity from a set of candidates. E.g.,
“Select a record from the following list that refers to the
same real-world entity as the given record: Given Record:
[details]. Options: 1. [details] 2. [details] 3. [details]...”
The experimental results show that incorporating record
interactions through the comparison and selection prompts
significantly improves EM performance across various
scenarios; among the two, the selection prompts are the
topperformers in most cases. However, they sufer from
position bias, because their accuracy decreases when the
duplicate record is placed lower in the list of candidates.
      </p>
      <p>
        BatchER [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] aims to reduce the costs for hosted LLMs
through batch processing, exploring various methods for
question batching and demonstration selection. The
experimental results demonstrate that batch prompting
outperform match prompts in both efectiveness and cost, with
the top performance achieved by diversity-based question
batching combined with covering-based demonstration
selection.
      </p>
      <p>These studies collectively demonstrate the potential of
LLMs in entity matching tasks, highlighting the importance
of prompt engineering, the competitiveness of open-source
models, and the efectiveness of batching strategies for
improved eficiency. This work builds upon and extends the
existing ones by focusing specifically on 7B parameter LLMs
with 4-bit quantization. Unlike previous studies that
primarily use larger, more resource-intensive models, our work
explores the potential of smaller and more accessible LLMs
for EM tasks. In this context, we perform a
comprehensive evaluation of various novel prompting strategies,
including zero-shot, few-shot, and general matching
definition approaches, across multiple models and datasets. This
approach ofers insights into the practical applicability of
LLMs in resource-constrained environments, bridging the
gap between advanced language models and real-world EM
challenges.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Definition</title>
      <p>
        Applied after Filtering, Entity Matching is typically
formulated as a binary classification problem [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. More formally:
Given two records 1 and 2, the task is to determine whether
they refer to the same entity. This is often expressed as a
function  (1, 2) → {0, 1}, where 1 indicates a match
(also called duplicate) and 0 indicates a non-match.
      </p>
      <p>In LLM-based settings, EM is framed as a natural language
inference task. The LLM is provided with descriptions of
two records and asked to determine if they refer to the same
entity, returning “True” for a match and “False” otherwise.</p>
      <p>
        In all cases, EM performance is measured with respect to:
• Precision, i.e., the proportion of correctly identified
matches out of all predicted matches.
• Recall, i.e., the proportion of correctly identiefid matches
out of all actual matches.
• F-measure, i.e., the harmonic mean of precision and recall,
providing a balanced measure of performance.
• Run-time, i.e., the time taken to complete the ER process.
The first three measures are defined in [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] with higher
values indicating higher efectiveness . For the last one, lower
values indicate higher time eficiency .
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. EM Prompts</title>
      <p>We now present the EM prompts that are examined in our
work. The basic prompt is presented in Figure 2(a). It
consists of an instruction that describes the input and the
desired output. It lacks any examples, thus constitutes a
zeroshot EM prompt, which tests the model’s ability to generalize
to new tasks or domains it has not been trained on.</p>
      <p>A concise few-shot EM prompt extends the zero-shot one
with the examples in Figure 2(b). To provide a balanced
context, there are two examples that include a pair of matching
entities and a pair of non-matching ones. These examples
serve as a form of weak supervision, allowing the LLM to
learn from the provided instances and generalize to
similar cases. Note that the examples in Figure 2(b) have been
carefully selected from dataset 1 (see Table 1) so that they
capture typical variations in product descriptions that are
encountered in the full dataset.</p>
      <p>
        Note that LLM responses to few-shot prompts sufer from
position bias [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], because the order of examples in the EM
prompt might alter the matching decision. This means that
in the example of Figure 2(b), the response for a specific
candidate pair might be True (i.e., matching) if the
positive example precedes the negative one and False (i.e.,
nonmatching) otherwise. For this reason, we define two types
of few-shot prompts:
1. TF , where the True example is followed by False one, as
in Figure 2(b).
2. FT , where the False example is followed by True one.
Note that with multiple examples per prompt, as in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
more arrangements are possible. In this work, though, we
exclusively consider the two variations of the few-shot EM
prompt that involves one example per match type.
      </p>
      <p>To increase the robustness of LLMs to few-shot EM
prompts, we consider two matching approaches for each
candidate pair, query with both TF and FT prompts:
1. The union approach labels a candidate pair as True if
either the TF or FT prompt results in a True response.
2. The intersection approach labels a candidate pair as True
only if both the TF and FT prompts yield a True response.</p>
      <sec id="sec-4-1">
        <title>4.1. Domain-specific Zero-Shot Prompts</title>
        <p>The above prompts are generic enough to apply to any
domain. In our experimental analysis, we also consider
domain-specific ones, which are crafted for the product
matching task. More specifically, we devise a zero-shot
prompt that involves general matching definitions,
providing the LLM with explicit guidance on how to determine if
two records refer to the same product.</p>
        <p>The core assumption of this approach is that the records
are described by a clean, aligned schema. This is necessary
for building a schema-aware generic definition of
duplicate records. In the product matching task, we use four
key product attributes: (i) product name, (ii) features, (iii)
manufacturer, and (iv) model number. We use them in two
diferent configurations:
1. The composite domain-specific EM prompt concatenates
all four criteria in the above sequence, as in Figure 3. The
goal is to facilitate more nuanced matching decisions.
2. The atomic domain-specific EM prompt uses only the
model number as the matching criterion. We selected
this attribute because it provides the cleanest and most
distinctive values.</p>
        <p>These two configurations were chosen after preliminary
tests that suggested that they yield the best performance
among all other combinations of these four attributes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Analysis</title>
      <p>Experimental Settings. All experiments were
implemented in Python v3.12.0 and Ollama1 v0.1.22. All
experiments were carried out on a server running Ubuntu 22.04.1
LTS, equipped with Intel Core i7-9700K 8 core @ 3.6 GHz,
32GB RAM and NVIDIA GeForce GTX 1080 Ti 11GB.</p>
      <p>Due to the limited size of the available VRAM, our study
focuses on 7-billion-parameter LLMs with optimizations
such as quantization, which in our case replaces the
32bit floating-point model weights with 4-bit integers. This
reduces the model size, while maintaining reasonable
performance levels. In other words, quantization lowers
efectiveness, due to the fewer parameters and the lower precision of
the model’s weights, but significantly reduces run-times and
memory consumption. Therefore, our experimental results
are useful for resource-constrained applications, which run
LLMs on commodity hardware.</p>
      <p>
        LLMs. There is a plethora of open-source LLMs, with
newer models introduced on a rather frequent basis. During
our study, two models were quite popular: Llama 2 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
with 7B parameters and a context length of 4096, as well
as Mistal [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], with 7.3B parameters. However, preliminary
experiments demonstrated that both of them were
inappropriate for the EM tasks considered in this work. Llama
2 consistently responded with “True” for every candidate
pair, while Mistral failed to provide a response according to
given instructions – it indicated an inability to respond in
certain cases or gave explanations for its decisions instead
of a “True” or “False” label.
      </p>
      <p>
        In their place, we considered the following open-source
models, which demonstrated high efectiveness in our
preliminary experiments:
1. Orca2 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Built by Microsoft Research, Orca2 is a
family of models fine-tuned on Meta’s Llama 2 using
synthetic data.
2. OpenHermes2. This is a Mistral 7B model fine-tuned with
fully open datasets, showcasing strong multi-turn chat
1https://ollama.com
2https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B
1,076-1,076
2,554-22,074
3. Zephyr [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. A 7B parameter model fine-tuned on
Mistral, it achieves results similar to Llama 2 70B Chat in
various benchmarks. It is trained on a distilled dataset,
improving grammar and chat results.
4. Mistral-OpenOrca3. This is a 7B parameter model,
finetuned on top of Mistral 7B using the OpenOrca dataset.
5. Stable-Beluga4. This is a Llama 2 based model fine-tuned
on an Orca-style dataset.
6. Llama-Pro [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]: An 8B parameter expansion of Llama 2
that specializes in integrating both general language
understanding and domain-specific knowledge, particularly
in programming and mathematics.
      </p>
      <p>In all cases, we use the default latest model with 4-bit
quantization and 7B parameters.</p>
      <p>Datasets. We used two real-world datasets with products
that are widely used in the ER literature: (i) 1 is the
AbtBuy dataset, which comprises product listings from two
online retailers, Abt Electronics and Buy.com. (ii) 2 is the
Walmart-Amazon dataset, which contains product listings
from two other online retailers, Walmart and Amazon. 1
primarily focuses on electronic products, while 2 covers a
broader range of product categories, matching diverse entity
types. Both datasets present important challenges, such
variations in product names and descriptions across retailers,
inconsistent use of model numbers and other identifiers,
diferences in the level of detail provided for each product,
variations in formatting and units (e.g., dimensions, weights)
as well as missing or null values in certain fields.</p>
      <p>
        Their technical characteristics are summarized in Table
1. Note that each dataset comprises two individually clean
data sources, whose sizes are reported in column
#Entities. Note also that we apply the prompts to the candidate
pairs generated by a state-of-the-art blocking implemented
3https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca
4https://huggingface.co/stabilityai/StableBeluga2
by PyJedAI [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] , version 0.1.6. Following [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], we
kNNJoin, which identifies the  nearest neighbors of each
entity. We fine-tuned it, maximizing blocking precision for a
blocking recall of at least 90%, as reported in the rightmost
columns of Table 1. This configuration uses cleaning (i.e.,
stop-word removal and stemming) and cosine similarity in
both datasets. For Abt-Buy,  was set to 4, while the
attribute values were converted into a multiset of character
trigrams. For Walmart-Amazon,  was set to 2, while the
attribute values were converted into a multiset of character
four-grams.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Zero-Shot Prompting Results</title>
        <p>We now examine the relative performance of the selected
LLMs over 1 and 2, when coupled with the basic
zeroshot EM prompt of Figure 2(a).</p>
        <p>We observe that Orca2, OpenHermes, and Zephyr
consistently rank as the top three models with respect to
FMeasure in both datasets. The last two models switch
their ranking positions in the two datasets, whereas Orca2
maintains the lead. The superior performance of Orca2,
which demonstrates Orca2’s robustness under diverse EM
settings, can be attributed to its fine-tuning on synthetic
data designed for reasoning tasks. This enhances its
capability to understand and compare complex product
descriptions. OpenHermes is fine-tuned on fully open datasets with
strong multi-turn chat skills, leveraging advanced language
understanding to perform well. Zephyr’s competitive
performance probably results from its training on a distilled
dataset that improves grammar and chat results, aiding in
better interpretation of entity attributes. The lower
performance of Mistral-OpenOrca, Stable-Beluga, and Llama-Pro
is probably due to the less specialized training data or the
smaller model capacities for the specific nuances of EM.</p>
        <p>Note that all models exhibit much higher recall than
precision in both datasets. This means that they are prone to
label a candidate pair as matching, at the cost of introducing
numerous false positives. Orca2 consistently exhibits the
highest precision, thus yielding the highest F-Measure, too.</p>
        <p>Note also that all models exhibit markedly lower
efectiveness in 2 compared to 1. This suggests that 2 presents
greater EM challenges, potentially due to more diverse or
complex product descriptions. While 1 is restricted to
electronics, 2 covers a broader range of products and
includes more variation in descriptions, attributes, and data
quality, rendering EM more dificult. Furthermore, 1 has
a 1:1 matching between its two data sources, whereas 2
has a much lower ratio of matches, adding another layer of
complexity to the task. The substantial performance gap
between 1 and 2 underscores the significant impact of
data characteristics on model efectiveness.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Few-Shot Prompting Results</title>
        <p>We now examine the performance of the aforementioned
few-shot prompts over 1 and 2. We disregard
MistralOpenOrca, Stable-Beluga, and Llama-Pro, because they
exhibited significantly lower efectiveness and less consistent
performance in the zero-shot experiments – preliminary
experiments verified their poor performance in few-shot
settings, too. For brevity, we focus on the top three
performing models, namely Orca2, OpenHermes, and Zephyr.</p>
        <p>The results are reported in Figure 5. Based on preliminary
experiments, we randomly select the examples included in
the few-shot prompts from the candidate pairs of the same
dataset. The same examples are used in all prompts issued
on a particular dataset.</p>
        <p>In both datasets, we observe the same patterns as regards
the relative performance of TF and FT few-shot prompts:
For Orca2, there is a substantial improvement when using
the latter; OpenHermes is more robust to position bias, as
there is no significant diference between the two prompt
strategies; Zephyr works best when coupled with the TF
few-shot prompts. These patterns highlight that the impact
of position bias on each model is consistent across the two
datasets. Note also that with the exception of Orca2 with
TF prompts, all models achieve higher recall than precision,
remaining more prone to label a candidate pair as matching.</p>
        <p>It is also interesting to compare the union approach with
the intersection one. For OpenHermes and Zephyr, the
latter yields significantly higher F-Measure: by considering
as duplicates only the candidate pairs that are marked as
matching by both TF and FT few-shot prompts, the
reduction in recall is much lower than the increase in precision
(as a result, recall remains much higher than precision for
both models). This means that considering only the
common matches of TF and FT prompts leads to more accurate
performance. Note that these patterns are consistent for
both models over both datasets.</p>
        <p>This is not the case with Orca2, whose performance varies
significantly across the two datasets. In 1, the same F1
score is achieved for both approaches, because the
intersection raises recall by 12%, while reducing precision to the
same degree. In 2, though, the intersection reduces recall
by 23% and increases precision by 16%, thus yielding a much
lower F-Measure. Note that in both datasets, the recall of
the model gets lower than its precision in combination with
the intersection approach, unlike the union one.</p>
        <p>Overall, we can conclude that Orca2 works best when
coupled with FT few-shot prompts, while OpenHermes and
Zephyr maximize their efectiveness when intersecting the
matches of TF and FT prompts. Among them, the top
performers over 1 and 2 are Orca2 (F1=0.799) and Zephyr
(F1=0.531), respectively.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Domain-specific Zero-Shot Prompting</title>
      </sec>
      <sec id="sec-5-4">
        <title>Results</title>
        <p>In this section, we compare the atomic domain-specific
prompt with the composite one. As in Section 5.2, we
exclusively consider the three top performing models with
respect to the zero-shot prompts: Orca2, OpenHermes, and
Zephyr. Their performance is reported in Figure 6.</p>
        <p>We observe that in all cases, the atomic prompt
outperforms the composite one to a significant extent – the only
exception corresponds to Zephyr in 1, where the
composite prompt increases F-Measure almost by 15%. This pattern
should be attributed to the short, distinctive and clean
values provided by the model number. This way, it reduces
the noise from other product attributes like product name,
which are typically associated with long and diverse texts.</p>
        <p>Similar to the above strategies, all LLMs exhibit much
higher recall than precision. This means that they remain
prone to mark a candidate pair as a match at the cost of
introducing false positives – a behavior that permeates all
prompt strategies we have examined.</p>
        <p>Among the three models, Orca2 is consistently better,
albeit to a minor extent in 2. This consistent performance
underscores Orca2’s efectiveness in EM tasks under quite
diferent prompt designs.</p>
        <p>We can conclude that domain-specific zero-shot prompts
ofer an efective and reliable alternative in datasets with a
clean schema of known characteristics.</p>
        <p>Prompt Strategy
Zero-shot
FT Few-shot
Atomic Domain-specific
Zero-shot
Intersection Few-shot
Atomic Domain-specific
Zero-shot
Intersection Few-shot
Composite Domain-specific
Precision</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.4. Comparison of Prompting Strategies</title>
        <p>We now compare the three top-performing models (Orca2,
OpenHermes, and Zephyr) with respect to efectiveness and
time eficiency across the three strategies of EM prompts
discussed in Section 4. Note that among the few-shot and
domain-specific variants, for each LLM we only consider
the one with the highest F-Measure in both datasets. Their
performance is reported in Table 2.</p>
        <p>For Orca2, we observe that the FT few-shot prompts are
the top performers in 1. The atomic domain-specific ones
follow in very close distance in terms of F-Measure, while
exhibiting a much lower run-time. This means that the
domain-specific prompts ofer a significantly better balance
between efectiveness and time eficiency. In 2, this
strategy scores the highest F-Measure for a slightly higher
runtime than the second best approach (zero-shot prompts).
For these reasons, Orca2 works best in combination with
the atomic domain-specific prompts.</p>
        <p>Regarding OpenHermes, the diferences between the
three types of prompts are minor in terms of F-Measure.
As expected, the fastest approach in both datasets
corresponds to the zero-shot prompts. This configuration also
achieves the highest F-Measure in 1, while in 2, it ranks
second, within a negligible distance from the top (&lt;0.5%).
Therefore, we can conclude that the zero-shot prompts are
the best choice for OpenHermes.</p>
        <p>For Zephyr, there is a clear winner in the case of 1: the
intersection of few-shot prompts. It exhibits, though, the
highest run-time by a large extent. This is expected, as it
queries the LLM twice per candidate pair. In the case of 2,
the same strategy takes a minor lead over the composite
Method
ZeroER</p>
        <p>Magellan
DeepMatcher
1</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.5. Comparison to Baselines</title>
        <p>
          To put the performance of the selected 7B LLMs into
perspective, we compare it with three state-of-the-art EM
approaches from the literature:
1. ZeroER [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], an unsupervised approach that requires
no labelled datasets, learning Gaussian mixture
models for matching and non-matching candidate pairs.
2. Magellan [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], a supervised approach combining
binary classifiers with a series of hand-crafted features
based on string similarity measures.
3. DeepMatcher [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], a framework leveraging the
synergy between language models and Deep Learning
classification.
        </p>
        <p>For each method, we consider its best performance as
reported in the literature. The results are reported in Table 1.</p>
        <p>We observe mixed patterns. In 1, all LLM
configurations in Table 2, even the zero-shot prompts, outperform
all three baseline methods to a significant extent ( &gt; 21%).
This is remarkable, because the simplest prompt strategy
requires neither domain expertise nor the labeling
candidate pairs, unlike Magellan and DeepMatcher, whose
performance is derived from large training and validation sets,
which amount to 60% and 20% of all candidate pairs, resp.</p>
        <p>The situation is reversed in 2, where all baseline
methods achieve a much better performance. In fact, the highest
F-measure of Orca2 is lower by 16.5% than the worst baseline
(ZeroER). This should be attributed to the more challenging
settings of 2, which have already been discussed in
Section 5.1. Note also that the records in 2 are noisier, with
a much higher portion of missing values. Its records are
also longer, an aspect that is crucial for the 7B LLMs we are
considering in this study, due to their limited attention
window. These settings favor the learning-based functionality
of the baseline methods, which take a clear lead over the
learning-free functionality of 7B LLMs. Another reason for
the poor performance of the latter is that they emphasize
recall at the expense of precision, significantly decreasing
their F-Measure in 2, due the very low portion of matches
in comparison to the total number of entities from each data
source. Therefore, more advanced strategies are required
for boosting the performance of 7B LLMs in datasets with
characteristics similar to that of 2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions &amp; Future Work</title>
      <p>Focusing on 7B open-source LLMs, we examined the
performance of three main prompt strategies: (i) the basic,
domain-agnostic zero-shot prompt, (ii) the few-shot prompt
with one example per type of matches, and (iii) the
domainspecific zero-shot prompt. We considered several variants
for the last two strategies and applied all of them on two
established benchmark datasets for product matching. Testing
six popular LLMs, we reached the following conclusions:
• Few-shot and domain-specific prompting significantly
improve the performance of the zero-shot approaches,
highlighting the value of task-specific prompts.
• In few-shot prompts, the response of LLMs is generally
sensitive to order of examples. This suggests that a careful
prompt engineering is crucial for optimal performance in
real-world ER applications.
• This sensitivity can be addressed by the intersection
approach to few-shot prompting, which consistently
achieves much better results, increasing precision at a
higher rate than it reduces recall.
• Orca2 consistently outperformed the other LLMs across
most prompting strategies and datasets, demonstrating
high robustness and efectiveness. In fact, the relative
performance of the best models (Orca2 &gt; OpenHermes &gt;
Zephyr) remained largely consistent across prompt
strategies and datasets, suggesting inherent strengths in their
base architectures.
• The use of 4-bit quantization and 7B parameter models
demonstrated the potential for efective EM with limited
computational resources. The efectiveness of the
considered models is competitive with established,
learningbased EM approaches, especially in datasets with low
portion of missing values and short entity descriptions.</p>
      <p>In the future, we plan to explore LLMs’ capability in
matching entities across diferent languages and to enhance
the interpretability and explainability of LLM decisions.
Acknowledgments. This work was partially funded by
the EU project STELAR (Horizon Europe – Grant No.
101070122).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <article-title>Data Matching - Concepts and Techniques for Record Linkage</article-title>
          , Entity Resolution, and Duplicate Detection, Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ioannou</surname>
          </string-name>
          , E. Thanos, T. Palpanas, The Four Generations of Entity Resolution, Morgan &amp; Claypool Publishers,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Big data integration</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1245</fpage>
          -
          <lpage>1248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          , G. Papadakis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <article-title>An overview of end-to-end entity resolution for big data</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>53</volume>
          (
          <year>2021</year>
          )
          <volume>127</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <article-title>Entity resolution in the web of data</article-title>
          ,
          <source>in: 23rd International World Wide Web Conference</source>
          , WWW,
          <year>2014</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Building a broad knowledge graph for products</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2019</year>
          , p.
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <article-title>A survey of indexing techniques for scalable record linkage and deduplication</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>24</volume>
          (
          <year>2012</year>
          )
          <fpage>1537</fpage>
          -
          <lpage>1555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Skoutas</surname>
          </string-name>
          , E. Thanos,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <article-title>A survey of blocking and filtering techniques for entity resolution</article-title>
          , CoRR abs/
          <year>1905</year>
          .06167 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Verykios</surname>
          </string-name>
          ,
          <article-title>Duplicate record detection: A survey</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>19</volume>
          (
          <year>2007</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jurek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chi</surname>
          </string-name>
          , W. Liu,
          <article-title>A novel ensemble learning approach to unsupervised record linkage</article-title>
          ,
          <source>Inf. Syst</source>
          .
          <volume>71</volume>
          (
          <year>2017</year>
          )
          <fpage>40</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <article-title>Automatic record linkage using seeded nearest neighbour and support vector machine classiifcation</article-title>
          ,
          <source>in: SIGKDD</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>151</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fisher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Active learning based entity resolution using markov logic</article-title>
          ,
          <source>in: PAKDD</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>338</fpage>
          -
          <lpage>349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          , E. Ioannou, T. Palpanas,
          <article-title>Entity resolution: Past, present and yet-to-come</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>647</fpage>
          -
          <lpage>650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nikoletos</surname>
          </string-name>
          , E. Ioannou,
          <string-name>
            <surname>G. Papadakis,</surname>
          </string-name>
          <article-title>The five generations of entity resolution on web data</article-title>
          ,
          <source>in: ICWE</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>469</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Peeters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>Entity matching using large language models</article-title>
          ,
          <source>CoRR abs/2310</source>
          .11244 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Orr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          ,
          <article-title>Can foundation models wrangle your data?</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <fpage>738</fpage>
          -
          <lpage>746</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , X. Han,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          , L. Sun,
          <article-title>Match, compare, or select? an investigation of large language models for entity matching</article-title>
          ,
          <source>CoRR abs/2405</source>
          .16884 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fan</surname>
          </string-name>
          , X. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Fan,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Cost-efective in-context learning for entity resolution: A design space exploration</article-title>
          ,
          <source>CoRR abs/2312</source>
          .03987 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and ifne-tuned chat models</article-title>
          ,
          <source>CoRR abs/2307</source>
          .09288 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Mistral</surname>
            <given-names>7b</given-names>
          </string-name>
          ,
          <source>CoRR abs/2310</source>
          .06825 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Corro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          , et al.,
          <source>Orca</source>
          <volume>2</volume>
          :
          <article-title>Teaching small language models how to reason</article-title>
          ,
          <source>CoRR abs/2311</source>
          .11045 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tunstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Beeching</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          , et al.,
          <article-title>Zephyr: Direct distillation of LM alignment</article-title>
          ,
          <source>CoRR abs/2310</source>
          .16944 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>C. Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Llama pro: Progressive llama with block expansion</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>6518</fpage>
          -
          <lpage>6537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nikoletos</surname>
          </string-name>
          , G. Papadakis,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Koubarakis, pyjedai: a lightsaber for link discovery</article-title>
          ,
          <source>in: ISWC Posters, Demos and Industry Tracks</source>
          , volume
          <volume>3254</volume>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Neuhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fisichella</surname>
          </string-name>
          , G. Papadakis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nikoletos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Augsten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>Open benchmark for filtering techniques in entity resolution</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>33</volume>
          (
          <year>2024</year>
          )
          <fpage>1671</fpage>
          -
          <lpage>1696</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sawlani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          , Zeroer:
          <article-title>Entity resolution using zero labeled examples</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1149</fpage>
          -
          <lpage>1164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kirielle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <article-title>A critical re-evaluation of record linkage benchmarks for learning-based matching algorithms</article-title>
          , in: ICDE,
          <year>2024</year>
          , pp.
          <fpage>3435</fpage>
          -
          <lpage>3448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mudgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rekatsinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Park</surname>
          </string-name>
          , G. Krishnan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Arcaute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          ,
          <article-title>Deep learning for entity matching: A design space exploration</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          , et al.,
          <article-title>Magellan: Toward building entity matching management systems</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          (
          <year>2016</year>
          )
          <fpage>1197</fpage>
          -
          <lpage>1208</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>