=Paper=
{{Paper
|id=Vol-3931/paper4
|storemode=property
|title=Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations
|pdfUrl=https://ceur-ws.org/Vol-3931/paper4.pdf
|volume=Vol-3931
|authors=Ioannis Arvanitis-Kasinikos,George Papadakis
|dblpUrl=https://dblp.org/rec/conf/dolap/Arvanitis-Kasinikos25
}}
==Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations==
Entity Matching with 7B LLMs: A Study on Prompting Strategies
and Hardware Limitations
Ioannis Arvanitis-Kasinikos1 , George Papadakis1
1
National and Kapodistrian University of Athens, Greece
Abstract
Entity Matching (EM) is a fundamental task in data management, involving the identification and linking of records that refer to the
same real-world entity across different datasets. While Large Language Models (LLMs) have shown promise in addressing complex
natural language processing tasks, their substantial computational requirements often limit their practical applicability. In this work,
we investigate the use of 7B parameter LLMs with 4-bit quantization for EM tasks executable on commodity hardware. We explore
various prompting strategies, including zero-shot, few-shot, and general matching definition prompts, to evaluate their effectiveness
in improving EM accuracy. Experiments are conducted on two benchmark datasets with products, which present varying levels of
complexity and challenge in product descriptions. Our findings demonstrate that 7B parameter LLMs can effectively perform EM, with
the Orca2 model consistently outperforming others across different prompting strategies and datasets. The study highlights that few-shot
prompting significantly enhances performance over zero-shot approaches, emphasizing the importance of task-specific examples and
careful prompt design. We also examine the impact of example order in few-shot prompts and find that it has a substantial effect on model
performance. Finally, we examine hardware limitations, demonstrating that effective EM can be achieved with resource-constrained
models.
Keywords
Entity Matching, 7B LLMs, Zero-Shot Prompts, Few-Shot Prompts
1. Introduction
Entity Resolution (ER) constitutes a vital task in data man-
agement that involves identifying and linking records from
different datasets that refer to the same real-world entity Figure 1: Two records with major differences describing the same
[1, 2]. In many domains, including e-commerce, health- product.
care, and finance, accurate ER is essential for ensuring data
quality, enabling effective data integration, and supporting
human involvement [12]. This is addressed by more recent
informed decision-making [3]. However, this task is chal-
state-of-the-art approaches that leverage deep learning (DL)
lenging due to data inconsistencies, incompleteness, and
techniques [13]. However, they require substantial amounts
ambiguity across different sources [4, 5].
of training data, which are rarely available.
As an example, consider the product descriptions in Fig-
Recent advancements in NLP, particularly in Large Lan-
ure 1. Despite corresponding to the same object (Sony head-
guage Models (LLMs), offer new possibilities for addressing
phones), there are significant variations in product names,
EM challenges [14, 15]. LLMs possess advanced capabili-
attributes, and dimensions. These discrepancies illustrate
ties for natural language understanding, which allows them
the challenges in reconciling variations across datasets, par-
to process and interpret complex textual descriptions [16].
ticularly when dealing with unstructured text and linguistic
Most importantly, LLM-based EM can be performed in zero-
differences. Accurate ER in scenarios like this is crucial for
shot settings, requiring no training instances, a characteris-
product catalog integration, price comparison, and recom-
tic particularly attractive for out-of-the-box solutions.
mendation systems [6].
In this work, we evaluate the performance of 7B param-
Due to its quadratic time complexity, ER solutions typ-
eter LLMs in entity matching tasks. While larger LLMs
ically implement the Filtering-Verification framework [7].
with hundreds of billions of parameters have shown impres-
The Filtering step, often called Blocking, significantly re-
sive results [15, 16], their computational requirements often
duces the computational cost to the most similar candidate
make them impractical for many real-world applications. By
pairs, which are the most likely matches [8]. The Verifica-
employing these LLMs, which excel in natural language un-
tion step performs Entity Matching (EM), which essentially
derstanding and semantic similarity assessment, this work
determines whether two records are duplicates, describing
seeks to address EM challenges in real-world datasets with
the same real-world object. In the following, we exclusively
linguistic variations and unstructured text, while also high-
focus on EM.
lighting their suitability for execution on commodity hard-
Traditional EM solutions typically rely on rule-based ap-
ware. The focus on 7B parameter LLMs is motivated by their
proaches, string similarity metrics, or machine learning
potential for efficient deployment on commodity hardware,
algorithms [9, 10, 11]. However, these methods can strug-
making them more suitable for practical applications.
gle with complex linguistic variations and contextual un-
To this end, we perform an extensive experimental evalu-
derstanding, while requiring domain expertise and heavy
ation that considers the models’ ability to handle different
DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-
types of EM scenarios. We explore novel zero-shot, few-
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT shot, and general matching definition prompting strategies
2025, March 25, 2025, Barcelona, Spain to assess their effectiveness in improving matching accuracy.
$ cs1180001@di.uoa.gr (I. Arvanitis-Kasinikos); gpapadis@di.uoa.gr Our goal is to bridge the gap between the advanced capabil-
(G. Papadakis) ities of LLMs and the practical constraints of real-world EM
https://gpapadis.wordpress.com (G. Papadakis)
applications, potentially paving the way for more efficient
0000-0002-7298-9431 (G. Papadakis)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License and accurate ER techniques in diverse domains.
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Related Work
There is a plethora of recent LLM-based EM methods, be-
cause LLMs offer several advantages over traditional EM
solutions: (i) contextual understanding, as they understand
the context and semantics of entity descriptions better than
traditional string matching techniques. (ii) robustness, since
LLMs are typically more capable of addressing variations
in how entity information is expressed. (iii) zero-shot and
few-shot learning, i.e., LLMs can accomplish EM tasks with
no or minimal examples of matching decisions. These char-
acteristics render LLMs ideal for most EM tasks, especially
those with complex, unstructured product descriptions.
Figure 2: (a) The basic zero-shot EM prompt, and (b) its few-shot
The seminal work on LLM-based EM [16] investigated
extension.
the effectiveness of GPT3-175B in EM, focusing on three
key parameters: (i) problem definition, exploring different
phrasings such as “Are Product A and Product B the same?” imental results demonstrate that batch prompting outper-
or “Are Product A and Product B equivalent?”. (ii) in-context form match prompts in both effectiveness and cost, with
learning, comparing zero-shot with few-shot approaches. the top performance achieved by diversity-based question
The former involve prompts with no examples in the prompt, batching combined with covering-based demonstration se-
while the latter involve a couple of examples, which are lection.
selected randomly or by experts. (iii) entity serialization, These studies collectively demonstrate the potential of
testing the use of all attributes or just a subset of them. LLMs in entity matching tasks, highlighting the importance
Their experimental analysis led to the following conclusions: of prompt engineering, the competitiveness of open-source
(i) few-shot learning significantly outperforms zero-shot models, and the effectiveness of batching strategies for im-
approaches, (ii) attribute selection yields better results than proved efficiency. This work builds upon and extends the
using all attributes, (iii) problem definition has a substantial existing ones by focusing specifically on 7B parameter LLMs
impact on performance, (iv) LLM performance is comparable with 4-bit quantization. Unlike previous studies that primar-
to the state-of-the-art DL-based matching algorithms. ily use larger, more resource-intensive models, our work
A detailed study was conducted in [15], using six LLMs, explores the potential of smaller and more accessible LLMs
three hosted and three open-source ones. The experiments for EM tasks. In this context, we perform a comprehen-
explored additional parameters such as problem definition, sive evaluation of various novel prompting strategies, in-
language complexity, output specification, entity serializa- cluding zero-shot, few-shot, and general matching defini-
tion, in-context learning, instructions, and fine-tuning. The tion approaches, across multiple models and datasets. This
experimental results revealed that: (i) no single prompt con- approach offers insights into the practical applicability of
sistently outperformed all others across different scenarios. LLMs in resource-constrained environments, bridging the
(ii) Open-source LLMs showed comparable effectiveness gap between advanced language models and real-world EM
to hosted models. (iii) LLMs performed competitively with challenges.
deep learning-based matchers, even in zero-shot settings.(iv)
Few-shot and instruction-based prompts generally outper- 3. Problem Definition
formed zero-shot approaches. (v) Fine-tuning significantly
improved effectiveness. Applied after Filtering, Entity Matching is typically formu-
In another line of research, three distinct prompting lated as a binary classification problem [3, 4]. More formally:
strategies were explored in [17]: (i) Match prompts, which Given two records 𝑟1 and 𝑟2 , the task is to determine whether
contain traditional pair-wise questions. E.g., “Do these two they refer to the same entity. This is often expressed as a
records refer to the same real-world entity? Record 1: [de- function 𝑓 (𝑟1 , 𝑟2 ) → {0, 1}, where 1 indicates a match
tails]. Record 2: [details].” (ii) Comparison prompts, which (also called duplicate) and 0 indicates a non-match.
ask for the most similar entity to a given reference. E.g., In LLM-based settings, EM is framed as a natural language
“Which of these two records is more consistent with the inference task. The LLM is provided with descriptions of
given record? Given Record: [details]. (A) Record 1: [de- two records and asked to determine if they refer to the same
tails]. (B) Record 2: [details].” (iii) Selection prompts, which entity, returning “True” for a match and “False” otherwise.
identify a matching entity from a set of candidates. E.g., In all cases, EM performance is measured with respect to:
“Select a record from the following list that refers to the
same real-world entity as the given record: Given Record: • Precision, i.e., the proportion of correctly identified
[details]. Options: 1. [details] 2. [details] 3. [details]...” matches out of all predicted matches.
The experimental results show that incorporating record • Recall, i.e., the proportion of correctly identified matches
interactions through the comparison and selection prompts out of all actual matches.
significantly improves EM performance across various sce-
narios; among the two, the selection prompts are the top- • F-measure, i.e., the harmonic mean of precision and recall,
performers in most cases. However, they suffer from posi- providing a balanced measure of performance.
tion bias, because their accuracy decreases when the dupli-
cate record is placed lower in the list of candidates. • Run-time, i.e., the time taken to complete the ER process.
BatchER [18] aims to reduce the costs for hosted LLMs The first three measures are defined in [0, 1] with higher
through batch processing, exploring various methods for values indicating higher effectiveness. For the last one, lower
question batching and demonstration selection. The exper- values indicate higher time efficiency.
4. EM Prompts
We now present the EM prompts that are examined in our
work. The basic prompt is presented in Figure 2(a). It con-
sists of an instruction that describes the input and the de-
sired output. It lacks any examples, thus constitutes a zero-
shot EM prompt, which tests the model’s ability to generalize
to new tasks or domains it has not been trained on.
A concise few-shot EM prompt extends the zero-shot one
with the examples in Figure 2(b). To provide a balanced con-
text, there are two examples that include a pair of matching
entities and a pair of non-matching ones. These examples
serve as a form of weak supervision, allowing the LLM to Figure 3: Domain-specific, zero-shot EM prompt for product
learn from the provided instances and generalize to simi- matching.
lar cases. Note that the examples in Figure 2(b) have been
carefully selected from dataset 𝐷1 (see Table 1) so that they 2. The atomic domain-specific EM prompt uses only the
capture typical variations in product descriptions that are model number as the matching criterion. We selected
encountered in the full dataset. this attribute because it provides the cleanest and most
Note that LLM responses to few-shot prompts suffer from distinctive values.
position bias [17], because the order of examples in the EM
These two configurations were chosen after preliminary
prompt might alter the matching decision. This means that
tests that suggested that they yield the best performance
in the example of Figure 2(b), the response for a specific
among all other combinations of these four attributes.
candidate pair might be True (i.e., matching) if the posi-
tive example precedes the negative one and False (i.e., non-
matching) otherwise. For this reason, we define two types 5. Experimental Analysis
of few-shot prompts:
Experimental Settings. All experiments were imple-
1. TF, where the True example is followed by False one, as mented in Python v3.12.0 and Ollama1 v0.1.22. All experi-
in Figure 2(b). ments were carried out on a server running Ubuntu 22.04.1
2. FT , where the False example is followed by True one. LTS, equipped with Intel Core i7-9700K 8 core @ 3.6 GHz,
32GB RAM and NVIDIA GeForce GTX 1080 Ti 11GB.
Note that with multiple examples per prompt, as in [17], Due to the limited size of the available VRAM, our study
more arrangements are possible. In this work, though, we focuses on 7-billion-parameter LLMs with optimizations
exclusively consider the two variations of the few-shot EM such as quantization, which in our case replaces the 32-
prompt that involves one example per match type. bit floating-point model weights with 4-bit integers. This
To increase the robustness of LLMs to few-shot EM reduces the model size, while maintaining reasonable perfor-
prompts, we consider two matching approaches for each mance levels. In other words, quantization lowers effective-
candidate pair, query with both TF and FT prompts: ness, due to the fewer parameters and the lower precision of
the model’s weights, but significantly reduces run-times and
1. The union approach labels a candidate pair as True if memory consumption. Therefore, our experimental results
either the TF or FT prompt results in a True response. are useful for resource-constrained applications, which run
2. The intersection approach labels a candidate pair as True LLMs on commodity hardware.
only if both the TF and FT prompts yield a True response. LLMs. There is a plethora of open-source LLMs, with
newer models introduced on a rather frequent basis. During
our study, two models were quite popular: Llama 2 [19],
4.1. Domain-specific Zero-Shot Prompts with 7B parameters and a context length of 4096, as well
The above prompts are generic enough to apply to any as Mistal [20], with 7.3B parameters. However, preliminary
domain. In our experimental analysis, we also consider experiments demonstrated that both of them were inap-
domain-specific ones, which are crafted for the product propriate for the EM tasks considered in this work. Llama
matching task. More specifically, we devise a zero-shot 2 consistently responded with “True” for every candidate
prompt that involves general matching definitions, provid- pair, while Mistral failed to provide a response according to
ing the LLM with explicit guidance on how to determine if given instructions – it indicated an inability to respond in
two records refer to the same product. certain cases or gave explanations for its decisions instead
The core assumption of this approach is that the records of a “True” or “False” label.
are described by a clean, aligned schema. This is necessary In their place, we considered the following open-source
for building a schema-aware generic definition of dupli- models, which demonstrated high effectiveness in our pre-
cate records. In the product matching task, we use four liminary experiments:
key product attributes: (i) product name, (ii) features, (iii) 1. Orca2 [21]. Built by Microsoft Research, Orca2 is a fam-
manufacturer, and (iv) model number. We use them in two ily of models fine-tuned on Meta’s Llama 2 using syn-
different configurations: thetic data.
1. The composite domain-specific EM prompt concatenates 2. OpenHermes2 . This is a Mistral 7B model fine-tuned with
all four criteria in the above sequence, as in Figure 3. The fully open datasets, showcasing strong multi-turn chat
goal is to facilitate more nuanced matching decisions. 1
https://ollama.com
2
https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B
Dataset #Entities Duplicates Cartesian Product #Attributes Candidate Pairs Bl.Recall Bl.Precision
𝐷1 1,076-1,076 1,076 1.16×106 3 4,345 0.924 0.229
𝐷2 2,554-22,074 853 5.64×107 6 5,163 0.910 0.150
Table 1
Technical characteristics of the datasets used in the experimental analysis.
Figure 4: Effectiveness of the zero-shot prompt in Figure 2(a) on top of the selected LLMs over 𝐷1 (left) and 𝐷2 (right).
skills and system prompt capabilities. It surpasses all by PyJedAI [24] , version 0.1.6. Following [25], we kNN-
previous versions of Nous-Hermes 13B and below. Join, which identifies the 𝑘 nearest neighbors of each en-
tity. We fine-tuned it, maximizing blocking precision for a
3. Zephyr [22]. A 7B parameter model fine-tuned on Mis- blocking recall of at least 90%, as reported in the rightmost
tral, it achieves results similar to Llama 2 70B Chat in columns of Table 1. This configuration uses cleaning (i.e.,
various benchmarks. It is trained on a distilled dataset, stop-word removal and stemming) and cosine similarity in
improving grammar and chat results. both datasets. For Abt-Buy, 𝑘 was set to 4, while the at-
4. Mistral-OpenOrca3 . This is a 7B parameter model, fine- tribute values were converted into a multiset of character
tuned on top of Mistral 7B using the OpenOrca dataset. trigrams. For Walmart-Amazon, 𝑘 was set to 2, while the
attribute values were converted into a multiset of character
5. Stable-Beluga4 . This is a Llama 2 based model fine-tuned four-grams.
on an Orca-style dataset.
6. Llama-Pro [23]: An 8B parameter expansion of Llama 2 5.1. Zero-Shot Prompting Results
that specializes in integrating both general language un- We now examine the relative performance of the selected
derstanding and domain-specific knowledge, particularly LLMs over 𝐷1 and 𝐷2 , when coupled with the basic zero-
in programming and mathematics. shot EM prompt of Figure 2(a).
We observe that Orca2, OpenHermes, and Zephyr con-
In all cases, we use the default latest model with 4-bit quan-
sistently rank as the top three models with respect to F-
tization and 7B parameters.
Measure in both datasets. The last two models switch
Datasets. We used two real-world datasets with products
their ranking positions in the two datasets, whereas Orca2
that are widely used in the ER literature: (i) 𝐷1 is the Abt-
maintains the lead. The superior performance of Orca2,
Buy dataset, which comprises product listings from two
which demonstrates Orca2’s robustness under diverse EM
online retailers, Abt Electronics and Buy.com. (ii) 𝐷2 is the
settings, can be attributed to its fine-tuning on synthetic
Walmart-Amazon dataset, which contains product listings
data designed for reasoning tasks. This enhances its capa-
from two other online retailers, Walmart and Amazon. 𝐷1
bility to understand and compare complex product descrip-
primarily focuses on electronic products, while 𝐷2 covers a
tions. OpenHermes is fine-tuned on fully open datasets with
broader range of product categories, matching diverse entity
strong multi-turn chat skills, leveraging advanced language
types. Both datasets present important challenges, such
understanding to perform well. Zephyr’s competitive per-
variations in product names and descriptions across retailers,
formance probably results from its training on a distilled
inconsistent use of model numbers and other identifiers,
dataset that improves grammar and chat results, aiding in
differences in the level of detail provided for each product,
better interpretation of entity attributes. The lower perfor-
variations in formatting and units (e.g., dimensions, weights)
mance of Mistral-OpenOrca, Stable-Beluga, and Llama-Pro
as well as missing or null values in certain fields.
is probably due to the less specialized training data or the
Their technical characteristics are summarized in Table
smaller model capacities for the specific nuances of EM.
1. Note that each dataset comprises two individually clean
Note that all models exhibit much higher recall than pre-
data sources, whose sizes are reported in column #Enti-
cision in both datasets. This means that they are prone to
ties. Note also that we apply the prompts to the candidate
label a candidate pair as matching, at the cost of introducing
pairs generated by a state-of-the-art blocking implemented
numerous false positives. Orca2 consistently exhibits the
highest precision, thus yielding the highest F-Measure, too.
Note also that all models exhibit markedly lower effective-
3
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca ness in 𝐷2 compared to 𝐷1 . This suggests that 𝐷2 presents
4
https://huggingface.co/stabilityai/StableBeluga2
Figure 5: Effectiveness of the few-shot prompts in Figure 2(b) on top of selected LLMs over 𝐷1 (left) and 𝐷2 (right). From top
to bottom, the TF promps are presented first, followed by the FT prompts, the Union and the Intersection approaches.
greater EM challenges, potentially due to more diverse or This is not the case with Orca2, whose performance varies
complex product descriptions. While 𝐷1 is restricted to significantly across the two datasets. In 𝐷1 , the same F1
electronics, 𝐷2 covers a broader range of products and in- score is achieved for both approaches, because the intersec-
cludes more variation in descriptions, attributes, and data tion raises recall by 12%, while reducing precision to the
quality, rendering EM more difficult. Furthermore, 𝐷1 has same degree. In 𝐷2 , though, the intersection reduces recall
a 1:1 matching between its two data sources, whereas 𝐷2 by 23% and increases precision by 16%, thus yielding a much
has a much lower ratio of matches, adding another layer of lower F-Measure. Note that in both datasets, the recall of
complexity to the task. The substantial performance gap the model gets lower than its precision in combination with
between 𝐷1 and 𝐷2 underscores the significant impact of the intersection approach, unlike the union one.
data characteristics on model effectiveness. Overall, we can conclude that Orca2 works best when
coupled with FT few-shot prompts, while OpenHermes and
5.2. Few-Shot Prompting Results Zephyr maximize their effectiveness when intersecting the
matches of TF and FT prompts. Among them, the top per-
We now examine the performance of the aforementioned formers over 𝐷1 and 𝐷2 are Orca2 (F1=0.799) and Zephyr
few-shot prompts over 𝐷1 and 𝐷2 . We disregard Mistral- (F1=0.531), respectively.
OpenOrca, Stable-Beluga, and Llama-Pro, because they ex-
hibited significantly lower effectiveness and less consistent
5.3. Domain-specific Zero-Shot Prompting
performance in the zero-shot experiments – preliminary
experiments verified their poor performance in few-shot Results
settings, too. For brevity, we focus on the top three perform- In this section, we compare the atomic domain-specific
ing models, namely Orca2, OpenHermes, and Zephyr. prompt with the composite one. As in Section 5.2, we ex-
The results are reported in Figure 5. Based on preliminary clusively consider the three top performing models with
experiments, we randomly select the examples included in respect to the zero-shot prompts: Orca2, OpenHermes, and
the few-shot prompts from the candidate pairs of the same Zephyr. Their performance is reported in Figure 6.
dataset. The same examples are used in all prompts issued We observe that in all cases, the atomic prompt outper-
on a particular dataset. forms the composite one to a significant extent – the only
In both datasets, we observe the same patterns as regards exception corresponds to Zephyr in 𝐷1 , where the compos-
the relative performance of TF and FT few-shot prompts: ite prompt increases F-Measure almost by 15%. This pattern
For Orca2, there is a substantial improvement when using should be attributed to the short, distinctive and clean val-
the latter; OpenHermes is more robust to position bias, as ues provided by the model number. This way, it reduces
there is no significant difference between the two prompt the noise from other product attributes like product name,
strategies; Zephyr works best when coupled with the TF which are typically associated with long and diverse texts.
few-shot prompts. These patterns highlight that the impact Similar to the above strategies, all LLMs exhibit much
of position bias on each model is consistent across the two higher recall than precision. This means that they remain
datasets. Note also that with the exception of Orca2 with prone to mark a candidate pair as a match at the cost of
TF prompts, all models achieve higher recall than precision, introducing false positives – a behavior that permeates all
remaining more prone to label a candidate pair as matching. prompt strategies we have examined.
It is also interesting to compare the union approach with Among the three models, Orca2 is consistently better,
the intersection one. For OpenHermes and Zephyr, the lat- albeit to a minor extent in 𝐷2 . This consistent performance
ter yields significantly higher F-Measure: by considering underscores Orca2’s effectiveness in EM tasks under quite
as duplicates only the candidate pairs that are marked as different prompt designs.
matching by both TF and FT few-shot prompts, the reduc- We can conclude that domain-specific zero-shot prompts
tion in recall is much lower than the increase in precision offer an effective and reliable alternative in datasets with a
(as a result, recall remains much higher than precision for clean schema of known characteristics.
both models). This means that considering only the com-
mon matches of TF and FT prompts leads to more accurate
performance. Note that these patterns are consistent for
both models over both datasets.
Figure 6: Effectiveness of the atomic and composite domain-specific zero-shot prompts in Figure 2(a) on top of the selected
LLMs over 𝐷1 (left) and 𝐷2 (right).
𝐷1 𝐷2
Prompt Strategy
Precision Recall F-Measure Run-time Precision Recall F-Measure Run-time
Zero-shot 0.664 0.956 0.784 32 min 0.397 0.740 0.517 23 min
FT Few-shot 0.768 0.834 0.799 41 min 0.420 0.515 0.463 33 min
Atomic Domain-specific 0.689 0.934 0.793 33 min 0.434 0.708 0.538 25 min
(a) Orca2
Zero-shot 0.584 0.963 0.727 31 min 0.309 0.864 0.455 23 min
Intersection Few-shot 0.683 0.718 0.700 40 min 0.378 0.585 0.459 33 min
Atomic Domain-specific 0.556 0.969 0.707 33 min 0.306 0.876 0.453 25 min
(b) OpenHermes
Zero-shot 0.572 0.965 0.718 32 min 0.329 0.942 0.488 24 min
Intersection Few-shot 0.667 0.877 0.757 43 min 0.408 0.761 0.531 34 min
Composite Domain-specific 0.573 0.960 0.718 39 min 0.372 0.913 0.529 30 min
(c) Zephyr
Table 2
Best performance per LLM in combination with the top performing variant per prompt strategy across both datasets.
5.4. Comparison of Prompting Strategies Method 𝐷1 Source 𝐷2 Source
ZeroER 0.520 [26] 0.644 [27]
We now compare the three top-performing models (Orca2, Magellan 0.436 [28] 0.719 [28]
OpenHermes, and Zephyr) with respect to effectiveness and DeepMatcher 0.628 [28] 0.669 [28]
time efficiency across the three strategies of EM prompts
Table 3
discussed in Section 4. Note that among the few-shot and
The F-Measure per dataset reported in the literature for three
domain-specific variants, for each LLM we only consider
state-of-the-art EM algorithms.
the one with the highest F-Measure in both datasets. Their
performance is reported in Table 2.
For Orca2, we observe that the FT few-shot prompts are
the top performers in 𝐷1 . The atomic domain-specific ones domain-specific prompts, which are faster by more than
follow in very close distance in terms of F-Measure, while 10%. Due to its consistency, the best choice for Zephyr cor-
exhibiting a much lower run-time. This means that the responds to the intersection of TF and FT few-shot prompts.
domain-specific prompts offer a significantly better balance Among the three 7B LLMs, the configuration consistently
between effectiveness and time efficiency. In 𝐷2 , this strat- achieving (almost) the highest effectiveness in both datasets
egy scores the highest F-Measure for a slightly higher run- is Orca2 coupled with atomic domain-specific prompts. Its
time than the second best approach (zero-shot prompts). efficiency is also rather high, given that its run-time is
For these reasons, Orca2 works best in combination with marginally higher than that of the fastest (zero-shot) con-
the atomic domain-specific prompts. figuration of the other two models.
Regarding OpenHermes, the differences between the
three types of prompts are minor in terms of F-Measure. 5.5. Comparison to Baselines
As expected, the fastest approach in both datasets corre-
To put the performance of the selected 7B LLMs into per-
sponds to the zero-shot prompts. This configuration also
spective, we compare it with three state-of-the-art EM ap-
achieves the highest F-Measure in 𝐷1 , while in 𝐷2 , it ranks
proaches from the literature:
second, within a negligible distance from the top (<0.5%).
Therefore, we can conclude that the zero-shot prompts are 1. ZeroER [26], an unsupervised approach that requires
the best choice for OpenHermes. no labelled datasets, learning Gaussian mixture mod-
For Zephyr, there is a clear winner in the case of 𝐷1 : the els for matching and non-matching candidate pairs.
intersection of few-shot prompts. It exhibits, though, the
highest run-time by a large extent. This is expected, as it 2. Magellan [29], a supervised approach combining bi-
queries the LLM twice per candidate pair. In the case of 𝐷2 , nary classifiers with a series of hand-crafted features
the same strategy takes a minor lead over the composite based on string similarity measures.
3. DeepMatcher [28], a framework leveraging the syn- • The use of 4-bit quantization and 7B parameter models
ergy between language models and Deep Learning demonstrated the potential for effective EM with limited
classification. computational resources. The effectiveness of the con-
sidered models is competitive with established, learning-
For each method, we consider its best performance as re- based EM approaches, especially in datasets with low
ported in the literature. The results are reported in Table 1. portion of missing values and short entity descriptions.
We observe mixed patterns. In 𝐷1 , all LLM configura-
tions in Table 2, even the zero-shot prompts, outperform In the future, we plan to explore LLMs’ capability in
all three baseline methods to a significant extent (> 21%). matching entities across different languages and to enhance
This is remarkable, because the simplest prompt strategy the interpretability and explainability of LLM decisions.
requires neither domain expertise nor the labeling candi-
Acknowledgments. This work was partially funded by
date pairs, unlike Magellan and DeepMatcher, whose per-
the EU project STELAR (Horizon Europe – Grant No.
formance is derived from large training and validation sets,
101070122).
which amount to 60% and 20% of all candidate pairs, resp.
The situation is reversed in 𝐷2 , where all baseline meth-
ods achieve a much better performance. In fact, the highest References
F-measure of Orca2 is lower by 16.5% than the worst baseline
(ZeroER). This should be attributed to the more challenging [1] P. Christen, Data Matching - Concepts and Techniques
settings of 𝐷2 , which have already been discussed in Sec- for Record Linkage, Entity Resolution, and Duplicate
tion 5.1. Note also that the records in 𝐷2 are noisier, with Detection, Springer, 2012.
a much higher portion of missing values. Its records are [2] G. Papadakis, E. Ioannou, E. Thanos, T. Palpanas, The
also longer, an aspect that is crucial for the 7B LLMs we are Four Generations of Entity Resolution, Morgan & Clay-
considering in this study, due to their limited attention win- pool Publishers, 2021.
dow. These settings favor the learning-based functionality [3] X. L. Dong, D. Srivastava, Big data integration, in:
of the baseline methods, which take a clear lead over the ICDE, 2013, pp. 1245–1248.
learning-free functionality of 7B LLMs. Another reason for [4] V. Christophides, V. Efthymiou, T. Palpanas, G. Pa-
the poor performance of the latter is that they emphasize padakis, K. Stefanidis, An overview of end-to-end
recall at the expense of precision, significantly decreasing entity resolution for big data, ACM Comput. Surv. 53
their F-Measure in 𝐷2 , due the very low portion of matches (2021) 127:1–127:42.
in comparison to the total number of entities from each data [5] K. Stefanidis, V. Efthymiou, M. Herschel,
source. Therefore, more advanced strategies are required V. Christophides, Entity resolution in the web
for boosting the performance of 7B LLMs in datasets with of data, in: 23rd International World Wide Web
characteristics similar to that of 𝐷2 . Conference, WWW, 2014, pp. 203–204.
[6] X. L. Dong, Building a broad knowledge graph for
products, in: ICDE, 2019, p. 25.
6. Conclusions & Future Work [7] P. Christen, A survey of indexing techniques for scal-
able record linkage and deduplication, IEEE Trans.
Focusing on 7B open-source LLMs, we examined the per-
Knowl. Data Eng. 24 (2012) 1537–1555.
formance of three main prompt strategies: (i) the basic,
[8] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, A
domain-agnostic zero-shot prompt, (ii) the few-shot prompt
survey of blocking and filtering techniques for entity
with one example per type of matches, and (iii) the domain-
resolution, CoRR abs/1905.06167 (2019).
specific zero-shot prompt. We considered several variants
[9] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Du-
for the last two strategies and applied all of them on two es-
plicate record detection: A survey, IEEE Trans. Knowl.
tablished benchmark datasets for product matching. Testing
Data Eng. 19 (2007) 1–16.
six popular LLMs, we reached the following conclusions:
[10] A. Jurek, J. Hong, Y. Chi, W. Liu, A novel ensemble
• Few-shot and domain-specific prompting significantly learning approach to unsupervised record linkage, Inf.
improve the performance of the zero-shot approaches, Syst. 71 (2017) 40–54.
highlighting the value of task-specific prompts. [11] P. Christen, Automatic record linkage using seeded
nearest neighbour and support vector machine classi-
• In few-shot prompts, the response of LLMs is generally fication, in: SIGKDD, 2008, pp. 151–159.
sensitive to order of examples. This suggests that a careful [12] J. Fisher, P. Christen, Q. Wang, Active learning based
prompt engineering is crucial for optimal performance in entity resolution using markov logic, in: PAKDD, 2016,
real-world ER applications. pp. 338–349.
[13] G. Papadakis, E. Ioannou, T. Palpanas, Entity resolu-
• This sensitivity can be addressed by the intersection
tion: Past, present and yet-to-come, in: EDBT, 2020,
approach to few-shot prompting, which consistently
pp. 647–650.
achieves much better results, increasing precision at a
[14] K. Nikoletos, E. Ioannou, G. Papadakis, The five gen-
higher rate than it reduces recall.
erations of entity resolution on web data, in: ICWE,
• Orca2 consistently outperformed the other LLMs across 2024, pp. 469–473.
most prompting strategies and datasets, demonstrating [15] R. Peeters, C. Bizer, Entity matching using large lan-
high robustness and effectiveness. In fact, the relative guage models, CoRR abs/2310.11244 (2023).
performance of the best models (Orca2 > OpenHermes > [16] A. Narayan, I. Chami, L. J. Orr, C. Ré, Can foundation
Zephyr) remained largely consistent across prompt strate- models wrangle your data?, Proc. VLDB Endow. 16
gies and datasets, suggesting inherent strengths in their (2022) 738–746.
base architectures. [17] T. Wang, H. Lin, X. Chen, X. Han, H. Wang, Z. Zeng,
L. Sun, Match, compare, or select? an investigation
of large language models for entity matching, CoRR
abs/2405.16884 (2024).
[18] M. Fan, X. Han, J. Fan, C. Chai, N. Tang, G. Li, X. Du,
Cost-effective in-context learning for entity resolu-
tion: A design space exploration, CoRR abs/2312.03987
(2023).
[19] H. Touvron, et al., Llama 2: Open foundation and
fine-tuned chat models, CoRR abs/2307.09288 (2023).
[20] A. Q. Jiang, A. Sablayrolles, A. Mensch, et al., Mistral
7b, CoRR abs/2310.06825 (2023).
[21] A. Mitra, L. D. Corro, S. Mahajan, et al., Orca 2: Teach-
ing small language models how to reason, CoRR
abs/2311.11045 (2023).
[22] L. Tunstall, E. Beeching, N. Lambert, et al., Zephyr: Di-
rect distillation of LM alignment, CoRR abs/2310.16944
(2023).
[23] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, Y. Shan,
P. Luo, Llama pro: Progressive llama with block ex-
pansion, in: ACL, 2024, pp. 6518–6537.
[24] K. Nikoletos, G. Papadakis, M. Koubarakis, pyjedai: a
lightsaber for link discovery, in: ISWC Posters, Demos
and Industry Tracks, volume 3254, 2022.
[25] F. Neuhof, M. Fisichella, G. Papadakis, K. Nikoletos,
N. Augsten, W. Nejdl, M. Koubarakis, Open benchmark
for filtering techniques in entity resolution, VLDB J.
33 (2024) 1671–1696.
[26] R. Wu, S. Chaba, S. Sawlani, X. Chu, S. Thirumuru-
ganathan, Zeroer: Entity resolution using zero labeled
examples, in: SIGMOD, 2020, pp. 1149–1164.
[27] G. Papadakis, N. Kirielle, P. Christen, T. Palpanas, A
critical re-evaluation of record linkage benchmarks for
learning-based matching algorithms, in: ICDE, 2024,
pp. 3435–3448.
[28] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Kr-
ishnan, R. Deep, E. Arcaute, V. Raghavendra, Deep
learning for entity matching: A design space explo-
ration, in: SIGMOD, 2018, pp. 19–34.
[29] P. Konda, S. Das, et al., Magellan: Toward building
entity matching management systems, Proc. VLDB
Endow. 9 (2016) 1197–1208.