=Paper=
{{Paper
|id=Vol-3931/paper4
|storemode=property
|title=Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations
|pdfUrl=https://ceur-ws.org/Vol-3931/paper4.pdf
|volume=Vol-3931
|authors=Ioannis Arvanitis-Kasinikos,George Papadakis
|dblpUrl=https://dblp.org/rec/conf/dolap/Arvanitis-Kasinikos25
}}
==Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations==
<pdf width="1500px">https://ceur-ws.org/Vol-3931/paper4.pdf</pdf>
<pre>
                         Entity Matching with 7B LLMs: A Study on Prompting Strategies
                         and Hardware Limitations
                         Ioannis Arvanitis-Kasinikos1 , George Papadakis1
                         1
                             National and Kapodistrian University of Athens, Greece


                                           Abstract
                                           Entity Matching (EM) is a fundamental task in data management, involving the identification and linking of records that refer to the
                                           same real-world entity across different datasets. While Large Language Models (LLMs) have shown promise in addressing complex
                                           natural language processing tasks, their substantial computational requirements often limit their practical applicability. In this work,
                                           we investigate the use of 7B parameter LLMs with 4-bit quantization for EM tasks executable on commodity hardware. We explore
                                           various prompting strategies, including zero-shot, few-shot, and general matching definition prompts, to evaluate their effectiveness
                                           in improving EM accuracy. Experiments are conducted on two benchmark datasets with products, which present varying levels of
                                           complexity and challenge in product descriptions. Our findings demonstrate that 7B parameter LLMs can effectively perform EM, with
                                           the Orca2 model consistently outperforming others across different prompting strategies and datasets. The study highlights that few-shot
                                           prompting significantly enhances performance over zero-shot approaches, emphasizing the importance of task-specific examples and
                                           careful prompt design. We also examine the impact of example order in few-shot prompts and find that it has a substantial effect on model
                                           performance. Finally, we examine hardware limitations, demonstrating that effective EM can be achieved with resource-constrained
                                           models.

                                           Keywords
                                           Entity Matching, 7B LLMs, Zero-Shot Prompts, Few-Shot Prompts


                         1. Introduction
                         Entity Resolution (ER) constitutes a vital task in data man-
                         agement that involves identifying and linking records from
                         different datasets that refer to the same real-world entity                                                  Figure 1: Two records with major differences describing the same
                         [1, 2]. In many domains, including e-commerce, health-                                                       product.
                         care, and finance, accurate ER is essential for ensuring data
                         quality, enabling effective data integration, and supporting
                                                                                                                                      human involvement [12]. This is addressed by more recent
                         informed decision-making [3]. However, this task is chal-
                                                                                                                                      state-of-the-art approaches that leverage deep learning (DL)
                         lenging due to data inconsistencies, incompleteness, and
                                                                                                                                      techniques [13]. However, they require substantial amounts
                         ambiguity across different sources [4, 5].
                                                                                                                                      of training data, which are rarely available.
                            As an example, consider the product descriptions in Fig-
                                                                                                                                         Recent advancements in NLP, particularly in Large Lan-
                         ure 1. Despite corresponding to the same object (Sony head-
                                                                                                                                      guage Models (LLMs), offer new possibilities for addressing
                         phones), there are significant variations in product names,
                                                                                                                                      EM challenges [14, 15]. LLMs possess advanced capabili-
                         attributes, and dimensions. These discrepancies illustrate
                                                                                                                                      ties for natural language understanding, which allows them
                         the challenges in reconciling variations across datasets, par-
                                                                                                                                      to process and interpret complex textual descriptions [16].
                         ticularly when dealing with unstructured text and linguistic
                                                                                                                                      Most importantly, LLM-based EM can be performed in zero-
                         differences. Accurate ER in scenarios like this is crucial for
                                                                                                                                      shot settings, requiring no training instances, a characteris-
                         product catalog integration, price comparison, and recom-
                                                                                                                                      tic particularly attractive for out-of-the-box solutions.
                         mendation systems [6].
                                                                                                                                         In this work, we evaluate the performance of 7B param-
                            Due to its quadratic time complexity, ER solutions typ-
                                                                                                                                      eter LLMs in entity matching tasks. While larger LLMs
                         ically implement the Filtering-Verification framework [7].
                                                                                                                                      with hundreds of billions of parameters have shown impres-
                         The Filtering step, often called Blocking, significantly re-
                                                                                                                                      sive results [15, 16], their computational requirements often
                         duces the computational cost to the most similar candidate
                                                                                                                                      make them impractical for many real-world applications. By
                         pairs, which are the most likely matches [8]. The Verifica-
                                                                                                                                      employing these LLMs, which excel in natural language un-
                         tion step performs Entity Matching (EM), which essentially
                                                                                                                                      derstanding and semantic similarity assessment, this work
                         determines whether two records are duplicates, describing
                                                                                                                                      seeks to address EM challenges in real-world datasets with
                         the same real-world object. In the following, we exclusively
                                                                                                                                      linguistic variations and unstructured text, while also high-
                         focus on EM.
                                                                                                                                      lighting their suitability for execution on commodity hard-
                            Traditional EM solutions typically rely on rule-based ap-
                                                                                                                                      ware. The focus on 7B parameter LLMs is motivated by their
                         proaches, string similarity metrics, or machine learning
                                                                                                                                      potential for efficient deployment on commodity hardware,
                         algorithms [9, 10, 11]. However, these methods can strug-
                                                                                                                                      making them more suitable for practical applications.
                         gle with complex linguistic variations and contextual un-
                                                                                                                                         To this end, we perform an extensive experimental evalu-
                         derstanding, while requiring domain expertise and heavy
                                                                                                                                      ation that considers the models’ ability to handle different
                          DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-
                                                                                                                                      types of EM scenarios. We explore novel zero-shot, few-
                          guages and Analytical Processing of Big Data, co-located with EDBT/ICDT                                     shot, and general matching definition prompting strategies
                          2025, March 25, 2025, Barcelona, Spain                                                                      to assess their effectiveness in improving matching accuracy.
                          $ cs1180001@di.uoa.gr (I. Arvanitis-Kasinikos); gpapadis@di.uoa.gr                                          Our goal is to bridge the gap between the advanced capabil-
                          (G. Papadakis)                                                                                              ities of LLMs and the practical constraints of real-world EM
                           https://gpapadis.wordpress.com (G. Papadakis)
                                                                                                                                      applications, potentially paving the way for more efficient
                           0000-0002-7298-9431 (G. Papadakis)
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License   and accurate ER techniques in diverse domains.
                                       Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work
There is a plethora of recent LLM-based EM methods, be-
cause LLMs offer several advantages over traditional EM
solutions: (i) contextual understanding, as they understand
the context and semantics of entity descriptions better than
traditional string matching techniques. (ii) robustness, since
LLMs are typically more capable of addressing variations
in how entity information is expressed. (iii) zero-shot and
few-shot learning, i.e., LLMs can accomplish EM tasks with
no or minimal examples of matching decisions. These char-
acteristics render LLMs ideal for most EM tasks, especially
those with complex, unstructured product descriptions.
                                                                   Figure 2: (a) The basic zero-shot EM prompt, and (b) its few-shot
    The seminal work on LLM-based EM [16] investigated
                                                                   extension.
the effectiveness of GPT3-175B in EM, focusing on three
key parameters: (i) problem definition, exploring different
phrasings such as “Are Product A and Product B the same?”          imental results demonstrate that batch prompting outper-
or “Are Product A and Product B equivalent?”. (ii) in-context      form match prompts in both effectiveness and cost, with
learning, comparing zero-shot with few-shot approaches.            the top performance achieved by diversity-based question
The former involve prompts with no examples in the prompt,         batching combined with covering-based demonstration se-
while the latter involve a couple of examples, which are           lection.
selected randomly or by experts. (iii) entity serialization,          These studies collectively demonstrate the potential of
testing the use of all attributes or just a subset of them.        LLMs in entity matching tasks, highlighting the importance
Their experimental analysis led to the following conclusions:      of prompt engineering, the competitiveness of open-source
(i) few-shot learning significantly outperforms zero-shot          models, and the effectiveness of batching strategies for im-
approaches, (ii) attribute selection yields better results than    proved efficiency. This work builds upon and extends the
using all attributes, (iii) problem definition has a substantial   existing ones by focusing specifically on 7B parameter LLMs
impact on performance, (iv) LLM performance is comparable          with 4-bit quantization. Unlike previous studies that primar-
to the state-of-the-art DL-based matching algorithms.              ily use larger, more resource-intensive models, our work
    A detailed study was conducted in [15], using six LLMs,        explores the potential of smaller and more accessible LLMs
three hosted and three open-source ones. The experiments           for EM tasks. In this context, we perform a comprehen-
explored additional parameters such as problem definition,         sive evaluation of various novel prompting strategies, in-
language complexity, output specification, entity serializa-       cluding zero-shot, few-shot, and general matching defini-
tion, in-context learning, instructions, and fine-tuning. The      tion approaches, across multiple models and datasets. This
experimental results revealed that: (i) no single prompt con-      approach offers insights into the practical applicability of
sistently outperformed all others across different scenarios.      LLMs in resource-constrained environments, bridging the
(ii) Open-source LLMs showed comparable effectiveness              gap between advanced language models and real-world EM
to hosted models. (iii) LLMs performed competitively with          challenges.
deep learning-based matchers, even in zero-shot settings.(iv)
Few-shot and instruction-based prompts generally outper-           3. Problem Definition
formed zero-shot approaches. (v) Fine-tuning significantly
improved effectiveness.                                            Applied after Filtering, Entity Matching is typically formu-
    In another line of research, three distinct prompting          lated as a binary classification problem [3, 4]. More formally:
strategies were explored in [17]: (i) Match prompts, which         Given two records 𝑟1 and 𝑟2 , the task is to determine whether
contain traditional pair-wise questions. E.g., “Do these two       they refer to the same entity. This is often expressed as a
records refer to the same real-world entity? Record 1: [de-        function 𝑓 (𝑟1 , 𝑟2 ) → {0, 1}, where 1 indicates a match
tails]. Record 2: [details].” (ii) Comparison prompts, which       (also called duplicate) and 0 indicates a non-match.
ask for the most similar entity to a given reference. E.g.,           In LLM-based settings, EM is framed as a natural language
“Which of these two records is more consistent with the            inference task. The LLM is provided with descriptions of
given record? Given Record: [details]. (A) Record 1: [de-          two records and asked to determine if they refer to the same
tails]. (B) Record 2: [details].” (iii) Selection prompts, which   entity, returning “True” for a match and “False” otherwise.
identify a matching entity from a set of candidates. E.g.,            In all cases, EM performance is measured with respect to:
“Select a record from the following list that refers to the
same real-world entity as the given record: Given Record:          • Precision, i.e., the proportion of correctly identified
[details]. Options: 1. [details] 2. [details] 3. [details]...”       matches out of all predicted matches.
The experimental results show that incorporating record            • Recall, i.e., the proportion of correctly identified matches
interactions through the comparison and selection prompts            out of all actual matches.
significantly improves EM performance across various sce-
narios; among the two, the selection prompts are the top-          • F-measure, i.e., the harmonic mean of precision and recall,
performers in most cases. However, they suffer from posi-            providing a balanced measure of performance.
tion bias, because their accuracy decreases when the dupli-
cate record is placed lower in the list of candidates.             • Run-time, i.e., the time taken to complete the ER process.
    BatchER [18] aims to reduce the costs for hosted LLMs          The first three measures are defined in [0, 1] with higher
through batch processing, exploring various methods for            values indicating higher effectiveness. For the last one, lower
question batching and demonstration selection. The exper-          values indicate higher time efficiency.
4. EM Prompts
We now present the EM prompts that are examined in our
work. The basic prompt is presented in Figure 2(a). It con-
sists of an instruction that describes the input and the de-
sired output. It lacks any examples, thus constitutes a zero-
shot EM prompt, which tests the model’s ability to generalize
to new tasks or domains it has not been trained on.
   A concise few-shot EM prompt extends the zero-shot one
with the examples in Figure 2(b). To provide a balanced con-
text, there are two examples that include a pair of matching
entities and a pair of non-matching ones. These examples
serve as a form of weak supervision, allowing the LLM to          Figure 3: Domain-specific, zero-shot EM prompt for product
learn from the provided instances and generalize to simi-         matching.
lar cases. Note that the examples in Figure 2(b) have been
carefully selected from dataset 𝐷1 (see Table 1) so that they     2. The atomic domain-specific EM prompt uses only the
capture typical variations in product descriptions that are          model number as the matching criterion. We selected
encountered in the full dataset.                                     this attribute because it provides the cleanest and most
   Note that LLM responses to few-shot prompts suffer from           distinctive values.
position bias [17], because the order of examples in the EM
                                                                  These two configurations were chosen after preliminary
prompt might alter the matching decision. This means that
                                                                  tests that suggested that they yield the best performance
in the example of Figure 2(b), the response for a specific
                                                                  among all other combinations of these four attributes.
candidate pair might be True (i.e., matching) if the posi-
tive example precedes the negative one and False (i.e., non-
matching) otherwise. For this reason, we define two types         5. Experimental Analysis
of few-shot prompts:
                                                                  Experimental Settings. All experiments were imple-
1. TF, where the True example is followed by False one, as        mented in Python v3.12.0 and Ollama1 v0.1.22. All experi-
   in Figure 2(b).                                                ments were carried out on a server running Ubuntu 22.04.1
2. FT , where the False example is followed by True one.          LTS, equipped with Intel Core i7-9700K 8 core @ 3.6 GHz,
                                                                  32GB RAM and NVIDIA GeForce GTX 1080 Ti 11GB.
Note that with multiple examples per prompt, as in [17],             Due to the limited size of the available VRAM, our study
more arrangements are possible. In this work, though, we          focuses on 7-billion-parameter LLMs with optimizations
exclusively consider the two variations of the few-shot EM        such as quantization, which in our case replaces the 32-
prompt that involves one example per match type.                  bit floating-point model weights with 4-bit integers. This
  To increase the robustness of LLMs to few-shot EM               reduces the model size, while maintaining reasonable perfor-
prompts, we consider two matching approaches for each             mance levels. In other words, quantization lowers effective-
candidate pair, query with both TF and FT prompts:                ness, due to the fewer parameters and the lower precision of
                                                                  the model’s weights, but significantly reduces run-times and
1. The union approach labels a candidate pair as True if          memory consumption. Therefore, our experimental results
   either the TF or FT prompt results in a True response.         are useful for resource-constrained applications, which run
2. The intersection approach labels a candidate pair as True      LLMs on commodity hardware.
   only if both the TF and FT prompts yield a True response.         LLMs. There is a plethora of open-source LLMs, with
                                                                  newer models introduced on a rather frequent basis. During
                                                                  our study, two models were quite popular: Llama 2 [19],
4.1. Domain-specific Zero-Shot Prompts                            with 7B parameters and a context length of 4096, as well
The above prompts are generic enough to apply to any              as Mistal [20], with 7.3B parameters. However, preliminary
domain. In our experimental analysis, we also consider            experiments demonstrated that both of them were inap-
domain-specific ones, which are crafted for the product           propriate for the EM tasks considered in this work. Llama
matching task. More specifically, we devise a zero-shot           2 consistently responded with “True” for every candidate
prompt that involves general matching definitions, provid-        pair, while Mistral failed to provide a response according to
ing the LLM with explicit guidance on how to determine if         given instructions – it indicated an inability to respond in
two records refer to the same product.                            certain cases or gave explanations for its decisions instead
   The core assumption of this approach is that the records       of a “True” or “False” label.
are described by a clean, aligned schema. This is necessary          In their place, we considered the following open-source
for building a schema-aware generic definition of dupli-          models, which demonstrated high effectiveness in our pre-
cate records. In the product matching task, we use four           liminary experiments:
key product attributes: (i) product name, (ii) features, (iii)    1. Orca2 [21]. Built by Microsoft Research, Orca2 is a fam-
manufacturer, and (iv) model number. We use them in two              ily of models fine-tuned on Meta’s Llama 2 using syn-
different configurations:                                            thetic data.
1. The composite domain-specific EM prompt concatenates           2. OpenHermes2 . This is a Mistral 7B model fine-tuned with
   all four criteria in the above sequence, as in Figure 3. The      fully open datasets, showcasing strong multi-turn chat
   goal is to facilitate more nuanced matching decisions.         1
                                                                      https://ollama.com
                                                                  2
                                                                      https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B
         Dataset        #Entities   Duplicates     Cartesian Product     #Attributes   Candidate Pairs     Bl.Recall   Bl.Precision
           𝐷1        1,076-1,076          1,076             1.16×106               3              4,345        0.924          0.229
           𝐷2       2,554-22,074            853             5.64×107               6              5,163        0.910          0.150
        Table 1
        Technical characteristics of the datasets used in the experimental analysis.


             Figure 4: Effectiveness of the zero-shot prompt in Figure 2(a) on top of the selected LLMs over 𝐷1 (left) and 𝐷2 (right).


      skills and system prompt capabilities. It surpasses all             by PyJedAI [24] , version 0.1.6. Following [25], we kNN-
      previous versions of Nous-Hermes 13B and below.                     Join, which identifies the 𝑘 nearest neighbors of each en-
                                                                          tity. We fine-tuned it, maximizing blocking precision for a
3. Zephyr [22]. A 7B parameter model fine-tuned on Mis-                   blocking recall of at least 90%, as reported in the rightmost
   tral, it achieves results similar to Llama 2 70B Chat in               columns of Table 1. This configuration uses cleaning (i.e.,
   various benchmarks. It is trained on a distilled dataset,              stop-word removal and stemming) and cosine similarity in
   improving grammar and chat results.                                    both datasets. For Abt-Buy, 𝑘 was set to 4, while the at-
4. Mistral-OpenOrca3 . This is a 7B parameter model, fine-                tribute values were converted into a multiset of character
   tuned on top of Mistral 7B using the OpenOrca dataset.                 trigrams. For Walmart-Amazon, 𝑘 was set to 2, while the
                                                                          attribute values were converted into a multiset of character
5. Stable-Beluga4 . This is a Llama 2 based model fine-tuned              four-grams.
   on an Orca-style dataset.
6. Llama-Pro [23]: An 8B parameter expansion of Llama 2                   5.1. Zero-Shot Prompting Results
   that specializes in integrating both general language un-              We now examine the relative performance of the selected
   derstanding and domain-specific knowledge, particularly                LLMs over 𝐷1 and 𝐷2 , when coupled with the basic zero-
   in programming and mathematics.                                        shot EM prompt of Figure 2(a).
                                                                             We observe that Orca2, OpenHermes, and Zephyr con-
In all cases, we use the default latest model with 4-bit quan-
                                                                          sistently rank as the top three models with respect to F-
tization and 7B parameters.
                                                                          Measure in both datasets. The last two models switch
   Datasets. We used two real-world datasets with products
                                                                          their ranking positions in the two datasets, whereas Orca2
that are widely used in the ER literature: (i) 𝐷1 is the Abt-
                                                                          maintains the lead. The superior performance of Orca2,
Buy dataset, which comprises product listings from two
                                                                          which demonstrates Orca2’s robustness under diverse EM
online retailers, Abt Electronics and Buy.com. (ii) 𝐷2 is the
                                                                          settings, can be attributed to its fine-tuning on synthetic
Walmart-Amazon dataset, which contains product listings
                                                                          data designed for reasoning tasks. This enhances its capa-
from two other online retailers, Walmart and Amazon. 𝐷1
                                                                          bility to understand and compare complex product descrip-
primarily focuses on electronic products, while 𝐷2 covers a
                                                                          tions. OpenHermes is fine-tuned on fully open datasets with
broader range of product categories, matching diverse entity
                                                                          strong multi-turn chat skills, leveraging advanced language
types. Both datasets present important challenges, such
                                                                          understanding to perform well. Zephyr’s competitive per-
variations in product names and descriptions across retailers,
                                                                          formance probably results from its training on a distilled
inconsistent use of model numbers and other identifiers,
                                                                          dataset that improves grammar and chat results, aiding in
differences in the level of detail provided for each product,
                                                                          better interpretation of entity attributes. The lower perfor-
variations in formatting and units (e.g., dimensions, weights)
                                                                          mance of Mistral-OpenOrca, Stable-Beluga, and Llama-Pro
as well as missing or null values in certain fields.
                                                                          is probably due to the less specialized training data or the
   Their technical characteristics are summarized in Table
                                                                          smaller model capacities for the specific nuances of EM.
1. Note that each dataset comprises two individually clean
                                                                             Note that all models exhibit much higher recall than pre-
data sources, whose sizes are reported in column #Enti-
                                                                          cision in both datasets. This means that they are prone to
ties. Note also that we apply the prompts to the candidate
                                                                          label a candidate pair as matching, at the cost of introducing
pairs generated by a state-of-the-art blocking implemented
                                                                          numerous false positives. Orca2 consistently exhibits the
                                                                          highest precision, thus yielding the highest F-Measure, too.
                                                                             Note also that all models exhibit markedly lower effective-
3
    https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca                  ness in 𝐷2 compared to 𝐷1 . This suggests that 𝐷2 presents
4
    https://huggingface.co/stabilityai/StableBeluga2
          Figure 5: Effectiveness of the few-shot prompts in Figure 2(b) on top of selected LLMs over 𝐷1 (left) and 𝐷2 (right). From top
          to bottom, the TF promps are presented first, followed by the FT prompts, the Union and the Intersection approaches.


greater EM challenges, potentially due to more diverse or                 This is not the case with Orca2, whose performance varies
complex product descriptions. While 𝐷1 is restricted to                significantly across the two datasets. In 𝐷1 , the same F1
electronics, 𝐷2 covers a broader range of products and in-             score is achieved for both approaches, because the intersec-
cludes more variation in descriptions, attributes, and data            tion raises recall by 12%, while reducing precision to the
quality, rendering EM more difficult. Furthermore, 𝐷1 has              same degree. In 𝐷2 , though, the intersection reduces recall
a 1:1 matching between its two data sources, whereas 𝐷2                by 23% and increases precision by 16%, thus yielding a much
has a much lower ratio of matches, adding another layer of             lower F-Measure. Note that in both datasets, the recall of
complexity to the task. The substantial performance gap                the model gets lower than its precision in combination with
between 𝐷1 and 𝐷2 underscores the significant impact of                the intersection approach, unlike the union one.
data characteristics on model effectiveness.                              Overall, we can conclude that Orca2 works best when
                                                                       coupled with FT few-shot prompts, while OpenHermes and
5.2. Few-Shot Prompting Results                                        Zephyr maximize their effectiveness when intersecting the
                                                                       matches of TF and FT prompts. Among them, the top per-
We now examine the performance of the aforementioned                   formers over 𝐷1 and 𝐷2 are Orca2 (F1=0.799) and Zephyr
few-shot prompts over 𝐷1 and 𝐷2 . We disregard Mistral-                (F1=0.531), respectively.
OpenOrca, Stable-Beluga, and Llama-Pro, because they ex-
hibited significantly lower effectiveness and less consistent
                                                                       5.3. Domain-specific Zero-Shot Prompting
performance in the zero-shot experiments – preliminary
experiments verified their poor performance in few-shot                     Results
settings, too. For brevity, we focus on the top three perform-         In this section, we compare the atomic domain-specific
ing models, namely Orca2, OpenHermes, and Zephyr.                      prompt with the composite one. As in Section 5.2, we ex-
   The results are reported in Figure 5. Based on preliminary          clusively consider the three top performing models with
experiments, we randomly select the examples included in               respect to the zero-shot prompts: Orca2, OpenHermes, and
the few-shot prompts from the candidate pairs of the same              Zephyr. Their performance is reported in Figure 6.
dataset. The same examples are used in all prompts issued                 We observe that in all cases, the atomic prompt outper-
on a particular dataset.                                               forms the composite one to a significant extent – the only
   In both datasets, we observe the same patterns as regards           exception corresponds to Zephyr in 𝐷1 , where the compos-
the relative performance of TF and FT few-shot prompts:                ite prompt increases F-Measure almost by 15%. This pattern
For Orca2, there is a substantial improvement when using               should be attributed to the short, distinctive and clean val-
the latter; OpenHermes is more robust to position bias, as             ues provided by the model number. This way, it reduces
there is no significant difference between the two prompt              the noise from other product attributes like product name,
strategies; Zephyr works best when coupled with the TF                 which are typically associated with long and diverse texts.
few-shot prompts. These patterns highlight that the impact                Similar to the above strategies, all LLMs exhibit much
of position bias on each model is consistent across the two            higher recall than precision. This means that they remain
datasets. Note also that with the exception of Orca2 with              prone to mark a candidate pair as a match at the cost of
TF prompts, all models achieve higher recall than precision,           introducing false positives – a behavior that permeates all
remaining more prone to label a candidate pair as matching.            prompt strategies we have examined.
   It is also interesting to compare the union approach with              Among the three models, Orca2 is consistently better,
the intersection one. For OpenHermes and Zephyr, the lat-              albeit to a minor extent in 𝐷2 . This consistent performance
ter yields significantly higher F-Measure: by considering              underscores Orca2’s effectiveness in EM tasks under quite
as duplicates only the candidate pairs that are marked as              different prompt designs.
matching by both TF and FT few-shot prompts, the reduc-                   We can conclude that domain-specific zero-shot prompts
tion in recall is much lower than the increase in precision            offer an effective and reliable alternative in datasets with a
(as a result, recall remains much higher than precision for            clean schema of known characteristics.
both models). This means that considering only the com-
mon matches of TF and FT prompts leads to more accurate
performance. Note that these patterns are consistent for
both models over both datasets.
          Figure 6: Effectiveness of the atomic and composite domain-specific zero-shot prompts in Figure 2(a) on top of the selected
          LLMs over 𝐷1 (left) and 𝐷2 (right).
                                                        𝐷1                                                    𝐷2
  Prompt Strategy
                                  Precision    Recall   F-Measure     Run-time      Precision        Recall   F-Measure    Run-time
  Zero-shot                           0.664     0.956           0.784    32 min         0.397        0.740        0.517      23 min
  FT Few-shot                         0.768     0.834          0.799     41 min         0.420        0.515        0.463      33 min
  Atomic Domain-specific              0.689     0.934           0.793    33 min         0.434        0.708        0.538      25 min
                                                               (a) Orca2
  Zero-shot                           0.584     0.963           0.727    31 min         0.309        0.864         0.455     23 min
  Intersection Few-shot               0.683     0.718           0.700    40 min         0.378        0.585         0.459     33 min
  Atomic Domain-specific              0.556     0.969           0.707    33 min         0.306        0.876         0.453     25 min
                                                          (b) OpenHermes
  Zero-shot                           0.572     0.965           0.718    32 min         0.329        0.942         0.488     24 min
  Intersection Few-shot               0.667     0.877           0.757    43 min         0.408        0.761         0.531     34 min
  Composite Domain-specific           0.573     0.960           0.718    39 min         0.372        0.913         0.529     30 min
                                                              (c) Zephyr
     Table 2
     Best performance per LLM in combination with the top performing variant per prompt strategy across both datasets.


5.4. Comparison of Prompting Strategies                                      Method             𝐷1       Source     𝐷2     Source
                                                                             ZeroER          0.520        [26]     0.644    [27]
We now compare the three top-performing models (Orca2,                      Magellan         0.436        [28]     0.719    [28]
OpenHermes, and Zephyr) with respect to effectiveness and                  DeepMatcher       0.628        [28]     0.669    [28]
time efficiency across the three strategies of EM prompts
                                                                     Table 3
discussed in Section 4. Note that among the few-shot and
                                                                     The F-Measure per dataset reported in the literature for three
domain-specific variants, for each LLM we only consider
                                                                     state-of-the-art EM algorithms.
the one with the highest F-Measure in both datasets. Their
performance is reported in Table 2.
   For Orca2, we observe that the FT few-shot prompts are
the top performers in 𝐷1 . The atomic domain-specific ones           domain-specific prompts, which are faster by more than
follow in very close distance in terms of F-Measure, while           10%. Due to its consistency, the best choice for Zephyr cor-
exhibiting a much lower run-time. This means that the                responds to the intersection of TF and FT few-shot prompts.
domain-specific prompts offer a significantly better balance            Among the three 7B LLMs, the configuration consistently
between effectiveness and time efficiency. In 𝐷2 , this strat-       achieving (almost) the highest effectiveness in both datasets
egy scores the highest F-Measure for a slightly higher run-          is Orca2 coupled with atomic domain-specific prompts. Its
time than the second best approach (zero-shot prompts).              efficiency is also rather high, given that its run-time is
For these reasons, Orca2 works best in combination with              marginally higher than that of the fastest (zero-shot) con-
the atomic domain-specific prompts.                                  figuration of the other two models.
   Regarding OpenHermes, the differences between the
three types of prompts are minor in terms of F-Measure.              5.5. Comparison to Baselines
As expected, the fastest approach in both datasets corre-
                                                                     To put the performance of the selected 7B LLMs into per-
sponds to the zero-shot prompts. This configuration also
                                                                     spective, we compare it with three state-of-the-art EM ap-
achieves the highest F-Measure in 𝐷1 , while in 𝐷2 , it ranks
                                                                     proaches from the literature:
second, within a negligible distance from the top (<0.5%).
Therefore, we can conclude that the zero-shot prompts are                 1. ZeroER [26], an unsupervised approach that requires
the best choice for OpenHermes.                                              no labelled datasets, learning Gaussian mixture mod-
   For Zephyr, there is a clear winner in the case of 𝐷1 : the               els for matching and non-matching candidate pairs.
intersection of few-shot prompts. It exhibits, though, the
highest run-time by a large extent. This is expected, as it               2. Magellan [29], a supervised approach combining bi-
queries the LLM twice per candidate pair. In the case of 𝐷2 ,                nary classifiers with a series of hand-crafted features
the same strategy takes a minor lead over the composite                      based on string similarity measures.
    3. DeepMatcher [28], a framework leveraging the syn-         • The use of 4-bit quantization and 7B parameter models
       ergy between language models and Deep Learning              demonstrated the potential for effective EM with limited
       classification.                                             computational resources. The effectiveness of the con-
                                                                   sidered models is competitive with established, learning-
For each method, we consider its best performance as re-           based EM approaches, especially in datasets with low
ported in the literature. The results are reported in Table 1.     portion of missing values and short entity descriptions.
   We observe mixed patterns. In 𝐷1 , all LLM configura-
tions in Table 2, even the zero-shot prompts, outperform           In the future, we plan to explore LLMs’ capability in
all three baseline methods to a significant extent (> 21%).      matching entities across different languages and to enhance
This is remarkable, because the simplest prompt strategy         the interpretability and explainability of LLM decisions.
requires neither domain expertise nor the labeling candi-
                                                                 Acknowledgments. This work was partially funded by
date pairs, unlike Magellan and DeepMatcher, whose per-
                                                                 the EU project STELAR (Horizon Europe – Grant No.
formance is derived from large training and validation sets,
                                                                 101070122).
which amount to 60% and 20% of all candidate pairs, resp.
   The situation is reversed in 𝐷2 , where all baseline meth-
ods achieve a much better performance. In fact, the highest      References
F-measure of Orca2 is lower by 16.5% than the worst baseline
(ZeroER). This should be attributed to the more challenging       [1] P. Christen, Data Matching - Concepts and Techniques
settings of 𝐷2 , which have already been discussed in Sec-            for Record Linkage, Entity Resolution, and Duplicate
tion 5.1. Note also that the records in 𝐷2 are noisier, with          Detection, Springer, 2012.
a much higher portion of missing values. Its records are          [2] G. Papadakis, E. Ioannou, E. Thanos, T. Palpanas, The
also longer, an aspect that is crucial for the 7B LLMs we are         Four Generations of Entity Resolution, Morgan & Clay-
considering in this study, due to their limited attention win-        pool Publishers, 2021.
dow. These settings favor the learning-based functionality        [3] X. L. Dong, D. Srivastava, Big data integration, in:
of the baseline methods, which take a clear lead over the             ICDE, 2013, pp. 1245–1248.
learning-free functionality of 7B LLMs. Another reason for        [4] V. Christophides, V. Efthymiou, T. Palpanas, G. Pa-
the poor performance of the latter is that they emphasize             padakis, K. Stefanidis, An overview of end-to-end
recall at the expense of precision, significantly decreasing          entity resolution for big data, ACM Comput. Surv. 53
their F-Measure in 𝐷2 , due the very low portion of matches           (2021) 127:1–127:42.
in comparison to the total number of entities from each data      [5] K. Stefanidis, V. Efthymiou, M. Herschel,
source. Therefore, more advanced strategies are required              V. Christophides, Entity resolution in the web
for boosting the performance of 7B LLMs in datasets with              of data, in: 23rd International World Wide Web
characteristics similar to that of 𝐷2 .                               Conference, WWW, 2014, pp. 203–204.
                                                                  [6] X. L. Dong, Building a broad knowledge graph for
                                                                      products, in: ICDE, 2019, p. 25.
6. Conclusions & Future Work                                      [7] P. Christen, A survey of indexing techniques for scal-
                                                                      able record linkage and deduplication, IEEE Trans.
Focusing on 7B open-source LLMs, we examined the per-
                                                                      Knowl. Data Eng. 24 (2012) 1537–1555.
formance of three main prompt strategies: (i) the basic,
                                                                  [8] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, A
domain-agnostic zero-shot prompt, (ii) the few-shot prompt
                                                                      survey of blocking and filtering techniques for entity
with one example per type of matches, and (iii) the domain-
                                                                      resolution, CoRR abs/1905.06167 (2019).
specific zero-shot prompt. We considered several variants
                                                                  [9] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Du-
for the last two strategies and applied all of them on two es-
                                                                      plicate record detection: A survey, IEEE Trans. Knowl.
tablished benchmark datasets for product matching. Testing
                                                                      Data Eng. 19 (2007) 1–16.
six popular LLMs, we reached the following conclusions:
                                                                 [10] A. Jurek, J. Hong, Y. Chi, W. Liu, A novel ensemble
• Few-shot and domain-specific prompting significantly                learning approach to unsupervised record linkage, Inf.
  improve the performance of the zero-shot approaches,                Syst. 71 (2017) 40–54.
  highlighting the value of task-specific prompts.               [11] P. Christen, Automatic record linkage using seeded
                                                                      nearest neighbour and support vector machine classi-
• In few-shot prompts, the response of LLMs is generally              fication, in: SIGKDD, 2008, pp. 151–159.
  sensitive to order of examples. This suggests that a careful   [12] J. Fisher, P. Christen, Q. Wang, Active learning based
  prompt engineering is crucial for optimal performance in            entity resolution using markov logic, in: PAKDD, 2016,
  real-world ER applications.                                         pp. 338–349.
                                                                 [13] G. Papadakis, E. Ioannou, T. Palpanas, Entity resolu-
• This sensitivity can be addressed by the intersection
                                                                      tion: Past, present and yet-to-come, in: EDBT, 2020,
  approach to few-shot prompting, which consistently
                                                                      pp. 647–650.
  achieves much better results, increasing precision at a
                                                                 [14] K. Nikoletos, E. Ioannou, G. Papadakis, The five gen-
  higher rate than it reduces recall.
                                                                      erations of entity resolution on web data, in: ICWE,
• Orca2 consistently outperformed the other LLMs across               2024, pp. 469–473.
  most prompting strategies and datasets, demonstrating          [15] R. Peeters, C. Bizer, Entity matching using large lan-
  high robustness and effectiveness. In fact, the relative            guage models, CoRR abs/2310.11244 (2023).
  performance of the best models (Orca2 > OpenHermes >           [16] A. Narayan, I. Chami, L. J. Orr, C. Ré, Can foundation
  Zephyr) remained largely consistent across prompt strate-           models wrangle your data?, Proc. VLDB Endow. 16
  gies and datasets, suggesting inherent strengths in their           (2022) 738–746.
  base architectures.                                            [17] T. Wang, H. Lin, X. Chen, X. Han, H. Wang, Z. Zeng,
     L. Sun, Match, compare, or select? an investigation
     of large language models for entity matching, CoRR
     abs/2405.16884 (2024).
[18] M. Fan, X. Han, J. Fan, C. Chai, N. Tang, G. Li, X. Du,
     Cost-effective in-context learning for entity resolu-
     tion: A design space exploration, CoRR abs/2312.03987
     (2023).
[19] H. Touvron, et al., Llama 2: Open foundation and
     fine-tuned chat models, CoRR abs/2307.09288 (2023).
[20] A. Q. Jiang, A. Sablayrolles, A. Mensch, et al., Mistral
     7b, CoRR abs/2310.06825 (2023).
[21] A. Mitra, L. D. Corro, S. Mahajan, et al., Orca 2: Teach-
     ing small language models how to reason, CoRR
     abs/2311.11045 (2023).
[22] L. Tunstall, E. Beeching, N. Lambert, et al., Zephyr: Di-
     rect distillation of LM alignment, CoRR abs/2310.16944
     (2023).
[23] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, Y. Shan,
     P. Luo, Llama pro: Progressive llama with block ex-
     pansion, in: ACL, 2024, pp. 6518–6537.
[24] K. Nikoletos, G. Papadakis, M. Koubarakis, pyjedai: a
     lightsaber for link discovery, in: ISWC Posters, Demos
     and Industry Tracks, volume 3254, 2022.
[25] F. Neuhof, M. Fisichella, G. Papadakis, K. Nikoletos,
     N. Augsten, W. Nejdl, M. Koubarakis, Open benchmark
     for filtering techniques in entity resolution, VLDB J.
     33 (2024) 1671–1696.
[26] R. Wu, S. Chaba, S. Sawlani, X. Chu, S. Thirumuru-
     ganathan, Zeroer: Entity resolution using zero labeled
     examples, in: SIGMOD, 2020, pp. 1149–1164.
[27] G. Papadakis, N. Kirielle, P. Christen, T. Palpanas, A
     critical re-evaluation of record linkage benchmarks for
     learning-based matching algorithms, in: ICDE, 2024,
     pp. 3435–3448.
[28] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Kr-
     ishnan, R. Deep, E. Arcaute, V. Raghavendra, Deep
     learning for entity matching: A design space explo-
     ration, in: SIGMOD, 2018, pp. 19–34.
[29] P. Konda, S. Das, et al., Magellan: Toward building
     entity matching management systems, Proc. VLDB
     Endow. 9 (2016) 1197–1208.

</pre>