Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations Ioannis Arvanitis-Kasinikos1 , George Papadakis1 1 National and Kapodistrian University of Athens, Greece Abstract Entity Matching (EM) is a fundamental task in data management, involving the identification and linking of records that refer to the same real-world entity across different datasets. While Large Language Models (LLMs) have shown promise in addressing complex natural language processing tasks, their substantial computational requirements often limit their practical applicability. In this work, we investigate the use of 7B parameter LLMs with 4-bit quantization for EM tasks executable on commodity hardware. We explore various prompting strategies, including zero-shot, few-shot, and general matching definition prompts, to evaluate their effectiveness in improving EM accuracy. Experiments are conducted on two benchmark datasets with products, which present varying levels of complexity and challenge in product descriptions. Our findings demonstrate that 7B parameter LLMs can effectively perform EM, with the Orca2 model consistently outperforming others across different prompting strategies and datasets. The study highlights that few-shot prompting significantly enhances performance over zero-shot approaches, emphasizing the importance of task-specific examples and careful prompt design. We also examine the impact of example order in few-shot prompts and find that it has a substantial effect on model performance. Finally, we examine hardware limitations, demonstrating that effective EM can be achieved with resource-constrained models. Keywords Entity Matching, 7B LLMs, Zero-Shot Prompts, Few-Shot Prompts 1. Introduction Entity Resolution (ER) constitutes a vital task in data man- agement that involves identifying and linking records from different datasets that refer to the same real-world entity Figure 1: Two records with major differences describing the same [1, 2]. In many domains, including e-commerce, health- product. care, and finance, accurate ER is essential for ensuring data quality, enabling effective data integration, and supporting human involvement [12]. This is addressed by more recent informed decision-making [3]. However, this task is chal- state-of-the-art approaches that leverage deep learning (DL) lenging due to data inconsistencies, incompleteness, and techniques [13]. However, they require substantial amounts ambiguity across different sources [4, 5]. of training data, which are rarely available. As an example, consider the product descriptions in Fig- Recent advancements in NLP, particularly in Large Lan- ure 1. Despite corresponding to the same object (Sony head- guage Models (LLMs), offer new possibilities for addressing phones), there are significant variations in product names, EM challenges [14, 15]. LLMs possess advanced capabili- attributes, and dimensions. These discrepancies illustrate ties for natural language understanding, which allows them the challenges in reconciling variations across datasets, par- to process and interpret complex textual descriptions [16]. ticularly when dealing with unstructured text and linguistic Most importantly, LLM-based EM can be performed in zero- differences. Accurate ER in scenarios like this is crucial for shot settings, requiring no training instances, a characteris- product catalog integration, price comparison, and recom- tic particularly attractive for out-of-the-box solutions. mendation systems [6]. In this work, we evaluate the performance of 7B param- Due to its quadratic time complexity, ER solutions typ- eter LLMs in entity matching tasks. While larger LLMs ically implement the Filtering-Verification framework [7]. with hundreds of billions of parameters have shown impres- The Filtering step, often called Blocking, significantly re- sive results [15, 16], their computational requirements often duces the computational cost to the most similar candidate make them impractical for many real-world applications. By pairs, which are the most likely matches [8]. The Verifica- employing these LLMs, which excel in natural language un- tion step performs Entity Matching (EM), which essentially derstanding and semantic similarity assessment, this work determines whether two records are duplicates, describing seeks to address EM challenges in real-world datasets with the same real-world object. In the following, we exclusively linguistic variations and unstructured text, while also high- focus on EM. lighting their suitability for execution on commodity hard- Traditional EM solutions typically rely on rule-based ap- ware. The focus on 7B parameter LLMs is motivated by their proaches, string similarity metrics, or machine learning potential for efficient deployment on commodity hardware, algorithms [9, 10, 11]. However, these methods can strug- making them more suitable for practical applications. gle with complex linguistic variations and contextual un- To this end, we perform an extensive experimental evalu- derstanding, while requiring domain expertise and heavy ation that considers the models’ ability to handle different DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- types of EM scenarios. We explore novel zero-shot, few- guages and Analytical Processing of Big Data, co-located with EDBT/ICDT shot, and general matching definition prompting strategies 2025, March 25, 2025, Barcelona, Spain to assess their effectiveness in improving matching accuracy. $ cs1180001@di.uoa.gr (I. Arvanitis-Kasinikos); gpapadis@di.uoa.gr Our goal is to bridge the gap between the advanced capabil- (G. Papadakis) ities of LLMs and the practical constraints of real-world EM € https://gpapadis.wordpress.com (G. Papadakis) applications, potentially paving the way for more efficient  0000-0002-7298-9431 (G. Papadakis) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License and accurate ER techniques in diverse domains. Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Related Work There is a plethora of recent LLM-based EM methods, be- cause LLMs offer several advantages over traditional EM solutions: (i) contextual understanding, as they understand the context and semantics of entity descriptions better than traditional string matching techniques. (ii) robustness, since LLMs are typically more capable of addressing variations in how entity information is expressed. (iii) zero-shot and few-shot learning, i.e., LLMs can accomplish EM tasks with no or minimal examples of matching decisions. These char- acteristics render LLMs ideal for most EM tasks, especially those with complex, unstructured product descriptions. Figure 2: (a) The basic zero-shot EM prompt, and (b) its few-shot The seminal work on LLM-based EM [16] investigated extension. the effectiveness of GPT3-175B in EM, focusing on three key parameters: (i) problem definition, exploring different phrasings such as “Are Product A and Product B the same?” imental results demonstrate that batch prompting outper- or “Are Product A and Product B equivalent?”. (ii) in-context form match prompts in both effectiveness and cost, with learning, comparing zero-shot with few-shot approaches. the top performance achieved by diversity-based question The former involve prompts with no examples in the prompt, batching combined with covering-based demonstration se- while the latter involve a couple of examples, which are lection. selected randomly or by experts. (iii) entity serialization, These studies collectively demonstrate the potential of testing the use of all attributes or just a subset of them. LLMs in entity matching tasks, highlighting the importance Their experimental analysis led to the following conclusions: of prompt engineering, the competitiveness of open-source (i) few-shot learning significantly outperforms zero-shot models, and the effectiveness of batching strategies for im- approaches, (ii) attribute selection yields better results than proved efficiency. This work builds upon and extends the using all attributes, (iii) problem definition has a substantial existing ones by focusing specifically on 7B parameter LLMs impact on performance, (iv) LLM performance is comparable with 4-bit quantization. Unlike previous studies that primar- to the state-of-the-art DL-based matching algorithms. ily use larger, more resource-intensive models, our work A detailed study was conducted in [15], using six LLMs, explores the potential of smaller and more accessible LLMs three hosted and three open-source ones. The experiments for EM tasks. In this context, we perform a comprehen- explored additional parameters such as problem definition, sive evaluation of various novel prompting strategies, in- language complexity, output specification, entity serializa- cluding zero-shot, few-shot, and general matching defini- tion, in-context learning, instructions, and fine-tuning. The tion approaches, across multiple models and datasets. This experimental results revealed that: (i) no single prompt con- approach offers insights into the practical applicability of sistently outperformed all others across different scenarios. LLMs in resource-constrained environments, bridging the (ii) Open-source LLMs showed comparable effectiveness gap between advanced language models and real-world EM to hosted models. (iii) LLMs performed competitively with challenges. deep learning-based matchers, even in zero-shot settings.(iv) Few-shot and instruction-based prompts generally outper- 3. Problem Definition formed zero-shot approaches. (v) Fine-tuning significantly improved effectiveness. Applied after Filtering, Entity Matching is typically formu- In another line of research, three distinct prompting lated as a binary classification problem [3, 4]. More formally: strategies were explored in [17]: (i) Match prompts, which Given two records 𝑟1 and 𝑟2 , the task is to determine whether contain traditional pair-wise questions. E.g., “Do these two they refer to the same entity. This is often expressed as a records refer to the same real-world entity? Record 1: [de- function 𝑓 (𝑟1 , 𝑟2 ) → {0, 1}, where 1 indicates a match tails]. Record 2: [details].” (ii) Comparison prompts, which (also called duplicate) and 0 indicates a non-match. ask for the most similar entity to a given reference. E.g., In LLM-based settings, EM is framed as a natural language “Which of these two records is more consistent with the inference task. The LLM is provided with descriptions of given record? Given Record: [details]. (A) Record 1: [de- two records and asked to determine if they refer to the same tails]. (B) Record 2: [details].” (iii) Selection prompts, which entity, returning “True” for a match and “False” otherwise. identify a matching entity from a set of candidates. E.g., In all cases, EM performance is measured with respect to: “Select a record from the following list that refers to the same real-world entity as the given record: Given Record: • Precision, i.e., the proportion of correctly identified [details]. Options: 1. [details] 2. [details] 3. [details]...” matches out of all predicted matches. The experimental results show that incorporating record • Recall, i.e., the proportion of correctly identified matches interactions through the comparison and selection prompts out of all actual matches. significantly improves EM performance across various sce- narios; among the two, the selection prompts are the top- • F-measure, i.e., the harmonic mean of precision and recall, performers in most cases. However, they suffer from posi- providing a balanced measure of performance. tion bias, because their accuracy decreases when the dupli- cate record is placed lower in the list of candidates. • Run-time, i.e., the time taken to complete the ER process. BatchER [18] aims to reduce the costs for hosted LLMs The first three measures are defined in [0, 1] with higher through batch processing, exploring various methods for values indicating higher effectiveness. For the last one, lower question batching and demonstration selection. The exper- values indicate higher time efficiency. 4. EM Prompts We now present the EM prompts that are examined in our work. The basic prompt is presented in Figure 2(a). It con- sists of an instruction that describes the input and the de- sired output. It lacks any examples, thus constitutes a zero- shot EM prompt, which tests the model’s ability to generalize to new tasks or domains it has not been trained on. A concise few-shot EM prompt extends the zero-shot one with the examples in Figure 2(b). To provide a balanced con- text, there are two examples that include a pair of matching entities and a pair of non-matching ones. These examples serve as a form of weak supervision, allowing the LLM to Figure 3: Domain-specific, zero-shot EM prompt for product learn from the provided instances and generalize to simi- matching. lar cases. Note that the examples in Figure 2(b) have been carefully selected from dataset 𝐷1 (see Table 1) so that they 2. The atomic domain-specific EM prompt uses only the capture typical variations in product descriptions that are model number as the matching criterion. We selected encountered in the full dataset. this attribute because it provides the cleanest and most Note that LLM responses to few-shot prompts suffer from distinctive values. position bias [17], because the order of examples in the EM These two configurations were chosen after preliminary prompt might alter the matching decision. This means that tests that suggested that they yield the best performance in the example of Figure 2(b), the response for a specific among all other combinations of these four attributes. candidate pair might be True (i.e., matching) if the posi- tive example precedes the negative one and False (i.e., non- matching) otherwise. For this reason, we define two types 5. Experimental Analysis of few-shot prompts: Experimental Settings. All experiments were imple- 1. TF, where the True example is followed by False one, as mented in Python v3.12.0 and Ollama1 v0.1.22. All experi- in Figure 2(b). ments were carried out on a server running Ubuntu 22.04.1 2. FT , where the False example is followed by True one. LTS, equipped with Intel Core i7-9700K 8 core @ 3.6 GHz, 32GB RAM and NVIDIA GeForce GTX 1080 Ti 11GB. Note that with multiple examples per prompt, as in [17], Due to the limited size of the available VRAM, our study more arrangements are possible. In this work, though, we focuses on 7-billion-parameter LLMs with optimizations exclusively consider the two variations of the few-shot EM such as quantization, which in our case replaces the 32- prompt that involves one example per match type. bit floating-point model weights with 4-bit integers. This To increase the robustness of LLMs to few-shot EM reduces the model size, while maintaining reasonable perfor- prompts, we consider two matching approaches for each mance levels. In other words, quantization lowers effective- candidate pair, query with both TF and FT prompts: ness, due to the fewer parameters and the lower precision of the model’s weights, but significantly reduces run-times and 1. The union approach labels a candidate pair as True if memory consumption. Therefore, our experimental results either the TF or FT prompt results in a True response. are useful for resource-constrained applications, which run 2. The intersection approach labels a candidate pair as True LLMs on commodity hardware. only if both the TF and FT prompts yield a True response. LLMs. There is a plethora of open-source LLMs, with newer models introduced on a rather frequent basis. During our study, two models were quite popular: Llama 2 [19], 4.1. Domain-specific Zero-Shot Prompts with 7B parameters and a context length of 4096, as well The above prompts are generic enough to apply to any as Mistal [20], with 7.3B parameters. However, preliminary domain. In our experimental analysis, we also consider experiments demonstrated that both of them were inap- domain-specific ones, which are crafted for the product propriate for the EM tasks considered in this work. Llama matching task. More specifically, we devise a zero-shot 2 consistently responded with “True” for every candidate prompt that involves general matching definitions, provid- pair, while Mistral failed to provide a response according to ing the LLM with explicit guidance on how to determine if given instructions – it indicated an inability to respond in two records refer to the same product. certain cases or gave explanations for its decisions instead The core assumption of this approach is that the records of a “True” or “False” label. are described by a clean, aligned schema. This is necessary In their place, we considered the following open-source for building a schema-aware generic definition of dupli- models, which demonstrated high effectiveness in our pre- cate records. In the product matching task, we use four liminary experiments: key product attributes: (i) product name, (ii) features, (iii) 1. Orca2 [21]. Built by Microsoft Research, Orca2 is a fam- manufacturer, and (iv) model number. We use them in two ily of models fine-tuned on Meta’s Llama 2 using syn- different configurations: thetic data. 1. The composite domain-specific EM prompt concatenates 2. OpenHermes2 . This is a Mistral 7B model fine-tuned with all four criteria in the above sequence, as in Figure 3. The fully open datasets, showcasing strong multi-turn chat goal is to facilitate more nuanced matching decisions. 1 https://ollama.com 2 https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B Dataset #Entities Duplicates Cartesian Product #Attributes Candidate Pairs Bl.Recall Bl.Precision 𝐷1 1,076-1,076 1,076 1.16×106 3 4,345 0.924 0.229 𝐷2 2,554-22,074 853 5.64×107 6 5,163 0.910 0.150 Table 1 Technical characteristics of the datasets used in the experimental analysis. Figure 4: Effectiveness of the zero-shot prompt in Figure 2(a) on top of the selected LLMs over 𝐷1 (left) and 𝐷2 (right). skills and system prompt capabilities. It surpasses all by PyJedAI [24] , version 0.1.6. Following [25], we kNN- previous versions of Nous-Hermes 13B and below. Join, which identifies the 𝑘 nearest neighbors of each en- tity. We fine-tuned it, maximizing blocking precision for a 3. Zephyr [22]. A 7B parameter model fine-tuned on Mis- blocking recall of at least 90%, as reported in the rightmost tral, it achieves results similar to Llama 2 70B Chat in columns of Table 1. This configuration uses cleaning (i.e., various benchmarks. It is trained on a distilled dataset, stop-word removal and stemming) and cosine similarity in improving grammar and chat results. both datasets. For Abt-Buy, 𝑘 was set to 4, while the at- 4. Mistral-OpenOrca3 . This is a 7B parameter model, fine- tribute values were converted into a multiset of character tuned on top of Mistral 7B using the OpenOrca dataset. trigrams. For Walmart-Amazon, 𝑘 was set to 2, while the attribute values were converted into a multiset of character 5. Stable-Beluga4 . This is a Llama 2 based model fine-tuned four-grams. on an Orca-style dataset. 6. Llama-Pro [23]: An 8B parameter expansion of Llama 2 5.1. Zero-Shot Prompting Results that specializes in integrating both general language un- We now examine the relative performance of the selected derstanding and domain-specific knowledge, particularly LLMs over 𝐷1 and 𝐷2 , when coupled with the basic zero- in programming and mathematics. shot EM prompt of Figure 2(a). We observe that Orca2, OpenHermes, and Zephyr con- In all cases, we use the default latest model with 4-bit quan- sistently rank as the top three models with respect to F- tization and 7B parameters. Measure in both datasets. The last two models switch Datasets. We used two real-world datasets with products their ranking positions in the two datasets, whereas Orca2 that are widely used in the ER literature: (i) 𝐷1 is the Abt- maintains the lead. The superior performance of Orca2, Buy dataset, which comprises product listings from two which demonstrates Orca2’s robustness under diverse EM online retailers, Abt Electronics and Buy.com. (ii) 𝐷2 is the settings, can be attributed to its fine-tuning on synthetic Walmart-Amazon dataset, which contains product listings data designed for reasoning tasks. This enhances its capa- from two other online retailers, Walmart and Amazon. 𝐷1 bility to understand and compare complex product descrip- primarily focuses on electronic products, while 𝐷2 covers a tions. OpenHermes is fine-tuned on fully open datasets with broader range of product categories, matching diverse entity strong multi-turn chat skills, leveraging advanced language types. Both datasets present important challenges, such understanding to perform well. Zephyr’s competitive per- variations in product names and descriptions across retailers, formance probably results from its training on a distilled inconsistent use of model numbers and other identifiers, dataset that improves grammar and chat results, aiding in differences in the level of detail provided for each product, better interpretation of entity attributes. The lower perfor- variations in formatting and units (e.g., dimensions, weights) mance of Mistral-OpenOrca, Stable-Beluga, and Llama-Pro as well as missing or null values in certain fields. is probably due to the less specialized training data or the Their technical characteristics are summarized in Table smaller model capacities for the specific nuances of EM. 1. Note that each dataset comprises two individually clean Note that all models exhibit much higher recall than pre- data sources, whose sizes are reported in column #Enti- cision in both datasets. This means that they are prone to ties. Note also that we apply the prompts to the candidate label a candidate pair as matching, at the cost of introducing pairs generated by a state-of-the-art blocking implemented numerous false positives. Orca2 consistently exhibits the highest precision, thus yielding the highest F-Measure, too. Note also that all models exhibit markedly lower effective- 3 https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca ness in 𝐷2 compared to 𝐷1 . This suggests that 𝐷2 presents 4 https://huggingface.co/stabilityai/StableBeluga2 Figure 5: Effectiveness of the few-shot prompts in Figure 2(b) on top of selected LLMs over 𝐷1 (left) and 𝐷2 (right). From top to bottom, the TF promps are presented first, followed by the FT prompts, the Union and the Intersection approaches. greater EM challenges, potentially due to more diverse or This is not the case with Orca2, whose performance varies complex product descriptions. While 𝐷1 is restricted to significantly across the two datasets. In 𝐷1 , the same F1 electronics, 𝐷2 covers a broader range of products and in- score is achieved for both approaches, because the intersec- cludes more variation in descriptions, attributes, and data tion raises recall by 12%, while reducing precision to the quality, rendering EM more difficult. Furthermore, 𝐷1 has same degree. In 𝐷2 , though, the intersection reduces recall a 1:1 matching between its two data sources, whereas 𝐷2 by 23% and increases precision by 16%, thus yielding a much has a much lower ratio of matches, adding another layer of lower F-Measure. Note that in both datasets, the recall of complexity to the task. The substantial performance gap the model gets lower than its precision in combination with between 𝐷1 and 𝐷2 underscores the significant impact of the intersection approach, unlike the union one. data characteristics on model effectiveness. Overall, we can conclude that Orca2 works best when coupled with FT few-shot prompts, while OpenHermes and 5.2. Few-Shot Prompting Results Zephyr maximize their effectiveness when intersecting the matches of TF and FT prompts. Among them, the top per- We now examine the performance of the aforementioned formers over 𝐷1 and 𝐷2 are Orca2 (F1=0.799) and Zephyr few-shot prompts over 𝐷1 and 𝐷2 . We disregard Mistral- (F1=0.531), respectively. OpenOrca, Stable-Beluga, and Llama-Pro, because they ex- hibited significantly lower effectiveness and less consistent 5.3. Domain-specific Zero-Shot Prompting performance in the zero-shot experiments – preliminary experiments verified their poor performance in few-shot Results settings, too. For brevity, we focus on the top three perform- In this section, we compare the atomic domain-specific ing models, namely Orca2, OpenHermes, and Zephyr. prompt with the composite one. As in Section 5.2, we ex- The results are reported in Figure 5. Based on preliminary clusively consider the three top performing models with experiments, we randomly select the examples included in respect to the zero-shot prompts: Orca2, OpenHermes, and the few-shot prompts from the candidate pairs of the same Zephyr. Their performance is reported in Figure 6. dataset. The same examples are used in all prompts issued We observe that in all cases, the atomic prompt outper- on a particular dataset. forms the composite one to a significant extent – the only In both datasets, we observe the same patterns as regards exception corresponds to Zephyr in 𝐷1 , where the compos- the relative performance of TF and FT few-shot prompts: ite prompt increases F-Measure almost by 15%. This pattern For Orca2, there is a substantial improvement when using should be attributed to the short, distinctive and clean val- the latter; OpenHermes is more robust to position bias, as ues provided by the model number. This way, it reduces there is no significant difference between the two prompt the noise from other product attributes like product name, strategies; Zephyr works best when coupled with the TF which are typically associated with long and diverse texts. few-shot prompts. These patterns highlight that the impact Similar to the above strategies, all LLMs exhibit much of position bias on each model is consistent across the two higher recall than precision. This means that they remain datasets. Note also that with the exception of Orca2 with prone to mark a candidate pair as a match at the cost of TF prompts, all models achieve higher recall than precision, introducing false positives – a behavior that permeates all remaining more prone to label a candidate pair as matching. prompt strategies we have examined. It is also interesting to compare the union approach with Among the three models, Orca2 is consistently better, the intersection one. For OpenHermes and Zephyr, the lat- albeit to a minor extent in 𝐷2 . This consistent performance ter yields significantly higher F-Measure: by considering underscores Orca2’s effectiveness in EM tasks under quite as duplicates only the candidate pairs that are marked as different prompt designs. matching by both TF and FT few-shot prompts, the reduc- We can conclude that domain-specific zero-shot prompts tion in recall is much lower than the increase in precision offer an effective and reliable alternative in datasets with a (as a result, recall remains much higher than precision for clean schema of known characteristics. both models). This means that considering only the com- mon matches of TF and FT prompts leads to more accurate performance. Note that these patterns are consistent for both models over both datasets. Figure 6: Effectiveness of the atomic and composite domain-specific zero-shot prompts in Figure 2(a) on top of the selected LLMs over 𝐷1 (left) and 𝐷2 (right). 𝐷1 𝐷2 Prompt Strategy Precision Recall F-Measure Run-time Precision Recall F-Measure Run-time Zero-shot 0.664 0.956 0.784 32 min 0.397 0.740 0.517 23 min FT Few-shot 0.768 0.834 0.799 41 min 0.420 0.515 0.463 33 min Atomic Domain-specific 0.689 0.934 0.793 33 min 0.434 0.708 0.538 25 min (a) Orca2 Zero-shot 0.584 0.963 0.727 31 min 0.309 0.864 0.455 23 min Intersection Few-shot 0.683 0.718 0.700 40 min 0.378 0.585 0.459 33 min Atomic Domain-specific 0.556 0.969 0.707 33 min 0.306 0.876 0.453 25 min (b) OpenHermes Zero-shot 0.572 0.965 0.718 32 min 0.329 0.942 0.488 24 min Intersection Few-shot 0.667 0.877 0.757 43 min 0.408 0.761 0.531 34 min Composite Domain-specific 0.573 0.960 0.718 39 min 0.372 0.913 0.529 30 min (c) Zephyr Table 2 Best performance per LLM in combination with the top performing variant per prompt strategy across both datasets. 5.4. Comparison of Prompting Strategies Method 𝐷1 Source 𝐷2 Source ZeroER 0.520 [26] 0.644 [27] We now compare the three top-performing models (Orca2, Magellan 0.436 [28] 0.719 [28] OpenHermes, and Zephyr) with respect to effectiveness and DeepMatcher 0.628 [28] 0.669 [28] time efficiency across the three strategies of EM prompts Table 3 discussed in Section 4. Note that among the few-shot and The F-Measure per dataset reported in the literature for three domain-specific variants, for each LLM we only consider state-of-the-art EM algorithms. the one with the highest F-Measure in both datasets. Their performance is reported in Table 2. For Orca2, we observe that the FT few-shot prompts are the top performers in 𝐷1 . The atomic domain-specific ones domain-specific prompts, which are faster by more than follow in very close distance in terms of F-Measure, while 10%. Due to its consistency, the best choice for Zephyr cor- exhibiting a much lower run-time. This means that the responds to the intersection of TF and FT few-shot prompts. domain-specific prompts offer a significantly better balance Among the three 7B LLMs, the configuration consistently between effectiveness and time efficiency. In 𝐷2 , this strat- achieving (almost) the highest effectiveness in both datasets egy scores the highest F-Measure for a slightly higher run- is Orca2 coupled with atomic domain-specific prompts. Its time than the second best approach (zero-shot prompts). efficiency is also rather high, given that its run-time is For these reasons, Orca2 works best in combination with marginally higher than that of the fastest (zero-shot) con- the atomic domain-specific prompts. figuration of the other two models. Regarding OpenHermes, the differences between the three types of prompts are minor in terms of F-Measure. 5.5. Comparison to Baselines As expected, the fastest approach in both datasets corre- To put the performance of the selected 7B LLMs into per- sponds to the zero-shot prompts. This configuration also spective, we compare it with three state-of-the-art EM ap- achieves the highest F-Measure in 𝐷1 , while in 𝐷2 , it ranks proaches from the literature: second, within a negligible distance from the top (<0.5%). Therefore, we can conclude that the zero-shot prompts are 1. ZeroER [26], an unsupervised approach that requires the best choice for OpenHermes. no labelled datasets, learning Gaussian mixture mod- For Zephyr, there is a clear winner in the case of 𝐷1 : the els for matching and non-matching candidate pairs. intersection of few-shot prompts. It exhibits, though, the highest run-time by a large extent. This is expected, as it 2. Magellan [29], a supervised approach combining bi- queries the LLM twice per candidate pair. In the case of 𝐷2 , nary classifiers with a series of hand-crafted features the same strategy takes a minor lead over the composite based on string similarity measures. 3. DeepMatcher [28], a framework leveraging the syn- • The use of 4-bit quantization and 7B parameter models ergy between language models and Deep Learning demonstrated the potential for effective EM with limited classification. computational resources. The effectiveness of the con- sidered models is competitive with established, learning- For each method, we consider its best performance as re- based EM approaches, especially in datasets with low ported in the literature. The results are reported in Table 1. portion of missing values and short entity descriptions. We observe mixed patterns. In 𝐷1 , all LLM configura- tions in Table 2, even the zero-shot prompts, outperform In the future, we plan to explore LLMs’ capability in all three baseline methods to a significant extent (> 21%). matching entities across different languages and to enhance This is remarkable, because the simplest prompt strategy the interpretability and explainability of LLM decisions. requires neither domain expertise nor the labeling candi- Acknowledgments. This work was partially funded by date pairs, unlike Magellan and DeepMatcher, whose per- the EU project STELAR (Horizon Europe – Grant No. formance is derived from large training and validation sets, 101070122). which amount to 60% and 20% of all candidate pairs, resp. The situation is reversed in 𝐷2 , where all baseline meth- ods achieve a much better performance. In fact, the highest References F-measure of Orca2 is lower by 16.5% than the worst baseline (ZeroER). This should be attributed to the more challenging [1] P. Christen, Data Matching - Concepts and Techniques settings of 𝐷2 , which have already been discussed in Sec- for Record Linkage, Entity Resolution, and Duplicate tion 5.1. Note also that the records in 𝐷2 are noisier, with Detection, Springer, 2012. a much higher portion of missing values. Its records are [2] G. Papadakis, E. Ioannou, E. Thanos, T. Palpanas, The also longer, an aspect that is crucial for the 7B LLMs we are Four Generations of Entity Resolution, Morgan & Clay- considering in this study, due to their limited attention win- pool Publishers, 2021. dow. These settings favor the learning-based functionality [3] X. L. Dong, D. Srivastava, Big data integration, in: of the baseline methods, which take a clear lead over the ICDE, 2013, pp. 1245–1248. learning-free functionality of 7B LLMs. Another reason for [4] V. Christophides, V. Efthymiou, T. Palpanas, G. Pa- the poor performance of the latter is that they emphasize padakis, K. Stefanidis, An overview of end-to-end recall at the expense of precision, significantly decreasing entity resolution for big data, ACM Comput. Surv. 53 their F-Measure in 𝐷2 , due the very low portion of matches (2021) 127:1–127:42. in comparison to the total number of entities from each data [5] K. Stefanidis, V. Efthymiou, M. Herschel, source. Therefore, more advanced strategies are required V. Christophides, Entity resolution in the web for boosting the performance of 7B LLMs in datasets with of data, in: 23rd International World Wide Web characteristics similar to that of 𝐷2 . Conference, WWW, 2014, pp. 203–204. [6] X. L. Dong, Building a broad knowledge graph for products, in: ICDE, 2019, p. 25. 6. Conclusions & Future Work [7] P. Christen, A survey of indexing techniques for scal- able record linkage and deduplication, IEEE Trans. Focusing on 7B open-source LLMs, we examined the per- Knowl. Data Eng. 24 (2012) 1537–1555. formance of three main prompt strategies: (i) the basic, [8] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, A domain-agnostic zero-shot prompt, (ii) the few-shot prompt survey of blocking and filtering techniques for entity with one example per type of matches, and (iii) the domain- resolution, CoRR abs/1905.06167 (2019). specific zero-shot prompt. We considered several variants [9] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Du- for the last two strategies and applied all of them on two es- plicate record detection: A survey, IEEE Trans. Knowl. tablished benchmark datasets for product matching. Testing Data Eng. 19 (2007) 1–16. six popular LLMs, we reached the following conclusions: [10] A. Jurek, J. Hong, Y. Chi, W. Liu, A novel ensemble • Few-shot and domain-specific prompting significantly learning approach to unsupervised record linkage, Inf. improve the performance of the zero-shot approaches, Syst. 71 (2017) 40–54. highlighting the value of task-specific prompts. [11] P. Christen, Automatic record linkage using seeded nearest neighbour and support vector machine classi- • In few-shot prompts, the response of LLMs is generally fication, in: SIGKDD, 2008, pp. 151–159. sensitive to order of examples. This suggests that a careful [12] J. Fisher, P. Christen, Q. Wang, Active learning based prompt engineering is crucial for optimal performance in entity resolution using markov logic, in: PAKDD, 2016, real-world ER applications. pp. 338–349. [13] G. Papadakis, E. Ioannou, T. Palpanas, Entity resolu- • This sensitivity can be addressed by the intersection tion: Past, present and yet-to-come, in: EDBT, 2020, approach to few-shot prompting, which consistently pp. 647–650. achieves much better results, increasing precision at a [14] K. Nikoletos, E. Ioannou, G. Papadakis, The five gen- higher rate than it reduces recall. erations of entity resolution on web data, in: ICWE, • Orca2 consistently outperformed the other LLMs across 2024, pp. 469–473. most prompting strategies and datasets, demonstrating [15] R. Peeters, C. Bizer, Entity matching using large lan- high robustness and effectiveness. In fact, the relative guage models, CoRR abs/2310.11244 (2023). performance of the best models (Orca2 > OpenHermes > [16] A. Narayan, I. Chami, L. J. Orr, C. Ré, Can foundation Zephyr) remained largely consistent across prompt strate- models wrangle your data?, Proc. VLDB Endow. 16 gies and datasets, suggesting inherent strengths in their (2022) 738–746. base architectures. [17] T. Wang, H. Lin, X. Chen, X. Han, H. Wang, Z. Zeng, L. Sun, Match, compare, or select? an investigation of large language models for entity matching, CoRR abs/2405.16884 (2024). [18] M. Fan, X. Han, J. Fan, C. Chai, N. Tang, G. Li, X. Du, Cost-effective in-context learning for entity resolu- tion: A design space exploration, CoRR abs/2312.03987 (2023). [19] H. Touvron, et al., Llama 2: Open foundation and fine-tuned chat models, CoRR abs/2307.09288 (2023). [20] A. Q. Jiang, A. Sablayrolles, A. Mensch, et al., Mistral 7b, CoRR abs/2310.06825 (2023). [21] A. Mitra, L. D. Corro, S. Mahajan, et al., Orca 2: Teach- ing small language models how to reason, CoRR abs/2311.11045 (2023). [22] L. Tunstall, E. Beeching, N. Lambert, et al., Zephyr: Di- rect distillation of LM alignment, CoRR abs/2310.16944 (2023). [23] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, Y. Shan, P. Luo, Llama pro: Progressive llama with block ex- pansion, in: ACL, 2024, pp. 6518–6537. [24] K. Nikoletos, G. Papadakis, M. Koubarakis, pyjedai: a lightsaber for link discovery, in: ISWC Posters, Demos and Industry Tracks, volume 3254, 2022. [25] F. Neuhof, M. Fisichella, G. Papadakis, K. Nikoletos, N. Augsten, W. Nejdl, M. Koubarakis, Open benchmark for filtering techniques in entity resolution, VLDB J. 33 (2024) 1671–1696. [26] R. Wu, S. Chaba, S. Sawlani, X. Chu, S. Thirumuru- ganathan, Zeroer: Entity resolution using zero labeled examples, in: SIGMOD, 2020, pp. 1149–1164. [27] G. Papadakis, N. Kirielle, P. Christen, T. Palpanas, A critical re-evaluation of record linkage benchmarks for learning-based matching algorithms, in: ICDE, 2024, pp. 3435–3448. [28] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Kr- ishnan, R. Deep, E. Arcaute, V. Raghavendra, Deep learning for entity matching: A design space explo- ration, in: SIGMOD, 2018, pp. 19–34. [29] P. Konda, S. Das, et al., Magellan: Toward building entity matching management systems, Proc. VLDB Endow. 9 (2016) 1197–1208.