-

Constantin Orasan

c.orasan@surrey.ac.uk 1

Zhe Wu

Shenbin Qian

s.qian@surrey.ac.uk 1

Diptesh Kanojia

d.kanojia@surrey.ac.uk 1

Samarth Agrawal

samagrawal@ebay.com 3

Hadeel Saadany

hadeel.saadany@bcu.ac.uk 0

Swapnil Bhosale

s.bhosale@surrey.ac.uk 1

E-commerce, Search, Matryoshka, Representation Learning

0 Birmingham City University , United Kingdom 1 University of Surrey , United Kingdom 2 eBay Inc , San Jose, CA , USA 3 eBay Inc , Seattle, WA , USA

2025

E-commerce information retrieval (IR) systems struggle to simultaneously achieve high accuracy in interpreting complex user queries and maintain efficient processing of vast product catalogs. The dual challenge lies in precisely matching user intent with relevant products while managing the computational demands of real-time search across massive inventories. In this paper, we propose a Nested Embedding Approach to product Retrieval and Ranking, called NEAR2, which can achieve up to 12 times efficiency in embedding size at inference time while introducing no extra cost in training and improving performance in accuracy for various encoder-based Transformer models. We validate our approach using different loss functions for the retrieval and ranking task, including multiple negative ranking loss and online contrastive loss, on four different test sets with various IR challenges such as short and implicit queries. Our approach achieves an improved performance over a smaller embedding dimension, compared to any existing models.

Ranking

1. Introduction

CEUR Workshop

ISSN1613-0073 can lead to dissatisfaction or abandoned searches. Optimizing these systems to handle large-scale data efficiently without compromising accuracy is a critical challenge in e-commerce search.

In this paper, we propose a Nested Embedding Approach to product Retrieval and Ranking, called NEAR2, which can achieve efficient product retrieval and ranking using much smaller embedding sizes of encoder-based Transformer models [ 7 ]. This approach maintains performance comparable to the full model without incurring additional training costs. Our evaluation results on various test sets that contain different types of challenging queries, such as implicit and alphanumeric queries, indicate that NEAR2 can improve model performance on these challenging datasets using significantly smaller embedding dimension sizes. Our contributions can be summarized as follows: • We propose NEAR2, a nested embedding approach, which can achieve up to 12× efficiency in embedding size and 100× smaller in memory usage during inference while introducing no extra cost in training. • We evaluate NEAR2 on four different test sets that contains various types challenging queries.

Evaluation results show that our approach achieves an improved performance using a much smaller embedding dimension compared to any existing models. • We conduct ablative experiments on different encoder-based models fine-tuned using different IR loss functions. We find that NEAR 2 is robust to different IR losses or loss combinations for continued fine-tuning. • We perform a qualitative analysis on retrieved product titles using challenging queries. Our analysis re-affirms the superior performance of our approach and reveals that the similarity scores from NEAR2 models are more reliable than those of baseline models.

2. Related Work

Modern IR systems encounter several challenges that hinder their performance, particularly in dealing with complex queries and data representation. Ambiguities in natural language, vocabulary mismatches, and the need for scalable real-time processing pose significant challenges [ 5 ]. Traditional term-based models often fail due to lexical gaps and polysemy, necessitating the transition to advanced semantic models. Semantic retrieval with dense representations, powered by neural networks and pre-trained language models (PTLMs) like BERT [ 8 ], has shown remarkable improvements in handling context and semantics. However, these models demand substantial computational resources and struggle with implicit or alphanumeric queries [ 5 ]. Similarly, interaction-based approaches focus on capturing query-document dynamics through deep neural networks, such as the Deep Relevance Matching Model [ 9 ], but often sacrifice efficiency and scalability due to their inability to cache document embeddings offline and their reliance on real-time computation [ 10 ]. To gap the mismatch of user intent and retrieved product titles in search queries, Saadany et al. [ 3 ] curated a dataset annotated with user-intent centrality scores, and proposed a dual loss optimization strategy to fine-tune PTLMs on the dataset in a multi-task learning setting, to solve such challenges.

To address the efficiency issue, researchers have proposed a range of solutions aimed at enhancing efficiency while maintaining accuracy at the same time. Efficiency issues can be tackled through using DUET models that employ local and distributed deep neural networks, which learns dense lower-dimensional vector representations of the query and the document text for efficient retrieval [ 10 ]. Knowledge distillation, where smaller models inherit knowledge from larger PTLMs, has proven effective in reducing resource requirements without compromising performance for IR systems [ 11 ]. To mitigate computational overhead, Wan et al. [ 12 ] proposed to use dimension reduction and distilled encoders to create lightweight models for fast and efficient question-answer retrieval. Kusupati et al. [13] proposed Matryoshka representation learning (MRL) which is able to encode information at different granularities, to adapt to the computational constraints of various downstream tasks. In this paper, we tackle the challenges of accuracy and efficiency using a nested embedding approach based on MRL to create lightweight embedding models for IR tasks.

3. Methodology

This section describes our nested embedding approach in § 3.1 and the backbone models in § 3.2.

3.1. Nested Embedding Training

We utilize MRL with a ranking loss to train nested embeddings of different sizes on various models. Matryoshka Representation Learning MRL develops representations with diverse capacities within the same higher-dimensional vector by explicitly optimizing sets of lower-dimensional vectors in a nested manner, as illustrated in Figure 1.

The initial − dimensions of the Matryoshka representation, where ∈ , the set of nested representation sizes, form a compact and information-dense vector that matches the accuracy of a separately trained − dimensional representation, but requires no extra training effort. As dimensionality increases, the representation progressively incorporates more detailed information, providing a nested coarse-to-fine representation. This approach maintains near-optimal accuracy relative to the full dimensional scale, while avoiding substantial training or deployment costs [14].

The MRL loss is formally defined in Equation 1, where is the loss for downstream tasks such as the cross-entropy loss for classification tasks. ( ) is the output of the -th nested embedding representation, and is the importance weight for the -th embedding representation. = ∑ ( ( ), ) (1) ∈

MRL learns multiple nested embedding representations, each with a different size ∈ . The final MRL loss is a weighted sum of the task losses for each of the nested representations. For our product retrieval and ranking task, we set the multiple negative ranking loss (MNRL) [15] as our . Multiple Negative Ranking Loss MNRL measures the difference between relevant (positive) and irrelevant (negative) examples associated with a given query. This technique ensures a clear separation by reducing the distance between the query and positive samples while increasing the distance from negative samples. Using multiple negative examples enhances the model’s ability to discern varying levels of irrelevance, refining its optimization. The MNRL objective function is formulated as follows: = ∑ ∑ (0, (, ) − (, ) + ) (2)

=1 =1

In Equation 2, represents the number of positive samples; denotes the number of negative samples; is the query; is the similarity metric (cosine similarity in our case), and the is a hyperparameter defining the ideal distance between positive and negative samples based on the relevance score. The goal of MNRL is to minimize the similarity between (, ) while simultaneously maximizing the difference between (, ) for all positive and negative samples.

3.2. Backbone Models

We used encoder-based Transformer models as our backbone for training nested embeddings for efficient product retrieval and ranking.

Pre-trained Language Models We initially leveraged BERT [ 8 ], a publicly available pre-trained encoder Transformer model. For our specific use case in e-commerce, we also employed eBERT 1, a proprietary multilingual language model pre-trained internally at eBay. This custom model was pretrained on a corpus of approximately three billion product titles, supplemented by data from general domain sources like Wikipedia and RefinedWeb.

Expanding our experimental approach, we also incorporated eBERT-siam, a fine-tuned variant of eBERT using a Siamese network architecture. This model aims to generate semantically aligned embeddings for item titles, making it particularly effective for similarity-based search and retrieval tasks. Consistent across all models, we maintained a uniform architectural design of 12 layers with a dimension size of 768.

User-intent Centrality Optimized (UCO) Models Saadany et al. [ 3, 16 ] show how current IR systems have problems in achieving user-centric product retrieval and ranking due to implicit or alphanumeric queries. They curated a dataset with user-intent centrality scores (see Section 4.1) and proposed a few models optimized for user-intent using an MNRL loss for retrieval and ranking, and an online contrastive loss (OCL) for user-intent centrality. OCL builds on the traditional contrastive loss (CL) [17] approach but introduces a more focused strategy. While conventional CL uses a twin network to evaluate similarities between all data point pairs from the same and different classes, OCL targets only the most challenging and informative pairs within a batch. By prioritizing such cases, OCL refines the loss calculation to focus on the most critical and complex relationships between data points.

They applied the two losses in a transfer learning setup for eBERT and eBERT-siam models, and performed fine-tuning for centrality classification. Their results indicate that the UCO models achieve an improved performance for retrieval and ranking. Details can be found in Saadany et al. [ 3 ].

To improve model efficiency and meanwhile leverage optimized performance of the UCO models, we continued training them using NEAR2 for both eBERT-UCO and eBERT-siam-UCO models.

4. Experimental Setup

This section explains the datasets we used for training, validating and testing our approach in § 4.1. Implementation details and evaluation metrics are presented in § 4.2 and § 4.3 respectively. 4.1. Data We utilized eBay’s internal graded relevance (IGR) datasets to train our nested embedding representation. These datasets comprise user search queries alongside the product titles retrieved on the platform. They are annotated by humans following specific guidelines to generate two types of buyer-focused relevance labels.

The first is a relevance ranking scheme, where query-title pairs are assigned a rank from (1) Bad, (2) Fair, (3) Good, (4) Excellent, to (5) Perfect. A “Perfect” rating signifies an exact match between the query and title, indicating high confidence that the user’s needs are fully met, whereas a “Bad” rating indicates no alignment between the query and the product title. This ranking methodology aligns with previous studies [18, 19]. The second annotation type is a binary centrality score, derived through majority voting among multiple annotators, indicating whether a product aligns with the user’s expressed query intent. Centrality scoring differs from relevance ranking in that it assesses whether an item is an outlier or unexpected in the retrieval set versus being a core match to user expectations.

To compare the results of our approach with those reported in Saadany et al. [ 3 ], we utilized the Common Queries (CQ), CQ Balanced (CQ-balanced), CQ Common String (CQ-common-str), and CQ Alphanumeric (CQ-alphanum) test sets proposed in their paper. The CQ test set was constructed using queries with both positive (relevancy > 3) and negative (relevancy < 3) titles, resulting in a dataset skewed toward positive pairs due to the nature of e-commerce data collection. To address this imbalance, a new version, CQ-balanced, was created with approximately equal numbers of positive and negative query-title pairs. The CQ-common-str set was derived by selecting queries where the exact query string appeared in both positive and negative titles, ensuring a strong correlation between relevance scores (both graded relevance and binary centrality). Finally, CQ-alphanum was created to include only query-title pairs containing alphanumeric characters, allowing for a more focused evaluation. Details about their formulation can be found in Saadany et al. [ 3 ]. An example of the datasets and the size for each test set can be seen in Figure 2 and Table 1. (a) The query “turtle” is a part of both positive and negative titles with very different product search outputs. It could also be a part of the ambiguous query “turtles bepop”. (b) The query “turtles bepop” is ambiguous as it could be referred to the major antagonist, “Bepop” or together with other Ninjia Turtles.

CQ-balanced CQ-common-str CQ-alphanum

4.2. Implementation Details

We continued training the PTLMs and the UCO models in § 3.2 for 2 epochs, using our nested embedding approach at dimension sizes of 768, 512, 256, 128 and 64, on the query-title pairs using only the relevance ranking scores (excluding pairs with a score of 3) of the IGR datasets.

During training, we ran a sequential evaluator on the ranking score data to validate for all dimension sizes. First, the evaluator computes the embeddings for both query and title and uses them to calculate the cosine similarity. Then, it finds the most relevant product title to the query (top 3, 5 and 10 titles) in the corpus of all titles with a max corpus size of 200, 000. For all experiments, we set a batch size of 32, a margin of 0.75 for the MNRL loss with the AdamW optimizer [20] and the learning rate as 5 − 05 . Training one model using the above hyperparameters takes ≈ 1.5 hours on a single NVIDIA V100 GPU.

4.3. Evaluation Metrics

We evaluated the model effectiveness through multiple established evaluation metrics including precision, recall, normalized discounted cumulative gain (NDCG) [21] and mean reciprocal rank (MRR).

Precision@ quantifies the ratio of pertinent items within the top- recommended products, focusing on their individual relevance. Conversely, recall@ assesses the proportion of successfully retrieved relevant items compared to the total number of applicable products, regardless of their positioning. NDCG provides a comprehensive assessment of recommendation quality by analyzing both the relevance and positioning of suggested items. This metric compares the actual recommendation order against an idealized ranking, offering a nuanced evaluation of recommendation performance. MRR focuses on measuring the average ranking position of the first relevant item across different queries. A superior MRR indicates the model’s capability to prominently feature highly relevant products, thereby enhancing user experience and recommendation effectiveness.

5. Results and Discussion

Results achieved using NEAR2 with a dimension size of 64 are shown in Table 2. Since BERT and eBERT were not fine-tuned on e-commerce data 2, improvement achieved using our approach is huge, as listed in Table A.1 in Appendix A. The values are shown as the percentage of increase (delta) of the evaluation metrics in comparison of those without using NEAR2 presented in Saadany et al. [ 3 ].

Comparing results upon using NEAR2 vs existing models, we find that our approach remarkably improves performance on all test sets for all models in § 3.2, even using embeddings with a dimension size of 64, which is 12× smaller in size and more than 100× smaller in memory usage than the full model (see Table 3).

When comparing results of different dimension sizes from the largest (768) to the smallest (64), as shown in Table 43 for the CQ test set, we discover that the drop in performance is not significant. Embeddings of some smaller dimensions are even slightly better than larger ones. For example, the performance of the eBERT-siam model using NEAR2 at dimension 512 is slightly better than 768 for 2eBERT was only pre-trained on e-commerce data. 3BERT and eBERT results are in Table A.2 in Appendix A. eBERT-siam eBERT-UCO eBERT-siam UCO eBERT-siam eBERT-UCO eBERT-siam UCO eBERT-siam eBERT-UCO eBERT-siam UCO eBERT-siam eBERT-UCO eBERT-siam UCO +11.80% +2.98% +2.82% +8.85% +3.19% +2.77%

CQ test +9.99% +9.72% +3.12% +2.99% +2.72% +2.45% CQ-balanced test +8.85% +8.43% +3.15% +2.81% +2.75% +2.48% CQ-common-str test +6.59% +4.84% +1.68% +1.51% +1.48% +1.18% CQ-alphanum test +4.70% +4.59% +3.61% +3.55% +2.15% +1.87% +9.07% +3.16% +2.50% +7.28% +2.41% +2.05% +11.23% +3.34% +2.77% +10.65% +3.47% +2.80% +9.56% +3.03% +2.77% +9.06% +3.03% +2.58% +10.48% +3.25% +3.01% +8.51% +1.38% +1.85% +4.41% +2.57% +2.28%

Delta in precision, recall, NDCG, and MRR at on all the test sets for diferent encoder-based models

ifne-tuned using NEAR 2 at 64 dimensions of the entire embedding size (768).

Embedding Size

Memory Usage (MB) 768 512 256 128 64 which further indicates the effectiveness of our approach for product retrieval and ranking.

To further validate our approach, we qualitatively compared some product titles retrieved with and without NEAR2. The comparison consistently confirmed the superior performance of our method. Full details are presented in Appendix B.

6. Ablation Study

To verify whether continual training using NEAR2 can help improve performance and efficiency when models are initially trained with other losses, we conducted several experiments using eBERT and eBERT-siam for ablation studies. First, we continued training the models using NEAR2, which have been ifne-tuned using the MNRL and OCL losses respectively to test if our approach works on each of the two individual losses. Second, we tested training these models using the MRL loss first, and then continued ifne-tuning on the MNRL and OCL losses in a multi-task learning setting. The results are contrasted with training without using NEAR2, which are presented as the percentage of increase (delta) in the evaluation metrics in Table 5.

Our ablative results suggest that applying the nested embedding approach to training embeddings with lower dimensions can improve performance for all models fine-tuned using the MNRL or OCL losses for retrieval and ranking, with much obvious improvement on the models trained using the OCL loss. However, models trained with the MRL loss first, then fine-tuned using the MNRL and OCL losses, show slight performance degradation in terms of NDCG and MRR. This suggests that our approach is most effective when used after training the model with an IR task loss first. eBERT-siam eBERT-UCO eBERT-siam-UCO

7. Conclusion and Future Work

E-commerce IR systems face the challenge of balancing accurate interpretation of complex user queries with efficient processing of large product catalogs. To address this, we introduced NEAR 2, a nested embedding approach for efficient product retrieval and ranking. NEAR 2 improves accuracy and achieves up to 12× efficiency in embedding size and 100 × smaller in memory usage during inference, without any increase in pre-training costs. Tested across diverse datasets, including short and implicit queries and alphanumeric queries, our method outperforms existing models with smaller embedding dimensions, demonstrating its robustness across challenging evaluation sets, and with efficiency. Our qualitative analysis reinforces the superior performance of our approach, demonstrating that embeddings generated by NEAR2 models are significantly more reliable than those of baseline models when evaluated based on similarity scores. For future work, we plan to: 1) evaluate our model performance through / testing in deployment, 2) leverage internal data to refine larger decoder-based generalist embedding models like NV-embed-v2 [22], and 3) optimize these models using our NEAR2 approach.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT (GPT-4) and Grammarly in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. Linguistics: Human Language Technologies: Industry Track, Association for Computational Linguistics, Hybrid: Seattle, Washington + Online, 2022, pp. 334–343. URL: https://aclanthology. org/2022.naacl-industry.37. doi:10.18653/v1/2022.naacl-industry.37. [13] A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V. Ramanujan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain, et al., Matryoshka representation learning, in: Advances in Neural Information Processing Systems, 2022. [14] X. Li, Z. Li, J. Li, H. Xie, Q. Li, ESE: Espresso sentence embeddings, arXiv preprint (2024).

arXiv:2402.14776. [15] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, R. Kurzweil, Efficient natural language response suggestion for smart reply, arXiv preprint arXiv:1705.00652 (2017). [16] H. Saadany, S. Bhosale, S. Agrawal, Z. Wu, C. Ora˘san, D. Kanojia, Product retrieval and ranking for alphanumeric queries, in: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 55645565. URL: https://doi.org/10.1145/3627673.3679080. doi:10.1145/ 3627673.3679080. [17] F. Carlsson, A. C. Gyllensten, E. Gogoulou, E. Y. Hellqvist, M. Sahlgren, Semantic re-tuning with contrastive tension, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=Ov_sMNau-PF. [18] Y. Jiang, Y. Shang, R. Li, W.-Y. Yang, G. Tang, C. Ma, Y. Xiao, E. Zhao, A unified neural network approach to e-commerce relevance learning, in: Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, DLP-KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019. URL: https://doi.org/10.1145/3326937. 3341259. doi:10.1145/3326937.3341259. [19] D. Kang, W. Jang, Y. Park, Evaluation of e-commerce websites using fuzzy hierarchical topsis based on e-s-qual, Applied Soft Computing 42 (2016) 53–65. URL: https://www.sciencedirect.com/ science/article/pii/S1568494616300047. doi:https://doi.org/10.1016/j.asoc.2016.01.017. [20] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: International Conference on

Learning Representations, 2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7. [21] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, ACM Trans. Inf. Syst.

20 (2002) 422446. URL: https://doi.org/10.1145/582415.582418. doi:10.1145/582415.582418. [22] C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, W. Ping, Nv-embed: Improved techniques for training llms as generalist embedding models, 2024. URL: https://arxiv.org/abs/ 2405.17428. arXiv:2405.17428.

A. Additional Figures and Tables

Model +230.75% +180.69% +273.74% +197.78% +164.96% +124.19% +262.73% +186.77% +226.68% +160.48% +117.16% +104.15% Table A.1

Delta in precision, recall, NDCG, and MRR at on all the test sets for BERT and eBERT fine-tuned

using NEAR2 at 64 dimensions of the entire embedding size (768).

Model

+265.40% +265.57% +264.91% +262.16% +251.93% +185.11% +185.24% +184.72% +182.58% +174.59% +170.99% +171.13% +170.52% +169.54% +164.96% +119.88% +120.00% +119.50% +118.71% +114.99%

B. Detailed Qualitative Analysis

To understand the performance improvements of our approach compared to existing models, we conducted a qualitative analysis using examples from the CQ test set. Specifically, we generated inferences for all instances in the CQ test set with eBERT and eBERT-siam4 using or not using the NEAR2 approach at a dimension size of 64 (NEAR2@64). For each query, we retrieved the top 10 product titles and ranked them based on their cosine similarity scores. To evaluate real-world performance, we selected two representative queries: one short and implicit query and one long and detailed query. These examples provided insights into how our approach performs relative to eBERT or eBERT-siam in practical scenarios. Short and Implicit Query Table B.1 illustrates the retrieved titles, their rankings (from 1 to 10), and their normalized5 similarity scores for the short and implicit query “plants” with eBERT. Based on the gold label, the expected product title should include “potted plants”. For the model using NEAR2@64, all retrieved product titles contained relevant keywords such as “plant” or “pot”, along with detailed product descriptions. In contrast, the titles retrieved by the model without using NEAR2@64 were significantly 4We mainly analyze results from eBERT. Results from eBERT-siam can be seen in Tables B.3 and B.4. 5Against the minimum value.

Retrieved titles for the short and implicit query “plants” using or not using NEAR2@64 on

shorter, with many lacking the keyword “plant” and some, such as “coins”, being entirely irrelevant to the query. Notably, the normalized similarity scores from without using NEAR2@64 are much lower than those of using NEAR2@64, which is responsible for those irrelevant titles retrieved. This highlights the unreliability of the similarity scores from models without using NEAR2. Philodendron Micans Rooted Cutting Trailing House Plant Cuttings Rare Plants

Tillandsia Mix 5 Plants Indoor Air Plant for House Vivarium Terrarium

Big leaf philodendron pink princess plant cutting 1 leaf cutting 2 NEON PINK SALVIA PLANT PERENNIAL SAGE HIGHLY FRAGRANT Spathiphyllum Peace Lily Indoor Plants 1 x Potted Lily House Plant 9cm Pot

Cissus Discolor aka Rex Begonia Vine 6 inch pot 3 Plant 4 Pots Great Houseplant Assorted Rex Begonia Easy to grow housepl PHILODENDRON MELANOCHRYSUM VERY LARGE 25 3 FEET TALL STUNNING PLANT

Spathiphyllum Peace Lily House Plant Live Indoor House Potted Tree In 9cm PHILODENDRON PINK PRINCESS LARGE PLANT IN 15CM POT HOUSE PLANT

Avocado plant

coins Begonia Butterfly drinks cabinet

Eucalyptus tree portfolio landscape lights Nico the marble index car assessories

Begonia Curly Q

Houseplant and Pot Package Aloe Vera Plant - Large Plant in Pot

Retrieved Title

Retrieved titles for the long and detailed query “925 sterling silver triplet opal gemstone jewelry

vintage pendant s-1.20” using or not using NEAR2@64 on eBERT.

Long and Detailed Query

Table B.2 presents the retrieved titles, their rankings, and their normalized similarity scores for the long and detailed query “925 sterling silver triplet opal gemstone jewelry vintage pendant s-1.20” with eBERT. Given the specificity of the query, even using the exact gold label title did not yield the exact product on eBay. However, the model using NEAR2@64 retrieved similar products, as shown in Figure B.1(b). In contrast, the products retrieved using top-ranked title from eBERT without NEAR2@64, shown in Figure B.1(c), were significantly less relevant compared to those retrieved using the gold label title in Figure B.1(a). These results further demonstrate the effectiveness of NEAR2@64. As with the short query example in Table B.1, normalized similarity scores from eBERT without using NEAR2@64 are much lower than those using NEAR2@64, further underscoring its limitations. (a) Products retrieved using the gold label title.

(b) Products retrieved using the first title from NEAR 2@64.

Figure B.1: Products retrieved on eBay using the gold label title (a), the top one title from eBERT using

NEAR2@64 (b) and eBERT not using NEAR2@64 (c) for the query-title pairs in Table B.2.

Performance Disparity To investigate the root cause of performance disparity, we plotted the distribution of original similarity scores based on eBERT for all retrieved query-title pairs in the CQ test set, as shown in Figure B.2. The scores from the model using NEAR2@64 are well-distributed between 0.5 and 1.0, reflecting nuanced relevance evaluations. In contrast, scores from eBERT without using NEAR 2@64 are clustered between 0.9 and 1.0, with most query-title pairs assigned a score near 0.95. This uniform distribution suggests that eBERT fails to effectively differentiate between relevant and irrelevant titles, leading to poor ranking performance. These findings further validate the superiority of NEAR 2@64 in the evaluation metrics for product retrieval and ranking tasks.

For product titles retrieved by eBERT-siam, whether for the short, implicit query or the long, detailed query, the differences in appearance between using and not using NEAR2@64 are less pronounced compared to those observed with eBERT. However, the similarity scores still show a notable distinction. As illustrated in Figure B.3, the model using NEAR2@64 produces scores that are well-distributed between 0.45 and 1.0. In contrast, the scores from the model without this approach are more tightly clustered between 0.65 and 1.0, with the majority of query-title pairs receiving scores between 0.75 and 0.9. These results are consistent with the findings from the eBERT model.

Gold label CRAZY DAISY Shasta daisies Qty 2 PLANTS Hardy Perennial Healthy plants

CRAZY DAISY Shasta daisies Qty 2 x Hardy Perennialhealthy plants

Streptocarpus MKsArktur09 young plant Spathiphyllum Peace Lily Indoor Plants 1 x Potted Lily House Plant 9cm Pot

Houseplant and Pot Package

Spathiphyllum Peace Lily House Plant Live Indoor House Potted Tree In 9cm Boston FernLive 10 Plants Lots Of Roots Air Purifier Reptile Terrarium ORGANIC

1 x CRAZY DAISY Shasta daisies Hardy Perennial Healthy plant

Leucanthemum Crazy Daisy Middleton Nurseries Flowering hardy Plants Syngonium White Butterfly Arrowhead Goose Foot Plant House Plant Easy Care

Houseplant and Pot Package Spathiphyllum Peace Lily Indoor Plants 1 x Potted Lily House Plant 9cm Pot

Spathiphyllum Peace Lily House Plant Live Indoor House Potted Tree In 9cm Cordyline Kiwi Ti Plant 7c Best Indoor Plants 7c Colourful 3040cm Potted Plant 68 Live Snake Plant Sansevieria Trifasciata Two Plants

Leucanthemum Crazy Daisy in plant in 13cm pot approx Multi Listing Pond Plants Marginal Plants Water Bog Garden Oxygenator SALE 12 Succulent Flowers not Included Pots 12 Pcs 12 Fashion Practical

Avocado plant 3CM Succulent Cactus Live Plant Copiapoa Tenuissima Chile Home Garden Rare Plant

Aloe Vera Plant - Large Plant in Pot

Gold label

Retrieved Title Table B.4

Retrieved titles for the detailed query “925 sterling silver triplet opal gemstone jewelry vintage

pendant s-1.20” using or not using NEAR2@64 on eBERT-siam. Ranking

[1]

Li ,

Lv ,

Jin ,

Lin ,

Yang ,

Zeng , X.-M. Wu , Q. Ma , Embedding-based product retrieval in taobao search , in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD '21 , Association for Computing Machinery, New York, NY, USA, 2021 , p. 31813189 . URL: https://doi.org/10.1145/3447548.3467101. doi: 10 .1145/3447548.3467101.

[2]

Keyvan ,

J. X.

Huang , How to approach ambiguous queries in conversational search: A survey of techniques, approaches, tools, and challenges , ACM Comput. Surv . 55 ( 2022 ). URL: https: //doi.org/10.1145/3534965. doi: 10 .1145/3534965.

[3]

Saadany ,

Bhosale ,

Agrawal ,

Kanojia ,

Orasan ,

Wu , Centrality-aware product retrieval and ranking , in: F. Dernoncourt , D.

Preo¸tiuc-

Pietro , A . Shimorina (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , Association for Computational Linguistics, Miami, Florida, US , 2024 , pp. 215 - 224 . URL: https: //aclanthology.org/ 2024 .emnlp-industry. 17 .

[4]

D. N.

Mhawi ,

H. W.

Oleiwi ,

N. H.

Saeed ,

H. L.

Al-Taie , An efficient information retrieval system using evolutionary algorithms , Network 2 ( 2022 ) 583 - 605 . URL: https://www.mdpi.com/ 2673-8732/2/4/34. doi: 10 .3390/network2040034.

[5]

K. A.

Hambarde ,

Proença , Information retrieval: Recent advances and beyond , IEEE Access 11 ( 2023 ) 76581 - 76604 . doi: 10 .1109/ACCESS. 2023 . 3295776 .

[6]

Zhu ,

Yuan ,

Wang ,

Liu , W. Liu,

Deng ,

Chen ,

Liu ,

Dou ,

J.-R.

Wen , Large language models for information retrieval: A survey, arXiv preprint ( 2023 ). arXiv: 2308 . 07107 .

[7]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS'17, Curran Associates Inc., Red

Hook

, NY , USA, 2017 , p. 60006010 .

[8]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423. doi: 10 .18653/v1/ N19 -1423.

[9]

Guo ,

Fan ,

Ai ,

W. B.

Croft , A deep relevance matching model for ad-hoc retrieval , in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management , CIKM '16, Association for Computing Machinery, New York, NY, USA, 2016 , p. 5564 . URL: https://doi.org/10.1145/2983323.2983769. doi: 10 .1145/2983323.2983769.

[10]

Mitra ,

Diaz ,

Craswell , Learning to match using local and distributed representations of text for web search , in: Proceedings of the 26th International Conference on World Wide Web, WWW '17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE , 2017 , p. 12911299 . URL: https://doi.org/10.1145/3038912.3052579. doi: 10 .1145/3038912.3052579.

[11]

Kim ,

A. S.

Rawat ,

Zaheer ,

Jayasumana ,

Sadhanala ,

Jitkrittum ,

A. K.

Menon ,

Fergus ,

Kumar , Embeddistill: A geometric knowledge distillation for information retrieval , 2023 . URL: https://openreview.net/forum?id= BT03V9Re9a .

[12]

Wan ,

S. S.

Patel ,

J. W.

Murdock ,

Potdar ,

Joshi , Fast and light-weight answer text retrieval in dialogue systems , in: A. Loukina , R. Gangadharaiah , B. Min (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational