-

ACM SIGIR Workshop on eCommerce, July

1613-0073

(Vector) Space is Not the Final Frontier: Product Search as Program Synthesis

Jacopo Tagliabue

jacopo.tagliabue@nyu.edu 0 1

Ciro Greco

ciro.greco@bauplanlabs.com 0 0 Bauplan Labs , New York City , USA 1 New York University , New York City , USA

2023

27 2023

As ecommerce continues growing, huge investments in ML and NLP for Information Retrieval are following. While the vector space model dominated retrieval modelling in product search - even as vectorization itself greatly changed with the advent of deep learning -, our position paper argues in a contrarian fashion that program synthesis provides significant advantages for many queries and a significant number of players in the market. We detail the industry significance of the proposed approach, sketch implementation details, and address common objections drawing from our experience building a similar system at Tooso.

product search semantic parsing program synthesis large language models

CEUR ceur-ws.org

1. Introduction

“Now, like all great plans, my strategy is so simple an idiot could have devised it” – Zapp Brannigan

The explosive growth of ecommerce [ 1 ] brought equally impressive innovation in Information Retrieval (IR) [ 2 ], with product search now to representing 30% to 60% of total online revenues [ 3, 4, 5 ]. Building on decades of literature in web and document retrieval, product search is typically modelled as a two-step process: candidate selection (retrieval [6]) and re-ranking [7, 8, 9, 10]. The most widespread model for retrieval is the vector space model (VSM) [11, 12, 13], according to which relevance is approximated by the distance between a query vector and a product vector in a suitable space. Even as deep learning drastically altered vectorization [14], it did not call into question the tenets of the VSM, or the idea that re-ranking is needed to push down the page irrelevant items wrongfully retrieved [15, 16]. It is important to remember that most real-world search engines leverage VSM in one form or another: sparse BM25 retrieval in Elasticsearch may be implemented very diferently from dense retrieval on Redis Vector Search 1, but they all share the core idea of VSM. Namely, that retrieval is fundamentally approximated by distance in a vector space. CEUR Workshop Proceedings

We argue that program synthesis through semantic parsing provides a principled and viable alternative to VSM for product search. In this perspective, search queries are (informal) instructions for knowledge bases, as opposed to points in a vector space 2. We shall defend two main claims: 1. VSM is an indirect representation of meaning that is necessary for large unstructured documents, such as those in web search; however, under diferent circumstances, where search queries are interpreted against product catalogs, direct representation is feasible and useful; 2. explicit representations unlock a powerful search experience where formal inferences can be made to improve retrieval, while ranking is used as a device for personalization.

Historically, ecommerce tech has been focusing mostly on the challenges of big players, while a larger market share represented by mid-to-large websites has been neglected [18]. While we recognize the intrinsic limits of position papers, we believe our contrarian argument will benefit from the freedom allowed by this format. Our arguments proceed as follows: we first establish some empirical facts about ecommerce search at the “Reasonable Scale”; we then showcase the virtues of program synthesis, assuming a semantic oracle. Finally, we show how such a system can actually be built.

We believe this work to be valuable for a broad set of practitioners, solving specific use cases in this segment of the market or working on SaaS solutions3. Even if most of the arguments we 2As we explain below (Fig. 4, our approach is to parse a search query to an intermediate semantic representation, and then translates the latter into a program, handling the shopping query “as if it were instructions”; program synthesis may also be construed directly from natural language [17]. We will refer to parsing and synthesis somehow liberally below, since it’s clear how to move from one to the other. 3As a business context about this blooming industry, Algolia and Bloomreach raised >USD200M each in venture money in the last few years [19, 20], and Coveo raised >CAD200M at IPO [21]. present are theoretical, these ideas have been successfully implemented in a company before (Tooso), and played an important role in its acquisition by a public market leader (TSX:CVO)4.

2. An Industry Perspective

“Hooray! A happy ending for the rich people.” – Dr. Zoidberg

While the idiosyncrasies of product search have been partially documented before [25, 26], most ecommerce systems are still designed from the same building blocks as document search: VSM for retrieval, Machine Learning (ML) for re-ranking using all types of signals. In our experience, the farther you go from planetary scale retailers, the less product search will resemble web search.

Because digital transformation is consistently taking place in the retail industry, most ecommerce search systems are now deployed outside of Big Tech Retailers. We are going to describe the mid-long tail of ecommerce implementations as the “Reasonable Scale” (RSc) [27, 28, 29, 30]. While RSc is intended to be a loose concept [18], practitioners typically know it when they see it [31].

A number of strategies need to be diferent at RSc. For instance, instead of several millions of SKUs, RSc shops may have 10K to 100K products and still make >100M USD in yearly revenues. Queries on inventories of this size can easily have result sets of 10 golden items. In this context, no re-ranking strategy will be able to hide irrelevant products from the user: for the typical strategy of hiding results in page two5 to work, there should be a page two to begin with. Even as inventory grows, VSM may go against shoppers’ preference: for price-sensitive items, users often sort results by price [ 32]. When this happens, sub-optimal candidate selection can hurt the experience (Fig. 3)(see also the cases discussed in [33] with regard to prices and sizes).

To paint a more quantitative picture of the RSc, we can leverage our unique and privileged position as SaaS practitioners with access to dozens of diferent real-world deployments. In particular, there are two main facts that turn out to be crucial for our approach (Section 4): 4While most of these ideas have been developed in 2017-2019, we have updated our arguments to reflect the most recent advancements in the field. 5“The best place to hide a dead body is page two of Google.”

1. product search mostly deals with short queries in the form of Noun Phrases (NPs) describing entities and properties (e.g. “red shoes” or “Dell laptop”) [34]. Query examples from RSc shops can also be found in [35] (Table 1) and [ 36 ]; 2. a small number of queries account for a significant portion of the distribution, making superior relevance for top queries extremely impactful for the overall experience. In the frequency distribution of a month of anonymous query data sampled from three RSc shops in two languages, the top 1-to-5% queries account for half of the total individual queries (Fig. 2).

The first observation is important as parsing gets harder with longer queries; the second observation is important as it indicates how to align technological objective with business outcomes – i.e., solving parsing for short queries is a very good place to start.

Taken together, they both re-afirm the peculiarities of product search, but from a novel and unusual angle: interestingly, both facts are not true for web or big-scale ecommerce search – as the numbers of users / items get larger and revenues grow into billions, the tail of the query distribution gets both longer and more important. In other words, while the general linguistic behavior for users of Amazon or Facebook is also be NP-based, the tail is disproportionally more important: the tail is longer, as big catalogs invite a larger set of inputs, and the tail is more valuable, as marginal improvements in rare queries translate in sizable monetary gains. While we believe our approach can be used, under the appropriate circumstances, at any scale, its novelty and impact are more easily noticeable for RSc deployments.

3. Searching with an oracle

Originally developed for large documents and long queries, VSM is a useful approximation as it provides a retrieval strategy that avoid explicitly modelling for meaning, which has long been thought to be an intractable problem: what would be the logical form [ 37 ] of this Wikipedia page6? As we argue below, the challenges of explicit representations are eased for product search: on the query side, real-world data shows that NP-like queries are very impactful (Section 6https://en.wikipedia.org/wiki/Transformer_(machine_learning_model) 2); on the item side, products are remarkably diferent from long documents: products are welldefined entities, which can be described through a sortal (i.e. the type of object, e.g. “shoes”) and few key properties (e.g. color, material, size, brand, price - crucially those more often used by shoppers [ 38, 36 ]). In other words, products already come into an IR system as (quasi) structured information.

What would a search-as-parsing experience look like? We first sketch the general experience we have in mind through a “parsing oracle” (PO) - i.e. an idealized system that is able to: • at runtime, return the logical form of a query; • at indexing time, given a product (as contained in a digital catalog [24]), return its properties.

Under the proposed approach, a query is parsed into a logical form (parsing), which is mapped to a machine code to be executed over the target domain (synthesis): in Fig. 4 we find lambda expressions and SQL [ 39 ], but the proposal is broadly compatible with any explicit formalism. In other words, the meaning of “Prada purple shoes” is neither boolean operators over TF-IDF weights, nor a BERT-based embedding, but (something like): .[ () & ℎ() & ()] .7

Viewing queries as (small) programs to execute has several advantages. First, it provides the ability to apply filters that are already available and show their application to the user – this is often desirable but in a VSM-system it requires an additional module to be trained and maintained. Second, the explicit and easy-to-debug “trace” of the query enables principled fallback strategies. As an illustration, assume the user issues the query “purple shoes”, which has no perfect matches. The logical form that (roughly) states “retrieve an object of type shoes, with purple as a color”, allows us to reason about the next best thing available, and provide a graceful fallback message (e.g. “we don’t have purple shoes, but we thought you could like dark red shoes”8. An explicit parse leads us to recognize that diferent tokens in the query have diferent psychological importance for the shopper: if the retrieval goal revolves around shoes, the system should retrieve items that are still shoes while never retrieving purple items that are not shoes. In this perspective, parsing both yields the exact linguistic intent and lays down possible compositional fallback strategies. Crucially, fallback strategies can be ML-driven, domain-driven, or heuristic-driven and may change from one deployment to the next: by turning queries into code, we make it easier to incorporate constraints (including probabilistic ones) into an interpretable search plan. 7With , ℎ , as predicates of type , , . 8Note that while IR explanations are often used to improve recommender systems [ 40 ], search may benefit from them for similar reasons.

To further appreciate the experience, it is useful to contrast what would have happened under plausible implementations of VSM. Under a sparse vector space, the shopper would typically either get a No result page, or – as the opposite extreme – received irrelevant items from an OR expansion: non-shoes that are purple, shoes that are green 9. Under a dense vector space, retrieval would provide a set of items, but no principled way to cut the set at the right position (when a “near” vector is not near enough?) or explain its choice. Both are open problems [ 41 ], and no solution is known, especially given data constraints of the RSc [29].

There is another, subtler, way to appreciate the impact of PO on search especially relevant for SaaS players [ 42 ], whose job is to develop solutions deployed on dozens independent shops in several languages and verticals. When you have two shops in the same market (as Shop A and Shop B below), PO gets you re-usable abstractions. Overlapping parse trees and product properties can help with cold start scenarios: if a model is matching “Adidas” and “Nike” as brands with high afinity, it can be ported to a new shop to boostrap learning (i.e. bootstrapping a new ecommerce without any behavioral data). As an even more extreme form of bootstrapping, learning can be transferred between (similar) languages when appropriate resources exist: while VSM models can make good use of multi-language embeddings, the power-law of RSc helps us here as well, as most retailers would do most business in 1-3 languages. 10 Of course, reimagining search with PO opens up possibilities also outside of the search experience itself: just to mention two obvious ones, finer-grained analytics (both about queries as expressing shoppers’ intent, and products, as a collection of human-readable properties), and cross-pollination with data coming in and out of the PIM (Product Information Management).

In this section, we argued that a large portion of the market would benefit from program synthesis through semantic parsing, if such a system existed. We now show how such system can be built.

4. Building a Semantic Parser

As PO itself has two components – a query and a product parser with a shared domain and interpretation (in the sense of model theory [ 43 ]) – how do we bootstrap and scale both? Assuming we use ML to train the parser, the hardest part is obtaining a training set for queries: while (almost) any untrained human can annotate an ecommerce catalog, producing logical forms requires a good deal of work by trained linguists. We will therefore break the problem into pieces, by first assuming we have product representations available to build a training set for query parsing, and then relaxing this assumption.

Fig. 1 showcases the creation of a query dataset for a statistical parser (3) 11, starting from product representations (1) and a small grammar (2): our insight is that, instead of manual annotation, we can programmatically generate golden triples <query,logical form,SKUs> by synthesizing jointly queries, their logical form, and the result set, leveraging the isomorphism 9Far from being a theoretical possibility, this is the default experience for all website using out open source tools like Elasticsearch, or non-AI SaaS providers. 10Even the fallback strategies mentioned before can be ported: if “sneakers” is fallback for “shoes”, the same strategy can be applied any time you have “shoes” available in the parse tree. 11The details of the parser are pretty unimportant, as there is substantial evidence that this is a solvable problem with good enough data [ 44 ]. between product representations and logical forms. Moving the annotation problem away from logical form helps us leverage further insights on the peculiarities of the RSc. First, it should be stressed that extracting (most) product features (1, in Fig. 1) is easy: some attributes come already structured, and statistically accurate labels are easy to obtain thanks to methods applicable across shops [ 45, 46 ]. In particular, while recent large language models cannot be directly used at runtime [ 47, 48 ], they are ideally suited to be a complementary strategy to more traditional methods when it comes to entity extraction (or even as an oracle for ofline usage [ 49 ], see the Appendix). Product information is also important for other parts of the business, which means labeling can piggyback on independently motivated processes (e.g. PIM).12

Second, the peculiarities of query distribution simplify the slot filling component (2, in Fig. 1): even in a SaaS scenario where extreme scalability is paramount, NP queries are easy to generate and then re-use – the queries “ski trousers”, “running shoes” and “ski gloves” (mentioned in [35]) share the same logical form. Not only the grammar is simple enough to start, but since the ifnal goal is to parse queries through a model trained on these synthetic NPs, we can err on the side of recall and over-generate (as it will just create training sentences that nobody would use).

Let’s now recap our approach as an actionable list: 1. at indexing time, extract product representations from the catalog to be indexed in a knowledge base, through heuristics and/or models [ 50, 46 ];13 2. build a simple NP-focused grammar, to cover a significant part of the distribution. The process can begin by annotating historical queries with simple logical forms, and then generalize a grammar to simplify those trees. To give a sense of how this would work, we selected Shop A and Shop B, multi-brand retailers in the apparel industry and catalog size between 10k and 30k SKUs. We manually annotate historical queries to get a sense of what grammar captures user behavior. Few hundreds parses (respectively, 475 and 459) cover 43% and 25% of the entire query distribution for Shop A and Shop B; 3. use the product representation and the NP-grammar to generate a training test with synthetic queries and golden parse trees ( Fig. 1) – note that it is easy to augment the set of parsable queries through paraphrases [ 44 ] or prompting [ 51 ]; 4. train a standard parsing model [ 52, 53 ] on this dataset; 5. at runtime, use the parsing model on an incoming query, get the logical form and map it to an executable code for the target knowledge base: retrieve the products, execute fallback strategies if relevant.

This strategy has consequences for two important pieces of the search experience, re-ranking and type-ahead suggestions. Re-ranking in VSM is often needed to hide poor results, and may even conflict with relevance objectives: e.g., popular products may sometimes outrank others irrespective of query intent. A structured approach to retrieval allows ranking to be mostly about personalization: given a relevant result set, which of the following “purple shoes” is best for this shopper (based on several real-time and historical ranking signals)? Conversely, ranking rules – both manual and learned – can be applied on a ceteris paribus level: only if two items are equally relevant, popularity can influence their ranking. Query suggestions are 12Product labeling can also be outsourced with no privacy concerns. 13We refer the readers to the Appendix for more details. known to be important for a good search UX [32]: synthetic queries (Fig. 1) could be used to suggest new and cold query types, as well as familiarize shoppers with the capability of the parser; for example, suggesting “blue shoes under 100 USD” would gradually educate users in using the search bar better.

5. Limitations and answers to common concerns 5.1. Vectors strike back

The explosion of NLP-capabilities in recent years have established beyond any reasonable doubt the virtues of distributional semantics [ 54 ]: it may therefore seem strange to defend program synthesis for IR use cases. The quality of the vectorized representations for queries and products increased dramatically (including exciting possibilities such as multi-modal understanding [ 55 ]), but the problem with VLM is still present even in the most sophisticated retailers: as we observe in the result set in Fig. 3, the query “nintendo switch” is retrieving pens. While it would be tempting to dismiss this as an artefact or an anecdote, it is on the contrary an essential component of VSM: if relevance is distance in a vector space, there is no cut-of establishing when far is too far. If we compare the result set to the typical response we would get from a human assistant14, it is clear the shared meaning of “nintendo switch” is very diferent. For almost-web-scale catalogs, vector search is pragmatically an efective strategy, as the “very close” products for most queries are enough to fill the first few result pages; for smaller catalogs, however, the perceived relevance may quickly degrade and the VSM approach has no principled countermeasure.

As we discuss below, better vector representations are an essential component of any search engine, and NLP breakthroughs are a welcome addition to the toolkit of any shop. However, treating relevance solely as a distance calculation is an approximation, and should be recognized as such: when we switch our attention from lexically-driven to compositionally-driven use cases, how much value can we now unlock?

5.2. Parsing vs rewriting

Parsing is hardly the only query processing technique available to RSc shops: for example, query rewriting is a popular approach to bridge the gap between the user’s intent (“red Nike sneakers”) and inventory (burgundy Adidas shoes). However, it is important to realize that the concerns of parsing and rewriting modules are distinct, and possibly complementary: you can rewrite “sneakers” into “shoes” before parsing it into an object type, but rewriting by itself does not challenge the fundamental assumption of VSM – efective rewriting may improve recall, but does not unlock any of the relevance benefit that parsing provides (Section 3). From an engineering perspective, it’s easy to see how a rewriting module could leave completely 14Anecdotally, note that ChatGPT response to the prompt “You are a shopper assistant at Best Buy, the famous electronic retailer. You work in the video-game section. A shopper comes to you and ask for nintendo switch: what product do you think she wants to buy?” is “If a shopper comes asking for a Nintendo Switch, it’s most likely that they are referring to the Nintendo Switch console itself”. untouched the retrieval machinery of VSM, while parsing requires re-thinking the strategy entirely.

Moreover, a crucial component of our proposal is the “zero-shot” adaptation obtained through the loose isomorphism between products in a graph and grammars: since parsing is built through product understanding, not explicit or implicit behavioral supervision, its sample eficiency makes it ideal for RSc shops and horizontal scalability (see below); on the other side, modern NLP-based rewriting through behavioral supervision [ 56 ] is better suited for big retailers.15

5.3. Vertical vs horizontal scaling

When thinking about “scalable” engineering, we think of diminishing marginal efort as we “scale” along an important dimension. Since most IR is done at Big Tech scale, the implicit notion of scalability is the B2C one: as a target shop grows in inventory and trafic, the long tail of queries will expand and rare events become more important (Section 2). In this regime, data-driven approaches are scalable: the more trafic, the more data, so statistical generalization is a promising path to diminishing marginal efort – how hard is to satisfy this shopper’s intent, given we have seen already k million of them?

As we hinted in this work, there is another concept of scalability in IR, which becomes evident for B2B scenarios: if our system is used across multiple RSc shops, the marginal cost that will dominate the business is deployment cost – how hard is to get a new shop online, given we put online k already? The synthesis approach we championed has been developed mainly targeting this second notion: if the marginal cost of tagging catalogs is diminishing (see the Appendix), the cost of understanding queries on newer shops diminishes as well, irrespective of how much trafic they get. While emphasis has been put on synthesis as the actual implementation mechanism for our strategy, the broader, and perhaps novel insight, is that query performance is (in certain cases) a by-product of product understanding and linguistic knowledge, both of which are more scalable than practitioners typically realize.

5.4. Parser fragility

A critical point that has not been addressed is the “fragility” of parsing-first strategies: since no parsing model would be perfect, what should we do when it fails? In our experience, the most natural architecture is a two-tier system, such that, if parsing or program execution fail, the system would resort to a traditional VSM strategy (e.g. a sparse / dense vector-based retrieval). Considering the speed of an ML parser, we pay a tiny latency tax for the above mentioned benefits. When it comes to deployment, our recommendation is to use program synthesis on top of a basic VSM retrieval, not as a replacement; philosophically however, our position remains that VSM is an approximation to relevance, and should be treated as such. 15Five years after the deployment of the system in this paper, it is telling that leading tech retailers are starting to use a product graph for rewriting as well [ 57 ].

6. Conclusion

Motivated by query distributions and industry constraints, we argued that program synthesis (through semantic parsing) is a feasible path for a better search experience at RSc, compared to VSM alone as a relevance model. We showed that the usual worries associated with explicit meaning representations are unwarranted, and maintained that the key insight to a novel view on search is the “isomorphic” structure of (parsed) queries and product structure.

The representation dichotomy explicit-but-annotation-heavy VSM approximate-but-fullylearnable is indeed a false one, and we sketched how a tiny initial linguistic structure can help bootstrapping a large-scale parsing system. We are confident through 6 years of experience, deployments and publications that RSc shops can benefit from it, and we hope this paper will start a discussion with participants coming from diferent backgrounds. While this work hardly constitutes the last word on the topic, it is hopefully a first step in leading the field away from local optima, and embracing the peculiarities and opportunities of product search.

Acknowledgments

The dry prose of a scholarly paper cannot do justice to the adventure that is building an earlystage startup: this paper would not have been possible without Tooso, the company pioneering search-as-parsing at scale back in 2018-2019. We wish to thank first Mattia, Luca, Andrea, Alessia, and then everybody else involved in that clumsy, special company: a challenge we were willing to accept, one we were unwilling to postpone, and one we intended to win.

Furthermore, we wish to thank Tracy Holloway King, Federico Bianchi, Patrick John Chia and two anonymous reviewers for useful comments to a previous version of this paper. searching-for-roi-in-retail-the-time-for-a-new-site-search-tool-is-now/?categoryid= a89c0000000AKp1AAG. [6] G. Salton, A. Wong, C. S. Yang, A vector space model for automatic indexing, Commun.

ACM 18 (1975) 613–620. URL: https://doi.org/10.1145/361219.361220. doi:10.1145/361219. 361220. [7] B. Mitra, N. Craswell, An introduction to neural information retrieval, Foundations and Trends® in Information Retrieval 13 (2018) 1–126. URL: https://www.microsoft.com/en-us/ research/publication/introduction-neural-information-retrieval/. [8] N. Choudhary, N. Rao, S. Katariya, K. Subbian, C. K. Reddy, Anthem: Attentive hyperbolic entity model for product search, in: WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Phoenix, AZ, USA, February 21-25, 2022, WSDM ’22, Association for Computing Machinery, New York, NY, USA, 2022. [9] C. Pei, Y. Zhang, Y. Zhang, F. Sun, X. Lin, H. Sun, J. Wu, P. Jiang, W. Ou, D. Pei, Personalized context-aware re-ranking for e-commerce recommender systems, ArXiv abs/1904.06813 (2019). [10] R. Li, Y. Jiang, W. Yang, G. Tang, S. Wang, C. Ma, W. He, X. Xiong, Y. Xiao, Y. E. Zhao, From semantic retrieval to pairwise ranking: Applying deep learning in e-commerce search, Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019). [11] D. Gillick, A. Presta, G. S. Tomar, End-to-end retrieval in continuous space, arXiv preprint arXiv:1811.08008 (2018). [12] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Y. Wu, S. Edunov, D. Chen, W. tau Yih, Dense passage retrieval for open-domain question answering, ArXiv abs/2004.04906 (2020). [13] Y. Tay, V. Q. Tran, M. Dehghani, J. Ni, D. Bahri, H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta, T. Schuster, W. W. Cohen, D. Metzler, Transformer memory as a diferentiable search index, 2022. arXiv:2202.06991. [14] R. Nogueira, W. Yang, K. Cho, J. J. Lin, Multi-stage document ranking with bert, ArXiv abs/1910.14424 (2019). [15] Y. Yan, Z. Liu, M. Zhao, W. Guo, W. P. Yan, Y. Bao, A practical deep online ranking system in e-commerce recommendation, in: ECML/PKDD, 2018. [16] D. Sorokina, E. Cantu-Paz, Amazon search: The joy of ranking products, in: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 459–460. URL: https://doi.org/10.1145/2911451.2926725. doi:10.1145/2911451.2926725. [17] D. Basin, Y. Deville, P. Flener, A. Hamfelt, J. Nilsson, Synthesis of programs in computational logic, volume 3049, 2004, pp. 30–65. doi:10.1007/978- 3- 540- 25951- 0_2. [18] J. Tagliabue, You do not need a bigger boat: Recommendations at reasonable scale in a (mostly) serverless and open stack, in: Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 598–600. URL: https://doi.org/10.1145/3460231.3474604. doi:10.1145/ 3460231.3474604. [19] Techcrunch, Search api startup algolia raises $150 million at $2.25 billion valuation, 2021. URL: https://techcrunch.com/2021/07/28/ search-api-startup-algolia-raises-150-million-at-2-25-billion-valuation/. [20] Bloomreach, With $175 million in funding, bloomreach is authoring the next chapter of e-commerce, 2022. URL: https://www.bloomreach.com/en/blog/2022/ with-usd175-million-in-funding-bloomreach-is-authoring-the-next-chapter-of-e-commerce. [21] S. Marotta, Canada’s latest tech public debut swings amid soft ipos, 2021. URL: https://www.bloomberg.com/news/articles/2021-11-25/ canada-s-latest-tech-public-debut-swings-amid-slew-of-soft-ipos. [22] B. Requena, G. Cassani, J. Tagliabue, C. Greco, L. Lacasa, Shopper intent prediction from clickstream e-commerce data with minimal browsing information, Scientific Reports 10 (2020) 2045–2322. doi:10.1038/s41598- 020- 73622- y. [23] F. Bianchi, J. Tagliabue, B. Yu, L. Bigon, C. Greco, Fantastic embeddings and how to align them: Zero-shot inference in a multi-shop scenario, in: Proceedings of the SIGIR 2020 eCom workshop, July 2020, Virtual Event, published at http://ceur-ws.org (to appear), 2020.

URL: https://arxiv.org/abs/2007.14906. [24] J. Tagliabue, C. Greco, J.-F. Roy, F. Bianchi, G. Cassani, B. Yu, P. J. Chia, Sigir 2021 e-commerce workshop data challenge, in: SIGIR eCom 2021, 2021. [25] M. Tsagkias, T. H. King, S. Kallumadi, V. Murdock, M. de Rijke, Challenges and research opportunities in ecommerce search and recommendations, in: SIGIR Forum, volume 54, 2020. [26] E. Brenner, J. Zhao, A. Kutiyanawala, Z. Yan, End-to-end neural ranking for ecommerce product search: an application of task models and textual embeddings, ArXiv abs/1806.07296 (2018). [27] J. Tagliabue, Mlops without much ops, 2022. URL: https://towardsdatascience.com/ mlops-without-much-ops-d17f502f76e8. [28] M. Eric, Mlops is a mess but that’s to be expected, 2022. URL: https://www.mihaileric.com/ posts/mlops-is-a-mess/. [29] J. Tagliabue, H. Bowne-Anderson, V. Tuulos, S. Goyal, R. Cledat, D. Berg, Reasonable scale machine learning with open-source metaflow, 2023. arXiv:2303.11761. [30] P. Molino, C. Ré, Declarative machine learning systems: The future of machine learning will depend on it being in the hands of the rest of us., Queue 19 (2021) 46–76. URL: https://doi.org/10.1145/3475965.3479315. doi:10.1145/3475965.3479315. [31] D. Berg, R. K. Chirravuri, R. Cledat, S. Goyal, F. Hamad, V. Tuulos, Open-sourcing metaflow, a human-centric framework for data science, 2019. URL: https://netflixtechblog.com/ open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9. [32] J. Tagliabue, B. Yu, M. Beaulieu, How to grow a (product) tree: Personalized category suggestions for eCommerce type-ahead, in: Proceedings of The 3rd Workshop on eCommerce and NLP, Association for Computational Linguistics, Seattle, WA, USA, 2020, pp. 7–18. URL: https://aclanthology.org/2020.ecnlp-1.2. doi:10.18653/v1/2020.ecnlp- 1.2. [33] T. H. King, White Roses, Red Backgrounds: Bringing Structured Representations to Search, Springer International Publishing, Cham, 2023, pp. 191–215. URL: https://doi.org/10.1007/ 978-3-031-21780-7_9. doi:10.1007/978- 3- 031- 21780- 7_9. [34] A. Schade, J. Nielsen, Ecommerce User Experience Vol. 05: Search., 2022. URL: https: //www.nngroup.com/reports/ecommerce-ux-search-including-faceted-search/. [35] B. Yu, J. Tagliabue, C. Greco, F. Bianchi, “an image is worth a thousand features”: Scal

A. Implementation notes

As the novelty of our proposal does not lie in new classifiers or NLP pipelines, we briefly expand here the implementation strategies sketched in Section 4. We count as a strength of the approach that tried-and-tested and of-the-shelf techniques can be successfully used to start: any improvements to the below methods will make the parser even better. Once basic tags (COLOR, BRAND, etc.) are defined as the building blocks of the knowledge base and the logical forms, we need to know where and how each of these attributes can be found starting from the product catalogs. We favor a declarative approach, where tags are associated with strategies, that get executed in series when parsing the catalog. For example, Table 1 shows how three tags can be extracted from Shop A.

We first have configuration strategy, which just points to the column in the catalog that contains the attribute (typical for brands, prices etc.); this leverages the structured nature of catalogs, which is a huge simplifying factor when considering product search vis-à-vis web search. We then have a model strategy, which relies on machine learning to accomplish tagging; ifnally we have a heuristic strategy, building on domain knowledge and catalog specifics.

When discussing scaling B2B product search across deployments, it’s important to realize diferent strategies have diferent levels of granularity. Configurations are set per shop and they are deterministic; models can typically be trained across shops (for entire industries for example) and can leverage the latest zero-shot classifiers in case no label is wanted / needed [ 46 ] heuristics are more case specifics, but in our experience they have some degree of re-use: moreover, heuristics can be used to train new classifiers (using for example weak supervision [ 50 ]), which will in turn reduce the use of heuristics.

Importantly, the very recent progress on large language models promises to greatly simplify the actual building of a structured knowledge representation, ofering even zero-shot graph building from text [ 58 ]. While LLMs are still too slow and somehow not understood enough to be directly involved in the runtime query path, they are definitely well suited to speed up the ofline component of our method (Fig. 1, section 1 and 2 from the left).

[1]

Cramer-Flood , Global Ecommerce 2020 . Ecommerce Decelerates amid Global Retail Contraction but Remains a Bright Spot ., 2020 . URL: https://www.emarketer.com/content/ global-ecommerce- 2020 .

[2]

Ai , L. Narayanan.R , Model-agnostic vs. model-intrinsic interpretability for explainable product search , in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management , CIKM '21, Association for Computing Machinery, New York, NY, USA, 2021 , p. 5 - 15 . URL: https://doi.org/10.1145/3459637.3482276. doi: 10 .1145/ 3459637.3482276.

[3]

Commerce , How Ecommerce Site Search Can Create a Competitive Advantage ., 2021 . URL: https://www.bigcommerce.com/articles/ecommerce/site-search/ # the-effectiveness-of-ecommerce-site-search- .

[4]

Alaimo , 87 % of shoppers now begin product searches online ., 2018 . URL: https://www. retaildive.com/news/87-of -shoppers-now-begin- product- searches-online/530139/.

[5]

Compton , Searching For ROI In Retail: The Time For A New Site Search Tool Is Now, 2021 . URL: https://www.forrester. com/blogs/ able product representations for in-session type-ahead personalization , in: Companion Proceedings of the Web Conference 2020 , WWW '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 461 - 470 . URL: https://doi.org/10.1145/3366424.3386198. doi: 10 .1145/3366424.3386198.

[36]

Bianchi ,

Greco ,

Tagliabue , Language in a (search) box: Grounding language learning in real-world human-machine interaction , in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 4409 - 4415 . URL: https://aclanthology.org/ 2021 .naacl-main. 348 . doi: 10 .18653/v1/ 2021 . naacl- main.348.

[37]

Jia ,

Liang , Data recombination for neural semantic parsing , ArXiv abs/1606 .03622 ( 2016 ).

[38]

Bianchi ,

Tagliabue , B. Yu, Query2Prod2Vec: Grounded word embeddings for eCommerce, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, Association for Computational Linguistics , Online, 2021 , pp. 154 - 162 . URL: https: //aclanthology.org/ 2021 .naacl-industry. 20 . doi: 10 .18653/v1/ 2021 .naacl- industry.20.

[39]

Hui ,

Shi ,

Geng ,

Li ,

Sun ,

Zhu , Improving text-to-sql with schema dependency learning , ArXiv abs/2103 .04399 ( 2021 ).

[40]

Zhang ,

Chen , Explainable recommendation: A survey and new perspectives , Found. Trends Inf. Retr . 14 ( 2018 ) 1 - 101 .

[41]

Bahri ,

Tay ,

Zheng ,

Metzler ,

Tomkins , Choppy: Cut transformer for ranked list truncation , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 1513 - 1516 . URL: https://doi.org/10.1145/3397271. 3401188. doi: 10 .1145/3397271.3401188.

[42]

Tagliabue , Applied Research at Reasonable Scale, https://medium.com/the-techlife/ applied-research-at-reasonable -scale-8a74d2beed89 , 2022 . [Online; accessed 19-Feb-2023].

[43]

Hodges , Model Theory, in: E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy , Spring 2022 ed., Metaphysics Research Lab, Stanford University, 2022 .

[44]

Wang ,

Berant ,

Liang , Building a semantic parser overnight , in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Beijing, China, 2015 , pp. 1332 - 1342 . URL: https://aclanthology.org/P15-1129. doi: 10 .3115/v1/ P15 - 1129.

[45]

Gupta ,

Karnick ,

Bansal ,

Jhala , Product classification in e-commerce using distributional semantics , in: COLING , 2016 .

[46]

Chia ,

Attanasio ,

Bianchi ,

Terragni ,

Magalhães ,

Goncalves ,

Greco ,

Tagliabue , Contrastive language and vision learning of general fashion concepts , Scientific Reports 12 ( 2022 ). doi:10.1038/s41598- 022- 23052- 9.

[47]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Bang ,

Dai ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Computing Surveys 55 ( 2022 ) 1 - 38 .

[48]

N. F.

Liu ,

Zhang , P. Liang, Evaluating verifiability in generative search engines , 2023 .

[49]

Drozdov ,

Scharli ,

Akyuurek ,

Scales ,

Song ,

Chen ,

Bousquet ,

Zhou , Compositional semantic parsing with large language models , ArXiv abs/2209 .15003 ( 2022 ).

[50]

A. J.

Ratner ,

S. H.

Bach ,

H. R.

Ehrenberg ,

J. A.

Fries ,

Wu ,

Ré , Snorkel: Rapid training data creation with weak supervision , Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases 11 3 ( 2017 ) 269 - 282 .

[51]

Rosenbaum ,

Soltan ,

Hamza ,

Safari ,

Damonte , I. Groves , Clasp: Few-shot cross-lingual data augmentation for semantic parsing , in: AACL-IJCNLP 2022 , 2022 . URL: https://www.amazon.science/publications/ clasp-few -shot-cross-lingual-data-augmentation-for-semantic-parsing.

[52]

J. D.

Laferty ,

McCallum ,

F. C. N.

Pereira , Conditional random fields: Probabilistic models for segmenting and labeling sequence data , in: Proceedings of the Eighteenth International Conference on Machine Learning , ICML ' 01 , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001 , p. 282 - 289 .

[53]

Zheng ,

Wang ,

Bao ,

Hao ,

Zhou ,

Xu , Joint extraction of entities and relations based on a novel tagging scheme , in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 1227 - 1236 . URL: https:// aclanthology.org/P17-1113. doi: 10 .18653/v1/ P17 - 1113.

[54]

B. M.

Lake ,

G. L.

Murphy , Word meaning in minds and machines, Psychological review ( 2020 ).

[55]

P. J.

Chia ,

Tagliabue ,

Bianchi ,

Greco ,

Goncalves , “ does it come in black?” CLIP-like models are zero-shot recommenders , in: Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5) , Association for Computational Linguistics, Dublin, Ireland, 2022 , pp. 191 - 198 . URL: https://aclanthology.org/ 2022 .ecnlp- 1 .22. doi: 10 .18653/ v1/ 2022 .ecnlp- 1 . 22 .

[56]

Wang ,

Lu ,

Xu ,

Goutam ,

Song ,

Yin , Queen: Neural query rewriting in e-commerce , in: The Web Conference 2021 , 2021 . URL: https://www.amazon.science/ publications/queen -neural-query-rewriting-in-e-commerce.

[57]

Farzana ,

Zhou ,

Ristoski , Knowledge graph-enhanced neural query rewriting , in: Companion Proceedings of the ACM Web Conference 2023 , WWW '23 Companion, Association for Computing Machinery, New York, NY, USA, 2023 , p. 911 - 919 . URL: https: //doi.org/10.1145/3543873.3587678. doi: 10 .1145/3543873.3587678.

[58]

Shenoy , Graphgpt, https://github.com/varunshenoy/graphgpt, 2023 .