<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>merce Taxonomies⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jetlir Duraj</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ishita Khan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kilian Merkelbach</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehran Elyasi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>eBay Inc</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Query Categorization, E-commerce Search, Taxonomies, Chain-of-Thought Reasoning, Large Language Models</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>eBay Inc</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Mapping user queries to relevant categories is essential in e-Commerce search and navigation, since
it enhances search relevance, user navigation, and inventory targeting. Traditionally, demand-based
methods leveraging user behavioral data like click-through rates have been researched in industry
and academic literature. However, these methods face issues like presentation bias and signal sparsity,
particularly with long-tail queries and new inventory (e.g., Joachims et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Xv et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>
        Recent semantic-based methods ofer promising solutions by using linguistic and contextual
understanding to infer relevance, addressing sparsity and bias while generalizing to new scenarios. These
methods integrate query semantics with taxonomies for more precise mappings, but often lack
taskspecific adaptations and focus on static representations (e.g., Dehghani et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Gao et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>
        We propose a novel semantic projection system that complements demand-based methods. We adapt
the chain-of-thought (CoT) reasoning paradigm for large language models (LLM) (see Wei et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) to
our specific problem of classification in a hierarchical taxonomy. Given a query our system navigates
from root to leaf categories of the taxonomy, integrating query semantics and taxonomy details to
create precise, interpretable mappings. Our approach is orthogonal to demand-based methods and aims
to enrich and de-bias them.
      </p>
      <p>Similar in spirit to chain-of-thought (CoT) reasoning where an LLM solves a complex task by breaking
it down into smaller steps and tasks, the system we build operates through a structured, multi-step
process towards the solution. It iteratively predicts ranked categories at each taxonomy level, moving
from root to leaf categories. Our model dynamically adjusts prediction thresholds based on the semantic
information of the current category node, its children, and the query semantics.
⋆Accepted for oral presentation at SIGIR eCom 2025.
CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>A key feature of our approach is its ability to specify context for the query — such as user intent of
buying, browsing, or accessory/complementary intents. Context specification enables more targeted and
contextually appropriate category mapping. Our model provides confidence scores for each prediction,
thus ranking categories with the aim of ofering actionable insights and interpretability. Additionally, it
can serve as a diagnostic tool for refining and improving taxonomies, addressing structural noise often
misaligned with buyer signals.</p>
      <p>Taxonomy
Breadcrumbs</p>
      <p>Category</p>
      <p>Descriptions
(1) Category Tree Construction</p>
      <p>Search Query
(e.g., "4k drone gps")</p>
      <p>Intent
(Buy, Browse, Accessories,
Complementary, N/A)</p>
      <p>(2) CoT Search
...</p>
      <p>(3)
Taxonomy
Diagnostics</p>
      <p>and
Refinements</p>
      <sec id="sec-1-1">
        <title>1.1. Related Work</title>
        <p>Recommendation systems and personalized search in e-Commerce heavily depend on understanding
user interaction data semantically. Two main approaches exist in the literature: demand-based methods,
using behavioral data, and semantic methods, enhancing recommendations through query and content
understanding.</p>
        <p>
          Demand-based approaches to category prediction These approaches use implicit feedback, like
click-through data, to personalize recommendations. Ai et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] introduced a model using local contexts
for improved ranking, while Joachims et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] highlighted click-through data’s utility despite biases.
Xv et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] used graph neural networks to better handle long-tail queries and cold-start products, thus
tackling the persisting challenge of data sparsity.
        </p>
        <p>
          Semantic-based methods and query understanding The focus here is on understanding query
intent and matching it with relevant content. Guo et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] proposed a deep relevance matching model,
and Mitra and Craswell [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] introduced semantic embeddings for aligning queries with documents. These
methods help mitigate bias and sparsity issues. Dehghani et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] demonstrated weak supervision’s
efectiveness in sparse datasets, while Gao et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] explored semantic generalization in taxonomies.
LLMs LLMs are good at generalizing to unseen scenarios, which is a crucial capability for handling
long-tail queries in e-Commerce. Brown et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] showed LLMs’ strengths in few-shot learning.
Chain-of-thought reasoning (Wei et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], Kojima et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]) enhances hierarchical reasoning and
predictions.
        </p>
        <p>
          Taxonomy integration Taxonomies ofer a structured basis for improving category projections
and recommendations. Huang et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] showed how to integrate semantic signals and taxonomies in
e-Commerce search.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. High level overview</title>
        <p>The task of identifying semantically relevant leaf categories is a multi-label classification problem given
input data. The labels correspond to the leaf categories of a tree-structured taxonomy. This makes tree
path-finding algorithms a natural choice to study. Additionally, assessing the strength of the semantic
relationship between a query and leaf categories is crucial for applications. Therefore, we integrate
straightforward tree search methods with LLM-scoring for semantic relevance. In LLM-scoring, the
LLM is asked to provide a score measuring the strength of the semantic relation between a query and
an e-commerce category and its description. LLM-scoring for semantic relevance closely approximates
human judgment in our task, as demonstrated in Table 1.</p>
        <p>This table presents classification metrics for the semantic relevance of 4,897 query-leaf category pairs,
where human judges determine the ground truth. The LLMs we evaluated perform well in terms of F1,
precision, and recall. In terms of inference speed, Mixtral-8x7B, a mixture-of-experts model that can be
hosted locally, while ranked third in terms of F1, has significantly higher inference speed than the other
two models considered.1 We use Mixtral-8x7B for the results of our methodology in the paper.</p>
        <p>Another reason for favoring LLM-scoring for semantic relevance over more generative LLM
approaches is the issue of instruction-following. Such issues cannot be entirely eliminated due to the
generative nature of current LLM architectures. When prompted to select directly from available
children in a category node, LLMs sometimes modify category names rather than reproduce the exact
names from inputs. This issue is less prevalent in closed-source LLMs like those from OpenAI or Gemini
(4% failure rate in our experiments), but is more pronounced in open-source models from Hugging Face,
which can be hosted locally and support large-scale inference.</p>
        <p>For these reasons, we focused on a scoring approach for the LLM component of our method: we ask
the LLM for a confidence score for semantic relevance, ranging from 1 (lowest) to 10 (highest). The
decision to continue searching at each category node is based on the semantic scores of its children,
along with other contextual, or algorithm-runtime information.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Breadth first search design (CoT BFS)</title>
        <p>Interpreting the categorization task as one of regression (score assignment) after classification, our
approach first solves the classification problem in terms of relevant leaf categories fully, before addressing
the regression problem. Specifically, for a given query, at level 1 of the taxonomy, after scoring the
semantic relevance of all level-1 categories, we retain only the relatively most semantically relevant
children, pruning the rest. We use two query-dependent thresholds for selection: a selection-threshold
and a minimum-threshold, both ranging from 1 to 10. The relevance scores (1 to 10) of category children
are mapped to the standard normal distribution. The selection-threshold divided by 10 is applied to
the standardized scores to prune less relevant children. For example, with a selection-threshold of 9
(out of 10), children scoring below the mean plus 0.9 times the standard deviation of semantic scores
are pruned. To deal with potential high skewness of the score distribution at the lower end, a child’s
1The experiments took less than two hours to run within eBay’s infrastructure. OpenAI-GPT-4oMini is closed source, while
Llama3-70B is internally hosted.</p>
        <p>Muscial
Instruments
&amp; Gear</p>
        <p>Guitars &amp;
Basses</p>
        <p>...</p>
        <p>Child category rating
original semantic score must exceed the minimum-threshold (the second threshold) to survive for
further exploration.</p>
        <p>Next, the algorithm examines each subtree starting from the level-1 children that survived the initial
pruning. We repeat the pruning process for children in these subtrees to identify non-pruned
secondlevel nodes using the same relative thresholding procedure described above. This iterative process
continues until we reach nodes without children, which are leaf categories, and thus added to the set of
candidate leaf categories.</p>
        <p>Finally, because the search relies on relative rather than absolute semantic thresholding, we score the
ifnal set of leaf categories using only leaf category information (categorization path and descriptions).
The surviving leaf categories with high semantic relevance, above the minimum-threshold, are the final
predictions.</p>
        <p>For our empirical application we choose the selection-threshold and minimum-threshold as follows:
given the range of semantic scoring between 1 (lowest) and 10 (highest), we never consider thresholds
below 6. Among thresholds 7, 8, 9, for both selection-threshold and minimum-threshold, we only look
at pairs where selection-threshold is above the minimum-threshold. Among such pairs, we pick the
one that does best in terms of F1-score against a human judgment dataset composed of about 1000
representative queries.2</p>
        <p>We refer to this method as the Chain-Of-Thought Breadth-First-Search (CoT BFS).</p>
        <p>Figure 2 illustrates the CoT BFS categorization result for the query acoustic guitar, with
selectionthreshold of 9, and minimum-threshold of 8. In the first step, CoT BFS narrows down to a single level-1
category: AllCats &gt; Musical Instruments &amp; Gear, which gets an intermediate semantic score of 10. Out of
35 level-1 categories all other level 1 categories have low score, with the mode score of 1, and maximal
other score of 4 (category AllCats &gt; Music). The surviving level-1 category AllCats &gt; Musical Instruments
&amp; Gear has 16 children. The next classification step prunes out 15 out of these 16 children, because all
of them have semantic score lower than 8. The child Guitars &amp; Basses has semantic score of 10 and
2Note that the maximal score of 10, i.e. the span of the range 1..10 is also a hyperparameter. These hyperparameters need to
be validated periodically over time, to account for distribution shifts of the queries, but also changes in the taxonomy. Space
constraints preclude us from including a detailed analysis of the efect of the hyperparameters. Here we report qualitatively
the following: lowering the selection-threshold and minimum-thresholds typically increases recall, but lowers precision. We
also found that using the alternative range 1..5 instead of 1..10 lowered both precision and recall, while using 1..20 resulted in
slightly higher recall, lower precision and lower F1.
survives. Further, the node AllCats &gt; Musical Instruments &amp; Gear &gt; Guitars &amp; Basses has 13 children.
Scoring these children in the next step prunes out all but three children. The surviving children are
the nodes: Classical Guitars (with a final score of 9), Acoustic Electric Guitars (final score of 9), and
Acoustic Guitars (final score of 10). At this point, the search stops, as the reached nodes are already
leaf-categories of the category tree.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Scalable approaches for CoT and LLM scoring</title>
        <p>For a given query, the total number of LLM calls in CoT BFS is in the same order with the number of
category nodes visited in the taxonomy. Experiments on large datasets of queries show CoT BFS can
visit between 1.7% to 24.8% of the total number of category nodes of eBay’s taxonomy. This shows the
eficiency of our method, given the very high number of categories in eBay’s taxonomy. Nonetheless,
to scale this method to millions of queries and low latency, modifications are needed. We propose two
approaches, the second more scalable than the first.
2.3.1. CoT-k-NN hybrid BFS
k-NN retrieval based on embeddings of category names or descriptions can be used as a filter at each
step of the tree search process. Instead of exhaustively rating each child node, only a subset surviving
the embedding distance filter (between the user query and a textual representation of the category) is
scored by the LLM. This reduces the number of LLM calls at each node of the taxonomy, and constrains
the search to only the most promising directions.</p>
        <sec id="sec-2-3-1">
          <title>2.3.2. k-NN-search + LLM scoring on leaf categories</title>
          <p>One replaces tree-search with a k-NN-search on leaf category embeddings as a pre-filter, followed by
LLM scoring of the candidates identified through k-NN. Running k-NN with many neighbors at the
beginning of the procedure, e.g. 20 neighbors, enhances recall. We use a variant of this method in
section 3.2 to construct a synthetic ground truth for evaluating CoT BFS.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimentation</title>
      <p>3.1. Baseline model: k-nearest neighbors categorization
Our benchmark for evaluation of the CoT BFS is k-NN search for leaf categories with  = 10 using
(not fine-tuned) embeddings from sentence-BERT. Cosine-similarity is used as a metric for the k-NN. 3
We provide detailed categorization performance, comparing our method’s F1, precision and recall
classification metrics in the micro, macro and sample aggregations. Micro aggregation considers
performance across all queries and leaf categories. Sample aggregation considers performance
perquery and then aggregates. Macro aggregation considers performance per leaf category and then
aggregates.</p>
      <sec id="sec-3-1">
        <title>3.2. Evaluation against baseline</title>
        <sec id="sec-3-1-1">
          <title>3.2.1. Human Judgment</title>
          <p>Human judgment ofers both qualitative and quantitative evaluations by utilizing human intuition
and expertise. Evaluators review predicted categories for semantic relevance, though this process is
subjective and costly for large datasets. Despite its costs, human judgment captures nuances often
missed by other methods, hence it is indispensable. Our human judgment dataset includes 1018 queries
3We have access to language models trained on eBay-specific data, that typically perform better in eBay-related tasks than
general-purpose language models, but we do not present results from the use of eBay-specific language models. This is
because our focus is on understanding how the CoT BFS approach performs with general-purpose, non-fine-tuned LLMs.
Furthermore, using language models that are publicly accessible helps with the reproducibility of the results.
and 4897 query-category pairs, judged on semantic relevance (Yes/No decision). We note that the
leaf categories for judgment were chosen based on user behavior signals, leading to presentation bias
influenced by eBay’s current models in production. The annotators are three eBay-funded independent
domain experts for eBay’s taxonomy.</p>
          <p>Table 2 shows the relative performance of the CoT BFS to the benchmark, assuming that the ground
truth is given by the human judgment. CoT BFS outperforms the baseline model, especially in relation
to the F1 score and precision.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2.2. AI Pseudo-Reference Method</title>
          <p>We use an AI pseudo-reference method to create a dataset that approximates ground truth without the
presentation bias found in demand signal datasets. Starting with 3000 user queries, a high-quality LLM
emulates human judgment on semantic relevance for query-category pairs. To avoid losing potentially
relevant leaf categories for LLM-scoring, we use a k-NN embedding-based search with a large number
of neighbors. Namely, we pick out the 100 most relevant categories for each query. From these, we
exclude those with cosine similarity below 0.01.</p>
          <p>Afterwards, each pair is scored from 1 to 10 using a superior LLM (OpenAI-GPT-4o-Mini) compared
to locally hosted Mixtral-8x7B we use for CoT BFS. This hybrid method with large  , see also subsection
2.3.2, delivers a proxy for ground truth. Table 3 depicts the results. CoT BFS again outperforms the
baseline, especially in terms of F1 and precision.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.2.3. Retrieval Test</title>
          <p>The retrieval test evaluates predicted leaf categories by comparing recall and relevance between our
model and the baseline at the level of retrieved items. Items from the inventory are retrieved based on
leaf categories, with estimated recall size showing the proportion of relevant items found. Relevance is
measured using an eBay-internal PEGFB model that has been trained on human judgment data, and
which classifies results into five graded relevance levels: Perfect, Excellent, Good, Fair, and Bad. The
retrieval test evaluation highlights the model’s practical utility in improving user satisfaction and search
eficiency. Our model significantly outperforms the k-NN benchmark in both recall and relevance, with
Mann-Whitney U test results showing highly significant diferences in favor of CoT BFS.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Applications</title>
      <sec id="sec-4-1">
        <title>4.1. Context learning</title>
        <p>By learning from extensive real-world datasets, LLMs can identify patterns that reveal user intent and
preferences, enabling personalized search leading to higher semantic relevance. Contextual learning
can refine model outputs based on a given context of the query, ensuring that prediction results are
relevant. This capability is important for platforms like eBay, where discerning buyer intent enhances
the search experience. We consider two applications of context learning, one on user intent and one on
brand origin.</p>
        <p>More specifically, eBay’s taxonomy includes accessory-related categories across various merchandise
segments like electronics, automotive, and fashion. The CoT BFS approach can easily incorporate buyer
intent by modifying LLM prompts to include intents such as buying, seeking accessories, looking for
complementary items. Figure 3 illustrates for the query canon camera.</p>
        <p>All
Cameras
&amp; Photo
Camera,
Drone &amp;</p>
        <p>Photo
Accessories
Accessory
Bundles
9</p>
        <p>Flashes &amp;</p>
        <p>Flash
Accessories</p>
        <p>Other
Flashes &amp;</p>
        <p>Flash
Accs
9</p>
        <p>Prompt:
" Gi ven an eBay user quer y and a
cat egor i zat i on of t he quer y, your
t ask i s t o scor e t he semant i c
r el evance of a chi l d cat egor y of t he
cur r ent cat egor i zat i on f or t he quer y.
[ . . . ] "
Inputs:
- User query, e.g., "canon camera"
- Search context: intent is finding accessories
for main product
- Category path of current node
- Textual description of parent category
- Name and description of child category to be
judged</p>
        <p>Context accessory
intent guides query
interpretation</p>
        <p>Without intent, the top category identified for the query is AllCats &gt; Cameras &amp; Photo &gt; Digital
Cameras with a score of 10. By injecting accessory intent as a search context into the prompt, the top
categories identified are AllCats &gt; Cameras &amp; Photo &gt; Camera, Drone &amp; Photo Accessories &gt; Accessory
Bundles and AllCats &gt; Cameras &amp; Photo &gt; Flashes &amp; Flash Accessories &gt; Other Flashes &amp; Flash Accs, with
a score of 9 each. More generally, table 5 shows how the average semantic scores for Accessory vs.
No-Accessory predicted categories change for 15 selected queries, when specifying accessory intent.</p>
        <p>Including buyer intent in CoT BFS leads to better targeting of relevant categories.</p>
        <p>Similarly, we consider the efect of injecting context regarding brand origin. To illustrate, for the
query sports car, we consider the two distinct contexts of brand origin is from Germany, and brand origin
is from Italy. For this query, the CoT BFS predicts exclusively sports car-related leaf categories from the
german brands Audi, BMW, Mercedes-Benz, Porsche in the first case, and exclusively leaf categories
from the italian brands Alfa Romeo, De Tomaso, Ferrari, Fiat, Maserati, Lamborghini in the second.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Detecting issues and improving the taxonomy</title>
        <p>By analyzing query patterns and identifying gaps in category representation, CoT BFS can help provide
actionable insights for improving e-Commerce taxonomy structures.</p>
        <p>In this regard, we conducted an experiment using a representative sample of 25,000 queries processed
through the CoT BFS approach with high thresholds: selection-threshold of 10, minimum-threshold of
9. There were 3,110 queries with empty model predictions, indicating that the current eBay taxonomy
lacks category nodes at the first few levels that are strongly semantically related to these queries. We
uncovered certain patterns by clustering these queries using k-NN search on embeddings. For instance,
two clusters of these ”failing” queries correspond to the e-Commerce categories Designer Sunglasses
and Optical Instruments and Accessories. The closest leaf categories in the current eBay taxonomy for
the first cluster identified ( Designer Sunglasses) are AllCats &gt; Clothing, Shoes &amp; Accessories &gt; Women &gt;
Women’s Accessories &gt; Sunglasses &amp; Sunglasses Accessories &gt; Sunglasses and AllCats &gt; Clothing, Shoes
&amp; Accessories &gt; Men &gt; Men’s Accessories &gt; Sunglasses &amp; Sunglasses Accessories &gt; Sunglasses, both in a
depth of 5 in the taxonomy. A similar issue is observed for the second cluster. These types of insights,
when drawn from large sets of user queries, can help product management teams in their taxonomy
enhancements work. E.g., introducing a level-2 category titled Optical Products and Eyewear with
subcategories such as Designer Sunglasses and Optical Instruments and Accessories might be beneficial
for the search experience.</p>
        <p>Ultimately, maintaining an e-Commerce taxonomy that provides high value to users involves complex
business and product management decisions. Our methodology ofers tools to explore taxonomy issues
with the goal of enhancing decision making in these complex business decisions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future work</title>
      <p>In this study, we introduce a novel methodology for query categorization within hierarchical taxonomies.
It combines the world knowledge of LLMs and simple tree search algorithms to achieve high-quality
categorization and provide deep insights into the taxonomy.</p>
      <p>AB-tests are planned for the scalable methods presented in section 2.3. These involve direct tests
where the model predictions are cached for use in production, but also indirect tests where the AB test
is on lower-latency categorization models trained with data that have been LLM-labeled via CoT BFS.</p>
      <p>Further, we developed a CoT algorithm version that uses absolute thresholding at each taxonomy
node, rather than the relative thresholding discussed in the paper, here left out due to space constraints.
This method, called Chain-of-Thought Depth-first-search (CoT DFS), searches for leaf categories in a
depth-first manner and halts a path when encountering an intermediate node with low absolute semantic
relevance, as opposed to relative described in this paper. Because of its more stringent requirements,
the CoT DFS approach leads to more queries with empty predictions. CoT DFS can leverage user query
activity and LLM semantic-knowledge more efectively than CoT BFS for the purpose of taxonomy
diagnostics.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
models for web search using clickthrough data, in: Proceedings of the 22nd ACM International
Conference on Information &amp; Knowledge Management, CIKM ’13, Association for Computing
Machinery, New York, NY, USA, 2013, p. 2333–2338. URL: https://doi.org/10.1145/2505515.2505665.
doi:10.1145/2505515.2505665.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Granka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hembrooke</surname>
          </string-name>
          , G. Gay,
          <article-title>Accurately interpreting clickthrough data as implicit feedback</article-title>
          ,
          <source>in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '05,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2005</year>
          , p.
          <fpage>154</fpage>
          -
          <lpage>161</lpage>
          . URL: https://doi.org/10.1145/1076034.1076063. doi:
          <volume>10</volume>
          .1145/1076034.1076063.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Xv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>E-commerce search via content collaborative graph neural network</article-title>
          ,
          <source>in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          , KDD '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>2885</fpage>
          -
          <lpage>2897</lpage>
          . URL: https://doi.org/10.1145/3580305.3599320. doi:
          <volume>10</volume>
          .1145/ 3580305.3599320.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Severyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Neural ranking models with weak supervision</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>65</fpage>
          -
          <lpage>74</lpage>
          . URL: https://doi.org/10.1145/3077136.3080832. doi:
          <volume>10</volume>
          .1145/3077136. 3080832.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pantel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Modeling interestingness with deep neural networks</article-title>
          ,
          <source>in: Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2014</year>
          . URL: https://api. semanticscholar.org/CorpusID:2141094.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-ofthought prompting elicits reasoning in large language models</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Learning a deep listwise context model for ranking refinement</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval</source>
          , SIGIR '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>135</fpage>
          -
          <lpage>144</lpage>
          . URL: https://doi.org/10.1145/3209978.3209985. doi:
          <volume>10</volume>
          .1145/3209978.3209985.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>A deep relevance matching model for ad-hoc retrieval</article-title>
          ,
          <source>in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management</source>
          , CIKM '16,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2016</year>
          , p.
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          . URL: https://doi.org/10.1145/2983323.2983769. doi:
          <volume>10</volume>
          .1145/2983323.2983769.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <article-title>An introduction to neural information retrieval</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>13</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>126</lpage>
          . URL: http://dx.doi.org/10.1561/1500000061. doi:
          <volume>10</volume>
          .1561/ 1500000061.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https:// proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2201.11903. arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>ArXiv abs/2205</source>
          .11916 (
          <year>2022</year>
          ). URL: https://api.semanticscholar.org/CorpusID:249017743.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>P.-S. Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Acero</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Heck</surname>
          </string-name>
          , Learning deep structured semantic
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>