<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM SIGIR Workshop on eCommerce, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Question Intent Taxonomy for E-com merce</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diji Yang</string-name>
          <email>dyang39@ucsc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Omar Alonso</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California Santa Cruz</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>18</volume>
      <issue>2024</issue>
      <abstract>
        <p>Efective question-intent understanding plays an important role in enhancing the performance of Question-Answering (QA) and Search systems. Previous research in open-domain QA has highlighted the value of intent taxonomies in comprehending data and facilitating answer generation and evaluation. However, existing taxonomies have limitations for specific domains. We're interested in question intent for e-commerce scenarios where questions are specific to shopping activities.</p>
      </abstract>
      <kwd-group>
        <kwd>intent understanding</kwd>
        <kwd>question taxonomy</kwd>
        <kwd>question answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Question answering (QA) as a longstanding task in NLP has been pushed forward rapidly in
recent years with the development of language models. Transformer-based models perform well
in most factoid QA datasets, however, they tend to still perform poorly compared to humans
in datasets containing more complex problems. In closed-domain QA, such as AmazonQA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
this problem is more noticeable and requires relevant knowledge. As a result, QA applications
are limited in scenarios in which they are deployed for product-level services. At the same time,
early research suggests that accurate intent understanding forms the cornerstone for successful
information retrieval and contextually relevant answer generation [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. The goal of question
intent understanding is to categorize user queries into distinct intent classes. This categorization
aids in facilitating data comprehension, answer generation, and evaluation [
        <xref ref-type="bibr" rid="ref2 ref4 ref5">4, 2, 5</xref>
        ]. It can also
be used as a signal for relevance ranking and improving diversity in search results.
      </p>
      <p>
        In practice, Broder [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] shows the importance of classifying user queries in web search and
how it reflects the real word. Intent taxonomies aid in categorizing questions based on their
inherent purpose and help in improved answer synthesis and evaluation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, it has
been observed that a single intent taxonomy may not be universally applicable across diverse
∗Work done during internship at Amazon.
domains due to the specific nuances inherent in diferent contexts [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. Bolotova et al.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a unified intent taxonomy for non-factoid questions (NFQA). While the NFQA
taxonomy is efective in certain contexts, it falls short when applied to the fine-grained features
of e-commerce. Human-to-human three-way agreement when using NFQA stands at 49.13%,
indicating a lack of consensus in categorizing intent for e-commerce-related queries.
Furthermore, a noticeable category imbalance exacerbates the challenges in efectively classifying
e-commerce-related questions using the NFQA.
      </p>
      <p>
        We propose EQA (E-commerce Question Answering) taxonomy, a tailored approach that
advocates for the creation and adoption of a bespoke taxonomy dedicated to e-commerce
questions. Specifically, recognizing the limitations of the existing NFQA taxonomy in accurately
reflecting the intent of e-commerce queries, we eliminate categories that show low
interrater agreement rates in the e-commerce context and introduce new categories that are more
contextually appropriate. EQA is designed to encapsulate the unique characteristics of
ecommerce data and user queries within this domain. Our taxonomy demonstrates that it can
represent users’ real information needs in the context of shopping scenarios. To operationalize
the EQA taxonomy, we leverage instruction fine-tuning [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to train an intent classifier for
e-commerce questions. Our experiments demonstrate the efectiveness of this approach in
accurately categorizing e-commerce queries.
      </p>
      <p>Our contribution can be summarized as follows:
• We propose a question intent taxonomy for e-commerce questions that can be used
in diferent shopping scenarios. Our quantitative and qualitative analyses confirm the
reliability of this taxonomy in e-commerce problems.
• We describe how to build classifiers when introducing a new taxonomy. While EQA is
based on e-commerce, we believe this methodology can be generalized to other domains.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Question intent</title>
      <p>NFQA is a comprehensive question intent taxonomy for open-domain question-answering
tasks. To explore the suitability of NFQA trained from open-domain questions for e-commerce
questions, two human annotators followed the NFQA taxonomy to label a set of in-domain
questions. Meanwhile, using the pre-trained classifiers of NFQA, we obtained NFQA predictions
from a deep learning model. The results show that human-to-human agreement stands at a
mere 49.13%. These disagreements occur mainly in experience and evidence-based categories.
Moreover, we note that some classes do not faithfully respond to the information needs of the
question, e.g., debate. More quantitative analyses are covered in Section 4.</p>
    </sec>
    <sec id="sec-4">
      <title>3. E-commerce taxonomy</title>
      <p>The EQA taxonomy is presented in Table 1. In this section, we describe how EQA fulfills users’
information needs in close domains and how to evaluate the taxonomy.
Opinion
Description
Comparison
Recommendation
Factoid</p>
      <p>Description Example
The customer wants instructions, guidelines, Where is the doorknob? Once the code
or procedures to achieve something with re- is entered, how do you open the door?
spect to a product or service. How can I tell if this will work on</p>
      <p>my TV or BluRay player?
The customer wants a subjective piece of infor- Is this product worth buying or will
mation about a product, service, or shopping I end up sending it back?
category. Any defect complaints?
The customer wants a definition, description, What are the sizes and types of
explanation, or summary of a product, service, blades that come with the 5-blade
or shopping category. package?</p>
      <p>What are the measurements of this
product?
The customer wants a comparison of two or What's the difference between exclusive
more products or services. Castiel and regular?</p>
      <p>How does this knife compare with a</p>
      <p>Kershaw?
The customer wants recommendations for a Looking for a bag for golf for drinks
product or service. and snacks. Would this be a good
choice?
I need a case for a .357 with a
sixinch barrel. Suggestions?
The customer wants an objective piece of infor- Is the price and shipping for one bar
mation about a product, service, or shopping or a set of two? Does it fit on Honda
category. CRV 2014?</p>
      <sec id="sec-4-1">
        <title>3.1. Information Needs and Question Intent</title>
        <p>
          The concept of information need refers to the foundational motivation driving users to
engage with search systems [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Questions are shaped by the askers’ specific contexts, akin to
how semantics in linguistics rely on contextual understanding [
          <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
          ]. Thus, analyzing
questioning requires consideration of the broader context (i.e., question intent). In contrast to
open-domain QA or web search queries, users within the e-commerce domain pose questions
for more targeted purposes – specifically, to facilitate subsequent purchasing decisions. This
inherent focus renders the coarse-grained taxonomy of the general domain insuficient in
capturing the nuanced distinctions between e-commerce intents.
        </p>
        <p>Building upon the existing NFQA taxonomy, we first focus on the fact that the debate category
does not constitute a valid intent in e-commerce. When individuals take part in shopping,
they are not likely to anticipate engaging in formal debates with others. Even inquiries that
could potentially spark heated discussions on forums or in other contexts, such as “What's
the best graphics card for non-gamers?”, where the user is looking for advice with a
purchase. This observation consequently highlights the significance of recommendation,
which is a prevalent intent within the e-commerce domain. Furthermore, opinions, representing
subjective insights from other consumers, prove instrumental in guiding purchasing decisions.
This intent commonly manifests in the form of queries seeking feedback on product usage
or opinions about comparable items. Correspondingly, inquiries for objective information
are encompassed by the description category. Answers to these queries can often be readily
found on the product page, such as technical details provided by the seller. Alongside these,
instruction and comparison persist as two enduring question types that are pertinent in
e-commerce.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Evaluation matrix</title>
        <p>To quantitatively assess the eficacy of the question intent taxonomy, we adopt two distinct
matrices that serve as indicators of its performance across specific datasets.</p>
        <p>Distribution of categories We analyze the distribution of each intention as a percentage
within the dataset. Although closely tied to dataset characteristics like data source, the
distribution ofers valuable insights. Extremely unbalanced distributions often imply that current
taxonomies struggle to establish efective boundaries for splitting the questions in the given
dataset.</p>
        <p>Human-to-human agreement Within a given dataset, a well-defined taxonomy should
facilitate consistent labeling by diverse human annotators. Agreement implies that the categories
in the taxonomy clearly delineate the problem intent. Minimize blurry classification regions as
well as controversial labels.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Intent Classifier Design</title>
        <sec id="sec-4-3-1">
          <title>3.3.1. Model choice</title>
          <p>
            We chose the encoder-decoder model T5x [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] as the starting point, considering performance
and scalability. Benefiting from extensive pre-training data, T5x shows language understanding
ability, which is the prerequisite for a language model to predict the intent. Furthermore,
encoderdecoder architecture was born with an advantage over encoder-only models in classification
tasks [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. In scalability concerns, decoder-only models that perform well in various NLP tasks
tend to be heavy in size and thus dificult to deploy on a lightweight device [
            <xref ref-type="bibr" rid="ref18 ref19 ref20">18, 19, 20</xref>
            ]. For all
experiments, we conduct fine-tuning on the Flan-T5-Large model [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
        </sec>
        <sec id="sec-4-3-2">
          <title>3.3.2. Data Preparation</title>
          <p>
            Motivated by the success of few-shot learning in NLP [
            <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
            ], we pre-process the training data to
better serve the subsequent supervised fine-tuning. Emulating LIMA [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], we prioritize training
data quality through stratified sampling for balanced intent representation and manual filtering
to eliminate monotonous language patterns. For example, we diversified the comparison
intent questions to prevent overfitting to repetitive structures like “ What is the difference
between A and B?”. Our processed training dataset focuses on both representative across
all intent classes and also linguistic diversity, which further enhances the robustness of the
ifne-tuning process.
Category
Debate
          </p>
          <p>Factoid</p>
          <p>Instruction
Not-a-question</p>
          <p>Experience
Evidence-based</p>
          <p>Comparison</p>
          <p>Reason
Variance</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>3.3.3. Model Alignment</title>
          <p>
            To better align with the downstream task, i.e., intent classification, we adopt an instruction
ifne-tuning paradigm [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. Specifically, we define the task as a 7-classes classification problem
and conduct supervised fine-tuning to tailor the model to our specific requirements.
          </p>
          <p>For prediction trustworthiness, in addition to the intent label, we record the model transition
probability at the generated token to approximate the confidence score. Particularly, the score
  , is determined by its conditional probability given all preceding tokens (the given question),
 &lt; . The overall score is computed by Equation 1, where   is the logit corresponding to   , and
the denominator is the sum of exponential logits for all tokens in the vocabulary, ensuring
normalization.</p>
          <p>(  | &lt; ) =</p>
          <p>exp(  )
∑ exp(  )
(1)</p>
          <p>This formulation quantifies the model’s certainty or confusion in selecting   given the
preceding context, with lower scores indicating the model’s uncertainty about the prediction.
In practical applications, setting a threshold could advise users against placing too much trust
in uncertain predictions, thereby enhancing the reliability of classification results. In this work,
the threshold is manually set at 0.6. We envision that future research could develop an adaptive,
learnable threshold by training a simple neural network, such as a Multilayer Perceptron (MLP),
to improve the discernment of prediction reliability.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments and Results</title>
      <sec id="sec-5-1">
        <title>4.1. Training details</title>
        <p>In line with the prompt design from recent instruction fine-tuning works, our training utilizes
an instruction prefix combined with dataset questions as input and human-annotated intent
categories as expected labels. The fine-tuning is delivered by the cross-entropy loss and Adam
optimizer through 20 training epochs. The entire fine-tuning process was completed in under
Category
Factoid</p>
        <p>Debate
Evidence-based</p>
        <p>Instruction
Experience
Comparison</p>
        <p>Reason
Three-way Agreement</p>
        <p>
          Distribution
17.43
0.57
40.29
4.29
12.86
18.57
6.00
49.13
two hours on a single NVIDIA A-100 GPU, following the hyperparameter settings recommended
by T5x [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Dataset</title>
        <p>
          Built on the top of a product review-based e-commerce dataset, Amazon review data [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ],
AmazonQA [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is known as a community QA dataset. All questions, passages, and answers
in AmazonQA are extracted from real human interactions, which makes it an ideal dataset
for understanding the real information needs of users in the e-commerce domain. The oficial
test split included 92,726 QA pairs. However, due to the fact the human-to-human agreement
requires a significant amount of human efort, we use a subset of the original test data, i.e., 350
questions, as our test set.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Results and Analysis</title>
        <sec id="sec-5-3-1">
          <title>4.3.1. Intent Classifier</title>
          <p>Table 2 presents the intent distribution of questions in the AmazonQA dataset using EQA and
NFQA. Our analysis aimed to establish which taxonomy better represents the nature of queries in
an e-commerce context. The EQA taxonomy revealed a predominant focus on factoid questions,
constituting 51.98% of the dataset. This result aligns well with the nature of e-commerce
inquiries, where customers often seek specific, factual information about products. The next
significant category in the EQA taxonomy was opinion (16.97%) and description (15.07%),
reflecting the customer’s interest in reviews and detailed product descriptions. In contrast,
the NFQA taxonomy’s most prominent category was debate, accounting for 55.70%. However,
the concept of debate is less relevant in an e-commerce setting, as customers typically seek
concrete information rather than engage in discussions of a contentious nature. The factoid
category in NFQA, while still significant, was markedly lower at 16.03%, suggesting a less precise
alignment with the nature of e-commerce queries. Of the remaining categories with relatively
low occupancy rates, Not-a-question accounts for 7.56% of the NFQA and is higher than any
other category. This finding points to many cases where the NFQA classifier fails, which did not
Question
I need to replace a defective Julie from brake on a Cannondale
Scalpel MTB. Is this a good replacement and will it bolt right
on?
Compare best 3G/4G internet access plan?
10′ is L, H or W?
What is the protein in this? Wish they would call that out
on the site. Trying to decide between this and Kay's naturals
which is protein packed!
For a more permanent solution, do you super glue the pads
along with the sticky adhesive onto the glass or without the
sticky adhesive?
EQA
Recommendation (99%)
Comparison (99%)
Factoid (99%)
Description (99%)
Opinion (88%)</p>
          <p>Debate
Not-Question
Not-Question
Instruction
occur with EQA. In both taxonomies, instruction and comparison share similar proportions.
NFQA presents experience and evidence-based with high rates. While these categories are
relevant in broader information contexts, their specific applicability in e-commerce is less direct
compared to the recommendation in EQA, which reflects a customer’s desire for guidance to
make informed purchasing decisions.</p>
          <p>Further statistical analysis revealed that the variance for the intent distribution in EQA
taxonomy was approximately 275.76, while for NFQA, it was slightly higher at 284.78. This
higher variance in the NFQA taxonomy indicates a broader spread in the distribution of question
types, which may imply less consistency in categorization relevance for e-commerce data.
Our experiment result from AmazonQA emphasizes the adaptability of the EQA taxonomy
in capturing the intent of e-commerce customers, providing a more relevant and practical
categorization framework for analyzing customer queries on e-commerce scenarios.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>4.3.2. Human Evaluation</title>
          <p>We performed human annotation of a random sample of 350 data points, where half were from
AmazonQA, and another half were from our internal unpublished real e-commerce data. To
ensure reliability and reduce the subjectivity inherent in manual labeling, we recruited three
independent annotators and adopted a two-stage majority voting process for deriving the final
label. In the initial stage, data points where at least two annotators agreed on the label were
directly accepted, and these consensus labels were deemed final for those specific data instances.
Next, to address the cases with complete annotator disagreement, we calculated the individual
accept rate of each annotator, defined as the proportion of their labels being accepted in the
ifrst stage. For data points with divergent annotations, the label proposed by the annotator
with the highest accept rate was chosen as the final label. We analyzed the distribution of each
intent category and calculated the rate of three-way agreement, which is the proportion of
three labelers providing the same label.</p>
          <p>As reported in Table 3, the human evaluation results reveal a significant diference in the
faithfulness of the EQA and NFQA taxonomies in e-commerce contexts. While the EQA
taxonomy achieved a substantial three-way agreement rate of 76.88%, NFQA’s agreement rate was
notably lower at 49.13%. This diference highlights a key challenge with NFQA in e-commerce:
its categories are less tailored to the specific types of queries that arise in this domain. For
instance, NFQA’s broader categories, like Evidence-based and Reason, may lead to varied
interpretations among annotators when applied to the more focused needs of e-commerce
customers. The disagreement in NFQA suggests that its categories, possibly well-suited for
open-domain questions, are less intuitive and coherent for e-commerce queries, leading to more
subjective and inconsistent categorization. In contrast, EQA, with its higher agreement rate,
demonstrates a clear alignment with the distinct, often more pragmatic and product-focused
nature of e-commerce questions.</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>4.3.3. Qualitative analysis</title>
          <p>
            This section outlines three patterns in which the NFQA and EQA yield divergent outcomes.
As mentioned in Section 4.3.1, one notable issue with NFQA is its tendency to classify a large
number of questions as Debate, which is a less reasonable intent in the online shopping context.
For example, as illustrated in the first two examples in Table 4, questions that may contain a
debating intent in daily discussion (e.g., debates over “the best”) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], but in an e-commerce
setting, more accurately interpreted as seeking product recommendations or comparisons.
Another commonly seen pattern is the misclassification of questions as Not-Question due to
the gap between NFQA’s pre-training data and the actual shopping queries, particularly failing
to recognize questions containing abbreviations. Our analysis of question length shows that
data labeled as Not-Question by NFQA averaged 27.85 tokens, contrasted with an average of
13.46 tokens for all other intents. This discrepancy further supports the fact that NFQA falls
short of processing longer queries in the e-commerce domain. The last example question seeks
an opinion, indicating a preference for judgment over direct steps. The distinction between EQA
and NFQA highlights their diferential capacities to interpret the demands of e-commerce data.
Throughout the above-mentioned three patterns, EQA demonstrates alignment with human
intuition and consistently delivers high confidence scores.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion and Future Work</title>
      <p>We introduce the EQA taxonomy, tailored specifically for e-commerce queries. Our research
highlighted the limitations of generic taxonomies like NFQA in the e-commerce context and
demonstrated the need for a domain-specific solution. The development and validation of EQA,
coupled with an intent classifier trained using instruction fine-tuning, shows a lot of promise for
question intent understanding in e-commerce. This approach ofers a more accurate framework
for question categorization in e-commerce and sets a precedent for developing domain-specific
taxonomies in other specialized areas.</p>
      <p>EQA has been practiced reliably on e-commerce data; however, the efectiveness of its intent
label for downstream tasks is still unproven. Moving forward, we anticipate the integration
of EQA classifiers into operational pipelines, enabling systematic evaluation of their eficacy
in supporting downstream tasks. Furthermore, the prospect of extending the methodologies
employed in this study to other domains, such as healthcare, by adapting domain-specific intent
taxonomies for classifier training points in an exciting direction for future research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rayasam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <article-title>Amazonqa: A review-based question answering task</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>04364</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W. G.</given-names>
            <surname>Lehnert</surname>
          </string-name>
          ,
          <article-title>A conceptual theory of question answering</article-title>
          ,
          <source>in: Proc. of IJCAI</source>
          ,
          <year>1977</year>
          , p.
          <fpage>158</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Shneiderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Byrd</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , Clarifying Search:
          <article-title>A User-Interface Framework for Text Searches</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Graesser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Person</surname>
          </string-name>
          , Question asking during tutoring,
          <source>American Educational Research Journal</source>
          <volume>31</volume>
          (
          <year>1994</year>
          )
          <fpage>104</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pujari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <article-title>Can taxonomy help? improving semantic question matching using question taxonomy</article-title>
          ,
          <source>in: Proc. of ACL</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>499</fpage>
          -
          <lpage>513</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <article-title>A taxonomy of web search</article-title>
          ,
          <source>SIGIR Forum 36</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anubhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shandilya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sigalas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Beyond accurate answers: Evaluating open-domain question answering in enterprise search</article-title>
          ,
          <source>in: Proc. of CHIIR</source>
          ,
          <year>2023</year>
          , p.
          <fpage>308</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Taira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sasaki</surname>
          </string-name>
          , E. Maeda,
          <article-title>Question classification using HDAG kernel</article-title>
          ,
          <source>in: Proce. of ACL Workshop on Multilingual Summarization and Question Answering</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          ,
          <year>2003</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>Learning question classifiers</article-title>
          ,
          <source>in: COLING 2002: The 19th International Conference on Computational Linguistics</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Hermjakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ravichandran</surname>
          </string-name>
          ,
          <article-title>A question/answer typology with surface text patterns</article-title>
          ,
          <source>in: Proc. of HLT</source>
          ,
          <year>2002</year>
          , p.
          <fpage>247</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bolotova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Blinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <article-title>A non-factoid questionanswering taxonomy</article-title>
          ,
          <source>in: Proc. of SIGIR</source>
          ,
          <year>2022</year>
          , p.
          <fpage>1196</fpage>
          -
          <lpage>1207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.11416</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sarzynska-Wawer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wawer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pawlak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Szymanowska</surname>
          </string-name>
          , I. Stefaniak,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarkiewicz</surname>
          </string-name>
          , L. Okruszek,
          <article-title>Detecting formal thought disorder by deep contextualized word representations</article-title>
          ,
          <source>Psychiatry Research</source>
          <volume>304</volume>
          (
          <year>2021</year>
          )
          <fpage>114135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kusner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <article-title>A survey on contextual embeddings</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>07278</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Andor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gafney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohiuddin</surname>
          </string-name>
          , et al.,
          <article-title>Scaling up models and data with t5x and seqio</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kementchedjhieva</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chalkidis</surname>
          </string-name>
          ,
          <article-title>An exploration of encoder-decoder approaches to multi-label classification for legal and biomedical text</article-title>
          ,
          <source>arXiv preprint arXiv:2305.05627</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , et al.,
          <article-title>Palm: Scaling language modeling with pathways</article-title>
          ,
          <source>arXiv preprint arXiv:2204.02311</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>Exploiting cloze questions for few shot text classification and natural language inference</article-title>
          ,
          <source>arXiv preprint arXiv:2001</source>
          .
          <volume>07676</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A. Efrat,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          , et al.,
          <article-title>Lima: Less is more for alignment</article-title>
          ,
          <source>arXiv preprint arXiv:2305.11206</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>J. McAuley</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Addressing complex and subjective product-related queries with customer reviews</article-title>
          ,
          <source>in: Proc. of WWW</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>625</fpage>
          -
          <lpage>635</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>