<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SEQ+MD: Learning Multi-Task as a SEQuence with Multi- Distribution Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Siqi Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Audrey Zhijiao Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Austin Clapp</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sheng-Min Shih</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoting Zhao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Boston University</institution>
          ,
          <addr-line>665 Commonwealth Ave, Boston, MA 02215</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Etsy</institution>
          ,
          <addr-line>117 Adams St, Brooklyn, NY 11201</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In e-commerce, ranking algorithms based on relevance and engagement signals have often shown improvement in sales and gross merchandise value (GMV). Designing such algorithms becomes particularly challenging when serving customers across diverse regional markets, as shopping preferences and cultural traditions vary significantly. We propose the SEQ+MD framework, which combines sequential learning for multi-task learning (MTL) with a region-based feature mask for handling multi-distribution data. This approach utilizes the sequential order within tasks and accounts for regional heterogeneity, enhancing performance on multi-source data. Unlike traditional sequential models that rely on tracking user interaction histories, SEQ operates on user-item feature pairs and generates task-specific predictions in sequence. Moreover, SEQ supports eficient parameter sharing across tasks and allows new tasks to be added easily. Notably, SEQ trained on data from only two tasks outperforms the baseline model trained on data from all three tasks when evaluated on the full three-task setting. Experiments on in-house data showed significant gains in high-value engagements, including add-to-cart and purchase actions. Furthermore, our multi-regional learning module can be flexibly applied to enhance other MTL applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-task Learning</kwd>
        <kwd>Mixed-distribution Learning</kwd>
        <kwd>E-commerce Search</kwd>
        <kwd>E-commerce Ranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In e-commerce, the design of item display algorithms is crucial for enhancing the customer shopping
experience [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. When a customer enters a query in the search window, the query typically goes
through two stages to render final search results: retrieval and re-ranking. In the first stage, retrieval
systems extract thousands of the most relevant items from millions of listings; in the re-ranking step,
the thousands of listings are further re-ranked such that the most relevant results are shown at the
top. Unlike traditional pattern-searching methods [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], machine learning ofers possibilities for more
personalized search experiences [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The same search query from diferent users may yield completely
diferent listing displays.
      </p>
      <p>
        Designing efective machine learning algorithms for global e-commerce involves two major challenges.
First, models often need to handle multiple tasks with unevenly distributed data. For example, click data
is much more abundant than purchase data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Multi-task learning (MTL) improves performance by
enabling shared learning across tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], as illustrated in Fig.1-(a), but it remains dificult to maintain
balanced training and promote efective communication between tasks [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. Second, regional
diferences introduce significant variation in data distributions. In global marketplaces, users interact
with international listings, yet shopping behaviors difer across countries due to cultural preferences.
For instance, buyers in the UK are more likely to purchase cookie boxes as birthday gifts (Fig.2-(a)).
These diferences influence both the distribution and relevance of features. As shown in Fig.2-(b), some
features are informative in certain regions but uninformative in others. Throughout this paper, we use
"country" and "region" interchangeably, though a region may refer to any geographic area.
(a) Learning multi-task with
experts and gates (prior work)
Task 1 Task 2 … Task k
      </p>
      <p>Task 1
(b) Learning multi-task
as a sequence (ours)</p>
      <p>Task 2 …
e
e
g
g …
e …
g … g
e …</p>
      <p>Input
g Gate e Expert
g
e
e</p>
      <p>R
R
x1
…</p>
      <p>R
R
x2
…
…
…
Seq Processor</p>
      <p>Input
xk Input for taskk R</p>
      <p>R Unrolled RNN</p>
      <p>R
R
xk</p>
      <p>
        Existing methods usually address these two challenges separately. To the best of our knowledge,
no single model currently solves both challenges efectively. Regarding multi-task learning (MTL),
many approaches treat tasks independently [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ], ignoring their natural sequential structure.
Methods that consider task ordering either rely on user interaction sequences to predict the next
item [15], or use separate task-specific towers followed by conditional probability modeling [ 16, 17, 18].
Beyond sharing a base model, interactions between tasks are typically limited to shared experts or
gating mechanisms [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">7, 6, 5</xref>
        ]. For region-specific data, most models are trained without accounting for
regional variation, despite clear diferences in input features across regions as shown in Fig. 2. While
incorporating regional information could improve performance, training separate models for each
region is ineficient and often inefective due to imbalanced data availability, especially in regions with
limited samples.
      </p>
      <p>To this end, we propose the learning multi-task as a SEQuence + Multi-Distribution (SEQ+MD)
framework, which can tackle the two challenges simultaneously. For the multi-task component, we
observe that many user actions follow a natural sequence, such as clicking before purchasing, which can
be modeled efectively as a sequential learning problem. Rather than treating each task independently,
our SEQ architecture generates task predictions as a sequence, as shown in Fig.1-(b). The input pair of
user and item features is first encoded into a sequence, and the model then outputs a probability token
for each task in order. The most closely related work, HTLNet[18], also uses the output of earlier tasks as
input for later ones. However, their approach relies on separate task towers, while our SEQ model uses
a recurrent neural network (RNN) [19] that shares the same weights across tasks. This design supports
eficient expansion to new tasks and maintains strong performance without the need for additional
training. For handling mixed input distributions, we separate input features into region-invariant and
region-dependent groups. The region-dependent features are processed with a country embedding in
our multi-distribution (MD) learning module, meaning these features are transformed according to
their region, and then concatenated with the region-invariant features. An advantage of this approach
is that the MD module is easy to plug in and can enhance the performance of any multi-task learning
model on multi-source data.</p>
      <p>We evaluated our framework on our in-house data ofline and observed a 1.8% performance increase
in the critical purchase task while keeping the click task performance positive compared to baseline
models. In summary, our contributions are:
• We introduced a new framework SEQ for multi-task learning with an improvised RNN architecture,
specifically designed to handle tasks with sequential order. SEQ not only extracts and utilizes the
sequence relation between tasks, reduces redundant computations among related tasks but also
demonstrates excellent transferability when adding new tasks. By decomposing a complex task
(a) Search “birthday” on CA and GB sites
(b) Feature distribution shift between CA and GB</p>
      <p>Listing views per query count</p>
      <p>User gift purchase count</p>
      <p>into simpler, sequential tasks, SEQ efectively enhances the multi-task learning process.
• We developed a module MD for learning regional data with diferent distributions. The MD
module enables the model to capture region-specific features while sharing region-invariant
features, allowing for efective training with a more extensive and diverse dataset.</p>
      <p>• Our in-house data experiments demonstrate improvements with this new framework.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Multi-task learning (MTL) trains models on multiple tasks simultaneously. By sharing information
across tasks, the model can learn more robust features, leading to improved performance on each
individual task. MTL can be categorized into two types: hard parameter sharing and soft parameter
sharing. Hard parameter sharing involves an architecture where certain layers are shared among all
tasks in the base model, while other layers remain specific to individual tasks in separate task "towers."
The "Shared-bottom" approach [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is one of the most popular methods within this category. Soft
parameter sharing uses trainable parameters to combine each layer’s output with linear combinations.
This approach often incorporates the concepts of experts and gates, which are multi-layer perceptrons
(MLPs) in the architecture design. Experts are responsible for learning with specific attention from
the features, while gates determine how to combine these attentions. Various methods difer based
on whether the experts and gates are shared among tasks or specific to individual tasks, as shown in
Fig. 1-(a). E.g. MMoE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] shares all experts and gates parameters among the tasks; PLE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] includes
both task-specific and shared experts and gates; Adatt-sp [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has task-specific experts, but all gates
are shared among tasks. Soft parameter sharing heavily relies on experts and gates for knowledge
sharing between multiple tasks. However, many related works often overlook the potential to utilize
relationships between tasks in MTL. For tasks with a sequential order, Recurrent Neural Networks
(RNNs) ofer another method to promote knowledge sharing, which has been less explored.
Sequence learning in e-commerce has been explored to model user behavior patterns [15, 20, 17, 18].
For instance, DPN [21] retrieves target-related user behavior patterns using a target-aware attention
mechanism, where user behaviors are represented by their shopping history—a sequence of purchased
listings. Similarly, Hidasi et al. [22] demonstrates the impressive performance of RNNs over classical
methods in session-based recommendations. GRU4Rec [23] takes the listing from the current event in
the session and outputs a set of scores indicating the likelihood of each listing being the next in the
session. However, these related works primarily focus on learning from listing interactions. To the
best of our knowledge, our work is the first to treat tasks themselves as a sequence in the context of
e-commerce.
      </p>
      <p>Multi-distribution learning trains models using data from various sources, each with distinct feature
distributions. Multi-regional data is an example of multi-distribution input, with prior work largely
focusing on language-agnostic approaches to create a unified, unbiased embedding space [ 24] or on
learning consistent similarities across diferent markets [ 25, 26]. In contrast, our approach utilizes
regionally distinct signals to enhance model diversification . Bonab et al. [27] propose learning in an
MTL setting where each market is treated as a task. However, this approach faces challenges when
market data is imbalanced, especially for smaller markets with limited data. Model-agnostic meta
learning (MAML) [28] tackles this through a dual-loop training process: an inner loop optimizes each
market individually, while an outer loop optimizes across markets, but the need for separate parameter
ifne-tuning for market adaptation makes MAML ineficient in this context. More recently, Market-Aware
(MA) models [29] have used market-specific embeddings to create market-adapted item embeddings.
Our MD module is similar to MA, yet we observed that not all features are region-specific [ 30], making
it more efective to distinguish between shared and region-specific features.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>In this section, we introduce our SEQ+MD framework, which includes two model components: a
multi-task learning architecture SEQ and a multi-distribution learning module MD. We provide formal
definitions for the problem followed by detailed explanations for our framework in the subsections.</p>
      <sec id="sec-3-1">
        <title>3.1. Problem Definition</title>
        <p>Consider an online shopping dataset that records users’ queries and interactions (e.g., click, purchase)
with the returned listings. Let  = {(, )}=1 be the dataset with  samples, where  = (, ),
 refers to the -dimensional features about the user and query,  refers to the -dimensional

features about the target listing, and  = {}=1 is the score set for  tasks. The score for each task is
calculated based on the user interaction sequences. A complete sequence would be ["click", "add to cart",
"purchase"]. The last action in this sequence represents the final step. For example, if the sequence
is ["click", "add to cart"], it means the user clicked on the listing and added it to the cart but did not
Country
purchase it. If none of these actions occurred, the sequence is ["no interaction"]. We assign specific
scores to each action ("no interaction", "click", "add to cart", "purchase"), and the final task score is a
combination of these action scores.</p>
        <p>The multi-task learning architecture SEQ focuses on making predictions for the  tasks simultaneously
given a single input . Meanwhile, the multi-distribution learning module MD is designed for unified
learning across the entire input set {()}=1, where the distribution of  for certain regions shows
significant diferences compared to other regions. (See Fig. 2-(b) for examples.) The multi-task learning
architecture and multi-distribution learning module can be applied separately. We combine these two
parts in our final framework and Fig. 3 shows the overall structure.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Learning Multi-Task as A SEQuence</title>
        <p>Some tasks naturally form a sequence, e.g., click, add to cart, purchase, where each action occurs in a
sequential order, conditional on the previous ones. However, most multi-task learning architectures
do not account for the sequential nature of the problem, making the output tasks order-agnostic and
interchangeable.</p>
        <p>Introducing "order" into multi-task learning ofers several benefits. First, sequential ordering allows
the model to prioritize more complex tasks later in the sequence. In e-commerce, those later tasks
(e.g. purchase) are often more critical than earlier (e.g. click) tasks because of their higher monetization
values. At the same time, the data sparsity of the purchase task makes it more dificult to optimize. By
establishing a sequence, knowledge from earlier (and typically easier) tasks can be used to address later
(and often harder) tasks. Second, sequential ordering facilitates the transfer or addition of new
tasks. Since the model learns tasks in a "continuous" manner, adding new tasks in the sequence requires
minimal training cost. Journey Ranker [31] recognized the importance of task order having each task
model predict the conditional probability based on the previous task. However, the MLP components in
their model are isolated, not fully utilizing the knowledge exchange of the sequential tasks.</p>
        <p>To address this, we connect RNNs [19] with multi-sequential-task learning. In RNN [19], the prediction
of later tokens is based on previous tokens; similarly, our predictions for later user actions are conditioned
on previous actions. In RNN [19], each token position shares the same set of weights (e.g. ℎℎ, ℎ
and ℎ in Eq. 2, 3) with the only diference being the input token and the hidden input from previous
tokens. In our approach, as shown in Eq. 1, we process the single input feature through an MLP for each
token, transforming the input feature specifically for each task (see Fig. 3-(a)). The hidden input can be
seen as the knowledge passed down from previous actions. As shown in Eq. 2, the knowledge for the
current task  (ℎ ) is from both the input for task  (   ()) and knowledge from the previous task
 − 1 (ℎ− 1). The score for task  ( ) depends on the knowledge (ℎ ). Gated Recurrent Unit (GRU) [32]
is applied in our SEQ architecture.</p>
        <p>[1, ..., ] = [  1(), ...,   ()]
ℎ = ℎ(ℎℎℎ− 1 + ℎ   ())
 = ℎℎ
(1)
(2)
(3)
(4)</p>
        <p>Fig. 3 shows our sequential task learning together with MD module. Given a single input feature,
the first step is passing it through  − 1 MLPs to create a length- sequence, where  is the number of
tasks. After passing through multiple layers of RNN, the output scores are in sequence form, with each
score token corresponding to a task.</p>
        <p>To further strengthen the learning with sequence, we add the Descending Probability
Regularizer [31]. Based on the prior knowledge that the probability of a sequence of actions decreases from
the beginning to the end (i.e., the probability of a user "clicking" the listing is greater than or equal
to the probability of "purchasing"), we add a sigmoid multiplication at the end of the output. Each
output score is activated with a sigmoid function and then multiplied by the previous sigmoid scores.
As shown in Eq. 4, the score for task , ˜ is the product of the sigmoid activations of the logits  from
all previous tasks. This ensures that the output probabilities of later actions are always smaller than
those of previous actions, aligning with the prior knowledge.</p>
        <p>=
˜ = ∏︁ ()</p>
        <p>=1</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Learning with Multi-Distribution Input</title>
        <p>Looking at the distribution of each raw input feature, we noticed that there are multi-distributions for
certain features (e.g. average number of purchases, see examples in Fig.2-(b)). If the goal of training a
machine learning model is to learn the transition from a input distribution to the output distribution,
then this multi-distribution will pose significant challenges to the model, ultimately leading to a failure
in learning [33].</p>
        <p>Fig. 4 shows the overall structure of the multi-distribution adaptor module. We first break the
input features into three parts: country features (which is the deciding factor of the distribution shift),
dependent features (with distribution shifts across countries), and invariant features (which are
countryagnostic features). The feature split is done in a heuristic way: country features are manually selected,
and the dependent features and invariant features are separated with a distribution distance threshold.
i.e., when the average of the distribution distance among all countries is greater than a certain threshold,
the feature is categorized as a dependent feature.</p>
        <p>After splitting the input features, diferent operations are applied to these three groups of features.
Country features are used to generate country mask weights for the dependent features. Country mask
weights have the same dimension as the dependent features, and elementwise-multiplication is performed
between the mask and dependent features. The multiplied input is fed into an MLP, which transforms
the output into invariant features. These are then concatenated with the invariant features from the
original input, resulting in a transformed input with consistent distributions.</p>
        <p>This multi-distribution adaptor module MD can be easily plugged in for all MTL frameworks. Adding
this module directly after the input and then sending the transformed input to the model is clean
and simple. We also explore other options for combining this adaptor module with our sequential
task learning framework, as shown in Fig. 3. Instead of concatenating the transformed dependent
features with the input feature directly, we can concatenate them with the invariant feature model
output from the previous layers. Block (b) in Fig. 3 shows how the multi-distribution module works in
our sequential learning architecture. Each task has its own country mask. For a single input (country
Algorithm 1 SEQ+MD</p>
        <p>Input: Feature (, ), heuristic feature selector  , network for generating country mask
MLPcountry mask,  networks for each task input transformation {MLPtask }, sequential learning
network RNNseq task.</p>
        <p />
        <p>Output: Scores {}=1 for  tasks.
2: (country, dependent, invariant) ←
3: // Generate country mask
1: // Separate country features, dependent features, and invariant features for the input
 (, )
4:  ← MLPcountry mask(country)
5: // Transformed dependent features
◁ element-wise multiplication
◁  denotes the sigmoid function
features, dependent features) transformed with -task country masks, the output is also a length- input
sequence. Concatenated with the invariant feature output, the new input features can be processed
with the following sequential learning layers to finally get the task scores.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        To evaluate our methods, we conducted experiments on our ofline in-house datasets. Four baseline
methods were selected for comparison. The Shared-Bottom model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is used as the baseline for all
other models, as it represents the most fundamental architecture in multi-task learning (MTL). Results
are reported as changes relative to the Shared-Bottom model, with its performance marked as
the 0% reference point. The other methods implemented for reference are MLMMOE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], PLE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
and Adatt [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Details of the baselines are described in Sec. 4.1.
      </p>
      <p>We used 14 days of ofline in-house data for training and three days of data for evaluation, and we
report the relative increase in the average Normalized Discounted Cumulative Gains ( ) [34] in
the result tables (see Sec. 4.2 for more details). Due to the varying nature of diferent trafic sources, the
results are divided into two sections: Webpage search trafic (Web), and Mobile App search trafic (App).
We track multi-tasks across all trafic.</p>
      <p>The results focus on two main areas: the efectiveness of the sequential learning architecture for
MTL and the "plug-in" multi-distribution learning module for SOTA MTL methods. Ablation studies
and alternative designs are discussed in Sec. 5.</p>
      <sec id="sec-4-1">
        <title>4.1. Baseline Models</title>
        <p>We select a few state-of-the-art multi-task learning methods without any multi-distribution adjustments
as the baselines. For multi-distribution learning challenges, most related work [35, 36] focuses on
learning invariant features, whereas our goal is to better capture regional preferences. Thus, we
use training with single or multi-distribution data as the baselines for multi-distribution learning
comparisons.</p>
        <p>
          Shared-bottom [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is a hard parameter sharing method in MTL. It consists of a shared bottom layer for
all tasks, followed by separate "tower" layers for each task, which extend from the shared-bottom output.
Both the "bottom" and the "towers" are MLPs, with no knowledge sharing beyond the shared-bottom.
MLMMOE [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a soft parameter sharing method in MTL. It features experts and gates, which are MLPs
within the architecture. "ML" refers to multiple layers; except for the top task-specific gates, all other
experts and gates are shared among tasks.
        </p>
        <p>
          PLE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is another soft parameter sharing method in MTL. It includes two types of experts and gates:
task-specific and task-shared. Task-specific experts learn only for their individual tasks, and task-specific
gates accept input exclusively from the same task expert or the shared expert.
        </p>
        <p>
          Adatt-sp [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is also a soft parameter sharing method in MTL. All experts are task-specific, while all
gates take outputs from all experts as their input.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Datasets and Metrics</title>
        <p>We exclusively use our in-house data for experiments because public search datasets [37] often omit
feature details for data security reasons. This omission makes it dificult to isolate country features and
generate accurate country mask weights. Our ofline in-house dataset contains over 20 million &lt;user,
query, listing&gt; interaction sequences from 10 regions and 2 platforms. Unless otherwise specified, we
train the models with data from all regions and platforms. Results are evaluated separately for each
platform. Normalized Discounted Cumulative Gain ( ) [34] is our evaluation metric, commonly
used for measuring the efectiveness of search engines by summing the gain of the results, discounted
by their ranked positions. The rankings of the search listings are ordered by the output scores from
the model, and   is calculated based on the user interaction sequences. As discussed in Sec. 3.2,
e-commerce prioritizes the purchase task over click, making purchase-ndcg our prioritized metric for
model evaluation.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>SEQ. Table 1 presents the multi-task learning performance on click and purchase tasks across diferent
platforms. State-of-the-art MTL baseline methods demonstrate various improvements in the purchase
task but show a slight decline in the click task. In contrast, our SEQ model shows improvement across
all tasks, adding MD module (SEQ+MD) achieves the best   on the critical purchase task. We
observed a performance drop in the click task after adding the MD module to SEQ, making the final
click performance only slightly positive compared to the share-bottom baseline. This may be due
to the click data being noisier and having higher variance. Another possible explanation is that the
region-dependent features isolated by the MD module are more closely related to user/listing purchase
history, which may have a greater impact on the purchase task.</p>
        <p>
          MD: Multi-Distribution Learning Module. Table 2 illustrates the efectiveness of our
multidistribution learning module as a "plug-in" component for state-of-the-art MTL methods. The adapted
models demonstrate overall improvements, with PLE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]+MD achieving the best performance for the
purchase task across all platforms. These results validate that our MD module can significantly enhance
MTL performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussions</title>
      <sec id="sec-5-1">
        <title>5.1. Will the sequential learning model benefit from more tasks?</title>
        <p>A significant advantage of learning multi-task sequences lies in the inherent properties of RNNs, where
weights are shared across all tokens in the sequence. This has two main benefits. First, it reduces
redundant calculations among related tasks. For instance, tasks like click and purchase share many
commonalities in the buyer’s decision process, i.e. a listing that a user clicks on is also likely to be
purchased. Second, by reinforcing the connections between tasks, later tasks in the sequence can
be learned more efectively by decomposing them and beginning with easier tasks. As the sequence
progresses, task dificulty can be seen as increasing, with earlier tasks acting as processors for the
later ones. This recurrent learning process, from easier to harder tasks, is advantageous. For example,
predicting which listing is likely to be purchased is challenging, but if the model starts by learning click
behavior, it can learn better. We hypothesize that the sequential learning model will benefit from more
tasks. In our experiment, we add an add to cart task between the click and purchase sequence to better
reflect the buyer’s shopping journey. The results in Table 3 support this hypothesis.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Transferability from two-task to three-task</title>
        <p>An important consideration for multi-task models is how easily they can adapt to additional tasks, the
SEQ+MD model demonstrates a significant advantage. Adding new tasks requires almost no increase in
parameters compared to the state-of-the-art models which increase parameter size by 30% on average.
Moreover, reusing weights trained on previous tasks can also lead to improved performance in new
task evaluations. Figure 5 illustrates the performance comparison of evaluating a three-task setup
using weights from a two-task model. The RNN in SEQ+MD uses consistent weights across sequence
positions, allowing a new task to be added by simply appending a token to the input sequence. This
setup enables predictions for the new task without fine-tuning or additional data. In our three-task
evaluation, we averaged the MLP weights from the click and purchase tasks to initialize the MLP weights
for the add to cart task. After transforming the inputs separately with three MLPs as a sequence, we
applied the RNN using weights trained on only two tasks. Notably, without exposure to add to cart
data during training, the model still outperforms the baseline trained on three tasks in both click and
purchase tasks. These results support our hypothesis that utilizing the sequential order of tasks can
improve multi-task learning efectiveness.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Ablation studies</title>
        <p>Learning multi-task as a sequence not only enhances knowledge sharing among tasks but also simplifies
the integration of output regularization. In our SEQ design, we incorporate a descending probability
regularizer that enforces the model to output task scores in a non-increasing order. This regularization is
based on the observation that the probability of a user purchasing a listing cannot exceed the probability
of them clicking on it, as a click typically precedes a purchase. The results in Fig. 6 demonstrate the
efectiveness of this regularizer.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. How efective is the MD module when compared to models trained with single regional data?</title>
        <p>
          Our SEQ+MD model demonstrates a superior ability to align with regional preferences compared
to other baselines. Figure 7 illustrates the changes in the percentage of domestic listings relative to
the shared-bottom [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] model baseline (All models are trained with all-regional data.). Our in-house
analysis shows distinct regional preferences in CA and GB, where CA buyers tend to favor international
listings, while GB buyers lean towards domestic options. However, Fig. 7 shows that PLE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] consistently
returns more domestic listings, while AdaTT [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] consistently returns less, regardless of these regional
SEQ
Better
        </p>
        <p>GB</p>
        <p>CA
Better
preferences. In contrast, our SEQ+MD model efectively captures these regional trends, providing more
accurate rankings that better align with the buyers’ preferences.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we introduce the SEQ+MD framework, which integrates sequential learning for
multitask problems with multi-distribution data. While SEQ and MD can be applied independently, their
combination yields stronger results, particularly on complex tasks. The motivation behind learning
multi-task as a sequence stems from the natural sequential order of tasks. Our experiments and analyses
highlight two primary benefits: First, SEQ reduces redundant computation across tasks and enhances
transferability between diferent task sets, requiring minimal additional parameters while efectively
utilizing weights from previous models. Second, by breaking down a complex task into simpler subtasks
that serve as processors in the sequence, the model demonstrates improved performance on more
challenging tasks. Additionally, our MD module efectively handles multi-distribution data, it can also
enhance the performance of state-of-the-art multi-task learning models.</p>
      <p>Future work. 1. Improve robustness against noisy data. Even though the primary goal of
our approach is to improve performance on complex tasks such as add to cart and purchase, we see
opportunities in making SEQ+MD have a neutral impact on click compared to SEQ only. One hypothesis
is that click data tends to be noisier than other tasks, with a significant amount of "false clicks" present,
particularly on mobile platforms. For example, users may accidentally click on a listing due to the touch
screen’s sensitivity. Learning with task-specific noise within a multi-task learning framework could be
a valuable direction for future research. 2. Generalize multi-distribution data from region-wise to
other scenarios. While this paper focuses on regional diferences as an example of multi-distribution,
other multi-distribution exists in e-commerce search data. For instance, diferent platforms (web, app)
may show distinct shopping patterns. Extending our MD module to address these scenarios could be a
promising research direction.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 in order to: Grammar and spelling check.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.
[15] E. Yuan, W. Guo, Z. He, H. Guo, C. Liu, R. Tang, Multi-behavior sequential transformer
recommender, in: Proceedings of the 45th international ACM SIGIR conference on research and
development in information retrieval, 2022, pp. 1642–1652.
[16] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu, K. Gai, Entire space multi-task model: An
efective approach for estimating post-click conversion rate, in: The 41st International ACM SIGIR
Conference on Research &amp; Development in Information Retrieval, 2018, pp. 1137–1140.
[17] X. Tao, M. Ha, Q. Ma, H. Cheng, W. Lin, X. Guo, L. Cheng, B. Han, Task aware feature extraction
framework for sequential dependence multi-task learning, in: Proceedings of the 17th ACM
Conference on Recommender Systems, 2023, pp. 151–160.
[18] X. Tang, Y. Qiao, F. Lyu, D. Liu, X. He, Touch the core: Exploring task dependence among hybrid
targets for recommendation, in: Proceedings of the 18th ACM Conference on Recommender
Systems, 2024, pp. 329–339.
[19] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio,
Learning phrase representations using rnn encoder-decoder for statistical machine translation,
arXiv preprint arXiv:1406.1078 (2014).
[20] D. Xi, Z. Chen, P. Yan, Y. Zhang, Y. Zhu, F. Zhuang, Y. Chen, Modeling the sequential dependence
among audience multi-step conversions with multi-task learning in targeted display advertising,
in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &amp; Data Mining,
2021, pp. 3745–3755.
[21] H. Zhang, J. Pan, D. Liu, J. Jiang, X. Li, Deep pattern network for click-through rate prediction, in:
Proceedings of the 47th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2024, pp. 1189–1199.
[22] B. Hidasi, A. Karatzoglou, Recurrent neural networks with top-k gains for session-based
recommendations, in: Proceedings of the 27th ACM international conference on information and
knowledge management, 2018, pp. 843–852.
[23] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, Session-based recommendations with recurrent
neural networks, arXiv preprint arXiv:1511.06939 (2015).
[24] A. Ahuja, N. Rao, S. Katariya, K. Subbian, C. K. Reddy, Language-agnostic representation learning
for product search on e-commerce platforms, in: Proceedings of the 13th International Conference
on Web Search and Data Mining, 2020, pp. 7–15.
[25] J. Cao, X. Cong, T. Liu, B. Wang, Item similarity mining for multi-market recommendation, in:
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2022, pp. 2249–2254.
[26] X. Li, Z. Qiu, J. Jiang, Y. Zhang, C. Xing, X. Wu, Conditional cross-platform user engagement
prediction, ACM Transactions on Information Systems 42 (2023) 1–28.
[27] H. Bonab, M. Aliannejadi, A. Vardasbi, E. Kanoulas, J. Allan, Cross-market product
recommendation, in: Proceedings of the 30th ACM International Conference on Information &amp; Knowledge
Management, 2021, pp. 110–119.
[28] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks,
in: International conference on machine learning, PMLR, 2017, pp. 1126–1135.
[29] S. Bhargav, M. Aliannejadi, E. Kanoulas, Market-aware models for eficient cross-market
recommendation, in: European Conference on Information Retrieval, Springer, 2023, pp. 134–149.
[30] J. Cao, S. Li, B. Yu, X. Guo, T. Liu, B. Wang, Towards universal cross-domain recommendation,
in: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining,
2023, pp. 78–86.
[31] C. H. Tan, A. Chan, M. Haldar, J. Tang, X. Liu, M. Abdool, H. Gao, L. He, S. Katariya, Optimizing
airbnb search journey with multi-task learning, in: Proceedings of the 29th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, 2023, pp. 4872–4881.
[32] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks
on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).
[33] B. Peng, The sample complexity of multi-distribution learning, in: The Thirty Seventh Annual</p>
      <p>Conference on Learning Theory, PMLR, 2024, pp. 4185–4204.
[34] H. Valizadegan, R. Jin, R. Zhang, J. Mao, Learning to rank by optimizing ndcg measure, Advances
in neural information processing systems 22 (2009).
[35] J. Cha, K. Lee, S. Park, S. Chun, Domain generalization by mutual-information regularization with
pre-trained models, in: European conference on computer vision, Springer, 2022, pp. 440–457.
[36] J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, S. Park, Swad: Domain generalization by seeking
lfat minima, Advances in Neural Information Processing Systems 34 (2021) 22405–22418.
[37] P. Li, R. Li, Q. Da, A.-X. Zeng, L. Zhang, Improving multi-scenario learning to rank in e-commerce
by exploiting task relationships in the label space, in: Proceedings of the 29th ACM International
Conference on Information &amp; Knowledge Management, 2020, pp. 2605–2612.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Lari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vaishnava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Manu</surname>
          </string-name>
          ,
          <article-title>Artifical intelligence in e-commerce: Applications, implications and challenges</article-title>
          ,
          <source>Asian Journal of Management</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>235</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sarwar</surname>
          </string-name>
          , G. Karypis,
          <string-name>
            <given-names>J.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>Analysis of recommendation algorithms for e-commerce</article-title>
          ,
          <source>in: Proceedings of the 2nd ACM Conference on Electronic Commerce</source>
          ,
          <year>2000</year>
          , pp.
          <fpage>158</fpage>
          -
          <lpage>167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A. De Mauro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sestino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bacconi</surname>
          </string-name>
          ,
          <article-title>Machine learning and artificial intelligence use in marketing: a general taxonomy</article-title>
          ,
          <source>Italian Journal of Marketing</source>
          <year>2022</year>
          (
          <year>2022</year>
          )
          <fpage>439</fpage>
          -
          <lpage>457</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yoganarasimhan</surname>
          </string-name>
          ,
          <article-title>Search personalization using machine learning</article-title>
          ,
          <source>Management Science</source>
          <volume>66</volume>
          (
          <year>2020</year>
          )
          <fpage>1045</fpage>
          -
          <lpage>1070</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <article-title>Modeling task relationships in multi-task learning with multi-gate mixture-of-experts</article-title>
          ,
          <source>in: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery &amp; data mining</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1930</fpage>
          -
          <lpage>1939</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <article-title>Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations</article-title>
          ,
          <source>in: Proceedings of the 14th ACM Conference on Recommender Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Yuan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Adatt: Adaptive task-to-task fusion network for multitask learning in recommendations</article-title>
          ,
          <source>in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>4370</fpage>
          -
          <lpage>4379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bellur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Sundar</surname>
          </string-name>
          , Clicking, assessing, immersing, and
          <article-title>sharing: An empirical model of user engagement with interactive media</article-title>
          ,
          <source>Communication Research</source>
          <volume>45</volume>
          (
          <year>2018</year>
          )
          <fpage>737</fpage>
          -
          <lpage>763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tian</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>On better exploring and exploiting task relationships in multitask learning: Joint model and feature learning</article-title>
          ,
          <source>IEEE transactions on neural networks and learning systems 29</source>
          (
          <year>2017</year>
          )
          <fpage>1975</fpage>
          -
          <lpage>1985</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>An overview of multi-task learning</article-title>
          ,
          <source>National Science Review</source>
          <volume>5</volume>
          (
          <year>2018</year>
          )
          <fpage>30</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>A survey on multi-task learning</article-title>
          ,
          <source>IEEE transactions on knowledge and data engineering 34</source>
          (
          <year>2021</year>
          )
          <fpage>5586</fpage>
          -
          <lpage>5609</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <article-title>Multitask learning</article-title>
          ,
          <source>Machine learning 28</source>
          (
          <year>1997</year>
          )
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hebert</surname>
          </string-name>
          ,
          <article-title>Cross-stitch networks for multi-task learning</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>3994</fpage>
          -
          <lpage>4003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bingel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Søgaard</surname>
          </string-name>
          ,
          <article-title>Latent multi-task architecture learning</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>4822</fpage>
          -
          <lpage>4829</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>