<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Seeds for Data Splitting on Recom mendation Accuracy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lukas Wegmeth</string-name>
          <email>lukas.wegmeth@uni-siegen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Vente</string-name>
          <email>tobias.vente@uni-siegen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lennart Purucker</string-name>
          <email>contact@lennart-purucker.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joeran Beel</string-name>
          <email>joeran.beel@uni-siegen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Group, University of Siegen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>The evaluation of recommender system algorithms depends on randomness, e.g., during randomly splitting data into training and testing data. We suspect that failing to account for randomness in this scenario may lead to misrepresenting the predictive accuracy of recommendation algorithms. To understand the community's view of the importance of randomness, we conducted a paper study on 39 full papers published at the ACM RecSys 2022 conference. We found that the authors of 26 papers used some variation of a holdout split that requires a random seed. However, only five papers explicitly repeated experiments and averaged their results over diferent random seeds. This potentially problematic research practice motivated us to analyze the efect of data split random seeds on recommendation accuracy. Therefore, we train three common algorithms on nine public data sets with 20 data split random seeds, evaluate them on two ranking metrics with three diferent ranking cutof values the results. In the extreme case with  = 1 , we show that depending on the data split random seed, the accuracy with traditional recommendation algorithms deviates by up to ∼6.3% from the mean accuracy achieved on the data set. Hence, we show that an algorithm may significantly over- or under-perform when maliciously or negligently selecting a random seed for splitting the data. To showcase a mitigation strategy and better research practice, we compare holdout to cross-validation and show that, again, for  = 1 , the accuracy of algorithms evaluated with cross-validation deviates only up to ∼2.3% from the mean accuracy achieved on the data set. Furthermore, we found that the deviation becomes smaller the higher the value of  for both holdout and cross-validation.</p>
      </abstract>
      <kwd-group>
        <kwd>recommender systems</kwd>
        <kwd>random seed</kwd>
        <kwd>holdout</kwd>
        <kwd>validation</kwd>
        <kwd>cross-validation</kwd>
        <kwd>ranking</kwd>
        <kwd>reproducibility</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Finding the best algorithm for a recommendation task is challenging, as the experimental setup
requires many dificult design choices [ 1]. On a high level, these design choices include or
are afected by: the preprocessing of input data, the set of available algorithms and
hyperparameters, the evaluation metrics, the testing environment, and constraints on the experiment.
Understanding the efects of every choice in each step of the experimental pipeline is crucial
in obtaining a result that accurately represents the pipeline’s capabilities. Especially in
stateof-the-art papers, the evaluation procedure is often complex and lengthy, and sometimes the
nEvelop-O
performance increase over competing solutions is small [2, 3, 4]. Notably, many components
in a recommender systems evaluation pipeline require randomness, e.g., to split the data or
initialize random parameters for an algorithm.</p>
      <p>Any function that generates random output needs a random seed, e.g., an arbitrarily chosen
integer, for initialization purposes. For instance, in recommender systems, when data should be
randomly split, the sampling function requires a random seed to produce a random selection of
elements. A random generator, e.g., the aforementioned sampling function, always produces
the same chain of outputs when initialized with the same random seed. When the random seed
is explicitly set, repeated executions always run with the same randomization, e.g., the same
data split from the same input data. This is generally desirable and ensures the reproducibility
and repeatability of recommender systems research. However, the downside is that repeated
experiments with the same random seed remove randomness from the evaluation, ignoring
its potentially severe efects. We illustrate this with a recommender systems example in the
following paragraph.</p>
      <p>When recommender systems are evaluated with a randomized holdout split, the data is
split with one specific random seed. If experiments are not repeated with diferent holdout
splits, e.g., diferent random seeds, the evaluation of the recommender systems algorithm is
not protected against the impact of the randomness of data on the evaluated recommendation
performance. To illustrate, each data split, generated through a specific random seed, may be
diferent in its data distribution, e.g., the interactions that make up the training and the testing
set. Specifically in recommender systems, some users or items may evaluate with exceptionally
low recommendation accuracy while others do the opposite. Ideally, the data should be split
in a way that results in an average over these outliers, but this can not be guaranteed. In the
worst case, the evaluated accuracy of an algorithm on one specific split could be an outlier that
significantly changes the judgment of a specific parameter setup or algorithm. This can also
be abused to obtain unnaturally good results that are not achievable on average. Furthermore,
replicating the exact result is unlikely if the random seed used in an experiment is unknown. We
assume that the efect of the randomness of data on the evaluated performance of recommender
systems algorithms is significant enough to warrant methods that mitigate the shortcomings of
a single holdout split.</p>
      <p>
        Paper Study To motivate a thorough analysis of our suspicion, we conducted a brief ad-hoc
study of the 39 full papers that were published at the ACM RecSys Conference 20221. We analyzed
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) which type of validation procedure was used in the experiments, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) whether the code and
random seeds used for the experiments are public, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) whether the authors acknowledged
randomness and repeated experiments that are influenced by randomness. Regarding (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ),
we found that 26 papers (66.6%) use some form of holdout split to validate their results. Of
the remaining papers, 10 (25.6%) used time-based leave-one-out splits that do not require
randomness, and the remaining three (7.7%) did not perform and evaluate any experiments.
Regarding (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), we found that 20 papers (51.3%) contain links to code to reproduce the results
found in the papers. Of those 20 papers, 11 (28.2%) allow for the configuration of random
seeds and provide default values, 6 (15.4%) contain static random seeds, and three (7.7%) do not
1https://recsys.acm.org/recsys22/accepted-contributions/
specify a random seed at all. Reproducing the results of papers without the original code is
dificult and time-consuming. The task is even more challenging since no authors explicitly
state which random seeds were used. However, even in the papers that contain links to code, it
is unclear which random seed was used to obtain the results presented in the paper. Regarding
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), we found that only five (12.8%) papers acknowledge and deal with the efects of randomness
during data-splitting on the recommendation accuracy. The authors of these papers found
that repeating their experiments three, five, or ten times with diferent random seeds was an
appropriate solution.
      </p>
      <p>Research Question We find the current practices surrounding random seeds, especially
during the data-splitting phase, worrying. Given that so many papers apply holdout splits but
do not repeat their experiments, we seek to answer the following research question: How
significant is the impact of random seeds used for splitting the data on recommendation
accuracy?</p>
      <p>We answer the research question by evaluating nine diferent data sets on three diferent
recommendation algorithms with 20 diferent random seeds. We provide results for the nDCG@k
and Precision@k metrics with  = {1, 5, 10} . Furthermore, we propose cross-validation, a standard
practice in machine learning, as a mitigation strategy, and we showcase how it compares to the
results obtained with a holdout split. Our contribution quantifies the efect of random seeds
used during the data-splitting phase on recommendation accuracy. With this analysis, we aim
to increase the awareness of the recommender systems community of the importance of random
seeds for evaluating recommender systems. Furthermore, the code is open-source2, and our
experiments are reproducible.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In machine learning and deep learning, the efects of random seeds are well-explored. The
holdout split is a traditional method to validate machine learning models. Early works suggest
that repeated holdout, with diferent random seeds, is necessary to correctly estimate the error
of a given configuration [ 5]. Furthermore, in machine learning literature, statistical tests over
multiple algorithms and data sets are only reliable when experiments are repeated suficiently
often [ 6]. In another article, the authors describe how static random seeds can impact the
validation procedure [7]. Specifically in deep learning, multiple authors report how random
seeds in model initialization afect performance [ 8, 9]. Further evidence is that this also extends
to the data split [10]. Moreover, some works show the strong efects of random seeds on the
stability of deep learning models [11, 12]. This efect is specifically exploited to improve deep
learning ensemble performance [13, 14]. Apart from misrepresented performance, the reported
statistical significance in comparing two algorithms may also be misleading due to poor random
seeding practices [15]. Due to its advantages, cross-validation has become the standard for
machine learning a long time ago [16]. The AutoML community understands the importance of
randomness, and the collaborative AutoML benchmark uses 10-fold cross-validation to mitigate
randomness by default [17, 18].</p>
      <sec id="sec-2-1">
        <title>2https://code.isg.beel.org/random-seed-effects/</title>
        <p>Recommender system evaluation is notoriously complex with many domain-specific problems
but also inherits the general machine learning problems stated above [19, 20]. To our knowledge,
there is no survey about using validation strategies in recommender systems research. In our
paper study (→Introduction), we found that holdout validation is still a widespread
datasplitting method. However, only a few authors apply repeated holdout validation, citing the
efects of randomness [ 21, 22, 23, 24, 25]. Additionally, while acknowledging randomness, none
of the authors cite any reference concerning randomness. We are unaware of any work that
concretely analyzes and quantifies this efect.</p>
        <p>Performing random holdout splits neglects temporal efects. In our paper study (
→Introduction) we found that a significant amount of researchers use time-based splitting techniques.
However, the majority still applies a random holdout split. Recommender systems are inherently
afected by the chronological order and distance of historical interactions, and ignoring this
when splitting the data may lead to temporal leakage [26, 27]. Our work does not focus on
temporal efects.</p>
        <p>Overall, there are only a few works in the recommender systems literature that understand the
importance of and use cross-validation [28, 29, 30, 31]. Interestingly, many popular recommender
systems libraries also natively include cross-validation [32, 33, 34, 35]. However, in our paper
study (→Introduction), we found that no authors use cross-validation. We are unaware of any
work that analyzes and quantifies how cross-validation mitigates the efects of the randomness
of data compared to holdout, specifically for recommender systems.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>Our experiments showcase the efects of random seeds used during the data-splitting phase
on recommendation accuracy. We denote such a random seed as data split random seed from
here on to distinguish it from usages of random seeds in other evaluation components. We
evaluate our pipeline with common design choices found in the literature such that the efects
we show in our results apply to many research results. Therefore, the analysis focuses on the
top-n ranking prediction task where the input data consists of binary user-item interactions.
Data Sets &amp; Algorithms We evaluate nine publicly available and commonly used data sets:
Adressa [36], Amazon-CDs&amp;Vinyl [37], Gowalla [38], Hetrec-LastFM [39], MovieLens-1M [40],
Amazon-MusicalInstruments [37], Retailrocket3, Amazon-VideoGames [37], and Yelp4. Table 1
shows statistical information on the data sets. We evaluate the data sets on three algorithms
from traditionally used categories: Implicit Matrix Factorization with Alternating Least Squares
(ALS), Item-based k-Nearest Neighbors (ItemKNN), and the baseline recommender Popularity
Recommender (Pop). We use the algorithm implementations from the LensKit library [41].
Preprocessing For the explicit feedback data sets Amazon and MovieLens, we treat a rating
of higher than three as an interaction according to common practice [42, 43, 44]. For all data
sets, we remove duplicates and incomplete entries, and all features other than the user ID and</p>
      <sec id="sec-3-1">
        <title>3https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset 4https://www.yelp.com/dataset</title>
        <p>item ID. Following previous works, we perform 5-core filtering, which means that we prune the
data to ensure that all users and items contain at least five interactions [ 45, 46, 47].</p>
        <p>#Interactions #Users
adressa 2,020,328 146,635
cds-and-vinyl 1,075,615 87,712
gowalla 2,018,421 64,115
hetrec-lastfm 71,355 1,859
movielens-1m 574,376 6,034
musical-instruments 147,173 18,177
retailrocket 240,938 22,178
video-games 291,985 33,625
yelp 3,999,684 268,658
Data Splitting Our primary analysis revolves around the efect of diferent data split random
seeds on recommendation accuracy. We want to show how diferent data split random seeds
afect the common practice of holdout splits and compare this to the machine learning standard
of cross-validation. Therefore, we perform 20 holdout splits with diferent data split random
seeds for each data set. We split the interactions with a ratio of 80% for training and 20% for
testing, following other works [48, 49, 50]. For cross-validation, we use the same data split
random seeds to generate five non-overlapping testing folds for each data set 20 times.
Training &amp; Evaluation We train a model for each recommendation algorithm per data set
per data split random seed. For each data split random seed, we train one model with the holdout
training data and evaluate it on the holdout testing data. Additionally, for cross-validation, we
train five separate models with the training data of all five data split folds, evaluate them on
the respective testing data, and average the results. We train all models with default algorithm
hyperparameters as we look for the efects of data split random seeds without the confounding
efects of hyperparameter optimization. We then evaluate the predictions with the nDCG@k
and Precision@k metrics with  = {1, 5, 10} . We choose diferent values for  to analyze the
efects at various cutof points.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results &amp; Conclusion</title>
      <p>To answer our research question, we quantify the expected variation in recommendation
accuracy of diferent data split random seeds when using holdout validation and compare this to
cross-validation. Therefore, we present the results of our experiments over all nine data sets
separated by the three algorithms, two metrics, three diferent cutof values, and two validation
methods.</p>
      <p>Figure 1 shows a comprehensive summary of these experiments. There is one plot per
algorithm, per metric with each cutof value  , and each plot compares holdout to
crossvalidation on the y-axis. On the x-axis, the value 100 designates the mean accuracy of the
evaluations of one data set with 20 diferent data split random seeds. Furthermore, each plot
contains the evaluations of all nine data sets. To illustrate, we observe a few outliers for both
validation methods by looking at the plot for the algorithm ALS evaluated on Precision@1.
One of them is at ∼106 for holdout, which means that there exists a data split random seed
that achieved an accuracy that is ∼6% higher than the accuracy for its data set on average.
Conversely, another achieved a ∼5.5% lower accuracy than the mean for its data. Finally, we
used a Wilcoxon test ( = 0.05 ) to verify whether the hypothesis  0, that the distributions of
the absolute deviation from the mean across seeds are equal for holdout and cross-validation,
holds per algorithm. We were able to reject  0 for each algorithm with  &lt; 0.001 . Therefore,
the impact of the data split random seed on the evaluation of recommendation accuracy is
significantly diferent for holdout and cross-validation for all tested algorithms.</p>
      <p>In the following paragraphs, we point to the main observations of these results, interpret their
meaning for our research goal, outline the limitations of this work, and conclude by answering
our research questions.</p>
      <p>
        Observations We highlight seven observations that are numbered for reference: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) In the
most extreme case of the non-baseline recommenders, which is ItemKNN evaluated with
Precision@1, there is one data split random seed that resulted in ∼6.3% higher accuracy than the
mean with a holdout split. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) In the same case, since there is also a data split random seed that
achieves a ∼5.3% lower recommendation accuracy, the best-performing data split random seed
has a ∼12.2% higher accuracy than worst-performing data split random seed. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Cross-validation
makes the efect much less noticeable, where the highest deviation is only ∼2.3% over the mean.
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Similarly, the lowest-performing data split random seed for cross-validation is only ∼1.8%
lower than the mean, resulting in an accuracy range of ∼4.2%. (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) In terms of the efect of
data split random seeds on accuracy, ALS and ItemKNN are similar across the board, while Pop
has a larger range and more outliers. (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) We observe that the deviation shrinks with bigger  ,
but it does so proportionally for both holdout and cross-validation. (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) There is no noticeable
diference in the observations between nDCG and Precision, and we note that nDCG@1 and
Precision@1 are equivalent.
      </p>
      <p>
        Interpretation Observation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) shows that specific data split random seeds can result in
extreme performance gain over the mean performance. Therefore, it is possible to cherry-pick
a data split random seed that significantly overestimates the performance of an algorithm.
Similarly, observation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) shows that the opposite may be the efect. Results that are outliers in
either direction can be achieved if the data split random seed is chosen negligently, potentially
skewing results. Specifically, a static data split random seed may return extreme results, which
can go unnoticed in repeated experiments. Observations (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) show how cross-validation
mitigates this efect by drastically reducing the magnitude of outliers. Furthermore, a logical
conclusion is that if cross-validation can not be applied for any reason, averaging results over
repeated holdout with diferent data split random seeds is still expected to result in an accuracy
closer to the mean. Observation (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) shows how much the common baseline algorithm Pop reacts
to the randomness of data and how the traditional algorithms ALS and ItemKNN are similarly
afected by the randomness of data. Observation (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) may be explained by the fact that larger
HO
CV
d
o
h
t
HeO
M
n
o
itCV
a
d
li
a
V
HO
CV
HO
CV
d
o
h
t
HeO
M
n
o
itCV
a
d
li
a
V
HO
CV
HO
CV
d
o
h
t
HeO
M
n
o
itCV
a
d
li
a
V
HO
CV
      </p>
      <p>Item-based k-Nearest Neighbors (ItemKNN)
94
96</p>
      <p>98 100 102 104 106 94 96 98 100
Accuracy Depending on Data Split Random Seed - Relative to Mean Accuracy in %
102
104</p>
      <p>
        106
Popularity Recommender (Pop)
110
120
cutof values  lead to a higher probability that the recommended items contain ground-truth
interactions, reducing the impact of data split random seeds on the accuracy. However, while
the absolute deviation shrinks with larger  , cross-validation still mitigates the impact of data
split random seeds by a similar magnitude. Observation (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) is a hint to the similarity of Precision
and nDCG in recommender systems and shows that data split random seeds similarly afect both
metrics that do and do not account for the position of the recommended items.
Limitations The results are obtained with non-optimized models. We acknowledge that
hyperparameter optimization may have an efect on the reported distribution. It may be
worthwhile to repeat the experiments with hyperparameter optimization in the future but that
is an analysis with a diferent objective and incurs relatively high computational requirements.
Furthermore, we did not analyze the efects of data split random seeds with respect to data set
metadata. There may be a connection between the data set size, sparsity, and other features and
the variance of resulting performance scores. However, such a detailed analysis is outside the
scope of this paper. Still, it may be interesting to understand how much data set characteristics
afect the severity of the efects of data split random seeds in future work.
      </p>
      <p>Conclusion We answer our research question by quantifying how significant the impact
of data split random seeds on recommendation accuracy is. Our results, observations,
and interpretation show that the impact of data split random seeds on recommendation accuracy
may be significant to warrant steps to mitigate it. Even with a cutof value of 10, e.g., ALS
evaluated with Precision@10, any one data split random seed may still lead to accuracy that is
up to ∼4% lower than the mean accuracy over all data split random seeds, potentially changing
the ranks of compared approaches. To illustrate, if an experiment evaluates an algorithm on
Precision@10 without repetition on a data split random seed that produces an exceptionally low
accuracy, it may be assumed that it is up to ∼4% worse than it would be on average. On the
contrary, if a new algorithm is evaluated with the same parameters, it could be estimated to be
∼4% better than on average, potentially opening a rift between the two compared algorithms.</p>
      <p>Given these results, we argue it is hard to trust a result evaluated with holdout validation on
a single data split random seed. We urge researchers to take randomness into account when
evaluating an algorithm properly. Due to our results, we are convinced that a single holdout
evaluation is not enough to gauge an evaluation pipeline’s performance accurately. Given the
distributions shown in Figure 1 and that we were able to reject  0 with  &lt; 0.001 for each
algorithm, we recommend performing at least repeated holdout validation or cross-validation,
and in the best case, repeated cross-validation. We acknowledge that experimenting may be
expensive, but misjudging algorithm performance may be more costly in the long term.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The OMNI cluster of the University of Siegen was used to compute the results presented in this
paper.
article/pii/S000437020200190X. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / S 0 0 0 4 - 3 7 0 2 ( 0 2 ) 0 0 1 9 0 - X .
[14] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty
estimation using deep ensembles, 2017. a r X i v : 1 6 1 2 . 0 1 4 7 4 .
[15] C. Colas, O. Sigaud, P.-Y. Oudeyer, How many random seeds? statistical power analysis in
deep reinforcement learning experiments, 2018. a r X i v : 1 8 0 6 . 0 8 2 9 5 .
[16] S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection, Statistics</p>
      <p>Surveys 4 (2010) 40 – 79. URL: https://doi.org/10.1214/09-SS054. doi:1 0 . 1 2 1 4 / 0 9 - S S 0 5 4 .
[17] P. Gijsbers, E. LeDell, J. Thomas, S. Poirier, B. Bischl, J. Vanschoren, An open source automl
benchmark, 2019. a r X i v : 1 9 0 7 . 0 0 9 0 9 .
[18] P. Gijsbers, M. L. P. Bueno, S. Coors, E. LeDell, S. Poirier, J. Thomas, B. Bischl, J. Vanschoren,</p>
      <p>Amlb: an automl benchmark, 2022. a r X i v : 2 2 0 7 . 1 2 5 6 0 .
[19] F. Hernández del Olmo, E. Gaudioso, Evaluation of recommender systems: A new approach,
Expert Systems with Applications 35 (2008) 790–804. URL: https://www.sciencedirect.com/
science/article/pii/S0957417407002928. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . e s w a . 2 0 0 7 . 0 7 . 0 4 7 .
[20] A. Said, A. Bellogín, Comparative recommender system evaluation: Benchmarking
recommendation frameworks, in: Proceedings of the 8th ACM Conference on Recommender
Systems, RecSys ’14, Association for Computing Machinery, New York, NY, USA, 2014, p.
129–136. URL: https://doi.org/10.1145/2645710.2645746. doi:1 0 . 1 1 4 5 / 2 6 4 5 7 1 0 . 2 6 4 5 7 4 6 .
[21] Z. He, H. Zhao, T. Yu, S. Kim, F. Du, J. McAuley, Bundle mcr: Towards conversational
bundle recommendation, in: Proceedings of the 16th ACM Conference on Recommender
Systems, RecSys ’22, Association for Computing Machinery, New York, NY, USA, 2022, p.
288–298. URL: https://doi.org/10.1145/3523227.3546755. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 . 3 5 4 6 7 5 5 .
[22] S. Borg Bruun, M. Maistro, C. Lioma, Learning recommendations from user actions
in the item-poor insurance domain, in: Proceedings of the 16th ACM Conference on
Recommender Systems, RecSys ’22, Association for Computing Machinery, New York, NY,
USA, 2022, p. 113–123. URL: https://doi.org/10.1145/3523227.3546775. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 .
3 5 4 6 7 7 5 .
[23] H. Chen, X. Li, K. Zhou, X. Hu, C.-C. M. Yeh, Y. Zheng, H. Yang, Tinykg: Memory-eficient
training framework for knowledge graph neural recommender systems, in: Proceedings
of the 16th ACM Conference on Recommender Systems, RecSys ’22, Association for
Computing Machinery, New York, NY, USA, 2022, p. 257–267. URL: https://doi.org/10.1145/
3523227.3546760. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 . 3 5 4 6 7 6 0 .
[24] C. Almagor, Y. Hoshen, You say factorization machine, i say neural network - it’s all in
the activation, in: Proceedings of the 16th ACM Conference on Recommender Systems,
RecSys ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 389–398.</p>
      <p>URL: https://doi.org/10.1145/3523227.3551499. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 . 3 5 5 1 4 9 9 .
[25] A. Rashed, S. Elsayed, L. Schmidt-Thieme, Context and attribute-aware sequential
recommendation via cross-attention, in: Proceedings of the 16th ACM Conference on
Recommender Systems, RecSys ’22, Association for Computing Machinery, New York, NY,
USA, 2022, p. 71–80. URL: https://doi.org/10.1145/3523227.3546777. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 .
3 5 4 6 7 7 7 .
[26] O. Jeunen, Revisiting ofline evaluation for implicit-feedback recommender systems,
in: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19,
Association for Computing Machinery, New York, NY, USA, 2019, p. 596–600. URL: https:
//doi.org/10.1145/3298689.3347069. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 6 9 .
[27] Y. Ji, A. Sun, J. Zhang, C. Li, A critical study on data leakage in recommender system
ofline evaluation, ACM Trans. Inf. Syst. 41 (2023). URL: https://doi.org/10.1145/3569930.
doi:1 0 . 1 1 4 5 / 3 5 6 9 9 3 0 .
[28] L. Brozovsky, V. Petricek, Recommender system for online dating service, 2007.</p>
      <p>a r X i v : c s / 0 7 0 3 0 4 2 .
[29] A. Košir, A. Odić, M. Tkalčič, How to improve the statistical power of the 10-fold cross
validation scheme in recommender systems, in: Proceedings of the International Workshop
on Reproducibility and Replication in Recommender Systems Evaluation, RepSys ’13,
Association for Computing Machinery, New York, NY, USA, 2013, p. 3–6. URL: https:
//doi.org/10.1145/2532508.2532510. doi:1 0 . 1 1 4 5 / 2 5 3 2 5 0 8 . 2 5 3 2 5 1 0 .
[30] E. A. Sosnina, S. Sosnin, A. A. Nikitina, I. Nazarov, D. I. Osolodkin, M. V.
Fedorov, Recommender systems in antiviral drug discovery, ACS Omega 5
(2020) 15039–15051. URL: https://doi.org/10.1021/acsomega.0c00857. doi:1 0 . 1 0 2 1 / a c s o m e g a .
0 c 0 0 8 5 7 . a r X i v : h t t p s : / / d o i . o r g / 1 0 . 1 0 2 1 / a c s o m e g a . 0 c 0 0 8 5 7 , pMID: 32632398.
[31] D. I. Ignatov, J. Poelmans, G. Dedene, S. Viaene, A new cross-validation technique to
evaluate quality of recommender systems, in: M. K. Kundu, S. Mitra, D. Mazumdar, S. K. Pal
(Eds.), Perception and Machine Intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg,
2012, pp. 195–202.
[32] M. D. Ekstrand, Lenskit for python: Next-generation software for recommender systems
experiments, in: Proceedings of the 29th ACM International Conference on Information
&amp; Knowledge Management, CIKM ’20, Association for Computing Machinery, New York,
NY, USA, 2020, p. 2999–3006. URL: https://doi.org/10.1145/3340531.3412778. doi:1 0 . 1 1 4 5 /
3 3 4 0 5 3 1 . 3 4 1 2 7 7 8 .
[33] N. Hug, Surprise: A python library for recommender systems, Journal of Open Source</p>
      <p>Software 5 (2020) 2174.
[34] V. W. Anelli, A. Bellogin, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini,
T. Di Noia, Elliot: A comprehensive and rigorous framework for reproducible recommender
systems evaluation, in: Proceedings of the 44th International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’21, Association for Computing
Machinery, New York, NY, USA, 2021, p. 2405–2414. URL: https://doi.org/10.1145/3404835.
3463245. doi:1 0 . 1 1 4 5 / 3 4 0 4 8 3 5 . 3 4 6 3 2 4 5 .
[35] Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Mymedialite: A free
recommender system library, in: Proceedings of the Fifth ACM Conference on Recommender
Systems, RecSys ’11, Association for Computing Machinery, New York, NY, USA, 2011, p.
305–308. URL: https://doi.org/10.1145/2043932.2043989. doi:1 0 . 1 1 4 5 / 2 0 4 3 9 3 2 . 2 0 4 3 9 8 9 .
[36] J. A. Gulla, L. Zhang, P. Liu, O. Özgöbek, X. Su, The adressa dataset for news
recommendation, in: Proceedings of the International Conference on Web Intelligence, WI
’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 1042–1048. URL:
https://doi.org/10.1145/3106426.3109436. doi:1 0 . 1 1 4 5 / 3 1 0 6 4 2 6 . 3 1 0 9 4 3 6 .
[37] J. Ni, J. Li, J. McAuley, Justifying recommendations using distantly-labeled reviews and
ifne-grained aspects, in: Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong
Kong, China, 2019, pp. 188–197. URL: https://aclanthology.org/D19-1018. doi:1 0 . 1 8 6 5 3 / v 1 /
D 1 9 - 1 0 1 8 .
[38] E. Cho, S. A. Myers, J. Leskovec, Friendship and mobility: User movement in location-based
social networks, in: Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’11, Association for Computing Machinery,
New York, NY, USA, 2011, p. 1082–1090. URL: https://doi.org/10.1145/2020408.2020579.
doi:1 0 . 1 1 4 5 / 2 0 2 0 4 0 8 . 2 0 2 0 5 7 9 .
[39] I. Cantador, P. Brusilovsky, T. Kuflik, 2nd workshop on information heterogeneity and
fusion in recommender systems (hetrec 2011), in: Proceedings of the 5th ACM conference
on Recommender systems, RecSys 2011, ACM, New York, NY, USA, 2011.
[40] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Trans.</p>
      <p>Interact. Intell. Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 .
[41] M. D. Ekstrand, M. Ludwig, J. A. Konstan, J. T. Riedl, Rethinking the recommender research
ecosystem: Reproducibility, openness, and lenskit, in: Proceedings of the Fifth ACM
Conference on Recommender Systems, RecSys ’11, Association for Computing Machinery,
New York, NY, USA, 2011, p. 133–140. URL: https://doi.org/10.1145/2043932.2043958. doi:1 0 .
1 1 4 5 / 2 0 4 3 9 3 2 . 2 0 4 3 9 5 8 .
[42] O. Barkan, R. Hirsch, O. Katz, A. Caciularu, N. Koenigstein, Anchor-based collaborative
ifltering, in: Proceedings of the 30th ACM International Conference on Information &amp;
Knowledge Management, CIKM ’21, Association for Computing Machinery, New York,
NY, USA, 2021, p. 2877–2881. URL: https://doi.org/10.1145/3459637.3482056. doi:1 0 . 1 1 4 5 /
3 4 5 9 6 3 7 . 3 4 8 2 0 5 6 .
[43] A. B. Melchiorre, N. Rekabsaz, C. Ganhör, M. Schedl, Protomf: Prototype-based matrix
factorization for efective and explainable recommendations, in: Proceedings of the
16th ACM Conference on Recommender Systems, RecSys ’22, Association for Computing
Machinery, New York, NY, USA, 2022, p. 246–256. URL: https://doi.org/10.1145/3523227.
3546756. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 . 3 5 4 6 7 5 6 .
[44] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara, Variational autoencoders for
collaborative filtering, in: Proceedings of the 2018 World Wide Web Conference, WWW
’18, International World Wide Web Conferences Steering Committee, Republic and
Canton of Geneva, CHE, 2018, p. 689–698. URL: https://doi.org/10.1145/3178876.3186150.
doi:1 0 . 1 1 4 5 / 3 1 7 8 8 7 6 . 3 1 8 6 1 5 0 .
[45] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation
with bidirectional encoder representations from transformer, in: Proceedings of the 28th
ACM International Conference on Information and Knowledge Management, CIKM ’19,
Association for Computing Machinery, New York, NY, USA, 2019, p. 1441–1450. URL:
https://doi.org/10.1145/3357384.3357895. doi:1 0 . 1 1 4 5 / 3 3 5 7 3 8 4 . 3 3 5 7 8 9 5 .
[46] Z. Yue, Z. He, H. Zeng, J. McAuley, Black-box attacks on sequential recommenders via
data-free model extraction, in: Proceedings of the 15th ACM Conference on Recommender
Systems, RecSys ’21, Association for Computing Machinery, New York, NY, USA, 2021, p.
44–54. URL: https://doi.org/10.1145/3460231.3474275. doi:1 0 . 1 1 4 5 / 3 4 6 0 2 3 1 . 3 4 7 4 2 7 5 .
[47] Z. Yue, H. Zeng, Z. Kou, L. Shang, D. Wang, Defending substitution-based profile pollution
attacks on sequential recommenders, in: Proceedings of the 16th ACM Conference on
Recommender Systems, RecSys ’22, Association for Computing Machinery, New York, NY,
USA, 2022, p. 59–70. URL: https://doi.org/10.1145/3523227.3546770. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 .
3 5 4 6 7 7 0 .
[48] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat: Knowledge graph attention network for
recommendation, in: Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery &amp; Data Mining, KDD ’19, Association for Computing Machinery,
New York, NY, USA, 2019, p. 950–958. URL: https://doi.org/10.1145/3292500.3330989. doi:1 0 .
1 1 4 5 / 3 2 9 2 5 0 0 . 3 3 3 0 9 8 9 .
[49] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Analysis of recommendation algorithms for
e-commerce, in: Proceedings of the 2nd ACM Conference on Electronic Commerce, EC
’00, Association for Computing Machinery, New York, NY, USA, 2000, p. 158–167. URL:
https://doi.org/10.1145/352871.352887. doi:1 0 . 1 1 4 5 / 3 5 2 8 7 1 . 3 5 2 8 8 7 .
[50] T. Avny Brosh, A. Livne, O. Sar Shalom, B. Shapira, M. Last, Bruce: Bundle recommendation
using contextualized item embeddings, in: Proceedings of the 16th ACM Conference on
Recommender Systems, RecSys ’22, Association for Computing Machinery, New York, NY,
USA, 2022, p. 237–245. URL: https://doi.org/10.1145/3523227.3546754. doi:1 0 . 1 1 4 5 / 3 5 2 3 2 2 7 .
3 5 4 6 7 5 4 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <article-title>Evaluating recommender systems: Survey and framework</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1145/3556536.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 5 5 6 5 3 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Aspect re-distribution for learning better item embeddings in sequential recommendation</article-title>
          ,
          <source>in: Proceedings of the 16th ACM Conference on Recommender Systems</source>
          , RecSys '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          . URL: https://doi.org/10.1145/3523227.3546764.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 5 2 3 2 2 7 . 3 5 4 6 7 6 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Next-item recommendations in short sessions</article-title>
          ,
          <source>in: Proceedings of the 15th ACM Conference on Recommender Systems</source>
          , RecSys '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>282</fpage>
          -
          <lpage>291</lpage>
          . URL: https: //doi.org/10.1145/3460231.3474238.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 4 6 0 2 3 1 . 3 4 7 4 2 3 8 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ma</surname>
          </string-name>
          , N. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>Caen: A hierarchically attentive evolution network for item-attribute-change-aware recommendation in the growing e-commerce environment</article-title>
          ,
          <source>in: Proceedings of the 16th ACM Conference on Recommender Systems</source>
          , RecSys '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>278</fpage>
          -
          <lpage>287</lpage>
          . URL: https://doi.org/10.1145/3523227.3546773.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 5 2 3 2 2 7 . 3 5 4 6 7 7 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Estimating classification error rate: Repeated cross-validation, repeated holdout and bootstrap</article-title>
          ,
          <source>Computational Statistics &amp; Data Analysis</source>
          <volume>53</volume>
          (
          <year>2009</year>
          )
          <fpage>3735</fpage>
          -
          <lpage>3745</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0167947309001601. doi:h t t p s : / / d o i .
          <source>o r g / 1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . c
          <source>s d a . 2 0 0 9 . 0 4 . 0 0 9 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          ,
          <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>
          ,
          <source>The Journal of Machine learning research 7</source>
          (
          <year>2006</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>A. D'Amour</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Heller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moldovan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Adlam</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Alipanahi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beutel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Deaton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Hofman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hormozdiari</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Houlsby</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Jerfel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Karthikesalingam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lucic</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>McLean</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mincu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mitani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montanari</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Nado</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Nielson</surname>
            ,
            <given-names>T. F.</given-names>
          </string-name>
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Raman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ramasamy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sayres</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schrouf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Seneviratne</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sequeira</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Suresh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Veitch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vladymyrov</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Webster</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yadlowsky</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Yun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Sculley</surname>
          </string-name>
          ,
          <article-title>Underspecification presents challenges for credibility in modern machine learning</article-title>
          ,
          <year>2020</year>
          .
          <article-title>a r X i v : 2 0 1 1 . 0 3 3 9 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Févry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks</article-title>
          ,
          <year>2019</year>
          .
          <article-title>a r X i v : 1 8 1 1 . 0 1 0 8 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Amir</surname>
          </string-name>
          , J.-W. van de Meent,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>On the impact of random seeds on the fairness of clinical classifiers</article-title>
          ,
          <year>2021</year>
          .
          <article-title>a r X i v : 2 1 0 4 . 0 6 3 3 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          , G. Ilharco,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Fine-tuning pretrained language models: Weight initializations, data orders</article-title>
          , and early stopping,
          <year>2020</year>
          .
          <article-title>a r X i v : 2 0 0 2 . 0 6 3 0 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Madhyastha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>On model stability as a function of random seed</article-title>
          ,
          <year>2019</year>
          .
          <article-title>a r X i v : 1 9 0 9 . 1 0 4 4 7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mosbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andriushchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          ,
          <article-title>On the stability of fine-tuning bert: Misconceptions, explanations</article-title>
          , and strong baselines,
          <year>2021</year>
          .
          <article-title>a r X i v : 2 0 0 6 . 0 4 8 8 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Z.-H. Zhou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Ensembling neural networks: Many could be better than all</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>137</volume>
          (
          <year>2002</year>
          )
          <fpage>239</fpage>
          -
          <lpage>263</lpage>
          . URL: https://www.sciencedirect.com/science/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>