<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Diverse Synthetic Datasets for Evaluation of Real-life Recom mender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miha Malenšek</string-name>
          <email>mihamalen@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blaž Škrlj</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blaž Mramor</string-name>
          <email>blazmramor@hotmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jure Demšar</string-name>
          <email>jure.demsar@fri.uni-lj.si</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer and Information Science</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Outbrain</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many solutions that would allow generation of artificial datasets with such characteristics. For that purpose, we developed a novel framework for generating synthetic datasets that are diverse and statistically coherent. Our framework allows for creation of datasets with controlled attributes, enabling iterative modifications to fit specific experimental needs, such as introducing complex feature interactions, feature cardinality, or specific distributions. We demonstrate the framework's utility through use cases such as benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches. Unlike existing methods that either focus narrowly on specific dataset structures, or prioritize (private) data synthesis through real data, our approach provides a modular means to quickly generating completely synthetic datasets we can tailor to diverse experimental requirements. Our results show that the framework efectively isolates model behavior in unique situations and highlights its potential for significant advancements in the evaluation and development of recommender systems. The readily-available framework is available as a free open Python package to facilitate research with minimal friction. dataset generation, categorical datasets, evaluating recommender systems, probabilistic counting, DeepFM, 0009-0004-6941-2203 (M. Malenšek); 0000-0002-9916-8756 (B. Škrlj); 0009-0001-9669-5374 (B. Mramor); Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>AutoML</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Recommender systems have significantly enhanced user experiences across various digital platforms,
from e-commerce to streaming services and advertisement. These systems analyze user preferences and
behaviours to suggest relevant items, thereby increasing user engagement and satisfaction. However,
evaluating and benchmarking these systems poses unique challenges due to the sensitivity and
availability of real-world data. Privacy concerns, data access restrictions, and the need for diverse testing
scenarios often limit researchers’ ability to conduct comprehensive evaluations and perform thorough
benchmarking.</p>
      <p>
        Synthetic data generation ofers a promising solution to these challenges and is already widely used
in fields like computer vision and robotics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], yet still in its nascent stages in the field of recommender
systems. Prior research in the field mainly focused on maintaining data fidelity and ensuring user privacy.
For instance, Slokom et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced the SynRec framework to generate partially synthetic data
using CART, while Berlioz et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] applied diferential privacy techniques to protect user information
with matrix factorization. Eforts to scale academic datasets to production standards include
AntulovFalin et al.’s [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] memory-biased random walks and Belletti et al.’s [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] fractal expansions. Provalov et al.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] developed the SynEva framework using GAN techniques for synthetic data generation based on real
data. All this successful research point to significant potential for advancements in recommender systems
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
through synthetic data, warranting further research on data fidelity, privacy-preserving techniques,
and new simulation methods.</p>
      <p>Despite this, there remains a need for frameworks that generate completely synthetic data tailored
specifically to the evaluation of recommender systems. Such data should possess characteristics
that make it suitable for rigorous testing and algorithmic development, without relying on
privacypreserving transformations of real data. This paper addresses this gap by proposing a comprehensive
framework for generating diverse and statistically coherent synthetic datasets tailored to the evaluation
of recommender systems. Unlike techniques that focus on privacy or emulate real data, our approach
creates entirely artificial, production-scale data with specific qualities and attributes. Controlling the
generation process allows iterative modifications to fit specific experimental needs, such as introducing
complex feature interactions or increasing the cardinality of the dataset. Our deterministic generative
process allows for reproducibility and enables on-the-fly dataset modifications, reducing setup time for
experimental scenarios. We demonstrate the framework’s utility through use cases in benchmarking
algorithms, detecting algorithmic bias, and simulating AutoML searches.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Generating Categorical Datasets</title>
      <p>To streamline the creation of synthetic datasets for classification tasks and avoid reliance on imperfect or
complex real-world data samples, we introduce a comprehensive framework called
CategoricalClassification. The framework is available through the Python Package Index (PyPI, https://pypi.org/) and can be
installed with pip. The core of our solution is packaged into a Python class called
CategoricalClassification. Functionalities implemented in this class allow for rapid generation of production-scale synthetic
datasets with specific attributes tailored to the nature of the problem, such as sparsity, high cardinality
features or specific distributions. Additionally, our framework ofers functionalities to augment these
datasets by incorporating feature combinations, correlations, and noise.</p>
      <p>The CategoricalClassification framework generates datasets comprised of integer arrays which represent
various categorical values, from encoded categories, hashes, to counts. All generation processes are
reproducible through the use of random seeds. For simpler datasets, a single function call can generate
a useful, synthetic dataset, while the functionalities described in table 1 streamline the generation of
more complex datasets with specific attributes to fit experimental needs. These can be specific feature
value sets and distributions, specific feature-class relationships, or user defined feature combinations.
To demonstrate its capabilities, the example dataset seen in figure 1 was generated, featuring various
types of distributions and feature cardinalities. Adding combinations of features or correlated features,
is a simple matter of calling the appropriate function and specifying the desired (column) indices.</p>
      <p>To ensure familiarity, the API structure of our framework follows common principles seen in libraries
like NumPy or SciPy. The framework itself is available as an integrated module of the open-source tool
Outrank (https://github.com/outbrain/outrank), or as a standalone tool catclass (https://github.com/
98MM/msc_cc), both installable using pip.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Applications of custom synthetic datasets</title>
      <p>In this section we consider three applications that showcase how custom generated synthetic datasets
can give novel insights into methods commonly used in recommender systems.</p>
      <sec id="sec-4-1">
        <title>3.1. Use Case 1 – Benchmarking Probabilistic Counting Algorithms</title>
        <p>Categorical data streams are ubiquitous in modern recommender systems. Features include diferent
categories, identifiers, or aggregates of numeric features and similar counts. When monitoring online
streams, efectively counting the number of unique items in the stream becomes a challenging
computational problem. Exact counting methods (such as hashing) adhere to substantial memory overhead
that seldom scales in practice. To remedy this shortcoming, probabilistic counting algorithms were
introduced. In particular, we’re interested in the HyperLogLog family – algorithms aimed at estimation
of the number of unique items (cardinality) in a data batch (part of a stream). We observed that
probabilistic counters, albeit much more memory eficient, introduce some noise in terms of precision –
which is expected. However, the issue is that common algorithms don’t discriminate between low- and
high-cardinality items. This in practice means that a low-cardinality feature, where estimate is of high
impact (e.g., count of days in a week) is occasionally miscounted, resulting in errors that carry bigger
impact than e.g., counting unique users on a site. To remedy this shortcoming, we introduce a caching
mechanism to arbitrary HyperLogLog-like algorithm, where, to a certain degree, the algorithm remains
deterministic, and only switches to probabilistic mode of operation if its memory requirements exceed
the constrained (user-specified) value. See Figure 2 for a visualization of these results.
hllc (10, mixed)
hllc (12, mixed)</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Use Case 2 – Detecting Algorithmic Bias</title>
        <p>
          Datasets with complex feature interactions present significant challenges for machine learning models.
As the complexity of these interactions increases, so does the dificulty in accurately modelling and
capturing these relationships. Moreover, complex feature interactions can lead to overfitting or
underfitting, resulting in poor generalizations to unseen data, or the inability to capture intricate relationships
in the data. Traditional linear models, such as the logistic regression are eficient, but unfortunately
often fail to capture complex feature interactions. Advanced models, such as the DeepFM model which
combines factorization machines and deep neural networks [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ], are better equipped to handle such
complexities. As such we are interested in how Logistic Regression and DeepFM models compare when
presented with datasets with increasingly complex feature interactions. To systemically evaluate bias
and performance, we generate an initial synthetic dataset with 4 relevant and 750 irrelevant features, 10k
samples and a nonlinear class relation. We introduce 20% of categorical noise and create various feature
interactions based on pairwise combinations of relevant features, which are subsequently removed
from our generated dataset. We then iteratively remove the resulting combination features, perform
minimal hyperparameter tuning and evaluate DeepFM and logistic regression performance over one
epoch. When multiple types of combinations are present in the generated dataset, we observed an
increase in both the AUC and accuracy scores for both models.
        </p>
        <p>As seen in figure 3 DeepFM significantly outperformed logistic regression in all but one case – when
presented with features created with a sum of squares combination.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Use Case 3 – Simulating AutoML search</title>
        <p>Automated Machine Learning (AutoML) represents a paradigm shift in the way machine learning models
are developed and deployed. AutoML aims to decrease the time and efort required to develop robust
machine learning solutions by automating data pre-processing, feature ranking and engineering, model
selection, hyperparameter tuning and model evaluation.</p>
        <p>Building efective recommender systems involves managing complex feature interactions in a wide
feature space and often hinges on the fine-tuning of its underlying models and algorithms. One of the
most important AutoML operations in recommender systems is feature selection. More precisely, the
question of finding the smallest subset of features that will, given an ML algorithm, deliver optimal, or
close to optimal performance. In this way one optimizes the size of a model which is especially valuable
for deployments on a large scale and when we are dealing with memory limitations.</p>
        <p>A commonly used strategy is to iteratively grow the feature set by adding the next best feature that
is not already in the set. In the first step we find the best feature  1 by finding the model that gives
best predictions with a single feature. After that, we find the best model predicting with  1 and another
feature, called  2, etc. We repeat the process until we have a feature set of the desired size or until we
no longer see a performance boost. In this use case we use the aforementioned open source package
called Outrank. Outrank implements the functionality for the iterative feature set growth – given an
existing subset of features from our dataset, provide the ranking of the remaining features by evaluating
the performance of all models containing the existing features and one of the remaining ones. For this
we use the so-called surrogate-SGD-prior heuristic of Outrank, which, under the hood, implements
Scikit-learn’s SGDClassifier and runs 4-fold cross-validation score with a negative log loss.</p>
        <p>For simulations of such AutoML searches, we generate three initial datasets with 4 relevant and
900 irrelevant features, and a nonlinear class relation, with 10k, 50k, and 100k samples. As in section
3.2, we introduce 20% of categorical noise and create various feature interactions based on pairwise
combinations of relevant features which are subsequently removed from our generated dataset. Figure
4 shows the evolution of scores of the models, when adding new features.
1
2
3</p>
        <p>Important to note here is that due to the complexity of the search, Outrank has no hyperparameter
optimisation capability for the models that provide feature rankings. This explains why the results in
ifgure 4 are falling very fast with the increase of feature complexity. The performance of the models
with hyperparameter tuning has thus been evaluated similarly as in the section 3.2 (see Figure 5). The
positive trend in the AutoML evolution of Outrank (figure 4) seems to imply a positive trend in the
performance of the corresponding models trained with hyperparameter tuning. This, on one hand, is
useful information for optimal feature selection. On the other hand, it shows that AutoML for feature
selection without hyperparameter optimisation can be misleading.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion and Conclusions</title>
      <p>The framework presented in this paper ofers a robust and versatile tool for creating synthetic datasets
for testing and evaluation of real-world recommender systems. By allowing precise control over dataset
attributes, researchers can design experiments that isolate specific aspects of algorithm performance,
especially when investigating scenarios that are not commonly encountered in real-life data. Our use
cases demonstrate the framework’s practical application.</p>
      <p>We showcased the usefulness of our synthetic data generation framework on several real-world
scenarios. In the first one, we precisely evaluated probabilistic counting algorithms. By testing the
algorithms on more than 2k synthetic dataset we were able to highlight a key limitation – the probabilistic
counting algorithms’ inability to discriminate between low- and high-cardinality items. By introducing
a caching mechanism, we enhanced an arbitrary HyperLogLog-like alogrithm’s precision for
lowcardinality features.</p>
      <p>In the second example, we used our framework’s ability to generate complex feature interactions to
systematically evaluate the performance of logistic regression and DeepFM models on datasets with
varying levels and amounts of feature interaction complexity. The results demonstrated DeepFM’s
superior ability to handle complex interactions, significantly outperforming logistic regression.</p>
      <p>Finally, we simulated AutoML search and evaluated its ability to identify relevant features. Through
these use cases we demonstrated our framework’s eficacy in generating challenging datasets to simulate
real-world scenarios in a controlled environment.</p>
      <p>The three use cases demonstrate our framework’s capabilities at various stages in the recommender
systems’ pipeline. Evaluating probabilistic counting algorithms may decrease memory overhead when
determining feature cardinality. The framework’s ability to simulate properties of real-life data in a
controlled manner enables us to generate data tailored to a specific problem, enabling further insight
into model behavior (i.e. determining algorithm bias). By controlling the generation process we also
control feature relevance, enabling us to test key functionalities of AutoML systems integrated within
many recommender systems’ pipelines.</p>
      <p>Despite its strengths, there are areas for future improvement. By integrating advanced generative
models such as GANs or variational autoencoders we could further enrich the diversity and realism
of the synthetic datasets. Additionally, expanding the framework to support other types of machine
learning tasks, such as regression, could broaden its applicability and impact. Nevertheless, by enabling
controlled, repeatable experiments, our framework provides researches and practitioners with a powerful
tool to advance the field, ultimately leading to more efective and reliable recommender systems.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lesnikowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. d. S. P.</given-names>
            <surname>Moreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Byleen-Higley</surname>
          </string-name>
          ,
          <article-title>Synthetic data and simulators for recommendation systems: current state and future directions</article-title>
          ,
          <source>arXiv preprint arXiv:2112.11022</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Slokom</surname>
          </string-name>
          ,
          <article-title>Comparing recommender systems using synthetic data</article-title>
          ,
          <source>in: Proceedings of the 12th ACM Conference on Recommender Systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>548</fpage>
          -
          <lpage>552</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berlioz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kaafar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Boreli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Berkovsky</surname>
          </string-name>
          ,
          <article-title>Applying diferential privacy to matrix factorization</article-title>
          ,
          <source>in: Proceedings of the 9th ACM Conference on Recommender Systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Antulov-Fantulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bošnjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zlatić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grčar</surname>
          </string-name>
          , T. Šmuc,
          <article-title>Synthetic sequence generator for recommender systems-memory biased random walk on a sequence multilayer network</article-title>
          ,
          <source>in: Discovery Science: 17th International Conference, DS</source>
          <year>2014</year>
          , Bled, Slovenia, October 8-
          <issue>10</issue>
          ,
          <year>2014</year>
          . Proceedings 17, Springer,
          <year>2014</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Belletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Krichene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Anderson</surname>
          </string-name>
          ,
          <article-title>Scalable realistic recommendation datasets through fractal expansions</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>08910</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Provalov</surname>
          </string-name>
          , E. Stavinova,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chunaev</surname>
          </string-name>
          ,
          <article-title>Synevarec: A framework for evaluating recommender systems on synthetic data classes</article-title>
          ,
          <source>in: 2021 International Conference on Data Mining Workshops (ICDMW)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Deepctr: Easy-to-use,modular and extendible package of deep-learning based ctr models</article-title>
          , https://github.com/shenweichen/deepctr,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Deepfm: a factorization-machine based neural network for ctr prediction</article-title>
          ,
          <source>arXiv preprint arXiv:1703.04247</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>