<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IIR</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Caser+ and CosRec+: Closing the Gap Between CNNs and Attention Models in SRS⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Siciliano</string-name>
          <email>R@K</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Purificato</string-name>
          <email>P@K</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Betello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Tonellotto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Silvestri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer, Control and Management Engineering, Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Engineering Department, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>15</volume>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Sequential Recommender Systems (SRSs) have predominantly shifted toward neural-based models. Despite significant advances, Convolutional Neural Network (CNN)-based SRSs have been increasingly overshadowed by more powerful attention-based approaches. In this paper, we introduce a novel adaptation of two popular CNNbased SRSs, Caser and CosRec. We enhance their training by adjusting the convolution and pooling operations to process the entire input sequence simultaneously rather than focusing only on the most recent item. Experimental results show that these modified CNN-based models achieve improvements of up to +65% in NDCG@10 over their original versions. Code is available at https://github.com/antoniopurificato/recsys_conv_conf.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Recommender Systems</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Sequential Recommendation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Background</title>
        <p>
          Sequential recommendation aims to predict the next item +1 based on a preceding sequence (1, . . . , ).
Directly training a model to output only the last element +1 can be ineficient for longer histories [ 17].
A more efective strategy, as adopted by sequence-to-sequence architectures [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], is to predict each
successive interaction: 2 from (1), then 3 from (1, 2), and so forth [18]. In the neural
recommendation paradigm, each item is projected into a continuous embedding space [19], producing an input
representation as an  × ℎ matrix. Classic recurrent-based recommenders like GRU4Rec [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] process one
timestep at a time. The hidden state at time  feeds into both the next recurrent cell and the output layer,
allowing information from (1, . . . , ) to accumulate and influence all future predictions.
Attentionbased solutions like SASRec [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] use self-attention in order to evaluate all positions in the sequence
simultaneously. Masking restricts each timestep  so that it only sees past interactions (1, . . . , ).
        </p>
        <p>
          CNNs, which originated in image processing, slide a convolutional filter across the input to extract
local patterns. Here we focus on two CNN-based recommenders: Caser [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and CosRec [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Caser and CosRec</title>
        <p>Caser applies two kinds of convolution. First, its vertical filters cover all  timesteps but only one
embedding dimension per filter. This yields a vector of size ℎ × , where  denotes the number of
vertical kernels.Second, horizontal convolutions use multiple filters with diferent temporal extents
 ∈ {1, . . . , }. A kernel of shape  × ℎ captures local patterns across  consecutive items. Each
horizontal filter produces a ( −  + 1)-long feature map, then a max pooling compresses this into
a single ℎ-dimensional vector. Concatenating all  pooled vectors yields a representation of length
 × ℎ, which is then merged with the vertical features. The resulting vector of size (ℎ × ) + ( × ℎ)
feeds into a fully-connected layer to generate the final score for every potential next item.</p>
        <p>CosRec follows a diferent design by first forming all possible pairs of embeddings. Specifically, it
constructs a 3D tensor of shape  ×  × 2ℎ, where each slice encodes the concatenated embeddings of an
item pair. This tensor is then passed through two convolutional blocks: each block contains a 1 × 1 and
a 3 × 3 convolution, followed by batch normalization and ReLU. With no padding, each block shrinks
the spatial dimensions by 2 on each axis, resulting in an ( − 4) × ( − 4) ×  tensor at the end of the
pipeline. Finally, global average pooling across the first two dimensions produces a -dimensional
summary vector, which is processed by a dense layer to obtain the output predictions.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Caser+ and CosRec+</title>
        <p>Reshape
(a) Vertical convolution
and MAP@K, with K ∈ {10, 20}. Bold denotes the best model for a dataset by the metric, underlined
the second best. † indicates a statistically significant result of the new model w.r.t. its original version
and * means statistically significant w.r.t. SASRec, based on Wilcoxon test with p-value &lt; 0.05.
convolutional components. For the vertical filters, we introduce left-padding of ( −
-sized kernels initially cover just the first item, then the first two, and so on. This adjustment yields an</p>
        <sec id="sec-2-3-1">
          <title>1) so that the</title>
          <p>output tensor of shape  × ℎ × , allowing a sequential processing of the input, as depicted in Fig. 1a.</p>
          <p>The horizontal convolutions require a similar treatment. Each horizontal kernel, spanning  × ℎ
where  ∈ {1, . . . , }, is left-padded with  placeholder elements. This setup allows every filter to
produce a  × ℎ matrix, from which we compute a cumulative maximum across the temporal axis
instead of reducing along that axis. This preserves the intended max-pooling behavior at each timestep
while retaining the full sequence length. Finally, we concatenate the vertical and horizontal outputs,
resulting in a combined representation of shape  × (ℎ ×  +  × ℎ), as illustrated in Fig. 1b.</p>
          <p>Pair</p>
          <p>CNN
blocks</p>
          <p>Average
pooling
Average
pooling
Average
pooling
is introduced so that all intermediate tensors remain of shape  ×  × . Specifically, the 1× 1 convolution
requires no padding, while the 3 × 3 convolution is padded so that the output resolution stays constant
across layers. Next, we redefine the average pooling strategy. Instead of a global average across the
entire 2D space, we accumulate averages progressively. Starting with the top-left corner, we compute
the mean of the 1 × 1 submatrix. Then we move on to the 2 × 2 top-left submatrix, and so on up to the
full  ×  matrix. This yields a condensed  ×  output. A summary of this process is given in Fig. 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        Our setup mirrors that of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: interactions are treated as implicit feedbacks, users with fewer
than five interactions are removed, and a leave-one-out split is used. We use three well-known
datasets—MovieLens 1M (ML-1M) [20], Foursquare Tokyo (FS-TKY), and Foursquare New York
City (FS-NYC) [21]. To address RQ2, we also compare against the attention-based SASRec [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. All
experiments were conducted using the EasyRec toolkit [22].
      </p>
      <sec id="sec-3-1">
        <title>3.1. Comparison w.r.t. Caser &amp; CosRec</title>
        <p>We train all models for 2000 epochs and show their results in Table 1. The modified architectures
yielded better scores than the baseline models across all metrics. For instance, on the FS-TKY dataset,
resp. ML-1M dataset, Caser+ achieves an improvement of 0.2251, resp. 0.0367, in NDCG@10 w.r.t.
Caser. Similarly, CosRec+ obtains an improvement of 0.2216, resp. 0.0103, in NDCG@10 w.r.t. CosRec.</p>
        <p>Caser</p>
        <p>Caser+
0.00 0 250 500 750 E1p0o0c0h 1250 1500 1750 2000
CosRec</p>
        <p>CosRec+
0.0 0 250 500 750 E1p0o0c0h 1250 1500 1750 2000
(a) Caser and Caser+ on ML-1M.</p>
        <p>(b) CosRec and CosRec+ on FS.
In Fig. 3, across all epochs, the enhanced models consistently outperform their respective baselines.
Notably, CosRec+ reaches convergence in approximately 1000 epochs and achieves an NDCG@10 of
0.471, while the original CosRec struggles to surpass 0.30. This is especially important in low-resource
settings where only a limited number of training epochs can be run.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Comparison with SASRec</title>
        <p>From Table 1, on FS-TKY, CosRec+ demonstrates a clear advantage over SASRec across nearly all
metrics, achieving up to a 0.1941 increase in NDCG@10. On ML-1M, SASRec still holds the edge overall,
but the gaps have noticeably narrowed—our models trail by at most 0.0404 in NDCG@10.</p>
        <p>Fig. 4a shows that while SASRec eventually surpasses both CosRec and CosRec+, the CNN-based
models produce higher test performance during the first 250 and 500 epochs, with NDCG@10 reaching
0.3524 for SASRec, 0.4010 for CosRec, and 0.4746 for CosRec+. Similarly, Fig. 4b illustrates that although
SASRec converges faster on FS-TKY, Caser+ overtakes it after epoch 1000.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>This work demonstrates that appropriately modifying convolution-based sequential recommenders can
substantially enhance their performance. Although our findings are not yet definitive, they suggest
that CNN-based SRSs can surpass attention-based approaches on certain datasets and under specific
conditions. In future work, we plan to conduct a more extensive hyperparameter search to determine
whether these revised convolutional architectures can achieve even greater improvements.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <sec id="sec-5-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by projects FAIR (PE0000013) and SERICS (PE00000014), under the
MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU, and
project NEREO (Neural Reasoning over Open Data), funded by the Italian Ministry of Education and
Research (PRIN) Grant no. 2022AEFHAZ.
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[15] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation
with bidirectional encoder representations from transformer, in: Proceedings of the 28th ACM
international conference on information and knowledge management, 2019, pp. 1441–1450.
[16] X. Du, H. Yuan, P. Zhao, J. Qu, F. Zhuang, G. Liu, Y. Liu, V. S. Sheng, Frequency enhanced hybrid
attention network for sequential recommendation, in: Proceedings of the 46th International ACM
SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 78–88.
[17] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-aware recommender systems, ACM Comput.</p>
      <p>Surv. 51 (2018). URL: https://doi.org/10.1145/3190616. doi:10.1145/3190616.
[18] G. Di Teodoro, F. Siciliano, N. Tonellotto, F. Silvestri, A theoretical analysis of recommendation loss
functions under negative sampling, in: 2025 International Joint Conference on Neural Networks
(IJCNN), IEEE, 2025.
[19] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based recommender system: A survey and new
perspectives, ACM Comput. Surv. 52 (2019). URL: https://doi.org/10.1145/3285029. doi:10.1145/
3285029.
[20] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact.</p>
      <p>Intell. Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. doi:10.1145/2827872.
[21] D. Yang, D. Zhang, V. W. Zheng, Z. Yu, Modeling user activity preference by leveraging user
spatial temporal characteristics in lbsns, IEEE Transactions on Systems, Man, and Cybernetics:
Systems 45 (2015) 129–142. doi:10.1109/TSMC.2014.2327053.
[22] F. Betello, A. Purificato, F. Siciliano, G. Trappolini, A. Bacciu, N. Tonellotto, F. Silvestri, A
reproducible analysis of sequential recommender systems, IEEE Access (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purificato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Betello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Are convolutional sequential recommender systems still competitive? introducing new models and insights</article-title>
          , in: 2025
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          , IEEE,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuzhilin</surname>
          </string-name>
          ,
          <article-title>Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions</article-title>
          ,
          <source>IEEE transactions on knowledge and data engineering 17</source>
          (
          <year>2005</year>
          )
          <fpage>734</fpage>
          -
          <lpage>749</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Betello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Finite rank-biased overlap (frbo): A new measure for stability in sequential recommender systems</article-title>
          ,
          <source>in: Proc. of the 14th Italian Information Retrieval Workshop</source>
          , volume
          <volume>3802</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Betello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Investigating the robustness of sequential recommender systems against training data perturbations</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sbandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Mitigating extreme cold start in graph-based recsys through re-ranking</article-title>
          ,
          <source>in: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>4844</fpage>
          -
          <lpage>4851</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Betello</surname>
          </string-name>
          ,
          <article-title>The role of fake users in sequential recommender systems (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Purificato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Eco-aware graph neural networks for sustainable recommendations</article-title>
          ,
          <source>in: International Workshop on Recommender Systems for Sustainability and Social Good</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Integrating item relevance in training loss for sequential recommender systems</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1114</fpage>
          -
          <lpage>1119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hidasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <article-title>Session-based Recommendations with Recurrent Neural Networks</article-title>
          ,
          <source>in: Proc. ICLR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>W.-C. Kang</surname>
            ,
            <given-names>J. McAuley</given-names>
          </string-name>
          ,
          <article-title>Self-attentive sequential recommendation, in: 2018 IEEE international conference on data mining (ICDM)</article-title>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Purificato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cassarà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liò</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          , Sheaf4rec:
          <article-title>Sheaf neural networks for graph-based recommender systems</article-title>
          ,
          <source>ACM Transactions on Recommender Systems</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Personalized top-n sequential recommendation via convolutional sequence embedding</article-title>
          ,
          <source>in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining</source>
          , WSDM '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>565</fpage>
          -
          <lpage>573</lpage>
          . URL: https://doi.org/10.1145/3159652.3159656. doi:
          <volume>10</volume>
          .1145/3159652.3159656.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yan</surname>
          </string-name>
          , S. Cheng, W.-C. Kang,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Cosrec: 2d convolutional neural networks for sequential recommendation</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>2173</fpage>
          -
          <lpage>2176</lpage>
          . URL: https://doi.org/10.1145/3357384.3358113. doi:
          <volume>10</volume>
          .1145/ 3357384.3358113.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>