1. Introduction

IIR

Caser+ and CosRec+: Closing the Gap Between CNNs and Attention Models in SRS⋆

Federico Siciliano

R@K 0

Antonio Purificato

P@K 0

Filippo Betello

Nicola Tonellotto

Fabrizio Silvestri

0 0 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Italy 1 Information Engineering Department, University of Pisa , Italy

2025

15 0000 0003

Sequential Recommender Systems (SRSs) have predominantly shifted toward neural-based models. Despite significant advances, Convolutional Neural Network (CNN)-based SRSs have been increasingly overshadowed by more powerful attention-based approaches. In this paper, we introduce a novel adaptation of two popular CNNbased SRSs, Caser and CosRec. We enhance their training by adjusting the convolution and pooling operations to process the entire input sequence simultaneously rather than focusing only on the most recent item. Experimental results show that these modified CNN-based models achieve improvements of up to +65% in NDCG@10 over their original versions. Code is available at https://github.com/antoniopurificato/recsys_conv_conf.

eol>Recommender Systems Convolutional Neural Networks Sequential Recommendation

1. Introduction 2. Methodology 2.1. Background

Sequential recommendation aims to predict the next item +1 based on a preceding sequence (1, . . . , ). Directly training a model to output only the last element +1 can be ineficient for longer histories [ 17]. A more efective strategy, as adopted by sequence-to-sequence architectures [ 10 ], is to predict each successive interaction: 2 from (1), then 3 from (1, 2), and so forth [18]. In the neural recommendation paradigm, each item is projected into a continuous embedding space [19], producing an input representation as an × ℎ matrix. Classic recurrent-based recommenders like GRU4Rec [ 9 ] process one timestep at a time. The hidden state at time feeds into both the next recurrent cell and the output layer, allowing information from (1, . . . , ) to accumulate and influence all future predictions. Attentionbased solutions like SASRec [ 10 ] use self-attention in order to evaluate all positions in the sequence simultaneously. Masking restricts each timestep so that it only sees past interactions (1, . . . , ).

CNNs, which originated in image processing, slide a convolutional filter across the input to extract local patterns. Here we focus on two CNN-based recommenders: Caser [ 12 ] and CosRec [ 13 ].

2.2. Caser and CosRec

Caser applies two kinds of convolution. First, its vertical filters cover all timesteps but only one embedding dimension per filter. This yields a vector of size ℎ × , where denotes the number of vertical kernels.Second, horizontal convolutions use multiple filters with diferent temporal extents ∈ {1, . . . , }. A kernel of shape × ℎ captures local patterns across consecutive items. Each horizontal filter produces a ( − + 1)-long feature map, then a max pooling compresses this into a single ℎ-dimensional vector. Concatenating all pooled vectors yields a representation of length × ℎ, which is then merged with the vertical features. The resulting vector of size (ℎ × ) + ( × ℎ) feeds into a fully-connected layer to generate the final score for every potential next item.

CosRec follows a diferent design by first forming all possible pairs of embeddings. Specifically, it constructs a 3D tensor of shape × × 2ℎ, where each slice encodes the concatenated embeddings of an item pair. This tensor is then passed through two convolutional blocks: each block contains a 1 × 1 and a 3 × 3 convolution, followed by batch normalization and ReLU. With no padding, each block shrinks the spatial dimensions by 2 on each axis, resulting in an ( − 4) × ( − 4) × tensor at the end of the pipeline. Finally, global average pooling across the first two dimensions produces a -dimensional summary vector, which is processed by a dense layer to obtain the output predictions.

2.3. Caser+ and CosRec+

Reshape (a) Vertical convolution and MAP@K, with K ∈ {10, 20}. Bold denotes the best model for a dataset by the metric, underlined the second best. † indicates a statistically significant result of the new model w.r.t. its original version and * means statistically significant w.r.t. SASRec, based on Wilcoxon test with p-value < 0.05. convolutional components. For the vertical filters, we introduce left-padding of ( − -sized kernels initially cover just the first item, then the first two, and so on. This adjustment yields an

1) so that the

output tensor of shape × ℎ × , allowing a sequential processing of the input, as depicted in Fig. 1a.

The horizontal convolutions require a similar treatment. Each horizontal kernel, spanning × ℎ where ∈ {1, . . . , }, is left-padded with placeholder elements. This setup allows every filter to produce a × ℎ matrix, from which we compute a cumulative maximum across the temporal axis instead of reducing along that axis. This preserves the intended max-pooling behavior at each timestep while retaining the full sequence length. Finally, we concatenate the vertical and horizontal outputs, resulting in a combined representation of shape × (ℎ × + × ℎ), as illustrated in Fig. 1b.

Pair

CNN blocks

Average pooling Average pooling Average pooling is introduced so that all intermediate tensors remain of shape × × . Specifically, the 1× 1 convolution requires no padding, while the 3 × 3 convolution is padded so that the output resolution stays constant across layers. Next, we redefine the average pooling strategy. Instead of a global average across the entire 2D space, we accumulate averages progressively. Starting with the top-left corner, we compute the mean of the 1 × 1 submatrix. Then we move on to the 2 × 2 top-left submatrix, and so on up to the full × matrix. This yields a condensed × output. A summary of this process is given in Fig. 2.

3. Results

Our setup mirrors that of [ 12 ]: interactions are treated as implicit feedbacks, users with fewer than five interactions are removed, and a leave-one-out split is used. We use three well-known datasets—MovieLens 1M (ML-1M) [20], Foursquare Tokyo (FS-TKY), and Foursquare New York City (FS-NYC) [21]. To address RQ2, we also compare against the attention-based SASRec [ 10 ]. All experiments were conducted using the EasyRec toolkit [22].

3.1. Comparison w.r.t. Caser & CosRec

We train all models for 2000 epochs and show their results in Table 1. The modified architectures yielded better scores than the baseline models across all metrics. For instance, on the FS-TKY dataset, resp. ML-1M dataset, Caser+ achieves an improvement of 0.2251, resp. 0.0367, in NDCG@10 w.r.t. Caser. Similarly, CosRec+ obtains an improvement of 0.2216, resp. 0.0103, in NDCG@10 w.r.t. CosRec.

Caser

Caser+ 0.00 0 250 500 750 E1p0o0c0h 1250 1500 1750 2000 CosRec

CosRec+ 0.0 0 250 500 750 E1p0o0c0h 1250 1500 1750 2000 (a) Caser and Caser+ on ML-1M.

(b) CosRec and CosRec+ on FS. In Fig. 3, across all epochs, the enhanced models consistently outperform their respective baselines. Notably, CosRec+ reaches convergence in approximately 1000 epochs and achieves an NDCG@10 of 0.471, while the original CosRec struggles to surpass 0.30. This is especially important in low-resource settings where only a limited number of training epochs can be run.

3.2. Comparison with SASRec

From Table 1, on FS-TKY, CosRec+ demonstrates a clear advantage over SASRec across nearly all metrics, achieving up to a 0.1941 increase in NDCG@10. On ML-1M, SASRec still holds the edge overall, but the gaps have noticeably narrowed—our models trail by at most 0.0404 in NDCG@10.

Fig. 4a shows that while SASRec eventually surpasses both CosRec and CosRec+, the CNN-based models produce higher test performance during the first 250 and 500 epochs, with NDCG@10 reaching 0.3524 for SASRec, 0.4010 for CosRec, and 0.4746 for CosRec+. Similarly, Fig. 4b illustrates that although SASRec converges faster on FS-TKY, Caser+ overtakes it after epoch 1000.

4. Conclusions

This work demonstrates that appropriately modifying convolution-based sequential recommenders can substantially enhance their performance. Although our findings are not yet definitive, they suggest that CNN-based SRSs can surpass attention-based approaches on certain datasets and under specific conditions. In future work, we plan to conduct a more extensive hyperparameter search to determine whether these revised convolutional architectures can achieve even greater improvements.

Declaration on Generative AI The author(s) have not employed any Generative AI tools. Acknowledgments

This work was partially supported by projects FAIR (PE0000013) and SERICS (PE00000014), under the MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU, and project NEREO (Neural Reasoning over Open Data), funded by the Italian Ministry of Education and Research (PRIN) Grant no. 2022AEFHAZ. [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,

Attention is all you need, Advances in neural information processing systems 30 (2017). [15] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer, in: Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1441–1450. [16] X. Du, H. Yuan, P. Zhao, J. Qu, F. Zhuang, G. Liu, Y. Liu, V. S. Sheng, Frequency enhanced hybrid attention network for sequential recommendation, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 78–88. [17] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-aware recommender systems, ACM Comput.

Surv. 51 (2018). URL: https://doi.org/10.1145/3190616. doi:10.1145/3190616. [18] G. Di Teodoro, F. Siciliano, N. Tonellotto, F. Silvestri, A theoretical analysis of recommendation loss functions under negative sampling, in: 2025 International Joint Conference on Neural Networks (IJCNN), IEEE, 2025. [19] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based recommender system: A survey and new perspectives, ACM Comput. Surv. 52 (2019). URL: https://doi.org/10.1145/3285029. doi:10.1145/ 3285029. [20] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact.

Intell. Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. doi:10.1145/2827872. [21] D. Yang, D. Zhang, V. W. Zheng, Z. Yu, Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns, IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (2015) 129–142. doi:10.1109/TSMC.2014.2327053. [22] F. Betello, A. Purificato, F. Siciliano, G. Trappolini, A. Bacciu, N. Tonellotto, F. Silvestri, A reproducible analysis of sequential recommender systems, IEEE Access (2024).

[1]

Siciliano ,

Purificato ,

Betello ,

Tonellotto ,

Silvestri , Are convolutional sequential recommender systems still competitive? introducing new models and insights , in: 2025 International Joint Conference on Neural Networks (IJCNN) , IEEE, 2025 .

[2]

Adomavicius ,

Tuzhilin , Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions , IEEE transactions on knowledge and data engineering 17 ( 2005 ) 734 - 749 .

[3]

Betello ,

Siciliano ,

Mishra ,

Silvestri , Finite rank-biased overlap (frbo): A new measure for stability in sequential recommender systems , in: Proc. of the 14th Italian Information Retrieval Workshop , volume 3802 , 2022 , pp. 78 - 81 .

[4]

Betello ,

Siciliano ,

Mishra ,

Silvestri , Investigating the robustness of sequential recommender systems against training data perturbations , in: European Conference on Information Retrieval , Springer, 2024 , pp. 205 - 220 .

[5]

Sbandi ,

Siciliano ,

Silvestri , Mitigating extreme cold start in graph-based recsys through re-ranking , in: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , 2024 , pp. 4844 - 4851 .

[6]

Betello , The role of fake users in sequential recommender systems ( 2024 ).

[7]

Purificato ,

Silvestri , Eco-aware graph neural networks for sustainable recommendations , in: International Workshop on Recommender Systems for Sustainability and Social Good , Springer, 2024 , pp. 111 - 122 .

[8]

Bacciu ,

Siciliano ,

Tonellotto ,

Silvestri , Integrating item relevance in training loss for sequential recommender systems , in: Proceedings of the 17th ACM Conference on Recommender Systems , 2023 , pp. 1114 - 1119 .

[9]

Hidasi ,

Karatzoglou ,

Baltrunas ,

Tikk , Session-based Recommendations with Recurrent Neural Networks , in: Proc. ICLR , 2016 .

[10] W.-C. Kang , J. McAuley , Self-attentive sequential recommendation, in: 2018 IEEE international conference on data mining (ICDM) , IEEE, 2018 , pp. 197 - 206 .

[11]

Purificato ,

Cassarà ,

Siciliano ,

Liò ,

Silvestri , Sheaf4rec: Sheaf neural networks for graph-based recommender systems , ACM Transactions on Recommender Systems ( 2023 ).

[12]

Tang ,

Wang , Personalized top-n sequential recommendation via convolutional sequence embedding , in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining , WSDM '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 565 - 573 . URL: https://doi.org/10.1145/3159652.3159656. doi: 10 .1145/3159652.3159656.

[13]

Yan , S. Cheng, W.-C. Kang,

Wan , J. McAuley , Cosrec: 2d convolutional neural networks for sequential recommendation , in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management , CIKM '19, Association for Computing Machinery, New York, NY, USA, 2019 , p. 2173 - 2176 . URL: https://doi.org/10.1145/3357384.3358113. doi: 10 .1145/ 3357384.3358113.