1. Introduction

Influence of distance measures and data characteristics on time performance in content-based and collaborative filtering datasets⋆

Oleksandr Marchenko

Maksym Shevchenko

0 0 Taras Shevchenko National University of Kyiv , 60 Volodymyrska Street, Kyiv, 01033 , Ukraine

99 108

This paper presents a comparative study on the time performance of various distance measures within the vector-space model, applied to content-based and collaborative filtering datasets. Euclidean distance, inner product, and cosine similarity were evaluated. A custom experimental framework was developed to assess these measures. The impact of dataset size, dimensionality, and the number of closest vectors returned by queries were analyzed.

eol>Chroma DB collaborative filtering content-based filtering distance measures recommender systems 1

1. Introduction

Recommender systems have become an integral part of our daily lives, influencing decisions in various domains. From suggesting movies and TV shows on platforms like Netflix, to recommending products on e-commerce sites such as Amazon, and even curating personalized music playlists on Spotify, these systems enhance user experience by providing tailored content based on individual preferences. Unlike traditional recommender systems that operate behind the scenes, dialog-based systems engage users in interactive conversations to gather preferences and provide recommendations through natural language interfaces. This approach offers a more human-centric and engaging way to receive personalized suggestions, potentially leading to improved user satisfaction and increased adoption of recommendation-driven applications [ 6 ].

Recommender systems typically rely on two main approaches: content-based filtering and collaborative filtering. Content-based filtering relies on the attributes of items (or their content itself) and user profiles to suggest similar items [ 15 ]. For instance, if a user has shown interest in science fiction movies, the system will recommend other science fiction movies based on the features such as genre, director, or actors. Collaborative filtering leverages the preferences and behaviors of multiple users to generate recommendations. It assumes that if users A and B have similar tastes, items liked by user A are likely to be appreciated by user B as well [ 17 ]. Both methods aim to discover new content users might enjoy, but they differ in their underlying assumptions and data requirements.

One common technique to implement both filtering types is representing user profiles and items as vectors. This method leverages distance or similarity measures between vectors to identify recommendations [ 2 ]. In content-based filtering, the system compares the vectorized user profile with vectorized item profiles to find similar items, ensuring recommendations align closely with the with each other, identifying users with similar tastes to suggest items they have enjoyed (Fig. 1, right).

Generative models, such as GPT-like models, are widely used to develop chat-based applications due to their ability to generate human-like text. However, they are not always the ideal solution for recommender systems [ 19 ]. These models are trained on extensive corpora that include a vast amount of data unrelated to recommendations, which can dilute their effectiveness in this specific context. Additionally, they may lack relevant data that the system aims to recommend, and their training data may not be updated frequently, limiting their ability to provide recent items. To address these limitations, it is possible to fine-tune the model on relevant, domain-specific data or to incorporate Retrieval-Augmented Generation techniques.

Retrieval-Augmented Generation (RAG) enhances the capabilities of generative models by allowing them to access external knowledge sources [ 16 ], making them more effective for recommendation tasks. In a RAG- olely on pretrained knowledge but instead retrieves relevant data dynamically from a connected database or vector store [ 1 ]. For example, when a user engages with a dialog-based system, the model generates a query based on the conversation. This query is then used to search the vector store, which contains pre-encoded user and item profiles. By retrieving relevant vectors such as items similar to the the system can provide up-to-date and accurate recommendations. This approach combines the strengths of generative models with real-time, domain-specific data, ensuring the recommendations are both contextually relevant and current.

While numerous investigations have focused on estimating the impact of distance measures on recommendation accuracy [ 9 ], only few of them have considered the time performance of these measures [ 12 ]. Time performance is a critical factor in production systems, which must efficiently handle large volumes of data and serve numerous users simultaneously. Slow response times can degrade the user experience and strain system resources, which is unacceptable in real-time applications like recommender systems. Optimizing the time performance is essential to ensure that recommender systems can deliver timely and relevant recommendations, maintaining a seamless user experience even under heavy load. The goal of this study is to assess the time performance of different distance measures within vector stores. The results could provide valuable insights not only for improving recommender systems but also for any other domain that relies on vector stores, from information retrieval to natural language processing and beyond.

2. The experiment methodology 2.1. Hardware and software setup

The experiment was performed on a Windows 10-powered machine equipped with an AMD Ryzen 1500x 3.7 GHz processor, 16 GB of DDR4 3200 MHz RAM, and a 512 GB M2 SSD. While the time performance results may vary slightly based on the hardware configuration, the patterns and trends observed in the experiment should remain consistent across different setups, ensuring the generalizability of the results. All experiments were implemented using Python 3 (version 3.10.5), a widely adopted programming language in machine learning research due to its simplicity and extensive library support.

Chroma DB [ 4 ] was used as the vector store in this experiment for several reasons. First, it provides a Python library, making it straightforward to integrate into the experimental pipeline. Second, Chroma allows local data storage without the need for an external API, which helps eliminate delays related to internet connectivity. Finally, it integrates with LangChain [ 11 ], a popular Python framework for building controllable agentic workflows, including RAG applications, which enhances the flexibility and functionality of the experiment. Chroma natively supports three distance measures [ 4 ]: Squared L2 (Euclidean distance), inner product (IP) and cosine similarity (CS). In Chroma, these distances are defined as follows [ 4 ]: =1 =1 2 = ∑( − )2 ;

= 1.0 − ∑( × ) ; = ∑ =1( × ) √∑ =1 ( 2) ∙ √∑ =1 ( 2) , where n is the dimension of the vectors, Ai and Bi are the coordinates of vectors A and B, respectively. All these distances were estimated and compared.

2.2. Content-based filtering dataset For the content

This dataset contains nearly 2.7 million news articles and essays from 27 American publications, spanning from 2016 to 2020 years. Each record in the dataset includes various attributes such as publication date, author, title, article text, URL, and more (Fig. 2). However, for content-based filtering, only the article (as the text to embed), title (as a metadata field), and row index (as the record ID) were used to populate the vector store. -MiniLM-L6-v2 model) [ 3 ] was utilized to convert each article into a 384-dimensional vector. To ensure manageable computational time, the dataset was divided into three subsets containing 25,000 (25K), 75,000 (75K), and 125,000 (125K) articles. Larger subsets were not created due to the extended time required to embed large datasets.

Queries to the vector store were executed using 100 pre-embedded texts, which were not part of the created datasets, simulating user profiles. Additionally, experiments were also conducted on generated random data. (1) (2) (3)

2.3. Collaborative filtering dataset

For the collaborative filtering experiment, three sizes of the MovieLens (ML) [ 8 ] dataset were used: with 100,000 (100K), 1 million (1M), and 10 million (10M) ratings. The larger datasets were not utilized due to limitations in compute resources.

The dataset includes user-item interactions, where each interaction is represented in the format: user ID, movie ID, provided rating and timestamp (Fig. 3). Ratings are given on a 5-star scale with half-star increments, and the timestamp was omitted.

For collaborative filtering, the user-item interactions should be represented as a utility matrix, where each row corresponds to a user, and the columns represent their interactions with all items (movies) from the dataset. Thus, an additional transformation was done to convert the entire data into utility matrix form (Fig. 4).

The width of the utility matrix and, accordingly, the dimensionality of the vectors varies depending on the number of items (movies) in the dataset. For the 100K dataset, the matrix included 610 users and 9,724 items; for the 1M dataset, it included 6,040 users and 3,706 items; and for the 10M dataset, there were 69,878 users and 10,677 items. This variation allowed for an estimation of how vector dimensionality impacts retrieval time. To query the created vector stores, 10 records were taken from the utility matrix itself, and 90 records were generated to simulate additional users and their interactions.

2.4. Methodology

For each dataset and distance function, 10 queries were conducted using 10 input vectors. The number of closest vectors returned by Chroma DB was parametrized, starting from 10 and increasing in increments up to 3,010 vectors. A smaller number of returned vectors is practical when the vector store directly powers a recommender system, providing a concise set of recommendations. On the other hand, a larger number of returned vectors is beneficial when the query results serve as input for subsequent stages in a recommendation pipeline, such as in hybrid recommender systems [ 14 ], where additional post-processing or combination with other algorithms may be required (Fig. 5).

3. Results

Due to the extensive volume of data generated during these experiments, only key results and trends are presented in this paper. Therefore, readers interested in the raw data and detailed results can access them through the provided link: https://github.com/Gurdel/distances_comparison_public/blob/main/results_raw.xlsx.

3.1. Results on the News dataset

As mentioned in the previous chapter, the content-based filtering experiments were divided into two groups: one using real data and the other using generated data. The time performance across both groups was nearly identical, indicating that the nature of the data did not significantly impact the computational efficiency. Additionally, all evaluated distance measures demonstrated similar response times. The minor variations observed in the results can likely be attributed to experimental error rather than any inherent difference in the algorithms.

Figure 6 illustrates the increase in average response time across all experiments conducted on the News datasets. Although the dataset size increased by 200% (from 25K to 75K) and 400% (from 25K to 125K), the average response time for all distance measures rose by only 36.5% and 58.5%, respectively (Table 1). This indicates that the system scales efficiently, maintaining relatively low increases in response time even as the dataset size grows significantly.

Cosine, real

IP, real L2, real

Figure 7 shows the increase in response time as the number of closest vectors returned by each query increases. The results are shown for the IP measure on 25K, 75K, and 125K News datasets using real data for queries and 75K News dataset using generated data for queries. The increased response time at the beginning of the experiment with real data can be attributed to the internal processes of Chroma DB, such as initial indexing and caching mechanisms. When the experiment was repeated using generated query data, these spikes were absent, indicating that the system was already optimized after the initial runs. Across all database sizes and distance measures, the trends for response time remained almost linear, suggesting that the system scales predictably and efficiently as the number of retrieved vectors grows.

3.2. Results on the MovieLens dataset

Since the smallest collaborative filtering dataset, ML 100K, contains only 600 records (users), querying the vector store for more than 600 results would simply return all available data. During queries requesting between 10 to 510 results, the response time remained consistent across all distance measures and did not increase. Table 2 contains the average response time for all MovieLens datasets when querying up to 510 results. As expected, larger datasets resulted in longer response times. However, even in this small sample, a minor difference in performance was observed among the distance measures (Fig. 8). CS exhibited slightly worse response times compared to IP and L2, suggesting that CS may introduce additional computational overhead in the context of these datasets.

During experiments on the ML 1M dataset, all distance measures exhibited linear trends in time increases (Fig. 9). CS demonstrated a slightly worse average response time (1.47 seconds) compared to IP and L2 measures, which averaged 1.36 and 1.3 seconds, respectively. However, the differences between the measures are marginal, particularly when a higher number of vectors are returned.

In contrast to previous datasets, cosine similarity exhibited significantly worse performance on the ML 10M dataset compared to the other distance measures. As shown in Figure 10, the response time for CS increases more rapidly than for IP and L2, and both of them have almost linear trends. While L2 had an average response time of 2.84 seconds and IP averaged 3.27 seconds, CS took 4.4 seconds on average. This can be explained by the fact that, despite all these measures have similar time complexity O(n), CS requires more elementary operations to be computed according to its formula. This highlights that its performance degradation with high-dimensional data makes IP or L2 preferable for high-efficiency scenarios where response time is critical.

An important observation can be drawn from the comparison of response times between the ML 10M and News 75K datasets. Although these datasets have a comparable number of vectors (approximately 70,000 and 75,000, respectively), the dimensionality of the ML dataset is significantly higher (10,677 vs. 384). It was initially assumed that retrieval time on the ML data would be much slower compared to the News dataset. However, while the average response time for all measures in the News 75K dataset is 5.7 seconds, the ML 10M dataset showed only 3.5 seconds across all experiments. This discrepancy can be explained by the sparsity inherent in collaborative filtering data. With far more items than any individual user interacts with, the utility matrix has many empty (or zero) cells, allowing for substantial optimization in distance computations. This reduces the actual number of calculations needed, leading to faster retrieval times despite the higher dimensionality.

4. Conclusion

The experiments conducted on both content-based and collaborative filtering datasets provided valuable insights into the time performance of different distance measures. All compared distance measures showed linear trends in response time as the number of returned closest vectors increased. This consistent behavior across different datasets suggests that the system scales predictably, making it reliable for scenarios where larger sets of vectors need to be retrieved.

Across most datasets, cosine similarity, inner product, and Euclidean distance exhibited similar response times, with only minor differences in computational efficiency. However, as data dimensionality increased, significant variations in performance were observed. CS showed a notably higher response time on the high-dimension datasets, particularly in the ML 10M dataset, where its performance lagged behind IP and L2. This performance gap can be attributed to the higher number of elementary operations required by CS compared to the other measures.

Additionally, the comparison between the ML 10M and News 75K datasets highlighted the impact of data sparsity on performance, with the ML dataset showing faster response times despite its higher dimensionality. This underscores the importance of considering data characteristics when evaluating distance measures.

Moreover, an additional contribution of this study was the development of an experimental methodology and accompanying software to systematically evaluate the performance of various distance measures. This framework enabled consistent testing across datasets and can be extended for future research, providing a valuable tool for benchmarking distance-based retrieval methods in recommender systems.

4.1. Further steps

To further validate our findings and explore the scalability of different distance measures, it is essential to conduct experiments on larger datasets. By incorporating datasets beyond the tested in this study, it will be possible to better understand how different distance measures scale and whether the observed trends hold on even bigger datasets. This would provide more comprehensive insights into the scalability of vector stores and distance measures when applied to real-world, large-scale recommender systems.

To ensure the generalizability of our results, it is crucial to experiment with other vector stores beyond Chroma. Different vector stores may have varying optimizations and internal processes that could affect performance. By comparing multiple vector stores, we can identify the most efficient options for different recommendation scenarios.

Additionally, future research should explore the use of other similarity measures. While our study focused on natively supported by Chroma DB measures, comparing them with other measures will provide a comprehensive understanding of their performance characteristics and help identify the most suitable measures for specific applications.

Declaration on Generative AI The authors have not employed any Generative AI tools.

[1]

Alcaraz , Blending Fine-Tuning and RAG for Collaborative Filtering with LLMs, 2023 . URL: https://ai.plainenglish. io/blending-fine-tuning-and-rag-for-collaborative-filtering-with-llms3d71858485e4.

[2]

Wang ,

Shen ,

Yang ,

Guo , Improving Rocchio Algorithm for Updating User Profile in Recommender Systems . In: Lin, X. , Manolopoulos , Y. , Srivastava , D. , Huang , G . (eds) Web Information Systems Engineering WISE 2013. WISE 2013. Lecture Notes in Computer Science , vol 8180 . Springer, Berlin, Heidelberg, 2013 . doi: https://doi.org/10.1007/978-3- 642 -41230-1_ 14 .

[3] Chroma , Embeddings, n.d. URL: https://docs.trychroma.com/guides/embeddings.

[4] Chroma , Usage Guide, n.d. URL: https://docs.trychroma.com/guides.

[5] Components , All the News 2.0 2 . 7 million news articles and essays from 27 American publications , n.d. URL: https://components.one/datasets/all -the-news-2-news-articles-dataset.

[6]

Pramod ,

Bafna , Conversational recommender systems techniques, tools, acceptance, and adoption: A state of the art review . Expert Systems with Applications 203 ( 2022 ) 117539 , ISSN 0957- 4174 . doi: https://doi.org/10.1016/j.eswa. 2022 . 117539 .

[7]

Fkih , Similarity measures for Collaborative Filtering-based Recommender Systems: Review and experimental comparison . Journal of King Saud University - Computer and Information Sciences 34 ( 9 ) ( 2022 ) 7645 - 7669 .

[8] GroupLens, MovieLens, n.d. URL: https://grouplens.org/datasets/movielens/.

[9]

Joy ,

Renumol , Comparison of Generic Similarity Measures in E-learning Content Recommender System in Cold-Start Condition . In: Proceedings of the IEEE Bombay Section Signature Conference , December 2020 . Bombay, India, 2020 , 175 - 179 . doi: 10 .1109/IBSSC51096. 2020 . 9332162 .

[10]

K. G.

Saranya ,

G. S.

Sadasivam ,

Chandralekha , Performance Comparison of Different Similarity Measures for Collaborative Filtering Technique . Indian Journal of Science and Technology 9 ( 29 ) ( 2016 ) 1 - 8 .

[11] LangChain , Vector stores, n.d. URL: https://python.langchain. com/v0 .1/docs/modules/data_connection/vectorstores/.

[12]

M. A.

Baxla , Comparative study of similarity measures for item based top n recommendation . Unpublished thesis (Bachelor in Computer Science), National Institute of Technology Rourkela , 2014 .

[13]

Wijewickrema ,

Petras ,

Dias , Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora . The Electronic Library 37 ( 3 ) ( 2019 ) 506 - 527 . doi: https://doi.org/10.1108/EL-08-2018-0165.

[14]

Burke , Hybrid Web Recommender Systems . In: Brusilovsky, P. , Kobsa , A. , Nejdl , W. (eds), The Adaptive Web. Lecture Notes in Computer Science , vol 4321 . Pp. 377 - 408 . Springer, Berlin, Heidelberg, 2007 . doi: https://doi.org/10.1007/978-3- 540 -72079-9_ 12 .

[15] R. van Meteren , M. van Someren , Using Content-Based Filtering for Recommendation , 2000 . URL: https://users.ics.forth.gr/~potamias/mlnia/paper_6.pdf.

[16] Restack , Recommendation Systems Using Rag , 2024 . URL: https://www.restack.io/p/recommendation -systems-answer-using-rag-cat-ai.

[17]

S. B.

Belhaouari ,

Fareed ,

Hassan ,

Halim , A collaborative filtering recommendation framework utilizing social networks . Machine Learning with Applications 14 ( 2023 ) 1 - 20 .

[18] S.-B. Sun , Z.-H.

Zhang , X.-L.

Dong , H.-R.

Zhang , T.- J. Li , L.

Zhang , F.

Min , Integrating Triangle and Jaccard similarities for recommendation . PLoS ONE 12 ( 8 ) ( 2017 ) e0183570 . doi: https://doi.org/10.1371/journal.pone. 0183570 .

[19]

Deldjoo , Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency . ArXiv, abs/2401.10545 ( 2024 ) 27 pages . doi: https://doi.org/10.48550/arXiv.2401.10545.