<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Conversational Recommender Systems based on Extracting Implicit Preferences with Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Woo-Seok Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wooseung Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hye-Jin Jeong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suwon Lee</string-name>
          <email>leesuwon@gnu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chie Hoon Song</string-name>
          <email>chsong01@gnu.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sang-Min Choi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Gyeongsang National University</institution>
          ,
          <addr-line>501, Jinju-daero, Jinju-si, Gyeongsangnam-do</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Management of Technology, Gyeongsang National University</institution>
          ,
          <addr-line>501, Jinju-daero, Jinju-si, Gyeongsangnam-do</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Conversational recommender systems (CRS) have gained significant attention for their ability to provide personalized recommendations through conversational interfaces. CRS are increasingly being used in various fields such as e-commerce, entertainment, and customer services by understanding user preferences and providing personalized recommendations. Large Language Models (LLMs) have potential in recommendation systems due to their ability to understand and generate text, as well as their generalization and reasoning capabilities. In this paper, we propose a novel method that leverages LLMs to extract implicit information from conversations and explicitly incorporate it into recommendations. Our approach focuses on extracting implicit information such as user-preferred categories from conversations and explicitly adding it to the recommendation processes to enhance performance. We utilized Reddit-movie dataset, which provides rich conversational data, to extract users' implicit preferred movie genres from conversations and explicitly incorporate this information into the conversation to recommend movies. Experimental results show that both GPT-3.5-turbo and GPT-4 models perform exceptionally well at identifying user preferences and providing accurate recommendations. These findings demonstrate that utilizing implicit information extracted from conversations can efectively enhance recommendation quality, highlighting the potential of LLMs in conversational recommender systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Conversational Recommender Systems</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Implicit User Preference</kwd>
        <kwd>Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Conversational Recommender Systems (CRS) ofer personalized recommendations by engaging in
direct conversations with users through conversational interfaces. These systems typically utilize
users’ past behavior data, explicit feedback, and information gathered during the conversation to make
recommendations [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ]. However, users’ needs are complex and ever-changing, presenting a significant
challenge in efectively understanding and adapting to them [ 3]. In CRS, it is required to not only
detect the challenges but also an understanding of ambiguous and implicit preferences in user dialogue.
Therefore, CRS must continuously identify both the explicit requirements and implicit preferences of
users during the conversation to provide the most useful recommendations.
      </p>
      <p>With the advancement of Large Language Models (LLMs), the models are being actively utilized in
various fields [ 4, 5]. LLMs not only excel in understanding and generating text but also hold potential
as recommender systems through their generalization and reasoning abilities. For instance, LLMs can
be employed to generate new items that users might prefer by analyzing user reviews or conversation
logs [6].</p>
      <p>Recently, methods for leveraging LLMs in the CRS domain have also been proposed [7]. Integrating
LLMs into CRS can provide a more natural and flexible conversational interface, efectively uncovering
users’ hidden needs. For example, LLMs can precisely interpret ambiguous needs expressed by users
during a conversation and provide more sophisticated recommendations based on this understanding [3].</p>
      <p>In this paper, we propose a novel method that converts users’ implicit preferences within conversations
into explicit ones using LLMs. Additionally, we use the converted explicit information as labels to
create a multi-label classification model. We extract user preferences from conversations to explicitly
identify categorical information. we then use a classification model to quantify this information,
efectively reconstructing the conversations. The reconstructed conversations include explicit preference
information. By leveraging both categorical and quantitative information, we enhance recommendation
accuracy.</p>
      <p>Based on our approach, we can make revelation for the hidden preferences contained in a user’s
conversations. Moreover, we leverage extracted information and LLM to show that more accurate
recommendations can be derived when explicit information such as numerical preferences are used in
the conversation. We also show the experiment results that the integration of CRS and LLM is a crucial
step towards the development of user-centric recommendation systems and suggests possibility for
future research.</p>
      <p>The contributions of this paper are as follows:
• We propose a method to explicitly extract implicit preferences within conversations.
• We further suggest using the extracted preferences to design a multi-label model that quantifies
categorical data.
• Experimental results demonstrate that our proposed approach significantly enhances the
performance of CRS.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In general, CRS understands the context and flow of conversations to ofer appropriate recommendations
without explicit user statements [2]. This understanding is achieved using natural language processing
(NLP) techniques and reinforcement learning methods [8, 9]. By analyzing user responses in real-time,
CRS enhances recommendation accuracy and continuously learns to ofer increasingly personalized
suggestions.</p>
      <p>LLMs can be used to analyze user conversations in CRS since LLMs takes textual information as input
and outputs related text. Because of this reason, recently, research on utilizing LLMs for recommendation
systems has garnered significant attention [ 4, 6]. LLMs, trained on vast amounts of textual data, exhibit
advanced language understanding and generation capabilities, performing exceptionally well across
various domains. In recommendation systems, LLMs can deeply understand users’ linguistic expressions
and contexts, enabling more sophisticated recommendations. For example, LLMs analyze user-written
reviews or posts to discern preferences and provide personalized recommendations accordingly [10].
Additionally, LLMs leverage various forms of textual metadata related to recommended items, efectively
addressing the cold start problem for new users or items with sparse initial data [11, 12].</p>
      <p>Studies on integrating LLMs into CRS cover various aspects. LLMs excel in understanding
conversational contexts and discerning user intentions. By utilizing these capabilities, CRS can ofer more
natural and human-like conversational interfaces. For example, CRS powered by LLMs can detect
subtle emotional shifts or changes in user interests during conversations and provide corresponding
recommendations in real-time [3]. Recent research has proposed methods to further personalize user
interactions through LLMs, incorporating real-time feedback to enhance recommendation accuracy and
user satisfaction. These studies contribute to overcoming the limitations of traditional recommendation
systems by integrating the robust language understanding capabilities of LLMs, ultimately delivering a
superior user experience. In this paper, we leverage the characteristics of these LLMs to extract features
containing user preferences from dialogues. Furthermore, we propose a recommendation model that
reflects the extracted features.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Our Approach</title>
      <p>In this section, we propose a method to utilize users’ implicit preferences explicitly. Fig. 1 shows
overall processes of our approach. In this example, we utilize the movie domain. First, we extract the
user’s preferred genres implicitly expressed within the conversations using the LLM. Then, we train a
multi-label classifier using the user’s conversations as input and the extracted genres as labels. When
the trained classifier receives the conversation as input, it outputs the predicted genre labels. These
labels are then explicitly added to the end of the conversation, which is inputted into the LLM along
with a prompt for movie recommendations. Finally, the LLM recommends a list of movies based on the
prompt.</p>
      <p>Our methodology consists of three main steps. The first step is to extract the user’s preferences from
the conversation and the second step is to train a multi-label classification model using the conversation
and the extracted preferences. The last step is to make recommendations using the conversation and
the classification model.</p>
      <sec id="sec-3-1">
        <title>3.1. Extraction of Implicit Preferences within Conversation</title>
        <p>We define user conversation as  and the preference information extracted by the LLM from 
as . Namely,  represents the item features, such as genres, that are positively expressed by a user  in
. For example, if a user  shows a positive reaction to a particular movie genre in the conversation,
that genre can be considered as .</p>
        <p>To leverage LLM with extracted preferences, we configure the prompts to be suitable for extracting
preferences. Prompts are generated by combining the user’s conversation content with instruction. The
prompt is structured as shown in Fig. 2. For instance, an instruction such as “you reply me with user’s
genre preference within [Action, Adventure, Animation, . . . , Thriller, War, Western]" is included in the
prompt along with the conversation (). Based on this prompt, the LLM extracts the genre ()
that the user is likely to prefer. The movie genres referenced are based on the genre list from IMDb 1,
which includes 25 genres.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Quantifying Extracted Preferences</title>
        <p>We propose a method to quantify extracted preferences applying a multi-label classification model. Using
the conversation () as input and the user’s preferred genre () as labels, a multi-label classification
model is created. The model architecture is shown in Fig. 3. we first utilize BERT [ 13] to embed the
conversation. BERT embedding converts the  into vector form, making it understandable for the
model. These embedded values pass through three linear layers, each further refining the understanding
of the conversation and extracting important features. Finally, the output is generated through a sigmoid
layer with 25 dense units representing the number of genres. This layer outputs probability values for
each genre to predict the user’s preferred genres.</p>
        <p>Labels with predicted probabilities exceeding the threshold are defined as the quantification of user
’s preferred genre . We suppose that if we can quantify the extracted preference, it can help in
clearly understanding the user’s preferences through the comparison among the features. Because
of this reason,  is designed to enable quantitative comparison by adding a numerical variable, the
probability, to the categorical values of . For example, if the user’s preferred genres are predicted as
[Romance, Comedy], this value can be included in the conversation to clearly indicate how much the
user prefers these genres numerically.</p>
        <p>To utilize the  in CRS, we explicitly add  to the . For instance, it can be added as follows:
“My favorite genres are [Comedy: 0.9814, Romance: 0.8694].” This restructured conversation is termed
+ , indicating the original conversation () with the user’s preferred genres () and their
prediction probabilities explicitly added. It enables the user to clearly understand how much their
preferences are reflected.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Recommendation Process</title>
        <p>Finally, we construct the prompt to be used as input for the LLM. The prompt consists of instruction
and conversation containing user’s preferred genres. The instruction composes sentences designed to
instruct the LLM to recommend the top-k movies, followed by user conversation. The conversation
also includes the predicted user’s preferred genres (). Using the prompt, the LLM can recommend the
top-k movies based on the user conversation that include explicit preferences.</p>
        <p>The prompt for movie recommendations is shown in Fig. 4. For example, the instruction specifies the
role of LLM as a recommendation system tasked with recommending a total of 20 movies, along with
details such as the number of recommendations. In addition, + includes information from the
previous process, such as [Comedy: 0.9814, Romance: 0.8694], and the user’s requirements. In response
to the prompt, LLM recommends 20 movies that are slightly more focused on comedy than romance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We conduct the experiments using the GPT-3.5-turbo and GPT-4 models and the Reddit-movie datasets
[7]. We address the datasets to reconstruct + and + , and the performance of movie
recommendations is compared using these two types of conversation data.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Datasets and Evaluation Metrics</title>
          <p>We address the Reddit-movie datasets based on actual user conversations from Reddit, showcasing the
personalized tendencies of users. It includes a variety of conversations and personalized preferences,
making it suitable for CRS. We utilize the metrics Recall@K, NDCG@K, and MRR@K to evaluate the
performance in our experiments [14]. We set the value of K as 1, 5, 10, and 20. These metrics are widely
used indicators for evaluating the performance of recommendation systems, measuring the accuracy of
the model at each K value.</p>
          <p>• Recall@K: Measures the probability that the actual preferred genre is included in the top-k
recommendations.
• NDCG@K: An indicator used to evaluate the quality of the recommendation list, considering the
order of the recommended items.
• MRR@K: Based on the rank of the first correctly recommended item, with higher ranks receiving
higher scores.</p>
          <p>Through these metrics, the performance of the movie recommendation lists generated by each
GPT model can be quantitatively evaluated.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Baselines</title>
          <p>The baseline for the experiment is to recommend movies that the user might prefer based on 
using LLM. The LLM uses a prompt consisting of  and instructions to recommend 20 movies that
the user might prefer. In the comparative experiment,  is converted into + and + ,
and the LLM recommends 20 movies in the same manner. For + , as shown in Fig. 5,  is
obtained through the LLM without using a multi-label classifier. Additionally, the prompt that converts
 to + includes the phrase “The parentheses at the end of the conversation are in the format
‘User’s preferred genre: value’. The higher the value, the more preferred the genre."</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.1.3. Implementation Details</title>
          <p>The multi-label classifier is set up by using  generated by LLM as labels and  as inputs to create
models for each version of GPT. The training data is split into 80% for training and 20% for validation.
The classifier is trained until the train loss and validation loss rates are below 0.10.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results</title>
        <p>The experimental results are summarized in Table 1 for each model. When we utilize the GPT-3.5-turbo
model, we can observe that the performance of + surpasses that of using only . However,
in other metrics, we can find that the performance of the original  is superior.</p>
        <p>In the case of + , we can observe that all metrics show lower performance compared to .
For the GPT-4 model, we find that the performance of + exceeds that of all metrics. Similarly,
+ also shows improved performance in all aspects compared to . Thus, we confirm that
as the performance of the GPT model improves, explicitly expressing user preferences in conversations
enhances the performance of the CRS. Furthermore, it is indicated that adding quantitative values to
categorical data through the proposed multi-label classifier improves recommendation performance
more than expressing preferences only categorically.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we have proposed a method for improving the performance of conversational recommender
systems (CRS) by extracting implicitly expressed user preferences from conversations using large
language models (LLMs), and then explicitly adding these preferences. Additionally, we have suggested
enhancing CRS performance by using a multi-label classifier to add quantitative values to categorical
preferences. We have also highlighted the performance diferences between the GPT-3.5-turbo and
GPT-4 models, demonstrating that recommender systems leveraging the latest models yield better
results. In particular, our experiments using the GPT-4 model showed that both + and +
outperform the original .</p>
      <p>Despite the contributions of the recommendation performance, there still exists the possibility for
improvement in several parts. Our study is confined to the Reddit-movie dataset. It is required to the
additional research to validate the generalizability of the methodology across diferent domains and
datasets, such as music, books, and food domains. Although we have used the GPT-3.5-turbo and GPT-4
models, further experiments with other LLMs or more advanced models are required. This can help
identify performance diferences across various models and determine the optimal one. Subsequent
research can contribute to developing more sophisticated and versatile recommender systems applicable
in a wider range of scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Following are results of a study on the "Leaders in Industry-university Cooperation 3.0" Project supported
by the Ministry of Education and National Research Foundation of Korea (NRF), "Regional Innovation
Strategy (RIS)" through the NRF funded by the Ministry of Education(MOE)(2021RIS-003), and the NRF
grant funded by the Korea government (MIST) (No. RS-2022-00165785).
[2] Y. Sun, Y. Zhang, Conversational recommender system, in: The 41st International ACM SIGIR
Conference on Research &amp; Development in Information Retrieval, SIGIR ’18, Association for
Computing Machinery, New York, NY, USA, 2018, p. 235–244. URL: https://doi.org/10.1145/3209978.
3210002. doi:10.1145/3209978.3210002.
[3] G. Zhang, User-centric conversational recommendation: Adapting the need of user with large
language models, in: Proceedings of the 17th ACM Conference on Recommender Systems,
RecSys ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 1349–1354. URL:
https://doi.org/10.1145/3604915.3608885. doi:10.1145/3604915.3608885.
[4] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye,
Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, X. Xie, A survey on evaluation of large language models,
ACM Trans. Intell. Syst. Technol. 15 (2024). URL: https://doi.org/10.1145/3641289. doi:10.1145/
3641289.
[5] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen, C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, E. Chen, A
survey on large language models for recommendation, 2024. URL: https://arxiv.org/abs/2305.19860.
arXiv:2305.19860.
[6] W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng, J. Wang, D. Yin, C. Huang, Llmrec: Large
language models with graph augmentation for recommendation, in: Proceedings of the 17th ACM
International Conference on Web Search and Data Mining, WSDM ’24, Association for Computing
Machinery, New York, NY, USA, 2024, p. 806–815. URL: https://doi.org/10.1145/3616855.3635853.
doi:10.1145/3616855.3635853.
[7] Z. He, Z. Xie, R. Jha, H. Steck, D. Liang, Y. Feng, B. P. Majumder, N. Kallus, J. Mcauley, Large
language models as zero-shot conversational recommenders, in: Proceedings of the 32nd ACM
International Conference on Information and Knowledge Management, CIKM ’23, Association
for Computing Machinery, New York, NY, USA, 2023, p. 720–730. URL: https://doi.org/10.1145/
3583780.3614949. doi:10.1145/3583780.3614949.
[8] L. Zou, L. Xia, P. Du, Z. Zhang, T. Bai, W. Liu, J.-Y. Nie, D. Yin, Pseudo dyna-q: A reinforcement
learning framework for interactive recommendation, in: Proceedings of the 13th International
Conference on Web Search and Data Mining, WSDM ’20, Association for Computing Machinery,
New York, NY, USA, 2020, p. 816–824. URL: https://doi.org/10.1145/3336191.3371801. doi:10.1145/
3336191.3371801.
[9] Y. Deng, Y. Li, F. Sun, B. Ding, W. Lam, Unified conversational recommendation policy learning
via graph-based reinforcement learning, in: Proceedings of the 44th International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR ’21, Association for
Computing Machinery, New York, NY, USA, 2021, p. 1431–1441. URL: https://doi.org/10.1145/
3404835.3462913. doi:10.1145/3404835.3462913.
[10] F. Yang, Z. Chen, Z. Jiang, E. Cho, X. Huang, Y. Lu, Palr: Personalization aware llms for
recommendation, 2023. URL: https://arxiv.org/abs/2305.07622. arXiv:2305.07622.
[11] S. Sanner, K. Balog, F. Radlinski, B. Wedin, L. Dixon, Large language models are competitive
near cold-start recommenders for language- and item-based preferences, in: Proceedings of
the 17th ACM Conference on Recommender Systems, RecSys ’23, Association for Computing
Machinery, New York, NY, USA, 2023, p. 890–896. URL: https://doi.org/10.1145/3604915.3608845.
doi:10.1145/3604915.3608845.
[12] S. Agrawal, J. Trenkle, J. Kawale, Beyond labels: Leveraging deep learning and llms for content
metadata, in: Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23,
Association for Computing Machinery, New York, NY, USA, 2023, p. 1. URL: https://doi.org/10.
1145/3604915.3608883. doi:10.1145/3604915.3608883.
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/
v1/N19-1423.
[14] A. Said, A. Bellogín, Comparative recommender system evaluation: benchmarking
recommendation frameworks, in: Proceedings of the 8th ACM Conference on Recommender Systems,
RecSys ’14, Association for Computing Machinery, New York, NY, USA, 2014, p. 129–136. URL:
https://doi.org/10.1145/2645710.2645746. doi:10.1145/2645710.2645746.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Christakopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <article-title>Towards conversational recommender systems</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '16,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2016</year>
          , p.
          <fpage>815</fpage>
          -
          <lpage>824</lpage>
          . URL: https://doi.org/10.1145/2939672.2939746. doi:
          <volume>10</volume>
          .1145/2939672.2939746.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>