Health Document Presentation in Patient-Centered Recommender Systems with Carousel Interfaces

Health Document Presentation in Patient-Centered Recommender Systems with Carousel Interfaces BehnamRahdari School of Computing and Information University of Pittsburgh

Pittsburgh USA

PeterBrusilovsky peterb@pitt.edu School of Computing and Information University of Pittsburgh

Pittsburgh USA

DaqingHe School of Computing and Information University of Pittsburgh

Pittsburgh USA

KhushbooThaker k.thaker@pitt.edu School of Computing and Information University of Pittsburgh

Pittsburgh USA

MohammadHassany School of Computing and Information University of Pittsburgh

Pittsburgh USA

YoujiaWang School of Nursing University of Pittsburgh

Pittsburgh USA

YoungJiLee leeyoung@pitt.edu School of Nursing University of Pittsburgh

Pittsburgh USA

HeidiDonovan donovanh@pitt.edu School of Nursing University of Pittsburgh

Pittsburgh USA

Generative AI Personalized Health Information Health Recommender Systems User Interface Design Health Document Presentation in Patient-Centered Recommender Systems with Carousel Interfaces 1613-0073 A0CA52F9C894A5E7E09BC420059991DC GROBID - A machine learning software for extracting information from scholarly documents

Despite the increasing availability of health information, many users still find it difficult to navigate and comprehend this content effectively. Addressing these challenges requires innovative approaches, including personalized recommendations and more efficient methods of information delivery. In this paper, we explore the use of generative AI to improve access to health article recommendations within a carousel-based interface, utilizing our system, HELPeR. Our focus is on both generating and evaluating these summaries through a three-stage online experiment with domain experts. The results reveal the potential and complexities of employing generative AI for summarizing recommended health articles for ovarian cancer patients and their caregivers.

Introduction

The increased availability of health information online has transformed the way patients and caregivers access knowledge about diseases, treatments, and health management. However, navigating this vast amount of information can be overwhelming, especially for non-experts. The complexity of medical terminology, coupled with the sheer volume of available content, often leaves users struggling to find and understand the information most relevant to their needs. This challenge underscores the need for more intuitive methods of information delivery that cater to varying levels of information needs and health literacy.

In response to these challenges, personalized recommendations and advanced methods of information delivery have emerged as key strategies to bridge the gap between users and the information they seek. These approaches aim to adapt the content to individual users, considering factors such as their treatment history, disease trajectory, and cognitive abilities. By offering personalized recommendations, these systems can guide users through relevant information more efficiently, potentially improving their understanding and engagement with health content.

Despite the progress made in this area, important questions remain unanswered. One of the key issues is understanding how users decide whether to explore a recommendation further, especially when only minimal information is provided. While visual cues such as images are effective in capturing users' attention in domains like entertainment and e-commerce, the health domain primarily relies on textual content, with images often serving a decorative role in most online health articles. This raises the question of how to present health information in a way that encourages deeper exploration without the visual allure that is typical in other fields.

This challenge became particularly evident in our work with the HELPeR system [1] depicted in Figure 1, which used a carousel-based recommendation interface to present relevant health documents to cancer patients. The use of a carousel-based interface was important in our recommendation context, since the system has to data to decide which information need brought the user to HELPeR for each particular session. While the system maintains a fine-grain model of user interests and knowledge, the majority of the users maintain their interests in several health topics, and choosing user chief concern reliably at each session is as hard as guessing which movie genre a user would like to watch on Netflix at each login. Just like Netflix uses a carousel-based interface to support human-AI collaboration and allow the user to choose the genre she preferred today, HELPeR's carousel-based interface allows the user to choose the priority topics from several most likely topics shown by the carousels.

The carousel format is familiar and effective in presenting multiple options in parallel, but in the context of health information, it presented unique difficulties. While carousels are useful for organizing and displaying visual content (such as movie posters or book covers), it is a challenge to choose textual content to present on a carousel card to ensure that this information is sufficient for the user to make an informed decision. This challenge was compounded by the fact that the recommended content was often not available in a cohesive, summarized format. The headers of health documents are usually too long to be readable or even fit on a card (Figure 1). Document summaries are either too complex or not available at all. To address this challenge, we turned to generative AI to create concise document overviews and summaries that could effectively fit within the carousel cards, helping users make informed decisions with the limited available space.

In this paper, we present our attempts to use Large Language Models (LLM) to generate brief but informative overviews and summaries of health articles related to ovarian cancer, which can fit to the cards of HELPeR's carousel-based recommendation interface. To assess the relevance, clarity, and informativeness of these AI-generated overviews and summaries, we performed a three-phase expert evaluation study. Our findings highlight the challenges of generating concise yet informative content that meets the needs of diverse users, offering valuable insights into the future of AI-driven health information delivery.

Our study contributes to this evolving landscape by evaluating the use of AI-generated summaries within a patient-centered health recommender system, specifically designed for ovarian cancer patients. Although previous research has established the potential of AI in personalizing health information, our focus is on understanding how these AI-generated summaries can be integrated into carousel interfaces to enhance user engagement and decision-making. Through a three-phase expert evaluation, we aim to provide insights that will inform the design of future recommender systems, making them more responsive to the diverse and changing needs of patients.

The paper is organized as follows. In the next section, we provide a review of related work in personalized health information systems. The methodology section explains the approach to generating summaries and a balanced document selection approach that we applied to minimize bias in our evaluation process. We then describe the evaluation study and present our findings. Finally, we discuss the implications of our results, outline the limitations of our study, and suggest directions for future work.

Related Work

The increasing availability of health information online has greatly influenced how patients and caregivers manage care and make informed decisions. Early recommender systems in this domain focused on providing personalized health information by tailoring content to often static user profiles. These systems reported in [2,3] were important in demonstrating the benefits of personalized information delivery. However, as the complexity and volume of health information increased, so did the need for systems that could adapt to the evolving needs of users more effectively.

In response to these needs, the field has seen a shift towards more interactive and dynamic recommender systems. These systems known as conversational and critique-based recommenders [4,5] allow users to actively engage with the recommendations, refining the information needs based on real-time feedback. Although these interactive models enhanced personalization, they also introduced new challenges, such as the cognitive load associated with continuous interaction, which could be particularly burdensome for users with lower health literacy.

To address these challenges, recent research explored visual interfaces for recommender systems [6], which offered a higher expressive power and a better opportunity to add transparency [7] and user control [8] to the recommendation process. Among other visual recommender interfaces, carousel-based interfaces became especially popular for their ability to display multiple pieces of content in a compact and navigable format [9,10] and offer users a simple control over the recommendation process. These interfaces combined power and simplicity enabling users to quickly scan through recommendations and make informed decisions without feeling overwhelmed.

Despite these advancements, the challenge of effectively delivering personalized health information remains unsolved. As noted by Chi et al. [11] and Thaker et al. [12], users' information needs are not static; they could change freqiently following the change in health status, the progress of treatment and personal circumstances. This variability requires a more dynamic approach to content presentation, one that can adapt to the immediate context of the user. The integration of generative AI into health recommender systems, as explored in our work, offers a promising solution by creating concise, contextually relevant summaries that fit within the limited space of carousel interfaces [1].

Method

Generating the title and overview

The HELPeR system [1,13], is an interactive recommender system designed for ovarian cancer patients and their caregivers. It uses a knowledge base built from a curated collection of documents, which includes public health information, research articles, clinical trial results, and other relevant resources.

These documents undergo a rigorous curation process that involves sectioning, topic modeling, and key-phrase extraction to ensure that the information is reliable, up-to-date, and relevant to the needs of the users. For each document, HELPeR's knowledge base includes a range of textual and metadata information, such as the article title, topics, difficulty, relevant key-phrase, and full text of each document section.

As mentioned above, neither document titles nor section content were immediately suitable to be displayed on carousel card. The titles were frequently long and confusing and summaries either very long on not available. To address this problem, the most recent version of our system explored the use of LLM 1 to generate a document representation (title and summary) that can fit on a carousel card providing concise information about the document behind the card.

The exact prompt used in our study to generate a title and summary for each document is shown below. This prompt was selected through a prompt-tuning process to maximize clarity and relevance within the constraints of a limited format. Note that we passed all information about each document to LLM as part of the prompt: prompt = f""" Given the following section of an article titled '{article_title}', with the topic of {topic}, covering keywords: {keywords} and containing this text: "{section_text}" 1-Generate a 4-7 word title reflecting the section's essence and aligning with the article's theme.

2-Write a 20-25 words, two-sentence summary capturing key points, serving as an informative overview of the article for readers.

Respond ONLY and PRECISELY in this format: [{{ "title": "the generated title" }}, {{"summary": "the generated summary" }}]"""

Selecting a Diverse Subset of Annotations

Given the large number of documents in our collection related to ovarian cancer, selecting a manageable yet representative subset for our human-centered evaluation was essential. A subset that accurately represents the diversity of the full dataset is crucial to avoid biases that could skew our results, particularly in our collection, that certain topics or audience types are over-represented. To address this, we used a modified version of the Maximal Marginal Relevance (MMR) algorithm [14], traditionally used to balance relevance and novelty in information retrieval tasks.

The primary challenge in selecting this subset was to ensure that it not only reflected the diversity of topics, domains, and audience types present in the dataset, but also maintained the overall distribution of the documents and topics in our collection. We adapted the MMR algorithm to emphasize the intrinsic properties of the document, allowing it to select a subset that balanced these two critical factors.

Given a set of all documents 𝐷, where 𝑛 = |𝐷| denotes the total number of documents, we represented the document set as a matrix 𝑋, with each document encoded as a one-hot vector across 𝑚 categories. The subset 𝑆, initially empty, was populated by the execution of the MMR algorithm. The key parameter 𝜆, which ranged from 0 to 1, modulated the trade-off between relevance and diversity.

The process began with the random selection of an initial document 𝑑 𝑞 from 𝐷, which served as a pseudo-query, establishing the initial subset 𝑆 = {𝑑 𝑞 }. Then each document 𝑑 𝑖 within 𝐷 was assigned a relevance score, 𝑠𝑖𝑚(𝑑 𝑖 ), assumed for simplicity to follow a uniform distribution 𝑠𝑖𝑚(𝑑 𝑖 ) ∼ 𝑈 (0, 1).

During each iteration, the algorithm evaluated each candidate document 𝑑 𝑐 not yet included in 𝑆. The diversity score 𝑑𝑖𝑣(𝑑 𝑐 , 𝑆) was calculated based on the average cosine dissimilarity between 𝑑 𝑐 and all documents currently in 𝑆:

𝑑𝑖𝑣(𝑑 𝑐 , 𝑆) = 1 − 1 |𝑆| ∑︁ 𝑑𝑠∈𝑆 cosine_similarity(x 𝑑𝑐 , x 𝑑𝑠 )

The MMR score for each candidate document was then calculated by combining the relevance and diversity scores:

𝑀 𝑀 𝑅(𝑑 𝑐 ) = 𝜆 • 𝑠𝑖𝑚(𝑑 𝑐 ) + (1 − 𝜆) • 𝑑𝑖𝑣(𝑑 𝑐 , 𝑆)

The document 𝑑 𝑐 with the highest MMR score was added to 𝑆, thereby progressively refining the document subset to be both relevant and diverse.

Our modified approach departs from the traditional reliance on a fixed query in the MMR algorithm by dynamically calculating diversity relative to the evolving subset 𝑆. This modification is particularly advantageous in environments where queries are undefined or fluid, such as unsupervised document clustering or information retrieval systems where query independence is crucial. By structuring the selection process around these principles, we ensured that the resulting subset was both representative of the dataset's diversity and suitable for our human-centered evaluation of the AI-generated summaries.

As illustrated in Figure 3, the subset selected using the Maximal Marginal Relevance (MMR) algorithm shows a more balanced distribution of features across various categories, such as article knowledge level, audience type, and domain, compared to a randomly selected subset. The MMR-based selection achieves a broader coverage of diverse topics, ensuring that underrepresented areas are included, thus mitigating the biases that are evident in the random sample where certain categories, such as "Survivorship" and "Patient and Caregiver, " dominate.

Experiment

Experimental Design

To evaluate the quality and consistency of AI-generated titles and summaries for documents in HELPeR knowledge base we designed a multi-phase study. We alternate the articles and their order in which they are presented in each phase. This study design allows us to gain insights about both the quality of AI-generated content and the evaluation process itself.

Participants

The study involved two groups of participants: domain experts and nursing students with specialized knowledge in ovarian cancer. In the first phase, we have 100 unique documents to each of three domain experts in our research team. In the next phases, three nursing students (not belonging to the research team) were recruited to review the same sample in two rounds. Each student reviewed 300 instances, with 100 documents that overlapped between the two rounds. The use of nursing students in addition to domain experts was important to better represent the prospect of caregivers and reduce evaluation bias.

Procedure

Phase 1

In the first phase, AI-generated titles and overviews were created for a sub-sample of 300 documents selected by the modified MMR algorithm. Then these summaries were presented to three domain experts via a custom web interface (Figure 2). Each expert was tasked with evaluating 100 documents. To our surprise, only 55% of the summaries were accepted without modification, a number that raises many questions about the abilities of Generative-AI in summarizing health articles.

Phase 2

Following the insights gained from the first phase, we conducted unstructured interviews with domain experts on our team to understand the reason behind their high rejection rate. As it turns out, the reason for rejecting in many cases was related to a problem with missing or inaccurate annotations and uncertainty about the decision-making criteria. To address these issues, several enhancements were made in the second phase. A detailed code-book was developed to provide clear guidelines and decision-making criteria for participants, ensuring consistency across evaluations. Additionally, the web interface was improved to include an interactive tutorial and more user-friendly features including the ability to report inconsistent annotations and other issues that are not directly related to the quality of AI-Generated contents. In this phase, three external nursing students, each with specialized knowledge in ovarian cancer, were recruited to evaluate the same 300 documents. Each student reviewed 100 documents in this phase.

Phase 3

In the final phase, the consistency of the evaluations was tested. The external nursing students were asked to review an additional set of 200 documents. This set included 100 new documents and 100 documents that they had reviewed in the previous round. These documents were randomly distributed. This phase aimed to measure the stability and reliability of their evaluations over time. At the end of this phase, each document had been reviewed by multiple experts, providing a comprehensive dataset for our analysis.

Results

Figure 5: Document quality ratings before and after improvements (left to right)

As explained above, the initial evaluation phase, conducted by domain experts, indicated that only 55% of the AI-generated summaries were accepted without modifications. This lower-than-expected acceptance rate highlighted several issues in the evaluation process, which we attempted to address by introducing several enhancements reviewed above. Following these enhancements, the second phase of our study, in which nursing students with specialized knowledge in ovarian cancer were recruited as evaluators, showed a significant change, with 98.7% of the content being accepted either outright or with minor edits. Specifically, 78.7% of the summaries were accepted as they were, while 20% required slight modifications. Only a small fraction, 1.3%, of the content was rejected. These results demonstrated that the enhancements implemented after the first phase, such as the code-book and UI improvements, had a substantial positive impact on the perceived quality of the AI-generated summaries. Notably, our results suggest that the main factor contributing to this improvement was the added ability for users to report issues with the generated content separately from judging its overall quality. (Figure 5)

The final phase of the study focused on assessing the quality of evaluation itself by analyzing the consistency between our raters. The results from this ,depicted in Table 1,phase showed an 88% agreement in the reviews, indicating the reliability of the evaluation process. However, Cohen's Kappa scores of 0.37 suggested moderate to low consistency among reviewers, pointing to individual differences in judgment and the influence of subjective factors on the evaluation process. We are aware that our small sample size could have major influence of this results. Additionally, pairwise agreement percentages among reviewers varied, with the highest consistency observed between reviewers P2 and P3 at 92%.

To better understand the individual differences between our reviewers, we analyzed the ratio between the duration of the review session and the consistency with peers. Our analysis, illustrated in Figure 3-left, revealed a positive correlation between longer review times and higher consistency in evaluations as can be seen in Figure 4. This suggests that a more deliberate review process leads to more stable and reliable judgments. Reviewers who spent less time on evaluations exhibited lower consistency, emphasizing the importance of thoroughness in assessing AI-generated content.

Overall, the results of this study, although preliminary in nature, underscore the challenges inherent in evaluating AI-generated summaries, particularly in a complex domain like ovarian cancer information. The improvements made between phases significantly improved reviewer's judgment of acceptability of the AI outputs, yet individual differences among reviewers remain a critical factor in the consistency of evaluations. These findings highlight the need for clear guidelines and rigorous training to ensure reliable and consistent assessments in future AI evaluation tasks.

Discussion, Limitations, and Future Work

The results of this study highlight both the potential and the challenges associated with using AIgenerated summaries in the domain of ovarian cancer information. The improvements observed in the second phase, following the implementation of a comprehensive code-book2 and user interface enhancements, underscore the importance of clear guidelines and a user-friendly evaluation environment. These adjustments significantly increased the acceptability of the AI outputs, demonstrating that careful attention to the evaluation process can substantially enhance the performance of AI systems in generating accurate and relevant content. However, the study also revealed certain limitations. One of the primary concerns is the variability in reviewer judgments, as indicated by the moderate to low Cohen's Kappa scores. This suggests that individual differences in interpretation and the inherent subjectivity in content evaluation can impact the consistency of the results. Additionally, the correlation between review time and consistency points to the importance of thorough evaluations, but it also raises questions about the feasibility of scaling such processes for larger datasets.

Another limitation is related to the specific domain of ovarian cancer, which, while critical, may not encompass the full spectrum of challenges that AI-generated content could face in other medical or health-related domains. Therefore, while the findings of this study are valuable, they may not be fully generalizable to other areas without further validation.

For future work, several avenues can be explored. First, expanding the evaluation to include a more diverse set of medical topics could help to understand how AI performs in different domains. Furthermore, exploring automated methods to assess the consistency of evaluations, while keeping the human in the loop, could reduce the reliance on manual review processes, making the evaluation of AI-generated content more scalable.

Finally, addressing the subjective nature of content evaluation by developing more objective criteria or incorporating a larger pool of evaluators could improve the robustness of the evaluation process. Future studies could also examine the impact of these AI-generated summaries on end-users, such as patients and caregivers.

Figure 1 :1Figure 1: The HELPeR system interface design includes the following main components: A: Recommended Articles, B: A carousel containing recommended articles within the same topic, C: A specific recommended article, D: Details of the selected item, E: Options for user interaction with the recommendation, and F: Reading list (Personal Library)

Figure 2 :2Figure 2: Evaluation Interface for AI-Generated Metadata

Figure 3 :3Figure 3: Comparison of Overall Distribution of Features in the Random Sample (top) vs. MMR-Selected Sample (bottom) -𝑛 = 1000, 𝜆 = 0.5 -Color intensity corresponds to feature frequency.

Figure 4 :4Figure 4: Left: Agreement percentage goes up with more time spend on reviewing. Right: Time spent by each reviewer.

Table 11Inter-Rating ConsistencyMetric

OpenAI APIs -gpt-3.5-turbo-1106 https://bit.ly/helper-codebook

Acknowledgments

This work was supported by awards from the National Library of Medicine (NLM) of the National Institutes of Health (NIH) (Award Number: R01-LM013038).

Helper: An interactive recommender system for ovarian cancer patients and caregivers BRahdari PBrusilovsky DHe KMThaker ZLuo YJLee Proceedings of the 16th ACM Conference on Recommender Systems, RecSys '22 the 16th ACM Conference on Recommender Systems, RecSys '22

New York, NY, USA

Association for Computing Machinery 2022 Consumer use of "dr google": a survey on health information-seeking behaviors and navigational needs KLee KHoti JDHughes LMEmmerton Journal of medical Internet research 17 e4345 2015 Answers to health questions: Internet search results versus online health community responses SKanthawala AVermeesch BGiven JHuh Journal of medical Internet research 18 e5369 2016 Compound critiques for conversational recommender systems BSmyth LMcginty JReilly KMccarthy Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence the 2004 IEEE/WIC/ACM International Conference on Web Intelligence

WI; USA

IEEE Computer Society 2004 '04 Critiquing-based recommenders: survey and emerging trends LChen PPu User Modeling and User-Adapted Interaction 22 2012 Interactive recommender systems: A survey of the state of the art and future research challenges and opportunities CHe DParra KVerbert Expert Systems with Applications 56 2016 Visualizing recommendations to support exploration, transparency and controllability KVerbert DParra-Santander PBrusilovsky EDuval the 2013 International Conference on Intelligent User Interfaces, IUI '2013 ACM Press 2013 User control in recommender systems: Overview and interaction challenges DJannach SNaveed MJugovac E-Commerce and Web Technologies: 17th International Conference, EC-Web 2016

Porto, Portugal

Springer September 5-8, 2016. 2017 17 Revised Selected Papers The magic of carousels: Single vs. multi-list recommender systems BRahdari BKveton PBrusilovsky Proceedings of the 33rd ACM Conference on Hypertext and Social Media the 33rd ACM Conference on Hypertext and Social Media 2022 A methodology for the offline evaluation of recommender systems in a user interface with multiple carousels NFelicioni MFerrari Dacrema PCremonesi Adjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization 2021 Laypeople's source selection in online health information-seeking process YChi DHe WJeng Journal of the Association for Information Science and Technology 71 2020 Exploring resource-sharing behaviors for finding relevant health resources: analysis of an online ovarian cancer community KThaker YChi SBirkhoff DHe HDonovan LRosenblum PBrusilovsky VHui YJLee JMIR cancer 8 e33110 2022 Helper: Interface design decision and evaluation KThaker BRahadari VHui ZLuo YWang PBrusilovsky DHe HDonovan YJLee Innovation in Applied Nursing Informatics IOS Press 2024 The use of mmr, diversity-based reranking for reordering documents and producing summaries JCarbonell JGoldstein Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval the 21st annual international ACM SIGIR conference on Research and development in information retrieval 1998