Understanding Modality Preferences in Search Clarification Leila Tavakoli∗ , Giovanni Castiglia, Federica Calò, Yashar Deldjoo, Hamed Zamani and Johanne R. Trippas RMIT University, Australia Polytechnic University of Bari, Italy Polytechnic University of Bari, Italy Polytechnic University of Bari, Italy University of Massachusetts Amherst, United States RMIT University, Australia Abstract This study is the first attempt to explore the impact of clarification question modality on user preference in search engines. We introduce the multi-modal search clarification dataset, MIMICS-MM, containing clarification questions with associated expert-collected and model-generated images. We analyse user preferences over different clarification modes of text, image, and combination of both through crowdsourcing by taking into account image and text quality, clarity, and relevance. Our findings demonstrate that users generally prefer multi- modal clarification over uni-modal approaches. We explore the use of automated image generation techniques and compare the quality, relevance, and user preference of model-generated images with human-collected ones. The study reveals that text-to-image generation models, such as Stable Diffusion, can effectively generate multi-modal clarification questions. By investigating multi-modal clarification, this research establishes a foundation for future advancements in search systems. Keywords multi-modal clarification, search clarification dataset, text-to-image generation, 1. Introduction Effective communication between users and intelligent systems is essential for accurately identifying a user’s information needs. One common obstacle encountered by Information Retrieval systems is the inherent ambiguity present in human language. Clarification questions can play a pivotal role in search interactions, allowing users to refine their queries and obtain more precise search results. Traditionally, clarification questions have been presented in a textual format, allowing users to respond with further textual input. Figure 1 shows an example of a clarification pane presented to users on the Bing search engine. In this scenario, the user is seeking information about setting up a distribution list in Outlook, and the clarification question aims to clarify the version of Outlook that the user is working with. While clarification has become an important component of many conversational and interactive information-seeking systems [1], previous research has shown that even though clarification questions receive positive engagement, users are not frequently engaged with them [2, 3]. Recent advancements in technology have introduced new modalities, such as visual prompts or multi-modal, which is a combination of text and visuals. As emphasised in the most recent Alexa Prize TaskBot Challenge [4], there are instances in which multi-modal interactions (e.g., text and image) impact the user experience in conversational information-seeking systems [5]. 1st Workshop on Multimodal Search and Recommendations (CIKM MMSR ’24), October 25 2024, Boise, Idaho, USA ∗ Corresponding author. Envelope-Open leila.tavakoli31@gmail.com (L. Tavakoli); g.castiglia@studenti.poliba.it (G. Castiglia); f.calo8@studenti.poliba.it (F. Calò); yashar.deldjoo@poliba.it (Y. Deldjoo); zamani@cs.umass.edu (H. Zamani); j.trippas@rmit.edu.au (J. R. Trippas) Orcid 0000-0002-5951-4052 (L. Tavakoli); 0000-0002-6767-358X (Y. Deldjoo); 0000-0002-0800-3340 (H. Zamani); 0000-0002-7801-0239 (J. R. Trippas) © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: A clarification pane shown after a user query [3]. Although incorporating visual elements allows users to provide more context and improve query accuracy, the extent to which different modalities (such as text or image) can enhance user interaction in search engines is still uncertain. Previous studies have primarily focused on text-based clarification, neglecting the potential benefits of multi-modal approaches. By exploring user preferences over various modalities, we can investigate which modalities are perceived as more effective and intuitive for optimising both user experience and system performance. We study multi-modal clarification questions from the user behaviour perspective and explore user preference on clarification question modalities, specifically focusing on text-only, visual-only, and multi-modal approaches. By systematically analysing user feedback, we can gain valuable insights into the advantages and limitations of each modality and the influential parameters. A clarification pane typically consists of a multi-choice clarification question and a list of candidate answers [3]. A multi-modal clarification pane contains both visual and textual content for each candidate answer (see Figure 2). We aim to understand if adding a visual presentation to text-only clarification panes enhances the user experience. We explore user preferences over three modalities for clarification panes: (i) textual, (ii) visual, and (iii) multi-modal (i.e., a combination of the two). We randomly sample 100 query-clarification pairs from the MIMICS dataset [3]. Then, we create the visual and multi-modal clarification panes for the sampled query-clarification pairs through a controlled manual expert annotation process. Pairwise user preferences are collected for different modalities following a post-task questionnaire to answer the following research question: • Do users prefer multi-modal clarification panes over uni-modal (i.e., textual or visual)? In this study, we investigate the impact of the image quality, image/text clarity, and relevance of the text and image, in addition to various image aspects, on user preference. Finally, we explore whether generating corresponding images to the clarification panes can be automated using text-to-image generation models. The quality and the relevance of generated images, in addition to user preferences over human-collected and model-generated images, are investigated through manual annotation. Our experiments reveal that: • In the majority of cases (70-80%), users prefer multi-modal clarification panes over visual-only and text-only clarification panes. They also prefer visual-only clarification over text-only clarification in 54% of cases. • Crowd-source workers prefer multi-modal clarification panes as they are easier to understand, which helps users make better and faster decisions. • Image quality, clarity, and relevance, in addition to text clarity, have a direct impact on self-reported user perceptions. • Text-to-image generation models, such as Stable Diffusion [6], are capable of automating image generation for creating multi-modal clarification panes. Our contributions to this paper include: • Gaining a better understanding of user preferences when it comes to different clarification modalities. • Evaluating the influence of image and text properties on user preference. By investigating how different factors related to images and text affect user choices, we gain insights into the impact of these properties on search clarification. • Exploring the capabilities of text-to-image generation models in the context of search clarification. By studying the effectiveness of these models in generating relevant images based on textual queries, we investigate their potential use in enhancing the search clarification process. Overall, our findings provide valuable insights into how to engage the user better with clarifications in information-seeking systems. By understanding user preferences and leveraging multi-modal ap- proaches, we can create more effective systems that cater to the needs of users in search clarification scenarios. 2. Related Work Despite the growing interest in search clarification [7, 8, 9, 10, 11], there is a need for more research on improving user interaction with clarification questions. In addition, further integration of these approaches with the latest developments in multi-modal generative AI and IR systems could lead to more effective and intuitive user experiences [12]. Previous researchers such as Rao and Daumé III [13], Aliannejadi et al. [9], Zamani et al. [2], Sekulić et al. [14] primarily focused on enhancing the effectiveness of clarification modals in search systems. Still, there is a research gap regarding user preferences and perceptions of different modalities in search clarification. This literature review highlights that research on multi-modal IR has overlooked search clarifications. For example, Yang et al. [15] introduced an online video recommendation system incorporating multi-modal fusion and relevance feedback. While Zha et al. [16] proposed Visual Query Suggestion for image search, Altinkaya and Smeulders [17] developed a stuttering detection model, Srinivasan and Setlur [18] explored utterance recommendations for visual analysis, Pantazopoulos et al. [19] integrated computer vision and conversational systems for socially assistive robots, and Ferreira et al. [20] presented TWIZ, a multi-modal conversational task wizard. None of these works addressed multi-modal clarification questions in the context of search systems. Hence, this area has a significant research gap, highlighting the need for further exploration and development. 3. Experimental Design We now describe the methodology and structure of the data collection, including the experiments. Query and clarification panes sampling. We used the MIMICS-Manual 1 dataset to select textual clarification panes. We randomly selected 100 queries and their corresponding multi-choice clarification panes to create the MIMICS-MM dataset. The number of candidate answers in the clarification pane varies between two and five. Clarification image collection. To assign an image to each candidate answer of clarification panes, an expert annotator searched the online website for corresponding images to those candidate answers 1 The MIMICS-Manual is one of three subsets (i.e., MIMICS-Click, MIMICS-ClickExplore, and MIMICS-Manual) of the MIMICS dataset–the largest available search clarification dataset [3]. It contains over 2,000 search queries with multiple clarification panes, landing result pages and manually annotated three-point quality labels for clarification panes. Figure 2: An example of Task II (T vs. MM) with Post-task questionnaire (All of the questions were single-select). using the Google images search engine.2 In total, 314 images were matched with 314 textual candidate answers. The annotator re-evaluated the quality of the images and, if needed, replaced them with images of greater quality. Experimental design. Online experiments3 were conducted on Amazon Mechanical Turk (AMT) to gather user preference labels through Human Intelligence Tasks (HITs).4 We designed three tasks to collect judgements from AMT workers on user preferences over different modalities in search clarification. We ran pairwise comparisons as follows: • Task I: text-only (T) vs. visual-only (V) • Task II: text-only (T) vs. multi-modal (MM) • Task III: visual-only (V) vs. multi-modal (MM) A query and two modalities are shown in Figure 2. At the end of this data collection process, three different subsets were created. Post-task questionnaire. After showing a query and two clarification question options, workers were presented with a post-task questionnaire assessing their presentation style preference and feedback (Figure 2). Thus, after inspecting the query, clarification question, and candidate answers, workers indicated which presentation they preferred (Q1). Workers were also asked to justify their preference with four questions. The second question (Q2) contained checkboxes with options about the text and images’ clarity, quality, and relevance. Workers were asked three more questions to obtain the motivation behind their choice of which modality was easier to understand (Q4), which helped them make better (Q5) and faster decisions (Q6) on a 5-point slider (e.g., in Task 2, labels 1 and 2 means text-only modality is preferred, label 0 means they have no preference, and labels 4, and 5 mean multi-modality is preferred). 2 We watermarked the images for copyright compliance. 3 Reviewed and approved according to RMIT University’s ethics procedures for research involving human subjects. The approval number is 66-19/22334. 4 Data collection was conducted in mid-March 2022. Quality assurance. We included two quality assurance checks. For example, each task contained a gold question (i.e., a question with a known answer) with the aim of high validity throughout the task (see Q3 in Figure 2). Workers who failed to answer the gold question were prohibited from completing other tasks, and their answers were removed. We also manually checked 10% of submitted HITs per task as a final quality assurance check. Invalid submissions were removed, and the workers were denied from completing subsequent tasks. We then opened those HITs to other workers. AMT pilot tasks were carried out5 to analyse the flow, acquire users’ feedback, check the quality of collected data, and estimate the required time to finish each task and a fair pay rate. Workers. Only workers based in Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States, with a minimum HIT approval rate of 98% and a minimum of 5,000 accepted HITs, were allowed to participate in the study, maximising the collected data quality and the likelihood that workers were either native English speakers or had a high level of English. Each HIT was assigned to at least three different AMT workers, enabling us to use an agreement analysis measure on their modality preferences. The exact same number of users was assigned to each question to avoid creating bias towards giving more importance to the question with more assigned users. In case of disagreements, we administered the HIT again to more workers until we achieved a final majority vote. Each worker was allowed to perform 25 tasks (a portion used for each launch). Workers had a five-minute time limit to finish the task and were compensated with 0.74 USD per HIT. Image generation for multi-modal clarification. Following a crowd-sourcing approach, we utilised two text-to-image generation models, namely Stable Diffusion6 [6] and Dall⋅E 27 [21]. These models were employed to produce images related to candidate answers, with the aim of exploring their potential in generating multi-modal clarifications. Our input to generate a corresponding image to a candidate answer of a clarification pane was the concatenation of the query and the candidate answer text. This input was used to generate all corresponding images for all candidate answers (two employed models per candidate answer generated two images). Comparing human-collected versus computer-generated images. First, we evaluated and com- pared the generated images’ visual aspects with manually collected images. We extracted the visual aspects of the images using OpenIMAJ [22], a tool for multimedia content analysis. The nine visual as- pects investigated were brightness, colourfulness, naturalness, contrast, RGB contrast, sharpness, sharpness variation, saturation, and saturation variation [23]. We conducted a manual annotation to investigate the generated images’ relevance to the text, compare the images’ quality, and assess the user preference over generated and collected images. Three annotators, two men and a woman with proficient English and a higher degree, completed the labelling. Each annotator labelled 314 generated images. We collected all annotations and aggregated them, and in case of any disagreements, majority voting was used for the final label. We showed the concatenation of the query and the candidate answer in the text and the corresponding generated image to the annotators. We asked annotators if the image was relevant to the text or not on a binary scale (i.e., label 1 means relevant, and label 0 means irrelevant). This label was similar to the label collected for the human-collected images during crowd-sourcing. Then, we showed the collected image for the same text from the crowd-sourcing part and asked the annotators to compare the quality of generated and collected images regardless of the presented text on a 3-point scale (i.e., the quality of the computer-generated image is higher (2), are the same (1), or the human-collected image has a higher quality (0)). Finally, the annotators were asked to indicate their preferred image between two images on a 3-point scale (i.e., annotators prefer the computer-generated image (2), have no preference (1), or prefer the human-collected image (0)). 5 Pilot study was conducted in February 2022. 6 Stable Diffusion is a neural text-to-image model that uses a diffusion model variant called the latent diffusion model. It is capable of generating photo-realistic images given text input. The Diffusers library available at https://github.com/ huggingface/diffusers is used for this study. 7 Dall⋅E 2, created by OpenAI, generates synthetic images corresponding to an input text. Table 1 Pairwise preference for clarification modality (%) Prefer Prefer Prefer No Task Text Visual Multi-Modal preference Text vs. Visual 39† 54† NA 7† Text vs. Multi-Modal 17† NA 79† 4† Visual vs. Multi-Modal NA 17 71† 12 † Significantly different from the other two preferences (Tukey HSD test, p<0.05). 4. Results In this section, we investigate the impact of various clarification modality characteristics and visual aspects of the images on user preference. Furthermore, we explore whether the clarification panes’ visual modality can be automated.8 User preference and clarification modality. We first investigated user preferences over the clarifica- tion modality in each pairwise comparison (i.e., text-only vs. visual-only, text-only vs. multi-modal, and visual-only vs. multi-modal). To understand whether a preferred modality in each pairwise comparison is significantly different from the other two options, we performed the Tukey honestly significant difference (HSD)9 test [25]. This statistical significance test helped us determine, for instance, if the number of users who preferred multi-modal over text-only was significantly higher or not. Table 1 indicates the percentage of user preference in each pairwise clarification modality comparison (i.e., The average across all the user inputs). In Task 1, where the workers indicated their preferences between the text-only and visual-only clarifications, we observed that 54% of the workers preferred visual-only over text-only clarification panes. In Tasks 2 and 3, where the workers indicated their preferences between uni-modal and multi-modal clarification panes, the workers strongly preferred multi-modal clarification panes, no matter whether the uni-modal clarification pane is text-only or visual-only. The workers’ preferences were significantly different from other options, indicating that in 70-80% of the cases, a multi-modal clarification was preferred. Post-task questionnaire analysis. We asked the workers to explain if the text/image clarity relevance and image quality impacted their preferences. We calculated the Pearson correlations between the workers’ preferences and the characteristics of the clarification modalities in each Task. In Task 1, we ob- served a positive correlation (𝜌=0.476) between user preference (i.e., preferring visual-only clarifications over text-only ones) and image quality. There was also a strong positive correlation (𝜌=0.677) between user preference and image clarity, and user preference had a strong negative correlation (𝜌=-0.686) with text clarity. The same correlation trends and orders were observed for the user preference (i.e., preferring multi-modal clarifications over text-only ones) with image quality (𝜌=0.458), image clarity (𝜌=0.626) and text clarity (𝜌=-0.627). However, in Task 3, the user preference (i.e., preferring multi-modal clarifications over visual-only) had correlations only with the text clarity (𝜌=0.505) and image clarity (𝜌=-0.301). A closer look at the worker’s feedback showed that the text and the image in more than 95% of clarification panes were relevant. This explained low to zero correlations between user preference and the relevance of the text and the image. We calculated the Tukey HSD test and observed the calculated correlations were significantly different from each other. In the pairwise comparison between multi-modal and visual-only clarification panes, although the collected images for the clarification panes were the same, the workers preferred multi-modal clarification panes over the visual-only ones when the images were not clear. The text helped them understand the candidate answers to the clarification panes. The users preferred visual-only clarifications in more than 54% of cases when the text clarity was low and the image quality and clarity were high. 8 Our results and codes are publicly available for reproducibility at https://github.com/Leila-Ta/MIMICS-MM. 9 The Tukey HSD test is a post hoc test used when there are equal numbers of subjects in each group for which pairwise comparisons of the data are made [24]. Table 2 Motivations behind user preference (%). T vs. V T vs. MM V vs. MM Motivation Prefer Prefer Prefer Prefer Prefer Prefer Text Visual Text Multi-Modal Visual Multi-Modal Easier to understand 25 31 7 61 6 67 Better decision 22 36 6 68 3 67 Faster decision 27 36 10 62 6 66 None of the above 8 12 4 9 9 5 However, the text and image were relevant in most cases. In the post-task questionnaire, we investigated the users’ motivation for their preferences. We asked users whether the preferred modality was easier to understand and helped them make better and faster decisions. Table 2 shows the user preferences in each pairwise modality. We see when users preferred visual-only clarification panes over text-only ones, 31% of users believed that the visual-only clarification panes were easier to understand The visual-only modality helped 36% of users make better and faster decisions. When comparing multi-modal clarification panes with text-only and visual-only clarification panes, between 60 to 70% of users believed that multi-modal clarification panes were easier to understand and helped them make better and faster decisions. Table 2 shows that there were small groups of users whose motivations behind their preferences were not listed in our questions. User preference and impact of visual aspects. In the next step, we investigated the impact of visual aspects of the collected images on user preference over the clarification modality. We calculated the point-biserial correlation10 [26] between the visual aspects of images and user preferences, the image quality and the image clarity. The average value of each aspect was calculated across all candidate answers for each clarification pane. Therefore, one value was obtained per visual aspect for every clarification pane. There was a low correlation between the image’s visual aspects and user preference, including the image quality and clarity that the workers judged. To further explore the impact of visual aspects of images on user preference, we developed a feature-level attribution explanation to rate the image’s visual characteristics based on their user preference. We utilised the Gini importance of the random forest with visual aspects as the input and target label user preference (i.e., 0 means Text preferred over Multi-Modal and one means Multi-Modal preferred over Text). The Gini importance is a metric that determines the relative significance of features in a random forest model. In this case, the visual aspects of the data were considered when calculating the Gini importance. By incorporating visual aspects into the Gini importance calculation, the model was likely able to capture and evaluate the relevance of visual features in the dataset. This can be particularly useful in scenarios where visual information plays a significant role or provides valuable insights for the given problem or task. We performed this analysis for Task 2, and the results indicate that brightness, naturalness, RGB contrast, sharpness variation, and saturation variation, among other studied aspects, accounted for more than 65% of the differences in user preferences. In particular, brightness and naturalness were the two most important visual features. Automatic image generation for clarification panes. Finally, we investigated whether generating the corresponding images to the candidate answers could be automated. First, we compared the visual aspects (e.g., brightness, colourfulness, naturalness, ...) of the generated images with the collected ones. We observed that the generated images had relatively the same visual aspects as the collected ones. However, the Stable Diffusion model generated images with similar sharpness to the human-collected images. Second, we compared computer-generated images with human-collected ones regarding image relevance, quality, and user preference. Table 3 shows that 87% of Stable Diffusion generated images 10 The point-biserial correlation measures the relationship between a binary (i.e., user preference, image quality, and clarity) and a continuous variable (i.e., image aspects). were relevant to the text. Even though only 20.7% of the generated images had a higher quality compared to human-collected ones, more than 57% of images had higher or equal qualities compared to collected ones. Only 12.7% of the generated images were preferred over the human-collected images. However, as seen from Table 3, 39.8% of the users either preferred the generated images or had no preferences over the generated and collected images (same preference). A slight improvement in the model performance was observed when we removed the irrelevant generated images from the collection (i.e., the percentage of generated images that had higher quality than the collected images rose from 20.7% to 21.2%, and the percentage of generated images that were preferred over collected images rose from 12.7% to 14.6%.). The annotators preferred the collected images over ∼60% of computer-generated images. This observation was expected as the collected images were gathered through online searching to select the most suitable images. At the same time, a text-to-image model generated an image from only text. However, the Stable Diffusion model could generate relevant and high-quality images. As, in ∼80% of cases, users preferred a multi-modal clarification pane over a text-only one; such a text-to-image model can ease and fasten the task of generating multi-modal clarification panes. Table 3 Comparison of human-collected and computer-generated search clarification question images. Collection Method Relevance Image Quality1 Image Preference2 Human-Collected 96% 42.7% 60.2% Stable Diffusion model-Generated 87% 20.7% 12.7% 1 36.6% of users indicated the quality of the generated and collected images were the same. It is anticipated that by continuous improvements in the performance of text-to-image generation models, the image quality has significantly increased in the past two years. 2 27.1% of users indicated no preference over the generated and collected images. 5. Conclusions and Future Work We aimed to understand the impact of clarification question modality on user preference. We introduced a novel multi-modal clarification dataset, MIMCS-MM. We created three modalities of text-only, visual- only, and multi-modal (a combination of both) for clarification panes and presented them to users through crowdsourcing. The research shows that users generally preferred multi-modal clarification panes over text-only and visual-only ones. Users found it easier to understand the information presented in multi-modal panes, which helped them make better and faster decisions. This implies that integrating text and visual elements improves comprehension and decision-making for users, particularly given that the models for generating clarifications are not yet performing optimally. The study identified that when images were clear and of high quality, users favoured multi-modal panes. Therefore, ensuring that the visual content provided in clarification panes is of good quality and easily understandable is crucial. We also showed that when the images were unclear and of low quality, users preferred text-only clarification panes, even if the images were relevant. This suggests that when visual content is inadequate, relying solely on text can be more effective in conveying the necessary information. We also explored the task of automatically generating corresponding images for text-only clarifications to make them multi-modal clarifications. The results indicated that text-to-image generation models, such as Stable Diffusion, can produce high-quality and relevant visual content. This indicates that automated generation techniques can produce multi-modal panes for search clarifications. Nonetheless, it is crucial to note that these methods have not yet achieved the ability to completely replicate human annotation when gathering relevant images for text-only clarification panes. Users still strongly prefer images collected by humans rather than those generated by models. Our objective in this study was to gain insight into user preferences regarding different clarification modalities in a search scenario rather than examining the impact of clarification modality on search performance. As a result, we acknowledge that the participants in our study were not in a genuine search situation. In our research, we recognise the potential impact of the dataset size. However, the statistically significant differences observed in our analysis form a reliable foundation for drawing valid conclusions. We have utilised robust statistical techniques to ensure the credibility of our findings, and it is unlikely that the observed effects are solely due to random chance. The conducted study suggests several research paths for the future, including investigating the impact of clarification modality on search performance in real search situations, creating a more comprehensive dataset containing various aspects of queries to explore clarification modality further, developing advanced multi-modal language models to determine the most effective modality in different scenarios, investigating the impact of factors like user demographics, task complexity, and content characteristics, improving image generation techniques to produce more preferable images, and lastly, exploring alternative modalities beyond text and images, such as audio or interactive elements [27, 5]. We think future work should consider the development of more robust multi-modal clarification (e.g., images) using the latest advances in generative AI, large language models, and multi-modal foundation models, see [28, 29]. Acknowledgements This research was supported in part by the Centre for Intelligent Information Retrieval, in part by the Office of Naval Research contract number N000142212688, and in part by NSF grant number 2143434. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsors. References [1] H. Zamani, J. R. Trippas, J. Dalton, F. Radlinski, Conversational information seeking, Foundations and Trends® in Information Retrieval 17 (2023) 244–456. URL: http://dx.doi.org/10.1561/1500000081. doi:10.1561/1500000081 . [2] H. Zamani, B. Mitra, E. Chen, G. Lueck, F. Diaz, P. N. Bennett, N. Craswell, S. T. Dumais, Analyzing and learning from user interactions for search clarification, in: Proceedings of SIGIR, 2020, p. 1181–1190. [3] H. Zamani, G. Lueck, E. Chen, R. Quispe, F. Luu, N. Craswell, Mimics: A large-scale data collection for search clarification, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 3189–3196. [4] The alexa prize taskbot challenge, 2021. URL: https://www.amazon.science/alexa-prize/ taskbot-challenge. [5] Y. Deldjoo, J. Trippas, H. Zamani, Towards multi-modal conversational information seeking, in: Proceedings of SIGIR, 2021. [6] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695. [7] P. Braslavski, D. Savenkov, E. Agichtein, A. Dubatovka, What do you mean exactly? analyzing clarification questions in cqa, in: Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, 2017, pp. 345–348. [8] L. Tavakoli, J. R. Trippas, H. Zamani, F. Scholer, M. Sanderson, Mimics-duo: Offline & online evaluation of search clarification, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3198–3208. [9] M. Aliannejadi, H. Zamani, F. Crestani, W. B. Croft, Asking clarifying questions in open-domain information-seeking conversations, in: Proceedings of the 42nd international acm sigir conference on research and development in information retrieval, 2019, pp. 475–484. [10] J.-K. Kim, G. Wang, S. Lee, Y.-B. Kim, Deciding whether to ask clarifying questions in large-scale spoken language understanding, in: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021, pp. 869–876. [11] L. Tavakoli, J. R. Trippas, H. Zamani, F. Scholer, M. Sanderson, Online and offline evaluation in search clarification, ACM Trans. Inf. Syst. (2024). URL: https://doi.org/10.1145/3681786. doi:10. 1145/3681786 , just Accepted. [12] J. R. Trippas, D. Spina, F. Scholer, Adapting generative information retrieval systems to users, tasks, and scenarios, in: R. W. White, C. Shah (Eds.), Information Access in the Era of Generative AI, Springer Nature Switzerland AG, Cham, Switzerland, 2024. [13] S. Rao, H. Daumé III, Answer-based adversarial training for generating clarification questions, arXiv preprint arXiv:1904.02281 (2019). [14] I. Sekulić, M. Aliannejadi, F. Crestani, User engagement prediction for clarification in search, in: Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part I 43, Springer, 2021, pp. 619–633. [15] B. Yang, T. Mei, X.-S. Hua, L. Yang, S.-Q. Yang, M. Li, Online video recommendation based on multimodal fusion and relevance feedback, in: Proceedings of the 6th ACM international conference on Image and video retrieval, 2007, pp. 73–80. [16] Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, Visual query suggestion, in: Proceedings of the 17th ACM international conference on Multimedia, 2009, pp. 15–24. [17] M. Altinkaya, A. W. Smeulders, A dynamic, self supervised, large scale audiovisual dataset for stuttered speech, in: Proceedings of the 1st International Workshop on Multimodal Conversational AI, 2020, pp. 9–13. [18] A. Srinivasan, V. Setlur, Snowy: Recommending utterances for conversational visual analysis, in: The 34th Annual ACM Symposium on User Interface Software and Technology, 2021, pp. 864–880. [19] G. Pantazopoulos, J. Bruyere, M. Nikandrou, T. Boissier, S. Hemanthage, B. K. Sachish, V. Shah, C. Dondrup, O. Lemon, Vica: Combining visual, social, and task-oriented conversational ai in a healthcare setting, in: Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 71–79. [20] R. Ferreira, D. Silva, D. Tavares, F. Vicente, M. Bonito, G. Gonçalves, R. Margarido, P. Figueiredo, H. Rodrigues, D. Semedo, et al., Twiz: The multimodal conversational task wizard, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6997–6999. [21] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical text-conditional image generation with clip latents, arXiv preprint arXiv:2204.06125 (2022). [22] J. S. Hare, S. Samangooei, D. P. Dupplaw, Openimaj and imageterrier: Java libraries and tools for scalable multimedia analysis and indexing of images, in: Proceedings of the 19th ACM international conference on Multimedia, 2011, pp. 691–694. [23] C. Trattner, D. Moesslang, D. Elsweiler, On the predictability of the popularity of online recipes, EPJ Data Science 7 (2018) 1–39. [24] A. Stoll, Post hoc tests: Tukey honestly significant difference test, The SAGE encyclopedia of communication research methods (2017) 1306–1307. [25] J. W. Tukey, Comparing individual means in the analysis of variance, Biometrics (1949) 99–114. [26] R. F. Tate, Correlation between a discrete and a continuous variable. point-biserial correlation, The Annals of mathematical statistics 25 (1954) 603–607. [27] J. R. Trippas, D. Spina, M. Sanderson, L. Cavedon, Towards understanding the impact of length in web search result summaries over a speech-only communication channel, in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 991–994. URL: https://doi.org/10.1145/2766462.2767826. doi:10.1145/2766462.2767826 . [28] Y. Deldjoo, Z. He, J. McAuley, A. Korikov, S. Sanner, A. Ramisa, R. Vidal, M. Sathiamoorthy, A. Kasirzadeh, S. Milano, A review of modern recommender systems using generative models (gen-recsys), in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6448–6458. [29] Y. Deldjoo, Z. He, J. McAuley, A. Korikov, S. Sanner, A. Ramisa, R. Vidal, M. Sathiamoorthy, A. Kasirzadeh, S. Milano, et al., Recommendation with generative models, arXiv (2024).