Teaching HCI Methods: Replicating a Study of Collaborative Search Abstract Max L. Wilson This paper describes the challenges experienced when Mixed Reality Lab University of Nottingham, UK replicating a user study that evaluated synergy in a max.wilson@nottingham.ac.uk collaborative search system. The original paper saw significant differences in collaborative performance, depending on the mode of collaboration. We were unable to replicate the findings, but experienced several challenges that created ambiguity and differences in the methods, which may have prevented us from doing so. These challenges and experiences, and their affect on our ability to replicate the findings, are described in detail. Author Keywords Collaborative search, Synergy, Replication ACM Classification Keywords H.5.3 [Group and Organization Interfaces]: Collaborative computing.; H.3.3 [Information Search and Retrieval]: Search Process.; H.3.7 [Digital Libraries]: User Issues. Introduction Hands on experience of replicating an experiment is often considered a good method of teaching [2]. For this reason, a cohort of 6 MSc students were asked to replicate a user study; to learn the methodological and analytical skills required to do so. Further, we hoped to confirm the Copyright is held by the author/owner(s). findings for the benefit of the wider community. Based This paper was submitted to RepliCHI 2013, a CHI’13 workshop Original Task Description upon the interests of the staff and students involved, we • Software Procurement - Initially it was considered that A leading newspaper has hired chose to replicate a user study of the synergetic effect the procurement of software would be very easy, as your team to create a compre- experienced by users searching in collaboration, originally Coagmento can be easily downloaded from the website. hensive report on the causes, carried out by Shah and Gonzalez-Ibanez [5], herein After installing the software, however, we noticed several effects, and consequences of the referred to as the original researchers. differences in the user interface to the system described in recent gulf oil spill. As a part of the original paper [5]. The original researchers told us your contract, you are required to The original researchers studied their own collaborative collect all the relevant information their study was based on an earlier version of the software. search software (Coagmento1 ), which had been evaluated At first, we decided to accept the difference in from any available online sources previously [6], to examine synergy between collaborators functionality and to report it as a limitation later if that you can find. in different group orientations. These orientations, as the needed. The original researchers, however, agreed to try To prepare this report, search and primary independent variable, were co-located (same and roll-back their functionality and provide us with a visit any website that you want computer), co-located (different computers), and remotely version that matched the evaluated version. This was very and look for specific aspects as located (different computers); individual searchers, generous of the original researchers, and not always an given in the guideline below. As automatically paired post hoc, were used as a baseline. option for those wishing to replicate studies. you find useful information, high- The paper further contributed to the issue of evaluating light and save relevant snippets. synergy in collaborative search, by presenting new • Data Capture - After investigating which data must be Make sure you also rate a snippet applicable measures. This focus on measures provided captured for the study, we discovered that the original to help you in ranking them based additional learning benefit to the MSc students involved. researchers captured the data at the server level. Again, on their quality and usefulness. we were faced with two options: video record the desktop Later, you can use these snippets The MSc students were given an entire semester to and manually log the necessary data afterwards, or to compile your report, no longer coordinate and run the study, and had each had to write request access to the data from the original researchers. than 200 lines, as instructed. about the results and the experience for their primary The original researchers were again generous and agreed assessment. Support from the original researchers had to provide us with the logs. Your report on this topic should address the following issues: de- been previously arranged by the staff. • Task Design - One significant challenge we faced was scription of how the oil spill took place, reactions by BP as well Challenges Faced and Decisions Made task design. The study was based upon an open-ended as various government and other Significant challenges were faced throughout the exploratory recall task, based upon american political agencies, impact on economy and replication attempt, from setting up the study, running parties. Our third decision was whether we should keep life (people and animals) in the the study, and analysing the results. These are described the american political task focus, or choose a more gulf, attempts to fix the leak- in turn below. temporally (since the political topic had become old) and ing well and to clean the waters, culturally relevant task for the British university. Several long-term implications and lessons Setup Challenges alternatives were proposed before making the decision, learned. There were three major challenges in the setup phase: and in the end a temporally and culturally relevant task software procurement, data capture, and task design. was chosen that focused on the 2012 Olympics (see original and revised task descriptions in the margins). This decision was made because task relevance and 1 http://www.coagmento.org/ inherent motivation are considered key factors in creating time to perform the study. Consequently, the students good work tasks for user studies [7, 1]. had to make a decision, also relating to the financial Revised Task Description limitations, about how many participants to include in the A leading newspaper has hired Running the Study study. The students managed 40 participants in the your team to create a compre- hensive report on he causes, There were three major challenges in the process of timeframe, rather than the 70 involved in the original effects and consequences of the running the study: the experience of the research team, research. Olympic Games. As a part of the financial support for incentives, and time limitations. your contract, you are required to Analysing the Results collect all the relevant information • Research Team - As this replication was being used to There were two major challenges in the analysis phase: from any available online sources teach new MSc students about the process of running a data processing and data analysis. that you can find. study, the first and most obvious challenge is that the study is being run by inexperienced researchers. This • Data Processing - The main challenge experienced in To prepare this report, search and visit any website that you want challenge was further confounded by the necessity to the analysis section was around the pre-processing of log and look for specific aspects as teach many students at once. In this case, the original data for analysis. The original researchers, for example, given in the guideline below. As study was performed by one experienced phd student, but removed search engine result pages from their analysis of you find useful information, high- the replication was carried out by 6 novice MSc students. diverse website coverage, but the exact set of URLs light and save relevant snippets. Each MSc student required experience at designing study considered as search engine results pages was implicit Make sure you also rate a snippet materials (like questionnaires), handling participants, and rather than explicit. In fact, any form of log processing to help you in ranking them based analysing the results. This means that there was likely to on their quality and usefulness. and filtering in such a study would be a possible source of be a high variance in each of the stages. To reduce variance in user studies, unless the exact rules are Later, you can use these snippets variance, one final protocol was selected from each of to compile your report, no longer accessible to the replicating team. One challenging protocols submitted by the students. However, there were example is whether to include both a user’s typo and then than 200 lines, as instructed. not many constraints, apart from a default script, in terms their correction in analysing log data. In our own Your report on this topic should of how, where, and when the researchers carried out the experiment, we created filters to achieve the same goals address the following issues: Im- study with their participants. as reported in the paper, but we could not guarantee the pact on economy of host countries • Financial Support for Incentives - As part of a taught exact same data would be filtered as the original research, (people and animals), long-term given the same log; these elements of research methods module, rather than a funded research project, the implications on the host country, students had to design alternative incentive methods. In are extremely difficult to comprehensively report in conditions and voting policy to be- the end, they choose a prize draw for a single prize research publications. come hosting nation and the next host country and their prepara- (provided by the staff), but of a value much lower than a • Data Analysis - With many methods, there are many tions to host the games. £10 voucher for each participant. There is some related variations on how to apply methods. In the case of this work (e.g. [4]) into the style of different incentive study, it was ambiguous as to how the data from the structures, but the effect in this case was not clear. NASA Task Load Index (TLX) [3] was analysed. Many • Time limitations - Also driven by the taught-module studies remove physical effort from the scale, as using a based constraints, the students had a limited amount of computer does not lend itself to variation in the physical effort questions. In this case, it was unclear as to exactly prevented us from getting the same findings. Reflectively, how the NASA TLX was applied, including as to whether its hard to estimate which element would have likely had pair-wise comparisons were made. the biggest impact on our attempt to replicate the study. First, the performance of the software, after being rolled back, was not ideal and this alone may have obstructed Study Outcome and Discussion the synergetic effect seen by the original researchers. The outcome of our replication attempt was that we Second, the study was performed by several novice could not replicate any of the original findings, as we hope researchers, who may simply not have performed the may be reported in detail in a future publication. In study effectively. Third, the differences in the number of summary, we saw no difference between the different participants and the lack of voucher-based motivation measures, where the original researchers found a number could have limited the performance of participants. of differences. However, there are many possible reasons Fourth, task design has been seen to have a large affect for the differences, where we’ll begin with the limitations on task outcome, and so perhaps your culturally and of our replication attempt. temporary relevant task may have not have been suitable. Limitations of our Replication Finally, the processing of data for the analysis could have Although we were somewhat privileged to have the been simply different. Having some different or more support of the original authors, we also had several comprehensive filtering rules may have led to significant limitations in our attempt: differences in the measures. • Researchers - our study was performed by 6 novice Implications for RepliCHI researchers, who each took part in running the We chose to report this HCI replication, despite being study, with different individual abilities focused on a user study not published at an HCI venue, • Participants - we had fewer participants (40 instead because of the sheer number of issues that it highlighted of 70), but from a similar academic population for a community that wants to better support replication. • Participant Motivation - as part of a teaching Our specific example leaves many open questions that we module, participants were volunteers found by the may wish to investigate: MSc students, and were not motivated in the same • What should we do when presented with different way as original study software versions from the original study? • Software - although the original researchers provided • Should we use original tasks? Or is it acceptable to rolled-back software for the study, the process of replace them for increased temporal/cultural rolling back introduced bugs that sometimes made relevance? the software unresponsive • Where data processing is involved, how should we best support others who wish to replicate our Possible Causes of Different Findings studies? There are many reasons, including those listed above, that • If we want to recommend replication as a form of may have affected the outcome of our results, and teaching, what are the consequences of using groups Technology 54, 10 (2003), 913–925. of novice researchers? [2] Frank, M. C., and Saxe, R. Teaching replication. • If we can’t overcome these challenges, is there any Perspectives on Psychological Science 7, 6 (2012), value in replicating the studies? 600–604. [3] Hart, S. Nasa-task load index (nasa-tlx); 20 years later. In Proceedings of the Human Factors and Overall, the students experienced many challenges in Ergonomics Society Annual Meeting, vol. 50, SAGE trying to replicate the study, but learned a lot about study Publications (2006), 904–908. design and paper writing by doing so. For these [4] Musthag, M., Raij, A., Ganesan, D., Kumar, S., and educational reasons, the replication attempt provided a lot Shiffman, S. Exploring micro-incentive strategies for of value to the students. In terms of confirming the participant compensation in high-burden studies. In original study, we were unable to confirm the results, but Proceedings of the 13th international conference on were of course unable to disprove them also. This is Ubiquitous computing, ACM (2011), 435–444. perhaps a final challenge and discussion point for [5] Shah, C., and González-Ibáñez, R. Evaluating the replication in HCI: we need to decide what we take away synergic effect of collaboration in information seeking. from studies that cannot replicate findings, and what In SIGIR11: Proceedings of the 34th annual value we have from understanding them. From this international ACM SIGIR conference on Research and experience report, we hope that researchers may learn development in information retrieval, July 24, vol. 28 about several decisions that they may likely have to make (2011), 24–28. when performing replications, and perhaps make more [6] Shah, C., Marchionini, G., and Kelly, D. Learning informed choices when the time comes. design principles for a collaborative information seeking system. In Proceedings of the 27th Acknowledgements international conference extended abstracts on Human We’d like to thank the original authors, Chirag Shah and factors in computing systems, ACM (2009), Roberto Gonzalez-Ibanez for their support: providing 3419–3424. software and and advice for the replication. [7] Wildemuth, B., and Freund, L. Search tasks and their role in studies of search behaviors. In Third Annual References Workshop on Human Computer Interaction and [1] Borlund, P. The concept of relevance in ir. Journal of Information Retrieval, Washington DC (2009). the American Society for information Science and