=Paper=
{{Paper
|id=Vol-2337/paper2
|storemode=property
|title=Data Sets for Spoken Conversational Search
|pdfUrl=https://ceur-ws.org/Vol-2337/paper2.pdf
|volume=Vol-2337
|authors=Johanne Trippas,Paul Thomas
|dblpUrl=https://dblp.org/rec/conf/chiir/TrippasT19
}}
==Data Sets for Spoken Conversational Search==
Data Sets for Spoken Conversational Search
Johanne Trippas Paul Thomas
RMIT University Microsoft
Melbourne, Australia Canberra, Australia
johanne.trippas@rmit.edu.au pathom@microsoft.com
ABSTRACT
There is increasing interest in spoken conversational search—multi-
turn interactions with a search engine, spoken in natural language—
but until recently there was little public data to support research.
We describe our experiences building two data sets for spoken
conversational search: the Microsoft Information-Seeking Conver-
sation set (“MISC”) and the Spoken Conversational Search set (“SCS-
data”). Each data set contains recordings of spoken interactions
between two people collaborating on web search tasks, but rela-
tively small differences in protocol have led to observably different
data. We discuss some consequences of these differences, and de-
scribe attempts to reproduce analyses from one set to the other.
1 DATA SETS OVERVIEW Figure 1: Recording setup for both MISC and SCSdata. Tasks
The increasing capability for natural-language, voice interactions were assigned to a “seeker”, who communicated with an “in-
with computers poses a range of research and engineering questions. termediary” who had access to a browser. From Thomas et al.
To address these questions we need corresponding data—for exam- [21].
ple, recordings of conversations with information-gathering agents.
Unfortunately, current systems cannot maintain a lengthy exchange,
have trouble tracking context, and are largely unaware of non- final answer. They were connected over an audio link to an “in-
verbal communication and of users’ emotional state. In 2016–17 two termediary”, who stood in for a future software agent (SCSdata
separate groups tried to bridge the gap by recording information- participants were located in the same room). The intermediary
seeking conversations between people, looking for structures which had unrestricted access to the web, including search engines. We
would help build new systems or evaluate old ones [c.f. 5, 9, 19]. recorded video and audio from both participants.
The data. The MISC data includes audio and video signals; tran-
1.1 MISC
scripts; prosodic and linguistic signals; entry questions on demo-
The Microsoft Information-Seeking Conversation data (MISC) is a graphics and personality; and post-task surveys on emotion, en-
set of recordings of spoken conversation between human “seekers” gagement, and effort. Screen recordings are also available, as is data
and “intermediaries” [21]. It was designed to support research on on affective and physiological signals.
questions such as: do human intermediaries show behaviours which
correlate with seeker satisfaction?; do seekers show behaviours Reuse and reusability. We designed the MISC data with regard to
which we could use as a baseline for online metrics, appropriate our own future research, but intended from the start that it could be
to conversational agents?; what role is played by politeness or used by other researchers. Our participants consented to possible
other conversational norms?; what tactics do we see in information- reuse and sharing, and were informed of their right to withdraw
seeking conversation, and do particular structures help or impede consent at any time, including post-hoc. The study was approved
progress or satisfaction? MISC has been used in unpublished work by our internal ethics review board.
on these questions, in work on conversational style [20], on mul- Although MISC includes a good deal of derived data, we have
timodal collaboration [14], and on conversational structures de- chosen to include the raw data wherever possible so as to enable
scribed below. (a) replication and (b) further unanticipated analyses. For example,
we include the raw audio, from which we derived the included tran-
The study. The overall setup for both the MISC and SCSdata scripts; and we include these transcripts, from which we derived
recordings is shown in Figure 1. Tasks were assigned to a “seeker”, data on word use. The only processing of the “raw” video and audio
who was responsible for assembling information and deciding a has been to segment by task. The full text of each pre-experiment
and post-task question is also included. This policy has already
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on enabled reuse inside our research group: for example, work by Mc-
Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Duff et al. [14], on the effect of facial expressivity and multimodel
2019. Copyright for the individual papers remains with the authors. Copying permitted
for private and academic purposes. This volume is published and copyrighted by its communication, was not anticipated when we collected MISC. We
editors.. are not aware of any attempts to re-process the audio or video
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK Johanne Trippas and Paul Thomas
streams, but we hope this policy also makes reuse outside our own The SCSdata was created to investigate the interaction behaviour
research group more likely. between the two actors, including helping us to understand ques-
We used standard instruments and standard processing tools tions such as; what is the impact of audio-only interactions for
where available: search?; how are information-dense documents transferred in an
• To help interpret physiological and affectual signals, we audio-only setting?; what are the components or actions of an
used the UPSS Impulsive Behaviour Scale [27] and Cohen information-seeking process via audio, and what is the impact of
et al.’s perceived stress scale [8]1 . These are commonly-used query complexity on the interactions and interactivity in spoken
instruments and should be comparable across studies. conversational search? The SCSdata has been used in research pub-
• MISC includes five tasks, one of which was used as a warmup. lished by the creators of the data set [22, 23] and also has been used
We believed there may be a difference in behaviour and self- recently in a study by the broader IR community [25].
reports depending on the complexity and difficulty of the The study. The SCSdata was created in a controlled laboratory
task, so we varied these in a controlled manner. We also study at RMIT University. We recorded the spoken interactions
wanted tasks that elicited an emotional response, which between seeker and intermediary (as explained in Section 1.1). We
ruled out those from most past collections; instead we se- then transcribed the recordings with transcription principles and
lected tasks from the Repository of Assigned Search Tasks protocols described by Trippas et al. [24]. Much detailed work went
(RepAST)2 . Participants addressed the tasks using the open into creating highly accurate transcriptions, with the vision to
web, which may make it hard to reproduce some results but increase the reusability of the data set, including indexing [13].
did allow intermediaries to use the full range of web search
features. The data. The data includes the transcriptions of the audio sig-
• We measured effort with the NASA task load index (TLX) [16]. nals, the codebook and labels for the first three utterances, and the
This is a commonly-used and well-validated scale which we backstories used in the setup. Other data such as the audio, video,
were able to adopt with minimal modification (only omitting pre- and post-task questionnaires are not available due to ethics
the question on physical effort). Post-hoc tests validated this regulations.
modified scale (Krippendorf’s α = 0.84 [21]). The data is maintained by an author of this paper (Trippas).
• We measured engagement using a subset of the user en- Reuse and reusability. The SCSdata reuses nine backstories based
gagement scale (UES) [17]. This proved very useful for our on TREC Q02, R03, and T04 as described by Bailey, Moffat, Scholer,
purposes, and again the modifications were validated post and Thomas [2]. These backstories follow the cognitive complexity
hoc (α = 0.85 [21]). framework of the Taxonomy of Learning [1].
• Questions on per-task emotion used a widely recognised set Participants completed a pre-test questionnaire before starting
of basic emotions, as well as a separate question about other the study. This pre-test questionnaire gathered demographic data
emotions which we considered more likely during our tasks. such as age, gender, highest level of education, employment, and
• Processing used standard tools, both to reduce effort and computer and search engine usage. Participants were also asked to
to aid reproducability. We used OpenSMILE [11] for audio complete a modified version of the Search Self-Efficacy Scale [4]
analysis; OpenFace [3] for coding facial actions; Microsoft and how they would rate their own overall search skills. Partici-
Cognitive Services3 to produce transcripts; and Linguistic pants were asked if they had experience with intelligent personal
Inquiry and Word Count (LIWC) [18] for lexical analysis. assistants such as Google Now, Siri, Amazon Alexa, or Cortana.
We are happy to make many of our processing scripts available Seekers and intermediaries were asked to complete pre- and post-
for other researchers—a small number use in-house tools—although task questionnaires throughout the study measuring interest and
again there have been no requests so far. knowledge about the task, experienced task difficulty, experienced
Reporting and availability. We described our protocol in detail in conversational difficulty, experienced collaboration difficulty, ex-
our first publication [21]. This paper includes details of participants, perienced search presentation difficulty, overall difficulty, overall
the wording for all tasks and questions, and descriptive statistics satisfaction, and open questions. Some of these questions were
including reliability measures. adapted or reproduced from Kelly, Arguello, Edwards, and Wu [12].
The MISC data is available online at http://aka.ms/MISCv1. The SCSdata was designed with our own research questions in
mind, while optimising the transcriptions and labelling for future
1.2 SCSdata use. We believe that the labelled data set is very valuable for the
research community. The data set was recently updated and it is
The Spoken Conversational Search data set (SCSdata) contains the
planned to update the data set with the full labelling annotations
utterance transcriptions of a spoken information seeking process
and label creation methodology in the near future.
between two actors. To the best of our knowledge, SCSdata was
the first data set which was created in this experimental setup. It is Reporting and availability. We described our experimental setup
also the first SCS data set which received labelling of the actions or in the preliminary data analysis paper [22]. Fully documented in-
utterances, albeit only for the first three turns [22]. However, the formation is available on the transcription protocol and labelling
release of the fully labelled data set is planned. process in Trippas et al. [24]. That paper aimed to establish a pro-
1 See also e.g. http://www.mindgarden.com/documents/PerceivedStressScale.pdf. tocol for spoken search interaction transcription, minimising the
2 https://ils.unc.edu/searchtasks/ likelihood that consequently produced transcripts are inconsistent
3 https://www.microsoft.com/cognitive-services/en-us/speech-api with each other.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
Data Sets for Spoken Conversational Search (CHIIR 2019), 14 March 2019, Glasgow, UK
Other details such as the procedure of the study or questionnaire so researchers could look for differences in answer correctness
results have not yet been published. or completeness4 . SCSdata participants were not asked to record
The SCSdata is available online via https://jtrippas.github.io/ an answer, but were asked to say “stop search” when they were
Spoken-Conversational-Search/. satisfied with the found information and could answer the informa-
tion need. This again led to differences in behaviour, such as MISC
2 COMPARING MISC AND SCSdata “seekers” confirming spelling in order to write down the answer.
These differences were an unexpected nuisance, as even with
In recent, unpublished work, one of us (Trippas) has developed a
such similar protocols it required some work to understand and
code schema for annotating utterances in spoken conversational
account for the substantial differences in data. However, familiarity
search. Initial development used SCSdata, but since MISC is very
with the data meant that once we had observed the differences,
similar it has been reused to validate the schema. We offer below
they were easy to understand. A close reading of the published
some observations on re-using MISC and SCSdata, based on this
descriptions would have given the same hints. Further, it is likely
experience.
that the differences were in fact useful for the validation, as they
It is clearly valuable to have two data sets collected with such
gave more variety and tested the coding schema in slightly different
similar protocols, and for similar purposes. Coding conversations
exchanges.
relies on having lengthy, naturalistic exchanges, and both SCSdata
We also note some smaller differences. For the SCSdata record-
and MISC have several exchanges running to ten minutes. Both sets
ings, a researcher was in the room with the participant. The MISC
distinguish the “seeker” and “intermediary” roles, allowing direct
researchers were not. This may have led to some differences in
comparison, and both include transcripts which could be coded
the data, although we have not yet explored this. There is also a
more or less directly. However, some differences across the data
difference in audio quality. The audio files from the SCSdata are
sets did hamper reuse, or led to unexpected findings.
poor, because they were recorded through a video camera. Using
those recordings was never part of the experimental setup.
2.1 Protocol differences Finally, there are details of the protocol which may have resulted
First, while the SCSdata was manually transcribed, the MISC data in minor differences between the sets. MISC featured a warm-up
is about ten times larger but has only been transcribed with a task, while SCSdata did not; MISC participants used a Windows PC,
commercial speech-to-text system. Although the automatic speech while SCSdata participants used a Mac; and MISC intermediaries
recognition (ASR) system was state of the art, it was still prone to started with Bing, SCSdata intermediaries with Google, although
errors. (One common error was to inject “speech” when a partici- all were allowed to switch to any other site.
pant was typing, as if the ASR was confused by keyboard noise.)
These errors were discovered because a close reading was needed 2.2 Terminology
to label the MISC utterances for the validation of an annotation There has been some inconsistency in terminology. First the two
schema. The difference in transcription techniques also gave a dif- actors of the SCSdata were referred to as the “user” (the participant
ferent notion of utterance or turn: in SCSdata these are divided with the search task) and “retriever” (the participant with the search
manually, while in MISC they are separated by pauses in the audio engine) [22, 24]. In later publications describing the SCSdata, “user”
signal. Utterance-level statistics may not be directly comparable. became “seeker” and “retriever” became “intermediary” [23]. These
The sets also differ in the pre- and post-task questions. The latter terms match MISC.
MISC questions and responses are part of the released data, and Other terminology is not standard. Trippas et al. used “spo-
the published description includes descriptive statistics and basic ken conversational search” to emphasise the spoken channel, as
validity checks. We hope this is useful for future work. The SCSdata opposed to multi-turn interactions with e.g. typing or selecting
protocol also added many pre- and post-task items (see section 1.2), buttons. For the same scenario, Thomas et al. used the phrase
on overlapping themes but with different instruments. These have “information-seeking conversation” to encourage a broader under-
not been examined to date so they may or may not be useful or standing encompassing negotiation and clarification, not just a
comparable. Future SCSdata releases will not include this data. traditional query/response “search” model. Other terms again are
Some apparently small differences between the SCSdata and used elsewhere in the literature. Presumably in the near future this
MISC protocols have led to observable differences in the collected terminology, as well as the names of the different roles, will be
data. SCSdata participants were expressly prohibited from reading standardised.
out the task statement verbatim and had to verbalise their informa-
tion request; MISC participants were given no instruction on this 2.3 Task design
matter. As a result, the MISC data include seekers reading out and
As explained in section 1.2, the tasks used for the SCSdata were
repeating the task statements, verbatim. More importantly, once
reused from research by Bailey et al. [2] and are based on the
both participants have the same statement, the roles of “seeker” and
Taxonomy of Learning. Three of the five cognitive dimensions were
“intermediary” are blurred and the two act much more like peers.
used: Remember, Understand, and Analyse. However, it has been
This has influenced the interactions in MISC, and the distribution
suggested that there are no clear interaction differences between
of conversational moves.
Understand and Analyse tasks [22], which is consistent with the
The two protocols also differed at the end of each task. For MISC,
difficulties Moffat et al. reported when classifying tasks [15].
“seekers” were asked to record an answer: this was meant partly to
encourage participants to properly complete each task, and partly 4 In the event, we have not been able to code the answers with any degree of reliability.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK Johanne Trippas and Paul Thomas
Table 1: MISC search tasks. These were controlled for com- of data collection are not always easy to describe in a paper, but the
plexity, difficulty, and likely emotional response. protocols for SCSdata and MISC were relatively simple and the data
can be re-used with care. It has been interesting and informative to
Difficulty Complexity Emotion Task source compare the two sets of transcripts, and we hope to continue this
to investigate other conversational questions.
0 Warm-up (NA) (NA) Buhi et al. [7], via
RepAST
1 Low Low Positive Modified TREC ACKNOWLEDGMENTS
topic 442 We thank Daniel McDuff, Mary Czerwinski, and Nick Craswell for
2 Low High Negative Broussard and
their effort assembling MISC, and Penny Analytis for auditing the
Zhang [6], via
RepAST
SCSdata transcriptions. We are grateful to our participants for their
3 High Low (NA) Newly created time.
4 High High Positive White [26], via
RepAST REFERENCES
[1] L. W. Anderson, D. R. Krathwohl, and B. S. Bloom. 2001. A taxonomy for learning,
teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives.
Longman, New York.
The MISC tasks were gathered from different sources and one [2] Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2016. UQV100: A
task was created specifically for this study (Table 1). More specif- test collection with query variability. In Proc. Int. ACM SIGIR Conf. on Research
and Development in Information Retrieval. 725–728.
ically, the tasks used in MISC were chosen to elicit positive and [3] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Open-
negative emotions and were based on two different levels of diffi- face: An open source facial behavior analysis toolkit. In Proc. IEEE Winter Conf.
culty and complexity as seen in Table 1. Since MISC uses only two Applications of Computer Vision. 1–10.
[4] Kathy Brennan, Diane Kelly, and Yinglong Zhang. 2016. Factor analysis of a search
levels, it would perhaps make sense to consider Understand and self-efficacy scale. In Proc. ACM SIGIR Conf. on Human Information Interaction
Analyse as high complexity, and Remember as low complexity, if and Retrieval. 241–244.
task-to-task comparisons were needed. Alternatively, differences [5] H. M. Brooks and N. J. Belkin. 1983. Using discourse analysis for the design of
information retrieval interaction mechanisms. In Proc. Int. ACM SIGIR Conf. on
in interaction patterns may let us align tasks across the two sets. Research and Development in Information Retrieval. 31–47.
We have not yet explored these possibilities. [6] Ramona Broussard and Yan Zhang. 2013. Seeking treatment options: Consumers’
search behaviors and cognitive activities. J. American Society for Information
Science and Technology 50, 1 (2013), 1–10.
3 OBSERVATIONS [7] Eric R. Buhi, Ellen M. Daley, Hollie J. Fuhrmann, and Sarah A. Smith. 2009.
An observational study of how young people search for online sexual health
Two sets of spoken conversational searches—SCSdata and MISC— information. J American College Health 58, 2 (2009), 101–111.
were collected independently, by different teams, in different ge- [8] Sheldon Cohen, Tom Kamarck, and Robin Mermelstein. 1983. A global measure
ographical locations, to support different research. It is fortunate of perceived stress. J. Health and Social Behavior 24, 4 (Dec. 1983), 385–396.
[9] Penny J. Daniels, H. M. Brooks, and N. J. Belkin. 1985. Using problem structures for
that the data sets are similar enough so that we can make direct driving human-computer dialogues. In RIAO-85: Actes: Recherche d’Informations
comparisons, and use one set to verify observations from the other. Assistée par Ordinateur. 645–660.
[10] Birgit Endrass, Matthias Rehm, and Elisabeth André. 2009. Culture-specific
Despite being collected with very similar goals and methods, communication management for virtual agents. In Proc. Int. Confȯn Autonomous
relatively small differences in protocol made observable differences Agents and Multiagent Systems—Volume 1. 281–287.
to the data and we have had to be careful with reuse and compar- [11] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent
Developments in openSMILE, the Munich Open-Source Multimedia Feature
isons. This was made much easier by our familiarity with the data; Extractor. In Proc. ACM Multimedia. 835–838.
another researcher could quite reasonably choose these two data [12] Diane Kelly, Jaime Arguello, Ashlee Edwards, and Wan-ching Wu. 2015. De-
sets, compare them, and have difficulty. That this is possible de- velopment and evaluation of search tasks for IIR experiments using a cognitive
complexity framework. In Proc. Int. Conf. on the Theory of Information Retrieval.
spite careful design and description, and despite close similarity in 101–110.
protocol, may perhaps caution us about reuse in interactive studies [13] Martha Larson and Gareth JF Jones. 2012. Spoken content retrieval: A survey of
techniques and technologies. Foundations and Trends® in Information Retrieval 5,
generally. 4–5 (2012), 235–422.
We were however helped by the decision to explicitly allow the [14] Daniel McDuff, Paul Thomas, Mary Czerwinski, and Nick Craswell. 2017. Multi-
release of MISC’s raw data (not just, e.g., transcripts). Because audio modal analysis of vocal collaborative search: a public corpus and results. In Proc.
ACM Int. Conf. on Multimodal Interaction. 456–463.
was available, the transcription errors could be detected. Unfortu- [15] Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2014. Assessing
nately ethical clearance precludes a similar release for SCSdata, and the cognitive complexity of information needs. In Proc. Australasian Document
this may limit reuse. Computing Symposium. ACM, 97–100.
[16] National Aeronautics and Space Administration Human Systems Integration
Communication between two people is very culture-specific [10]. Division. 2016. TLX @ NASA Ames. (2016). Retrieved January 2017 from
Even though both MISC and SCSdata were collected in English https://humansystems.arc.nasa.gov/groups/TLX/
[17] Heather L O’Brien and Elaine G Toms. 2010. The development and evaluation of
speaking countries, and all participants claimed native or high- a survey to measure user engagement. J American Society for Information Science
level English, we do not exclude that cultural differences played a and Technology 61, 1 (2010), 50–69.
role in the differences in the two data sets. Similarly, the difference [18] James W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn. 2015.
The development and psychometric properties of LIWC2015. Technical Report.
in participant populations (more uniform in SCSdata, more varied University of Texas at Austin.
in MISC) may have resulted in differences in communication. [19] Rachel Reichman. 1985. Getting computers to talk like you and me. MIT Press,
Cambridge, Massachusetts.
[20] Paul Thomas, Mary Czerwinksi, Daniel McDuff, Nick Craswell, and Gloria Mark.
Spoken conversational search is still an immature field of inquiry, 2018. Style and alignment in information-seeking conversation. In Proc. ACM
and we should exercise some caution re-using data sets. Nuances SIGIR Conf. on Human Information Interaction and Retrieval. 42–51.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
Data Sets for Spoken Conversational Search (CHIIR 2019), 14 March 2019, Glasgow, UK
[21] Paul Thomas, Daniel McDuff, Mary Czerwinski, and Nick Craswell. 2017. MISC:
A data set of information-seeking conversations. In Proc. Int. Workshop on Con-
versational Approaches to Information Retrieval.
[22] Johanne R. Trippas, Lawrence Cavedon, Damiano Spina, and Mark Sanderson.
2017. How do people interact in conversational speech-only search tasks: A
preliminary analysis. In Proc. ACM SIGIR Conf. on Human Information Interaction
and Retrieval. 325–328.
[23] Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark
Sanderson. 2018. Informing the design of spoken conversational search: Per-
spective paper. In Proc. ACM SIGIR Conf. on Human Information Interaction and
Retrieval. 32–41.
[24] Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, and Mark Sanderson.
2017. A conversational search transcription protocol and analysis. In Proc. Int.
Workshop on Conversational Approaches to Information Retrieval.
[25] Svitlana Vakulenko, Kate Revoredo, Claudio Di Ciccio, and Maarten de Rijke.
2019. QRFA: A data-driven model of information-seeking dialogues. In Proc.
European Conf. on Information Retrieval. To appear.
[26] Ryen W White. 2004. Implicit feedback for interactive information retrieval. Ph.D.
Dissertation. University of Glasgow.
[27] Stephen P. Whiteside and Donald R. Lynam. 2003. Understanding the role of
impulsivity and externalizing psychopathology in alcohol abuse: application of
the UPPS impulsive behavior scale. 11, 3 (2003), 669–689.