PodCastle and Songle: Crowdsourcing-Based Web Services for Retrieval and Browsing of Speech and Music Content Masataka Goto Jun Ogata Kazuyoshi Yoshii Hiromasa Fujihara Matthias Mauch Tomoyasu Nakano National Institute of Advanced Industrial Science and Technology (AIST) 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan m.goto [at] aist.go.jp ABSTRACT This paper describes two web services, PodCastle and Songle, that collect voluntary contributions by anonymous users in order to im- prove the experiences of users listening to speech and music con- tent available on the web. These services use automatic speech- recognition and music-understanding technologies to provide con- tent analysis results, such as full-text speech transcriptions and mu- sic scene descriptions, that let users enjoy content-based multime- dia retrieval and active browsing of speech and music signals with- out relying on metadata. When automatic content analysis is used, however, errors are inevitable. PodCastle and Songle therefore pro- vide an efficient error correction interface that let users easily cor- rect errors by selecting from a list of candidate alternatives. Figure 1: Screen snapshot of PodCastle’s interface for correct- Keywords ing speech recognition errors. Competitive candidate alterna- tives are presented under the recognition results. A user cor- Multimedia retrieval, web services, spoken document retrieval, ac- rected three errors in this excerpt by selecting from the candi- tive music listening, wisdom of crowds, crowdsourcing dates. 1. INTRODUCTION Our goal is to provide end users with public web services based tion (ASR) technologies to provide full-text searching of the speech on speech recognition, music understanding, signal processing, data in podcasts, individual audio or movie files on the web, and machine learning, and crowdsourcing so that they can experience the video clips on the video sharing services YouTube, Nico Nico the benefits of state-of-the-art research-level technologies. Since Douga, and Ustream.tv). PodCastle enables users to find English the amount of speech and music data available on the web is always and Japanese speech data including a search term, read full texts increasing, there are growing needs for the retrieval of this data. of their recognition results, and easily correct recognition errors by Unlike text data, however, the speech and music data itself cannot simply selecting from a list of candidate alternatives displayed on be used as an index for information retrieval. Although metadata an error correction interface (Figure 1). The resulting corrections or social tags are often put on speech and music, annotations such are used to improve the speech retrieval and recognition perfor- as categories or topics tend to be broad and insufficient for useful mance, and users can actively browse speech data by jumping to content-based information retrieval [1]. Furthermore, even if users any word in the recognition results during playback. In our expe- can find their favorite content, listening to it takes time. Content- rience with its use over the past five years (since December 2006), based active browsing that allows random access to a desired part over five hundred eighty thousand recognition errors were corrected of the content and facilitates deeper understanding of the content by anonymous users and we confirmed that PodCastle’s speech is important for improving the experiences of users listening to recognition performance was improved by those corrections. speech and music. We therefore developed two web services for Following the success of PodCastle, we launched Songle content-based retrieval and browsing: PodCastle for speech data (http://songle.jp) [8], an active music listening service that enriches and Songle for music data. music listening experiences by using music-understanding tech- PodCastle (http://en.podcastle.jp for the English version and nologies based on signal processing. Songle serves as a showcase, http://podcastle.jp for the Japanese version) [6, 7, 15, 16] is a spo- demonstrating how people can benefit from music-understanding ken document retrieval service that uses automatic speech recogni- technologies, by enabling people to experience active music listen- ing interfaces [5] on the web. Songle facilitates deeper understand- Copyright c 2012 for the individual papers by the papers’ authors. Copy- ing of music by visualizing automatically estimated music scene ing permitted for private and academic purposes. This volume is published descriptions such as music structure, hierarchical beat structure, and copyrighted by its editors. CrowdSearch 2012 workshop at WWW 2012, Lyon, France back on a web browser. A user who finds a recognition error while listening can easily correct it by simply selecting an alternative from a list of candidates or typing the correct text on the error cor- rection interface shown in Figure 1 [14]. The resulting corrections can then not only be immediately shared with other users and used to improve the spoken document retrieval performance for the cor- rected speech data, but also used to gradually improve the speech recognition performance by training our speech recognizer so that other speech data can be searched more reliably. This approach can be described as collaborative training for speech-recognition technologies. 2.1 Three Functions of PodCastle PodCastle supports three functions: retrieving, browsing, and annotating speech data. The retrieval and browsing functions let users understand the speech recognition performance better, and the annotation (error correction) function allows them to contribute to improved performance. This improved performance can then lead to a better user experience of retrieving and browsing speech data. 2.1.1 Retrieval Function This function allows a full-text search of speech recognition Figure 2: Screen snapshot of Songle’s main interface for mu- results. When the user types in a search term, a list of speech sic playback with the visualization of automatically estimated data containing this term is displayed together with text excerpts music scene descriptions. of speech recognition results around the highlighted search term. These excerpts can be played back individually. The user can ac- cess the full text of one of the search results by selecting that result melody line, and chords (Figure 2). Users can actively browse mu- and then switching over to the browsing function. sic data by jumping to a chorus or repeated section during playback and can use a content-based retrieval function to find music with 2.1.2 Browsing (Reading) Function similar vocal timbres. Songle also features an efficient error cor- With this function the user can view the transcribed text of the rection interface that encourages people to help improve Songle by speech data. To make errors easy to discover, each word is col- correcting estimation errors. ored according to the degree of reliability estimated during speech recognition. Furthermore, a cursor moves across the text in syn- 2. PODCASTLE: A SPOKEN DOCUMENT chronization with the audio playback. Because the corresponding full-text result of speech recognition is available to external full- RETRIEVAL SERVICE IMPROVED BY text search engines, it can be found by those engines. USER CONTRIBUTIONS In 2006 we launched an ASR-based speech retrieval service, 2.1.3 Annotation (Error Correction) Function called PodCastle [6, 7, 15, 16], that provides full-text searching of This function lets users add annotations to correct any recog- speech data available on the web, and since then we have been im- nition errors. Here, annotation means transcribing the content of proving its functions. Like the growing need for full-text search speech data, either by selecting the correct alternative from a list of services accessing text web pages, there is a growing need for full- competitive candidates or by typing in the correct text. On an error text speech retrieval services. Although there were previous re- correction interface we earlier proposed [14] (Figure 1), a recogni- search projects for speech retrieval [9,12,13,20,21,24] before 2006, tion result excerpt is shown around the cursor and scrolled in syn- most did not provide public web services for podcasts. There were chronization with the audio playback. Each word in the excerpt two major exceptions, Podscope [17] and PodZinger [18], which is accompanied by other candidate words generated beforehand by in 2005 started web services for speech retrieval targeting English- using a confusion network [11] that can condense a huge internal language podcasts. They only displayed parts of speech recogni- word graph of a large vocabulary continuous speech recognition tion results, however, making it impossible to visually ascertain the (LVCSR) system. Users do not have to worry about temporal er- detailed content of the speech data. And users who found speech rors in word boundaries when typing in the correct text because the recognition errors were offered no way to correct them. ASR tech- temporal position of each word boundary is automatically adjusted nologies cannot avoid making recognition errors when processing in training the speech recognizer. Note that users are not expected the vast amount of speech data available on the web because speech to correct all the errors but to correct some errors according to their corpora covering the diversity of topics, vocabularies, and speaking interests. styles cannot be prepared in advance. As a result, the users of a web service using those technologies might be disappointed by its per- 2.2 Experiences with PodCastle formance. The Japanese version of PodCastle was released to the public at Our PodCastle web service therefore enables anonymous users http://podcastle.jp on December 1st, 2006 and the English version to contribute by correcting speech-recognition errors. Since it pro- was released at http://en.podcastle.jp on October 12th, 2011. Al- vides the full text of speech recognition results, users can read those though in the Japanese version we used AIST’s speech recognizer, texts with a cursor moving in synchronization with the audio play- we have collaborated with the University of Edinburgh’s Centre # episodes (audio/video files) mance was improved by the language model training, and this will 㻝㻢㻜㻜 147280 㻝㻠㻜㻜 be reported in another paper. # searches 㻝㻞㻜㻜 2008/6: Press release 97900 We have inferred some motivations for users correcting errors, 㻝㻜㻜㻜 Reported in TV news, though we cannot directly ask since the users are anonymous. 㻤㻜㻜 newspapers These motivations can be categorized as follows: 㻢㻜㻜 • Error correction itself is enjoyable and interesting 㻠㻜㻜 # podcasts Since the error correction interface is carefully designed to be 㻞㻜㻜 877 useful and efficient, using it would, especially for proficient 㻜 㻞㻜㻜㻢㻛㻝㻝 㻞㻜㻜㻣㻛㻝㻝 㻞㻜㻜㻤㻛㻝㻝 㻞㻜㻜㻥㻛㻝㻝 㻞㻜㻝㻜㻛㻝㻝 㻞㻜㻝㻝㻛㻝㻝 users who master quick and accurate operations, be fun some- # podcasts # episodes (x100) # searches (x100) what like the fun some people find in video games. • Users want to contribute Figure 3: Cumulative usage statistics for PodCastle: the num- Some users would often correct errors not only for their own ber of podcasts, the number of episodes (audio or video files), convenience, but also to altruistically contribute to the im- and the number of searches (queries). provement of speech recognition and retrieval. • Users want their speech data to be correctly searched 㻢㻜㻜㻜 The creators of speech data (like podcasters for podcasts) 㻡㻜㻜㻜 would correct recognition errors in their own speech data so # corrected words 㻠㻜㻜㻜 2008/6: Press release that it can be searched more accurately. 580765 Reported in TV news, 㻟㻜㻜㻜 newspapers • Users like the content and cannot tolerate the presence of 㻞㻜㻜㻜 recognition errors in it # corrected episodes Some fans of famous artists or TV personalities would correct 㻝㻜㻜㻜 (audio/video files) errors because they like the speakers’ voices and cannot toler- 3279 㻜 ate the presence of recognition errors in their favorite content. 㻞㻜㻜㻢㻛㻝㻝 㻞㻜㻜㻣㻛㻝㻝 㻞㻜㻜㻤㻛㻝㻝 㻞㻜㻜㻥㻛㻝㻝 㻞㻜㻝㻜㻛㻝㻝 㻞㻜㻝㻝㻛㻝㻝 We have indeed observed that such kinds of speech data gen- # corrected episodes # corrected words(x100) erally receive more corrections than other kinds. Figure 4: Cumulative usage statistics for PodCastle: the num- 3. SONGLE: AN ACTIVE MUSIC LISTEN- ber of corrected episodes and the number of corrected words. ING SERVICE IMPROVED BY USER CONTRIBUTIONS for Speech Technology Research (CSTR) and in the English ver- In 2011 we launched a web service, called Songle [8], that al- sion used their speech recognizer. In addition to supporting au- lows web users to enjoy music by using active music listening in- dio podcasts, PodCastle has supported video podcasts since 2009 terfaces [5], where active music listening is a way of listening to and in 2011 began supporting video clips on YouTube, Nico Nico music through active interactions. In this context the word active Douga, and Ustream.tv (recorded videos). This additional sup- does not mean that the listeners create new music but means that port is implemented by transcribing speech data in video clips and they take control of their own listening experience. For example, an displaying an accompanying video screen in synchronization with active music listening interface called SmartMusicKIOSK [4] has a the original PodCastle screen as shown in Figure 1. PodCastle has chorus-search function that enables a user to directly access his or also supported functions annotating speaker names and paragraphs her favorite part of a song (and to skip other parts) while view- (new lines), marking (changing the color of) correct words that do ing a visual representation of its music structure. This facilitates not need any correction, and showing the percentage of correction deeper understanding, but up to now the general public has not had (which becomes 100% when all the words are marked as “correct”). the chance to use such research-level interfaces and technologies in When several users are correcting different parts of the same speech their daily lives. data, those corrections can be automatically shared (synchronized) Toward the goal of enriching music listening experiences, Songle and shown on their screens. This is useful for simultaneously and uses automatic music-understanding technologies to estimate mu- rapidly transcribing speech data together. sic scene descriptions (musical elements) [3] of musical pieces (au- As shown in Figure 3, 877 Japanese speech programs (such as dio files) available on the web. A Songle user can enjoy play- podcasts and YouTube channels), comprising 147,280 audio files, ing back a musical piece while seeing the visualization of the es- had been registered by January 1st, 2012. Of those audio files, timated descriptions. In our current implementation, four major 3,279 had been at least partially corrected, resulting in the correc- types of descriptions are automatically estimated and visualized tion of 580,765 words (Figure 4). We found that some speech pro- for content-based music browsing: music structure (chorus sec- grams registered in PodCastle were corrected almost every day or tions and repeated sections), hierarchical beat structure (musical every week, and we confirmed the performance was improved by beats and bar lines), melody line (fundamental frequency (F0) of the wisdom of the crowd. the vocal melody), and chords (root note and chord type). Songle For the collaborative training of our speech recognizer, we in- implements all functions that the interface of SmartMusicKIOSK troduced a podcast-dependent acoustic model that is trained for had and lets a user jump and listen to the chorus by just pushing the each podcast by using transcripts corrected by anonymous users next-chorus button. Songle thus makes it easier for a user to find [15, 16]. Our experiments confirmed that the speech recognition desired parts of a piece. performance for some podcasts that received many error correc- Given the variety of musical pieces on the web, however, music tions was improved by the acoustic model training (relative error scene descriptions are hard to estimate accurately. Because of the reduction of 21-33%) [15] and that the burden of error correction diversity of music genres and recording conditions and the com- was reduced for those podcasts. We also confirmed that the perfor- plexity of sound mixtures, automatic music-understanding tech- nologies cannot avoid making some errors. As a result, the users of In the global view, the music map of the SmartMusicKIOSK a web service using those technologies might be disappointed by interface [4] is shown below the playback controls including its performance. the buttons, time display, and playback slider. The music map Our Songle web service therefore enables anonymous users to is a graphical representation of the entire song structure and help improve its performance by correcting music-understanding consists of chorus sections (the top row) and repeated sections errors. Each user can see the music-understanding visualizations on (the five lower rows). On each row, colored sections indicate a web browser, where a moving cursor indicates the audio playback similar (repeated) sections. Clicking directly on a colored sec- position. A user who finds an error while listening can easily cor- tion plays that section. rect it by selecting from a list of candidate alternatives, or by pro- 2. Hierarchical beat structure (musical beats and bar lines) viding an alternative description via an error correction interface. At the bottom of the local view, musical beats corresponding The resulting corrections are then shared and used to immediately to quarter notes are visualized by using small triangles. Bar improve user experience with the corrected piece. We also plan lines are marked by larger triangles. to use such corrections to gradually improve music-understanding technologies through adaptive machine learning techniques so that 3. Melody line (F0 of the vocal melody) descriptions of other musical pieces can be estimated more accu- The piano roll representation of the melody line is shown rately. This approach can be described as collaborative training for above the beat structure in the local view. It is also shown music-understanding technologies. in the lower half of the global view. For simplicity, the funda- The alpha version of Songle was released to the public at mental frequency (F0) can be visualized after being quantized http://songle.jp on October 20th, 2011. During the initial stage to the closest semitone. of the Songle launch we are focusing on popular songs with vo- 4. Chords (root note and chord type) cals. A user can register any song available on the web by pro- Chord names are written in the text at the top of the local view. viding the URL of its MP3 file, the URL of a web page including Twelve different colors are used to represent twelve different multiple MP3 URLs, or the URL of a music podcast (an RSS syn- root notes so that a user can notice the repetition of chord pro- dication feed including multiple MP3 URLs). In addition to con- gressions. tributing to the enrichment of music listening experiences, Songle will serve as a showcase in which everybody can experience music- 3.1.3 Annotation (Error Correction) Function understanding technologies and understand their nature: for exam- This function allows users to add annotations to correct any esti- ple, what kinds of music or sound mixture are difficult for the tech- mation errors. Here, annotation means describing the contents of a nologies to handle. song, either by modifying the estimated descriptions or by selecting the correct candidate if it is available. In the local view, a user can 3.1 Three Functions of Songle switch between editors for four types of music scene descriptions. Songle supports three main functions: retrieving, browsing, and 1. Music structure (Figure 5(a)) annotating songs. The retrieval and browsing functions facilitate The beginning and end points of every chorus or repeated sec- deeper understanding of music, and the annotation (error correc- tion can be adjusted. It is also possible to add, move, or delete tion) function allows users to contribute to the improvement of mu- each section. This correction function improves the SmartMu- sic scene descriptions. The improved descriptions can lead to a sicKIOSK experience. better user experience of retrieving and browsing songs. 2. Hierarchical beat structure (Figure 5(b)) 3.1.1 Retrieval Function Several alternative candidates for the beat structure can be se- lected at the bottom of the local view. If none of the candidates This function enables a user to retrieve a song by making a text are appropriate, a user can enter the beat position by tapping a search for the song title or artist name or by making a selection from key during music playback. Each beat position or bar line can a list of artists or a list of songs whose descriptions were recently also be changed directly. For fine adjustment it is possible to estimated or corrected. This function also shows various kinds of play the audio back with click tones at beats. rankings. Following the idea of an active music listening interface Vo- 3. Melody line (Figure 5(c)) calFinder called [2], which finds songs with similar vocal timbres, Songle allows note-level correction on the piano roll represen- Songle provides a similarity graph of songs so that a user can re- tation of the melody line. Since the melody line is internally trieve a song according to vocal timbre similarity. The graph is a represented as the temporal trajectory of F0, more precise cor- radially connected network in which nodes (songs) of similar vocal rection is also possible. More accurate melody annotations timbre are connected to the center node (a recommended or user- will lead to better similarity graphs of songs. specified song). By traversing a graph while listening to nodes, a 4. Chords (Figure 5(d)) user can find a song having the favorite vocal timbre. Chord names can be corrected by choosing from candidates or By selecting a song, the user switches over to the within-song by typing in chord names. Each chord boundary can also be browsing function. adjusted. Chords can be played back along with the original song to make it easier to check the correctness. 3.1.2 Within-song Browsing Function Note that users can simply enjoy active music listening with- This function provides a content-based playback-control inter- out correcting errors. We understand that it is too difficult for some face for within-song browsing as shown in the upper half of users to correct the above descriptions (especially, chords). Design- Figure 2. The upper window is the global view showing the en- ing an interface that makes it easier for them to make corrections tire song and the lower window is the local view magnifying the will be another future challenge. Moreover, users are not expected selected region. A user can view the following four types of music to correct all errors, only some according to each user’s interests. scene descriptions estimated automatically: When the music-understanding results are corrected by users, the 1. Music structure (chorus sections and repeated sections) original values are visualized as trails with different colors (white, (a) Correcting music structure (b) Correcting hierarchical beat structure (chorus sections and repeated sections) (musical beats and bar lines) (c) Correcting melody line (F0 of the vocal melody) (d) Correcting chords (root note and chord type) Figure 5: Screen snapshots of Songle’s annotation function for correcting music scene descriptions. gray, or yellow marks in Figure 5) that can be distinguished by into motion a positive spiral where (1) we enable users to experi- anybody. These trails are important to prevent overestimation of ence a service based on speech recognition or music understanding the automatic music-understanding performance after the user cor- to let them better understand its performance, (2) users contribute rections. Moreover, all the correction histories are recorded, and to improving performance, and (3) the improved performance leads descriptions before and after corrections can be compared. to a better user experience, which encourages further use of the ser- vice at step (1) of this spiral. This is a social correction framework, where users can improve the performance by sharing their correc- 4. DISCUSSION tion results over a web service. The game-based approach of Hu- We discuss how PodCastle and Songle could contribute to soci- man Computation or GWAPs (games with a purpose) [22] like the ety and academic research. ESP Game [23] often lacks step (3) and depends on the feeling of fun. In this framework, users gain a real sense of contributing for 4.1 Contributions of PodCastle and Songle their own benefit and that of others and can be further motivated to PodCastle and Songle make social contributions by providing contribute by seeing corrections made by other users. In this way, public web services that let people retrieve speech data by us- we can use the wisdom of the crowd or crowdsourcing to achieve a ing speech-recognition technologies and that let people enjoy ac- better user experience. tive music listening interfaces with music-understanding technolo- Another important technical contribution is that PodCastle and gies. They also promote the popularization and use of speech- Songle let us investigate how much the performance of speech- recognition and music-understanding technologies by raising user recognition and music-understanding technologies can be im- awareness. Users can grasp the nature of those technologies just proved by getting errors corrected through the cooperative efforts of by seeing results obtained when the technologies applied to speech users. Although we have already implemented a machine-learning data and songs available on the web. We risk attracting criticism mechanism to improve the performance of the speech-recognition when there are many errors, but we believe that sharing these re- technology on the basis of user corrections on PodCastle, we have sults with users will promote the popularization of this research not yet implemented such a mechanism to improve the perfor- field. mance of the music-understanding technology on the basis of user PodCastle and Songle make academic contributions by demon- corrections on Songle because it has just recently been launched. strating a new research approach to speech recognition and music When we have collected enough corrections, we could also im- understanding based on signal processing; this approach aims to plement such a mechanism on Songle. This study thus provides improve the speech-recognition and music-understanding perfor- a framework for amplifying user contributions. In a typical Web mances as well as the usage rates while benefiting from the coop- 2.0 service like Wikipedia, improvements are limited to an item eration of anonymous end users. This approach is designed to set directly contributed (edited) by users. In PodCastle, the improve- ment of the speech recognition performance automatically spreads 6. REFERENCES improvements to items not contributed by users. In Songle, im- [1] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and provements will also spread to other songs when we will imple- M. Slaney. Content-based music information retrieval: Current ment the improvement mechanism. This is a novel technology of directions and future challenges. Proceedings of the IEEE, amplifying user contributions, which could be beyond Web 2.0 and 96(4):668–696, 2008. Human Computation [22]. We hope that this study will show the [2] H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno. A modeling of importance and potential of incorporating and amplifying user con- singing voice robust to accompaniment sounds and its application to tributions and that various other projects [10, 19] that follow this singer identification and vocal-timbre-similarity-based music approach will be done, thus adding a new dimension to this field of information retrieval. IEEE Trans. on ASLP, 18(3):638–648, 2010. research. [3] M. Goto. A real-time music scene description system: Predominant-F0 estimation for detecting melody and bass lines in One Web 2.0 principle is to trust users, and we think users can real-world audio signals. Speech Communication, 43(4):311–329, also be trusted with respect to the quality of their corrections. In 2004. fact, as far as we assessed the quality, the correction results ob- [4] M. Goto. A chorus-section detection method for musical audio tained so far have been of high quality. One of the reasons would signals and its application to a music listening station. IEEE Trans. be that PodCastle and Songle avoid relying on monetary rewards as on ASLP, 14(5):1783–1794, 2006. Amazon Mechanical Turk does. Even if some users make inappro- [5] M. Goto. Active music listening interfaces based on signal priate corrections deliberately (the vandalism problem), we will be processing. In Proc. of ICASSP 2007, 2007. able to develop countermeasures evaluating the reliability of cor- [6] M. Goto and J. Ogata. PodCastle: Recent advances of a spoken rections acoustically. For example, we could validate whether the document retrieval service improved by anonymous user corrected descriptions can be supported by acoustic phenomena. contributions. In Proc. of Interspeech 2011, 2011. This will be another interesting research topic. [7] M. Goto, J. Ogata, and K. Eto. PodCastle: A Web 2.0 approach to speech recognition research. In Proc. of Interspeech 2007, 2007. 4.2 PodCastle and Songle as a Research Plat- [8] M. Goto, K. Yoshii, H. Fujihara, M. Mauch, and T. Nakano. Songle: A web service for active music listening improved by user form contributions. In Proc. of ISMIR 2011, pages 311–316, 2011. We hope to extend PodCastle and Songle to serve as a research [9] L. Lee and B. Chen. Spoken document understanding and platform where other researchers can also exhibit the results of organization. IEEE Signal Processing Magazine, 22(5):42–60, 2005. their own speech-recognition and music-understanding technolo- [10] S. Luz, M. Masoodian, and B. Rogers. Supporting collaborative gies. Since even in our current implementations of PodCastle and transcription of recorded speech with a 3D game interface. In Proc. Songle a module of each technology can be executed anywhere in of KES 2010, 2010. the world, its source and binary codes need not be shared. Its mod- [11] L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech ule can just connect to our web server to receive an audio file and recognition: Word error minimization and other applications of send back speech-recognition or music-understanding results via confusion networks. Computer Speech and Language, HTTP. The results should always be shown with clear acknowledg- 14(4):373–400, 2000. ments/credits so that users can distinguish the sources. [12] Cambridge Multimedia Document Retrieval Project. http://mi.eng.cam.ac.uk/research/projects/mdr/. This platform is especially useful for supporting various lan- [13] CMU Informedia Digital Video Library Project. guages for PodCastle. In fact, the English version of PodCastle was http://www.informedia.cs.cmu.edu/. implemented in this platform and the CSTR’s speech recognizer for [14] J. Ogata and M. Goto. Speech Repair: Quick error correction just by English language is executed at CSTR, University of Edinburgh. using selection operation for speech input interfaces. In Proc. of Eurospeech 2005, pages 133–136, 2005. [15] J. Ogata and M. Goto. PodCastle: Collaborative training of acoustic 5. CONCLUSION models on the basis of wisdom of crowds for podcast transcription. We have described PodCastle, a spoken document retrieval ser- In Proc. of Interspeech 2009, pages 1491–1494, 2009. vice that provides a search engine for web speech data and is based [16] J. Ogata, M. Goto, and K. Eto. Automatic transcription for a Web 2.0 on the wisdom of the crowd (crowdsourcing), and Songle, an ac- service to search podcasts. In Proc. of Interspeech 2007, 2007. tive music listening service that is continually improved by anony- [17] Podscope. http://www.podscope.com/. mous user contributions. In our current implementations, full-text [18] PodZinger. http://www.podzinger.com/. transcriptions of speech data and four types of music scene descrip- [19] N. Ramzan, M. Larson, F. Dufaux, and K. Cluver. The participation tions are recognized, estimated, and displayed through web-based payoff: Challenges and opportunities for multimedia access in interactive user interfaces. Since automatic speech-recognition and networked communities. In Proc. of ACM MIR 2010, 2010. music-understanding technologies are not perfect, PodCastle and [20] J.-M. V. Thong, P. J. Moreno, B. Logan, B. Fidler, K. Maffey, and M. Moores. Speechbot: An experimental speech-based search engine Songle allow users to make error corrections that are shared with for multimedia content on the web. IEEE Trans. on Multimedia, other users, thus creating a positive spiral and giving users an in- 4(1):88–96, 2002. centive to keep making corrections. This platform will act both as [21] V. Turunen, M. Kurimo, and I. Ekman. Speech transcription and a test-bed or showcase for new technologies and as a way of col- spoken document retrieval in Finnish. Machine Learning for lecting valuable annotations. Multimodal Interaction, 3361:253–262, 2005. [22] L. von Ahn. Games with a purpose. IEEE Computer Magazine, Acknowledgments: We thank Youhei Sawada, Shunichi Arai, 39(6):92–94, June 2006. Kouichirou Eto, and Ryutaro Kamitsu for their web service im- [23] L. von Ahn and L. Dabbish. Labeling images with a computer game. plementation of PodCastle, Utah Kawasaki for the web service im- In Proc. of CHI 2004, pages 319–326, 2004. plementation of Songle, and Minoru Sakurai for the web design of [24] S. Whittaker, J. Hirschberg, J. Choi, D. Hindle, F. Pereira, and PodCastle and Songle. We also thank anonymous users of PodCas- A. Singhal. SCAN: Designing and evaluating user interfaces to tle and Songle for correcting errors. This work was supported in support retrieval from speech archives. In Proc. of ACM SIGIR 99, part by CREST, JST. pages 26–33, 1999.