=Paper= {{Paper |id=None |storemode=property |title=PodCastle and Songle: Crowdsourcing-Based Web Services for Retrieval and Browsing of Speech and Music Content |pdfUrl=https://ceur-ws.org/Vol-842/crowdsearch-goto.pdf |volume=Vol-842 |dblpUrl=https://dblp.org/rec/conf/www/GotoOYFMN12 }} ==PodCastle and Songle: Crowdsourcing-Based Web Services for Retrieval and Browsing of Speech and Music Content== https://ceur-ws.org/Vol-842/crowdsearch-goto.pdf
                      PodCastle and Songle:
              Crowdsourcing-Based Web Services for
       Retrieval and Browsing of Speech and Music Content

                   Masataka Goto                                    Jun Ogata                         Kazuyoshi Yoshii
                 Hiromasa Fujihara                             Matthias Mauch                        Tomoyasu Nakano
                            National Institute of Advanced Industrial Science and Technology (AIST)
                                       1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan
                                                            m.goto [at] aist.go.jp

ABSTRACT
This paper describes two web services, PodCastle and Songle, that
collect voluntary contributions by anonymous users in order to im-
prove the experiences of users listening to speech and music con-
tent available on the web. These services use automatic speech-
recognition and music-understanding technologies to provide con-
tent analysis results, such as full-text speech transcriptions and mu-
sic scene descriptions, that let users enjoy content-based multime-
dia retrieval and active browsing of speech and music signals with-
out relying on metadata. When automatic content analysis is used,
however, errors are inevitable. PodCastle and Songle therefore pro-
vide an efficient error correction interface that let users easily cor-
rect errors by selecting from a list of candidate alternatives.
                                                                            Figure 1: Screen snapshot of PodCastle’s interface for correct-
Keywords                                                                    ing speech recognition errors. Competitive candidate alterna-
                                                                            tives are presented under the recognition results. A user cor-
Multimedia retrieval, web services, spoken document retrieval, ac-
                                                                            rected three errors in this excerpt by selecting from the candi-
tive music listening, wisdom of crowds, crowdsourcing
                                                                            dates.

1. INTRODUCTION
   Our goal is to provide end users with public web services based
                                                                            tion (ASR) technologies to provide full-text searching of the speech
on speech recognition, music understanding, signal processing,
                                                                            data in podcasts, individual audio or movie files on the web, and
machine learning, and crowdsourcing so that they can experience
                                                                            the video clips on the video sharing services YouTube, Nico Nico
the benefits of state-of-the-art research-level technologies. Since
                                                                            Douga, and Ustream.tv). PodCastle enables users to find English
the amount of speech and music data available on the web is always
                                                                            and Japanese speech data including a search term, read full texts
increasing, there are growing needs for the retrieval of this data.
                                                                            of their recognition results, and easily correct recognition errors by
Unlike text data, however, the speech and music data itself cannot
                                                                            simply selecting from a list of candidate alternatives displayed on
be used as an index for information retrieval. Although metadata
                                                                            an error correction interface (Figure 1). The resulting corrections
or social tags are often put on speech and music, annotations such
                                                                            are used to improve the speech retrieval and recognition perfor-
as categories or topics tend to be broad and insufficient for useful
                                                                            mance, and users can actively browse speech data by jumping to
content-based information retrieval [1]. Furthermore, even if users
                                                                            any word in the recognition results during playback. In our expe-
can find their favorite content, listening to it takes time. Content-
                                                                            rience with its use over the past five years (since December 2006),
based active browsing that allows random access to a desired part
                                                                            over five hundred eighty thousand recognition errors were corrected
of the content and facilitates deeper understanding of the content
                                                                            by anonymous users and we confirmed that PodCastle’s speech
is important for improving the experiences of users listening to
                                                                            recognition performance was improved by those corrections.
speech and music. We therefore developed two web services for
                                                                               Following the success of PodCastle, we launched Songle
content-based retrieval and browsing: PodCastle for speech data
                                                                            (http://songle.jp) [8], an active music listening service that enriches
and Songle for music data.
                                                                            music listening experiences by using music-understanding tech-
   PodCastle (http://en.podcastle.jp for the English version and
                                                                            nologies based on signal processing. Songle serves as a showcase,
http://podcastle.jp for the Japanese version) [6, 7, 15, 16] is a spo-
                                                                            demonstrating how people can benefit from music-understanding
ken document retrieval service that uses automatic speech recogni-
                                                                            technologies, by enabling people to experience active music listen-
                                                                            ing interfaces [5] on the web. Songle facilitates deeper understand-
Copyright c 2012 for the individual papers by the papers’ authors. Copy-    ing of music by visualizing automatically estimated music scene
ing permitted for private and academic purposes. This volume is published   descriptions such as music structure, hierarchical beat structure,
and copyrighted by its editors.
CrowdSearch 2012 workshop at WWW 2012, Lyon, France
                                                                          back on a web browser. A user who finds a recognition error while
                                                                          listening can easily correct it by simply selecting an alternative
                                                                          from a list of candidates or typing the correct text on the error cor-
                                                                          rection interface shown in Figure 1 [14]. The resulting corrections
                                                                          can then not only be immediately shared with other users and used
                                                                          to improve the spoken document retrieval performance for the cor-
                                                                          rected speech data, but also used to gradually improve the speech
                                                                          recognition performance by training our speech recognizer so that
                                                                          other speech data can be searched more reliably. This approach
                                                                          can be described as collaborative training for speech-recognition
                                                                          technologies.

                                                                          2.1 Three Functions of PodCastle
                                                                             PodCastle supports three functions: retrieving, browsing, and
                                                                          annotating speech data. The retrieval and browsing functions let
                                                                          users understand the speech recognition performance better, and
                                                                          the annotation (error correction) function allows them to contribute
                                                                          to improved performance. This improved performance can then
                                                                          lead to a better user experience of retrieving and browsing speech
                                                                          data.

                                                                          2.1.1     Retrieval Function
                                                                             This function allows a full-text search of speech recognition
Figure 2: Screen snapshot of Songle’s main interface for mu-              results. When the user types in a search term, a list of speech
sic playback with the visualization of automatically estimated            data containing this term is displayed together with text excerpts
music scene descriptions.                                                 of speech recognition results around the highlighted search term.
                                                                          These excerpts can be played back individually. The user can ac-
                                                                          cess the full text of one of the search results by selecting that result
melody line, and chords (Figure 2). Users can actively browse mu-         and then switching over to the browsing function.
sic data by jumping to a chorus or repeated section during playback
and can use a content-based retrieval function to find music with         2.1.2     Browsing (Reading) Function
similar vocal timbres. Songle also features an efficient error cor-          With this function the user can view the transcribed text of the
rection interface that encourages people to help improve Songle by        speech data. To make errors easy to discover, each word is col-
correcting estimation errors.                                             ored according to the degree of reliability estimated during speech
                                                                          recognition. Furthermore, a cursor moves across the text in syn-
2. PODCASTLE: A SPOKEN DOCUMENT                                           chronization with the audio playback. Because the corresponding
                                                                          full-text result of speech recognition is available to external full-
   RETRIEVAL SERVICE IMPROVED BY                                          text search engines, it can be found by those engines.
   USER CONTRIBUTIONS
   In 2006 we launched an ASR-based speech retrieval service,             2.1.3     Annotation (Error Correction) Function
called PodCastle [6, 7, 15, 16], that provides full-text searching of        This function lets users add annotations to correct any recog-
speech data available on the web, and since then we have been im-         nition errors. Here, annotation means transcribing the content of
proving its functions. Like the growing need for full-text search         speech data, either by selecting the correct alternative from a list of
services accessing text web pages, there is a growing need for full-      competitive candidates or by typing in the correct text. On an error
text speech retrieval services. Although there were previous re-          correction interface we earlier proposed [14] (Figure 1), a recogni-
search projects for speech retrieval [9,12,13,20,21,24] before 2006,      tion result excerpt is shown around the cursor and scrolled in syn-
most did not provide public web services for podcasts. There were         chronization with the audio playback. Each word in the excerpt
two major exceptions, Podscope [17] and PodZinger [18], which             is accompanied by other candidate words generated beforehand by
in 2005 started web services for speech retrieval targeting English-      using a confusion network [11] that can condense a huge internal
language podcasts. They only displayed parts of speech recogni-           word graph of a large vocabulary continuous speech recognition
tion results, however, making it impossible to visually ascertain the     (LVCSR) system. Users do not have to worry about temporal er-
detailed content of the speech data. And users who found speech           rors in word boundaries when typing in the correct text because the
recognition errors were offered no way to correct them. ASR tech-         temporal position of each word boundary is automatically adjusted
nologies cannot avoid making recognition errors when processing           in training the speech recognizer. Note that users are not expected
the vast amount of speech data available on the web because speech        to correct all the errors but to correct some errors according to their
corpora covering the diversity of topics, vocabularies, and speaking      interests.
styles cannot be prepared in advance. As a result, the users of a web
service using those technologies might be disappointed by its per-        2.2     Experiences with PodCastle
formance.                                                                    The Japanese version of PodCastle was released to the public at
   Our PodCastle web service therefore enables anonymous users            http://podcastle.jp on December 1st, 2006 and the English version
to contribute by correcting speech-recognition errors. Since it pro-      was released at http://en.podcastle.jp on October 12th, 2011. Al-
vides the full text of speech recognition results, users can read those   though in the Japanese version we used AIST’s speech recognizer,
texts with a cursor moving in synchronization with the audio play-        we have collaborated with the University of Edinburgh’s Centre
                                          # episodes (audio/video files)             mance was improved by the language model training, and this will
㻝㻢㻜㻜
                                                     147280
㻝㻠㻜㻜                                                                                 be reported in another paper.
                                                 # searches
㻝㻞㻜㻜    2008/6: Press release                       97900
                                                                                       We have inferred some motivations for users correcting errors,
㻝㻜㻜㻜    Reported in TV news,                                                         though we cannot directly ask since the users are anonymous.
 㻤㻜㻜             newspapers                                                          These motivations can be categorized as follows:
 㻢㻜㻜
                                                                                       • Error correction itself is enjoyable and interesting
 㻠㻜㻜
                                                                # podcasts                Since the error correction interface is carefully designed to be
 㻞㻜㻜
                                                                    877                   useful and efficient, using it would, especially for proficient
   㻜
  㻞㻜㻜㻢㻛㻝㻝      㻞㻜㻜㻣㻛㻝㻝          㻞㻜㻜㻤㻛㻝㻝       㻞㻜㻜㻥㻛㻝㻝       㻞㻜㻝㻜㻛㻝㻝        㻞㻜㻝㻝㻛㻝㻝        users who master quick and accurate operations, be fun some-
            # podcasts            # episodes (x100)         # searches (x100)
                                                                                          what like the fun some people find in video games.
                                                                                       • Users want to contribute
Figure 3: Cumulative usage statistics for PodCastle: the num-                             Some users would often correct errors not only for their own
ber of podcasts, the number of episodes (audio or video files),                           convenience, but also to altruistically contribute to the im-
and the number of searches (queries).                                                     provement of speech recognition and retrieval.
                                                                                       • Users want their speech data to be correctly searched
㻢㻜㻜㻜
                                                                                          The creators of speech data (like podcasters for podcasts)
㻡㻜㻜㻜                                                                                      would correct recognition errors in their own speech data so
                                                        # corrected words
㻠㻜㻜㻜    2008/6: Press release                                                             that it can be searched more accurately.
                                                              580765
        Reported in TV news,
㻟㻜㻜㻜             newspapers
                                                                                       • Users like the content and cannot tolerate the presence of
㻞㻜㻜㻜
                                                                                          recognition errors in it
                                                        # corrected episodes              Some fans of famous artists or TV personalities would correct
㻝㻜㻜㻜                                                      (audio/video files)             errors because they like the speakers’ voices and cannot toler-
                                                                3279
  㻜                                                                                       ate the presence of recognition errors in their favorite content.
 㻞㻜㻜㻢㻛㻝㻝       㻞㻜㻜㻣㻛㻝㻝          㻞㻜㻜㻤㻛㻝㻝       㻞㻜㻜㻥㻛㻝㻝       㻞㻜㻝㻜㻛㻝㻝        㻞㻜㻝㻝㻛㻝㻝
                                                                                          We have indeed observed that such kinds of speech data gen-
               # corrected episodes              # corrected words(x100)                  erally receive more corrections than other kinds.

Figure 4: Cumulative usage statistics for PodCastle: the num-                        3.    SONGLE: AN ACTIVE MUSIC LISTEN-
ber of corrected episodes and the number of corrected words.
                                                                                           ING SERVICE IMPROVED BY USER
                                                                                           CONTRIBUTIONS
for Speech Technology Research (CSTR) and in the English ver-                           In 2011 we launched a web service, called Songle [8], that al-
sion used their speech recognizer. In addition to supporting au-                     lows web users to enjoy music by using active music listening in-
dio podcasts, PodCastle has supported video podcasts since 2009                      terfaces [5], where active music listening is a way of listening to
and in 2011 began supporting video clips on YouTube, Nico Nico                       music through active interactions. In this context the word active
Douga, and Ustream.tv (recorded videos). This additional sup-                        does not mean that the listeners create new music but means that
port is implemented by transcribing speech data in video clips and                   they take control of their own listening experience. For example, an
displaying an accompanying video screen in synchronization with                      active music listening interface called SmartMusicKIOSK [4] has a
the original PodCastle screen as shown in Figure 1. PodCastle has                    chorus-search function that enables a user to directly access his or
also supported functions annotating speaker names and paragraphs                     her favorite part of a song (and to skip other parts) while view-
(new lines), marking (changing the color of) correct words that do                   ing a visual representation of its music structure. This facilitates
not need any correction, and showing the percentage of correction                    deeper understanding, but up to now the general public has not had
(which becomes 100% when all the words are marked as “correct”).                     the chance to use such research-level interfaces and technologies in
When several users are correcting different parts of the same speech                 their daily lives.
data, those corrections can be automatically shared (synchronized)                      Toward the goal of enriching music listening experiences, Songle
and shown on their screens. This is useful for simultaneously and                    uses automatic music-understanding technologies to estimate mu-
rapidly transcribing speech data together.                                           sic scene descriptions (musical elements) [3] of musical pieces (au-
   As shown in Figure 3, 877 Japanese speech programs (such as                       dio files) available on the web. A Songle user can enjoy play-
podcasts and YouTube channels), comprising 147,280 audio files,                      ing back a musical piece while seeing the visualization of the es-
had been registered by January 1st, 2012. Of those audio files,                      timated descriptions. In our current implementation, four major
3,279 had been at least partially corrected, resulting in the correc-                types of descriptions are automatically estimated and visualized
tion of 580,765 words (Figure 4). We found that some speech pro-                     for content-based music browsing: music structure (chorus sec-
grams registered in PodCastle were corrected almost every day or                     tions and repeated sections), hierarchical beat structure (musical
every week, and we confirmed the performance was improved by                         beats and bar lines), melody line (fundamental frequency (F0) of
the wisdom of the crowd.                                                             the vocal melody), and chords (root note and chord type). Songle
   For the collaborative training of our speech recognizer, we in-                   implements all functions that the interface of SmartMusicKIOSK
troduced a podcast-dependent acoustic model that is trained for                      had and lets a user jump and listen to the chorus by just pushing the
each podcast by using transcripts corrected by anonymous users                       next-chorus button. Songle thus makes it easier for a user to find
[15, 16]. Our experiments confirmed that the speech recognition                      desired parts of a piece.
performance for some podcasts that received many error correc-                          Given the variety of musical pieces on the web, however, music
tions was improved by the acoustic model training (relative error                    scene descriptions are hard to estimate accurately. Because of the
reduction of 21-33%) [15] and that the burden of error correction                    diversity of music genres and recording conditions and the com-
was reduced for those podcasts. We also confirmed that the perfor-                   plexity of sound mixtures, automatic music-understanding tech-
nologies cannot avoid making some errors. As a result, the users of          In the global view, the music map of the SmartMusicKIOSK
a web service using those technologies might be disappointed by              interface [4] is shown below the playback controls including
its performance.                                                             the buttons, time display, and playback slider. The music map
   Our Songle web service therefore enables anonymous users to               is a graphical representation of the entire song structure and
help improve its performance by correcting music-understanding               consists of chorus sections (the top row) and repeated sections
errors. Each user can see the music-understanding visualizations on          (the five lower rows). On each row, colored sections indicate
a web browser, where a moving cursor indicates the audio playback            similar (repeated) sections. Clicking directly on a colored sec-
position. A user who finds an error while listening can easily cor-          tion plays that section.
rect it by selecting from a list of candidate alternatives, or by pro-    2. Hierarchical beat structure (musical beats and bar lines)
viding an alternative description via an error correction interface.         At the bottom of the local view, musical beats corresponding
The resulting corrections are then shared and used to immediately            to quarter notes are visualized by using small triangles. Bar
improve user experience with the corrected piece. We also plan               lines are marked by larger triangles.
to use such corrections to gradually improve music-understanding
technologies through adaptive machine learning techniques so that         3. Melody line (F0 of the vocal melody)
descriptions of other musical pieces can be estimated more accu-             The piano roll representation of the melody line is shown
rately. This approach can be described as collaborative training for         above the beat structure in the local view. It is also shown
music-understanding technologies.                                            in the lower half of the global view. For simplicity, the funda-
   The alpha version of Songle was released to the public at                 mental frequency (F0) can be visualized after being quantized
http://songle.jp on October 20th, 2011. During the initial stage             to the closest semitone.
of the Songle launch we are focusing on popular songs with vo-            4. Chords (root note and chord type)
cals. A user can register any song available on the web by pro-              Chord names are written in the text at the top of the local view.
viding the URL of its MP3 file, the URL of a web page including              Twelve different colors are used to represent twelve different
multiple MP3 URLs, or the URL of a music podcast (an RSS syn-                root notes so that a user can notice the repetition of chord pro-
dication feed including multiple MP3 URLs). In addition to con-              gressions.
tributing to the enrichment of music listening experiences, Songle
will serve as a showcase in which everybody can experience music-        3.1.3     Annotation (Error Correction) Function
understanding technologies and understand their nature: for exam-           This function allows users to add annotations to correct any esti-
ple, what kinds of music or sound mixture are difficult for the tech-    mation errors. Here, annotation means describing the contents of a
nologies to handle.                                                      song, either by modifying the estimated descriptions or by selecting
                                                                         the correct candidate if it is available. In the local view, a user can
3.1 Three Functions of Songle                                            switch between editors for four types of music scene descriptions.
   Songle supports three main functions: retrieving, browsing, and         1. Music structure (Figure 5(a))
annotating songs. The retrieval and browsing functions facilitate             The beginning and end points of every chorus or repeated sec-
deeper understanding of music, and the annotation (error correc-              tion can be adjusted. It is also possible to add, move, or delete
tion) function allows users to contribute to the improvement of mu-           each section. This correction function improves the SmartMu-
sic scene descriptions. The improved descriptions can lead to a               sicKIOSK experience.
better user experience of retrieving and browsing songs.
                                                                           2. Hierarchical beat structure (Figure 5(b))
3.1.1 Retrieval Function                                                      Several alternative candidates for the beat structure can be se-
                                                                              lected at the bottom of the local view. If none of the candidates
   This function enables a user to retrieve a song by making a text
                                                                              are appropriate, a user can enter the beat position by tapping a
search for the song title or artist name or by making a selection from
                                                                              key during music playback. Each beat position or bar line can
a list of artists or a list of songs whose descriptions were recently
                                                                              also be changed directly. For fine adjustment it is possible to
estimated or corrected. This function also shows various kinds of
                                                                              play the audio back with click tones at beats.
rankings.
   Following the idea of an active music listening interface Vo-           3. Melody line (Figure 5(c))
calFinder called [2], which finds songs with similar vocal timbres,           Songle allows note-level correction on the piano roll represen-
Songle provides a similarity graph of songs so that a user can re-            tation of the melody line. Since the melody line is internally
trieve a song according to vocal timbre similarity. The graph is a            represented as the temporal trajectory of F0, more precise cor-
radially connected network in which nodes (songs) of similar vocal            rection is also possible. More accurate melody annotations
timbre are connected to the center node (a recommended or user-               will lead to better similarity graphs of songs.
specified song). By traversing a graph while listening to nodes, a         4. Chords (Figure 5(d))
user can find a song having the favorite vocal timbre.                        Chord names can be corrected by choosing from candidates or
   By selecting a song, the user switches over to the within-song             by typing in chord names. Each chord boundary can also be
browsing function.                                                            adjusted. Chords can be played back along with the original
                                                                              song to make it easier to check the correctness.
3.1.2 Within-song Browsing Function                                         Note that users can simply enjoy active music listening with-
   This function provides a content-based playback-control inter-        out correcting errors. We understand that it is too difficult for some
face for within-song browsing as shown in the upper half of              users to correct the above descriptions (especially, chords). Design-
Figure 2. The upper window is the global view showing the en-            ing an interface that makes it easier for them to make corrections
tire song and the lower window is the local view magnifying the          will be another future challenge. Moreover, users are not expected
selected region. A user can view the following four types of music       to correct all errors, only some according to each user’s interests.
scene descriptions estimated automatically:                                 When the music-understanding results are corrected by users, the
  1. Music structure (chorus sections and repeated sections)             original values are visualized as trails with different colors (white,
                  (a) Correcting music structure                                (b) Correcting hierarchical beat structure
              (chorus sections and repeated sections)                                 (musical beats and bar lines)




        (c) Correcting melody line (F0 of the vocal melody)                 (d) Correcting chords (root note and chord type)
                Figure 5: Screen snapshots of Songle’s annotation function for correcting music scene descriptions.


gray, or yellow marks in Figure 5) that can be distinguished by      into motion a positive spiral where (1) we enable users to experi-
anybody. These trails are important to prevent overestimation of     ence a service based on speech recognition or music understanding
the automatic music-understanding performance after the user cor-    to let them better understand its performance, (2) users contribute
rections. Moreover, all the correction histories are recorded, and   to improving performance, and (3) the improved performance leads
descriptions before and after corrections can be compared.           to a better user experience, which encourages further use of the ser-
                                                                     vice at step (1) of this spiral. This is a social correction framework,
                                                                     where users can improve the performance by sharing their correc-
4. DISCUSSION                                                        tion results over a web service. The game-based approach of Hu-
   We discuss how PodCastle and Songle could contribute to soci-     man Computation or GWAPs (games with a purpose) [22] like the
ety and academic research.                                           ESP Game [23] often lacks step (3) and depends on the feeling of
                                                                     fun. In this framework, users gain a real sense of contributing for
4.1 Contributions of PodCastle and Songle                            their own benefit and that of others and can be further motivated to
   PodCastle and Songle make social contributions by providing       contribute by seeing corrections made by other users. In this way,
public web services that let people retrieve speech data by us-      we can use the wisdom of the crowd or crowdsourcing to achieve a
ing speech-recognition technologies and that let people enjoy ac-    better user experience.
tive music listening interfaces with music-understanding technolo-      Another important technical contribution is that PodCastle and
gies. They also promote the popularization and use of speech-        Songle let us investigate how much the performance of speech-
recognition and music-understanding technologies by raising user     recognition and music-understanding technologies can be im-
awareness. Users can grasp the nature of those technologies just     proved by getting errors corrected through the cooperative efforts of
by seeing results obtained when the technologies applied to speech   users. Although we have already implemented a machine-learning
data and songs available on the web. We risk attracting criticism    mechanism to improve the performance of the speech-recognition
when there are many errors, but we believe that sharing these re-    technology on the basis of user corrections on PodCastle, we have
sults with users will promote the popularization of this research    not yet implemented such a mechanism to improve the perfor-
field.                                                               mance of the music-understanding technology on the basis of user
   PodCastle and Songle make academic contributions by demon-        corrections on Songle because it has just recently been launched.
strating a new research approach to speech recognition and music     When we have collected enough corrections, we could also im-
understanding based on signal processing; this approach aims to      plement such a mechanism on Songle. This study thus provides
improve the speech-recognition and music-understanding perfor-       a framework for amplifying user contributions. In a typical Web
mances as well as the usage rates while benefiting from the coop-    2.0 service like Wikipedia, improvements are limited to an item
eration of anonymous end users. This approach is designed to set     directly contributed (edited) by users. In PodCastle, the improve-
ment of the speech recognition performance automatically spreads       6.    REFERENCES
improvements to items not contributed by users. In Songle, im-          [1] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and
provements will also spread to other songs when we will imple-              M. Slaney. Content-based music information retrieval: Current
ment the improvement mechanism. This is a novel technology of               directions and future challenges. Proceedings of the IEEE,
amplifying user contributions, which could be beyond Web 2.0 and            96(4):668–696, 2008.
Human Computation [22]. We hope that this study will show the           [2] H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno. A modeling of
importance and potential of incorporating and amplifying user con-          singing voice robust to accompaniment sounds and its application to
tributions and that various other projects [10, 19] that follow this        singer identification and vocal-timbre-similarity-based music
approach will be done, thus adding a new dimension to this field of         information retrieval. IEEE Trans. on ASLP, 18(3):638–648, 2010.
research.                                                               [3] M. Goto. A real-time music scene description system:
                                                                            Predominant-F0 estimation for detecting melody and bass lines in
   One Web 2.0 principle is to trust users, and we think users can
                                                                            real-world audio signals. Speech Communication, 43(4):311–329,
also be trusted with respect to the quality of their corrections. In        2004.
fact, as far as we assessed the quality, the correction results ob-     [4] M. Goto. A chorus-section detection method for musical audio
tained so far have been of high quality. One of the reasons would           signals and its application to a music listening station. IEEE Trans.
be that PodCastle and Songle avoid relying on monetary rewards as           on ASLP, 14(5):1783–1794, 2006.
Amazon Mechanical Turk does. Even if some users make inappro-           [5] M. Goto. Active music listening interfaces based on signal
priate corrections deliberately (the vandalism problem), we will be         processing. In Proc. of ICASSP 2007, 2007.
able to develop countermeasures evaluating the reliability of cor-      [6] M. Goto and J. Ogata. PodCastle: Recent advances of a spoken
rections acoustically. For example, we could validate whether the           document retrieval service improved by anonymous user
corrected descriptions can be supported by acoustic phenomena.              contributions. In Proc. of Interspeech 2011, 2011.
This will be another interesting research topic.                        [7] M. Goto, J. Ogata, and K. Eto. PodCastle: A Web 2.0 approach to
                                                                            speech recognition research. In Proc. of Interspeech 2007, 2007.
4.2 PodCastle and Songle as a Research Plat-                            [8] M. Goto, K. Yoshii, H. Fujihara, M. Mauch, and T. Nakano. Songle:
                                                                            A web service for active music listening improved by user
    form                                                                    contributions. In Proc. of ISMIR 2011, pages 311–316, 2011.
  We hope to extend PodCastle and Songle to serve as a research         [9] L. Lee and B. Chen. Spoken document understanding and
platform where other researchers can also exhibit the results of            organization. IEEE Signal Processing Magazine, 22(5):42–60, 2005.
their own speech-recognition and music-understanding technolo-         [10] S. Luz, M. Masoodian, and B. Rogers. Supporting collaborative
gies. Since even in our current implementations of PodCastle and            transcription of recorded speech with a 3D game interface. In Proc.
Songle a module of each technology can be executed anywhere in              of KES 2010, 2010.
the world, its source and binary codes need not be shared. Its mod-    [11] L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech
ule can just connect to our web server to receive an audio file and         recognition: Word error minimization and other applications of
send back speech-recognition or music-understanding results via             confusion networks. Computer Speech and Language,
HTTP. The results should always be shown with clear acknowledg-             14(4):373–400, 2000.
ments/credits so that users can distinguish the sources.               [12] Cambridge Multimedia Document Retrieval Project.
                                                                            http://mi.eng.cam.ac.uk/research/projects/mdr/.
  This platform is especially useful for supporting various lan-
                                                                       [13] CMU Informedia Digital Video Library Project.
guages for PodCastle. In fact, the English version of PodCastle was
                                                                            http://www.informedia.cs.cmu.edu/.
implemented in this platform and the CSTR’s speech recognizer for
                                                                       [14] J. Ogata and M. Goto. Speech Repair: Quick error correction just by
English language is executed at CSTR, University of Edinburgh.              using selection operation for speech input interfaces. In Proc. of
                                                                            Eurospeech 2005, pages 133–136, 2005.
                                                                       [15] J. Ogata and M. Goto. PodCastle: Collaborative training of acoustic
5. CONCLUSION                                                               models on the basis of wisdom of crowds for podcast transcription.
   We have described PodCastle, a spoken document retrieval ser-            In Proc. of Interspeech 2009, pages 1491–1494, 2009.
vice that provides a search engine for web speech data and is based    [16] J. Ogata, M. Goto, and K. Eto. Automatic transcription for a Web 2.0
on the wisdom of the crowd (crowdsourcing), and Songle, an ac-              service to search podcasts. In Proc. of Interspeech 2007, 2007.
tive music listening service that is continually improved by anony-    [17] Podscope. http://www.podscope.com/.
mous user contributions. In our current implementations, full-text     [18] PodZinger. http://www.podzinger.com/.
transcriptions of speech data and four types of music scene descrip-   [19] N. Ramzan, M. Larson, F. Dufaux, and K. Cluver. The participation
tions are recognized, estimated, and displayed through web-based            payoff: Challenges and opportunities for multimedia access in
interactive user interfaces. Since automatic speech-recognition and         networked communities. In Proc. of ACM MIR 2010, 2010.
music-understanding technologies are not perfect, PodCastle and        [20] J.-M. V. Thong, P. J. Moreno, B. Logan, B. Fidler, K. Maffey, and
                                                                            M. Moores. Speechbot: An experimental speech-based search engine
Songle allow users to make error corrections that are shared with
                                                                            for multimedia content on the web. IEEE Trans. on Multimedia,
other users, thus creating a positive spiral and giving users an in-        4(1):88–96, 2002.
centive to keep making corrections. This platform will act both as     [21] V. Turunen, M. Kurimo, and I. Ekman. Speech transcription and
a test-bed or showcase for new technologies and as a way of col-            spoken document retrieval in Finnish. Machine Learning for
lecting valuable annotations.                                               Multimodal Interaction, 3361:253–262, 2005.
                                                                       [22] L. von Ahn. Games with a purpose. IEEE Computer Magazine,
Acknowledgments: We thank Youhei Sawada, Shunichi Arai,                     39(6):92–94, June 2006.
Kouichirou Eto, and Ryutaro Kamitsu for their web service im-          [23] L. von Ahn and L. Dabbish. Labeling images with a computer game.
plementation of PodCastle, Utah Kawasaki for the web service im-            In Proc. of CHI 2004, pages 319–326, 2004.
plementation of Songle, and Minoru Sakurai for the web design of       [24] S. Whittaker, J. Hirschberg, J. Choi, D. Hindle, F. Pereira, and
PodCastle and Songle. We also thank anonymous users of PodCas-              A. Singhal. SCAN: Designing and evaluating user interfaces to
tle and Songle for correcting errors. This work was supported in            support retrieval from speech archives. In Proc. of ACM SIGIR 99,
part by CREST, JST.                                                         pages 26–33, 1999.