<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>and TayloRVC: An Exploratory Analysis of Musical Deepfakes and Hosting Platforms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Fefer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zachary C. Lipton</string-name>
          <email>zlipton@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Donahue</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">US</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HCMIR23: 2nd Workshop on Human-Centric Music Information Research</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent advancements in voice conversion and text-to-speech technology have facilitated the creation of musical deepfakes, audio tracks featuring the voices of celebrity artists-typically without the artists' involvement. Several deepfakes have already gone viral, leaving the music industry scrambling to sort out the potential impacts. While the media have primarily focused on specific high-profile incidents, there has been less attention from journalists and researchers surrounding the broader trends in musical deepfakes, including the communities creating them, the modeling techniques that they employ, and the sites on which they congregate. In this paper, we investigate two leading sources of musical deepfake models, the AI Hub Discord server and the Uberduck website, which are dedicated to the training, utilization, and distribution of these deepfakes. Interestingly, musical deepfakes target hundreds of artists of diferent backgrounds, levels of success, and musical styles. In light of the economic, legal, and ethical issues raised by deepfakes of so many artists, we provide warnings about the generation of discriminatory forms of content and potential financial and contractual problems for artists. We recommend more research should be conducted in this area, especially to probe peoples' perceptions of this technology and devise approaches that mitigate potential harms.</p>
      </abstract>
      <kwd-group>
        <kwd>Deepfake</kwd>
        <kwd>GAN synthesis</kwd>
        <kwd>Difusion models</kwd>
        <kwd>Artist identity and representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>(C. Donahue)
https://mfeffer.github.io (M. Fefer); https://www.zacharylipton.com (Z. C. Lipton); https://chrisdonahue.com
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
of musical deepfakes is modeling the more narrow distribution of individual singing voices.
Compared to modeling broad music audio [7, 8, 9, 10] which requires new ML methods,
thousands of hours of training data, and specialized hardware, modeling singing voice is possible
with of-the-shelf methods, minutes of training data, and commodity hardware. To turn
generated vocals into a complete song, model outputs are combined with manually-created musical
elements (e.g., mixing deepfaked rap with a human-composed beat).</p>
      <p>
        There are two broad categories of approaches for singing voice synthesis (SVS): (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) voice
conversion (VC), and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) text-to-speech (TTS), primarily diferentiated by the forms of user
control they ofer—respectively, VC is controlled by singing and TTS is controlled by lyrics
represented as text. Both categories involve training models which estimate singing audio
from intermediary features: VC-based models use intermediaries that can be readily extracted
from input singing such as fundamental frequency [11, 12] or representations from pre-trained
encoders [13], while TTS-based models use lyrics (and sometimes melody notes). In both cases,
intermediary features both simplify the modeling problem (thereby decreasing compute and
data requirements) and aford an essential form of control for musical deepfakes: the ability to
specify lyrics (either by singing or writing text). Popular SVS systems are complex pipelines
which compose several modules for feature extraction [14, 15, 13] and resynthesis [16, 17, 18].
Despite the underlying complexity, training and using models is made more broadly accessible
by the distribution of easy-to-use open source tools1 and video tutorials.2
      </p>
      <p>Even with such accessible resources, technical and musical expertise are still required to
train and co-create with singing voice models. Hence, making convincing musical deepfakes
is, for the moment, primarily accessible to musical “prosumers” (e.g., “bedroom producers”
already familiar with technical music production tools). Additionally, considerable artistic
efort—composing and performing lyrics and producing backing tracks—is also a requirement.
Despite these impediments, music streaming services are already being flooded with musical
deepfakes [19, 20, 21]. Such deepfakes have also gone viral on social media [22], prompting
everyone from listeners to musicians and record labels to seriously consider the issues these
capabilities raise [23]. Moreover, musical deepfakes may become even easier to create in the
future—the recent and rapid advancement in broad music audio generation methods suggest
that it may eventually be possible for anyone to generate convincing musical deepfakes without
technical or musical expertise.</p>
      <p>Analysis of initial trends in musical deepfaking, such as examining which types of artists
have been targeted, can help navigate these dilemmas or better prepare for future developments.
Surprisingly, except for examples that have gone viral, we find little coverage in that regard.
To this end, we explored AI Hub, a Discord community at the center of musical deepfake
creation [24], and scraped the website Uberduck.ai3 (referred to as “Uberduck” going forward)
in order to gather information on current deepfake models. While AI Hub is a community efort
driven by prosumers sharing models, Uberduck is backed by a corporation and hosts models
that require comparatively less technical background. Our results suggest that hundreds of
musicians of diverse backgrounds have stakes in these issues. Based on our analysis, we also
1SoVITS and RVC are popular tools for VC: https://github.com/voicepaw/so-vits-svc-fork, https://github.com/
RVC-Project/Retrieval-based-Voice-Conversion-WebUI/tree/main.
2Example video tutorial: https://www.youtube.com/watch?v=tZn0lcGO5OQ
3https://uberduck.ai/
ofer recommendations for the future, including but not limited to research into perceptions of
listeners and members of the music industry as per Lee et al. [25].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>We gathered data from AI Hub by recording the title, post date, and tags of all posts in the v o i c e
m o d e l s channel on May 31st, 2023. This means we gathered all posts ever made in the channel
from the Discord’s inception to May 31st. Regarding Uberduck, we similarly downloaded details
of all available voice models on the site as of May 31st in the form of JSON metadata. For each
data source, we first manually labeled entries with relevant artist info, including the artist’s
name, race4, and whether the artist is deceased. We then used APIs for MusicBrainz [26] and
Spotify to gather additional data about each artist’s gender5, music genres, geographical region,
and popularity on a scale from 0 to 100 (with 100 being most popular).6</p>
    </sec>
    <sec id="sec-3">
      <title>3. Analysis</title>
      <p>Overall, we found that nearly 400 artists were represented in AI Hub models, and over 50 were
represented in Uberduck models. Additionally, for the first four weeks of May 2023, over 100
model posts were made per week in AI Hub. Based on retrieved metadata, users made diferent
models to utilize difering training approaches (e.g., SoVITS versus RVC) or capture artists
at diferent points in their careers (e.g., early versus contemporary Britney Spears). Table 1a
displays the ten most popular artists from AI Hub and Uberduck in terms of how many models
were made using their data, and Table 1b displays results of a random sample of models from
each source. Evidently, the most popular artists in AI Hub typically have more related models
than those in Uberduck. However, both lists highlight artists from a wide range of musical
styles and backgrounds. In particular, Juice WLRD appears next to Jungkook of BTS in one list,
and Kanye West and David Bowie appear in another.</p>
      <p>The diversity of both data sources are also quantitatively illustrated in Figures 1 and 2. Namely,
Figure 1 shows the distribution of artists with deepfake models in each source grouped by race,
and Figure 2 does the same in each source grouped by gender, popularity score, and region. We
ifnd that AI Hub has greater diversity across each criterion, featuring a bimodal popularity score
distribution and many artists from Europe and Asia. However, very popular Black American
male artists are most represented in each data source. The dominance of rap and hip hop styles
in Figure 3 showing the top 10 most popular artists’ genres in each source also supports this.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>While our work sheds light on the deepfake models that currently exist, we emphasize that it is, at
best, a preliminary investigation. For instance, we only focus on the number of models pertaining
4Determined via sources ranging from physical appearance to heritage. We are aware of limitation that race is a
social construct. Our aim is to illustrate diversity of impacted artists.
5This was largely provided by MusicBrainz but was occasionally inferred from photos and articles in a manner
similar to that employed for race. As such, we also emphasize awareness of gender as a social construct and again
stress that our aim is to showcase the range of those afected.
6Resulting data available here: https://docs.google.com/spreadsheets/d/1tZa9YsTiFIYCF-gIndquFFnMNV_
50TD81EFf4Z95ajE/edit?usp=sharing</p>
      <sec id="sec-4-1">
        <title>Artist (no. of models)</title>
      </sec>
      <sec id="sec-4-2">
        <title>Artist from Random Sample</title>
        <p>
          AI Hub
Michael Jackson (
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
Juice WRLD (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
Playboi Carti (
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
Eminem (
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
Notti Osama (
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
Ariana Grande (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
Britney Spears (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
Irene (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
Jungkook (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
Kanye West (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <sec id="sec-4-2-1">
          <title>Uberduck</title>
          <p>
            Eminem (
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
Playboi Carti (
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
Juice WRLD (
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
B La B (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
E-40 (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
Freddie Mecury (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
Lady Gaga (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
Lil Uzi Vert (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
XXXTentacion (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
21 Savage (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
(a)
AI Hub
Duki
Noa Kirel
Trent Reznor
Weird Al
Kendrick Lamar
Winter
Killy
Jhene Aiko
Ice Spice
Lil Tjay
(b)
          </p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Uberduck</title>
          <p>Damon Albarn
Noel Gallagher
Nicki Minaj
Kanye West
NLE Choppa
Lil Uzi Vert
Liam Gallagher
Andy Bell
MC Ride
David Bowie
to each artist, but the numbers of songs generated would also be valuable information.7</p>
          <p>Even so, our findings suggest some concerning possibilities. First, the usage of models
imitating East Asian and Black artists by creators who do not share those demographics could be
considered digital forms of yellowface and blackface respectively [27, 5]. Similarly, voice and
text7We briefly studied other parts of AI Hub and observed originals and covers channels where users shared original
tracks or covers of existing songs made with deepfakes, respectively, and each channel appeared to have hundreds
of messages exchanged in a given day. Further analysis of these channels could address some of these limitations.
to-speech models imitating deceased artists broach ethical and normative questions regarding
whether impersonation of the dead is appropriate. Deepfake models may also exacerbate issues
of music ownership as musicians already have a tenuous grasp on music ownership (see, e.g.,
[28, 29]). Therefore, we recommend that more research should be done in this area.
18003–18017. URL: https://proceedings.mlr.press/v162/qian22b.html.
[16] X. Wang, J. Yamagishi, Using cyclic noise as the source signal for neural
source-filterbased speech waveform model, in: Interspeech 2020, ISCA, 2020, p. 1992–1996. URL:
https://www.isca-speech.org/archive/interspeech_2020/wang20u_interspeech.html. doi:1 0 .
2 1 4 3 7 / I n t e r s p e e c h . 2 0 2 0 - 1 0 1 8 .
[17] J. Kong, J. Kim, J. Bae, Hifi-gan: Generative adversarial networks for eficient and high
ifdelity speech synthesis, in: Advances in Neural Information Processing Systems,
volume 33, Curran Associates, Inc., 2020, p. 17022–17033. URL: https://proceedings.neurips.
cc/paper/2020/hash/c5d736809766d46260d816d8dbc9eb44-Abstract.html.
[18] J. Liu, C. Li, Y. Ren, F. Chen, Z. Zhao, Difsinger: Singing voice synthesis via shallow
difusion mechanism, Proceedings of the AAAI Conference on Artificial Intelligence 36
(2022) 11020–11028. doi:1 0 . 1 6 0 9 / a a a i . v 3 6 i 1 0 . 2 1 3 5 0 .
[19] M. Savage, Deezer: Streaming service to detect and delete ’deepfake’ ai songs, BBC News
(2023). URL: https://www.bbc.com/news/entertainment-arts-65792580.
[20] A. Johnson, Spotify removes ‘tens of thousands’ of ai-generated songs: Here’s
why, Forbes (2023). URL: https://www.forbes.com/sites/ariannajohnson/2023/05/09/
spotify-removes-tens-of-thousands-of-ai-generated-songs-heres-why/?sh=601d69624f4a.
[21] A. Hoover, Spotify has an ai music problem—but bots love it, Wired (2023). URL: https:
//www.wired.com/story/spotify-ai-music-robot-listeners/.
[22] J. Coscarelli, An a.i. hit of fake ‘drake’ and ‘the weeknd’ rattles the music world,
The New York Times (2023). URL: https://www.nytimes.com/2023/04/19/arts/music/
ai-drake-the-weeknd-fake.html.
[23] E. Livni, L. Hirsch, S. Kessler, Who owns a song created by a.i.?, The
New York Times (2023). URL: https://www.nytimes.com/2023/04/15/business/dealbook/
artificial-intelligence-copyright.html.
[24] C. Xiang, Inside the discord where thousands of rogue producers are
making ai music, 2023. URL: https://www.vice.com/en/article/y3wdj7/
inside-the-discord-where-thousands-of-rogue-producers-are-making-ai-music.
[25] K. Lee, G. Hitt, E. Terada, J. H. Lee, Ethics of singing voice synthesis: Perceptions of users
and developers, in: Proc. International Society for Music Information Retrieval Conference,
2022, pp. 733–740.
[26] A. Swartz, Musicbrainz: A semantic web service, IEEE Intelligent Systems 17 (2002) 76–77.
[27] A. Matamoros-Fernández, A. Rodriguez, P. Wikström, Humor that harms? examining
racist audio-visual memetic media on tiktok during covid-19, Media and Communication
10 (2022) 180–191.
[28] Reuters, Pop star kesha releases first single after label dispute, Reuters (2016). URL:
https://www.reuters.com/article/us-music-kesha-idUSKCN0XQ296.
[29] R. Brunner, Why is taylor swift re-rerecording her old albums?, 2021. URL: https://time.
com/5949979/why-taylor-swift-is-rerecording-old-albums/.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Verdoliva</surname>
          </string-name>
          ,
          <article-title>Media forensics and deepfakes: An overview</article-title>
          ,
          <source>IEEE Journal of Selected Topics in Signal Processing</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>910</fpage>
          -
          <lpage>932</lpage>
          .
          <source>doi:1 0 . 1 1 0 9 / J S T S P . 2</source>
          <volume>0 2 0 . 3 0 0 2 1 0 1 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Albahar</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Almalki</surname>
          </string-name>
          ,
          <source>Journal of Theoretical and Applied Information Technology</source>
          <volume>97</volume>
          (
          <year>2019</year>
          )
          <fpage>3242</fpage>
          -
          <lpage>3250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mirsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>The creation and detection of deepfakes: A survey</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 4 2 5 7 8 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tracy</surname>
          </string-name>
          ,
          <article-title>A 'virtual rapper' was fired. questions about art and tech remain</article-title>
          ., The New York Times (
          <year>2022</year>
          ). URL: https://www.nytimes.com/
          <year>2022</year>
          /09/06/arts/music/ fn
          <article-title>-meka-virtual-ai-rap</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sobande</surname>
          </string-name>
          ,
          <article-title>Spectacularized and branded digital (re)presentations of black people and blackness</article-title>
          ,
          <source>Television &amp; New Media</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>131</fpage>
          -
          <lpage>146</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>7 7 / 1 5 2 7 4 7 6 4 2 0 9 8 3 7 4 5 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>I. Bonifacic</surname>
          </string-name>
          , “
          <article-title>lost tapes of the 27 club” used google ai to “write” a new nirvana song</article-title>
          ,
          <year>2021</year>
          . URL: https://www.engadget.
          <article-title>com/over-the-bridge-lost-tapes-of-the-</article-title>
          27
          <string-name>
            <surname>-</surname>
          </string-name>
          club-223000315. html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Jukebox: A generative model for music</article-title>
          , arXiv:
          <year>2005</year>
          .
          <volume>00341</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agostinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. I.</given-names>
            <surname>Denk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Borsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Engel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tagliasacchi</surname>
          </string-name>
          , et al.,
          <article-title>MusicLM: Generating music from text</article-title>
          ,
          <source>arXiv:2301.11325</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          , E. Manilow,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agostinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pietquin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zeghidour</surname>
          </string-name>
          , et al.,
          <article-title>SingSong: Generating musical accompaniments from singing</article-title>
          ,
          <source>arXiv:2301.12662</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Copet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kreuk</surname>
          </string-name>
          , I. Gat,
          <string-name>
            <given-names>T.</given-names>
            <surname>Remez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kant</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Adi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Défossez</surname>
          </string-name>
          ,
          <article-title>Simple and controllable music generation</article-title>
          ,
          <source>arXiv:2306.05284</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Morise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kawahara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Katayose</surname>
          </string-name>
          ,
          <article-title>Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech</article-title>
          , in: Audio Engineering Society Conference: 35th International Conference: Audio for Games,
          <year>2009</year>
          . URL: http://www.aes.org/e-lib/browse.cfm?elib=
          <fpage>15165</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Salamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bello</surname>
          </string-name>
          ,
          <article-title>Crepe: A convolutional representation for pitch estimation</article-title>
          ,
          <source>in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.-N.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bolte</surname>
          </string-name>
          , Y.
          <string-name>
            <surname>-H. H. Tsai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lakhotia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mohamed</surname>
          </string-name>
          , Hubert:
          <article-title>Self-supervised speech representation learning by masked prediction of hidden units</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>3451</fpage>
          -
          <lpage>3460</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          0 9 / T A S L P .
          <volume>2 0 2 1 . 3 1 2 2 2 9 1 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. van Niekerk</given-names>
            ,
            <surname>M.-A. Carbonneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zaïdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Seuté</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kamper</surname>
          </string-name>
          ,
          <article-title>A comparison of discrete and soft speech units for improved voice conversion</article-title>
          ,
          <source>in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2022</year>
          , p.
          <fpage>6562</fpage>
          -
          <lpage>6566</lpage>
          . doi:
          <article-title>1 0 . 1 1 0 9 / I C A S S P 4</article-title>
          <volume>3 9 2 2 . 2 0 2 2 . 9 7 4 6 4 8 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-I.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasegawa-Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Contentvec:</surname>
          </string-name>
          <article-title>An improved self-supervised speech representation by disentangling speakers</article-title>
          ,
          <source>in: Proceedings of the 39th International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , p.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>