<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inferring the meaning of chord sequences via lyrics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tom O'Hara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department Texas State University San Marcos</institution>
          ,
          <addr-line>TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>G</institution>
          ,
          <addr-line>Am, Bm, C, D, Em, F m, F</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper discusses how meanings associated with chord sequences can be inferred from word associations based on lyrics. The approach works by analyzing in-line chord annotations of lyrics to maintain co-occurrence statistics for chords and lyrics. This is analogous to the way parallel corpora are analyzed in order to infer translation lexicons. The result can benefit musical discovery systems by modeling how the chord structure complements the lyrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Experimentation</title>
      <p>Music information retrieval, Natural language processing</p>
      <sec id="sec-1-1">
        <title>INTRODUCTION</title>
        <p>A key task for music recommendation systems is to
determine whether an arbitrary song might match the mood
of the listener. An approach commonly used is for a system
to learn a classification model based on tagged data (i.e.,
supervised classification). For example, training data might
be prepared by collecting a large variety of songs and then
asking users to assign one or more mood categories to each
song. Based on these annotations, a model can be
developed to assign the most likely mood type for a song, given
features derived from the audio and lyrics.</p>
        <p>Such an approach works well for capturing the mood or
other meaning aspects of entire songs, but it is less suitable
for capturing similar aspects for segments of songs. The
main problem is that human annotations are generally only
done for entire songs. However, for complex songs this might
lead to improper associations being learned (e.g., a sad
introduction being tagged upbeat in a song that is otherwise
WOMRAD 2011 2nd Workshop on Music Recommendation and Discovery,
colocated with ACM RecSys 2011 (Chicago, US)
Copyright c . This is an open-access article distributed under the terms
of the Creative Commons Attribution License 3.0 Unported, which permits
unrestricted use, distribution, and reproduction in any medium, provided
the original author and source are credited.
upbeat). Although it would be possible for segments to be
annotated as well, it would not be feasible. There would
simply be too many segments to annotate. Furthermore,
as the segments get smaller, the annotations would become
more subjective (i.e., less consistent). However, by using
lyrics in place of tagged data, learning could indeed be done
at the song segment level.</p>
        <p>
          Parallel text corpora were developed primarily to serve
multilingual populations but have proved invaluable for
inducing lexicons for machine translation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Similarly, a type
of resource intended for musicians can be exploited to
associate meaning with music. Guitarists learning new songs
often rely upon tablature notation (“tabs”) provided by others
to show the finger placement for a song measure by measure.
Tabs often include lyrics, enabling note sequences to be
associated with words. They also might indicate chords as an
aid to learning the sequence (as is often done in scores for
folk songs). In some cases, the chord annotations for lyrics
are sufficient for playing certain songs, such as those with
accompaniment provided primarily by guitar strumming.
        </p>
        <p>There are several web sites with large collections of tabs
and chord annotations for songs (e.g., about 250,000 via
www.chordie.com). These build upon earlier Usenet-based
guitar forums (e.g., alt.guitar.tabs). Such repositories
provide a practical means to implement unsupervised learning
of the meaning of chord sequences from lyrics. As these
resources are willingly maintained by thousands of guitarists
and other musicians, a system based on them can be readily
kept current. This paper discusses how such resources can
be utilized for associating meaning with chords.
2.</p>
      </sec>
      <sec id="sec-1-2">
        <title>BACKGROUND</title>
        <p>
          There has been a variety of work in music information
retrieval on learning the meaning of music. Most approaches
have used supervised classification in which user tags serve
as ground truth for machine learning algorithms. A few
have inferred the labels based on existing resources. The
approaches differ mainly on the types of features used.
Whitman and Ellis [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] combine audio features based on signal
processing with features based on significant terms extracted
from reviews for the album in question, thus an unsupervised
approach relying only upon metadata about songs (e.g.,
author and title). Turnbull et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] use similar types of audio
features, but they incorporate tagged data describing the
song in terms of genre, instrumentality, mood, and other
attributes. Hu et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] combine word-level lyrics and audio
features, using tags derived from social media, filtered based
on degree of affect, and then revised by humans (i.e., partly
1. Obtain large collection of lyrics with chord annotations
2. Extract lyrics proper with annotations from dataset
3. Optional: Map lyrics from words to meaning categories
(a) Get tagged data on meaning categories for lyrics
(b) Preprocess lyrics and untagged chord annotations
(c) Train to categorize over words and hypernyms
(d) Classify each lyric line from chord annotations
4. Fill contingency table with chord(s)/token associations
5. Determine significant chord(s)/token associations.
supervised). McKay at al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] combine class-level lyric
features (e.g., part of speech frequencies and readability level)
with ones extracted from user tags from social media
(specifically Last.fm1) as well as with features derived from general
term co-occurrence via web searches for the task of genre
classification.
        </p>
        <p>
          Parallel corpora are vital for machine translation. Fung
and Church [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] induce translation lexicons by tabulating
cooccurrence statistics over fixed-size blocks, from which
contingency tables are produced to derive mutual information
statistics. Melamed [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] improves upon similar approaches
by using a heuristic to avoid redundant links.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>PROCESS</title>
        <p>The overall task of processing is as follows: starting with a
large collection of lyrics with chord annotations, infer
meaning category labels for the chord sequences that occur, based
on word associations for the chords sequences. Several steps
are required to achieve this in order to make the lyrics more
tractable for processing and due to the option for including
a lyrics classifier as a refinement of the main induction step.
The latter allows meaning to be in terms of high-level mood
categories rather than just words.</p>
        <p>
          Figure 1 lists the steps involved. First the Internet is
checked to find and download a large sample of lyrics with
word annotations. The resulting data then is passed through
a filter to remove extraneous text associated with the lyrics
(e.g., transcriber notes). Next, there is an optional step to
convert the lyrics into meaning categories (e.g., mood
labels). This requires a separate set of lyrics that have been
tagged with the corresponding labels. Annotations provided
by UCSD’s Computer Audition Laboratory2 are used for
this purpose, specifically the CAL500 data set [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The
mapping process uses text categorization with word features and
also semantic categories in the form of WordNet ancestors
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Prior to categorization, both the CAL500 training data
and Usenet testing data are preprocessed to isolate
punctuation. However, no stemming is done (for simplicity). The
remaining steps are always done. The second-last step
com
        </p>
        <sec id="sec-1-3-1">
          <title>1See http://www.last.fm.</title>
          <p>2See http://cosmal.ucsd.edu/cal/projects/AnnRet.
[C] They’re gonna put me in the [F] movies
[C] They’re gonna make a big star out of [G] me
We’ll [C] make a film about a man that’s sad
and [F] lonely
And [G7] all I have to do is act [C] naturally
putes contingency tables for the co-occurrence of chords and
target tokens. Then these are used in the final step to derive
co-occurrence statistics, such as mutual information.</p>
          <p>The most critical resource required is a large set of lyrics
with chord annotation. These annotations are often
specified in-line with lyrics using brackets to indicate when a
new chord occurs. Figure 2 shows an example. The Usenet
group alt.guitar.tab is used to obtain the data. This is done
by issuing a query for “CRD”, which is the name for this
type of chord annotation. The result is 8,000+ hits, each
of which is then downloaded. The chord annotation data is
used as is (e.g., without normalization into key of C).</p>
          <p>After the chord-annotated lyrics are downloaded,
postprocessing is needed to ensure that user commentary and
other additional material are not included. This is based
on a series of regular expressions. The lyrics are all
converted into a format more amenable for computing the
cooccurrence statistics, namely a tab-separated format with
the current chord name along with words from the lyrics
for which the chord applies. There will be a separate line
for each chord change in the song. Figure 3 illustrates this
format. This shows that special tokens are also included to
indicate the end of the line and paragraph (i.e., verse).
Rather than just using the words from lyrics as the
meaning content, it is often better to use terms typically
associated with songs and musical phrases. This would eliminate
idiosyncratic associations between chords and words that
just happen to occur in lyrics for certain types of songs.
More importantly, it allows for better integration with
music recommendation systems, such as by using the music
labels employed by the latter.</p>
          <p>
            A separate dataset of lyrics is used for lyric
classification. Although the overall process is unsupervised, it
incorporates a mapping from words to categories based on
supervised lyric classification. The source of the tagged data
is CAL500 [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], which uses 135 distinct categories. Several
of these are too specialized to be suitable for music
categorization based on general meaning, such as those related to
specific instruments or vocal characterization. Others are
usage related and highly subjective (e.g., music for driving).
Therefore, the categorization is based only on the emotion
categories. Table 1 shows the categories labels used here.
Although relatively small, CAL500 has the advantage of
being much more reliable than tags derived from social media
like Last.fm. For instance, CAL500 uses a voting scheme to
filter tags with little agreement among the annotators.
          </p>
          <p>Out of the 500 songs annotated in CAL500, only 300 are
currently used due to problems resolving the proper naming
convention for artist and song in Lyric Wiki3. In addition,
CAL500 provides multiple annotations per file, but for
simplicity only a single annotation is used here. The resulting
frequencies for the categories are shown in table 1.</p>
          <p>
            Categorization is performed using CMU’s Rainbow [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
Features are based both on words as well as on semantic
classes akin to word senses. WordNet ancestors called
“hypernyms” [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] are used to implement this. See figure 4 for an
example. The use of these word classes is intended to get
around data sparsity issues, especially since the training set
is rather small. The idiosyncratic nature of lyrics compared
to other types of text collections makes this problem more
prominent.
          </p>
          <p>As no part of speech tagging is applied as well as no sense</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>3See http://lyrics.wikia.com.</title>
          <p>
            Contingency Table Cells
X \ Y +
+ XY X¬Y
- ¬XY ¬X¬Y
tagging, the hypernyms are retrieved for all parts of speech
and all senses. For example, for ’film’, seven distinct senses
would be used: five for the noun and two for the verb. In
all, 43 distinct tokens would be introduced. Naturally, this
introduces much noise, so TF/IDF filtering is used to
select those hypernyms that tend to only occur with specific
categories. (See [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] for other work using hypernyms in text
categorization.)
          </p>
          <p>Each line of the extracted chord annotations file (e.g.,
figure 3) is categorized as a mini-document, and the
highestranking category label is used or N/A if none applicable. To
allow for more context, all of the words from the verse for
the line are included in the mini-document. The final
result is a revised chord annotation file with one chord name
and one category per line (e.g., figure 3 modified to have
Light-Playful throughout on the right-hand side).
Given the chord annotations involving either words or
meaning categories, the next stage is to compute the
cooccurrence statistics. This first tabulates the contingency
table entry for each pair of chord and target token, as
illustrated in table 2. (Alternatively, chord sequences can be of
length four, as discussed later. These are tabulated using
a sliding window over the chord annotations, as in n-gram
analysis.) This table shows that the chord G co-occurred
with the word ‘film’ once, out of the 2,213 instances for G.
The word itself only had one occurrence, and there were
17,522 instances where neither occurred. Next, the
average mutual information co-occurrence metric is derived as
follows:</p>
          <p>X X P (X = x, Y = y) × log2
x y</p>
          <p>P (X = x, Y = y)
P (X = x) × P (Y = y)
4.</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>ANALYSIS</title>
        <p>At the very least, the system should be able to capture
broad generalizations regarding chords. For example, in
Western music, major chords are typically considered bright
and happy, whereas the minor chords are typically
considered somber and sad.4 Table 3 suggests that the chord
meaning induction process indeed does capture this
generalization. By examining the frequency of the pairs, it can be seen
that most cases shown fall under the major-as-happy versus
minor-as-sad dichotomy. There are a few low-frequency
exceptions, presumably since songs that are sad do not just
restrict themselves to minor chords, as that might be too
dissonant.</p>
        <p>
          The exceptions shown in the table might also be due to
the conventions of chord theory. In particular, chord
progressions for a specific key should just contain chords based
4Strictly speaking, it is the difference in major versus minor
key, but there is a close relation between keys and chords.[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Chord C G Dm</title>
      <p>Em
F
Am
Bm
Bb
Em
Dm
C</p>
      <p>
        Word
happy
happy
happy
happy
bright
bright
sad
sad
sad
sorrow
sorrow
on the following formula, given the notes from the
corresponding major scale:[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
      </p>
      <p>M aj(or), M in(or), M in, M aj, M aj, M in, Diminished
Therefore, for the key of C, proper chord sequences only
contain the following chords:</p>
    </sec>
    <sec id="sec-3">
      <title>Likewise, the following are for the key of G:</title>
      <p>For example, both Dm and Em are among the preferred
chords for the key of C major (hence reasonable for ’happy’).</p>
      <p>Of course, individual chords are limited in the meaning
they can convey, given that there are relatively few that
are used in practice, compared to the thousands of playable
chords that are possible. For example, only 60 chords
account for 90% of the occurrences in the sample from Usenet
(from a total about 400 distinct chords). Therefore, the
ultimate test is on how well chord sequences are being treated.</p>
      <p>For simplicity, chord sequences were limited to length four.
This was chosen given the correspondence to the number
of quarter-note beats in a common time measure (i.e., 4/4
time). Over 4,000 distinct 4-chord sequences were found.
As 2,500 of these account for 90% of the occurrences, there
is much wider variety of usage than for individual chords.</p>
      <p>Running the co-occurrence analysis over words runs into
data sparsity issues, so instead results are shown over the
mood categories inferred from the CAL500 tagged data.
Table 4 shows the top sequences for which a semantic label has
been inferred by the classifier (i.e., without guessing based
on prior probability). For the most part, the meaning
assignment seems reasonable, adding more support that the
process described here can capture the meaning associated
with chord sequences.
5.</p>
      <sec id="sec-3-1">
        <title>CONCLUSION</title>
        <p>This paper has presented preliminary research illustrating
that it is feasible to learn the meaning of chord sequences
from lyrics annotated with chords. Thus, a large, untapped
resource can now be exploited for use in music
recommendation systems. An immediate area for future work is the
incorporation of objective measures for evaluation, which is
complicated given that the interpretation of chord sequences
can be highly subjective. Future work will also look into
additional aspects of music as features for modeling meaning
(e.g., tempo and note sequences). Lastly, as this approach
could be used to suggest chord sequences that convey moods
suitable for a particular set of lyrics, work will investigate
its use as a songwriting aid.
6.</p>
      </sec>
      <sec id="sec-3-2">
        <title>ACKNOWLEDGMENTS</title>
        <p>Dan Ponsford provided valuable feedback on the overall
process, and Per Egil Kummervold offered useful
suggestions. Douglas Turnbull granted access to CAL500, and
Cory McKay provided scripts for downloading lyrics.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Church</surname>
          </string-name>
          .
          <article-title>K-vec: A new approach for aligning parallel texts</article-title>
          .
          <source>In Proc. COLING</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Downie</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Ehman</surname>
          </string-name>
          .
          <article-title>Lyric text mining in music mood classification</article-title>
          .
          <source>In Proc. ISMIR</source>
          , pages
          <fpage>411</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mansuy</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Hilderman</surname>
          </string-name>
          .
          <article-title>Evaluating WordNet features in text classification models</article-title>
          .
          <source>In Proc. FLAIRS</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>McCallum. Bow</surname>
          </string-name>
          :
          <article-title>A toolkit for statistical language modeling, text retrieval, classification and clustering</article-title>
          . www.cs.cmu.edu/∼mccallum/bow,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>McKay et al</article-title>
          .
          <article-title>Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features</article-title>
          .
          <source>In Proc. ISMIR</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Melamed</surname>
          </string-name>
          .
          <article-title>Models of translational equivalence among words</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <fpage>221</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Miller</surname>
          </string-name>
          . Special issue on WordNet.
          <source>International Journal of Lexicography</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmidt-Jones</surname>
          </string-name>
          and R. Jones, editors.
          <source>Understanding Basic Music Theory. Connexions</source>
          ,
          <year>2007</year>
          . http://cnx.org/content/col10363/latest.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Turnbull</surname>
          </string-name>
          et al.
          <article-title>Semantic annotation and retrieval of music and sound effects</article-title>
          .
          <source>IEEE TASLP</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Whitman</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ellis</surname>
          </string-name>
          .
          <article-title>Automatic record reviews</article-title>
          .
          <source>In Proc. ISMIR</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>