<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A New Method to Combine Probability Estimates from Pairwise Binary Classifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ondrej Šuch</string-name>
          <email>ondrejs@savbb.sk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Štefan Benˇ uš</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Tinajová</string-name>
          <email>andrea.tinajova@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Constantine the Philosopher University and Slovak Academy of Sciences</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Slovak Academy of Sciences</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Žilina and Slovak Academy of Sciences</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>194</fpage>
      <lpage>199</lpage>
      <abstract>
        <p>Estimating class membership probabilities is an important step in many automated speech recognition systems. Since binary classifiers are usually easier to train, one common approach to this problem is to construct pairwise binary classifiers. Pairwise models yield an overdetermined system of equations for the class membership probabilities. Motivated by probabilistic arguments we propose a new way for estimating individual class membership probabilities, which reduces to solving a linear system of equations. A solution of this system is obtained by finding the unique non-zero eigenvector of total probability one, corresponding to eigenvalue one of a positive Markov matrix. This is a property shared by another algorithm previously proposed by Wu, Lin, and Weng. We compare properties of these methods in two settings: a theoretical three-way classification problem, and via classification of English monophthongs from TIMIT corpus. Index Terms: binary classifiers; multiclass classification; phoneme recognition; English vowels; TIMIT Inspired by Bradley-Terry model, Hastie and Tibshirani suggested [1] to require:</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Probabilistic approach underlies most current automatic
speech recognition (ASR) systems, and very likely also
human speech perception. In many ASR systems a
common task is to provide estimates of probabilities of a given
sample belonging to multiple classes given the observed
values of its features. These classes may represent various
phonemes, diphones or other kinds of linguistic categories.</p>
      <p>In machine learning it is easier to find the boundary
between two classes rather than the boundary separating
a class from many other classes [1]. Moreover, many
discriminative models are naturally suited to pairwise
classification, such as logistic regression, LDA or variants
of SVM. Thus given k classes Ci, one can readily
construct 2k pairwise discriminative models. Let us denote
by Mi j the model discriminating classes Ci and C j.
Suppose that Mi j is able not only to discriminate, but also to
compute the pairwise class membership probability ri j of
an object X with features f:</p>
      <p>ri j = ri j(X ) = p(X ∈ Ci| f, X ∈ Ci or X ∈ C j).</p>
      <p>
        Given the knowledge of ri j(X ) the question is then to
estimate multi-class probabilities pi where
pi = pi(X ) = p(X ∈ Ci| f).
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>pi
pi + p j</p>
      <p>= ri j
∑ pi = 1
i
Note that there are 1 + 2k equations for k unknowns, so
the system of equations is over-determined for k ≥ 3 and
it may be not possible to solve them.</p>
      <p>
        In the next section we review several approaches which
have been suggested to find approximate solution of (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). In
Section 3 we will propose a new method to combine
pairwise estimates. In Section 4 we will examine its
performance with synthetic as well as real world acoustic data.
In Conclusion we discuss findings of our experiments.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Existing Approaches</title>
      <p>
        One natural requirement for an algorithm which
determines probabilities pi is that if the system (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) has a
solution then the algorithm will find them exactly.
      </p>
      <p>Several approaches satisfying this requirement are
outlined in the work of Wu, Ling and Wen [2]. They consider
the following functionals:</p>
      <p>k k 1 1
δHT : min ∑[ ∑ (ri j k − 2 pi)]2,
p i=1 j: j6=i</p>
      <p>k k
δ1 : min ∑[ ∑ (ri j p j − r ji pi)]2,
p i=1 j: j6=i</p>
      <p>k k
δ2 : min ∑ ∑ (ri j p j − r ji pi)2,
p i=1 j: j6=i</p>
      <p>
        k k
δV : min ∑ ∑ (I{ri j&gt;r ji} p j − I{r ji&gt;ri j} pi)2,
p i=1 j: j6=i
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(4)
(5)
(6)
(7)
(8)
(9)
where I is the indicator function. Each of the four
functionals is nonnegative. When the system (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) does have
a solution, each functional is zero at, and only at the
solution. One less satisfying feature of these approaches is
that they lack probabilistic motivation, unlike the method
we propose in the next section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>New Method</title>
      <p>We will now describe our new algorithm. In general, one
has 0 ≤ ri j ≤ 1. To avoid complications arising from
degenerate cases we assume sharp inequalities 0 &lt; ri j &lt; 1,
which poses no difficulty in practical applications.</p>
      <p>
        Consider for a moment that an object X belongs to
the class Cm. Then for judging its similarity to other
classes one may restrict attention to the values rm j (and
r jm = 1 − rm j), since only classifiersMm j were trained on
values from the category Cm. But for those k − 1 values
equations (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) can be solved exactly, as we will now show.
      </p>
      <p>We have
This relation allows us to compute an estimate p(mm) of pm
explicitly as
p(mm) =</p>
      <p>1
∑
j6=m rm j
− (k − 2)
!−1
,
where the upper index indicates that the estimate of pm
is computed by taking into account only values rm j. The
remaining probabilities can be then computed by the
following formula:
1
p(jm) = p(mm) · rm j − 1 .</p>
      <p>
        Now we repeat this argument for m = 1, 2, . . . , k. In
general the estimates of pi thus obtained will be conflicting
i.e. in general p(jm) 6= p(jn), because given values ri j may
not allow for solving (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) consistently. We will now take
inspiration from the probability law p(A) = ∑i p(A|Bi)p(Bi),
if Bi is a partition of the probability space. We will require
that the estimate pˆi of pi should satisfy the following linear
system of equations:
pˆj = ∑ p(jm) pˆm, for j = 1, . . . , k.
      </p>
      <p>m
These requirements can be interpreted as imposing
selfconsistency on the estimates pˆ. One readily checks that
i
the matrix of the linear system (13) is Markov and
positive, thus (13) has a one-dimensional space of solutions.
Imposing an additional condition
(10)
(11)
(12)
(13)
(14)
first checks using (10) and (11) that p(mm) = pm and p(m) =
j
p j. It follows that the vector p j satisfies equations (13)
and (14). Since the solution of (13) and (14) is unique, the
method will yield the correct solution. However, this is an
ideal, very special situation that will generally not hold for
k ≥ 3.</p>
      <p>We have opted to do comparison testing of the
proposed method with the method of Wu, Ling and Wen [2]
that minimizes functional δ1 (6). The reason is that that
method also involves the construction of a positive Markov
matrix whose solution is their estimate of pm. We conduct
two experiments: one is an artificial three-way
classification problem, and the other a vowel recognition task.
4.1</p>
      <sec id="sec-3-1">
        <title>Three-Way Classification</title>
        <p>
          The system of equations (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) becomes over-determined for
k = 3. If one of the classifiers is unreliable then the
system (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) will not have a solution. In this section we present
the results of a synthetic experiment for three-way
classification.
        </p>
        <p>
          In our experiment we assume that only classifierM23 is
unreliable. In other words we assume that classifiers M12
and M13 discriminating respectively categories C1 versus
C2 and C1 versus C3 yield precise estimates of r12 and r13.
For a fixed value p1, p2 we thus set r12 = p1/(p1 + p2)
and r13 = p1/(p1 + p3) = p1/(1 − p2). Let pˆm and pWu
m
denote our and Wu’s estimates of pm. As r23 varies in
interval (
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ), define the absolute errors
and the relative error
Δ = sup |pˆi − pi|,
        </p>
        <p>i,r23
ΔWu = sup |piWu − pi|,</p>
        <p>i,r23
ΔrWelu = sup |piWu − pˆi|.</p>
        <p>i,r23
(15)
(16)
(17)</p>
        <p>The results of our experiment are shown in Table 1.
From the table it is clear that sometimes our method gives
more precise estimates, but for other values of p1, p2, Wu’s
method will yield more precise results. However, in all
cases, the relative error between our results and Wu’s
results is smaller than the absolute errors, often by an order
of one magnitude.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Vowel Recognition</title>
        <p>Unlike consonants, vowels may be perceived
noncategorically by listeners [3], making it a good testing
ground for multi-class probabilistic estimates. We opted
for English language, because it has a large variety of
vowels and because there are large corpora of annotated speech
available. We worked with TIMIT, a phonetically
segmented corpus of American English [4]. Our categories
consisted of 15 monophthongs as shown in Table 2. For
∑ pˆm = 1
m
determines a unique estimate pˆm of pm.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation of the New Method</title>
      <p>
        First note that our algorithm will yield the correct solution
if the system (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) has a solution. In order to see that, one
p1
0.05
0.1
0.85
0.85
0.05
0.1
0.33
p2
0.05
0.1
0.1
0.05
0.85
0.85
0.33
      </p>
      <p>Δ
0.66
0.57
0.07
0.07
0.66
0.58
0.21
each of the categories we randomly chose their
realizations from the set of male speakers in the corpus. Each
realization was analyzed with a window 512 samples wide
(at 16kHz sampling rate its length was 32ms). If the
center of the window was less than 256 samples away from
the next phoneme, it was proportionally less likely to be
selected into our dataset. We have trained pairwise
classifiers using linear discriminant analysis (LDA). The feature
set was log-periodogram, where the analysis window was
weighted with Hanning window before computing FFT.</p>
      <p>We have performed comparison testing of our and Wu’s
method by selecting 500 random samples from the test
subset. Per phone results are shown in Table 3. The
key statistics is that overall there was 96% agreement
between most-likely classifications by our method and Wu’s
method.</p>
      <p>The overall success rate was slightly below 40% for both
our and Wu’s method. Due to the limitations of the
features (no F0, no vowel duration, no dynamic information,
no multiframe data), suboptimal performance may be
expected. For instance without intensity baseline, it is nearly
impossible to correctly distinguish some accented vowels.
vowel
iy
ih
eh
ae
aa
ah
ao
uh
uw
ux
er
ax
ix
axr
ax-h</p>
      <p>We decided to do a more detailed case study. From the test
subset we have chosen sentence SA1 spoken by speaker
MREB0 and examined each monophthong at two points
in time. The first was 5 milliseconds after the onset, and
the other one approximately near the vowel’s center. The
results are shown in Table 4.</p>
      <p>Likelihoods of most likely estimates of our and Wu’s
method are again quite close. There are two differences
between onset and center predictions. The first one is
misprediction of /er/ at the beginning of the word ‘greasy’,
which is quite understandable, since the vowel is preceded
by /r/. To gain an insight into the other mispredictions as
well as deeper insight into dynamical behavior of the
resulting multiclass classifier we present time plots in Fig. 1.
In Fig. 1a the mis-classification of /iy/ instead of TIMIT’s
/ix/ in the word ’in’ is shown. We speculate that the
problem might be attributed to greater weight put on F2, that
is relatively high and within the region for /iy/, compared
to F1 that is quite high and definitely within the region for
/ix/. In other words, the vowel might be a bit fronter than
canonical /ix/. In Fig. 1b, the first vowel of ’greasy’ is
mis-classified as /ux/ instead of TIMIT’s /iy/.</p>
      <p>This problem might be attributed to coarticulation from
the flanking consonants. The first vowel does have lower
F2, which is plausibly responsible for /ux/ prediction, but
it is preceded by /r/, which is commonly associated with
lip protrusion, which lowers F2. In Fig. 1c in the vowel of
word ’wash’, we see that it is only in the beginning in the
word ’wash’ that the classifier gives more weight to /ao/,
and then it increasingly agrees that the vowel is /aa/.
Wu’s method
our method
Wu’s method
our method
offset</p>
      <p>In this particular case, we conclude that our
classification is closer to the phonetic realization than TIMIT’s. The
beginning of the vowel is influenced by the preceding /w/
with lip rounding similar to /ao/. The rest of the vowel
sounds like an /aa/ to phonetically trained listeners, and
the formant values correspond to this perception. Finally,
Fig. 1d shows the preference for /aa/ as the first vowel of
’water’ in our model over /ao/ in TIMIT’s. Similarly to
Fig. 1c, this vowel sounds more, and its formant values
correspond to our model more, than to TIMIT’s. It should
be noted, however, that /ao/ and /aa/ have merged in
several American dialects and more tokens would be needed
for a more thorough analysis.</p>
      <p>A common way to improve the performance in
automatic speech recognition is to tune the parameters of the
system for a particular speaker. To that end we carried one
more experiment. We extracted formants for TIMIT
vowels spoken by speaker MREB0 using package phonTools
in R [5]. Next we performed pairwise LDA training as
previously but this time used values F1 and F2 for features
rather than the log-periodogram. These first two formants
are key perceptual features of vowels [6, 7, 8, 9]. Finally,
we performed multiclass classification on the first vowel
in the word ‘water’. The formants contours for this vowel
are shown in Fig. 2.</p>
      <p>The somewhat suprising results are shown in Fig. 3.
One would expect that it would have little problem with
classification of the vowel. As seen in Fig. 3, except for a
brief start, the classifier overwhelmingly believes that the
phoneme is much closer to /aa/ than TIMIT annotated /ao/.
However, compared to Fig. 1d the likelihood of /aa/ is
markedly smaller near the vowel’s boundaries.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have described a new method for combining
probability estimates from pairwise classifiers. It is quite general
and for its application needs only pairwise classifiers that
provide posterior likelihoods. We believe that since the
rationale for our method is probabilistically motivated, it
has the potential to edge out other methods in practice.
In particular by its construction it avoids the problem of
‘pairwise coupling’ approaches pointed out by G.
Hinton [1, pg. 467]. Another important feature is that the
resulting probabilities are computed as the dominant
eigenvector of a Markov matrix, allowing for efficient
computation via iterations when the matrix of binary likelihoods
varies slowly in time. Finally, since the method is not
hierarchical, it avoids compounding of errors common in
hierarchical approaches.</p>
      <p>In presented synthetic and phonetic experiments its
performance was very close to a method previously suggested
by Wu [2]. The classification of English vowels was
suboptimal, but that may not be indicative of performance in
real world scenarios for several reasons.</p>
      <p>• We have used all TIMIT vowel categories, some
of which are in previously published performance
benchmark tests fused because they are extremely
hard to discriminate.
• Other pairwise classifiers, for instance logistic
regression or SVM may yield better results.
• Based on the last experiment presented, we question
whether TIMIT annotation is consistent throughout
the corpus even for individual speakers.
1.591s
1.634s
(a) TIMIT annotation is /ix/ in the word ‘in’. We considered an
alternative classification that the vowel is /iy/.
29000
29200
29400
29600
(b) TIMIT annotation is /iy/ for the first vowel in the word
‘greasy’. We considered an alternative classification that the
vowel is /ux/.
0
.
1
8
.
liitrybaob .04060
p .</p>
      <p>2
.
0
0
.
0
0
.</p>
      <p>1
od .08
lilikhoe .06
iirsew .04
ap .20
0
.
0
0
.
1
8
.
ilitryboab .40060
p .</p>
      <p>2
.
0
0
.
0
0
.</p>
      <p>1
od .08
ililkeho .06
iirsew .40
ap .20
0
.
0
iy
ix
aa
ao
0
0
4
1
00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
1
0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
6
0
0
8
1
y
iilt
b
a
b
o
r
p
8
.
0
4
.
0
0
.
0
2.438s
2.513s</p>
      <p>Further experiments with a complete ASR system may
shed more light on the applicability of the proposed
algorithm.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>Our research was supported by the project University
Science Park ITMS 26220220184 and grants APVV-0219-12,
APVV-14-0560 and VEGA 2/0197/15. The authors are
thankful to Paul Foulkes, K. Bachratá, and Martin Klimo
for helpful discussion.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Classification by pairwise coupling</article-title>
          .
          <source>Annals of Statistics</source>
          <volume>26</volume>
          (
          <issue>2</issue>
          ) (
          <year>1998</year>
          ),
          <fpage>451</fpage>
          -
          <lpage>471</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>T. -F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          -J.,
          <string-name>
            <surname>Weng</surname>
          </string-name>
          , R.:
          <article-title>Probability estimates for multi-class classification by pairwise coupling</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>5</volume>
          (
          <year>2004</year>
          ),
          <fpage>975</fpage>
          -
          <lpage>1005</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Fry</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abramson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eimas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liberman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The identification and discrimination of synthetic vowels</article-title>
          .
          <source>Language and Speech</source>
          <volume>5</volume>
          (
          <year>1962</year>
          ),
          <fpage>171</fpage>
          -
          <lpage>189</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>