<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hongchuan Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boyuan Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Centre for Computer Animation, Bournemouth University</institution>
        </aff>
      </contrib-group>
      <fpage>16</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>This paper focuses on multi-modal learning and introduces an AdaBoost-based approach for multi-modal learning. We address two foundation problems, (1) the diference be-tween AdaBoost with homogeneous and heterogeneous weak learners; (2) generalization metric. By addressing these research questions, this paper enhances our under-standing of AdaBoost in the context of multi-modal learning through comprehensive experiments. The experiment results show that the heterogeneous structure is a trade-of between the performances of diferent weak learners rather than a clear synergy. The multi-modal learning model's performance depends on how the individual weak learners are composed, and the heterogeneous structure's ad-vantage lies in harnessing the diverse strengths of individual weak learners, even though the improvement achieved is not overwhelmingly pronounced.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AdaBoost</kwd>
        <kwd>Homogeneous weak learners</kwd>
        <kwd>Heterogeneous weak learners</kwd>
        <kwd>Multimodal learning</kwd>
        <kwd>Generalization metric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>multi-modal learning techniques, they all encounter a
common challenge: generalization. Initially, these
algoMulti-modal learning refers to the process of extracting rithms were developed to tackle the issue of
generalizaat-tributes from one or more data streams, known as tion, where a pre-trained model can efectively handle
modalities, that have diferent dimensions. The goal is to unseen domains. In this paper, we leverage the power
learn how to combine and project the extracted hetero- of AdaBoost and introduce novel multi-modal learning
geneous features into a shared representation space. In methods based on it. Unlike the conventional
implevarious applications, leveraging multiple modalities and mentation of AdaBoost that assumes homogeneous weak
sensors can provide valuable contextual information for learners, in multi-modal learning scenarios, each
modala given task. Each modality, such as textual, visual, or au- ity may have its own individual learners, resulting in
ditory, has its own structure and encoding mechanisms heterogeneous learners.
for handling heterogeneous information harmoniously The main challenge we face in our proposed algorithm
within a conceptual framework. involves two aspects: (1) Assessing the performance
dif</p>
      <p>While the combination of diferent modalities or data ference of AdaBoost with homogeneous and
heterogesources to enhance performance is an ongoing research neous classifiers, respectively; (2) Establishing a
quantififocus, it is often challenging to distinguish between noise, able metric for generalization, which has been lacking in
concepts, and conflicts among the data sources in prac- existing research. Our contributions in this paper are as
tice. follows:</p>
      <p>Among boosting algorithms, AdaBoost is widely
recognized as a prominent member. It converts a set of 1. We demonstrate that AdaBoost performs equally
weak learners into a strong learner. Typically, AdaBoost well with homogeneous weak learners as with
is formulated using an additive model, where a linear heterogeneous weak learners.
combination of base learners is employed to minimize 2. We introduce a new metric for measuring the
the exponential loss function. AdaBoost implementation generalization capability of the proposed
algois straightforward and comprehensible, and it is known rithm. This metric allows us to assess how well
for its resistance to overfitting [ 1]. Multi-modal learning the algorithm generalizes to unseen data.
aims to tap potentialities of multiple modality data, while By addressing these challenges and making these
conAdaBoost is a successful example of ensemble learning. tributions, our paper aims to enhance the understanding
It is natural to apply AdaBoost to multi-modal learning. and application of multi-modal learning techniques,
es</p>
      <p>However, regardless of whether ensemble learning or pecially in the context of AdaBoost-based approaches.
data from diverse modalities while accounting for
heterogeneity, noise levels, and missing data. Deep networks
have been employed to represent visual, acoustic, and
textual data, with recent eforts focusing on fine-tuning
these representations for specific tasks [3].</p>
      <p>The second challenge is translation, which aims to
generate an entity in one modality based on information
from a diferent modality. An example of this is video
description generation. Previous work by [4] proposed
a system that describes human behavior in videos using
detected head and hand positions combined with
rulebased natural language generation. Evaluating
multimodal translation methods is challenging, as there are
often multiple correct answers and subjective judgments
involved.</p>
      <p>The third challenge is alignment, which involves
finding relationships and correspondences between
subcomponents of instances across multiple modalities. For
instance, aligning a movie with its corresponding script
or book chapters. Dynamic Time Warping and
Canonical Cor-relation Analysis are commonly used for
multimodal data alignment, and [5] introduced the deep
canonical time warping approach, which generalizes deep CCA
and DTW.</p>
      <p>The fourth challenge is fusion, which aims to integrate
in-formation from multiple modalities to improve the
robustness of predictions. In the context of continuous
multi-modal emotion recognition, [6] demonstrated the
advantages of using LSTM models over graphical models
and SVMs.</p>
      <p>It is worth noting that these challenges and approaches
are part of a broader survey on multi-modal learning, and
further details can be found in [2].</p>
      <p>The main objective of generation is to develop a model
from one or multiple distinct yet related domains (i.e.,
di-verse training datasets) that can generalize efectively
on unseen testing domains. Ensemble learning leverages
the connections between multiple source domains by
employing specific model architecture designs and
training strategies to enhance generalization. The underlying
assumption is that any sample can be seen as a
combination of multiple source domains, resulting in an overall
prediction that combines the outputs of various
domainspecific models. [ 7] introduced domain-specific layers
cor-responding to diferent source domains and learned
the linear aggregation of these layers to represent a test
sample. Similarly, [8] proposed Domain Adaptive
Ensemble Learning (DAEL), which comprises a CNN feature
extractor shared across domains and multiple
domainspecific classifier heads. Each classifier acts as an expert
for its own domain but a non-expert for others. The
objective of DAEL is to collaboratively train these experts
by teaching the non-experts with the expert knowledge,
encouraging the ensemble to efectively handle data from
previously unseen domains. This approach fosters
domain adaptation and allows the model to generalize well
across diferent domains.</p>
      <p>In this paper, we aim to tackle these challenges by
employing the AdaBoost algorithm since it allows more
diversity of models and features.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology</title>
      <sec id="sec-2-1">
        <title>3.1. Problem Description</title>
        <p>Consider two modalities generated from the sample set
,  = {1, . . . , } and  = {1, . . . , }, where n
denotes the index of samples,  ∈  and  ∈ 
with a and b dimensions respectively. Given the ground
truth labels  = {1, . . . , }, where  ∈ {0, 1} or
multiple classes, we aim to train a multi-modal learning
model to map both  and  into the same categorical
set of .</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. AdaBoost with heterogeneous weak learners</title>
        <sec id="sec-2-2-1">
          <title>In terms of the additive model in [9], the weak classifier</title>
          <p>ℎ() minimizes the classification error under the
distribution  over the training data. Its classification error rate
should be less than 0.5 for the . It can be noted that if
any ℎ() could satisfy this requirement, the resulting final
strong classifier () still satisfy the error bound in [10].
Moreover, with the assumption of the error rate (i.e., loss
function) is convex, it’s possible to prove that AdaBoosts
outperform individual learners according to Jensen’s
inequality. Note that it is true regardless of where the
individual learners come from. These imply that whether
homogeneous or heterogeneous weak classifiers do not
influence the performance of AdaBoost. Our numerical
experiments in Section 4.1 verify this assertion.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Multi-modal learning based on</title>
      </sec>
      <sec id="sec-2-4">
        <title>AdaBoost</title>
        <sec id="sec-2-4-1">
          <title>The basic idea is that the diferent modalities  and</title>
          <p>are bundled with weak learners together and are viewed
as heterogeneous learners. The sample set  and the
label set  are employed to the training dataset. The
proposed multi-modal learning model is implemented
based on AdaBoost as shown in Figure 1.</p>
          <p>The weak classifiers may be either homogeneous or
heterogeneous, which is suited to the scenario that
modality data has their individual classifiers. Moreover, each
sample in  may be a collection of multi-class data. Under
the AdaBoost scheme, we update the rule of the sample
distribution  over the .</p>
          <p>Note that diferent modalities may share the same
classifier and their combinations are still regarded as
indewhere,
pendent heterogeneous learners. This can maximally
generalize weak learners.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiments and analysis</title>
      <sec id="sec-3-1">
        <title>5.1. Data Collection</title>
        <p>5.1.1. Synthetic dataset
ℎ = 0
ℎ = 1</p>
        <sec id="sec-3-1-1">
          <title>Kappa statistic:</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>To evaluate the performance of AdaBoost, we conducted</title>
          <p>4. Generalization metric experiments using homogeneous weak learners and
heterogeneous weak learners respectively. For this purpose,
Indeed, the generalization error of an AdaBoost algo- we generated a synthetic dataset consisting of 1000
samrithm is influenced by the diversity of its individual learn- ples, 10 features, and 2 classes using the Gaussian
funcers. This relationship is elucidated through the error- tion with zero-mean and variance of 1.
ambiguity decomposition method introduced in [11]. It
is sensible to consider the diversity of a classifier as a rep- 5.1.2. CIFAR-10 dataset
resentation of its generation capabilities. In other words,
a more robustly generated classifier exhibits greater di- The CIFAR-10 dataset [12] is a widely-used benchmark
versity, leading to improved performance metrics, such for image classification. It comprises 60,000 color images
as a lower error rate. of size 32x32, distributed across 10 classes with 6,000
im</p>
          <p>However, a significant challenge in this context is the ages per class. The dataset exhibits diverse and relatively
absence of a well-defined diversity measurement. While low-resolution images. To simulate a multi-modal
learnit is intuitive to link diversity to better performance, there ing scenario, we extract three types of feature
representais currently a lack of standardized and quantifiable met- tions: color-based features (HSV histogram), shape-based
rics to precisely evaluate and compare the diversity of features (Histogram of Oriented Gradient), and
textureclassifiers. Addressing this gap could potentially enhance based features (Gabor filter). How-ever, considering that
our understanding of how diversity impacts generaliza- the original AdaBoost algorithm was designed for binary
tion and lead to further improvements in ensemble learn- classification, we selected two classes from the CIFAR-10
ing algorithms like AdaBoost. dataset for experiments.</p>
          <p>To measure AdaBoost diversity, we apply the Kappa
statistic to measuring the pairwise similarity/dissimilar- 5.1.3. Million Song Dataset
ity between two learners, and then average all the
pairwise measurements for the overall diversity. This can be
simply described in a binary classification application.</p>
          <p>We have the following contingency table for two learners
ℎ and ℎ , where  +  +  +  =  are non-negative
variables showing the numbers of examples satisfying
the conditions specified by the corresponding rows and
columns.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>We also design experiments based on AdaBoost for mu</title>
          <p>sic emotion recognition with Million Song Dataset [13],
which refers to recognizing and classifying emotions in
music using multiple modalities (such as audio, lyrics).
We chose two diferent emotion categories as labels based
on the quadrant distribution in Russel’s emotion model,
i.e., positive and negative [14]. We extract the lyrics
features from the MusiXmatch dataset derived from
Million Song Dataset and a series of emotionally
representative acoustic features (i.e., Tempo, Beats, Harmonic,
Percussive, Root Mean Square, Zero Crossing Rate, Onset
Frames, Chroma short-time Fourier transform, Chroma
Energy Normalized, Chroma Constant-Q chromagram,
Mel-spectrogram, MFCC, Poly, Tonnetz, Spectral
bandwidth, Spectral roll-of, Spectral contrast, Spectral
centroid) by the librosa python library [15].
ℎ = 0


 =
1 − 2
1 − 2
ℎ = 1


(1)</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>5.2. Results and analysis</title>
        <p>5.2.1. Experiment 1: Comparison of AdaBoost
with homogeneous and heterogeneous weak
learners</p>
        <sec id="sec-3-2-1">
          <title>We firstly performed AdaBoost on the synthetic dataset</title>
          <p>and apply three weak learners to homogeneous and
heterogeneous scenarios, i.e., Decision Tree (DT), Naive
Bayes (NB), Perceptron (Per). The results are shown
in Table 1,2 and Figure 2. It can be noted that whether
homogeneous or heterogeneous weak learners do not
afect the AdaBoost performance.</p>
          <p>We further performed AdaBoost on the CIFA-10 with
homogeneous and heterogeneous weak learners
respectively. The results are shown in Tables 3,4 and Figure
3. It can be noted that (1) the AdaBoost performance is
not influenced by homogeneous or heterogeneous weak
learners; (2) architecture of heterogeneous weak
learners usually does not make the AdaBoost performance
improved. This is reasonable since diferent weak
learners in the heterogeneous architecture have the individual
performances. This finally results in a trade-of of the
performance of diferent weak learners rather than synergy.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>We firstly per-formed AdaBoost with the homogeneous</title>
          <p>weak learners (Decision Tree, Naive Bayes) on each
unimodal feature in the CIFAR-10 and the Million Song
Dataset, respectively. The results are shown in Tables 5,6
and Figure 4.</p>
          <p>We further applied AdaBoost with homogeneous weak
learners (DT, NB) to multi-modal dataset. To mock
multimodal learning, we chose 4 combinations of the features
(HSV, Gabor, HOG) as multi-modal data. For the music
emotion recognition, there are two kinds of real modality
data available. The results are shown in Tables 7,8 and
of bundling the features with their individual classifiers
(i.e., HSV+DT, Gabor+DT, HOG+SGD) on the CIFAR-10
as the weak learners in Table 9 is only comparable with
that of multi-modal learning with the single classifier
of DT in Table 7. This is acceptable since these three
combinations in Table 9 may have diferent performance.</p>
          <p>Experiment 1 justifies that the final result is a trade-of of
the performance of diferent weak learners rather than a
synergy.
5.2.3. Experiment 3: AdaBoost based MLs’
diversities</p>
          <p>In the proposed multi-modal learning model (refer to Fig.
1), the weak learner can exhibit diferent compositions,
which can be categorized into the following types:</p>
          <p>1) The same classifier with diferent features, resulting
in multiple distinct weak learners.</p>
          <p>2) The same feature with diferent classifiers, leading
to multiple diverse weak learners.</p>
          <p>3) Diferent features with their individual classifiers,
yielding multiple weak learners.</p>
          <p>To compare the structures of homogeneous and
heterogeneous weak learners, each weak learner is first used
in the AdaBoost homogeneous structure. Subsequently,
these weak learners are incorporated into the AdaBoost
heterogeneous structure. In each AdaBoost iteration, we
calculate the pairwise Kappa statistics of weak learners
originating from the AdaBoost and their average error Figure 8: Same classifier + Multiple features. CIFAR-10
rates, which are then represented in a scatter plot. Herein dataset (above) and multi-modal music emotion recognition
the origin (0,0) denotes error rate=0 and Kappa=0, which (below). Homogeneous weak learners (left), heterogeneous
is the ideal point. Figure 7 illustrates the results for com- weak learners (right).
position 2, while Figure 8 shows those for composition
1. Overall, the heterogeneous structure broadly
encompasses the results obtained from the homogeneous struc- posed of features and their individual classifiers,
exhibtures. Figure 9 displays the results for composition 3. In ited good performance in the homogeneous structure
the homogeneous structure tests, we experimented with tests. Consequently, the heterogeneous structure
demonvarious combinations of features and classifiers for weak strated improved performance compared to the results
learner design, selecting 2 or 3 learners with satisfactory of the homogeneous tests such as error rate in Figure
performance for the heterogeneous structure test. 9. However, the extent of improvement was not
signifIt is noteworthy that the selected weak learners, com- icant, suggesting that the overall outcome represents a
[1] Z.-H. Zhou, Large margin distribution learning, in:</p>
          <p>Artificial Neural Networks in Pattern Recognition:
6th IAPR TC 3 International Workshop, ANNPR
2014, Montreal, QC, Canada, October 6-8, 2014.
Proceedings 6, Springer, 2014, pp. 1–11.
[2] T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal
machine learning: A survey and taxonomy, IEEE
transactions on pattern analysis and machine
intelligence 41 (2018) 423–443.
[3] D. Wang, P. Cui, M. Ou, W. Zhu, Deep
multimodal hashing with orthogonal regularization, in:
Twenty-fourth international joint conference on
artificial intelligence, 2015.
[4] A. Kojima, T. Tamura, K. Fukunaga, Natural
language description of human activities from video
images based on concept hierarchy of actions,
InFigure 9: Multiple features with individual classifiers. CIFAR- ternational Journal of Computer Vision 50 (2002)
10 dataset (above) and multi-modal music emotion recognition 171–184.
(below). Homogeneous tests (left), Heterogeneous tests (right). [5] G. Trigeorgis, M. A. Nicolaou, S. Zafeiriou, B. W.
Schuller, Deep canonical time warping, in:
Proceedings of the IEEE Conference on Computer Vision
trade-of between the performances of diferent weak and Pattern Recognition, 2016, pp. 5110–5118.
learners rather than a clear synergy. The heterogeneous [6] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller,
structure did not lead to a distinct and prominent change G. Rigoll, Lstm-modeling of continuous emotions in
in performance an audiovisual afect recognition framework, Image
and Vision Computing 31 (2013) 153–163.
[7] M. Mancini, S. R. Bulo, B. Caputo, E. Ricci, Best
6. Conclusion sources forward: domain generalization through
source-specific nets, in: 2018 25th IEEE
internaIn this paper, we conducted experiments and analysis to tional conference on image processing (ICIP), IEEE,
explore AdaBoost-based multi-modal learning methods. 2018, pp. 1353–1357.</p>
          <p>Our findings lead to the following conclusions: [8] K. Zhou, Y. Yang, Y. Qiao, T. Xiang, Domain
adap(1) The architecture of homogeneous or heterogeneous tive ensemble learning, IEEE Transactions on Image
weak learners does not significantly impact the perfor- Processing 30 (2021) 8008–8018.
mance of AdaBoost. [9] J. Friedman, T. Hastie, R. Tibshirani, Additive
logis(2) In the architecture of heterogeneous weak learners, tic regression: a statistical view of boosting (with
each weak learner contributes individual performance, discussion and a rejoinder by the authors), The
and the ensemble learning result is a trade-of among the annals of statistics 28 (2000) 337–407.
performances of diferent weak learners rather than a [10] P. Bartlett, Y. Freund, W. S. Lee, R. E. Schapire,
synergistic efect. Boosting the margin: A new explanation for the
(3) In multi-modal learning, each modality possesses efectiveness of voting methods, The annals of
its own classifiers. To fully maximize the potential of statistics 26 (1998) 1651–1686.
multi-modalities, it is preferable to bundle the modali- [11] A. Krogh, J. Vedelsby, Neural network ensembles,
ties with their individual classifiers as independent weak cross validation, and active learning, Advances in
learners for ensemble learning. However, whether ho- neural information processing systems 7 (1994).
mogeneous or heterogeneous architectures do not bring [12] H. Li, H. Liu, X. Ji, G. Li, L. Shi, Cifar10-dvs: an
about distinct change. event-stream dataset for object classification,
Fron</p>
          <p>In future research, we plan to apply AdaBoost-based tiers in neuroscience 11 (2017) 309.
multi-modal learning to address various challenges in the [13] T. Bertin-Mahieux, D. P. Ellis, B. Whitman,
ifeld, such as representation, alignment, explainability, P. Lamere, The million song dataset (2011).
and more. This will further demonstrate the potential and [14] J. A. Russell, A circumplex model of afect., Journal
efectiveness of AdaBoost in the context of multi-modal of personality and social psychology 39 (1980) 1161.
learning. [15] B. McFee, C. Rafel, D. Liang, D. P. Ellis, M. McVicar,</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>