1. Introduction

Hongchuan Yu

Boyuan Cheng

0 0 National Centre for Computer Animation, Bournemouth University

16 23

This paper focuses on multi-modal learning and introduces an AdaBoost-based approach for multi-modal learning. We address two foundation problems, (1) the diference be-tween AdaBoost with homogeneous and heterogeneous weak learners; (2) generalization metric. By addressing these research questions, this paper enhances our under-standing of AdaBoost in the context of multi-modal learning through comprehensive experiments. The experiment results show that the heterogeneous structure is a trade-of between the performances of diferent weak learners rather than a clear synergy. The multi-modal learning model's performance depends on how the individual weak learners are composed, and the heterogeneous structure's ad-vantage lies in harnessing the diverse strengths of individual weak learners, even though the improvement achieved is not overwhelmingly pronounced.

eol>AdaBoost Homogeneous weak learners Heterogeneous weak learners Multimodal learning Generalization metric

1. Introduction

multi-modal learning techniques, they all encounter a common challenge: generalization. Initially, these algoMulti-modal learning refers to the process of extracting rithms were developed to tackle the issue of generalizaat-tributes from one or more data streams, known as tion, where a pre-trained model can efectively handle modalities, that have diferent dimensions. The goal is to unseen domains. In this paper, we leverage the power learn how to combine and project the extracted hetero- of AdaBoost and introduce novel multi-modal learning geneous features into a shared representation space. In methods based on it. Unlike the conventional implevarious applications, leveraging multiple modalities and mentation of AdaBoost that assumes homogeneous weak sensors can provide valuable contextual information for learners, in multi-modal learning scenarios, each modala given task. Each modality, such as textual, visual, or au- ity may have its own individual learners, resulting in ditory, has its own structure and encoding mechanisms heterogeneous learners. for handling heterogeneous information harmoniously The main challenge we face in our proposed algorithm within a conceptual framework. involves two aspects: (1) Assessing the performance dif

While the combination of diferent modalities or data ference of AdaBoost with homogeneous and heterogesources to enhance performance is an ongoing research neous classifiers, respectively; (2) Establishing a quantififocus, it is often challenging to distinguish between noise, able metric for generalization, which has been lacking in concepts, and conflicts among the data sources in prac- existing research. Our contributions in this paper are as tice. follows:

Among boosting algorithms, AdaBoost is widely recognized as a prominent member. It converts a set of 1. We demonstrate that AdaBoost performs equally weak learners into a strong learner. Typically, AdaBoost well with homogeneous weak learners as with is formulated using an additive model, where a linear heterogeneous weak learners. combination of base learners is employed to minimize 2. We introduce a new metric for measuring the the exponential loss function. AdaBoost implementation generalization capability of the proposed algois straightforward and comprehensible, and it is known rithm. This metric allows us to assess how well for its resistance to overfitting [ 1]. Multi-modal learning the algorithm generalizes to unseen data. aims to tap potentialities of multiple modality data, while By addressing these challenges and making these conAdaBoost is a successful example of ensemble learning. tributions, our paper aims to enhance the understanding It is natural to apply AdaBoost to multi-modal learning. and application of multi-modal learning techniques, es

However, regardless of whether ensemble learning or pecially in the context of AdaBoost-based approaches. data from diverse modalities while accounting for heterogeneity, noise levels, and missing data. Deep networks have been employed to represent visual, acoustic, and textual data, with recent eforts focusing on fine-tuning these representations for specific tasks [3].

The second challenge is translation, which aims to generate an entity in one modality based on information from a diferent modality. An example of this is video description generation. Previous work by [4] proposed a system that describes human behavior in videos using detected head and hand positions combined with rulebased natural language generation. Evaluating multimodal translation methods is challenging, as there are often multiple correct answers and subjective judgments involved.

The third challenge is alignment, which involves finding relationships and correspondences between subcomponents of instances across multiple modalities. For instance, aligning a movie with its corresponding script or book chapters. Dynamic Time Warping and Canonical Cor-relation Analysis are commonly used for multimodal data alignment, and [5] introduced the deep canonical time warping approach, which generalizes deep CCA and DTW.

The fourth challenge is fusion, which aims to integrate in-formation from multiple modalities to improve the robustness of predictions. In the context of continuous multi-modal emotion recognition, [6] demonstrated the advantages of using LSTM models over graphical models and SVMs.

It is worth noting that these challenges and approaches are part of a broader survey on multi-modal learning, and further details can be found in [2].

The main objective of generation is to develop a model from one or multiple distinct yet related domains (i.e., di-verse training datasets) that can generalize efectively on unseen testing domains. Ensemble learning leverages the connections between multiple source domains by employing specific model architecture designs and training strategies to enhance generalization. The underlying assumption is that any sample can be seen as a combination of multiple source domains, resulting in an overall prediction that combines the outputs of various domainspecific models. [ 7] introduced domain-specific layers cor-responding to diferent source domains and learned the linear aggregation of these layers to represent a test sample. Similarly, [8] proposed Domain Adaptive Ensemble Learning (DAEL), which comprises a CNN feature extractor shared across domains and multiple domainspecific classifier heads. Each classifier acts as an expert for its own domain but a non-expert for others. The objective of DAEL is to collaboratively train these experts by teaching the non-experts with the expert knowledge, encouraging the ensemble to efectively handle data from previously unseen domains. This approach fosters domain adaptation and allows the model to generalize well across diferent domains.

In this paper, we aim to tackle these challenges by employing the AdaBoost algorithm since it allows more diversity of models and features.

3. Methodology 3.1. Problem Description

Consider two modalities generated from the sample set , = {1, . . . , } and = {1, . . . , }, where n denotes the index of samples, ∈ and ∈ with a and b dimensions respectively. Given the ground truth labels = {1, . . . , }, where ∈ {0, 1} or multiple classes, we aim to train a multi-modal learning model to map both and into the same categorical set of .

3.2. AdaBoost with heterogeneous weak learners In terms of the additive model in [9], the weak classifier

ℎ() minimizes the classification error under the distribution over the training data. Its classification error rate should be less than 0.5 for the . It can be noted that if any ℎ() could satisfy this requirement, the resulting final strong classifier () still satisfy the error bound in [10]. Moreover, with the assumption of the error rate (i.e., loss function) is convex, it’s possible to prove that AdaBoosts outperform individual learners according to Jensen’s inequality. Note that it is true regardless of where the individual learners come from. These imply that whether homogeneous or heterogeneous weak classifiers do not influence the performance of AdaBoost. Our numerical experiments in Section 4.1 verify this assertion.

3.3. Multi-modal learning based on AdaBoost The basic idea is that the diferent modalities and

are bundled with weak learners together and are viewed as heterogeneous learners. The sample set and the label set are employed to the training dataset. The proposed multi-modal learning model is implemented based on AdaBoost as shown in Figure 1.

The weak classifiers may be either homogeneous or heterogeneous, which is suited to the scenario that modality data has their individual classifiers. Moreover, each sample in may be a collection of multi-class data. Under the AdaBoost scheme, we update the rule of the sample distribution over the .

Note that diferent modalities may share the same classifier and their combinations are still regarded as indewhere, pendent heterogeneous learners. This can maximally generalize weak learners.

5. Experiments and analysis 5.1. Data Collection

5.1.1. Synthetic dataset ℎ = 0 ℎ = 1

Kappa statistic: To evaluate the performance of AdaBoost, we conducted

4. Generalization metric experiments using homogeneous weak learners and heterogeneous weak learners respectively. For this purpose, Indeed, the generalization error of an AdaBoost algo- we generated a synthetic dataset consisting of 1000 samrithm is influenced by the diversity of its individual learn- ples, 10 features, and 2 classes using the Gaussian funcers. This relationship is elucidated through the error- tion with zero-mean and variance of 1. ambiguity decomposition method introduced in [11]. It is sensible to consider the diversity of a classifier as a rep- 5.1.2. CIFAR-10 dataset resentation of its generation capabilities. In other words, a more robustly generated classifier exhibits greater di- The CIFAR-10 dataset [12] is a widely-used benchmark versity, leading to improved performance metrics, such for image classification. It comprises 60,000 color images as a lower error rate. of size 32x32, distributed across 10 classes with 6,000 im

However, a significant challenge in this context is the ages per class. The dataset exhibits diverse and relatively absence of a well-defined diversity measurement. While low-resolution images. To simulate a multi-modal learnit is intuitive to link diversity to better performance, there ing scenario, we extract three types of feature representais currently a lack of standardized and quantifiable met- tions: color-based features (HSV histogram), shape-based rics to precisely evaluate and compare the diversity of features (Histogram of Oriented Gradient), and textureclassifiers. Addressing this gap could potentially enhance based features (Gabor filter). How-ever, considering that our understanding of how diversity impacts generaliza- the original AdaBoost algorithm was designed for binary tion and lead to further improvements in ensemble learn- classification, we selected two classes from the CIFAR-10 ing algorithms like AdaBoost. dataset for experiments.

To measure AdaBoost diversity, we apply the Kappa statistic to measuring the pairwise similarity/dissimilar- 5.1.3. Million Song Dataset ity between two learners, and then average all the pairwise measurements for the overall diversity. This can be simply described in a binary classification application.

We have the following contingency table for two learners ℎ and ℎ , where + + + = are non-negative variables showing the numbers of examples satisfying the conditions specified by the corresponding rows and columns.

We also design experiments based on AdaBoost for mu

sic emotion recognition with Million Song Dataset [13], which refers to recognizing and classifying emotions in music using multiple modalities (such as audio, lyrics). We chose two diferent emotion categories as labels based on the quadrant distribution in Russel’s emotion model, i.e., positive and negative [14]. We extract the lyrics features from the MusiXmatch dataset derived from Million Song Dataset and a series of emotionally representative acoustic features (i.e., Tempo, Beats, Harmonic, Percussive, Root Mean Square, Zero Crossing Rate, Onset Frames, Chroma short-time Fourier transform, Chroma Energy Normalized, Chroma Constant-Q chromagram, Mel-spectrogram, MFCC, Poly, Tonnetz, Spectral bandwidth, Spectral roll-of, Spectral contrast, Spectral centroid) by the librosa python library [15]. ℎ = 0 = 1 − 2 1 − 2 ℎ = 1 (1)

5.2. Results and analysis

5.2.1. Experiment 1: Comparison of AdaBoost with homogeneous and heterogeneous weak learners

We firstly performed AdaBoost on the synthetic dataset

and apply three weak learners to homogeneous and heterogeneous scenarios, i.e., Decision Tree (DT), Naive Bayes (NB), Perceptron (Per). The results are shown in Table 1,2 and Figure 2. It can be noted that whether homogeneous or heterogeneous weak learners do not afect the AdaBoost performance.

We further performed AdaBoost on the CIFA-10 with homogeneous and heterogeneous weak learners respectively. The results are shown in Tables 3,4 and Figure 3. It can be noted that (1) the AdaBoost performance is not influenced by homogeneous or heterogeneous weak learners; (2) architecture of heterogeneous weak learners usually does not make the AdaBoost performance improved. This is reasonable since diferent weak learners in the heterogeneous architecture have the individual performances. This finally results in a trade-of of the performance of diferent weak learners rather than synergy.

We firstly per-formed AdaBoost with the homogeneous

weak learners (Decision Tree, Naive Bayes) on each unimodal feature in the CIFAR-10 and the Million Song Dataset, respectively. The results are shown in Tables 5,6 and Figure 4.

We further applied AdaBoost with homogeneous weak learners (DT, NB) to multi-modal dataset. To mock multimodal learning, we chose 4 combinations of the features (HSV, Gabor, HOG) as multi-modal data. For the music emotion recognition, there are two kinds of real modality data available. The results are shown in Tables 7,8 and of bundling the features with their individual classifiers (i.e., HSV+DT, Gabor+DT, HOG+SGD) on the CIFAR-10 as the weak learners in Table 9 is only comparable with that of multi-modal learning with the single classifier of DT in Table 7. This is acceptable since these three combinations in Table 9 may have diferent performance.

Experiment 1 justifies that the final result is a trade-of of the performance of diferent weak learners rather than a synergy. 5.2.3. Experiment 3: AdaBoost based MLs’ diversities

In the proposed multi-modal learning model (refer to Fig. 1), the weak learner can exhibit diferent compositions, which can be categorized into the following types:

1) The same classifier with diferent features, resulting in multiple distinct weak learners.

2) The same feature with diferent classifiers, leading to multiple diverse weak learners.

3) Diferent features with their individual classifiers, yielding multiple weak learners.

To compare the structures of homogeneous and heterogeneous weak learners, each weak learner is first used in the AdaBoost homogeneous structure. Subsequently, these weak learners are incorporated into the AdaBoost heterogeneous structure. In each AdaBoost iteration, we calculate the pairwise Kappa statistics of weak learners originating from the AdaBoost and their average error Figure 8: Same classifier + Multiple features. CIFAR-10 rates, which are then represented in a scatter plot. Herein dataset (above) and multi-modal music emotion recognition the origin (0,0) denotes error rate=0 and Kappa=0, which (below). Homogeneous weak learners (left), heterogeneous is the ideal point. Figure 7 illustrates the results for com- weak learners (right). position 2, while Figure 8 shows those for composition 1. Overall, the heterogeneous structure broadly encompasses the results obtained from the homogeneous struc- posed of features and their individual classifiers, exhibtures. Figure 9 displays the results for composition 3. In ited good performance in the homogeneous structure the homogeneous structure tests, we experimented with tests. Consequently, the heterogeneous structure demonvarious combinations of features and classifiers for weak strated improved performance compared to the results learner design, selecting 2 or 3 learners with satisfactory of the homogeneous tests such as error rate in Figure performance for the heterogeneous structure test. 9. However, the extent of improvement was not signifIt is noteworthy that the selected weak learners, com- icant, suggesting that the overall outcome represents a [1] Z.-H. Zhou, Large margin distribution learning, in:

Artificial Neural Networks in Pattern Recognition: 6th IAPR TC 3 International Workshop, ANNPR 2014, Montreal, QC, Canada, October 6-8, 2014. Proceedings 6, Springer, 2014, pp. 1–11. [2] T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE transactions on pattern analysis and machine intelligence 41 (2018) 423–443. [3] D. Wang, P. Cui, M. Ou, W. Zhu, Deep multimodal hashing with orthogonal regularization, in: Twenty-fourth international joint conference on artificial intelligence, 2015. [4] A. Kojima, T. Tamura, K. Fukunaga, Natural language description of human activities from video images based on concept hierarchy of actions, InFigure 9: Multiple features with individual classifiers. CIFAR- ternational Journal of Computer Vision 50 (2002) 10 dataset (above) and multi-modal music emotion recognition 171–184. (below). Homogeneous tests (left), Heterogeneous tests (right). [5] G. Trigeorgis, M. A. Nicolaou, S. Zafeiriou, B. W. Schuller, Deep canonical time warping, in: Proceedings of the IEEE Conference on Computer Vision trade-of between the performances of diferent weak and Pattern Recognition, 2016, pp. 5110–5118. learners rather than a clear synergy. The heterogeneous [6] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, structure did not lead to a distinct and prominent change G. Rigoll, Lstm-modeling of continuous emotions in in performance an audiovisual afect recognition framework, Image and Vision Computing 31 (2013) 153–163. [7] M. Mancini, S. R. Bulo, B. Caputo, E. Ricci, Best 6. Conclusion sources forward: domain generalization through source-specific nets, in: 2018 25th IEEE internaIn this paper, we conducted experiments and analysis to tional conference on image processing (ICIP), IEEE, explore AdaBoost-based multi-modal learning methods. 2018, pp. 1353–1357.

Our findings lead to the following conclusions: [8] K. Zhou, Y. Yang, Y. Qiao, T. Xiang, Domain adap(1) The architecture of homogeneous or heterogeneous tive ensemble learning, IEEE Transactions on Image weak learners does not significantly impact the perfor- Processing 30 (2021) 8008–8018. mance of AdaBoost. [9] J. Friedman, T. Hastie, R. Tibshirani, Additive logis(2) In the architecture of heterogeneous weak learners, tic regression: a statistical view of boosting (with each weak learner contributes individual performance, discussion and a rejoinder by the authors), The and the ensemble learning result is a trade-of among the annals of statistics 28 (2000) 337–407. performances of diferent weak learners rather than a [10] P. Bartlett, Y. Freund, W. S. Lee, R. E. Schapire, synergistic efect. Boosting the margin: A new explanation for the (3) In multi-modal learning, each modality possesses efectiveness of voting methods, The annals of its own classifiers. To fully maximize the potential of statistics 26 (1998) 1651–1686. multi-modalities, it is preferable to bundle the modali- [11] A. Krogh, J. Vedelsby, Neural network ensembles, ties with their individual classifiers as independent weak cross validation, and active learning, Advances in learners for ensemble learning. However, whether ho- neural information processing systems 7 (1994). mogeneous or heterogeneous architectures do not bring [12] H. Li, H. Liu, X. Ji, G. Li, L. Shi, Cifar10-dvs: an about distinct change. event-stream dataset for object classification, Fron

In future research, we plan to apply AdaBoost-based tiers in neuroscience 11 (2017) 309. multi-modal learning to address various challenges in the [13] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, ifeld, such as representation, alignment, explainability, P. Lamere, The million song dataset (2011). and more. This will further demonstrate the potential and [14] J. A. Russell, A circumplex model of afect., Journal efectiveness of AdaBoost in the context of multi-modal of personality and social psychology 39 (1980) 1161. learning. [15] B. McFee, C. Rafel, D. Liang, D. P. Ellis, M. McVicar,