=Paper= {{Paper |id=Vol-2161/paper32 |storemode=property |title=Ensemble Learning for Multi-type Classification in Heterogeneous Networks |pdfUrl=https://ceur-ws.org/Vol-2161/paper32.pdf |volume=Vol-2161 |authors=Francesco Serafino,Gianvito Pio,Michelangelo Ceci |dblpUrl=https://dblp.org/rec/conf/sebd/0002PC18 }} ==Ensemble Learning for Multi-type Classification in Heterogeneous Networks== https://ceur-ws.org/Vol-2161/paper32.pdf
Ensemble Learning for Multi-type Classification
in Heterogeneous Networks (Discussion Paper)

               Francesco Serafino, Gianvito Pio, and Michelangelo Ceci

 University of Bari, Dept. of Computer Science, Via Orabona, 4, 70125 Bari (Italy)



         Abstract. In the literature, several methods have been proposed for
         the analysis of network data, but they usually focus on homogeneous
         networks. More recently, the complexity of real scenarios has impelled
         researchers to design classification and clustering methods able to work
         also on heterogeneous networks, which consist of different types of ob-
         jects and links. However, they often make assumptions on the structure
         of the network that are too restrictive or do not exploit different forms of
         network correlation and autocorrelation. Moreover, when several nodes
         of the network have missing values, standard methods can lead to either
         building incomplete classification models or to discarding relevant de-
         pendencies (correlation or autocorrelation). In this discussion paper, we
         describe an ensemble learning approach for multi-type classification that
         we proposed recently, which is able to exploit i) the possible presence
         of correlation and autocorrelation phenomena, and ii) the classification
         of instances (with missing values) of other node types in the network.
         Experiments performed on real-world datasets show that the proposed
         method is able to significantly outperform state-of-the-art algorithms.

1     Introduction
In the real world we can easily find objects which appear to be connected to
each other, thus forming complex networks. Connections among these objects
can represent different types of relationships, which can be found in several fields,
including biology, epidemiology, geography, finance, etc. Most of the works in the
literature about mining networked data focus on homogeneous networks, where
all the objects are of the same type (and can, accordingly, be represented by a
predefined set of features/characteristics) and the links among them describe a
single type of relationship. A common example of a homogeneous network is that
of social networks: objects represent people and links represent the friendship
relationships. However, real scenarios are more complex due to the presence of
multiple types of objects that are connected through different types of links,
forming heterogeneous networks. For example, in well-known databases about
movies (e.g., IMDb) we have movies, actors, users, tags, etc. In the bio-medical
domain, databases contain genes, proteins, tissues, pathways and diseases. In all
these cases, objects of different types establish different types of relationships.
    Consequently, recent works have proposed new data mining methods that
work on heterogeneous information networks. For example, in [15] and [16] the
    SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the author(s).
authors propose new clustering solutions, whereas in [6] and [7] the authors pro-
pose classification/prediction methods. Other methods that were initially pro-
posed for relational data can be almost directly applied for the analysis of het-
erogeneous information networks [9]. However, existing methods suffer from one
or more of the following limitations: i) they impose strict restrictions on the
structure of the network, which must be know apriori; ii) they are not able to
take into account (and possibly exploit) one of the main peculiarities of network
data, i.e. the presence of different forms of autocorrelation [1, 14], according to
which two connected nodes in the network tend to share some properties; iii)
they are not able to consider the possible presence of missing values for some
attributes, which can lead to either learning incomplete classification models or
to discarding possibly relevant dependencies; iv) they are not able to classify
objects of different types, where each type can have a different set of labels.
    In this discussion paper, we describe the approach proposed in [13], which
is able to work on heterogeneous networks with arbitrary structures and is able
to capture both correlation and autocorrelation phenomena which involve the
target objects (i.e., objects which are the main subject of the classification task).
Moreover, we exploit the same strategy to predict possibly relevant missing val-
ues belonging to other objects, appearing strongly related to the target objects.
Methodologically, we extend the method Mr-SBC [2], in order to also capture
network autocorrelation phenomena, handle relevant missing values and per-
form multi-type classification. These last three issues are tackled by resorting to
a combined bagging-boosting ensemble learning solution able to exploit informa-
tion conveyed by objects (also of the same type) directly or indirectly connected
to the main subject(s) of the classification task. Specifically, we propose two ex-
tensions of the Mr-SBC algorithm: i) ST-MrSBC (Self-Training MrSBC),
which is able to capture possible autocorrelation phenomena by resorting to a
variant of the self-training method; ii) MT-MrSBC (Multi-Type MrSBC),
which iteratively analyzes objects of multiple types, in order to predict possible
missing values, belonging either to the target type of the main classification task
or to other object types that are strongly related to the main classification task.
We consider the network classification task according to the within-network set-
ting [4]: objects for which the class is known are linked to objects for which
the class must be estimated [10] (which can be either the subject of the main
classification task or other objects related to the main classification task). This
semi-supervised setting, which allows the classification phase to take advantage
of both labeled and unlabeled examples, leads to smoother predictions [3]. In
multi-type classification, we add an additional mechanism to smooth the pre-
diction function: capturing relationships among objects of different types, i.e,
capturing correlations among labels of objects of different types.


2   Problem Statement and Background

Before describing the proposed method, in the following we introduce the no-
tation used. We work on heterogeneous networks, which we formally define as
Fig. 1. An example of a heterogeneous network. The shape indicates the type of nodes:
triangles and circles are nodes of target types (primary or secondary), and can be
labeled (green) or unlabeled (white). Gray nodes belong to task-relevant types.


G = (V, E), where V is the set of nodes and E is the set of edges among nodes.
Both nodes and edges can be of different types. Moreover:
• Each node type Tp implicitly defines a subset of nodes Vp ⊆ V .
• Each node v 0 ∈ V is associated with a node type tv (v 0 ) ∈ T , where T is the
finite set {Tp } of all the possible types of nodes in the network.
• A node type Tp defines a set of attributes Xp = {Xp,1 , Xp,2 , . . . , Xp,mp }.
• An edge type Rj defines a subset of edges Ej ⊆ (Vp × Vq ) ⊆ E, where Vp and
Vq are not necessarily based on different types.
• An edge e between two nodes v 0 and v 00 is associated with an edge type Rj ∈ R,
where R is the finite set {Rj } of possible edge types in the network.
    In the considered task, we define a role for each node type: Tt (primary
targets), which are considered as the targets of the main classification task; Tst
(secondary targets), which are strongly related to the main classification task, for
which a prediction of missing values is considered relevant; Ttr (task-relevant),
which are the other node types. An example is reported in Figure 1. Only nodes
of target (primary and secondary) types are actually classified, on the basis of
all the nodes. However, we are actually interested in the maximization of the
prediction accuracy only of objects of the primary target types.


3   The proposed ensemble learning method
In this section we describe the two solutions ST-MrSBC and MT-MrSBC. Both
take as input a partially labeled heterogeneous network and work iteratively. At
each iteration, they build an ensemble of Mr-SBC [2] classifiers from different
subsets of labeled nodes (either known or predicted in the previous iterations),
whose combination of the output will possibly lead to a stronger model.
    Following the idea in [12] (for multi-label classification tasks), MT-MrSBC
shuffles all the target types. This is motivated by the fact that a predefined or-
dering of the analysis of target types can negatively affect the classification accu-
racy, since a wrong decision can inhibit the exploitation of relevant dependencies
and, consequently, can possibly enforce the exploitation of irrelevant/wrong de-
pendencies. In the literature, this phenomenon is also observed in random-scan
Gibbs sampling, as opposed to systematic-scan Gibbs sampling [8].
    Once the order of analysis has been defined, MT-MrSBC samples a subset
of nodes of the first target type and builds a weak (since built from a subset
           Fig. 2. High-level description of ST-MrSBC and MT-MrSBC.



of instances) predictive model through Mr-SBC. This predictive model is ap-
plied to classify unlabeled nodes which are then added to the labeled network.
The obtained probabilities are also stored for the final combination of the out-
puts. Then, we select a new target type and repeat the process. Note that, at
this stage, predictions performed for the previous target types are available to
Mr-SBC when building a prediction model for the new target. Target types are
shuffled again every time they are all processed. The number of iterations is lim-
ited by a user-defined threshold and the outputs obtained for each target type
over all the iterations are combined to obtain the final (strong) predictive model.
ST-MrSBC uses a similar process, but works on a single target type. Therefore, it
exploits predictions obtained on the same target type in the previous iterations.
The main difference with respect to the standard self-training is in the construc-
tion of the labeled network when a new iteration starts. Specifically, ST-MrSBC
builds an ensemble of weak classifiers from different subsets of labeled nodes,
instead of considering only the output of the last iteration.
Construction of labeled and unlabeled networks
At each iteration, we select the target attribute of the next target type and build
two separate networks: the first contains only labeled nodes, and the second con-
tains only unlabeled nodes. These networks are built by considering the complete
network and by removing unlabeled and labeled nodes, respectively. This means
that all the nodes of other target types, as well as nodes of task-relevant types,
are included in both networks. While the network of labeled nodes, for each tar-
get type, changes at different iterations, the network of unlabeled nodes remains
stable and the algorithm classifies the same nodes several times.
    The way nodes are considered as training instances at each iteration is ran-
dom. In fact, the algorithm randomly selects, according to a uniform distribution,
a given percentage perc of labeled nodes from the labeled network. Note that,
at a given iteration, the labeled network could contain also nodes that were ini-
tially unlabeled but that were classified during a previous iteration. In the case
of ST-MrSBC, the last available predictions will regard the same target type,
whereas in the case of MT-MrSBC, the last available predictions will regard all
the target types (primary and secondary). This behavior is coherent with the
assumption that predictions performed on other target types (primary or sec-
ondary) can help in the predictions of the label of nodes of the current target
type. This implies that, at the next iteration, the algorithm will take into account
some labels predicted at the previous iterations. This choice is, apparently, in
contrast with most of the self-training approaches, which usually select the most
reliable predictions for the next iterations. However, 1) this solution does not
necessarily lead to better results with respect to a random selection [5], and 2)
we do not use predictions as “hard” constraints, but in the next iterations we are
able to retract decisions that are not coherent with the new state of the network.
Estimation of probabilities
At the end of each iteration, we build a classification model through Mr-SBC for
the current target type from the network of labeled nodes. We remind that Mr-
SBC is a naı̈ve Bayes classifier which relies on a set of first-order rules induced
from data stored in the tables of a relational database (see [2] for details).
    For each unlabeled node v 0 , the identified classification model is exploited
to compute the posterior probability (which takes into account both network
correlation and network autocorrelation) and its label ψt (v 0 ), according to the
following equation based on the Bayes’ theorem:
      ψt (v 0 ) = argmaxYc P (Yc |Rv0 ) = argmaxc P (Yc ) · P (Rv0 |Yc )/P (Rv0 ),   (1)
where Yc is a possible class value of the target attribute y, and Rv0 ⊆ RD is the
subset of first-order rules, identified by Mr-SBC, covering the object v 0 .
    The labeled nodes are then added to the labeled network for the subsequent
iterations, as described before. The probabilities are exploited at the end of the
entire process. In fact, after the last iteration, the final classifier combines the
probabilities computed during all the iterations: for each target type Tt and for
each node v 0 , the method computes the final probability as the average of the
probabilities computed over the ensemble in the following way:
                            z                              z
                         1X                          1 X P (Yc )P (Rv0 k |Yc )
   ψt (v 0 ) = argmaxc      P (Yc |Rv0 k ) = argmaxc                           ,     (2)
                         z                           z       P (Rv0 k )
                          k=1                            k=1

where z is the number of iterations, i.e., the number of classifiers in the ensemble,
for each target type (the total number of iterations is z · |Lt |) and Rv0 k is the set
of rules identified from the labeled network at the k-th iteration. The rationale
behind the combination in Equation (2) is twofold: a) the predictions obtained at
each iteration are based on different training sets, which may focus on different
properties of the concept to be learned; b) the predictions are made more stable.
               Dataset #Nodes #Edges Primary targets Secondary targets
               MOVIE 2,051     59,532       movies         users
               NBA      36,593 63,924        teams        players
               YELP 1,387,596 1,371,060 business - users   N/A
               IMDB    212,039 342,161      movies         users
               STACK 92,800    114,385       posts         users
         Table 1. Quantitative information about the considered datasets



4   Experiments

We performed our experiments on five heterogeneous networks: MOVIE, NBA,
YELP, IMDB and STACK. Some quantitative information about these datasets
can be found in Table 4, while additional information can be found in [13].
In order to evaluate the different variants of the ensemble learning solutions we
propose, we compared them with the original version of Mr-SBC. This allows us
to evaluate the contribution of each aspect of our method, i.e., the iterative na-
ture of the ensemble-based self-training approach (introduced in ST-MrSBC)
and the multi-type classification of both primary and secondary targets (im-
plemented in MT-MrSBC). Moreover, we evaluate the performance of MT-
MrSBC by considering both a lexicographic (LexicographicMT-MrSBC) and
a random ordering (RandomMT-MrSBC) of target types.
    In order to perform a comparison with other systems, we also ran the exper-
iments with four competitor methods. In particular, we considered i) the rela-
tional version of the nearest neighbour algorithm (RelIBK), ii) the SVM-based
algorithm SMO (RelSMO), iii) the algorithm GNetMine [7], which is natively
able to work on heterogeneous networks, and iv) the algorithm HENPC [11],
recently proposed to solve the multi-type classification task in heterogeneous net-
works. However, RelIBK, RelSMO and HENPC were not able to finish within 3
days of execution, while GNetMine was not able to compute the results for Yelp
dataset, since the system went out of memory (on a server with 32GB of RAM).
In these cases, we ran the experiments on a reduced set of nodes (about 1, 000
nodes for each target type) of the datasets IMDB, STACK and YELP.
    As regards the parameter setting of ST-MrSBC and MT-MrSBC, we consid-
ered a random sampling of 20% of nodes for each classifier in the ensemble and
the number of iterations for each target type z ranging from 1 to 50.
    We collected all the classification accuracy values and performed the Ne-
menyi post-hoc tests on single datasets, on all the datasets and on the reduced
datasets for the comparative evaluation. The result of the statistical tests are
depicted in Figure 3. In particular, we can observe that for the datasets NBA
and for the target type users of the dataset Yelp, the improvement provided
by LexicographicMT-MrSBC and RandomMT-MrSBC over the competitors is
statistically significant at p-value= 0.05. Focusing on LexicographicMT-MrSBC,
we observe two specular and interesting cases: for the dataset IMDB it pro-
vides the best result, while for the target type Business of the dataset Yelp
it appears to be the worst approach. This instability is again caused by the
static ordering on the target types. Indeed, in the first case, the exploitation
of the dependency movies → users (which is surely strong, if we observe the
result) led LexicographicMT-MrSBC to obtain a better result with respect to
RandomMT-MrSBC, which alternatively (and randomly) exploited the depen-
dencies movies → users and users → movies (which appears weak). On the
contrary, in the second case, the static (unlucky) choice of the dependency to be
exploited brought LexicographicMT-MrSBC to the bottom of the ranking.
    Finally, the results obtained by our comparative evaluation (Figure 3 - last
chart) shows that the improvement provided by the proposed method is able
to give Mr-SBC the advantage of outperforming the competitors Indeed, the
original Mr-SBC is not able to outperform the considered competitors, while
the ensemble learning (ST-MrSBC) leads to outperform all the competitors (al-
though not statistically w.r.t. HENPC and RelIBk). The advantage comes from
the combination of capturing label dependencies between multiple types and of
the ensemble learning approach (MT-MrSBC), which improves the accuracy.
    Overall, we can conclude that the application of the proposed method, which
is able to capture both correlation and autocorrelation phenomena, as well as
to predict missing values, by exploiting the same classification method adopted
for primary targets, can lead to better, more stable predictions when applied to
real-world data organized in heterogeneous networks.


                MOVIE MOVIES                          IMDB MOVIES




                STACK POSTS                            NBA TEAMS




               YELP BUSINESS                          YELP USERS




                ALL DATASETS                      Comparative evaluation




Fig. 3. Results of the Nemenyi post-hoc test on the average accuracy. Better algorithms
are positioned on the right-hand side, and those that do not significantly differ in
performance (at p-value = 0.05) are connected with a line.
5    Conclusions
In this paper, we proposed an extension of the system Mr-SBC which works on
heterogeneous networks and that is able to solve multi-type classification tasks,
by capturing both correlation and autocorrelation phenomena. Experiments per-
formed on real datasets show that both the proposed variants are able to signif-
icantly outperform the original Mr-SBC, especially with a random ordering of
the target types, as well as four other well-known competitor algorithms.
    As future work, we plan to study the sensitivity to the sampling size and the
possibility to select only a subset of labeled nodes for the next iteration.
Acknowledgments
We would like to acknowledge the European project MAESTRA - Learning from
Massive, Incompletely annotated, and Structured Data (ICT-2013-612944).
References
 1. P. Angin and J. Neville. A shrinkage approach for modeling non-stationary rela-
    tional autocorrelation. In ICDM, pages 707–712. IEEE Computer Society, 2008.
 2. M. Ceci, A. Appice, and D. Malerba. Mr-SBC: a multi-relational naive bayes
    classifier. In PKDD 2003, pages 95–106. Springer, 2003.
 3. O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. The MIT Press,
    2nd edition, 2010.
 4. C. Desrosiers and G. Karypis. Within-network classification using local structure
    similarity. In ECML PKDD ’09, pages 260–275, Berlin, 2009. Springer-Verlag.
 5. Y. Guo, X. Niu, and H. Zhang. An extensive empirical study on semi-supervised
    learning. In ICDM, pages 186–195. IEEE, 2010.
 6. M. Ji, J. Han, and M. Danilevsky. Ranking-based classification of heterogeneous
    information networks. In SIGKDD ’11, pages 1298–1306, NY, USA, 2011. ACM.
 7. M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao. Graph regularized transduc-
    tive classification on heterogeneous information networks. In ECML PKDD 2010,
    volume 6321 of LNCS, pages 570–586. Springer Berlin, 2010.
 8. J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer Publishing
    Company, Incorporated, 2008.
 9. C. Loglisci, M. Ceci, and D. Malerba. Relational mining for discovering changes
    in evolving networks. Neurocomputing, 150:265–288, 2015.
10. S. A. Macskassy and F. Provost. Classification in networked data: A toolkit and a
    univariate case study. J. Mach. Learn. Res., 8:935–983, May 2007.
11. G. Pio, F. Serafino, D. Malerba, and M. Ceci. Multi-type clustering and classifi-
    cation from heterogeneous networks. Inf. Sci., 425:107–126, 2018.
12. J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label
    classification. Machine Learning, 85(3):333–359, 2011.
13. F. Serafino, G. Pio, and M. Ceci. Ensemble learning for multi-type classification
    in heterogeneous networks (doi: 10.1109/tkde.2018.2822307). IEEE TKDE, 2018.
14. D. Stojanova, M. Ceci, A. Appice, and S. Dzeroski. Network regression with pre-
    dictive clustering trees. Data Min. Knowl. Discov., 25(2):378–413, 2012.
15. Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: integrating
    clustering with ranking for heterogeneous information network analysis. In EDBT
    ’09, pages 565–576, New York, NY, USA, 2009. ACM.
16. Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous information
    networks with star network schema. In ACM SIGKDD, pages 797–806, 2009.