=Paper= {{Paper |id=Vol-2718/paper05 |storemode=property |title=Loss Functions for Clustering in Multi-instance Learning |pdfUrl=https://ceur-ws.org/Vol-2718/paper05.pdf |volume=Vol-2718 |authors=Marek Dědič,Tomáš Pevný,Lukáš Bajer,Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/DedicPBH20 }} ==Loss Functions for Clustering in Multi-instance Learning== https://ceur-ws.org/Vol-2718/paper05.pdf
                    Loss Functions for Clustering in Multi-instance Learning

                                   Marek Dědič1,2 , Tomáš Pevný3 , Lukáš Bajer2 , Martin Holeňa4
1
    Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Trojanova 13, Prague, Czech Republic
                                   2
                                     Cisco Systems, Inc., Karlovo náměstı́ 10, Prague, Czech Republic
        3
           Faculty of Electrical Engineering, Czech Technical University in Prague, Karlovo náměstí 13, Prague, Czech Republic
             4
               Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic

Abstract: Multi-instance learning belongs to one of re-                   their work significantly by deciding about whole clusters
cently fast developing areas of machine learning. It is a su-             of servers instead of each one individually.
pervised learning method and this paper reports research                     A key prerequisite for the application of multi-instance
into its unsupervised counterpart, multi-instance cluster-                learning to clustering is that loss functions for clustering,
ing. Whereas traditional clustering clusters points, multi-               originally proposed for points, are adapted for bags. In this
instance clustering clusters bags, i.e. multisets of points or            paper, such an adaptation is outlined for three sophisticated
of other kinds of objects. The paper focuses on the problem               loss functions: contrastive predictive coding, triplet loss
of loss functions for clustering. Three sophisticated loss                and magnet loss.
functions used for clustering of points, contrastive predic-                 Of the three proposed methods, triplet loss and magnet
tive coding, triplet loss and magnet loss, are elaborated for             loss utilize some labels as part of the training process and
multi-instance clustering. Finally, they are compared on 18               so are not truly unsupervised. As such, the authors expect
benchmark datasets, as well as on a real-world dataset.                   them to outperform the method based on contrastive pre-
                                                                          dictive coding, which on the other hand is the most promis-
                                                                          ing and innovative approach from the theoretical point of
1     Introduction                                                        view.
                                                                             In the next section, basic properties of multi-instance
Multi-instance learning (MIL) belongs to recently fast de-                learning and clustering are briefly introduced. The adapta-
veloping areas of machine learning. Though it is a super-                 tion of the three considered loss functions is explained in
vised learning method, this paper addresses its application               Section 3. A comprehensive comparison of them is then
to unsupervised learning – clustering. Whereas traditional                presented in Section 4.
clustering is one of points, multi-instance clustering clus-
ters multisets of points, also known as bags. Such a group-
ing of the points into bags is considered a property of the               2   Multi-instance Learning and Clustering
problem at hand and therefore sourced from the input data.
   While there have been some previous attempts at multi-                 Multi-instance learning (MIL) was first described by [7].
instance clustering, they use pairwise relations between all              In its original form, it was developed and used for su-
points in all bags, quickly becoming unwieldy and compu-                  pervised learning. Some prior art also exists for unsu-
tationally infeasible. Our work uses multi-instance learn-                pervised learning, such as [5, 31], however, it uses pair-
ing as a general toolkit for learning representations of bags             wise instance distances as a basis for clustering, which
and then clusters the bags using those representations. This              doesn’t properly utilize the inherent structure of the data
builds on previous works by the authors and enables clus-                 and quickly becomes computationally infeasible.
tering of arbitrary data-structures by expressing them as                    The MIL paradigm is a type of representation learning
hierarchies of multi-instance problems and then using the                 on data which has some internal structure. Therefore, it
internal structure to better solve the problems at hand.                  views a sample as a bag (i.e. a multiset) of an arbitrary
   Multi-instance clustering is evaluated in Section 4 in                 number of objects. The basic elements of MIL are sam-
the application domain of computer and network security.                  ples from a space X and their corresponding labels from a
While this is only one of many possible applications of                   space Y (a space of classes). Compared to usual supervised
MIL due to its general expressive power for structured data,              learning, MIL replaces individual instances with bags of
it is the domain of choice for the authors. Here, MIL is                  instances from the space X such that every instance in X
used to represent user activity as a bag of network con-                  belongs to at least one bag from the bag-space B.
nections for each user in a fixed time window. For clus-                     [7] provides an example of a multi-instance problem
tering, this enables detection of compromised users based                 where each bag represents a key chain with some keys (in-
on their complex behaviours. Clustering opens a new win-                  stances). To solve the problem of finding which key opens
dow of opportunity here by e.g. grouping servers with sim-                a particular lock, a “proxy” MIL problem is presented – de-
ilar behaviour together, allowing a human analyst to boost                termining which key chain opens that particular lock. This
     Copyright ©2020 for this paper by its authors. Use permitted under   line of thinking leads to the pivotal definition of the label
Creative Commons License Attribution 4.0 International (CC BY 4.0).       of a bag as being positive iff the bag contains at least one
positive instance. In later works such as [23], this interpre-   The functions fI and fB are realized by a deep neural net-
tation of MIL is abandoned in favor of a more general one.       work using ReLU as its activation function. The aggrega-
An instance is no longer viewed as having a meaning in and       tion g is realized as an element-wise mean or maximum of
of itself, but only in the context of its bag. The notion of     all the vectors.
an instance-level label is dropped because in this interpre-        The structure of the data is highly exploited using the
tation, the bag is the atomic unit of interest. To this end,     embedded-space paradigm. The approach uses multiple
the embedded-space paradigm, described in Section 2.2, is        layers of MIL nested in each other – the instances of a
used.                                                            bag do not necessarily need to be feature vectors, but can
                                                                 be bag in themselves. The HTTP traffic of a particular
2.1   A probabilistic formulation of multi-instance              client is therefore represented as a bag of all second-level
      learning                                                   domains the client has exchanged datagrams with. Each
                                                                 second-level domain is then represented as a bag of indi-
A probabilistic way of describing multi-instance learning        vidual URLs which the client has connected to. Individual
was first introduced in [23] and builds on the previous work     URLs are then split into 3 parts, domain, path and query,
[19].                                                            and each part is represented as a bag of tokens, which can
   Let for the space X exist a measurable space (X , A),         then be broken down even further. In the end, the model
where A is a σ-algebra on X . Let P X denote the set of          consists of 5 nested MIL problems.
all probability measures on (X , A). A bag B is viewed as           A benefit of MIL is also its easy use for explainability
a random sample with replacement of a random variable            and interpretability. The authors of [22] present a way to
governed by a particular probability distribution pB ∈ P X ,     extract indicators of compromise and explain the decision
that is                                                          using the learned MIL model.
  B = {xi |xi ∼ pB , i ∈ {1, . . . , m}} where m ∈ N. (1)
                                                                 2.3   Clustering
2.2   Embedded-space paradigm for solving                        Clustering is a prime example of a problem typically asso-
      multi-instance problems                                    ciated with unsupervised learning. The problem at hand,
                                                                 however, is not one of clustering ordinary number vectors
While it is possible to use several approaches to solv-          in a linear space. Instead, a clustering of objects repre-
ing multi-instance problems, in this work, the embedded-         sented by bags is explored, as that is the problem solved for
space paradigm was used. For an overview of the other            datasets introduced later in Section 4. While [28] present a
paradigms, see [6].                                              clustering of bags using a modified Hausdorff distance on
   In the embedded space paradigm, labels are only defined       bags and [17] present a clustering using maximum mean
on the level of bags. In order for these bag labels to be        discrepancy, a different approach to clustering of bags is
learned, an embedding function of the form φ : B → X̄            used in this work. Among the main reasons for this choice
must be defined, where X̄ is a latent space, which may or        is the prohibitively high computational complexity of the
may not be identical to X . Using this function, each bag        previously mentioned approaches on large datasets and the
can be represented by an object φ (B) ∈ X̄ , which makes it      possibility of utilizing the representations previously intro-
possible to use any off-the-shelf supervised learning algo-      duced in [22].
rithm acting on X̄ . Among the simplest embedding func-             An approach based on the embedded-space paradigm for
tions are, e.g. element-wise minimum, maximum, and               MIL was chosen. In order to utilize the structure of the
mean. A more complicated embedding function may for              data, a MIL model is used to represent each bag in the la-
example apply a neural network to each instance of the bag       tent space X̄ . This presents the issue of how to train the em-
and subsequently pool the instances using one of the afore-      bedding function φ, because its learning in standard MIL
mentioned functions.                                             is supervised.
   Specifically to the experiments presented in Section 4,          The embedding function φ can be actually written as
[22] present an approach to learning to classify HTTP traf-      φ = φ (B, θ) where B ∈ B and θ are the parameters of the
fic by utilizing sets of URLs. Neural networks are used          embedding, typically learned during the training phase. In
to transform both instance-level and bag-level representa-       the context of clustering, these parameters still need to be
tions. The model is as follows: A neural network fI :            learned over some kind of training. This itself presents an-
X1 → X2 is used to transform instances, followed by an           other challenge though – off-the-shelf algorithms typically
aggregation function                                             work in some constant Hilbert space, whereas the latent
                       g : BX2 → X̄1 .                           space needs to change over the learning period. As is the
                                                                 case for MIL itself, end-to-end learning is used. A cluster-
Finally, a second deep neural network fB : X̄1 → X̄2 is          loss function LC : B → R is chosen and used to express
used to transform the representation of each bag. Combin-        the actual loss for the embedding model φ and its parame-
ing these functions gives the embedding function                 ters θ as
            φ (B) = fB (g ({fI (x)|x ∈ B})) .                               L (φ, θ) = LC ({φ (B, θ)|B ∈ B}) .              (2)
If the cluster-loss LC is chosen correctly, minimizing L               high-quality embeddings would need only a small amount
over the learning period will yield a latent space X̄ in which         of seed data points to reach relatively high accuracy. For
the bags are already naturally clustered according to the              this measure, the kNN algorithm was run thrice and the re-
design of the cluster-loss function. Applying any off-the-             sults averaged to compensate for high dependence on the
shelf clustering algorithm on X̄ will then give good results.          particular seed points chosen.
How to choose the cluster-loss function LC is the focus of
Section 3.
                                                                       3     Investigated Loss Functions
2.4   Clustering evaluation metrics                                    In Section 2.4, a method for learning a representation φ was
                                                                       described. In order to learn the parameters of the represen-
In our research, several clustering evaluation metrics have            tation, a clustering-loss function is required in the form
been used. The primary metrics are based on the Silhou-
ette coefficient, the secondary ones on the kNN algorithm.                                 LC : P M X̄ → R.
                                                                                                         
The metrics are somewhat different to the ones traditionally
used in evaluating clustering methods as all the datasets              In this section, three ways of constructing such a clustering-
used in this work are originally classification datasets and           loss function are explored. First, an unsupervised method
therefore have classes available. Each class can then be               constructing a clustering loss-function is presented, fol-
viewed as a cluster target and the learned clustering evalu-           lowed by two supervised methods.
ated against these targets.
   The first two metrics measure the homogeneity of clus-
                                                                       3.1   Contrastive Predictive Coding
ters (that is the property of instances of one cluster being
close to one another) and their separation (that is the prop-          Contrastive predictive coding (CPC) is a technique first
erty of instances of different clusters being far apart), see          introduced by [21]. The method builds the model on the
[9]. To measure the non-homogeneity of a cluster, the av-              ideas from predictive coding [8]. CPC represents time se-
erage distance between items in a cluster is measured and              ries by modelling future data from the past. To this end, the
averaged over all clusters, giving the following metric for            model learns high-level representations of the data and dis-
items xj in clusters Ci :                                              cards noise. The further in the future the model predicts,
                                       n
                                                                       the less shared information is available and thus the global
 nonhomo (C1 , . . . , Cn ) =
                                    1X X
                                          kxj − µ (Ci )k ,             structure needs to be inferred better.
                                    n i=1
                                           xj ∈Ci
                                                                          The core concept taken from CPC is a loss function
                                                                       called InfoNCE in the prior art. Given a time series of
where                                                                  values xt , an autoregressive model produces a context ct
                   µ (Ci ) =
                                1 X
                                     xj ,                              from all xi up to the point t. For a training set X, predicted
                               |Ci |
                                       xj ∈Ci
                                                                       future value xt+k , the current context ct , and a model fk ,
                                                                       the InfoNCE loss is defined as
with |Ci | being the number of instances in the bag Ci .                                                                       
   To measure the separation of clusters, the average dis-
                                                                             − EX log fk (xt+k , ct ) − log
                                                                                                               X
tance between cluster centres was taken:                                                                           fk (xj , ct ) .
                                                                                                             xj ∈X
                              2
 sep (C1 , . . . , Cn ) =
                                    X
                                      kµ (Ci ) − µ (Cj )k .
                          n (n − 1)                                    Optimizing this value will result in fk modelling the den-
                                       i,j∈n̂
                                        i 0 is a hyper-parameter of this method.
                 1 X  log (Dii ) − log     Dij                      (5)       To adapt the original method, the definitions need to be
                                         X       
          LCPC =                                 ,
                 n i=1                  j=1                                  modified to leverage the bagging of the data. The matrix y
                                                    j6=i                      is defined in the same way, only with labels on the level of
where n is the number of bags.                                                bags. The value ηij = 1 iff the bag Bj is a target neigh-
   Out of the three methods presented in this section, the                    bour for the bag Bi . Then, using the matrix D defined in
approach based on contrastive predictive coding seems the                     equation 4 directly instead of the linear transformation, the
most promising, as it is the only unsupervised method and                     loss function can be expressed as
could therefore be used in a broader class of applications.
This, however, also has its drawbacks – unsupervised meth-
                                                                                             X
                                                                                Ltriplet =         ηij Dij +
ods are generally harder to learn correctly. For that reason,                                ij
the two supervised methods were selected as a safer option.
                                                                                                   ηij (1 − yil ) max (0, 1 + Dij − Dil ) , (6)
                                                                                             X
                                                                                      +c
                                                                                             ijl
3.2     Triplet Loss
                                                                              where c > 0 is again a hyper-parameter of the method.
The performance of the k-nearest neighbour algorithm de-                      Although [29] suggest finding η as the k-nearest neigh-
pends heavily on the distance metric used. Typically, the                     bours of a data point, the loss has been simplified by setting
      1 On the off-chance a random bag would be close to B (1) , this would   ηij = yij for i, j = 1, . . . , n in this work, making all the
                                                          k
be outweighed by the other terms.                                             neighbours of a cluster its target neighbours.
3.3     Magnet Loss                                                    The magnet loss is then defined as

Magnet loss is an enhancement of the previously described                                             exp − 2σ1 2 Ni − α
                                                                                          n                                    !
                                                                                 1X
triplet loss, first introduced by [26]. Whereas the triplet            Lmagnet =       max 0, 1 + log                               ,
                                                                                 n i=1                       Mi
loss function essentially goes over all triplets of a data
                                                                                                                                 (7)
point, its target neighbour and a point from a different class
                                                                       where α ∈ R is a hyper-parameter of the method. Since
and optimizes their distances, magnet loss maintains an
                                                                       the method standardizes both of the terms of the innermost
explicit model of the distributions of different classes in
                                                                       fraction by σ 2 , α is the desired cluster separation gap ex-
the representation space. To capture the distributions, the
                                                                       pressed in the units of the variance.
model uses simple clustering models for each class. Dif-
                                                                          The magnet loss is taken almost verbatim from the orig-
ferently to triplet loss, for which the target neighbourhood
                                                                       inal article. The only change needed is a different way to
is set a priori and is based on similarity in the original in-
                                                                       obtain the representation space by taking
put space, magnet loss uses similarity in the space of rep-
resentations and updates it periodically. Magnet loss also                                          ri = φ (Bi ) .
pursues local separation instead of global – representations
of different classes should be separated, but can be inter-            In this usage, yi is the class label assigned to the bag Bi
leaved.                                                                (making magnet loss a supervised method) and n is the
   Let {(xi , yi )}i=1 be a training dataset where xi ∈ Rd
                    n
                                                                       number of bags in the dataset.
and yi are one of C discrete class labels. A general param-               Due to the choice of the hyper-parameter K being non-
eterized map f (·; Θ) embeds the inputs into a representa-             obvious for most problems, alternative clustering meth-
tion space:                                                            ods that would not need hyper-parameter tuning were ex-
                         ri = f (xi ; Θ) .                             plored. One such method, self-tuning spectral clustering
                                                                       [30] was implemented as a replacement for the K-means al-
For each class c, K cluster assignments are obtained with
                                                                       gorithm used in the original method. Section 4.3 presents
the K-means++ algorithm [18, 2] where K ∈ N is a hyper-
                                                                       a comparison of the performance of these two methods.
parameter of the method:
                                                                          Since magnet loss is an enhancement of triplet loss, it
                                                               2       is reasonable to assume it would perform the same or bet-
                                                                       ter than triplet loss. One drawback of magnet loss is the
                                      K X
                                                    1 X
                       = arg min
                                     X
      I1c , . . . , IK
                     c
                                                 r− c      s       .
                         I1c ,...,IK
                                   c               |Ik |               higher number of hyper-parameters which need to be tuned
                                     k=1 r∈I c           c
                                                                       in order for the method to work correctly. This problem is
                                            k          s∈Ik


The centre of each cluster is denoted µck :                            partially solved by the introduction of self-tuning spectral
                                                                       clustering [30], however, that is outside the scope of this
                                     1 X                               paper.
                           µck =             r.
                                    |Ikc | c
                                          r∈Ik
                                                                       4      Experimental Comparison
For each input, the centre of the cluster in which its repre-
sentation falls is denoted as                                          The three loss functions presented in Section 3 were exper-
                                                                       imentally evaluated on 18 publicly available datasets and
                       µi = arg min kri − µck k                        on a corporate dataset from the domain of computer secu-
                                  µc
                                   k
                                                                       rity. The 2 metrics presented in Section 2.4 were used, i.e.
and the squared distance of its representation from the cen-           the Silhouette coefficient and the accuracy of a kNN clas-
tre of the cluster is denoted as                                       sifier built on the embedding. Both of these metrics were
                                                                       tracked over the learning period to also evaluate how the
                           Ni = kri − µi k .
                                                 2                     models are learning in addition to their final performance.

The variance of the distance of the representations from
                                                                       4.1     Datasets
their respective centres can then be calculated as
                                     n
                                                                       The evaluation was done on a set of 18 publicly available
                            2
                          σ =
                                1 X
                                        Ni .                           datasets, listed in Table 1. All the datasets as used were
                              n − 1 i=1                                also made public2 .
                                                                          The models were also evaluated on a proprietary dataset
For ri , the dissimilarity to all the inputs of different classes      provided by Cisco Cognitive Intelligence, consisting of
is calculated as                                                       records of network connections from clients (e.g. user
                                                                       computers or mobile devices) to some on-line services.
                     K                              
                                      1
                                exp − 2 kri − µck k .
                    XX                             2
           Mi =
                                     2σ                                      2 https://github.com/marekdedic/MilDatasets.jl
                    c6=yi k=1
  Source         Datasets                                        of a bag. This is a natural embedding of a bag as its ex-
                                                                 pected value. This model is referred to as the mean model
  [4]            BrownCreeper, WinterWren                        in the rest of this work. A classification model has been
  [1]            Elephant, Fox, Tiger                            introduced as a target to beat. This model was realised by
  [7]            Musk1, Musk2                                    a MIL neural network identical to the previously described
  [27]           Mutagenesis1, Mutagenesis2                      one, with a final layer of 2 output neurons with an iden-
  [33]           Newsgroups1, Newsgroups2, Newsgroups3           tity activation function added. The accuracy of the model
  [24, 25]       Protein                                         has been evaluated by selecting the optimal threshold on its
  [15]           UCSBBreastCancer                                output. This model mirrors the model used in the proposed
  [32]           Web1, Web2, Web3, Web4                          methods, but replaces the clustering with simple classifica-
                                                                 tion. This model is referred to as the classification model
       Table 1: The 18 publicly available datasets used.         in the rest of this work.
                                                                    Some of the three proposed clustering-losses have some
                                                                 hyper-parameters which need to be tuned. A range of val-
The dataset represents HTTP traffic of more than 100 com-
                                                                 ues was tried for each hyper-parameter in order to select
panies. Two datasets were collected, each spanning 1 day
                                                                 the best configuration for each on each of the datasets.
of traffic. The training data was traffic from 2019-11-18,
                                                                 For Ltriplet , the values c ∈ {0.01, 0.1, 1, 10, 100} have
while the data used for testing was collected the follow-
                                                                 been tested. For Lmagnet , the values K ∈ {2, 3, 8, 16},
ing day, 2019-11-19. For each connection, a proprietary
                                                                 α ∈ {0, 0.1, 0.5} and the cluster index update frequency
classification system based on [14] provided labels, clas-
                                                                 in {5, 10, 25, 70} have been tested.
sifying the connections either as legitimate or malicious
(connected to malware activity). The data was sampled to
include 90 % of negative bags and 10 % of positive bags.         4.3   Comparison of Results
For each connection, 20 connection features were used, as
well as a MIL model of the server URL, which is visualized       Table 2 lists the accuracy of a kNN classifier build on the
in Figure 1.                                                     embedding for all of the methods and all of the datasets.
                                                                 Figure 2 shows the value of the Silhouette coefficient and
                                                                 the accuracy on 3 selected datasets. As can be seen, CPC
4.2     Experimental Design                                      is the worst performing of the methods overall, and, more
                                                                 significantly, shows no clear improvement over the learn-
The models for evaluation were implemented in the Julia
                                                                 ing periods for both the Silhouette coefficient and the kNN
programming language [3] using the Flux.jl framework for
                                                                 classifier accuracy. Between triplet loss and magnet loss,
machine learning [13] and the Mill.jl framework for multi-
                                                                 magnet loss is somewhat better in the classification accu-
instance learning3 .
                                                                 racy, whereas triplet loss is somewhat better in the Silhou-
   The particular architecture of the models was chosen
                                                                 ette coefficient, indicating better separation of individual
based upon previous experience with using neural net-
                                                                 clusters. Of note is also the differing behaviour for var-
works in multi-instance setting [22]. Several architec-
                                                                 ious datasets, for example the clear dominance of triplet
tures were tested and the best one selected for the exper-
                                                                 loss w.r.t. the Silhouette coefficient on the Web3 dataset.
iments. The embedding φ was realised by a MIL neural
                                                                 This shows that none of the methods is a clear winner in all
network consisting of 2 per-instance layers of 30 neurons,
                                                                 scenarios and the best one needs to be carefully selected.
followed by aggregation formed by concatenating element-
wise mean and element-wise maximum of all instances in
a bag, followed by 2 per-bag layers of 30 neurons. All the       4.4   Statistical significance of the results
neurons used the ReLU activation function [12]. Layer
                                                                 The null hypothesis that all three methods are the same as
weights were initialized using Glorot initialization [11],
                                                                 measured by the accuracy can be rejected at a 5 % level
bias vectors were initialized to zeros. ADAM [16] was used
                                                                 of significance by the Friedman two-way analysis of vari-
as the optimization method.
                                                                 ance by ranks [10]. Triplet and magnet losses are better
   For each of the datasets, 80 % of bags were randomly
                                                                 than CPC at a 5 % level of significance by the post-hoc
chosen as the training data, with the rest being testing data.
                                                                 Nemenyi pairwise test [20]. However, the null hypothesis
The models were trained using 100 mini-batches of size of
                                                                 that magnet loss and triplet loss are the same cannot be re-
50.
                                                                 jected at the 5 % significance level by this test. This further
   In order to provide some baseline against which the
                                                                 supports the differing results between these two methods as
models could be compared (as there is no prior art for
                                                                 mentioned in the previous subsection.
this problem), two other models were introduced. A
non-machine-learning model was introduced as a baseline,
which all models should surpass. This model implements           4.5   HTTP dataset
the embedding φ as an element-wise mean of all instances
                                                                 Table 3 compares the three methods on the proprietary
      3 https://github.com/pevnak/Mill.jl                        dataset from Cisco Cognitive Intelligence. Here, magnet
                                     designed token features    learned features of URL parts    learned features of an URL       learned features of a SLD

                                 ψ               evil
                         evil            d1
                                           (1)
                                                 . . . d(1)
                                                        n       φD
                                                                           evil.com

                                 ψ               com                       d1 . . . dn
                        com              d1
                                           (2)
                                                 . . . d(2)
                                                        n


                                                 path                                                http://evil.com/path/file?key=val&get=exploit.js
                                 ψ
                        path              (1)
                                         p1 . . . p(1)
                                                   n            φP
                                                                            path/file
                                                                                                φU
                                 ψ               file                      p1 . . . p n                 u1 . . . u3n
                         file             (2)
                                         p1 . . . p(2)
                                                   n


                                              key=val                                                                                    evil.com
                                 ψ
                    key=val                (1)     (1)               key=val&get=exploit.js                                   φSLD
                                         q1 . . . qn            φQ                                                                       s1 . . . sk
                                         get=exploit.js                    q1 . . . qn
                                 ψ
               get=exploit.js            q1
                                           (2)          (2)
                                                 . . . qn


                                                                                                         c1 . . . cm


           Figure 1: Hierarchical model of a URL. The vector c1 , . . . , cm represents the connection features.


 Dataset                    CPC          Triplet loss           Magnet loss               5     Conclusion
 BrownCreeper               0.900                       0.882             0.900           In our work, the underexplored research area of multi-
 Elephant                   0.575                       0.900             0.825           instance clustering was investigated. Three methods for
 Fox                        0.400                       0.675             0.725           clustering of bags were introduced, of which one is unsu-
 Musk1                      0.667                       0.889             0.889           pervised (CPC) and two are supervised (Triplet loss and
 Musk2                      0.750                       0.950             0.950           Magnet loss). For each of the methods, the prior art it
 Mutagenesis1               0.658                       0.816             0.912           builds on was presented, along with its modification for
 Mutagenesis2               0.708                       1.000             1.000           the purposes of multi-instance clustering. All three meth-
 Newsgroups1                0.533                       0.950             1.000           ods were theoretically and experimentally evaluated and
 Newsgroups2                0.600                       0.900             0.950           compared. The experiments were conducted first on pub-
 Newsgroups3                0.683                       0.700             0.850           licly available datasets in a reproducible fashion. Follow-
 Protein                    0.820                       0.923             0.897           ing that, a corporate dataset of network security data was
 Tiger                      0.642                       0.825             0.825           used as it is the intended application domain for this work.
 UCSBBreastCancer           0.333                       0.667             1.000              Comparing the methods on the publicly available
 Web1                       0.667                       0.733             0.800           datasets shows the method based on contrastive predictive
 Web2                       0.689                       0.800             0.733           coding to perform the worst, with the other having no statis-
 Web3                       0.733                       0.800             1.000           tically significant difference between them. Similar results
 Web4                       0.622                       0.800             0.930           were also obtained on the corporate dataset of HTTP traffic,
 WinterWren                 0.936                       0.955             0.936           albeit none of the results were as good as anticipated. The
                                                                                          method based on contrastive predictive coding performed
Table 2: The accuracy of the final models for the publicly                                poorly on both the publicly available datasets and the cor-
available datasets.                                                                       porate one, however, the comparison might not be fair as
                                                                                          the CPC method is unsupervised, whereas the other two
     Variant         CPC        Triplet loss             Magnet loss                      can utilize labels on the training data, giving them a strong
                                                                                          advantage.
     2 classes      0.920               0.910                   0.930                        The initial expectation of CPC being outperformed
     20 classes     0.893               0.868                   0.904                     proved to be true. Even as such, the result is interesting
                                                                                          and has a value of its own in providing a baseline and a
Table 3: The accuracy of the final models for the corporate                               comparison for these and future methods.
datasets.

                                                                                          Acknowledgement
loss is the best of the three methods, however, only by a                                 The research reported in this paper has been supported by
small margin over the other 2 methods. Since the datasets                                 the Czech Science Foundation (GAČR) grant 18-21409S
consisted of 90 % of negative samples however, none of the                                and partially supported by the GAČR grant 18-18080S.
results is particularly remarkable and some further tuning
would be needed if any of the methods were to be used in
real environment.
                                                                                    1.0
                     CPC
                    Triplet
        1.5
                   Magnet
                                                                                    0.9
                  Mean model




                                                                         Accuracy
                                                                                    0.8
        1.0
Ratio




                                                                                    0.7

        0.5                                                                                                                          CPC
                                                                                    0.6                                             Triplet
                                                                                                                                   Magnet
                                                                                                                             Classification model
        0.0                                                                         0.5
              0            25             50            75         100                    0       25              50         75                 100
                                     Learning step                                                           Learning step

                         (a) Musk2, Silhouette coeff.                                              (b) Musk2, accuracy
                                                                                    1.0
        2.0          CPC
                    Triplet
                   Magnet
                  Mean model
                                                                                    0.8
        1.5
                                                                         Accuracy
Ratio




        1.0                                                                         0.6



        0.5                                                                                                                          CPC
                                                                                    0.4                                             Triplet
                                                                                                                                   Magnet
                                                                                                                             Classification model
        0.0
              0            25             50            75         100                    0       25              50         75                 100
                                     Learning step                                                           Learning step

                  (c) UCSBBreastCancer, Silhouette coeff.                                     (d) UCSBBreastCancer, accuracy
                                                                                    1.0
                      CPC
                     Triplet
        100         Magnet
                   Mean model
                                                                                    0.8

         75
                                                                         Accuracy
Ratio




                                                                                    0.6
         50


                                                                                                                                     CPC
         25                                                                                                                         Triplet
                                                                                    0.4
                                                                                                                                   Magnet
                                                                                                                             Classification model
          0
              0            25             50            75         100                    0       25              50         75                 100
                                     Learning step                                                           Learning step

                         (e) Web3, Silhouette coeff.                                                   (f) Web3, accuracy

                                Figure 2: Accuracy and the Silhouette coefficient on three selected datasets.
References                                                            //www.tandfonline.com/doi/abs/10.1080/
                                                                      01621459.1937.10503522.
 [1]   Stuart Andrews, Ioannis Tsochantaridis, and             [11]   Xavier Glorot and Yoshua Bengio. “Understanding
       Thomas Hofmann. “Support Vector Machines                       the difficulty of training deep feedforward neural
       for Multiple-Instance Learning”. In: Advances in               networks”. In: Proceedings of the Thirteenth Inter-
       Neural Information Processing Systems 15 (Jan.                 national Conference on Artificial Intelligence and
       2002), pp. 561–568.                                            Statistics. Mar. 2010, pp. 249–256. url: http://
 [2]   David Arthur and Sergei Vassilvitskii. k-means++:              proceedings.mlr.press/v9/glorot10a.html
       The Advantages of Careful Seeding. Technical                   (visited on 09/01/2018).
       Report 2006-13. Stanford InfoLab, June 2006,            [12]   Richard H. R. Hahnloser et al. “Digital selection and
       pp. 1027–1035. url: http://ilpubs.stanford.                    analogue amplification coexist in a cortex-inspired
       edu:8090/778/.                                                 silicon circuit”. In: Nature 405.6789 (June 2000),
 [3]   J. Bezanson et al. “Julia: A Fresh Approach to Nu-             pp. 947–951. issn: 1476-4687. doi: 10 . 1038 /
       merical Computing”. In: SIAM Review 59.1 (Jan.                 35016072. url: https : / / www . nature . com /
       2017), pp. 65–98. issn: 0036-1445. doi: 10.1137/               articles/35016072 (visited on 08/27/2018).
       141000671. url: https : / / epubs . siam .              [13]   Mike Innes. “Flux: Elegant machine learning with
       org / doi / 10 . 1137 / 141000671 (visited on                  Julia”. In: Journal of Open Source Software (2018).
       09/01/2018).                                                   doi: 10.21105/joss.00602.
 [4]   Forrest Briggs, Xiaoli Fern, and Raviv Raich.           [14]   Ján Jusko. “Graph-based Detection of Malicious
       “Rank-loss support instance machines for MIML                  Network Communities”. PhD thesis. Prague: Czech
       instance annotation”. In: Proceedings of the ACM               Technical University in Prague, 2017. url: https:
       SIGKDD International Conference on Knowledge                   //dspace.cvut.cz/handle/10467/73702 (vis-
       Discovery and Data Mining. Aug. 2012. doi: 10 .                ited on 04/02/2019).
       1145/2339530.2339616.
                                                               [15]   Melih Kandemir, Chong Zhang, and Fred A.
 [5]   Ying Chen and Ou Wu. “Contextual Hausdorff                     Hamprecht. “Empowering Multiple Instance
       dissimilarity for multi-instance clustering”. In:              Histopathology Cancer Diagnosis by Cell Graphs”.
       Fuzzy Systems and Knowledge Discovery (FSKD).                  In: Medical Image Computing and Computer-
       Sichuan, China: IEEE Computer Society Press, May               Assisted Intervention – MICCAI 2014. Ed. by
       2012, pp. 870–873. isbn: 978-1-4673-0024-7. doi:               Polina Golland et al. Lecture Notes in Computer
       10.1109/FSKD.2012.6233889. url: https://                       Science. Cham: Springer International Publishing,
       ieeexplore.ieee.org/abstract/document/                         2014, pp. 228–235. isbn: 978-3-319-10470-6. doi:
       6233889/ (visited on 05/14/2018).                              10.1007/978-3-319-10470-6_29.
 [6]   Marek Dědič. “Optimalization of distances for           [16]   Diederik P. Kingma and Jimmy Ba. “Adam:
       multi-instance clustering”. MA thesis. Prague:                 A Method for Stochastic Optimization”. In:
       Czech Technical University in Prague, Jan. 2020.               arXiv:1412.6980 [cs] (Jan. 2017). arXiv:
 [7]   Thomas G. Dietterich, Richard H. Lathrop, and                  1412.6980. url: http : / / arxiv . org / abs /
       Tomás Lozano-Pérez. “Solving the multiple in-                  1412.6980 (visited on 06/25/2020).
       stance problem with axis-parallel rectangles”. In:      [17]   J. Kohout and T. Pevný. “Network Traffic Fin-
       Artificial Intelligence 89.1 (Jan. 1997), pp. 31–71.           gerprinting Based on Approximated Kernel Two-
       issn: 0004-3702. doi: 10.1016/S0004-3702(96)                   Sample Test”. In: IEEE Transactions on Information
       00034-3. (Visited on 05/31/2017).                              Forensics and Security 13.3 (Mar. 2018), pp. 788–
 [8]   P. Elias. “Predictive coding–I”. In: IRE Transactions          801. issn: 1556-6013. doi: 10.1109/TIFS.2017.
       on Information Theory 1.1 (Mar. 1955), pp. 16–                 2768018.
       24. issn: 2168-2712. doi: 10 . 1109 / TIT . 1955 .      [18]   J. MacQueen. “Some methods for classification and
       1055126.                                                       analysis of multivariate observations”. In: Proceed-
 [9]   Brian S. Everitt, Sabine Landau, and Morven Leese.             ings of the Fifth Berkeley Symposium on Mathe-
       Cluster Analysis. 4th ed. Taylor & Francis, 2001.              matical Statistics and Probability. Vol. 1. Berke-
       isbn: 978-0-340-76119-9.                                       ley, California: University of California Press, 1967,
[10]   Milton Friedman. “The Use of Ranks to Avoid the                pp. 281–297.
       Assumption of Normality Implicit in the Analysis of     [19]   Krikamol Muandet et al. “Learning from Distri-
       Variance”. In: Journal of the American Statistical             butions via Support Measure Machines”. In: Ad-
       Association 32.200 (Dec. 1937). Publisher: Taylor              vances in neural information processing systems.
       & Francis, pp. 675–701. issn: 0162-1459. doi: 10.              2012, pp. 10–18. url: http://papers.nips.cc/
       1080/01621459.1937.10503522. url: https:                       paper/4825-learning-from-distributions-
       via - support - measure - machines (visited on        [29]   Kilian Q Weinberger, John Blitzer, and Lawrence K.
       06/29/2017).                                                 Saul. “Distance Metric Learning for Large Margin
[20]   PB Nemenyi. “Distribution-free multiple compar-              Nearest Neighbor Classification”. In: Advances in
       isons (doctoral dissertation, princeton university,          Neural Information Processing Systems 18. Ed. by
       1963)”. In: Dissertation Abstracts International             Y. Weiss, B. Schölkopf, and J. C. Platt. MIT Press,
       25.2 (1963), p. 1233.                                        2006, pp. 1473–1480. url: http : / / papers .
                                                                    nips . cc / paper / 2795 - distance - metric -
[21]   Aaron van den Oord, Yazhe Li, and Oriol Vinyals.             learning - for - large - margin - nearest -
       “Representation Learning with Contrastive Predic-            neighbor-classification.pdf.
       tive Coding”. In: arXiv:1807.03748 [cs, stat] (Jan.
       2019). arXiv: 1807.03748. url: http://arxiv.          [30]   Lihi Zelnik-manor and Pietro Perona. “Self-Tuning
       org/abs/1807.03748 (visited on 12/29/2019).                  Spectral Clustering”. In: Advances in Neural Infor-
                                                                    mation Processing Systems 17. Ed. by L. K. Saul, Y.
[22]   Tomas Pevny and Marek Dedic. “Nested Multiple                Weiss, and L. Bottou. MIT Press, 2005, pp. 1601–
       Instance Learning in Modelling of HTTP network               1608. url: http://papers.nips.cc/paper/
       traffic”. In: arXiv:2002.04059 [cs] (Feb. 2020).             2619 - self - tuning - spectral - clustering .
       arXiv: 2002.04059. url: http : / / arxiv . org /             pdf (visited on 01/06/2020).
       abs/2002.04059 (visited on 06/25/2020).
                                                             [31]   Min-Ling Zhang and Zhi-Hua Zhou. “Multi-
[23]   Tomáš Pevný and Petr Somol. “Using Neural Net-               instance clustering with applications to multi-
       work Formalism to Solve Multiple-Instance Prob-              instance prediction”. en. In: Applied Intelligence
       lems”. In: Advances in Neural Networks - ISNN                31.1 (Aug. 2009), pp. 47–68. issn: 1573-7497. doi:
       2017. Springer, Cham, June 2017, pp. 135–142.                10 . 1007 / s10489 - 007 - 0111 - x. url: https :
       isbn: 978-3-319-59072-1. doi: 10 . 1007 / 978 -              / / doi . org / 10 . 1007 / s10489 - 007 - 0111 - x
       3 - 319 - 59072 - 1 _ 17. url: https : / / link .            (visited on 07/30/2020).
       springer.com/chapter/10.1007/978-3-319-
       59072-1_17 (visited on 05/23/2018).                   [32]   Zhi-Hua Zhou, Kai Jiang, and Ming Li. “Multi-
                                                                    Instance Learning Based Web Mining”. In: Applied
[24]   Soumya Ray and Mark Craven. “Learning Statistical            Intelligence 22.2 (Mar. 2005), pp. 135–147. issn:
       Models for Annotating Proteins with Function Infor-          1573-7497. doi: 10.1007/s10489-005-5602-z.
       mation using Biomedical Text”. In: BMC Bioinfor-             url: https://doi.org/10.1007/s10489-005-
       matics 6.1 (May 2005), S18. issn: 1471-2105. doi:            5602-z (visited on 01/01/2020).
       10.1186/1471-2105-6-S1-S18. url: https:
       //doi.org/10.1186/1471-2105-6-S1-S18                  [33]   Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li.
       (visited on 01/01/2020).                                     “Multi-Instance Learning by Treating Instances As
                                                                    Non-I.I.D. Samples”. In: Proceedings of the 26th
[25]   Soumya Ray and Mark Craven. “Supervised ver-                 International Conference On Machine Learning,
       sus multiple instance learning: An empirical com-            ICML 2009 (July 2008). doi: 10.1145/1553374.
       parison”. In: Proceedings of the 22nd Interna-               1553534.
       tional Conference on Machine Learning. Jan. 2005,
       pp. 697–704. doi: 10.1145/1102351.1102439.
[26]   Oren Rippel et al. “Metric Learning with Adaptive
       Density Discrimination”. In: arXiv:1511.05939 [cs,
       stat] (Mar. 2016). arXiv: 1511.05939. url: http:
       / / arxiv . org / abs / 1511 . 05939 (visited on
       06/05/2020).
[27]   A. Srinivasan, Stephen Muggleton, and Robert
       King. “Comparing the use of background knowl-
       edge by inductive logic programming systems”. In:
       Proceedings of the 5th International Workshop on
       Inductive Logic Programming. 1995, pp. 199–230.
[28]   Jun Wang and Jean-Daniel Zucker. “Solving
       Multiple-Instance Problem: A Lazy Learning
       Approach”. In: Proceedings of the Seventeenth
       International Conference on Machine Learning.
       Ed. by Pat Langley. Stanford University, Stanford,
       CA, USA: Morgan Kaufmann, 2000, pp. 1119–
       1125. url: http : / / cogprints . org / 2124/
       (visited on 07/01/2017).