=Paper= {{Paper |id=Vol-2718/paper05 |storemode=property |title=Loss Functions for Clustering in Multi-instance Learning |pdfUrl=https://ceur-ws.org/Vol-2718/paper05.pdf |volume=Vol-2718 |authors=Marek Dědič,Tomáš Pevný,Lukáš Bajer,Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/DedicPBH20 }} ==Loss Functions for Clustering in Multi-instance Learning== https://ceur-ws.org/Vol-2718/paper05.pdf

Loss Functions for Clustering in Multi-instance Learning

Marek Dědič1,2 , Tomáš Pevný3 , Lukáš Bajer2 , Martin Holeňa4
1
Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Trojanova 13, Prague, Czech Republic
2
Cisco Systems, Inc., Karlovo náměstı́ 10, Prague, Czech Republic
3
Faculty of Electrical Engineering, Czech Technical University in Prague, Karlovo náměstí 13, Prague, Czech Republic
4
Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic

Abstract: Multi-instance learning belongs to one of re- their work significantly by deciding about whole clusters
cently fast developing areas of machine learning. It is a su- of servers instead of each one individually.
pervised learning method and this paper reports research A key prerequisite for the application of multi-instance
into its unsupervised counterpart, multi-instance cluster- learning to clustering is that loss functions for clustering,
ing. Whereas traditional clustering clusters points, multi- originally proposed for points, are adapted for bags. In this
instance clustering clusters bags, i.e. multisets of points or paper, such an adaptation is outlined for three sophisticated
of other kinds of objects. The paper focuses on the problem loss functions: contrastive predictive coding, triplet loss
of loss functions for clustering. Three sophisticated loss and magnet loss.
functions used for clustering of points, contrastive predic- Of the three proposed methods, triplet loss and magnet
tive coding, triplet loss and magnet loss, are elaborated for loss utilize some labels as part of the training process and
multi-instance clustering. Finally, they are compared on 18 so are not truly unsupervised. As such, the authors expect
benchmark datasets, as well as on a real-world dataset. them to outperform the method based on contrastive pre-
dictive coding, which on the other hand is the most promis-
ing and innovative approach from the theoretical point of
1 Introduction view.
In the next section, basic properties of multi-instance
Multi-instance learning (MIL) belongs to recently fast de- learning and clustering are briefly introduced. The adapta-
veloping areas of machine learning. Though it is a super- tion of the three considered loss functions is explained in
vised learning method, this paper addresses its application Section 3. A comprehensive comparison of them is then
to unsupervised learning – clustering. Whereas traditional presented in Section 4.
clustering is one of points, multi-instance clustering clus-
ters multisets of points, also known as bags. Such a group-
ing of the points into bags is considered a property of the 2 Multi-instance Learning and Clustering
problem at hand and therefore sourced from the input data.
While there have been some previous attempts at multi- Multi-instance learning (MIL) was first described by [7].
instance clustering, they use pairwise relations between all In its original form, it was developed and used for su-
points in all bags, quickly becoming unwieldy and compu- pervised learning. Some prior art also exists for unsu-
tationally infeasible. Our work uses multi-instance learn- pervised learning, such as [5, 31], however, it uses pair-
ing as a general toolkit for learning representations of bags wise instance distances as a basis for clustering, which
and then clusters the bags using those representations. This doesn’t properly utilize the inherent structure of the data
builds on previous works by the authors and enables clus- and quickly becomes computationally infeasible.
tering of arbitrary data-structures by expressing them as The MIL paradigm is a type of representation learning
hierarchies of multi-instance problems and then using the on data which has some internal structure. Therefore, it
internal structure to better solve the problems at hand. views a sample as a bag (i.e. a multiset) of an arbitrary
Multi-instance clustering is evaluated in Section 4 in number of objects. The basic elements of MIL are sam-
the application domain of computer and network security. ples from a space X and their corresponding labels from a
While this is only one of many possible applications of space Y (a space of classes). Compared to usual supervised
MIL due to its general expressive power for structured data, learning, MIL replaces individual instances with bags of
it is the domain of choice for the authors. Here, MIL is instances from the space X such that every instance in X
used to represent user activity as a bag of network con- belongs to at least one bag from the bag-space B.
nections for each user in a fixed time window. For clus- [7] provides an example of a multi-instance problem
tering, this enables detection of compromised users based where each bag represents a key chain with some keys (in-
on their complex behaviours. Clustering opens a new win- stances). To solve the problem of finding which key opens
dow of opportunity here by e.g. grouping servers with sim- a particular lock, a “proxy” MIL problem is presented – de-
ilar behaviour together, allowing a human analyst to boost termining which key chain opens that particular lock. This
Copyright ©2020 for this paper by its authors. Use permitted under line of thinking leads to the pivotal definition of the label
Creative Commons License Attribution 4.0 International (CC BY 4.0). of a bag as being positive iff the bag contains at least one
positive instance. In later works such as [23], this interpre- The functions fI and fB are realized by a deep neural net-
tation of MIL is abandoned in favor of a more general one. work using ReLU as its activation function. The aggrega-
An instance is no longer viewed as having a meaning in and tion g is realized as an element-wise mean or maximum of
of itself, but only in the context of its bag. The notion of all the vectors.
an instance-level label is dropped because in this interpre- The structure of the data is highly exploited using the
tation, the bag is the atomic unit of interest. To this end, embedded-space paradigm. The approach uses multiple
the embedded-space paradigm, described in Section 2.2, is layers of MIL nested in each other – the instances of a
used. bag do not necessarily need to be feature vectors, but can
be bag in themselves. The HTTP traffic of a particular
2.1 A probabilistic formulation of multi-instance client is therefore represented as a bag of all second-level
learning domains the client has exchanged datagrams with. Each
second-level domain is then represented as a bag of indi-
A probabilistic way of describing multi-instance learning vidual URLs which the client has connected to. Individual
was first introduced in [23] and builds on the previous work URLs are then split into 3 parts, domain, path and query,
[19]. and each part is represented as a bag of tokens, which can
Let for the space X exist a measurable space (X , A), then be broken down even further. In the end, the model
where A is a σ-algebra on X . Let P X denote the set of consists of 5 nested MIL problems.
all probability measures on (X , A). A bag B is viewed as A benefit of MIL is also its easy use for explainability
a random sample with replacement of a random variable and interpretability. The authors of [22] present a way to
governed by a particular probability distribution pB ∈ P X , extract indicators of compromise and explain the decision
that is using the learned MIL model.
B = {xi |xi ∼ pB , i ∈ {1, . . . , m}} where m ∈ N. (1)
2.3 Clustering
2.2 Embedded-space paradigm for solving Clustering is a prime example of a problem typically asso-
multi-instance problems ciated with unsupervised learning. The problem at hand,
however, is not one of clustering ordinary number vectors
While it is possible to use several approaches to solv- in a linear space. Instead, a clustering of objects repre-
ing multi-instance problems, in this work, the embedded- sented by bags is explored, as that is the problem solved for
space paradigm was used. For an overview of the other datasets introduced later in Section 4. While [28] present a
paradigms, see [6]. clustering of bags using a modified Hausdorff distance on
In the embedded space paradigm, labels are only defined bags and [17] present a clustering using maximum mean
on the level of bags. In order for these bag labels to be discrepancy, a different approach to clustering of bags is
learned, an embedding function of the form φ : B → X̄ used in this work. Among the main reasons for this choice
must be defined, where X̄ is a latent space, which may or is the prohibitively high computational complexity of the
may not be identical to X . Using this function, each bag previously mentioned approaches on large datasets and the
can be represented by an object φ (B) ∈ X̄ , which makes it possibility of utilizing the representations previously intro-
possible to use any off-the-shelf supervised learning algo- duced in [22].
rithm acting on X̄ . Among the simplest embedding func- An approach based on the embedded-space paradigm for
tions are, e.g. element-wise minimum, maximum, and MIL was chosen. In order to utilize the structure of the
mean. A more complicated embedding function may for data, a MIL model is used to represent each bag in the la-
example apply a neural network to each instance of the bag tent space X̄ . This presents the issue of how to train the em-
and subsequently pool the instances using one of the afore- bedding function φ, because its learning in standard MIL
mentioned functions. is supervised.
Specifically to the experiments presented in Section 4, The embedding function φ can be actually written as
[22] present an approach to learning to classify HTTP traf- φ = φ (B, θ) where B ∈ B and θ are the parameters of the
fic by utilizing sets of URLs. Neural networks are used embedding, typically learned during the training phase. In
to transform both instance-level and bag-level representa- the context of clustering, these parameters still need to be
tions. The model is as follows: A neural network fI : learned over some kind of training. This itself presents an-
X1 → X2 is used to transform instances, followed by an other challenge though – off-the-shelf algorithms typically
aggregation function work in some constant Hilbert space, whereas the latent
g : BX2 → X̄1 . space needs to change over the learning period. As is the
case for MIL itself, end-to-end learning is used. A cluster-
Finally, a second deep neural network fB : X̄1 → X̄2 is loss function LC : B → R is chosen and used to express
used to transform the representation of each bag. Combin- the actual loss for the embedding model φ and its parame-
ing these functions gives the embedding function ters θ as
φ (B) = fB (g ({fI (x)|x ∈ B})) . L (φ, θ) = LC ({φ (B, θ)|B ∈ B}) . (2)
If the cluster-loss LC is chosen correctly, minimizing L high-quality embeddings would need only a small amount
over the learning period will yield a latent space X̄ in which of seed data points to reach relatively high accuracy. For
the bags are already naturally clustered according to the this measure, the kNN algorithm was run thrice and the re-
design of the cluster-loss function. Applying any off-the- sults averaged to compensate for high dependence on the
shelf clustering algorithm on X̄ will then give good results. particular seed points chosen.
How to choose the cluster-loss function LC is the focus of
Section 3.
3 Investigated Loss Functions
2.4 Clustering evaluation metrics In Section 2.4, a method for learning a representation φ was
described. In order to learn the parameters of the represen-
In our research, several clustering evaluation metrics have tation, a clustering-loss function is required in the form
been used. The primary metrics are based on the Silhou-
ette coefficient, the secondary ones on the kNN algorithm. LC : P M X̄ → R.

The metrics are somewhat different to the ones traditionally
used in evaluating clustering methods as all the datasets In this section, three ways of constructing such a clustering-
used in this work are originally classification datasets and loss function are explored. First, an unsupervised method
therefore have classes available. Each class can then be constructing a clustering loss-function is presented, fol-
viewed as a cluster target and the learned clustering evalu- lowed by two supervised methods.
ated against these targets.
The first two metrics measure the homogeneity of clus-
3.1 Contrastive Predictive Coding
ters (that is the property of instances of one cluster being
close to one another) and their separation (that is the prop- Contrastive predictive coding (CPC) is a technique first
erty of instances of different clusters being far apart), see introduced by [21]. The method builds the model on the
[9]. To measure the non-homogeneity of a cluster, the av- ideas from predictive coding [8]. CPC represents time se-
erage distance between items in a cluster is measured and ries by modelling future data from the past. To this end, the
averaged over all clusters, giving the following metric for model learns high-level representations of the data and dis-
items xj in clusters Ci : cards noise. The further in the future the model predicts,
n
the less shared information is available and thus the global
nonhomo (C1 , . . . , Cn ) =
1X X
kxj − µ (Ci )k , structure needs to be inferred better.
n i=1
xj ∈Ci
The core concept taken from CPC is a loss function
called InfoNCE in the prior art. Given a time series of
where values xt , an autoregressive model produces a context ct
µ (Ci ) =
1 X
xj , from all xi up to the point t. For a training set X, predicted
|Ci |
xj ∈Ci
future value xt+k , the current context ct , and a model fk ,
the InfoNCE loss is defined as
with |Ci | being the number of instances in the bag Ci .  
To measure the separation of clusters, the average dis-
− EX log fk (xt+k , ct ) − log
X
tance between cluster centres was taken: fk (xj , ct ) .
xj ∈X
2
sep (C1 , . . . , Cn ) =
X
kµ (Ci ) − µ (Cj )k .
n (n − 1) Optimizing this value will result in fk modelling the den-
i,j∈n̂
i 0 is a hyper-parameter of this method.
1 X log (Dii ) − log Dij  (5) To adapt the original method, the definitions need to be
X 
LCPC = ,
n i=1  j=1 modified to leverage the bagging of the data. The matrix y
j6=i is defined in the same way, only with labels on the level of
where n is the number of bags. bags. The value ηij = 1 iff the bag Bj is a target neigh-
Out of the three methods presented in this section, the bour for the bag Bi . Then, using the matrix D defined in
approach based on contrastive predictive coding seems the equation 4 directly instead of the linear transformation, the
most promising, as it is the only unsupervised method and loss function can be expressed as
could therefore be used in a broader class of applications.
This, however, also has its drawbacks – unsupervised meth-
X
Ltriplet = ηij Dij +
ods are generally harder to learn correctly. For that reason, ij
the two supervised methods were selected as a safer option.
ηij (1 − yil ) max (0, 1 + Dij − Dil ) , (6)
X
+c
ijl
3.2 Triplet Loss
where c > 0 is again a hyper-parameter of the method.
The performance of the k-nearest neighbour algorithm de- Although [29] suggest finding η as the k-nearest neigh-
pends heavily on the distance metric used. Typically, the bours of a data point, the loss has been simplified by setting
1 On the off-chance a random bag would be close to B (1) , this would ηij = yij for i, j = 1, . . . , n in this work, making all the
k
be outweighed by the other terms. neighbours of a cluster its target neighbours.
3.3 Magnet Loss The magnet loss is then defined as

Magnet loss is an enhancement of the previously described exp − 2σ1 2 Ni − α
n !
1X
triplet loss, first introduced by [26]. Whereas the triplet Lmagnet = max 0, 1 + log ,
n i=1 Mi
loss function essentially goes over all triplets of a data
(7)
point, its target neighbour and a point from a different class
where α ∈ R is a hyper-parameter of the method. Since
and optimizes their distances, magnet loss maintains an
the method standardizes both of the terms of the innermost
explicit model of the distributions of different classes in
fraction by σ 2 , α is the desired cluster separation gap ex-
the representation space. To capture the distributions, the
pressed in the units of the variance.
model uses simple clustering models for each class. Dif-
The magnet loss is taken almost verbatim from the orig-
ferently to triplet loss, for which the target neighbourhood
inal article. The only change needed is a different way to
is set a priori and is based on similarity in the original in-
obtain the representation space by taking
put space, magnet loss uses similarity in the space of rep-
resentations and updates it periodically. Magnet loss also ri = φ (Bi ) .
pursues local separation instead of global – representations
of different classes should be separated, but can be inter- In this usage, yi is the class label assigned to the bag Bi
leaved. (making magnet loss a supervised method) and n is the
Let {(xi , yi )}i=1 be a training dataset where xi ∈ Rd
n
number of bags in the dataset.
and yi are one of C discrete class labels. A general param- Due to the choice of the hyper-parameter K being non-
eterized map f (·; Θ) embeds the inputs into a representa- obvious for most problems, alternative clustering meth-
tion space: ods that would not need hyper-parameter tuning were ex-
ri = f (xi ; Θ) . plored. One such method, self-tuning spectral clustering
[30] was implemented as a replacement for the K-means al-
For each class c, K cluster assignments are obtained with
gorithm used in the original method. Section 4.3 presents
the K-means++ algorithm [18, 2] where K ∈ N is a hyper-
a comparison of the performance of these two methods.
parameter of the method:
Since magnet loss is an enhancement of triplet loss, it
2 is reasonable to assume it would perform the same or bet-
ter than triplet loss. One drawback of magnet loss is the
K X
1 X
= arg min
X
I1c , . . . , IK
c
r− c s .
I1c ,...,IK
c |Ik | higher number of hyper-parameters which need to be tuned
k=1 r∈I c c
in order for the method to work correctly. This problem is
k s∈Ik

The centre of each cluster is denoted µck : partially solved by the introduction of self-tuning spectral
clustering [30], however, that is outside the scope of this
1 X paper.
µck = r.
|Ikc | c
r∈Ik
4 Experimental Comparison
For each input, the centre of the cluster in which its repre-
sentation falls is denoted as The three loss functions presented in Section 3 were exper-
imentally evaluated on 18 publicly available datasets and
µi = arg min kri − µck k on a corporate dataset from the domain of computer secu-
µc
k
rity. The 2 metrics presented in Section 2.4 were used, i.e.
and the squared distance of its representation from the cen- the Silhouette coefficient and the accuracy of a kNN clas-
tre of the cluster is denoted as sifier built on the embedding. Both of these metrics were
tracked over the learning period to also evaluate how the
Ni = kri − µi k .
2 models are learning in addition to their final performance.

The variance of the distance of the representations from
4.1 Datasets
their respective centres can then be calculated as
n
The evaluation was done on a set of 18 publicly available
2
σ =
1 X
Ni . datasets, listed in Table 1. All the datasets as used were
n − 1 i=1 also made public2 .
The models were also evaluated on a proprietary dataset
For ri , the dissimilarity to all the inputs of different classes provided by Cisco Cognitive Intelligence, consisting of
is calculated as records of network connections from clients (e.g. user
computers or mobile devices) to some on-line services.
K
1
exp − 2 kri − µck k .
XX 2
Mi =
2σ 2 https://github.com/marekdedic/MilDatasets.jl
c6=yi k=1
Source Datasets of a bag. This is a natural embedding of a bag as its ex-
pected value. This model is referred to as the mean model
[4] BrownCreeper, WinterWren in the rest of this work. A classification model has been
[1] Elephant, Fox, Tiger introduced as a target to beat. This model was realised by
[7] Musk1, Musk2 a MIL neural network identical to the previously described
[27] Mutagenesis1, Mutagenesis2 one, with a final layer of 2 output neurons with an iden-
[33] Newsgroups1, Newsgroups2, Newsgroups3 tity activation function added. The accuracy of the model
[24, 25] Protein has been evaluated by selecting the optimal threshold on its
[15] UCSBBreastCancer output. This model mirrors the model used in the proposed
[32] Web1, Web2, Web3, Web4 methods, but replaces the clustering with simple classifica-
tion. This model is referred to as the classification model
Table 1: The 18 publicly available datasets used. in the rest of this work.
Some of the three proposed clustering-losses have some
hyper-parameters which need to be tuned. A range of val-
The dataset represents HTTP traffic of more than 100 com-
ues was tried for each hyper-parameter in order to select
panies. Two datasets were collected, each spanning 1 day
the best configuration for each on each of the datasets.
of traffic. The training data was traffic from 2019-11-18,
For Ltriplet , the values c ∈ {0.01, 0.1, 1, 10, 100} have
while the data used for testing was collected the follow-
been tested. For Lmagnet , the values K ∈ {2, 3, 8, 16},
ing day, 2019-11-19. For each connection, a proprietary
α ∈ {0, 0.1, 0.5} and the cluster index update frequency
classification system based on [14] provided labels, clas-
in {5, 10, 25, 70} have been tested.
sifying the connections either as legitimate or malicious
(connected to malware activity). The data was sampled to
include 90 % of negative bags and 10 % of positive bags. 4.3 Comparison of Results
For each connection, 20 connection features were used, as
well as a MIL model of the server URL, which is visualized Table 2 lists the accuracy of a kNN classifier build on the
in Figure 1. embedding for all of the methods and all of the datasets.
Figure 2 shows the value of the Silhouette coefficient and
the accuracy on 3 selected datasets. As can be seen, CPC
4.2 Experimental Design is the worst performing of the methods overall, and, more
significantly, shows no clear improvement over the learn-
The models for evaluation were implemented in the Julia
ing periods for both the Silhouette coefficient and the kNN
programming language [3] using the Flux.jl framework for
classifier accuracy. Between triplet loss and magnet loss,
machine learning [13] and the Mill.jl framework for multi-
magnet loss is somewhat better in the classification accu-
instance learning3 .
racy, whereas triplet loss is somewhat better in the Silhou-
The particular architecture of the models was chosen
ette coefficient, indicating better separation of individual
based upon previous experience with using neural net-
clusters. Of note is also the differing behaviour for var-
works in multi-instance setting [22]. Several architec-
ious datasets, for example the clear dominance of triplet
tures were tested and the best one selected for the exper-
loss w.r.t. the Silhouette coefficient on the Web3 dataset.
iments. The embedding φ was realised by a MIL neural
This shows that none of the methods is a clear winner in all
network consisting of 2 per-instance layers of 30 neurons,
scenarios and the best one needs to be carefully selected.
followed by aggregation formed by concatenating element-
wise mean and element-wise maximum of all instances in
a bag, followed by 2 per-bag layers of 30 neurons. All the 4.4 Statistical significance of the results
neurons used the ReLU activation function [12]. Layer
The null hypothesis that all three methods are the same as
weights were initialized using Glorot initialization [11],
measured by the accuracy can be rejected at a 5 % level
bias vectors were initialized to zeros. ADAM [16] was used
of significance by the Friedman two-way analysis of vari-
as the optimization method.
ance by ranks [10]. Triplet and magnet losses are better
For each of the datasets, 80 % of bags were randomly
than CPC at a 5 % level of significance by the post-hoc
chosen as the training data, with the rest being testing data.
Nemenyi pairwise test [20]. However, the null hypothesis
The models were trained using 100 mini-batches of size of
that magnet loss and triplet loss are the same cannot be re-
50.
jected at the 5 % significance level by this test. This further
In order to provide some baseline against which the
supports the differing results between these two methods as
models could be compared (as there is no prior art for
mentioned in the previous subsection.
this problem), two other models were introduced. A
non-machine-learning model was introduced as a baseline,
which all models should surpass. This model implements 4.5 HTTP dataset
the embedding φ as an element-wise mean of all instances
Table 3 compares the three methods on the proprietary
3 https://github.com/pevnak/Mill.jl dataset from Cisco Cognitive Intelligence. Here, magnet
designed token features learned features of URL parts learned features of an URL learned features of a SLD

ψ evil
evil d1
(1)
. . . d(1)
n φD
evil.com

ψ com d1 . . . dn
com d1
(2)
. . . d(2)
n

path http://evil.com/path/file?key=val&get=exploit.js
ψ
path (1)
p1 . . . p(1)
n φP
path/file
φU
ψ file p1 . . . p n u1 . . . u3n
file (2)
p1 . . . p(2)
n

key=val evil.com
ψ
key=val (1) (1) key=val&get=exploit.js φSLD
q1 . . . qn φQ s1 . . . sk
get=exploit.js q1 . . . qn
ψ
get=exploit.js q1
(2) (2)
. . . qn

c1 . . . cm

Figure 1: Hierarchical model of a URL. The vector c1 , . . . , cm represents the connection features.

Dataset CPC Triplet loss Magnet loss 5 Conclusion
BrownCreeper 0.900 0.882 0.900 In our work, the underexplored research area of multi-
Elephant 0.575 0.900 0.825 instance clustering was investigated. Three methods for
Fox 0.400 0.675 0.725 clustering of bags were introduced, of which one is unsu-
Musk1 0.667 0.889 0.889 pervised (CPC) and two are supervised (Triplet loss and
Musk2 0.750 0.950 0.950 Magnet loss). For each of the methods, the prior art it
Mutagenesis1 0.658 0.816 0.912 builds on was presented, along with its modification for
Mutagenesis2 0.708 1.000 1.000 the purposes of multi-instance clustering. All three meth-
Newsgroups1 0.533 0.950 1.000 ods were theoretically and experimentally evaluated and
Newsgroups2 0.600 0.900 0.950 compared. The experiments were conducted first on pub-
Newsgroups3 0.683 0.700 0.850 licly available datasets in a reproducible fashion. Follow-
Protein 0.820 0.923 0.897 ing that, a corporate dataset of network security data was
Tiger 0.642 0.825 0.825 used as it is the intended application domain for this work.
UCSBBreastCancer 0.333 0.667 1.000 Comparing the methods on the publicly available
Web1 0.667 0.733 0.800 datasets shows the method based on contrastive predictive
Web2 0.689 0.800 0.733 coding to perform the worst, with the other having no statis-
Web3 0.733 0.800 1.000 tically significant difference between them. Similar results
Web4 0.622 0.800 0.930 were also obtained on the corporate dataset of HTTP traffic,
WinterWren 0.936 0.955 0.936 albeit none of the results were as good as anticipated. The
method based on contrastive predictive coding performed
Table 2: The accuracy of the final models for the publicly poorly on both the publicly available datasets and the cor-
available datasets. porate one, however, the comparison might not be fair as
the CPC method is unsupervised, whereas the other two
Variant CPC Triplet loss Magnet loss can utilize labels on the training data, giving them a strong
advantage.
2 classes 0.920 0.910 0.930 The initial expectation of CPC being outperformed
20 classes 0.893 0.868 0.904 proved to be true. Even as such, the result is interesting
and has a value of its own in providing a baseline and a
Table 3: The accuracy of the final models for the corporate comparison for these and future methods.
datasets.

Acknowledgement
loss is the best of the three methods, however, only by a The research reported in this paper has been supported by
small margin over the other 2 methods. Since the datasets the Czech Science Foundation (GAČR) grant 18-21409S
consisted of 90 % of negative samples however, none of the and partially supported by the GAČR grant 18-18080S.
results is particularly remarkable and some further tuning
would be needed if any of the methods were to be used in
real environment.
1.0
CPC
Triplet
1.5
Magnet
0.9
Mean model

Accuracy
0.8
1.0
Ratio

0.7

0.5 CPC
0.6 Triplet
Magnet
Classification model
0.0 0.5
0 25 50 75 100 0 25 50 75 100
Learning step Learning step

(a) Musk2, Silhouette coeff. (b) Musk2, accuracy
1.0
2.0 CPC
Triplet
Magnet
Mean model
0.8
1.5
Accuracy
Ratio

1.0 0.6

0.5 CPC
0.4 Triplet
Magnet
Classification model
0.0
0 25 50 75 100 0 25 50 75 100
Learning step Learning step

75
Accuracy
Ratio

0.6
50

CPC
25 Triplet
0.4
Magnet
Classification model
0
0 25 50 75 100 0 25 50 75 100
Learning step Learning step

(e) Web3, Silhouette coeff. (f) Web3, accuracy

Figure 2: Accuracy and the Silhouette coefficient on three selected datasets.
References //www.tandfonline.com/doi/abs/10.1080/
01621459.1937.10503522.
[1] Stuart Andrews, Ioannis Tsochantaridis, and [11] Xavier Glorot and Yoshua Bengio. “Understanding
Thomas Hofmann. “Support Vector Machines the difficulty of training deep feedforward neural
for Multiple-Instance Learning”. In: Advances in networks”. In: Proceedings of the Thirteenth Inter-
Neural Information Processing Systems 15 (Jan. national Conference on Artificial Intelligence and
2002), pp. 561–568. Statistics. Mar. 2010, pp. 249–256. url: http://
[2] David Arthur and Sergei Vassilvitskii. k-means++: proceedings.mlr.press/v9/glorot10a.html
The Advantages of Careful Seeding. Technical (visited on 09/01/2018).
Report 2006-13. Stanford InfoLab, June 2006, [12] Richard H. R. Hahnloser et al. “Digital selection and
pp. 1027–1035. url: http://ilpubs.stanford. analogue amplification coexist in a cortex-inspired
edu:8090/778/. silicon circuit”. In: Nature 405.6789 (June 2000),
[3] J. Bezanson et al. “Julia: A Fresh Approach to Nu- pp. 947–951. issn: 1476-4687. doi: 10 . 1038 /
merical Computing”. In: SIAM Review 59.1 (Jan. 35016072. url: https : / / www . nature . com /
2017), pp. 65–98. issn: 0036-1445. doi: 10.1137/ articles/35016072 (visited on 08/27/2018).
141000671. url: https : / / epubs . siam . [13] Mike Innes. “Flux: Elegant machine learning with
org / doi / 10 . 1137 / 141000671 (visited on Julia”. In: Journal of Open Source Software (2018).
09/01/2018). doi: 10.21105/joss.00602.
[4] Forrest Briggs, Xiaoli Fern, and Raviv Raich. [14] Ján Jusko. “Graph-based Detection of Malicious
“Rank-loss support instance machines for MIML Network Communities”. PhD thesis. Prague: Czech
instance annotation”. In: Proceedings of the ACM Technical University in Prague, 2017. url: https:
SIGKDD International Conference on Knowledge //dspace.cvut.cz/handle/10467/73702 (vis-
Discovery and Data Mining. Aug. 2012. doi: 10 . ited on 04/02/2019).
1145/2339530.2339616.
[15] Melih Kandemir, Chong Zhang, and Fred A.
[5] Ying Chen and Ou Wu. “Contextual Hausdorff Hamprecht. “Empowering Multiple Instance
dissimilarity for multi-instance clustering”. In: Histopathology Cancer Diagnosis by Cell Graphs”.
Fuzzy Systems and Knowledge Discovery (FSKD). In: Medical Image Computing and Computer-
Sichuan, China: IEEE Computer Society Press, May Assisted Intervention – MICCAI 2014. Ed. by
2012, pp. 870–873. isbn: 978-1-4673-0024-7. doi: Polina Golland et al. Lecture Notes in Computer
10.1109/FSKD.2012.6233889. url: https:// Science. Cham: Springer International Publishing,
ieeexplore.ieee.org/abstract/document/ 2014, pp. 228–235. isbn: 978-3-319-10470-6. doi:
6233889/ (visited on 05/14/2018). 10.1007/978-3-319-10470-6_29.
[6] Marek Dědič. “Optimalization of distances for [16] Diederik P. Kingma and Jimmy Ba. “Adam:
multi-instance clustering”. MA thesis. Prague: A Method for Stochastic Optimization”. In:
Czech Technical University in Prague, Jan. 2020. arXiv:1412.6980 [cs] (Jan. 2017). arXiv:
[7] Thomas G. Dietterich, Richard H. Lathrop, and 1412.6980. url: http : / / arxiv . org / abs /
Tomás Lozano-Pérez. “Solving the multiple in- 1412.6980 (visited on 06/25/2020).
stance problem with axis-parallel rectangles”. In: [17] J. Kohout and T. Pevný. “Network Traffic Fin-
Artificial Intelligence 89.1 (Jan. 1997), pp. 31–71. gerprinting Based on Approximated Kernel Two-
issn: 0004-3702. doi: 10.1016/S0004-3702(96) Sample Test”. In: IEEE Transactions on Information
00034-3. (Visited on 05/31/2017). Forensics and Security 13.3 (Mar. 2018), pp. 788–
[8] P. Elias. “Predictive coding–I”. In: IRE Transactions 801. issn: 1556-6013. doi: 10.1109/TIFS.2017.
on Information Theory 1.1 (Mar. 1955), pp. 16– 2768018.
24. issn: 2168-2712. doi: 10 . 1109 / TIT . 1955 . [18] J. MacQueen. “Some methods for classification and
1055126. analysis of multivariate observations”. In: Proceed-
[9] Brian S. Everitt, Sabine Landau, and Morven Leese. ings of the Fifth Berkeley Symposium on Mathe-
Cluster Analysis. 4th ed. Taylor & Francis, 2001. matical Statistics and Probability. Vol. 1. Berke-
isbn: 978-0-340-76119-9. ley, California: University of California Press, 1967,
[10] Milton Friedman. “The Use of Ranks to Avoid the pp. 281–297.
Assumption of Normality Implicit in the Analysis of [19] Krikamol Muandet et al. “Learning from Distri-
Variance”. In: Journal of the American Statistical butions via Support Measure Machines”. In: Ad-
Association 32.200 (Dec. 1937). Publisher: Taylor vances in neural information processing systems.
& Francis, pp. 675–701. issn: 0162-1459. doi: 10. 2012, pp. 10–18. url: http://papers.nips.cc/
1080/01621459.1937.10503522. url: https: paper/4825-learning-from-distributions-
via - support - measure - machines (visited on [29] Kilian Q Weinberger, John Blitzer, and Lawrence K.
06/29/2017). Saul. “Distance Metric Learning for Large Margin
[20] PB Nemenyi. “Distribution-free multiple compar- Nearest Neighbor Classification”. In: Advances in
isons (doctoral dissertation, princeton university, Neural Information Processing Systems 18. Ed. by
1963)”. In: Dissertation Abstracts International Y. Weiss, B. Schölkopf, and J. C. Platt. MIT Press,
25.2 (1963), p. 1233. 2006, pp. 1473–1480. url: http : / / papers .
nips . cc / paper / 2795 - distance - metric -
[21] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. learning - for - large - margin - nearest -
“Representation Learning with Contrastive Predic- neighbor-classification.pdf.
tive Coding”. In: arXiv:1807.03748 [cs, stat] (Jan.
2019). arXiv: 1807.03748. url: http://arxiv. [30] Lihi Zelnik-manor and Pietro Perona. “Self-Tuning
org/abs/1807.03748 (visited on 12/29/2019). Spectral Clustering”. In: Advances in Neural Infor-
mation Processing Systems 17. Ed. by L. K. Saul, Y.
[22] Tomas Pevny and Marek Dedic. “Nested Multiple Weiss, and L. Bottou. MIT Press, 2005, pp. 1601–
Instance Learning in Modelling of HTTP network 1608. url: http://papers.nips.cc/paper/
traffic”. In: arXiv:2002.04059 [cs] (Feb. 2020). 2619 - self - tuning - spectral - clustering .
arXiv: 2002.04059. url: http : / / arxiv . org / pdf (visited on 01/06/2020).
abs/2002.04059 (visited on 06/25/2020).
[31] Min-Ling Zhang and Zhi-Hua Zhou. “Multi-
[23] Tomáš Pevný and Petr Somol. “Using Neural Net- instance clustering with applications to multi-
work Formalism to Solve Multiple-Instance Prob- instance prediction”. en. In: Applied Intelligence
lems”. In: Advances in Neural Networks - ISNN 31.1 (Aug. 2009), pp. 47–68. issn: 1573-7497. doi:
2017. Springer, Cham, June 2017, pp. 135–142. 10 . 1007 / s10489 - 007 - 0111 - x. url: https :
isbn: 978-3-319-59072-1. doi: 10 . 1007 / 978 - / / doi . org / 10 . 1007 / s10489 - 007 - 0111 - x
3 - 319 - 59072 - 1 _ 17. url: https : / / link . (visited on 07/30/2020).
springer.com/chapter/10.1007/978-3-319-
59072-1_17 (visited on 05/23/2018). [32] Zhi-Hua Zhou, Kai Jiang, and Ming Li. “Multi-
Instance Learning Based Web Mining”. In: Applied
[24] Soumya Ray and Mark Craven. “Learning Statistical Intelligence 22.2 (Mar. 2005), pp. 135–147. issn:
Models for Annotating Proteins with Function Infor- 1573-7497. doi: 10.1007/s10489-005-5602-z.
mation using Biomedical Text”. In: BMC Bioinfor- url: https://doi.org/10.1007/s10489-005-
matics 6.1 (May 2005), S18. issn: 1471-2105. doi: 5602-z (visited on 01/01/2020).
10.1186/1471-2105-6-S1-S18. url: https:
//doi.org/10.1186/1471-2105-6-S1-S18 [33] Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li.
(visited on 01/01/2020). “Multi-Instance Learning by Treating Instances As
Non-I.I.D. Samples”. In: Proceedings of the 26th
[25] Soumya Ray and Mark Craven. “Supervised ver- International Conference On Machine Learning,
sus multiple instance learning: An empirical com- ICML 2009 (July 2008). doi: 10.1145/1553374.
parison”. In: Proceedings of the 22nd Interna- 1553534.
tional Conference on Machine Learning. Jan. 2005,
pp. 697–704. doi: 10.1145/1102351.1102439.
[26] Oren Rippel et al. “Metric Learning with Adaptive
Density Discrimination”. In: arXiv:1511.05939 [cs,
stat] (Mar. 2016). arXiv: 1511.05939. url: http:
/ / arxiv . org / abs / 1511 . 05939 (visited on
06/05/2020).
[27] A. Srinivasan, Stephen Muggleton, and Robert
King. “Comparing the use of background knowl-
edge by inductive logic programming systems”. In:
Proceedings of the 5th International Workshop on
Inductive Logic Programming. 1995, pp. 199–230.
[28] Jun Wang and Jean-Daniel Zucker. “Solving
Multiple-Instance Problem: A Lazy Learning
Approach”. In: Proceedings of the Seventeenth
International Conference on Machine Learning.
Ed. by Pat Langley. Stanford University, Stanford,
CA, USA: Morgan Kaufmann, 2000, pp. 1119–
1125. url: http : / / cogprints . org / 2124/
(visited on 07/01/2017).