Cartification:
    Turning Similarities into Itemset Frequencies?

                                    Bart Goethals

                            University of Antwerp, Belgium
                              bart.goethals@ua.ac.be

                                   Extended Abstract

     Suppose we are given a multi-dimensional dataset. For every point in the
dataset, we create a transaction, or cart, in which we store the k-nearest neigh-
bors of that point for one of the given dimensions. This is repeated for every
dimension. The resulting collection of carts can then be used to mine frequent
itemsets; that is, sets of points, or clusters, that are frequently seen together in
one or more of the dimensions. Essentially, this transformation, that we call car-
tification, combines multiple distance measures without suffering from the curse
of dimensionality.
     An important observation to make in order to see the potential of cartified
data is, that when the frequency of a single item is high, we know it is often
found in the neighborhoods of many points in one or more of the dimensions;
in fact, we can say that this item lies central in a cluster of data points, and, if
it is most central, its frequency will be among the highest of all items in that
cluster. Moreover, if an item is indeed part of a cluster, it is easy to see that
it will mainly receive its support from those transactions in the database that
correspond to the relevant dimensions of the cluster, as for the other dimensions
it will have wildly varying neighborhoods. This is very important, as it allows us
to identify which dimensions are relevant for the cluster, as well as to circumvent
the dreaded curse of dimensionality. This observation also goes for sets of items:
those itemsets that have relatively high frequency lie centrally in a sub-structure
of the data, and will mainly receive their support from the dimensions relevant
to that sub-structure. In fact, the latter effect will be even more pronounced for
(large) sets than for single items, as it is increasingly unlikely that all points in
the itemset lie closely together in an irrelevant dimension.
     For example, assume we are given the two-dimensional dataset as shown in
Figure 1. For cartification, we first need to choose a threshold k, the number
of the nearest neighbors that can be added to each cart. Table 1b shows the
resulting database for k = 3. The first two columns show the cartification for
the x-dimension, and the second two columns for the y-dimension. Every row
corresponds to the cart generated from one of the data points. For example, for
point 3, the three nearest neighbors in the x-dimension are points 2, 3, and 4,
while for the y-dimension these are 1, 3, and 5.
     In Figure 2a, we plot the frequencies of all items (points) in this cartified
database. This plot shows there are two points, 3 and 9, with a high frequency.
?
    work in collaboration with Jilles Vreeken


                                         4
            y

22                                                                     10                x     cart          y     cart
20                                                                       11
                                                                   9                      1 {1, 2, 3}        1 {1, 2, 3}
18                                                                                        2 {1, 2, 3}        2 {1, 2, 3}
                                                           7
16                                                             8                          3 {2, 3, 4}        3 {1, 3, 5}
14                                                                                        4 {3, 4, 5}        4 {3, 4, 5}
                                                                                          5 {3, 4, 5}        5 {3, 4, 5}
12                                                 6                                      6 {5, 6, 7}        6 {4, 6, 8}
10                                                                                        7 {7, 8, 9}        7 {7, 8, 9}
     8                                  4                                                 8 {7, 8, 9}        8 {7, 8, 9}
                                                                                          9 {8, 9, 10}       9 {7, 9, 11}
     6                                      5
                                    3                                                    10 {9, 10, 11}     10 {9, 10, 11}
     4                                                                                   11 {9, 10, 11}     11 {9, 10, 11}
                        1
     2                          2
                                                                                    x
                    2       4       6   8 10 12 14 16 18 20 22
                                            (a) original                                           (b) cartified

                                                       Fig. 1. Example dataset


            24                                                                      24
            20                                                                      20
Frequency


                                                                        Frequency


            16                                                                      16
            12                                                                      12
                8                                                                    8
                4                                                                    4

                            1 2 3 4 5 6 7 8 9 10 11                                       1 2 3 4 5 6 7 8 9 10 11
                                        (a) k=3                                                  (b) k=6

                                                       Fig. 2. Item Frequencies


                                                                   5
As Figure 1 shows, these points are in the centers of the clusters {1, 2, 3, 4, 5}
and {7, 8, 9, 10, 11} respectively. Additionally, point 6 has a very low frequency
which corresponds to the point being an outlier w.r.t. all other points in the
figure.
    Obviously, the choice of k is important with regard to which clusters, cen-
ters or outliers can be identified; if only, as for k only itemsets of length k or
smaller can receive support. Essentially, k affects the granularity at which we
are considering centers and outliers. For example, if we look back at the data in
Figure 1, as cartified in Table 1b using k = 3, we already identified items 3 and
9 as cluster centers, and item 6 as an outlier. If we choose k = 6 instead, the
resulting frequencies per item are shown in Figure 2b.
    Now item 6 has the maximum frequency of 22, far more than all other items,
and represents the center of the cluster consisting of all items. At this gran-
ularity, no outspoken outliers can be identified, yet the most remote elements
(i.e., 1, 2, 10, and 11) are still identifiable as they have the least support. Hence,
increasing k is like zooming out on the data, and so taking a more global point
of view.
    Let us illustrate this in another example, where we show that not only the
centroid items, but also the clusters themselves can be clearly identified. When
keeping a relatively close zoom, with k = 3, for instance, the largest itemsets with
a frequency greater than 2, are {1, 2, 3}, {3, 4, 5}, {7, 8, 9}, and {9, 10, 11}. These
sets correspond to the clusters in the data when we aim at finding small clusters.
Next, we zoom out to k = 5, and now find {1, 2, 3, 4, 5} and {7, 8, 9, 10, 11} as the
largest itemsets with support greater than 2. These sets represent the clusters
in the data well. Note however, these are good clusters at the current level of
zoom. Indeed, if we zoom in ‘too much’, every point is regarded as a cluster, and
if we zoom out ‘too much’ all of the data automatically becomes one big cluster.
Clearly, what is ‘enough’ for the task at hand is subjective, and as clustering
is explorative in nature, it is ultimately up to the data analyst to decide what
types of clusters are potentially of interest, and hence, how to set k.
    To conclude, preliminary cartification experiments on real data show that it
allows us to efficiently discover centroid and outlying sets of points, subspace
clusters, and clusterings, while not suffering from the curse of dimensionality.
Multiple dimensions, numerical or categorical, as well as multiple distance mea-
sures, can be combined, and become represented in the frequency of clusters.
Moreover, several efficient and scalable itemset mining techniques can be effec-
tively applied on the cartified database, resulting in meaningful and interesting
discoveries.


                                        6