WebDocs: a real-life huge transactional dataset.

                Claudio Lucchese2 , Salvatore Orlando1 , Raffaele Perego2 , Fabrizio Silvestri2
            1
                Dipartimento di Informatica, Università Ca’ Foscari di Venezia, Venezia, Italy,orlando@dsi.unive.it
    2
        ISTI-CNR, Consiglio Nazionale delle Ricerche, Pisa, Italy, fr.perego,c.lucchese,f.silvestrig@isti.cnr.it


Characteristics of the dataset                                                             7
                                                                                          10
                                                                                                         FREQUENT ITEMSETS IN THE WEBDOCS DATASET


   This short note describes the main characteristics                                      6
                                                                                          10


of WebDocs, a huge real-life transactional dataset we
                                                                                           5
made publicly available to the Data Mining commu-                                         10


                                                                     Number of Itemsets
nity through the FIMI repository. We built WebDocs
                                                                                           4
                                                                                          10
from a spidered collection of web html documents. The
whole collection contains about 1.7 millions documents,                                    3
                                                                                          10

mainly written in English, and its size is about 5GB.
                                                                                           2
                                                                                          10
    The transactional dataset was built from the web col-
lection in the following way. All the web documents                                        1
                                                                                          10
                                                                                               40   35      30        25               20   15      10   5
were preliminarly filtered by removing html tags and                                                                       support %


the most common words (stopwords), and by applying
a stemming algorithm. Then we generated from each                                         Figure 1. Number of frequent itemsets dis-
document a distinct transaction containing the set of all                                 covered in the WebDocs dataset as a func-
the distinct terms (items) appearing within the document                                  tion of the support threshold.
itself.

   The resulting dataset has a size of about 1; 48GB . It
contains exactly 1:692:082 transactions with 5:267:656
distinct items. The maximal length of a transaction is
71:472. Figure 1 plots the number of frequent item-
sets as a function of the support threshold, while Fig-
ure 2 shows a bitmap representing the horizontal dataset,
where items were sorted by their frequency. Note that to
reduce the size of the bitmap, it was obtained by eval-
uating the number of occurrences of a group of items
having subsequent Id’s in a subset of subsequent trans-
actions and assigning a level of gray proportional to such
count.

                                                                                          Figure 2. Bitmap representing the dataset.