=Paper=
{{Paper
|id=Vol-126/paper-13
|storemode=property
|title=WebDocs: a real-life huge transactional dataset
|pdfUrl=https://ceur-ws.org/Vol-126/webdocs.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/LuccheseOPS04
}}
==WebDocs: a real-life huge transactional dataset==
WebDocs: a real-life huge transactional dataset.
Claudio Lucchese2 , Salvatore Orlando1 , Raffaele Perego2 , Fabrizio Silvestri2
1
Dipartimento di Informatica, Università Ca’ Foscari di Venezia, Venezia, Italy,orlando@dsi.unive.it
2
ISTI-CNR, Consiglio Nazionale delle Ricerche, Pisa, Italy, fr.perego,c.lucchese,f.silvestrig@isti.cnr.it
Characteristics of the dataset 7
10
FREQUENT ITEMSETS IN THE WEBDOCS DATASET
This short note describes the main characteristics 6
10
of WebDocs, a huge real-life transactional dataset we
5
made publicly available to the Data Mining commu- 10
Number of Itemsets
nity through the FIMI repository. We built WebDocs
4
10
from a spidered collection of web html documents. The
whole collection contains about 1.7 millions documents, 3
10
mainly written in English, and its size is about 5GB.
2
10
The transactional dataset was built from the web col-
lection in the following way. All the web documents 1
10
40 35 30 25 20 15 10 5
were preliminarly filtered by removing html tags and support %
the most common words (stopwords), and by applying
a stemming algorithm. Then we generated from each Figure 1. Number of frequent itemsets dis-
document a distinct transaction containing the set of all covered in the WebDocs dataset as a func-
the distinct terms (items) appearing within the document tion of the support threshold.
itself.
The resulting dataset has a size of about 1; 48GB . It
contains exactly 1:692:082 transactions with 5:267:656
distinct items. The maximal length of a transaction is
71:472. Figure 1 plots the number of frequent item-
sets as a function of the support threshold, while Fig-
ure 2 shows a bitmap representing the horizontal dataset,
where items were sorted by their frequency. Note that to
reduce the size of the bitmap, it was obtained by eval-
uating the number of occurrences of a group of items
having subsequent Id’s in a subset of subsequent trans-
actions and assigning a level of gray proportional to such
count.
Figure 2. Bitmap representing the dataset.