WebDocs: a real-life huge transactional dataset. Claudio Lucchese2 , Salvatore Orlando1 , Raffaele Perego2 , Fabrizio Silvestri2 1 Dipartimento di Informatica, Università Ca’ Foscari di Venezia, Venezia, Italy,orlando@dsi.unive.it 2 ISTI-CNR, Consiglio Nazionale delle Ricerche, Pisa, Italy, fr.perego,c.lucchese,f.silvestrig@isti.cnr.it Characteristics of the dataset 7 10 FREQUENT ITEMSETS IN THE WEBDOCS DATASET This short note describes the main characteristics 6 10 of WebDocs, a huge real-life transactional dataset we 5 made publicly available to the Data Mining commu- 10 Number of Itemsets nity through the FIMI repository. We built WebDocs 4 10 from a spidered collection of web html documents. The whole collection contains about 1.7 millions documents, 3 10 mainly written in English, and its size is about 5GB. 2 10 The transactional dataset was built from the web col- lection in the following way. All the web documents 1 10 40 35 30 25 20 15 10 5 were preliminarly filtered by removing html tags and support % the most common words (stopwords), and by applying a stemming algorithm. Then we generated from each Figure 1. Number of frequent itemsets dis- document a distinct transaction containing the set of all covered in the WebDocs dataset as a func- the distinct terms (items) appearing within the document tion of the support threshold. itself. The resulting dataset has a size of about 1; 48GB . It contains exactly 1:692:082 transactions with 5:267:656 distinct items. The maximal length of a transaction is 71:472. Figure 1 plots the number of frequent item- sets as a function of the support threshold, while Fig- ure 2 shows a bitmap representing the horizontal dataset, where items were sorted by their frequency. Note that to reduce the size of the bitmap, it was obtained by eval- uating the number of occurrences of a group of items having subsequent Id’s in a subset of subsequent trans- actions and assigning a level of gray proportional to such count. Figure 2. Bitmap representing the dataset.