<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WebDocs: a real-life huge transactional dataset.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio Lucchese</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore Orlando</string-name>
          <email>orlando@dsi.unive.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raffaele Perego</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Silvestri</string-name>
          <email>f.silvestrig@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Universita` Ca' Foscari di Venezia</institution>
          ,
          <addr-line>Venezia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ISTI-CNR, Consiglio Nazionale delle Ricerche</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>FREQUENT ITEMSETS IN THE WEBDOCS DATASET</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Characteristics of the dataset</p>
      <p>This short note describes the main characteristics
of WebDocs, a huge real-life transactional dataset we
made publicly available to the Data Mining
community through the FIMI repository. We built WebDocs
from a spidered collection of web html documents. The
whole collection contains about 1.7 millions documents,
mainly written in English, and its size is about 5GB.</p>
      <p>The transactional dataset was built from the web
collection in the following way. All the web documents
were preliminarly filtered by removing html tags and
the most common words (stopwords), and by applying
a stemming algorithm. Then we generated from each
document a distinct transaction containing the set of all
the distinct terms (items) appearing within the document
itself.</p>
      <p>The resulting dataset has a size of about 1; 48GB. It
contains exactly 1:692:082 transactions with 5:267:656
distinct items. The maximal length of a transaction is
71:472. Figure 1 plots the number of frequent
itemsets as a function of the support threshold, while
Figure 2 shows a bitmap representing the horizontal dataset,
where items were sorted by their frequency. Note that to
reduce the size of the bitmap, it was obtained by
evaluating the number of occurrences of a group of items
having subsequent Id’s in a subset of subsequent
transactions and assigning a level of gray proportional to such
count.
105
s
t
e
s
m
fIoe104
t
r
e
b
m
u
N
103
102
10140
35
30
25 support % 20
15
10
5</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>