<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Common Criteria for Genre Classification: Annotation and Granularity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>st Author</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marina Santini</string-name>
          <email>M.Santini@brighton.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Brighton</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper,1we present two experiments that use machine learning for automatically classifying web pages by genre. These experiments highlight the influence that genre annotation and genre granularity can have on the accuracy of the classification. From a practical point of view these experiments show that a collection annotated with the criteria of 'objective sources' and consistent genre granularity ensures a very good classification accuracy (Experiment 1). Additionally, the classification model built out of such a collection can be exported more profitably for predictive tasks on an unclassified web page collection (Experiment 2). These experiments represent a starting point for a discussion about the need of common criteria for building a genre collection in the absence of an official genre-annotated benchmark.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In this paper, we present two experiments that use machine
learning for automatically classifying web pages by genre.</p>
      <p>
        Many definitions of genre have been proposed so far in
literary studies (e.g. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]), academic writing (e.g. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]),
professional settings (e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]), organizational
environment (e.g. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]), and so on. More specifically, in
automatic genre classification studies, genres have often been seen
as non-topical categories that could help reduce information
overload (e.g. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] or [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]). In this area, not only text categories
such as ‘article’, ‘FAQs’, ‘home page’, etc. have been considered
to be genres, but also polarities, such as subjective-objective and
positive-negative ([
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), and style ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). Regardless the
different definitions and connotations, a classification by genre
has been acknowledged to be useful in information retrieval (e.g.
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], etc.), information filtering ([
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), digital libraries ([
        <xref ref-type="bibr" rid="ref19">19</xref>
        ])
and other practical applications.
      </p>
      <p>
        In this paper we present two experiments of genre
classification of web pages based on a simplified and intuitive
definition of genre, which is suitable for all kind of genres –
including genres on the web – and for an automatic approach. In
our view, genres can be defined as named socio-cultural
communication artefacts, linked to a society or a community,
bearing standardized traits, leaving space for the creativity of the
text producer, and raising expectations in the text receiver. For
example, the personal home page (cf. also [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) has standard traits,
such as self-narration, personal interests, contact details, and often
pictures related to one’s life. However, these conventions do not
hinder the creativity of the producer, and as receivers, we expect a
blend of standardized information and personal touch. Though
unsophisticated, this definition of genre allows us to suggest a
practical solution to the main shortcoming in genre classification,
i.e. the lack of a genre-annotated benchmark. Because of this lack,
the main tendency has always been to build one’s own collection
according to subjective criteria as for genre annotation and genre
granularity. This is especially true for genre studies based on
collections of web pages. Although building a genre-annotated
benchmark of web pages is difficult and maybe not feasible,
because annotating a web page by genre is both hard and
controversial (cf. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]), a few criteria should be discussed and
agreed upon. Without some kind of commonality, any comparison
becomes unfeasible. For instance, can we state that the 92%
accuracy achieved by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is better than the accuracy (about 70%)
achieved by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]? The solution we suggest for building more
comparable genre collections is to exploit the socio-cultural
aspect of the concept of genre. As pointed out earlier, genres have
a function in a society, culture or community, i.e. they have a
social or public role that implies a number of conventions and
raises predictable expectations. This means that the role or the
function of different genres is recognized and correctly used in the
communication interaction. Leveraging on this public and
collective acknowledgement it is possible to create a
genreannotated collection without involving human annotators. The key
is to download documents from genre-specific archives or portals
and use their membership in these containers as an automatic
membership in a specific genre. For example, eshops can be
randomly downloaded from the portal http://www.eshops.co.uk/
and considered to be eshops without any further manual
annotation or inter-rater agreement assessment. We include in the
public acknowledgement also genres used as title of documents
(for example, “Insects Hotlist”). The idea behind selecting
documents with a genre in the title or picking them up randomly
from public resources, such as an archives or a portals, is the
following: if there is an archive, a portal or a website specialized
in, say, pointing to or collecting genres such as eshops, blogs or
search engines, this means that the documents pointed to or
collected there are considered to belong to these genres by the
collectivity of web users. We call this criterion ‘annotation by
objective sources’. A genre collection annotated by objective
sources tends to be more representative as for intra-genre
variation than a collection annotated relying on the genre
stereotypicality that two, three, or more annotators have in mind.
We suggest that annotating a collection using objective sources is
faster and closer to real-world conditions.
      </p>
      <p>
        Genre granularity is also important when building a
collection for genre classification. In fact, genre palettes often
show different levels of granularity. For instance, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] includes in
his genre palette both FAQs (genre) and journalistic materials
(super-genre). We suggest the use of the prototype theory (cf. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
and [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]) to achieve a consistent level of genre granularity. A
prototype is the most typical instance of a more encompassing or
fuzzy category. Categories that can be dealt with the prototype
theory can be ordered into a three-tiered hierarchy: superordinate
level, basic level and subordinate level. For example, the genre
‘advertisement’ represents the basic level (genre) of the
superordinate level ‘advertising’ (super-genre), while a ‘web ad’
represents the subordinate level (subgenre) of the basic level. The
basic level embodies the information level at which concepts are
most easily recognized, remembered and learned with respect to
their function. The basic level included in the prototype theory
should not be mixed up with document stereotypicality or
exemplarity. Building a genre collection choosing exemplars, i.e.
only stereotypical documents, to unambiguously represent a genre
can return biased results. According to the prototype theory,
instead, instances of a genre may vary in their prototypicality, thus
allowing intra-genre variation.
      </p>
      <p>
        The two experiments presented in this paper highlight the
influence that genre annotation and genre granularity can have on
the accuracy of genre classification of web pages. They were
designed to point out several issues (some already covered in
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]). In this paper, these two experiments allow us to emphasize
two general aspects of genre classification, one practical and one
theoretical. From a practical point of view these experiments show
that a collection annotated with the criteria of objective sources
and consistent genre granularity ensures a very good classification
accuracy (Experiment 1). Additionally, the classification model
built out of such a collection can be exported more profitably for
predictive tasks on an unclassified web page collection
(Experiment 2). From a theoretical point of view, they represent a
starting point for a discussion about the need of common criteria
in the absence of an official genre-annotated benchmark
      </p>
      <p>In order to ensure replicability, all the materials used for
these experiments, including web page collections, feature sets
and the manual evaluation of Experiment 2, are available at
http://www.nltg.brighton.ac.uk/home/Marina.Santini/, bottom of the
page.</p>
      <p>The paper is organized as follows: Section 2 provides an
overview of recent work in genre classification of web pages;
Section 3 presents the web page collections and the two
experiments; conclusions are drawn in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>PREVIOUS WORK</title>
      <p>
        Several experiments have been recently carried out with genres
and web pages. Here we list the latest studies in order to show
how difficult is to compare their results in the absence of common
criteria as for corpus building and genre palettes.
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: Number of web pages: 2150; Annotation: single rater;
Categories: subjectivity, positive-ness. They tried to discriminate
among texts coming from different domains in terms of two
polarities: subjective vs. objective and positive vs. negative. Their
aim was to see how a classification model tuned on one domain
performed in another domain. According to their results, in single
domain classification the best accuracy is achieved with
MultiView-Ensemble (MVE) (see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for details) for subjectivity, and
with bag-of-words (BOW) features for positive-ness. In domain
transfer classification, the best accuracy is achieved with
Parts-ofSpeech (POS) tags for subjectivity and MVE for positive-ness.
Although it is true that genres can be divided into more subjective
genres (e.g. editorials), or more objective genres (e.g. surveys),
and that the opposition positive-negative can suggest specific
genres (such as reviews), these two polarities can hardly be
considered as “genres” in themselves. Nonetheless, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]’s
contribution is extremely valuable because they shed some light
on the performance of different feature sets across several
domains, providing insight into the extent of feature exportability.
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: Number of web pages: 2700; Annotation: one or more raters;
Categories: functional styles. They carried out an experiment on
style-dependent document ranking. Their research explored the
possibility of incorporating style-dependent ranking into ranking
schemata for searching the web and digital libraries. Their basic
idea was to reduce styles (more specifically, the five functional
styles theorized by the School of Prague) to a single continuous
parameter. Regardless the promising preliminary results, they
could see little improvement in relevance ranking when stylistic
parameters were included.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: Number of web pages: 343; Genre annotation: the author
plus at least one or more raters; Genres: abstract, call for
papers, FAQs, hub/sitemap, job description, resume/C.V.,
statistics, syllabus, technical paper. She tried out the efficiency of
several feature sets and automatic feature selection techniques on
a small corpus of 10 genres, using a number of classification
algorithms. Although her results can be considered only indicative
given the reduced number of pages per genre (an average of 20
web pages per genre class), she made interesting remarks about
discrimination across similar genres, and the influence of the
genre palette and document exemplarity on discrimination tasks.
Her best accuracy (92.1%) was achieved by one of the feature
combinations resulting from an automatic feature selection
technique.
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: Number of web pages: 321; Genre annotation: do not say;
Genres: personal, corporate, organizational home pages,
including also non-home pages, as noise. They tried the hard task
of home page genre discrimination. The best accuracy (71.4%) is
achieved on personal home pages with a single classifier, manual
feature selection, and without noisy pages.
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: Number of web pages: 1224; Genre annotation: two
graduate students; Genres: personal home page, public home
page, commercial home page, bulletin collection, link collection,
image collection, simple table/lists, input pages, journalistic
material, research report, official materials, FAQs, discussions,
product specification, informal texts (poem, fiction, etc.). They
investigated the efficiency of several feature sets to discriminate
across these 16 genres. They also tested the classification
efficiency on different parts of the web page space (title and
metacontent, body, and anchors). The best accuracy (75.7%) was
achieved with one of their features sets when applied only to the
body and anchors.
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]: Number of web pages: 800; Genre annotation: three raters;
Genres: help, article, discussion, shop, portrayal (non-private),
portrayal (private), link collection, download. They worked out a
genre palette of eight genres following the outcome of a study on
genre usefulness. As they aimed at a classification performed on
the fly, they assessed features according to the computational
effort they required, giving preference to those requiring low or
medium effort. They achieved around 70% accuracy with
discriminant analysis on the palette of eight genres. Other results
relate to groups of genres tailored for web user profiles.
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the follow up [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: Number of web pages: 321; Genre
annotation: at least two raters; Genres: reportage-editorial,
research article, review, home page, Q&amp;A, specification. They
aimed at selecting genre-revealing terms from the training
document set using collection of web pages annotated both at
topic level and at genre level. Their formula (the deviation
formula) makes use of both genre-classified documents and
subject-classified documents and eliminate terms that are more
subject-related than genre-related. They report a micro-average of
precision and recall of about 90%.
As already stressed, the absence of common criteria or evaluation
ground makes most of these experiments (see Table 1 for a
summary) difficult to compare, however fruitful each study can be
in itself. A cross-evaluation of these experiments remains virtually
unfeasible because genre palettes are mostly disparate. Also in
the case of ‘home page’, which is probably one of the few genres
in common in several experiments, any comparison appear to be
difficult, because selection criteria and level of exemplarity are
not declared. The two criteria of annotation by objective sources
and consistent level of granularity are suggested to overcome this
un-comparability.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-4">
      <title>7-Web-Genre Collection</title>
      <p>The 7-web-genre collection includes 200 English web pages per
genre, amounting to a total of 1,400 web pages (available online
at the URL reported in the Introduction). These web pages were
collected by the author of this paper in early spring 2005. This
collection was built with genres belonging to a consistent level of
granularity and applying the annotation by objective source. The
seven web genres included in the collection are the following:
1. blog
2. eshop
3. FAQs
4. online newspaper front page
5. list
6. personal home page2
7. search page
2 ‘Personal home page’ is the basic level of the superordinate level ‘home
page’ and has ‘academic personal home page’, ‘administrative personal
home page’, etc. as subordinate level.
•
•
•
•</p>
      <sec id="sec-4-1">
        <title>Blogs:</title>
        <p>o
o</p>
      </sec>
      <sec id="sec-4-2">
        <title>Eshops:</title>
        <p>o
o
FAQs:
o
o
o
o
http://www.britblog.com/
http://www.nataliedarbeloff.com/augustinearchive.html.
http://www.shops.co.uk/
http://www.eshops.co.uk/
o
o
o
o
o
http://www.cybernothing.org/faqs/net-abuse-faq.html
http://www.irs.gov/faqs/
http://www.copyright.gov/help/faq/
http://www.aoml.noaa.gov/hrd/tcfaq/tcfaqHED.html
Newspaper front pages belong to a number of different
online newspaper and are available at Internet Archive:
o www.archive.org
Personal home pages are heterogeneous, and include
academic and administrative personal home pages, as well as
more informal personal home pages. They were downloaded
from:
o
o
o
o
http://dmoz.org/Society/People/Personal_Homepages/
http://www.math.unl.edu/~mbritten/ldt/homepage.html
http://www.bradley.edu/people/fac-staff.html
http://www.daimi.au.dk/local/map/PeopleandLocationsPe
opleFrame.html
http://www.mit.edu/Home-byUser.html
http://dir.yahoo.com/Society_and_Culture/People/Person
al_Home_Pages
http://hpsearch.uni-trier.de/hp/a-tree/
Search pages comes from:
http://www.searchenginecolossus.com/
The web pages included in the 7-web-genre collection were
randomly downloaded from the following public archives or
portals (download date: Feb-March 2005):
•
The web pages included in the genre ‘list’, were selected
searching keywords in Google and selecting relevant web pages
from the results. All the lists include one of the following
keywords (and orthographic variants) in the heading: checklist,
hot list, table of content, and sitemap (see, for example, Insect
Hotlist at http://www.fi.edu/tfi/hotlists/insects.html).
3.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>KI-04 corpus</title>
      <p>
        KI-04 corpus was built following a palette of eight genres
suggested by a user study on genre usefulness ([
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). It includes
1,295 English web pages (HTML documents), but only 800 web
pages (100 per genre) were used in the experiment described in
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In Experiment 1, we used 1,205 web pages because some
web pages were empty (both original version, 1,295 web pages,
and working version, 1,205 web pages, are available online at the
URL reported in the Introduction). KI-04 corpus includes:
1. article (127 web pages)
2. download (151 w. p)
3. link collection (205 w. p)
4. portrayal (priv.) (126 w. p)
5. discussion (127 w. p)
6. help (139 w. p)
7. portrayal (non-priv) (163 w. p.)
8. shop (167 w. p)
The KI-04 corpus was collected using bookmarks from about five
people. Some genres were extended to get a better balance. The
corpus was sorted by three people, one of them wrote a bachelor
thesis (in German) on the corpus building process. One of the
author of [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] checked many of the pages, and most of the sorting
complied with his understanding of the genre categories. The
download date was January 26th, 2004.
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>SPIRIT collection</title>
      <p>
        The SPIRIT collection is a random crawl carried out in 2001 (see
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). It contains single web pages and not full websites. The size
of the whole collection is about one terabyte, and the number of
web pages (mostly HTML files) is about 95 millions. It is
multilingual and without any meta-information, apart from a short
header including the original URL, the date and time when the
pages were crawled from the web, and few other details. It
represents a genuine slice of the real web. In Experiment 2, we
used only 1,000 English web pages (available online at the URL
reported in the Introduction) from this random, multilingual and
unclassified collection.
3.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>Experiment 1</title>
      <p>The practical aim of Experiment 1 was to build two single-label
discrete classification models, one out of the 7-web-genre
collection, the other from KI-04 corpus, and compare their
accuracy results. Both collections were submitted to the same
preprocessing. The unit of analysis was a single static web page in
HTML format.</p>
      <p>The feature set, called 1_set, used in Experiment 1 includes:
• the 50 most common words in English;
• 24 Part-of-Speech (POS) tags;
• 8 punctuation marks: full stop (.), colon (:), semi-colon (;), comma (,),
exclamation mark (!), question mark (?), apostrophe ('), and quotes (");
• genre-specific words3;
• 28 HTML tags;
• 1 nominal attribute representing the length of the web page (SHORT,</p>
      <p>
        MEDIUM and LONG).
(This feature set, together with a description, is available online at
the URL reported in the Introduction). The classification
algorithm used both in Experiments 1 and 2 is SMO (which
implements the Sequential Minimal Optimisation (SMO) for
training support vectors) with default parameters and logistic
regression model, from Weka machine learning workbench ([
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]).
Accuracy results, shown in Table 2, are averaged over stratified
10-fold crossvalidations repeated 10 times.
As you can see in Table 2, the accuracy of the model built with
the 7-web-genre collection is much higher than the model built
with KI-04 corpus, namely +21.7%.
      </p>
      <p>
        In order to see whether the feature set was too tailored or biased
towards the 7-web-genre collection, we compared the accuracy of
this feature set on KI-04 corpus with the accuracy rates reported
in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. To make this comparison possible, we ran discriminant
analysis using our feature set on KI-04 corpus. As [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] ran their
discriminant analysis only on 800 web pages while we used 1,205
3 Genre-specific words were selected through a cursory manual analysis.
      </p>
      <p>
        A total of 13 sets of genre-specific words were built. 13 and not 15
because two sets were shared across the two collections, namely those
related to home-page/portrayal (priv) and eshop/shop. It is worth saying
that genre-specific words (available online at the URL reported in the
Introduction) are not numerous. For example, genre-specific words for
the search web genre are only: search, crawl, directories, engine, find,
and see.
Our feature set performs better than [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]’s feature set. Although
the difference is rather small (+2.1%), it is statistically significant
(chi-square test). This means that our feature set is not biased
toward the 7-web-genre collection, but it performs significantly
better than [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]’s feature set on KI-04 corpus with discriminant
analysis, i.e. the same algorithm used in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
3.4.1
      </p>
      <p>
        Discussion
Experiment 1 compares the accuracies of two models built with
the same classification algorithm, the same feature set but
different web page collections, the 7-web-genre collection and
KI04 corpus. The accuracy on the 7-web-genre collection (1,400
web pages) is above 90% while the accuracy on KI-04 corpus is
definitely lower. A first thought was that our feature set did not
represent the genre palette of KI-04 corpus adequately. However,
after having compared the performance of our feature set with
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]’s feature set using the same algorithm (discriminant analysis)
on the same collection, we saw that the accuracy achieved by our
feature set was slightly higher than the accuracy stated in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Although KI-04 corpus contains eight genres, i.e. one genre more
than the 7-web-genre collection (error rate usually increases with
the number of categories), this does not justify such a wide the
gap in the classification accuracy. Also, it is important to stress
that genre-specific words are tailored to the genre palette. This
means, the genre-specific words used for the 7-web-genre
collection account for blogs, search, front page, etc., while those
employed for KI-04 corpus include words relate to articles,
discussion, download, etc. Since these two genre palettes have
two web genres in common, i.e. home page/portrayal (priv) and
eshop/shop, in these two cases the same set of genre-specific
words was used for both web genre collections. That the feature
set used in the KI-04 corpus is not biased towards the 7-web
genre collection is confirmed by the results shown in Table 3,
where the performance of our features set is higher than [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]'s
feature set.
      </p>
      <p>
        In conclusion, if neither the feature set nor the classification
algorithm is the cause of this large discrepancy in accuracy, then
the suspicion is that the selection of the web pages representing
genres in KI-04 corpus might be responsible for the lower
performance. Although the issue of subjectivity of the assignment
of genre to web pages needs further investigation (cf. also [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), for
the time being we interpret the higher performance on the
7-webgenre collection as a result of the application of the two criteria of
annotation by objective sources annotation and consistent genre
granularity.
3.5
      </p>
    </sec>
    <sec id="sec-8">
      <title>Experiment 2</title>
      <p>The goal of Experiment 2 was to see whether the classification
model built with the collection complying to the criteria of
annotation by objective source and consistent genre granularity is
more effective also for predictive tasks. In other words,
predictions are used here as a kind of evaluation metrics of the
efficiency of classification models.</p>
      <p>In this experiment we used the two classification models
built in the previous experiment together with additional models.
The practical aim was to make predictions on unclassified and
non-annotated web pages, i.e. 1,000 random English web pages
from the SPIRIT collection. The relevance of the agreed upon
web pages (see Tables 5 and 6) to a genre was manually assessed
by the author of this paper (the breakdown of this manual
evaluation is available online at the URL reported in the
Introduction).</p>
      <p>When making a prediction, the classifier returns a probability
score to be interpreted in terms of classification confidence. This
confidence score can be exploited when assessing the value of a
prediction and for setting a threshold for reliable guesses. In order
to get predictions on genre labels which were as reliable as
possible, we devised an approach inspired by co-training. The
basic idea was to exploit three different views (i.e. three different
feature sets) on the same data. When the three models built with
the three feature sets agreed on the same genre label (3-out-of-3
agreement) at very high confidence score, namely &gt;=0.9, this was
for us an indication of a good prediction. Additionally, as we have
two web page collections with two different genre palettes, we can
have multi-label predictions. Ideally, a web page might get a
prediction of “personal home page”, following the palette adopted
in the 7-web-genre collection, and “portrayal (private)”, following
the genre palette adopted in KI-04 corpus. Also, as the two
palettes are mostly not overlapping, it is interesting to see which
palette is more suitable for the classification of this SPIRIT
random sample. From the previous experiment we had two
models built with a single feature set (1_set). To these models, we
add four additional models (two per collection) in order to get the
three simultaneous views on each collection. The additional two
models were built using the feature sets called 2_set and 3_set
(these feature sets, together with a description, are available
online at the URL reported in the Introduction).
2_set contains the following features:
• POS trigrams;
• 8 punctuation symbols (as above);
• genre-specific words (as above);
• 28 HTML tags (as above);
• 1 nominal attribute representing the length of the web page (as above).
3_set contains the following features:
• 86 linguistic facets4;
• genre-specific words;
• 6 HTML facets;
• 1 nominal attribute representing the length of the web page (as above).
4 Linguistic facets and HTML facets are groups of features highlighting
an aspect in the communicative context that is reflected in the use of
language. They are listed in the URL reported in the Introduction.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>In this paper we pointed out how classification models learned
from a web collection annotated by genre using the two criteria of
annotation by objective source and consistent level of granularity
can return higher accuracy and a higher number of correct
predictions.
However, predictions are even sparer with the models built using
KI-04 corpus (Table 6). As there was no 3-out-of-3 agreement for
discussion, download, help, and portrayal (non-private), these
genres were evaluated with 2-out-of-3 agreement. No correct
guesses were returned for article, discussion, download, and help.
The annotation by objective source is not only less subjective and
closer to real-world conditions, but also much faster than
annotation by human raters, which is usually time-consuming,
controversial, and expensive. Further, a collection built with a
consistent level of genre granularity seems to be learned more
profitably by the classifier. Together, these two criteria enhance
the performance of classification algorithms.</p>
      <p>However, a full comparison between the results achieved
with the two web page collections built with different criteria is
not entirely feasible because the two genre palettes are mostly
different. Nonetheless, these findings are indicative of a tendency
that can be further investigated in future. It is also worth pointing
out that objective sources may still contain biases. Biases in web
collections relate to the well-known issue of 'corpus
representativeness', dating back to Chomsky’s aversion to the use
of corpora. However, in the present days and with the web
available, biases can be alleviated by randomly picking up web
pages from several genre-specific web archives or portals.</p>
      <p>Although the two criteria of annotation by objective source
and consistent level of granularity represent a practical solution
that can help genre classification, the concept of genre remains
hard to capture computationally and statistically in its entirety.</p>
      <p>
        First, it would be interesting to investigate more about the
ideal proportion among corpus size, number of features and
number of classes and its influence on classification results. Also,
up to now only single-label discrete classification has been tried
out in genre classification studies. Experiment 2 implicitly shows
an easy method that can be exploited for multi-label classification:
the use of concurrent genre palettes over the same unclassified
collection. Ideally, the use of several classification models built
with different collections annotated by external sources and a
consistent granularity, and including different genre palettes can
suggest several genre labels for the same web page. Multi-genre
documents and genre hybridism are particularly acute when
dealing with web pages, which appear much more unpredictable
and individualized than paper documents. Using concurrent genre
palettes might represent an alternative to the multi-faceted
approach by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. What is less reassuring is the absence of a
proper evaluation metrics for multi-label problems. We leave
these problems open to further investigations and invite the genre
classification community to make use of the three collections
employed in these experiments and now available online.
5
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avneri</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Routing documents according to style</article-title>
          ,
          <source>Proc. First International Workshop on Innovative Internet Information Systems</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bathia</surname>
            ,
            <given-names>V. Analysing</given-names>
          </string-name>
          <string-name>
            <surname>Genre</surname>
          </string-name>
          . Language Use in Professional Settings, Longman, London and New York,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Boese</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>Stereotyping the Web: Genre Classification of Web Documents</article-title>
          ,
          <string-name>
            <given-names>M.S.</given-names>
            <surname>Thesis</surname>
          </string-name>
          , Colorado State Univ.,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Boese</surname>
            ,
            <given-names>E</given-names>
          </string-name>
          and
          <string-name>
            <surname>Howe</surname>
            <given-names>A.</given-names>
          </string-name>
          <article-title>Effects of Web Document Evolution on Genre Classification</article-title>
          ,
          <source>CIKM'05</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bravslavski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tselischev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Experiment on StyleDependent Document Ranking</article-title>
          ,
          <source>Proc. of the 7th Russian Conference on Digital Libraries</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dillon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gushrowski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Genres and the Web: is the personal home page the first uniquely digital genre?</article-title>
          ,
          <source>JASIS</source>
          ,
          <volume>51</volume>
          (
          <issue>2</issue>
          ),
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kushmerick</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>Learning to classify documents according to genre</article-title>
          . JASIST, Special Issue,
          <volume>7</volume>
          (
          <issue>5</issue>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>The SPIRIT collection: an overview of a large web collection</article-title>
          ,
          <source>SIGIR Forum</source>
          ,
          <volume>38</volume>
          (
          <issue>2</issue>
          )
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Karlgren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Stylistic</surname>
          </string-name>
          <article-title>Experiments for Information Retrieval, Thesis submitted for the degree of Doctor of Philosophy</article-title>
          , Stockholm University, Sweden,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kennedy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Shepherd</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Automatic Identification of Home Pages on the Web</article-title>
          ,
          <source>Proc. 38 HICSS</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kessler</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Numberg</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Shütze</surname>
          </string-name>
          ,
          <source>H. Automatic Detection of Text Genre, Proc. 35 Annual Meeting of the ACL and 8th Conference of the EACL</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Kwasnik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crowston</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nilan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Roussinov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Identifying document genre to improve web search effectiveness</article-title>
          .
          <source>The Bulletin of the American Society for Information Science and Technology</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ),
          <fpage>23</fpage>
          -
          <lpage>26</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Genres</surname>
          </string-name>
          , Registers, Text types, Domains, and
          <article-title>Styles: Clarifying the concepts and navigating a path through the BNC Jungle, Language Learning</article-title>
          and Technology,
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <fpage>37</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Myaeng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Automatic Identification of Text Genres and Their Roles in Subject-Based Categorization</article-title>
          ,
          <source>Proc. 37 HICSS</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Myaeng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Text Genre Classification with Genre-Revealing and</article-title>
          <string-name>
            <surname>Subject-Revealing</surname>
            <given-names>Features</given-names>
          </string-name>
          ,
          <source>Proc. 25 Annual International ACM SIGIR</source>
          ,
          <volume>145</volume>
          -
          <fpage>150</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kim</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Automatic Genre Detection of Web Documents</article-title>
          , in Su K.,
          <string-name>
            <surname>Tsujii</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>J.</given-names>
          </string-name>
          , Kwong O. Y. (eds.)
          <source>Natural Language Processing</source>
          , Springer, Berlin,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Meyer zu Eissen S. and Stein B</article-title>
          .
          <article-title>Genre Classification of Web Pages: User Study and Feasibility Analysis</article-title>
          , in Biundo S.,
          <string-name>
            <surname>Fruhwirth</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palm</surname>
            <given-names>G</given-names>
          </string-name>
          . (eds.),
          <source>Advances in Artificial Intelligence</source>
          , Springer, Berlin,
          <fpage>256</fpage>
          -
          <lpage>269</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Paltridge</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Working with genre: A pragmatic perspective</article-title>
          ,
          <source>Journal of Pragmatics</source>
          ,
          <volume>24</volume>
          ,
          <fpage>393</fpage>
          -
          <lpage>406</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Müller-Kögler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Integrating Automatic Genre Analysis into Digital Libraries</article-title>
          , ACM/IEEE joint Conference on Digital Libraries, Roanoke, USA,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Rosmarin</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . The Power of Genre, University of Minnesota Press, Minneapolis,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Santini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Genres In Formation? An Exploratory Study of Web Pages using Cluster Analysis</article-title>
          ,
          <source>Proc. CLUK 05</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Santini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Some Issues in Automatic Genre Classification of Web Pages</article-title>
          .
          <source>Proc. of the JADT 2006 Besançon</source>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Swales</surname>
            ,
            <given-names>J. Genre</given-names>
          </string-name>
          <string-name>
            <surname>Analysis</surname>
          </string-name>
          , Cambridge University Press, Cambridge,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Trosborg</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (ed.), Analysing Professional Genres, J. Benjamins Publishing Company, Amsterdam,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Data</surname>
          </string-name>
          <article-title>Mining: Practical Machine Learning Tools</article-title>
          and Techniques, Morgan Kaufmann Publishers, Amsterdam, second edition,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Yates</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Orlikowski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <article-title>Genres of organizational communication: A structural approach to studying communications and media</article-title>
          ,
          <source>Academy of Management Review</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ),
          <fpage>229</fpage>
          -
          <lpage>326</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>