=Paper= {{Paper |id=Vol-205/paper-9 |storemode=property |title=Common Criteria for Genre Classification: Annotation and Granularity |pdfUrl=https://ceur-ws.org/Vol-205/paper9.pdf |volume=Vol-205 }} ==Common Criteria for Genre Classification: Annotation and Granularity== https://ceur-ws.org/Vol-205/paper9.pdf
                     Common Criteria for Genre Classification:
                          Annotation and Granularity
                   1st Author                                  Marina Santini1

ABSTRACT                                                                  according to subjective criteria as for genre annotation and genre
In this paper,1we present two experiments that use machine                granularity. This is especially true for genre studies based on
learning for automatically classifying web pages by genre. These          collections of web pages. Although building a genre-annotated
experiments highlight the influence that genre annotation and             benchmark of web pages is difficult and maybe not feasible,
genre granularity can have on the accuracy of the classification.         because annotating a web page by genre is both hard and
From a practical point of view these experiments show that a              controversial (cf. [21]), a few criteria should be discussed and
collection annotated with the criteria of ‘objective sources’ and         agreed upon. Without some kind of commonality, any comparison
consistent genre granularity ensures a very good classification           becomes unfeasible. For instance, can we state that the 92%
accuracy (Experiment 1). Additionally, the classification model           accuracy achieved by [3] is better than the accuracy (about 70%)
built out of such a collection can be exported more profitably for        achieved by [17]? The solution we suggest for building more
predictive tasks on an unclassified web page collection                   comparable genre collections is to exploit the socio-cultural
(Experiment 2). These experiments represent a starting point for a        aspect of the concept of genre. As pointed out earlier, genres have
discussion about the need of common criteria for building a genre         a function in a society, culture or community, i.e. they have a
collection in the absence of an official genre-annotated                  social or public role that implies a number of conventions and
benchmark.                                                                raises predictable expectations. This means that the role or the
                                                                          function of different genres is recognized and correctly used in the
1       INTRODUCTION                                                      communication interaction. Leveraging on this public and
                                                                          collective acknowledgement it is possible to create a genre-
In this paper, we present two experiments that use machine                annotated collection without involving human annotators. The key
learning for automatically classifying web pages by genre.                is to download documents from genre-specific archives or portals
      Many definitions of genre have been proposed so far in              and use their membership in these containers as an automatic
literary studies (e.g. [20]), academic writing (e.g. [23]),               membership in a specific genre. For example, eshops can be
professional settings (e.g. [2] and [24]), organizational                 randomly downloaded from the portal http://www.eshops.co.uk/
environment (e.g. [26]), and so on. More specifically, in                 and considered to be eshops without any further manual
automatic genre classification studies, genres have often been seen       annotation or inter-rater agreement assessment. We include in the
as non-topical categories that could help reduce information              public acknowledgement also genres used as title of documents
overload (e.g. [16] or [15]). In this area, not only text categories      (for example, “Insects Hotlist”). The idea behind selecting
such as ‘article’, ‘FAQs’, ‘home page’, etc. have been considered         documents with a genre in the title or picking them up randomly
to be genres, but also polarities, such as subjective-objective and       from public resources, such as an archives or a portals, is the
positive-negative ([7]), and style ([1], [9] and [5]). Regardless the     following: if there is an archive, a portal or a website specialized
different definitions and connotations, a classification by genre         in, say, pointing to or collecting genres such as eshops, blogs or
has been acknowledged to be useful in information retrieval (e.g.         search engines, this means that the documents pointed to or
[9], [12], etc.), information filtering ([7]), digital libraries ([19])   collected there are considered to belong to these genres by the
and other practical applications.                                         collectivity of web users. We call this criterion ‘annotation by
      In this paper we present two experiments of genre                   objective sources’. A genre collection annotated by objective
classification of web pages based on a simplified and intuitive           sources tends to be more representative as for intra-genre
definition of genre, which is suitable for all kind of genres –           variation than a collection annotated relying on the genre
including genres on the web – and for an automatic approach. In           stereotypicality that two, three, or more annotators have in mind.
our view, genres can be defined as named socio-cultural                   We suggest that annotating a collection using objective sources is
communication artefacts, linked to a society or a community,              faster and closer to real-world conditions.
bearing standardized traits, leaving space for the creativity of the            Genre granularity is also important when building a
text producer, and raising expectations in the text receiver. For         collection for genre classification. In fact, genre palettes often
example, the personal home page (cf. also [6]) has standard traits,       show different levels of granularity. For instance, [9] includes in
such as self-narration, personal interests, contact details, and often    his genre palette both FAQs (genre) and journalistic materials
pictures related to one’s life. However, these conventions do not         (super-genre). We suggest the use of the prototype theory (cf. [18]
hinder the creativity of the producer, and as receivers, we expect a      and [13]) to achieve a consistent level of genre granularity. A
blend of standardized information and personal touch. Though              prototype is the most typical instance of a more encompassing or
unsophisticated, this definition of genre allows us to suggest a          fuzzy category. Categories that can be dealt with the prototype
practical solution to the main shortcoming in genre classification,       theory can be ordered into a three-tiered hierarchy: superordinate
i.e. the lack of a genre-annotated benchmark. Because of this lack,       level, basic level and subordinate level. For example, the genre
the main tendency has always been to build one’s own collection           ‘advertisement’ represents the basic level (genre) of the
                                                                          superordinate level ‘advertising’ (super-genre), while a ‘web ad’
1                                                                         represents the subordinate level (subgenre) of the basic level. The
    University of Brighton (UK); M.Santini@brighton.ac.uk
basic level embodies the information level at which concepts are        style-dependent document ranking. Their research explored the
most easily recognized, remembered and learned with respect to          possibility of incorporating style-dependent ranking into ranking
their function. The basic level included in the prototype theory        schemata for searching the web and digital libraries. Their basic
should not be mixed up with document stereotypicality or                idea was to reduce styles (more specifically, the five functional
exemplarity. Building a genre collection choosing exemplars, i.e.       styles theorized by the School of Prague) to a single continuous
only stereotypical documents, to unambiguously represent a genre        parameter. Regardless the promising preliminary results, they
can return biased results. According to the prototype theory,           could see little improvement in relevance ranking when stylistic
instead, instances of a genre may vary in their prototypicality, thus   parameters were included.
allowing intra-genre variation.                                         [3]: Number of web pages: 343; Genre annotation: the author
      The two experiments presented in this paper highlight the         plus at least one or more raters; Genres: abstract, call for
influence that genre annotation and genre granularity can have on       papers, FAQs, hub/sitemap, job description, resume/C.V.,
the accuracy of genre classification of web pages. They were            statistics, syllabus, technical paper. She tried out the efficiency of
designed to point out several issues (some already covered in           several feature sets and automatic feature selection techniques on
[22]). In this paper, these two experiments allow us to emphasize       a small corpus of 10 genres, using a number of classification
two general aspects of genre classification, one practical and one      algorithms. Although her results can be considered only indicative
theoretical. From a practical point of view these experiments show      given the reduced number of pages per genre (an average of 20
that a collection annotated with the criteria of objective sources      web pages per genre class), she made interesting remarks about
and consistent genre granularity ensures a very good classification     discrimination across similar genres, and the influence of the
accuracy (Experiment 1). Additionally, the classification model         genre palette and document exemplarity on discrimination tasks.
built out of such a collection can be exported more profitably for      Her best accuracy (92.1%) was achieved by one of the feature
predictive tasks on an unclassified web page collection                 combinations resulting from an automatic feature selection
(Experiment 2). From a theoretical point of view, they represent a      technique.
starting point for a discussion about the need of common criteria       [10]: Number of web pages: 321; Genre annotation: do not say;
in the absence of an official genre-annotated benchmark                 Genres: personal, corporate, organizational home pages,
      In order to ensure replicability, all the materials used for      including also non-home pages, as noise. They tried the hard task
these experiments, including web page collections, feature sets         of home page genre discrimination. The best accuracy (71.4%) is
and the manual evaluation of Experiment 2, are available at             achieved on personal home pages with a single classifier, manual
http://www.nltg.brighton.ac.uk/home/Marina.Santini/, bottom of the
                                                                        feature selection, and without noisy pages.
page.
      The paper is organized as follows: Section 2 provides an          [16]: Number of web pages: 1224; Genre annotation: two
overview of recent work in genre classification of web pages;           graduate students; Genres: personal home page, public home
Section 3 presents the web page collections and the two                 page, commercial home page, bulletin collection, link collection,
experiments; conclusions are drawn in Section 4.                        image collection, simple table/lists, input pages, journalistic
                                                                        material, research report, official materials, FAQs, discussions,
2     PREVIOUS WORK                                                     product specification, informal texts (poem, fiction, etc.). They
Several experiments have been recently carried out with genres          investigated the efficiency of several feature sets to discriminate
and web pages. Here we list the latest studies in order to show         across these 16 genres. They also tested the classification
how difficult is to compare their results in the absence of common      efficiency on different parts of the web page space (title and meta-
criteria as for corpus building and genre palettes.                     content, body, and anchors). The best accuracy (75.7%) was
                                                                        achieved with one of their features sets when applied only to the
[7]: Number of web pages: 2150; Annotation: single rater;               body and anchors.
Categories: subjectivity, positive-ness. They tried to discriminate
among texts coming from different domains in terms of two               [17]: Number of web pages: 800; Genre annotation: three raters;
polarities: subjective vs. objective and positive vs. negative. Their   Genres: help, article, discussion, shop, portrayal (non-private),
aim was to see how a classification model tuned on one domain           portrayal (private), link collection, download. They worked out a
performed in another domain. According to their results, in single      genre palette of eight genres following the outcome of a study on
domain classification the best accuracy is achieved with Multi-         genre usefulness. As they aimed at a classification performed on
View-Ensemble (MVE) (see [7] for details) for subjectivity, and         the fly, they assessed features according to the computational
with bag-of-words (BOW) features for positive-ness. In domain           effort they required, giving preference to those requiring low or
transfer classification, the best accuracy is achieved with Parts-of-   medium effort. They achieved around 70% accuracy with
                                                                        discriminant analysis on the palette of eight genres. Other results
Speech (POS) tags for subjectivity and MVE for positive-ness.
Although it is true that genres can be divided into more subjective     relate to groups of genres tailored for web user profiles.
genres (e.g. editorials), or more objective genres (e.g. surveys),      [14] and the follow up [15]: Number of web pages: 321; Genre
and that the opposition positive-negative can suggest specific          annotation: at least two raters; Genres: reportage-editorial,
genres (such as reviews), these two polarities can hardly be            research article, review, home page, Q&A, specification. They
considered as “genres” in themselves. Nonetheless, [7]’s                aimed at selecting genre-revealing terms from the training
contribution is extremely valuable because they shed some light         document set using collection of web pages annotated both at
on the performance of different feature sets across several             topic level and at genre level. Their formula (the deviation
domains, providing insight into the extent of feature exportability.    formula) makes use of both genre-classified documents and
[5]: Number of web pages: 2700; Annotation: one or more raters;         subject-classified documents and eliminate terms that are more
Categories: functional styles. They carried out an experiment on        subject-related than genre-related. They report a micro-average of
                                                                        precision and recall of about 90%.
As already stressed, the absence of common criteria or evaluation               The web pages included in the 7-web-genre collection were
ground makes most of these experiments (see Table 1 for a                       randomly downloaded from the following public archives or
summary) difficult to compare, however fruitful each study can be               portals (download date: Feb-March 2005):
in itself. A cross-evaluation of these experiments remains virtually            •    Blogs:
unfeasible because genre palettes are mostly disparate. Also in                            o     http://www.britblog.com/
the case of ‘home page’, which is probably one of the few genres                           o     http://www.nataliedarbeloff.com/augustinearchive.html.
in common in several experiments, any comparison appear to be                   •     Eshops:
difficult, because selection criteria and level of exemplarity are                         o     http://www.shops.co.uk/
not declared. The two criteria of annotation by objective sources                          o     http://www.eshops.co.uk/
and consistent level of granularity are suggested to overcome this              •     FAQs:
un-comparability.                                                                          o     http://www.cybernothing.org/faqs/net-abuse-faq.html
                                                                                           o     http://www.irs.gov/faqs/
                         Table 1. Summary Table                                            o     http://www.copyright.gov/help/faq/
                                                                                           o     http://www.aoml.noaa.gov/hrd/tcfaq/tcfaqHED.html
    Studies      No. of Annotation Labels
                                                                                •     Newspaper front pages belong to a number of different
                 web
                 pages                                                                online newspaper and are available at Internet Archive:
                                                                                           o     www.archive.org
    [7]          2,150 single rater Subjectivity vs. objectivity, positive
                                                                                •     Personal home pages are heterogeneous, and include
                                    vs. negative
                                                                                      academic and administrative personal home pages, as well as
    [5]          2,700 One or more public affairs style, everyday
                                                                                      more informal personal home pages. They were downloaded
                        raters      communication style, scientific
                                    style, journalistic style, literary style         from:
                                                                                           o     http://dmoz.org/Society/People/Personal_Homepages/
    [3]          343    Two or more abstract, call for papers, FAQs,                       o     http://www.math.unl.edu/~mbritten/ldt/homepage.html
                        raters      hub/sitemap, job description,                          o     http://www.bradley.edu/people/fac-staff.html
                                    resume/C.V., statistics, syllabus,                     o     http://www.daimi.au.dk/local/map/PeopleandLocationsPe
                                    technical paper                                              opleFrame.html
    [10]         321    do not say  home pages (personal, corporate,                       o     http://www.mit.edu/Home-byUser.html
                                    organizational)                                        o     http://dir.yahoo.com/Society_and_Culture/People/Person
    [16]         1,224   two graduate personal home page, public home                            al_Home_Pages
                         students     page, commercial home page,                          o     http://hpsearch.uni-trier.de/hp/a-tree/
                                      bulletin collection, link collection,                o     Search pages comes from:
                                      image collection, simple table/lists,                o     http://www.searchenginecolossus.com/
                                      input pages, journalistic material,
                                      research report, official materials,      The web pages included in the genre ‘list’, were selected
                                      FAQs, discussions, product                searching keywords in Google and selecting relevant web pages
                                      specification, informal texts             from the results. All the lists include one of the following
    [17]         800     3 raters     article, discussion, shop, portrayal      keywords (and orthographic variants) in the heading: checklist,
                                      (non-private), portrayal (private),       hot list, table of content, and sitemap (see, for example, Insect
                                      link collection, download                 Hotlist at http://www.fi.edu/tfi/hotlists/insects.html).
    [14] and     321     at least two reportage-editorial, research article,
    [15]                 raters       review, home page, Q&A,                   3.2      KI-04 corpus
                                      specification                             KI-04 corpus was built following a palette of eight genres
                                                                                suggested by a user study on genre usefulness ([17]). It includes
3          EXPERIMENTS                                                          1,295 English web pages (HTML documents), but only 800 web
                                                                                pages (100 per genre) were used in the experiment described in
3.1           7-Web-Genre Collection                                            [17]. In Experiment 1, we used 1,205 web pages because some
                                                                                web pages were empty (both original version, 1,295 web pages,
The 7-web-genre collection includes 200 English web pages per
                                                                                and working version, 1,205 web pages, are available online at the
genre, amounting to a total of 1,400 web pages (available online
                                                                                URL reported in the Introduction). KI-04 corpus includes:
at the URL reported in the Introduction). These web pages were
collected by the author of this paper in early spring 2005. This                1. article (127 web pages)           5. discussion (127 w. p)
collection was built with genres belonging to a consistent level of             2. download (151 w. p)               6. help (139 w. p)
granularity and applying the annotation by objective source. The                3. link collection (205 w. p)        7. portrayal (non-priv) (163 w. p.)
seven web genres included in the collection are the following:                  4. portrayal (priv.) (126 w. p)      8. shop (167 w. p)
1. blog                                 5. list
                                                                                The KI-04 corpus was collected using bookmarks from about five
2. eshop                                6. personal home page2
                                                                                people. Some genres were extended to get a better balance. The
3. FAQs                                 7. search page                          corpus was sorted by three people, one of them wrote a bachelor
4. online newspaper front page                                                  thesis (in German) on the corpus building process. One of the
                                                                                author of [17] checked many of the pages, and most of the sorting
2
    ‘Personal home page’ is the basic level of the superordinate level ‘home
                                                                                complied with his understanding of the genre categories. The
     page’ and has ‘academic personal home page’, ‘administrative personal      download date was January 26th, 2004.
     home page’, etc. as subordinate level.
3.3       SPIRIT collection                                                    web pages, we converted all the results into percentages. A
                                                                               breakdown of the different accuracy rates achieved with
The SPIRIT collection is a random crawl carried out in 2001 (see               discriminant analysis and two different feature set is shown in
[8]). It contains single web pages and not full websites. The size             Table 3.
of the whole collection is about one terabyte, and the number of
                                                                                     Table 3. Accuracy rates with discriminant analysis
web pages (mostly HTML files) is about 95 millions. It is
multilingual and without any meta-information, apart from a short                KI-04 corpus             Our feature set    [17]’s feature set
header including the original URL, the date and time when the                    Article                  80.3%              81.3%
pages were crawled from the web, and few other details. It                       Discussion               76.4%              68.5%
represents a genuine slice of the real web. In Experiment 2, we                  Download                 74.2%              79.6%
used only 1,000 English web pages (available online at the URL                   Help                     59.7%              55.1%
reported in the Introduction) from this random, multilingual and                 Link Collection          69.3%              67.6%
unclassified collection.
                                                                                 Portrayal (non-priv)     59.5%              57.9%
3.4       Experiment 1                                                           Portrayal (priv)         73.8%              67.7%
                                                                                 Shop                     68.3%              66.9%
The practical aim of Experiment 1 was to build two single-label
discrete classification models, one out of the 7-web-genre                       Accuracy                 70.2%              68.1%
collection, the other from KI-04 corpus, and compare their                     Our feature set performs better than [17]’s feature set. Although
accuracy results. Both collections were submitted to the same pre-             the difference is rather small (+2.1%), it is statistically significant
processing. The unit of analysis was a single static web page in               (chi-square test). This means that our feature set is not biased
HTML format.                                                                   toward the 7-web-genre collection, but it performs significantly
The feature set, called 1_set, used in Experiment 1 includes:                  better than [17]’s feature set on KI-04 corpus with discriminant
• the 50 most common words in English;                                         analysis, i.e. the same algorithm used in [17].
• 24 Part-of-Speech (POS) tags;
• 8 punctuation marks: full stop (.), colon (:), semi-colon (;), comma (,),    3.4.1     Discussion
  exclamation mark (!), question mark (?), apostrophe ('), and quotes (");     Experiment 1 compares the accuracies of two models built with
• genre-specific words3;                                                       the same classification algorithm, the same feature set but
• 28 HTML tags;                                                                different web page collections, the 7-web-genre collection and KI-
• 1 nominal attribute representing the length of the web page (SHORT,          04 corpus. The accuracy on the 7-web-genre collection (1,400
  MEDIUM and LONG).
                                                                               web pages) is above 90% while the accuracy on KI-04 corpus is
(This feature set, together with a description, is available online at         definitely lower. A first thought was that our feature set did not
the URL reported in the Introduction). The classification                      represent the genre palette of KI-04 corpus adequately. However,
algorithm used both in Experiments 1 and 2 is SMO (which                       after having compared the performance of our feature set with
implements the Sequential Minimal Optimisation (SMO) for                       [17]’s feature set using the same algorithm (discriminant analysis)
training support vectors) with default parameters and logistic                 on the same collection, we saw that the accuracy achieved by our
regression model, from Weka machine learning workbench ([25]).                 feature set was slightly higher than the accuracy stated in [17].
Accuracy results, shown in Table 2, are averaged over stratified               Although KI-04 corpus contains eight genres, i.e. one genre more
10-fold crossvalidations repeated 10 times.                                    than the 7-web-genre collection (error rate usually increases with
                                                                               the number of categories), this does not justify such a wide the
              Table 2. Averaged Accuracies with SMO
                                                                               gap in the classification accuracy. Also, it is important to stress
    Averaged Accuracy on the 7-         Averaged Accuracy on KI-04             that genre-specific words are tailored to the genre palette. This
    web-genre collection                corpus
                                                                               means, the genre-specific words used for the 7-web-genre
    90.6%                               68.9%
                                                                               collection account for blogs, search, front page, etc., while those
As you can see in Table 2, the accuracy of the model built with                employed for KI-04 corpus include words relate to articles,
the 7-web-genre collection is much higher than the model built                 discussion, download, etc. Since these two genre palettes have
with KI-04 corpus, namely +21.7%.                                              two web genres in common, i.e. home page/portrayal (priv) and
In order to see whether the feature set was too tailored or biased             eshop/shop, in these two cases the same set of genre-specific
towards the 7-web-genre collection, we compared the accuracy of                words was used for both web genre collections. That the feature
this feature set on KI-04 corpus with the accuracy rates reported              set used in the KI-04 corpus is not biased towards the 7-web
in [17]. To make this comparison possible, we ran discriminant                 genre collection is confirmed by the results shown in Table 3,
analysis using our feature set on KI-04 corpus. As [17] ran their              where the performance of our features set is higher than [17]'s
discriminant analysis only on 800 web pages while we used 1,205                feature set.
                                                                                     In conclusion, if neither the feature set nor the classification
3                                                                              algorithm is the cause of this large discrepancy in accuracy, then
     Genre-specific words were selected through a cursory manual analysis.
                                                                               the suspicion is that the selection of the web pages representing
    A total of 13 sets of genre-specific words were built. 13 and not 15
    because two sets were shared across the two collections, namely those
                                                                               genres in KI-04 corpus might be responsible for the lower
    related to home-page/portrayal (priv) and eshop/shop. It is worth saying   performance. Although the issue of subjectivity of the assignment
    that genre-specific words (available online at the URL reported in the     of genre to web pages needs further investigation (cf. also [4]), for
    Introduction) are not numerous. For example, genre-specific words for      the time being we interpret the higher performance on the 7-web-
    the search web genre are only: search, crawl, directories, engine, find,   genre collection as a result of the application of the two criteria of
    and see.
annotation by objective sources annotation and consistent genre                Table 4 shows the performance of the three feature sets on the two
granularity.                                                                   web genre collections.
3.5       Experiment 2                                                             Table 4. Accuracies of three feature sets on two collections
                                                                               Classification       Averaged accuracy on the Averaged accuracy on
The goal of Experiment 2 was to see whether the classification                 algorithm: Weka      7-web-genre collection   KI-04 corpus
model built with the collection complying to the criteria of                   SMO
annotation by objective source and consistent genre granularity is             1_set                90.6%                              68.9%
more effective also for predictive tasks. In other words,                      2_set                89.4%                              64.1%
predictions are used here as a kind of evaluation metrics of the               3_set                88.8%                              65.9%
efficiency of classification models.                                           From the summary shown in Table 5, we can see that a very low
      In this experiment we used the two classification models                 number of pages were agreed upon by the three classification
built in the previous experiment together with additional models.              models (second column) built on the 7-web-page collection. This
The practical aim was to make predictions on unclassified and                  is not necessarily bad when aiming at high precision (future work
non-annotated web pages, i.e. 1,000 random English web pages                   will explore the possibility of increasing precision).
from the SPIRIT collection. The relevance of the agreed upon                       Table 5. Correct predictions with the 7-web-genre palette
web pages (see Tables 5 and 6) to a genre was manually assessed                 7 WEB GENRE        # OF AGREED           CORRECT       INCORRECT         ERROR
by the author of this paper (the breakdown of this manual                       PALETTE            UPON WEB PAGES        GUESSES       GUESSES AND       RATE
                                                                                                   (OUT OF 1,000)                      UNCERTAIN
evaluation is available online at the URL reported in the                       BLOG                               17              1              16       0.94
Introduction).                                                                  ESHOP                              11              3               8       0.73
      When making a prediction, the classifier returns a probability            FAQs                                8              1               7       0.88
                                                                                FRONTPAGE                           7              0               7       1.00
score to be interpreted in terms of classification confidence. This             LISTING                            18              7              11       0.61
confidence score can be exploited when assessing the value of a                 PHP                                44             10              34       0.77
                                                                                SPAGE                              12              6               6       0.50
prediction and for setting a threshold for reliable guesses. In order           TOTAL                             117             28              89
to get predictions on genre labels which were as reliable as                    PERCENTAGE                   11.7%          2.8%               8.9%
possible, we devised an approach inspired by co-training. The
basic idea was to exploit three different views (i.e. three different          However, predictions are even sparer with the models built using
feature sets) on the same data. When the three models built with               KI-04 corpus (Table 6). As there was no 3-out-of-3 agreement for
the three feature sets agreed on the same genre label (3-out-of-3              discussion, download, help, and portrayal (non-private), these
agreement) at very high confidence score, namely >=0.9, this was               genres were evaluated with 2-out-of-3 agreement. No correct
for us an indication of a good prediction. Additionally, as we have            guesses were returned for article, discussion, download, and help.
two web page collections with two different genre palettes, we can                      Table 6. Correct predictions with KI-04 corpus
have multi-label predictions. Ideally, a web page might get a                  KI-04 CORPUS       # OF AGREED           CORRECT        INCORRECT         ERROR
prediction of “personal home page”, following the palette adopted                                 UPON WEB              GUESSES        GUESSES AND       RATE
                                                                                                  PAGES (OUT OF                        UNCERTAIN
in the 7-web-genre collection, and “portrayal (private)”, following                               1,000)
the genre palette adopted in KI-04 corpus. Also, as the two                    ARTICLE                            4                0                 4     1.00
                                                                               DISCUSSION                         8                0                 8     1.00
palettes are mostly not overlapping, it is interesting to see which            DOWNLOAD                           4                0                 4     1.00
palette is more suitable for the classification of this SPIRIT                 HELP                               3                0                 3     1.00
random sample. From the previous experiment we had two                         LINK                               3                3                 0     0.00
                                                                               PORTRAYAL (NON-                    5                1                 4     0.80
models built with a single feature set (1_set). To these models, we            PRIVATE)
add four additional models (two per collection) in order to get the            PORTRAYAL                          7                3                 4     0.57
                                                                               (PRIVATE)
three simultaneous views on each collection. The additional two                SHOP                              6                 3               3       0.50
models were built using the feature sets called 2_set and 3_set                TOTAL                            36                10              26
(these feature sets, together with a description, are available                PERCENTAGE                   3.6%             1%                2.6%
online at the URL reported in the Introduction).
2_set contains the following features:                                         3.5.1      Discussion
•    POS trigrams;
                                                                               Experiment 2 shows that the classification models built with the
•    8 punctuation symbols (as above);
                                                                               7-web-genre collection return a higher number of predictions.
•    genre-specific words (as above);
•    28 HTML tags (as above);
                                                                               This seems to confirm the interpretation that using the two criteria
•    1 nominal attribute representing the length of the web page (as above).   of objective source annotation and consistent level of granularity
                                                                               ensures better classification models and consequently a higher
3_set contains the following features:                                         number of correct predictions. Also, this experiment shows a
•    86 linguistic facets4;                                                    useful methodology to follow for multi-genre classification of web
•    genre-specific words;                                                     pages, which can be refined and further investigated in future.
•    6 HTML facets;
•    1 nominal attribute representing the length of the web page (as above).   4      CONCLUSIONS
                                                                               In this paper we pointed out how classification models learned
                                                                               from a web collection annotated by genre using the two criteria of
4
    Linguistic facets and HTML facets are groups of features highlighting      annotation by objective source and consistent level of granularity
    an aspect in the communicative context that is reflected in the use of     can return higher accuracy and a higher number of correct
    language. They are listed in the URL reported in the Introduction.         predictions.
The annotation by objective source is not only less subjective and      [7] Finn, A. and Kushmerick, N. Learning to classify documents
closer to real-world conditions, but also much faster than                   according to genre. JASIST, Special Issue, 7(5), 2006.
annotation by human raters, which is usually time-consuming,            [8] Joho, H. and Sanderson, M. The SPIRIT collection: an
controversial, and expensive. Further, a collection built with a             overview of a large web collection, SIGIR Forum, 38(2)
consistent level of genre granularity seems to be learned more               2004.
profitably by the classifier. Together, these two criteria enhance      [9] Karlgren, J. Stylistic Experiments for Information Retrieval,
the performance of classification algorithms.                                Thesis submitted for the degree of Doctor of Philosophy,
     However, a full comparison between the results achieved                 Stockholm University, Sweden, 2000.
with the two web page collections built with different criteria is
not entirely feasible because the two genre palettes are mostly         [10] Kennedy, A. and Shepherd, M. Automatic Identification of
different. Nonetheless, these findings are indicative of a tendency          Home Pages on the Web, Proc. 38 HICSS, 2005.
that can be further investigated in future. It is also worth pointing   [11] Kessler, B., Numberg, G. and Shütze, H. Automatic
out that objective sources may still contain biases. Biases in web           Detection of Text Genre, Proc. 35 Annual Meeting of the
collections relate to the well-known issue of 'corpus                        ACL and 8th Conference of the EACL, 1997.
representativeness', dating back to Chomsky’s aversion to the use       [12] Kwasnik, B., Crowston, K., Nilan, M. and Roussinov, D.
of corpora. However, in the present days and with the web                    Identifying document genre to improve web search
available, biases can be alleviated by randomly picking up web               effectiveness. The Bulletin of the American Society for
pages from several genre-specific web archives or portals.                   Information Science and Technology, 27(2), 23–26, 2000.
     Although the two criteria of annotation by objective source        [13] Lee, D. Genres, Registers, Text types, Domains, and Styles:
and consistent level of granularity represent a practical solution           Clarifying the concepts and navigating a path through the
that can help genre classification, the concept of genre remains             BNC Jungle, Language Learning and Technology, 5(3), 37-
hard to capture computationally and statistically in its entirety.           72, 2001.
     First, it would be interesting to investigate more about the       [14] Lee, Y. and Myaeng, S. Automatic Identification of Text
ideal proportion among corpus size, number of features and                   Genres and Their Roles in Subject-Based Categorization,
number of classes and its influence on classification results. Also,         Proc. 37 HICSS, 2004.
up to now only single-label discrete classification has been tried
                                                                        [15] Lee, Y. and Myaeng, S. Text Genre Classification with
out in genre classification studies. Experiment 2 implicitly shows
                                                                             Genre-Revealing and Subject-Revealing Features, Proc. 25
an easy method that can be exploited for multi-label classification:
                                                                             Annual International ACM SIGIR, 145-150, 2002.
the use of concurrent genre palettes over the same unclassified
collection. Ideally, the use of several classification models built     [16] Lim, C., Lee, K. and Kim G., Automatic Genre Detection of
with different collections annotated by external sources and a               Web Documents, in Su K., Tsujii J., Lee J., Kwong O. Y.
consistent granularity, and including different genre palettes can           (eds.) Natural Language Processing, Springer, Berlin, 2005.
suggest several genre labels for the same web page. Multi-genre         [17] Meyer zu Eissen S. and Stein B. Genre Classification of Web
documents and genre hybridism are particularly acute when                    Pages: User Study and Feasibility Analysis, in Biundo S.,
dealing with web pages, which appear much more unpredictable                 Fruhwirth T., Palm G. (eds.), Advances in Artificial
and individualized than paper documents. Using concurrent genre              Intelligence, Springer, Berlin, 256-269, 2004.
palettes might represent an alternative to the multi-faceted            [18] Paltridge, B. Working with genre: A pragmatic perspective,
approach by [11]. What is less reassuring is the absence of a                Journal of Pragmatics, 24, 393-406, 1995.
proper evaluation metrics for multi-label problems. We leave            [19] Rauber, A. and Müller-Kögler, A. Integrating Automatic
these problems open to further investigations and invite the genre           Genre Analysis into Digital Libraries, ACM/IEEE joint
classification community to make use of the three collections                Conference on Digital Libraries, Roanoke, USA, 2001.
employed in these experiments and now available online.
                                                                        [20] Rosmarin, A. The Power of Genre, University of Minnesota
5     REFERENCES                                                             Press, Minneapolis, 1985.
[1] Argamon, S., Koppel, M., Avneri, G. Routing documents               [21] Santini, M. Genres In Formation? An Exploratory Study of
    according to style, Proc. First International Workshop on                Web Pages using Cluster Analysis, Proc. CLUK 05, 2005.
    Innovative Internet Information Systems, 1998.                      [22] Santini, M. Some Issues in Automatic Genre Classification
[2] Bathia, V. Analysing Genre. Language Use in Professional                 of Web Pages. Proc. of the JADT 2006 Besançon 2006.
    Settings, Longman, London and New York, 1993.                       [23] Swales, J. Genre Analysis, Cambridge University Press,
[3] Boese, E. Stereotyping the Web: Genre Classification of Web              Cambridge, 1990.
    Documents, M.S. Thesis, Colorado State Univ., 2005.                 [24] Trosborg, A. (ed.), Analysing Professional Genres, J.
[4] Boese, E and Howe A. Effects of Web Document Evolution                   Benjamins Publishing Company, Amsterdam, 2000.
    on Genre Classification, CIKM’05, 2005.                             [25] Witten, I. and Frank, E. Data Mining: Practical Machine
[5] Bravslavski, P. and Tselischev, A. Experiment on Style-                  Learning Tools and Techniques, Morgan Kaufmann
    Dependent Document Ranking, Proc. of the 7th Russian                     Publishers, Amsterdam, second edition, 2005.
    Conference on Digital Libraries, 2005.                              [26] Yates, J., and Orlikowski, W. Genres of organizational
[6] Dillon, A. and Gushrowski, B. Genres and the Web: is the                 communication: A structural approach to studying
    personal home page the first uniquely digital genre?, JASIS,             communications and media, Academy of Management
    51(2), 2000.                                                             Review, 17(2), 229-326, 1992.