=Paper=
{{Paper
|id=Vol-205/paper-9
|storemode=property
|title=Common Criteria for Genre Classification: Annotation and Granularity
|pdfUrl=https://ceur-ws.org/Vol-205/paper9.pdf
|volume=Vol-205
}}
==Common Criteria for Genre Classification: Annotation and Granularity==
Common Criteria for Genre Classification:
Annotation and Granularity
1st Author Marina Santini1
ABSTRACT according to subjective criteria as for genre annotation and genre
In this paper,1we present two experiments that use machine granularity. This is especially true for genre studies based on
learning for automatically classifying web pages by genre. These collections of web pages. Although building a genre-annotated
experiments highlight the influence that genre annotation and benchmark of web pages is difficult and maybe not feasible,
genre granularity can have on the accuracy of the classification. because annotating a web page by genre is both hard and
From a practical point of view these experiments show that a controversial (cf. [21]), a few criteria should be discussed and
collection annotated with the criteria of ‘objective sources’ and agreed upon. Without some kind of commonality, any comparison
consistent genre granularity ensures a very good classification becomes unfeasible. For instance, can we state that the 92%
accuracy (Experiment 1). Additionally, the classification model accuracy achieved by [3] is better than the accuracy (about 70%)
built out of such a collection can be exported more profitably for achieved by [17]? The solution we suggest for building more
predictive tasks on an unclassified web page collection comparable genre collections is to exploit the socio-cultural
(Experiment 2). These experiments represent a starting point for a aspect of the concept of genre. As pointed out earlier, genres have
discussion about the need of common criteria for building a genre a function in a society, culture or community, i.e. they have a
collection in the absence of an official genre-annotated social or public role that implies a number of conventions and
benchmark. raises predictable expectations. This means that the role or the
function of different genres is recognized and correctly used in the
1 INTRODUCTION communication interaction. Leveraging on this public and
collective acknowledgement it is possible to create a genre-
In this paper, we present two experiments that use machine annotated collection without involving human annotators. The key
learning for automatically classifying web pages by genre. is to download documents from genre-specific archives or portals
Many definitions of genre have been proposed so far in and use their membership in these containers as an automatic
literary studies (e.g. [20]), academic writing (e.g. [23]), membership in a specific genre. For example, eshops can be
professional settings (e.g. [2] and [24]), organizational randomly downloaded from the portal http://www.eshops.co.uk/
environment (e.g. [26]), and so on. More specifically, in and considered to be eshops without any further manual
automatic genre classification studies, genres have often been seen annotation or inter-rater agreement assessment. We include in the
as non-topical categories that could help reduce information public acknowledgement also genres used as title of documents
overload (e.g. [16] or [15]). In this area, not only text categories (for example, “Insects Hotlist”). The idea behind selecting
such as ‘article’, ‘FAQs’, ‘home page’, etc. have been considered documents with a genre in the title or picking them up randomly
to be genres, but also polarities, such as subjective-objective and from public resources, such as an archives or a portals, is the
positive-negative ([7]), and style ([1], [9] and [5]). Regardless the following: if there is an archive, a portal or a website specialized
different definitions and connotations, a classification by genre in, say, pointing to or collecting genres such as eshops, blogs or
has been acknowledged to be useful in information retrieval (e.g. search engines, this means that the documents pointed to or
[9], [12], etc.), information filtering ([7]), digital libraries ([19]) collected there are considered to belong to these genres by the
and other practical applications. collectivity of web users. We call this criterion ‘annotation by
In this paper we present two experiments of genre objective sources’. A genre collection annotated by objective
classification of web pages based on a simplified and intuitive sources tends to be more representative as for intra-genre
definition of genre, which is suitable for all kind of genres – variation than a collection annotated relying on the genre
including genres on the web – and for an automatic approach. In stereotypicality that two, three, or more annotators have in mind.
our view, genres can be defined as named socio-cultural We suggest that annotating a collection using objective sources is
communication artefacts, linked to a society or a community, faster and closer to real-world conditions.
bearing standardized traits, leaving space for the creativity of the Genre granularity is also important when building a
text producer, and raising expectations in the text receiver. For collection for genre classification. In fact, genre palettes often
example, the personal home page (cf. also [6]) has standard traits, show different levels of granularity. For instance, [9] includes in
such as self-narration, personal interests, contact details, and often his genre palette both FAQs (genre) and journalistic materials
pictures related to one’s life. However, these conventions do not (super-genre). We suggest the use of the prototype theory (cf. [18]
hinder the creativity of the producer, and as receivers, we expect a and [13]) to achieve a consistent level of genre granularity. A
blend of standardized information and personal touch. Though prototype is the most typical instance of a more encompassing or
unsophisticated, this definition of genre allows us to suggest a fuzzy category. Categories that can be dealt with the prototype
practical solution to the main shortcoming in genre classification, theory can be ordered into a three-tiered hierarchy: superordinate
i.e. the lack of a genre-annotated benchmark. Because of this lack, level, basic level and subordinate level. For example, the genre
the main tendency has always been to build one’s own collection ‘advertisement’ represents the basic level (genre) of the
superordinate level ‘advertising’ (super-genre), while a ‘web ad’
1 represents the subordinate level (subgenre) of the basic level. The
University of Brighton (UK); M.Santini@brighton.ac.uk
basic level embodies the information level at which concepts are style-dependent document ranking. Their research explored the
most easily recognized, remembered and learned with respect to possibility of incorporating style-dependent ranking into ranking
their function. The basic level included in the prototype theory schemata for searching the web and digital libraries. Their basic
should not be mixed up with document stereotypicality or idea was to reduce styles (more specifically, the five functional
exemplarity. Building a genre collection choosing exemplars, i.e. styles theorized by the School of Prague) to a single continuous
only stereotypical documents, to unambiguously represent a genre parameter. Regardless the promising preliminary results, they
can return biased results. According to the prototype theory, could see little improvement in relevance ranking when stylistic
instead, instances of a genre may vary in their prototypicality, thus parameters were included.
allowing intra-genre variation. [3]: Number of web pages: 343; Genre annotation: the author
The two experiments presented in this paper highlight the plus at least one or more raters; Genres: abstract, call for
influence that genre annotation and genre granularity can have on papers, FAQs, hub/sitemap, job description, resume/C.V.,
the accuracy of genre classification of web pages. They were statistics, syllabus, technical paper. She tried out the efficiency of
designed to point out several issues (some already covered in several feature sets and automatic feature selection techniques on
[22]). In this paper, these two experiments allow us to emphasize a small corpus of 10 genres, using a number of classification
two general aspects of genre classification, one practical and one algorithms. Although her results can be considered only indicative
theoretical. From a practical point of view these experiments show given the reduced number of pages per genre (an average of 20
that a collection annotated with the criteria of objective sources web pages per genre class), she made interesting remarks about
and consistent genre granularity ensures a very good classification discrimination across similar genres, and the influence of the
accuracy (Experiment 1). Additionally, the classification model genre palette and document exemplarity on discrimination tasks.
built out of such a collection can be exported more profitably for Her best accuracy (92.1%) was achieved by one of the feature
predictive tasks on an unclassified web page collection combinations resulting from an automatic feature selection
(Experiment 2). From a theoretical point of view, they represent a technique.
starting point for a discussion about the need of common criteria [10]: Number of web pages: 321; Genre annotation: do not say;
in the absence of an official genre-annotated benchmark Genres: personal, corporate, organizational home pages,
In order to ensure replicability, all the materials used for including also non-home pages, as noise. They tried the hard task
these experiments, including web page collections, feature sets of home page genre discrimination. The best accuracy (71.4%) is
and the manual evaluation of Experiment 2, are available at achieved on personal home pages with a single classifier, manual
http://www.nltg.brighton.ac.uk/home/Marina.Santini/, bottom of the
feature selection, and without noisy pages.
page.
The paper is organized as follows: Section 2 provides an [16]: Number of web pages: 1224; Genre annotation: two
overview of recent work in genre classification of web pages; graduate students; Genres: personal home page, public home
Section 3 presents the web page collections and the two page, commercial home page, bulletin collection, link collection,
experiments; conclusions are drawn in Section 4. image collection, simple table/lists, input pages, journalistic
material, research report, official materials, FAQs, discussions,
2 PREVIOUS WORK product specification, informal texts (poem, fiction, etc.). They
Several experiments have been recently carried out with genres investigated the efficiency of several feature sets to discriminate
and web pages. Here we list the latest studies in order to show across these 16 genres. They also tested the classification
how difficult is to compare their results in the absence of common efficiency on different parts of the web page space (title and meta-
criteria as for corpus building and genre palettes. content, body, and anchors). The best accuracy (75.7%) was
achieved with one of their features sets when applied only to the
[7]: Number of web pages: 2150; Annotation: single rater; body and anchors.
Categories: subjectivity, positive-ness. They tried to discriminate
among texts coming from different domains in terms of two [17]: Number of web pages: 800; Genre annotation: three raters;
polarities: subjective vs. objective and positive vs. negative. Their Genres: help, article, discussion, shop, portrayal (non-private),
aim was to see how a classification model tuned on one domain portrayal (private), link collection, download. They worked out a
performed in another domain. According to their results, in single genre palette of eight genres following the outcome of a study on
domain classification the best accuracy is achieved with Multi- genre usefulness. As they aimed at a classification performed on
View-Ensemble (MVE) (see [7] for details) for subjectivity, and the fly, they assessed features according to the computational
with bag-of-words (BOW) features for positive-ness. In domain effort they required, giving preference to those requiring low or
transfer classification, the best accuracy is achieved with Parts-of- medium effort. They achieved around 70% accuracy with
discriminant analysis on the palette of eight genres. Other results
Speech (POS) tags for subjectivity and MVE for positive-ness.
Although it is true that genres can be divided into more subjective relate to groups of genres tailored for web user profiles.
genres (e.g. editorials), or more objective genres (e.g. surveys), [14] and the follow up [15]: Number of web pages: 321; Genre
and that the opposition positive-negative can suggest specific annotation: at least two raters; Genres: reportage-editorial,
genres (such as reviews), these two polarities can hardly be research article, review, home page, Q&A, specification. They
considered as “genres” in themselves. Nonetheless, [7]’s aimed at selecting genre-revealing terms from the training
contribution is extremely valuable because they shed some light document set using collection of web pages annotated both at
on the performance of different feature sets across several topic level and at genre level. Their formula (the deviation
domains, providing insight into the extent of feature exportability. formula) makes use of both genre-classified documents and
[5]: Number of web pages: 2700; Annotation: one or more raters; subject-classified documents and eliminate terms that are more
Categories: functional styles. They carried out an experiment on subject-related than genre-related. They report a micro-average of
precision and recall of about 90%.
As already stressed, the absence of common criteria or evaluation The web pages included in the 7-web-genre collection were
ground makes most of these experiments (see Table 1 for a randomly downloaded from the following public archives or
summary) difficult to compare, however fruitful each study can be portals (download date: Feb-March 2005):
in itself. A cross-evaluation of these experiments remains virtually • Blogs:
unfeasible because genre palettes are mostly disparate. Also in o http://www.britblog.com/
the case of ‘home page’, which is probably one of the few genres o http://www.nataliedarbeloff.com/augustinearchive.html.
in common in several experiments, any comparison appear to be • Eshops:
difficult, because selection criteria and level of exemplarity are o http://www.shops.co.uk/
not declared. The two criteria of annotation by objective sources o http://www.eshops.co.uk/
and consistent level of granularity are suggested to overcome this • FAQs:
un-comparability. o http://www.cybernothing.org/faqs/net-abuse-faq.html
o http://www.irs.gov/faqs/
Table 1. Summary Table o http://www.copyright.gov/help/faq/
o http://www.aoml.noaa.gov/hrd/tcfaq/tcfaqHED.html
Studies No. of Annotation Labels
• Newspaper front pages belong to a number of different
web
pages online newspaper and are available at Internet Archive:
o www.archive.org
[7] 2,150 single rater Subjectivity vs. objectivity, positive
• Personal home pages are heterogeneous, and include
vs. negative
academic and administrative personal home pages, as well as
[5] 2,700 One or more public affairs style, everyday
more informal personal home pages. They were downloaded
raters communication style, scientific
style, journalistic style, literary style from:
o http://dmoz.org/Society/People/Personal_Homepages/
[3] 343 Two or more abstract, call for papers, FAQs, o http://www.math.unl.edu/~mbritten/ldt/homepage.html
raters hub/sitemap, job description, o http://www.bradley.edu/people/fac-staff.html
resume/C.V., statistics, syllabus, o http://www.daimi.au.dk/local/map/PeopleandLocationsPe
technical paper opleFrame.html
[10] 321 do not say home pages (personal, corporate, o http://www.mit.edu/Home-byUser.html
organizational) o http://dir.yahoo.com/Society_and_Culture/People/Person
[16] 1,224 two graduate personal home page, public home al_Home_Pages
students page, commercial home page, o http://hpsearch.uni-trier.de/hp/a-tree/
bulletin collection, link collection, o Search pages comes from:
image collection, simple table/lists, o http://www.searchenginecolossus.com/
input pages, journalistic material,
research report, official materials, The web pages included in the genre ‘list’, were selected
FAQs, discussions, product searching keywords in Google and selecting relevant web pages
specification, informal texts from the results. All the lists include one of the following
[17] 800 3 raters article, discussion, shop, portrayal keywords (and orthographic variants) in the heading: checklist,
(non-private), portrayal (private), hot list, table of content, and sitemap (see, for example, Insect
link collection, download Hotlist at http://www.fi.edu/tfi/hotlists/insects.html).
[14] and 321 at least two reportage-editorial, research article,
[15] raters review, home page, Q&A, 3.2 KI-04 corpus
specification KI-04 corpus was built following a palette of eight genres
suggested by a user study on genre usefulness ([17]). It includes
3 EXPERIMENTS 1,295 English web pages (HTML documents), but only 800 web
pages (100 per genre) were used in the experiment described in
3.1 7-Web-Genre Collection [17]. In Experiment 1, we used 1,205 web pages because some
web pages were empty (both original version, 1,295 web pages,
The 7-web-genre collection includes 200 English web pages per
and working version, 1,205 web pages, are available online at the
genre, amounting to a total of 1,400 web pages (available online
URL reported in the Introduction). KI-04 corpus includes:
at the URL reported in the Introduction). These web pages were
collected by the author of this paper in early spring 2005. This 1. article (127 web pages) 5. discussion (127 w. p)
collection was built with genres belonging to a consistent level of 2. download (151 w. p) 6. help (139 w. p)
granularity and applying the annotation by objective source. The 3. link collection (205 w. p) 7. portrayal (non-priv) (163 w. p.)
seven web genres included in the collection are the following: 4. portrayal (priv.) (126 w. p) 8. shop (167 w. p)
1. blog 5. list
The KI-04 corpus was collected using bookmarks from about five
2. eshop 6. personal home page2
people. Some genres were extended to get a better balance. The
3. FAQs 7. search page corpus was sorted by three people, one of them wrote a bachelor
4. online newspaper front page thesis (in German) on the corpus building process. One of the
author of [17] checked many of the pages, and most of the sorting
2
‘Personal home page’ is the basic level of the superordinate level ‘home
complied with his understanding of the genre categories. The
page’ and has ‘academic personal home page’, ‘administrative personal download date was January 26th, 2004.
home page’, etc. as subordinate level.
3.3 SPIRIT collection web pages, we converted all the results into percentages. A
breakdown of the different accuracy rates achieved with
The SPIRIT collection is a random crawl carried out in 2001 (see discriminant analysis and two different feature set is shown in
[8]). It contains single web pages and not full websites. The size Table 3.
of the whole collection is about one terabyte, and the number of
Table 3. Accuracy rates with discriminant analysis
web pages (mostly HTML files) is about 95 millions. It is
multilingual and without any meta-information, apart from a short KI-04 corpus Our feature set [17]’s feature set
header including the original URL, the date and time when the Article 80.3% 81.3%
pages were crawled from the web, and few other details. It Discussion 76.4% 68.5%
represents a genuine slice of the real web. In Experiment 2, we Download 74.2% 79.6%
used only 1,000 English web pages (available online at the URL Help 59.7% 55.1%
reported in the Introduction) from this random, multilingual and Link Collection 69.3% 67.6%
unclassified collection.
Portrayal (non-priv) 59.5% 57.9%
3.4 Experiment 1 Portrayal (priv) 73.8% 67.7%
Shop 68.3% 66.9%
The practical aim of Experiment 1 was to build two single-label
discrete classification models, one out of the 7-web-genre Accuracy 70.2% 68.1%
collection, the other from KI-04 corpus, and compare their Our feature set performs better than [17]’s feature set. Although
accuracy results. Both collections were submitted to the same pre- the difference is rather small (+2.1%), it is statistically significant
processing. The unit of analysis was a single static web page in (chi-square test). This means that our feature set is not biased
HTML format. toward the 7-web-genre collection, but it performs significantly
The feature set, called 1_set, used in Experiment 1 includes: better than [17]’s feature set on KI-04 corpus with discriminant
• the 50 most common words in English; analysis, i.e. the same algorithm used in [17].
• 24 Part-of-Speech (POS) tags;
• 8 punctuation marks: full stop (.), colon (:), semi-colon (;), comma (,), 3.4.1 Discussion
exclamation mark (!), question mark (?), apostrophe ('), and quotes ("); Experiment 1 compares the accuracies of two models built with
• genre-specific words3; the same classification algorithm, the same feature set but
• 28 HTML tags; different web page collections, the 7-web-genre collection and KI-
• 1 nominal attribute representing the length of the web page (SHORT, 04 corpus. The accuracy on the 7-web-genre collection (1,400
MEDIUM and LONG).
web pages) is above 90% while the accuracy on KI-04 corpus is
(This feature set, together with a description, is available online at definitely lower. A first thought was that our feature set did not
the URL reported in the Introduction). The classification represent the genre palette of KI-04 corpus adequately. However,
algorithm used both in Experiments 1 and 2 is SMO (which after having compared the performance of our feature set with
implements the Sequential Minimal Optimisation (SMO) for [17]’s feature set using the same algorithm (discriminant analysis)
training support vectors) with default parameters and logistic on the same collection, we saw that the accuracy achieved by our
regression model, from Weka machine learning workbench ([25]). feature set was slightly higher than the accuracy stated in [17].
Accuracy results, shown in Table 2, are averaged over stratified Although KI-04 corpus contains eight genres, i.e. one genre more
10-fold crossvalidations repeated 10 times. than the 7-web-genre collection (error rate usually increases with
the number of categories), this does not justify such a wide the
Table 2. Averaged Accuracies with SMO
gap in the classification accuracy. Also, it is important to stress
Averaged Accuracy on the 7- Averaged Accuracy on KI-04 that genre-specific words are tailored to the genre palette. This
web-genre collection corpus
means, the genre-specific words used for the 7-web-genre
90.6% 68.9%
collection account for blogs, search, front page, etc., while those
As you can see in Table 2, the accuracy of the model built with employed for KI-04 corpus include words relate to articles,
the 7-web-genre collection is much higher than the model built discussion, download, etc. Since these two genre palettes have
with KI-04 corpus, namely +21.7%. two web genres in common, i.e. home page/portrayal (priv) and
In order to see whether the feature set was too tailored or biased eshop/shop, in these two cases the same set of genre-specific
towards the 7-web-genre collection, we compared the accuracy of words was used for both web genre collections. That the feature
this feature set on KI-04 corpus with the accuracy rates reported set used in the KI-04 corpus is not biased towards the 7-web
in [17]. To make this comparison possible, we ran discriminant genre collection is confirmed by the results shown in Table 3,
analysis using our feature set on KI-04 corpus. As [17] ran their where the performance of our features set is higher than [17]'s
discriminant analysis only on 800 web pages while we used 1,205 feature set.
In conclusion, if neither the feature set nor the classification
3 algorithm is the cause of this large discrepancy in accuracy, then
Genre-specific words were selected through a cursory manual analysis.
the suspicion is that the selection of the web pages representing
A total of 13 sets of genre-specific words were built. 13 and not 15
because two sets were shared across the two collections, namely those
genres in KI-04 corpus might be responsible for the lower
related to home-page/portrayal (priv) and eshop/shop. It is worth saying performance. Although the issue of subjectivity of the assignment
that genre-specific words (available online at the URL reported in the of genre to web pages needs further investigation (cf. also [4]), for
Introduction) are not numerous. For example, genre-specific words for the time being we interpret the higher performance on the 7-web-
the search web genre are only: search, crawl, directories, engine, find, genre collection as a result of the application of the two criteria of
and see.
annotation by objective sources annotation and consistent genre Table 4 shows the performance of the three feature sets on the two
granularity. web genre collections.
3.5 Experiment 2 Table 4. Accuracies of three feature sets on two collections
Classification Averaged accuracy on the Averaged accuracy on
The goal of Experiment 2 was to see whether the classification algorithm: Weka 7-web-genre collection KI-04 corpus
model built with the collection complying to the criteria of SMO
annotation by objective source and consistent genre granularity is 1_set 90.6% 68.9%
more effective also for predictive tasks. In other words, 2_set 89.4% 64.1%
predictions are used here as a kind of evaluation metrics of the 3_set 88.8% 65.9%
efficiency of classification models. From the summary shown in Table 5, we can see that a very low
In this experiment we used the two classification models number of pages were agreed upon by the three classification
built in the previous experiment together with additional models. models (second column) built on the 7-web-page collection. This
The practical aim was to make predictions on unclassified and is not necessarily bad when aiming at high precision (future work
non-annotated web pages, i.e. 1,000 random English web pages will explore the possibility of increasing precision).
from the SPIRIT collection. The relevance of the agreed upon Table 5. Correct predictions with the 7-web-genre palette
web pages (see Tables 5 and 6) to a genre was manually assessed 7 WEB GENRE # OF AGREED CORRECT INCORRECT ERROR
by the author of this paper (the breakdown of this manual PALETTE UPON WEB PAGES GUESSES GUESSES AND RATE
(OUT OF 1,000) UNCERTAIN
evaluation is available online at the URL reported in the BLOG 17 1 16 0.94
Introduction). ESHOP 11 3 8 0.73
When making a prediction, the classifier returns a probability FAQs 8 1 7 0.88
FRONTPAGE 7 0 7 1.00
score to be interpreted in terms of classification confidence. This LISTING 18 7 11 0.61
confidence score can be exploited when assessing the value of a PHP 44 10 34 0.77
SPAGE 12 6 6 0.50
prediction and for setting a threshold for reliable guesses. In order TOTAL 117 28 89
to get predictions on genre labels which were as reliable as PERCENTAGE 11.7% 2.8% 8.9%
possible, we devised an approach inspired by co-training. The
basic idea was to exploit three different views (i.e. three different However, predictions are even sparer with the models built using
feature sets) on the same data. When the three models built with KI-04 corpus (Table 6). As there was no 3-out-of-3 agreement for
the three feature sets agreed on the same genre label (3-out-of-3 discussion, download, help, and portrayal (non-private), these
agreement) at very high confidence score, namely >=0.9, this was genres were evaluated with 2-out-of-3 agreement. No correct
for us an indication of a good prediction. Additionally, as we have guesses were returned for article, discussion, download, and help.
two web page collections with two different genre palettes, we can Table 6. Correct predictions with KI-04 corpus
have multi-label predictions. Ideally, a web page might get a KI-04 CORPUS # OF AGREED CORRECT INCORRECT ERROR
prediction of “personal home page”, following the palette adopted UPON WEB GUESSES GUESSES AND RATE
PAGES (OUT OF UNCERTAIN
in the 7-web-genre collection, and “portrayal (private)”, following 1,000)
the genre palette adopted in KI-04 corpus. Also, as the two ARTICLE 4 0 4 1.00
DISCUSSION 8 0 8 1.00
palettes are mostly not overlapping, it is interesting to see which DOWNLOAD 4 0 4 1.00
palette is more suitable for the classification of this SPIRIT HELP 3 0 3 1.00
random sample. From the previous experiment we had two LINK 3 3 0 0.00
PORTRAYAL (NON- 5 1 4 0.80
models built with a single feature set (1_set). To these models, we PRIVATE)
add four additional models (two per collection) in order to get the PORTRAYAL 7 3 4 0.57
(PRIVATE)
three simultaneous views on each collection. The additional two SHOP 6 3 3 0.50
models were built using the feature sets called 2_set and 3_set TOTAL 36 10 26
(these feature sets, together with a description, are available PERCENTAGE 3.6% 1% 2.6%
online at the URL reported in the Introduction).
2_set contains the following features: 3.5.1 Discussion
• POS trigrams;
Experiment 2 shows that the classification models built with the
• 8 punctuation symbols (as above);
7-web-genre collection return a higher number of predictions.
• genre-specific words (as above);
• 28 HTML tags (as above);
This seems to confirm the interpretation that using the two criteria
• 1 nominal attribute representing the length of the web page (as above). of objective source annotation and consistent level of granularity
ensures better classification models and consequently a higher
3_set contains the following features: number of correct predictions. Also, this experiment shows a
• 86 linguistic facets4; useful methodology to follow for multi-genre classification of web
• genre-specific words; pages, which can be refined and further investigated in future.
• 6 HTML facets;
• 1 nominal attribute representing the length of the web page (as above). 4 CONCLUSIONS
In this paper we pointed out how classification models learned
from a web collection annotated by genre using the two criteria of
4
Linguistic facets and HTML facets are groups of features highlighting annotation by objective source and consistent level of granularity
an aspect in the communicative context that is reflected in the use of can return higher accuracy and a higher number of correct
language. They are listed in the URL reported in the Introduction. predictions.
The annotation by objective source is not only less subjective and [7] Finn, A. and Kushmerick, N. Learning to classify documents
closer to real-world conditions, but also much faster than according to genre. JASIST, Special Issue, 7(5), 2006.
annotation by human raters, which is usually time-consuming, [8] Joho, H. and Sanderson, M. The SPIRIT collection: an
controversial, and expensive. Further, a collection built with a overview of a large web collection, SIGIR Forum, 38(2)
consistent level of genre granularity seems to be learned more 2004.
profitably by the classifier. Together, these two criteria enhance [9] Karlgren, J. Stylistic Experiments for Information Retrieval,
the performance of classification algorithms. Thesis submitted for the degree of Doctor of Philosophy,
However, a full comparison between the results achieved Stockholm University, Sweden, 2000.
with the two web page collections built with different criteria is
not entirely feasible because the two genre palettes are mostly [10] Kennedy, A. and Shepherd, M. Automatic Identification of
different. Nonetheless, these findings are indicative of a tendency Home Pages on the Web, Proc. 38 HICSS, 2005.
that can be further investigated in future. It is also worth pointing [11] Kessler, B., Numberg, G. and Shütze, H. Automatic
out that objective sources may still contain biases. Biases in web Detection of Text Genre, Proc. 35 Annual Meeting of the
collections relate to the well-known issue of 'corpus ACL and 8th Conference of the EACL, 1997.
representativeness', dating back to Chomsky’s aversion to the use [12] Kwasnik, B., Crowston, K., Nilan, M. and Roussinov, D.
of corpora. However, in the present days and with the web Identifying document genre to improve web search
available, biases can be alleviated by randomly picking up web effectiveness. The Bulletin of the American Society for
pages from several genre-specific web archives or portals. Information Science and Technology, 27(2), 23–26, 2000.
Although the two criteria of annotation by objective source [13] Lee, D. Genres, Registers, Text types, Domains, and Styles:
and consistent level of granularity represent a practical solution Clarifying the concepts and navigating a path through the
that can help genre classification, the concept of genre remains BNC Jungle, Language Learning and Technology, 5(3), 37-
hard to capture computationally and statistically in its entirety. 72, 2001.
First, it would be interesting to investigate more about the [14] Lee, Y. and Myaeng, S. Automatic Identification of Text
ideal proportion among corpus size, number of features and Genres and Their Roles in Subject-Based Categorization,
number of classes and its influence on classification results. Also, Proc. 37 HICSS, 2004.
up to now only single-label discrete classification has been tried
[15] Lee, Y. and Myaeng, S. Text Genre Classification with
out in genre classification studies. Experiment 2 implicitly shows
Genre-Revealing and Subject-Revealing Features, Proc. 25
an easy method that can be exploited for multi-label classification:
Annual International ACM SIGIR, 145-150, 2002.
the use of concurrent genre palettes over the same unclassified
collection. Ideally, the use of several classification models built [16] Lim, C., Lee, K. and Kim G., Automatic Genre Detection of
with different collections annotated by external sources and a Web Documents, in Su K., Tsujii J., Lee J., Kwong O. Y.
consistent granularity, and including different genre palettes can (eds.) Natural Language Processing, Springer, Berlin, 2005.
suggest several genre labels for the same web page. Multi-genre [17] Meyer zu Eissen S. and Stein B. Genre Classification of Web
documents and genre hybridism are particularly acute when Pages: User Study and Feasibility Analysis, in Biundo S.,
dealing with web pages, which appear much more unpredictable Fruhwirth T., Palm G. (eds.), Advances in Artificial
and individualized than paper documents. Using concurrent genre Intelligence, Springer, Berlin, 256-269, 2004.
palettes might represent an alternative to the multi-faceted [18] Paltridge, B. Working with genre: A pragmatic perspective,
approach by [11]. What is less reassuring is the absence of a Journal of Pragmatics, 24, 393-406, 1995.
proper evaluation metrics for multi-label problems. We leave [19] Rauber, A. and Müller-Kögler, A. Integrating Automatic
these problems open to further investigations and invite the genre Genre Analysis into Digital Libraries, ACM/IEEE joint
classification community to make use of the three collections Conference on Digital Libraries, Roanoke, USA, 2001.
employed in these experiments and now available online.
[20] Rosmarin, A. The Power of Genre, University of Minnesota
5 REFERENCES Press, Minneapolis, 1985.
[1] Argamon, S., Koppel, M., Avneri, G. Routing documents [21] Santini, M. Genres In Formation? An Exploratory Study of
according to style, Proc. First International Workshop on Web Pages using Cluster Analysis, Proc. CLUK 05, 2005.
Innovative Internet Information Systems, 1998. [22] Santini, M. Some Issues in Automatic Genre Classification
[2] Bathia, V. Analysing Genre. Language Use in Professional of Web Pages. Proc. of the JADT 2006 Besançon 2006.
Settings, Longman, London and New York, 1993. [23] Swales, J. Genre Analysis, Cambridge University Press,
[3] Boese, E. Stereotyping the Web: Genre Classification of Web Cambridge, 1990.
Documents, M.S. Thesis, Colorado State Univ., 2005. [24] Trosborg, A. (ed.), Analysing Professional Genres, J.
[4] Boese, E and Howe A. Effects of Web Document Evolution Benjamins Publishing Company, Amsterdam, 2000.
on Genre Classification, CIKM’05, 2005. [25] Witten, I. and Frank, E. Data Mining: Practical Machine
[5] Bravslavski, P. and Tselischev, A. Experiment on Style- Learning Tools and Techniques, Morgan Kaufmann
Dependent Document Ranking, Proc. of the 7th Russian Publishers, Amsterdam, second edition, 2005.
Conference on Digital Libraries, 2005. [26] Yates, J., and Orlikowski, W. Genres of organizational
[6] Dillon, A. and Gushrowski, B. Genres and the Web: is the communication: A structural approach to studying
personal home page the first uniquely digital genre?, JASIS, communications and media, Academy of Management
51(2), 2000. Review, 17(2), 229-326, 1992.