=Paper=
{{Paper
|id=Vol-485/paper-4
|storemode=property
|title=A General Framework for Personalized Text Classification and Annotation
|pdfUrl=https://ceur-ws.org/Vol-485/paper4-F.pdf
|volume=Vol-485
|dblpUrl=https://dblp.org/rec/conf/um/BaruzzoDPT09
}}
==A General Framework for Personalized Text Classification and Annotation==
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
A General Framework for Personalized Text
Classification and Annotation?
Andrea Baruzzo, Antonina Dattolo, Nirmala Pudota, and Carlo Tasso
University of Udine, Via delle Scienze 206, 33100 Udine, Italy
{andrea.baruzzo, antonina.dattolo, nirmala.pudota,
carlo.tasso}@dimi.uniud.it
Abstract. The tremendous volume of digital contents available today
on the Web and the rapid spread of Web 2.0 sites, blogs and forums have
exacerbated the classical information overload problem. Moreover, they
have made even worse the challenge of finding new content appropriate
to individual needs. In order to alleviate these issues, new approaches
and tools are needed to provide personalized content recommendations
and classification schemata.
This paper presents the PIRATES framework: a Personalized Intelligent
Recommender and Annotator TEStbed for text-based content retrieval
and classification. Using an integrated set of tools, this framework lets
the users experiment, customize, and personalize the way they retrieve,
filter, and organize the large amount of information available on the Web.
Furthermore, the PIRATES framework undertakes a novel approach that
automates typical manual tasks such as content annotation and tagging,
by means of personalized tags recommendations and other forms of tex-
tual annotations (e.g. key-phrases).
1 Introduction
In the context of Semantic Web and Web 2.0 environments, finding an appropri-
ate content is regarded not only as a problem of information overload but also
as a problem of Web personalization [1], which deals with personalizing content
retrieval and access with respect to a specific user model. Moreover, this large
volume of data makes impractical or even impossible several manual activities
such as extracting small portions of relevant information from available con-
tents, or classifying contents according to a specific model of user interests [2].
As a consequence, the gap between the performance of traditional information
retrieval tools (e.g. search engines) and the user satisfaction in their use contin-
ues to grow. In order to alleviate this issue [3], more sophisticated approaches
and tools become necessary for providing personalized content recommendations
and classification. Furthermore, in a world of collaborative publishing we have
to take into account e-Learning, knowledge management and Web 2.0 as typical
?
The authors acknowledge the financial support of the Italian Ministry of Education,
University and Research (MIUR) within the FIRB project number RBIN04M8S8.
31
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
application environments. Indeed, we can discover new relevant information by
looking the community of people that, for example, share a common set of doc-
uments or use the same tags to label them. In this wider setting, automatic text
classification remains a significant research field with several challenges such as:
– Associating rich and precise semantics to information contents. For describ-
ing an object, people tend to assign to it a very small number of tags, based
on their knowledge background; of consequence, same tags, used by different
users, do not share a common semantics [4, 5].
– Adapting information retrieval strategies to an evolving user model, providing
run-time malleability to end-users [6]. Certainly, continuously updating a
user profile is more difficult than building a single static representation, and
requires the availability of some forms of user feedback to keep synchronized
the model.
– Finding relationships between contents and using a uniform method to share
and reuse tagging data amongst users or communities [7]. The topicality
criteria alone may not be sufficient to relate contents when there is no shared
semantics for a tag.
Our main goal in building the PIRATES framework is to empower social book-
marking tools, allowing users to easily add new contents in their personal col-
lection of links, automatically supporting them when categorizing by means of
keywords (tags) in a personalized and adaptive way. This work is a first step
towards the generation and sharing of personal information spaces described in
[8]. We have designed PIRATES keeping in mind several applications where it
can provide innovative adaptive tools enhancing user capabilities: in e’learning
for supporting the tutor and teacher activities for monitoring (in a personal-
ized fashion) student performance, behavior, and participation; in knowledge
management contexts (including for example scholarly publication repositories
and digital libraries [9]) for supporting document filtering and classification and
for alerting users in a personalized way about new posts or document uploads
relevant to their individual interests; in online marketing for monitoring and an-
alyzing the blogosphere where word-of-mouth and viral marketing are nowadays
more and more expanding and where consumer opinions can be listen.
The paper is organized as follows: Section 2 illustrates the overall architecture
and operation of PIRATES; Section 3 describes a typical interaction session and
Section 4 concludes the paper.
2 The PIRATES framework
PIRATES (Personalized Intelligent Recommender and Annotator TEStbed) is
a general framework for text-based content retrieval and categorization and
exploits social tagging, user modeling, and information extraction techniques.
Rather than proposing a rigid classification toolset, we have developed a testbed
platform for integrating (and experimenting with) various tools and techniques,
providing an interactive environment where users can customize the way they
32
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
Figure 1. Overall architecture of PIRATES.
retrieve and classify information on the Web. The main feature of PIRATES
concerns a novel approach that automates in a personalized way some typical
manual tasks (e.g. content annotation and tagging). The framework operates
on a set of input documents stored in the Information Base (IB) repository
and suggests for these some personalized tags and other forms of textual an-
notations (e.g. key-phrases) in order to classify them. The original documents
are then annotated with these tags, forming the Knowledge Base (KB) repos-
itory. Personalization is achieved exploiting user profiles (which represent the
user interests), personal ontologies, personal tags, etc., as discussed in Section 3.
Furthermore, PIRATES provides several mechanisms of user feedback that helps
to provide personalized adaptive information.
The PIRATES architecture is illustrated in Figure 1. On the left-hand side,
all the possible input sources are shown: single textual documents, specific IB
repositories which can be contained within an e-learning knowledge management
environment, and the Web, with specific (but not exclusive) focus on Web 2.0
portals, social networks, etc.. The right-hand side shows the suggested annota-
tions and the resulting KB repository. The main modules of PIRATES are:
33
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
– IEM (Information Extraction Module), which is based on the GATE platform
[10] to extract named entities, adjectives, proper names, etc. from input
documents, contained in the IB.
– SAT (Sentiment Analysis Tool ), which is a specific plug-in for personalized
sentiment analysis (typically to be activated for online marketing applica-
tions), that is capable of mining consumer opinions in the blogosphere and
classify them according to their polarity (positive, negative, or neutral) [11].
– KPEM (Key-Phrases Extraction Module), which implements a variation of the
KEA algorithm [12] for key-phrases extraction. KPEM identifies n-gram key-
phrases (typically n between 1 and 4) that summarize each input document.
This information is provided to the user, and is also given as input to the
subsequent modules.
– ORE (Ontology Reasoner Engine), which suggests new abstract concepts by
navigating through ontologies, classification schemata, thesauri, lexicon (such
as WordNet), etc. An abstract concept is identified by looking for a match
between the annotations found by the other modules (IEM, KPEM, IFT,
and STE) and the concepts stored in ontologies. When a match is found,
ORE navigates through the ontology, looking for the common parent node
which represents the more abstract term to suggest as annotation. ORE also
assists users in creating personal ontologies with techniques similar to those
described in [13].
– IFT (Information Filtering Tool ), which evaluates the relevance (in the sense
of topicality) of a document according to a specific model of user interests
represented with semantic (co-occurrence) networks [14].
– IFT Web Agents, which continuously monitor the Web (and the blogosphere)
looking for new information, cooperates with IFT to filter contents according
to the user model, and updates the IB repository. IFT and its Web agents
form together the Cognitive Filtering module discussed in [8].
– STE (Social Tagger Engine), which suggests new annotations for a document
relying on aggregated tags, i.e. the user’s personal tags (tags previously ex-
ploited) and the more popular tags used by the community of people that
classify the same document in social bookmarking sites such as Del.icio.us1 ,
Faviki2 or Bibsonomy3 . This social information is integrated with content-
based analysis techniques as discussed in [15].
3 A typical usage scenario
In this section we provide a typical scenario that illustrates a use case for our
framework. Consider a user interested to read scientific publications in the area
of software engineering. He trains the IFT tool providing the training data (e.g.
2-3 relevant papers in the field, some keywords and a short textual description
for the argument) in order to setup the user model. After training, the IFT
1
http://delicious.com
2
http://www.faviki.com/pages/welcome/
3
http://bibsonomy.org
34
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
Figure 2. The PIRATES user interface running our example
agents periodically monitor the Web (in our case especially Web 2.0 sites such
as Del.icio.us, Bibsonomy, CiteseerX4 , etc.), download new content and scrap
selected data from them to filter out irrelevant information (e.g. ads and nav-
igational links). When a relevant content (with respect to the user model) is
retrieved, the agents add it to the IB repository and informs the user with a
notification (e.g. an e-mail message). This information retrieval workflow has
been already discussed in [14, 16], so in the rest of the section we concentrate on
the classification features added by the PIRATES framework. Indeed, PIRATES
aims expressly to support the user in organizing the IB repository, easing the
work of classifying new contents by means of personalized tag suggestions.
Suppose now that an IFT agent notifies (among the others) the paper “A
UML Class Diagram Analyzer”5 . In order to classify this new content, the user
can enable some PIRATES annotator modules, as illustrated in the left side of
4
http://citeseerx.ist.psu.edu/
5
http://twiki.cin.ufpe.br/twiki/pub/SPG/GroupPublications/csduml04.pdf.
35
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
Figure 3. The PIRATES user interface running our example
Figure 2. Let us assume that he enables only IEM and KPEM modules in order
to extract, respectively:
– person’s names, organizations, and places (using IEM);
– keyphrases, i.e. n-grams long three terms at maximum (using KPEM).
With these settings, the framework produces the tag recommendations showed
in the right side of Figure 2. In particular, the suggested tags concern per-
sons such as the authors (Tiago Massoni, Rohit Gheyi, and Paulo Borba)
and the people acknowledged in the paper (Bordeau, Chang, Augusto Sampaio,
Franklin Ramalho and Rodrigo Ramos), locations (Brazil), and organizations
cited in the text (the Informatics Center of the Federal University of
Pernambuco, the Software Productivity Group, and the NASA). As keyphrases,
KPEM provides many terms related to Alloy specification language (Alloy,
Alloy Analyzer, snapshots), to UML (UML, UML Class Diagrams, OCL) and
to the specification of dependable systems (Critical Systems, Invariants).
36
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
(a) IEM, KPEM, and ORE outputs (b) Ontology reasoning
Figure 4. Personalized annotations proposed by PIRATES
The tag suggestions provided so far are extracted by the text present in
the input document: no personalization is present at all. Suppose now that the
user enables also the ORE module which exploits (in our example) a personal
ontology6 in the field of software engineering (see left side of Figure 3).
ORE implements a navigation strategy, taking in input the key-phrases ex-
tracted by other annotators (KPEM in this case). For four out of the suggested
key-phrases (i.e. Alloy, UML, OCL, and Invariants), ORE identifies a corre-
sponding one-to-one match in the ontology (see Figure 4(b)). Starting from these
nodes, ORE uses a spreading activation algorithm to find common ancestors rep-
resenting more abstract subjects. Then both one-to-one ontology mappings and
common ancestors are provided by PIRATES as potential tag recommendations,
as summarized in Figure 4(a). The ontology navigation process highlighted by
the spreading activation algorithm is depicted in Figure 4(b). In conclusion,
the ORE module recommends five new tags which are not present in the text
(i.e. Software Design Notation, Formal Specification Language, Design
by Contract, Formal Specification Techniques, and Software Design)7 .
6
We exploit an extended version of the existing domain ontology available from
http://www.seontology.org/.
7
Note also that tag Design by Contract was not already present nor in the input
document, nor in the original ontology, but it was added to the ontology by means
37
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
These tags represent abstractions of the key-phrases extracted by the other an-
notators available in PIRATES.
4 Conclusions
We believe that the presented framework is a promising approach to automatic,
personalized classification of Web contents. It is a first step in the direction of
automatically organize document repositories into personal concept maps, mov-
ing from information to knowledge. The development of PIRATES has been
planned in an incremental fashion, interleaved with experimental evaluation.
Several modules have been already developed and integrated in a testbed envi-
ronment: IEM with the sentiment analysis plug-in [16], KPEM with key-phrases
extraction capabilities, and the Cognitive Filtering comprising an extended ver-
sion of IFT capable to monitor Web 2.0 sources (specifically newsgoups, forums,
and blogs). The integration of these modules is currently being evaluated. Pro-
totyping and integration of ORE, SAT, and STE within PIRATES are ongoing
processes, and evaluation experiments are planned. Moreover, we are working
specifically on integrating the PIRATES modules in a Web-based version of the
environment, which let us validate each module thoroughly. Finally, we have
also planned to implement the conceptual map editor described in [8] in order
to completely validate the framework.
References
1. Brusilovsky, P., Tasso, C.: Preface to special issue on user modeling for web in-
formation retrieval. User Modeling and User-Adapted Interaction 14(2-3) (2004)
147–157
2. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for person-
alized information access. In: The Adaptive Web. (2007) 54–89
3. Bunt, A., Carenini, G., Conati, C.: Adaptive content presentation for the web. In:
The Adaptive Web. (2007) 409–432
4. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for auto-
mated tag suggestion. In: Proc. of the ECML/PKDD 2008 Discovery Challenge.
(2008)
5. Marchetti, A., Tesconi, M., Ronzano, F.: Semkey: A semantic collaborative tagging
system. (2007)
6. Lonchamp, J.: A platform for cscl practice and dissemination. In: ICALT ’06: Proc.
of the Sixth IEEE International Conference on Advanced Learning Technologies,
IEEE Computer Society (2006) 66–70
7. Kim, H., Yang, S., Jung, J., Kim, K., Breslin, J., Decker, S., Kim, H.: Combining
tags and the semanticweb for linked tagging data (2008)
8. Casoto, P., Dattolo, A., Ferrara, F., Pudota, N., Omero, P., Tasso, C.: Generating
and sharing personal information spaces. In: Proc. of the Workshop on Adaptation
for the Social Web, 5th ACM Int. Conf. on Adaptive Hypermedia and Adaptive
Web-Based Systems. (2008) 14–23
of a user feedback mechanism provided by PIRATES. This is where personalization
comes from.
38
Workshop on Adaptation and Personalization for Web 2.0, UMAP'09, June 22-26, 2009
9. Omero, P., Polesello, N., Tasso, C.: Personalized intelligent information services
within an online digital library for medicine: the bibliomed system. In: IRCDL ’07:
Proc. of the Third Italian Research Conference on Digital Library Systems. (2007)
46–51
10. Cunningham, H.: Gate, a general architecture for language engineering. Computers
and the Humanities 36 (2002) 223–254
11. Casoto, P., Dattolo, A., Tasso, C.: Sentiment classification for the italian language:
A case study on movie reviews. Journal of Internet Technology 9(4) (2008) 365–373
12. Frank, E., Paynter, G., Witten, I., Gutwin, C., Nevill-Manning, C.: Domain-specific
keyphrase extraction. In: IJCAI ’99: Proc. of the Sixteenth International Joint
Conference on Artificial Intelligence, Morgan Kaufmann (1999) 668–673
13. Speretta, M., Gauch, S.: Using text mining to enrich the vocabulary of domain
ontologies. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM
International Conference on 1 (2008) 549–552
14. Tasso, C., Asnicar, F.A.: ifweb: a prototype of user model-based intelligent agent
for document filtering and navigation in the world wide web. In: Adaptive Systems
and User Modeling on the WWW, 6th UM Inter. Conf. (1997)
15. Tasso, C., Rossi, P., Virgili, C., Morandini, A.: Exploiting personalization tech-
niques in e-learning tools. In: SW-EL’04: Proc. of the Workshop on Applications
of Semantic Web Technologies for Adaptive Educational Hypermedia. (2004)
16. Pudota, N., Casoto, P., Dattolo, A., Omero, P., Tasso, C.: Towards bridging the
gap between personalization and information extraction. In: IRCDL ’08: Proc. of
the Forth Italian Research Conference on Digital Library Systems. (2008) 33–40
39