<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HNews : an enhanced multilingual hyperlinking news platform</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego De Cao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Previtali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Basili</string-name>
          <email>basilig@info.uniroma2.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Roma Tor Vergata</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe the HNews platform, a Web-based system addressing the general problem of aggregating and enriching news from di erent sources and languages. In the indexing stage, the news items gathered from RSS feeds or video streams are analyzed through Information Extraction tools. Their topical category information and the Named Entities mentions are recognized and used to create semantic metadata so to enrich the information available for each news item. Moreover, a robust unsupervised Word Sense Disambiguation algorithm is applied to the available texts that are thus further semantically annotated. This is used to align news items in di erent languages, such as Italian and English, and support cross-lingual search. As a result, advanced search features, such as cross-lingual or typed entity-based queries, are enabled in HNews. In this paper, we also present the browser, making use of a spatial metaphor for the arrangement of the retrieved news. It enables to capture di erent aspects such as the "semantic" similarity among news, or the timeliness of individual news items as well as their relevance with respect to an incoming user query.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As globalization emerges, information access across language boundaries is
becoming a critical issue. The World Wide Web has become accessible to more
and more countries and technological advances overcome the network, interface
and computer system di erences which are constraints to information access. In
particular the World Wide Web is becoming a major "media" for news delivery
(e.g. broadcasting) and content creation. Consequently, it has now an increasing
appeal for users the ability to search news from di erent sources, media and in
di erent languages. The application of Information Retrieval techniques to the
problems raised by news aggregation in such heterogeneous scenarios is
becoming a crucial technological challenge. Currently, the major technology enabling
the information access across di erent sources is represented by the News
Aggregator software. Aggregators reduce the time and e ort needed to regularly check
websites for updates as well as for creating a unique integrated information space
or a "personal newspaper". Correspondingly, every aggregator has explored the
way of integrate some Information Retrieval capability in order to reduce the
e ort to satisfy real user information needs.</p>
      <p>
        Research on ad hoc retrieval systems focus on a variety of methods, these
ranging between strongly lexicalized, statistical methods for relevance
modeling to more semantic-oriented techniques, usually based on deeper levels of text
analysis and language processing paradigms (such as parsing or semantic
disambiguation processes). In the former, so-called shallower, approaches, documents
are retrieved through simple matching mechanisms between texts and queries;
ranking according to relevance is the side e ect of statistical models of terms
cooccurrences in texts. Semantic approaches attempt to exploit at a certain degree
the syntactic and semantic information made available by linguistic analysis. In
the attempt of reproducing some levels of text understanding, a meaning
surrogate of the input text is obtained including also semantic (sometime syntactic)
indexes. These are metadata that restrict the potential interpretations of the
texts and are supposed to improve the retrieval accuracy. For example, a smaller
number of false positives are expected, as constraints at the semantic level can be
imposed to re-rank candidates documents. The technology that supports the
detection, disambiguation and formalization of meaningful information, nowadays
called semantic metadata, from unstructured texts is Information Extraction
(hereafter IE), as it has been studied since early 90's ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>Notice that the availability of semantic information is useful in particular
in cross-linguistic scenarios, whereas strongly lexicalized statistical methods are
not of much help: they in fact cannot be used to retrieve documents that are
expressed in a language di erent from the one used to query the IR system. In
these cases, beyond merely accepting extended character sets and performing
language identi cation, the information retrieval systems should also be able to
provide help in locating information across the language boundaries. Moreover
access to distributed information is complex also due to the heterogeneity of
the sources and to diversity of interests, expectations and purposes of the target
retrieval processes. Heterogeneity here characterizes:
{ Data typologies, as sources of information are characterized by di erent
media and content types.
{ Data formats, as even the same content can be made available through a
media according to di erent levels of granularity and quality. Formats may
highly vary across and within archives.
{ Contents, as the source information is not characterized by a speci c
knowledge domain but is spread across heterogeneous semantic dimensions.
{ Languages, as even an individual structured archive may well include
documents expressed in di erent natural languages.</p>
      <p>
        The major media channel for news is represented by television. In previous
work, the problem of extracting of semantic metadata from broadcasted TV and
radio news has been discussed and a corresponding system, RitroveRAI, is
presented [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It makes use of human language technologies for IE over multimedia
data (i.e. speech recognition and grammatical analysis of incoming news). The
HNews system, presented in this paper, represents an extension of RitroveRAI,
as it integrates the indexing of video news with the gathering and annotation
of news from di erent Web sources. News derived from newspaper portal in the
Web are characterized by texts that are less noisy than speech transcriptions. As
a consequence, in HNews, a set of di erent natural language processing modules
is applied and a comprehensive enrichment of individual news through semantic
metadata is obtained. The next section presents the overall structure of the
applied indexing process, while Section 3 describes the search interface. The section
4 concludes this work by discussing applications and extension of the system.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Enriching Web news through semantic metadata</title>
      <p>RSS is a family of web feed formats used to publish frequently updated news
contents in a standardized format and is adopted by of the web newspapers in
publishing timely updates. The HNews platform exploits these news aggregator
standards, and collect news updates from independent RSS feed sources.
However, as rich metadata are required to improve the quality of the retrieval process,
the limitations of current RSS protocols must be carefully handled in a system
such as HNews. The idea of applying IE to contents requires that the
RSSsupported data gathering process is followed by an in-depth analysis according
to a family of advanced natural language processing tools. Unfortunately, the
RSS feed of most newspapers makes available only a summary of individual
news, on which a too small scale linguistic analysis is possible. The summary
in fact is usually very short and insu cient to perform accurate extraction (e.g.
sense disambiguation is more complex when shorter contexts are targeted). As
an extension of a classical news aggregator, HNews provide crawling capabilities,
use RSS links to access the corresponding complete news contents, i.e. full Web
articles pages. A speci c RSS processor for individual newspaper sources has
been developed at this purpose.</p>
      <p>
        Once the full textual content is made available, a cascade of NLP tools is
applied to extend contents with semantically rich metadata. The most interesting
among these metadata is the set of entities, such as the places, locations,
organizations and temporal expressions mentioned in the news item. Accordingly, a
statistical Named Entity Recognition module is applied to detect and compile
lists of NEs (i.e. persons, cities, companies) that will be part of the metadata
related to the targeted item. The applied NER tool is further described in
Section 2.1). A relevant information used to organize and retrieve news items is the
editorial class, i.e. the topical category corresponding to the news content. While
these are usually made available by the di erent providers through the RSS
format itself, the reference classi cation scheme adopted by the various sources is
highly varying and di ers from one another. In order to determine a uni ed
and comparable scheme, news from di erent sources are classi ed by the HNews
system into a set of prede ned editorial categories, inspired by previous work
on this problem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The supervised statistical classi cation process adopted in
HNews is described in section 2.2. Further analysis is carried out to disambiguate
sense of relevant words through a Word Sense Disambiguation (WSD) stage, as
discussed in section 2.3. WSD here is applied especially to natural language
queries in order to support cross-linguistic search. Finally, the comprehensive
set of information about an underlying news item (i.e. title, text content, named
entities, topical category and words senses) are indexed through a well-known
engine (i.e. Lucene [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). The above process is applied to the textual components
of Web pages, although according to a separate independent work ow (as
discussed in the description of the RitroveRAI system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), HNews is also able to
index TV broadcasted news, whenever the segmented news and their speech
transcriptions are made available (see for example, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) according to the
workow described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The rich form of semantic metadata in HNews allows to
integrate semantically typed information in order to make heterogenous sources
and di erent languages coexist and support exible forms of querying and
conceptual aggregation. While Section 3 will discuss the navigation capabilities and
the resulting information space made available by the HNews dedicated GUI,
the rest of this section will describe the main HLT processing stages applied by
HNews.
2.1
      </p>
      <p>
        Statistical Named Entity Recognition
The Named Entity Recognition (NER) task aims at the identi cation of all
named locations, persons, organizations as well as dates, times, monetary amounts
or other numerical expressions that appear in free forms in a text. NER is a
crucial step for the enrichment proposed by HNews as it highly improves the
performance of news aggregation. A reference system for NER is certainly
IdentiFinder [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], that makes use of a variant of a Hidden Markov model to identify
names, dates or numerical quantities. In the original proposal, states of the HMM
are designed to correspond to the above classes. There is a conditional state for
"not a token class". Each individual word is assumed to be either part of a
speci c pre-determined class or not part of any class. According to the de
nition of the task, one of the class labels or the label that represent "none of the
classes" is assigned to every word. IdentiFinder uses word features, which are
language-dependent, such as capitalization, numeric symbols and special
characters, because they give good evidence for identifying tokens. A version of an
HMM-based NER has been designed at our Lab, and it has been trained against
annotated Web material in Italian for the major categories of people, locations,
organisations and dates. The resulting NER module is applied in the HNews
work ow both for news and query processing. Moreover, given the extremely
noisy nature of speech transcriptions, two di erent HMM-based recognizers are
adopted against the texts and the segments of TV broadcasted. While
performances close to 87% accuracy are obtained over standard textual input, a not
negligible performance drop is observed over transcribed speech material, where
70% is the reachable accuracy on average.
2.2
      </p>
      <p>
        News Categorization
Text categorization has been traditionally modeled as a supervised machine
learning task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In HNews, a simple yet e cient model, i.e. Rocchio, is
applied. The model, described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a pro le based classi er, where a speci c
cross validation process allows to optimize at individual class level and obtain
performance close to state of the art systems (e.g. Support Vector Machines).
Given the set of training document Ri , classi ed under the topics Ci (positive
examples), the set Ri of the documents not belonging to Ci (negative examples)
and given a document dh and a feature f , the Rocchio model [
        <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
        ] de nes the
weight f of f in the pro le of Ci as:
fi = max
8
&lt;0;
:
jRij dh2Ri
      </p>
      <p>X !fh
jRij dh2Ri</p>
      <p>
        9
X !fh=
;
where !fh is the weight of the feature f in the document dh. In equation 1,
the parameters and control the relative impact of positive and negative
examples and determine the weight of f in the i-th pro le. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], values =16,
=4 have been rst proposed for the categorization of low quality images. These
parameters indeed greatly depend on the training corpus and di erent settings
of their values produce signi cant performance variations.
      </p>
      <p>Notice that, in equation 1, features with negative di erence between positive
and negative relevance are set to 0. This represents an elegant feature selection
method: the 0-valued features are irrelevant in the similarity estimation. As a
result, the remaining features are optimally used, i.e. only for classes for which
they are selective. In this way, the minimal set of truly irrelevant features (those
having a weight of 0 for all the classes) can be better captured and removed.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a modi ed Rocchio model is presented that makes use of a single
parameter i as follows:
fi = max
8
&lt;0;
:
      </p>
      <p>1
jRij dh2Ri</p>
      <p>X !fh</p>
      <p>i
jRij dh2Ri</p>
      <p>9
X !fh=
;
(1)
(2)</p>
      <p>
        Moreover, a practical method for estimating the suitable values of the i
vector has been introduced. Each category in fact has its own set of relevant and
irrelevant features and equation 2 depends on i, for each class i. Now, if we
assume the optimal values of these parameters can be obtained by estimating
their impact on the classi cation performance, nothing prevents us from deriving
this estimation independently for each class i. This results in a vector of i
each one optimizing the performance of the classi er over the i-th class. The
estimation of the i is carried out by a typical cross-validation process. Two
data sets are used: the training set (about 70% of the annotated data) and a
validation set (about 30% of the remaining data). First the categorizer is trained
on the training set, where feature weights (!fd) are estimated. Then pro le vectors
fi for the di erent classes are built by setting the parameters i to those values
optimizing accuracy on the validation set. The resulting categorizer is then tested
on a separate test set. Results on the Reuters benchmark are about 85%, close
to state-of-art more complex classi cation models ([
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ]). Extensive discussion
on the performances reached over di erent benchmarks is reported in ([
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
      </p>
      <p>A typical example of the obtained results is reported in Figure 1, that shows
the results of a retrieval session in HNews: the third column reports the topical
classi cation of each retrieved news item. While the last two entries are derived
from the "Corriere della Sera" (CdS) portal, and their topic label is already
available, i.e. "Categoria: Politica", the rst originates from Ansa, and it is missing
of any topic label. Column 3 in the Figure reports the automatically labels
assigned by the HNews classi er, that in the last two cases con rms the original
CdS classi cation. Notice how while the rst two news are dealing with similar
topics, their focus is di erent and this is very well re ected by the classi er.
2.3</p>
      <p>
        Applying Word Sense Disambiguation for Query expansion in
CLIR
Lexical ambiguity is a fundamental aspect of natural language. Word Sense
Disambiguation (WSD) investigates methods to automatically determine the
intended sense of a word in a given context, according to a prede ned set of sense
de nitions. These are usually provided by a reference semantic lexicon. The
impact of WSD on IR tasks is still an open issue and large scale assessment is
needed. Unsupervised systems are certainly very interesting as for their
applicability to non English, i.e. resource poor, languages. While state-of-art systems
are usually supervised, their porting to other languages is mostly expensive as
large scale resources are needed. For this reason, unsupervised approaches to
inductive WSD are very appealing. In the framework of the HNews architecture,
we adopt a network based model of WSD based on WordNet, as discussed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a variant of the the PageRank algorithm, called personalized PageRank
is presented. WordNet is assumed as a network of senses and a random walk
model of its links is de ned. Then, sentences, or entire texts, are used to
initialize the state of the WordNet network, and the stable state of its "random walk"
is assumed as the posterior statistics across the senses of the targeted words.
While the approach can be applied either to individual words or to entire
sentences, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] it has been shown that a distributional approach can improve the
personalized PageRank disambiguation algorithm, both in accuracy and time
complexity. The initial state is obtained as determined by a topical expansion of
individual sentences through the use of Latent Semantic Analysis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: sentence
is rstly mapped into a vector in the LSA space and the closest words to the
vector are retained and added to the sentence terms. The initialization of the
network with this expanded lexicon provides the resulting PageRank to converge
rst and to more accurate sense distributions. Details of the technique are
discussed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] a detailed evaluation of the adopted system for English is
reported, moreover in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] we have reported evaluation of the applicability of the
WSD system to the Italian language. Results over the Senseval '07 benchmark
are about 71,5% of F1 for the English. Instead, the Evalita '07 benchmark is
used to evaluate the Italian language reaching about 52% of F1. Major lack is
due to the di erence of the WordNet version used between English and Italian
language. For its application into HNews, a large collection of Web news,
considered as the speci c domain corpus, has been used to derive the LSA model
where distributional evidence is represented. We rst built the classical vector
space model and then Latent Semantic Analysis is applied. Notice that the
topical similarity across news are able to better characterize typical contents for the
suitable word senses of all terms in a news item. When a sentence, a document or
a query is input, rst it is expanded with the set of its closest words in the LSA
space. The expanded query is then used to trigger the personalized PageRank
that provides the nal preferences for senses of individual terms, e.g. the nouns,
verbs or adjectives used in a query.
      </p>
      <p>
        While senses can be part of the document index, the very interesting aspect
that characterizes the adoption of WSD in HNews is that it is an enabling factor
for Cross Language Information Retrieval (CLIR). In fact, although WordNet
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] was developed for the English language, several versions for other languages,
aligned with the English sense hierarchy, have been developed. MultiWordNet
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a WordNet for Italian that is strictly aligned with Princeton WordNet
1.6, at the synset level. A large number of English synsets are in fact put in
correspondence with an Italian synset: words in this latter are thus synonyms
each other while being speci c "translations" of the words in the source English
synset. The application of our WSD model to CLIR is thus straightforward. First,
an English query is processed and its terms and Named Entities are extracted.
Then nouns and verbs are disambiguated through our personalized PageRank
and their WordNet senses are detected. Finally a translated query in Italian
is obtained. It includes the original (language neutral) Named Entities as well
as the Italian words obtained from the MultiWordNet synsets corresponding
to the selected senses of English words. In this way two versions (English and
Italian) of the same query are obtained and documents written in both languages
can be returned. In Figure 2, the response to the English query "The death of
Arafat, leader of Palestina" is shown whereas the rst hits include both news
in Italian and English. Notice that scores are able to separate well meaningful
from irrelevant news.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The retrieval interface</title>
      <p>In a general IR scenario, the user interface must be able to support the user
to submit queries to the system and trigger the navigation or browsing of the
returned documents. An example of the browser interface is shown in Figure 3.</p>
      <p>The interface is composed by ve di erent frames, i.e. the top one, three in
the middle and the footer one (see the red boxes in Figure 3). The top frame
provides the query interface, were user can edit its queries as
{ simple expressions, i.e. bag of keywords
{ short texts in two languages that are analyzed by HLT tools
{ boolean combination of simple queries</p>
      <p>A variable set of constraints, i.e. individual simple queries, can be designed
by the user, according to the di erent type of metadata considered. Relevant
elds of the indexes range in fact from the Topical Category, to full text content,
or Named Entities. The query shown in Fig. 3 is the expression
(Content: "Roma in Campionato" AND Person:"Ranieri" AND</p>
      <p>Organisation:"Roma")
that is "Find all news that discuss the Roma team in the football league and
Ranieri, i.e. the current coach. It is also possible to specify whether one individual
condition must or can be satis ed adding some further exibility in the boolean
combination of individual constraints (Fig. 3).</p>
      <p>The middle frame includes three individual frames. In the central one the
returned results for a query are shown, as already seen in Fig. 1. On the user
click any result triggers multiple visualization actions. The middle left frame is
used to show the video or the photos related to the selected news item. The
middle right frame shows the metadata related to individual news items, such
as publication dates and times, Web source or editorial category. The bottom
frame, at the footer, changes according to individual selections in the returned
news and it shows their textual contents.</p>
      <p>Whenever an interesting news item has been found, the user can browse the
Web, as links to the originating pages (at the source RSS portal) are made
available. Moreover, the HNews platform also allows to support a spatial metaphor
where an information space is shown to the user, centered in the selected news
item in the set of the returned answers. The space is obtained through the
mentioned Latent Semantic Analysis of the underlying collection. It aims at
capturing the semantic relatedness between news (either Web or multimedia) as well
as between news and Named Entities. A graph local to one selected news item
can be in fact obtained by retrieving all news closer to it in the LSA space. The
graph also expresses arcs among these news whenever close (i.e. similar) enough
news pairs are involved. An example of the spatial view is shown in Figure 4. In
the navigation tool, di erent visual layers are made available to capture di erent
useful information:
{ Every link represents the similarity of a news pair in the LSA space1, so
that the closer two news the more relevant they are for the same queries.
{ Di erent font sizes are used to discriminate timeliness : the more recent a
news item is, the larger its font will be. The resulting zooming e ect tries to
compensate coverage with timeliness : very old news will not limit
visualization of more recent material, although being retrieved and shown
{ Colors from green to red are used to represent relevance of a news item for
an input query: green expresses more relevant, i.e. better, responses, while
red is used for the worse ones.
{ Shapes discriminate semantic types. While news (i.e. textual objects) are
shown as colored boxes, ellipses are used to represent entities, and their
semantic types, such as person vs. organizations. Notice that typed entities
1 The Latent Semantic Analysis is the same performed for the WSD step and described
in section 2.3
are represented in the LSA space as much as documents, so that their
positioning in the network and their similarities are seemingly computed and
depicted.</p>
      <p>The resulting graphs have the desirable side e ect of expressing a global
view on the information space. Links express distances in the LSA space and
this implies that news naturally organize into visual clusters, usually made of
topically related materials. In the example in Figure 4, news aggregates form
two major regions: the bottom one concerning mostly stories related to the
"economics" class, while the upper right one concerns mainly "sport " topics.</p>
      <p>As presented in section 2.3, the system is also able to search news in di erent
languages as side e ects of word sense disambiguation. In Figure 2, it is shown
how a query can be submitted in English through a speci c parameter (CLIR:
en -&gt; it) of the query frame (i.e. the upper frame). We already discussed the
returned results that also include news from the Italian channels such as Corriere
della Sera or Ansa. Notice that also the viceversa (i.e. query in Italian and
documents in English) is currently supported.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>
        In this paper ,the HNews system for semantic annotation and indexing of news
from Web newspapers or TV broadcasts has been presented. A complex
family of language processing tools is adopted for Information Extraction, i.e. the
automatic recognition of di erent semantic information. Di erent models and
approaches are exploited in HNews to extract rich metadata, such as Named
Entities, topical categories or word senses. The resulting platform is also supporting
cross-lingual queries and advanced boolean combinations of simple queries acting
over texts and metadata. In particular, two languages are currently supported:
Italian and English. News are downloaded from the Web on a continuous basis.
Indexing proceeds from RSS feeds, so that published materials are available in
almost real time. One of the mostly innovative aspects of the HNews system
with respect to previous experiences in this area (e.g., the Prestospace system,
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) is the browsing modalities o ered, that integrates the search and semantic
navigation functionalities. A quantitative model of semantic similarity is in fact
de ned over the rich set of metadata and connected graphs of news items and
Named Entities in the information space are correspondingly obtained. These are
quite e ective to quickly focus on the information of highest relevance/interest,
as their conceptual, rather than mere textual, nature is made explicit in the
graph. HNews is the basis for future developments targeted to the support to
the creation of a community of readers but also producers of news and other
contents. In this way, the HNews portal would be directly reusable to support
the gathering of user-generated contents. The exploitation of these latter for
developing models of large scale, realistic opinion mining processes is the focus of
future research enabled by the HNews system.
      </p>
      <p>Fig. 4. HNews news navigator</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Pazienza</surname>
          </string-name>
          , M.T., ed.: Information Extraction:
          <article-title>A Multidisciplinary Approach to an Emerging Information Technology</article-title>
          ,
          <source>International Summer School</source>
          , SCIE-
          <volume>97</volume>
          , Frascati, Italy,
          <fpage>14</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>1997</year>
          . In Pazienza, M.T., ed.
          <source>: SCIE</source>
          . Volume
          <volume>1299</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cammisa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donati</surname>
          </string-name>
          , E.:
          <article-title>Ritroverai: A web application for semantic indexing and hyperlinking of multimedia news</article-title>
          . [
          <volume>9</volume>
          ]
          <fpage>97</fpage>
          {
          <fpage>111</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gospodnetic</surname>
            ,
            <given-names>O.: Advanced</given-names>
          </string-name>
          <string-name>
            <surname>Text Indexing with Lucene. O'Reilly Media</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Messina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boch</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimino</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bailer</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schallauer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allasia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groppo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vigilante</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basil</surname>
          </string-name>
          , R.:
          <article-title>Creating rich metadata in the tv broadcast archives environment: The prestospace project"</article-title>
          .
          <source>In: Proceedings of the AXMEDIS Conference</source>
          .
          <article-title>(</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bikel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weischedel</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An algorithm that learns what's in a name</article-title>
          .
          <source>Machine Learning Journal</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Machine learning in automated text categorization</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>34</volume>
          (
          <issue>1</issue>
          ) (
          <year>2002</year>
          )
          <volume>1</volume>
          {
          <fpage>47</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazienza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Nlp-driven ir: Evaluating performance over a text classi cation task</article-title>
          .
          <source>In: Proceeding of the 10th "International Joint Conference of Arti cial Intelligence" (IJCAI</source>
          <year>2001</year>
          ), Seattle, Washington, USA (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ittner</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahn</surname>
          </string-name>
          , D.D.:
          <article-title>Text categorization of low quality images</article-title>
          .
          <source>In: Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval</source>
          . (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benjamins</surname>
            ,
            <given-names>V.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
          </string-name>
          , M.A., eds.:
          <source>The Semantic Web - ISWC</source>
          <year>2005</year>
          , 4th International Semantic Web Conference,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2005</year>
          , Galway, Ireland, November 6-
          <issue>10</issue>
          ,
          <year>2005</year>
          , Proceedings. In Gil, Y.,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benjamins</surname>
            ,
            <given-names>V.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
          </string-name>
          , M.A., eds.:
          <source>International Semantic Web Conference</source>
          . Volume
          <volume>3729</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>De Cao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luciani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mesiano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
          </string-name>
          , R.:
          <article-title>Robust and e cient page rank for word sense disambiguation</article-title>
          .
          <source>In: Proceeding of TextGraphs-5: Graphbased Methods for Natural Language Processing</source>
          , Uppsala, Sweden (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soroa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Personalizing pagerank for word sense disambiguation</article-title>
          .
          <source>In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. EACL '09</source>
          ,
          <string-name>
            <surname>Morristown</surname>
          </string-name>
          , NJ, USA, Association for Computational Linguistics (
          <year>2009</year>
          )
          <volume>33</volume>
          {
          <fpage>41</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A solution to plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge</article-title>
          .
          <source>Psychological Review</source>
          <volume>104</volume>
          (
          <year>1997</year>
          )
          <volume>211</volume>
          {
          <fpage>240</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>De Cao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luciani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mesiano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
          </string-name>
          , R.:
          <article-title>Enriched page rank for multilingual word sense disambiguation</article-title>
          .
          <source>In: Proceeding of 2nd Italian Information Retrieval 2011 Workshop</source>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beckwith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            .,
            <given-names>K.:</given-names>
          </string-name>
          <article-title>An on-line lexical database</article-title>
          .
          <source>International Journal of Lexicography</source>
          <volume>13</volume>
          (
          <issue>4</issue>
          ) (
          <year>1990</year>
          )
          <volume>235</volume>
          {
          <fpage>312</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pianta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bentivogli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girardi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Multiwordnet: developing an aligned multilingual database"</article-title>
          .
          <source>In: Proceedings of the First International Conference on Global WordNet</source>
          , Mysore,
          <source>India (January</source>
          <volume>21</volume>
          -25
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>