T RENTINO M EDIA: Exploiting NLP and Background
 Knowledge to Browse a Large Multimedia News Store?

Roldano Cattoni1 , Francesco Corcoglioniti1,2 , Christian Girardi1 , Bernardo Magnini1 ,
                       Luciano Serafini1 , and Roberto Zanoli1
              1
                  Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy
              2
                  DISI, University of Trento, Via Sommarive 14, 38123 Trento, Italy
      {cattoni,corcoglio,cgirardi,magnini,serafini,zanoli}@fbk.eu


         Abstract. T RENTINO M EDIA provides access to a large and daily updated repos-
         itory of multimedia news in the Trentino Province. Natural Language Processing
         (NLP) techniques are used to automatically extract knowledge from news, which
         is then integrated with background knowledge from (Semantic) Web resources
         and exploited to enable two interaction mechanisms that ease information access:
         entity-based search and contextualized semantic enrichment. T RENTINO M EDIA
         is a real multimodal archive of public knowledge in Trentino and shows the po-
         tential of linking knowledge and multimedia and applying NLP on a large scale.


1     Introduction
Finding information about an entity in a large news collection using standard keyword-
based search may be time-consuming. Searching for a specific person, for instance,
may return a large list of news about homonymous persons, that need to be checked and
filtered manually. Also, understanding the contents of a news can be expensive, if the
user is not familiar with the entities mentioned and needs information about them.
     The presented T RENTINO M EDIA system shows how the use of NLP and Semantic
Web techniques may help in addressing these problems. T RENTINO M EDIA supports the
“smart” access to a large and dynamic (daily updated) repository of multimedia news in
the Italian Trentino Province. “Smart” means that NLP techniques are used to automati-
cally extract knowledge about the entities mentioned in the news. Extracted knowledge
is then integrated with background knowledge about the same entities gathered from
(Semantic) Web resources, so to build a comprehensive knowledge base of entity de-
scriptions linked to the news of the collection. Exploiting the interlinking of knowledge
and multimedia, two interaction mechanisms are provided to ease information access:
    – entity-based search, enabling a user to find exactly the news about a specific entity;
    – contextualized semantic enrichment, consisting in the visualization of additional
      knowledge about a mentioned entity that may ease a user’s understanding of a news.
     Two main usages are foreseen for T RENTINO M EDIA: (i) a professional usage, re-
stricted to the news providers and aimed at addressing internal needs, including auto-
matic news documentation, support tools for journalists and integration of advanced
?
    This work was supported by the LiveMemories project (http://www.livememories.org) funded
    by the Autonomous Province of Trento (Italy) under the call “Major Project”.
                                     Linking &                              Resource               inter-resource
 FRONTEND                          entity creation


                                                       CONTENT PROCESSING
                                                                            type (e.g. video)      relations (part
                                    Coreference                             DC metadata            of, caption of,
                                     resolution                             multimedia file        related to,
                                                                                                   derived from)


                                                            PIPELINE
                                      Mention                                   ▲ contained in
                                     extraction
                                                                            Mention
KNOWLEDGE                            Resource
                                   preprocessing                            position
    STORE                                                                   extent
                                      Content                               extracted attributes
                                     acquisition                            coreference cluster
                                                                                                        matching
                                                                                 ▼ denotes            ▼ context
                                                                            Entity                  Context
                                                   GeoNames                 name                    time
                                                                                                    location
                                                                                 ▼ described by     topic
                                                                            Triple                    ▲ holds in
                  Coop.
                Trentina
                                                                            subject (implicit)
                                                                            predicate
            NEWS                BACKGROUND KNOWLEDGE
                                                                            object

               Fig. 1: System architecture                                         Fig. 2: Data model

functionalities in existing editorial platforms; and (ii) an open use by citizens through
on-line subscriptions to news services, possibly delivered to mobile devices.
    The presented work has been carried out within the LiveMemories project, aimed
at automatically interpreting heterogeneous multimedia resources, transforming them
into “active memories”. The remainder of the paper presents the system architecture in
section 2 and the demonstrated user interface in section 3, while section 4 concludes.

2      System architecture
The architecture of T RENTINO M EDIA is shown in figure 1 and includes three compo-
nents: the K NOWLEDGE S TORE, a content processing pipeline and a Web frontend.
    The K NOWLEDGE S TORE [3] builds on Hadoop3 and Hbase4 to provide a scalable
storage for multimedia news and background knowledge, which are represented accord-
ing to the (simplified) schema of figure 2. News are stored as multimedia resources,
which include texts, images and videos. Knowledge is stored as a contextualized ontol-
ogy. It consists of a set of entities (e.g., “Michael Schumacher”) which are described
by hsubject, predicate, objecti RDF [2] triples (e.g., hMichael Schumacher, pilot of,
Mercedes GPi). In turn, each triple is associated to the htime, space, topici context the
represented fact holds in (e.g., h2012, World, Formula 1i). The representation of con-
texts follows the Contextualized Knowledge Repository approach [5] and permits to
accommodate “conflicting” knowledge holding under different circumstances (e.g. the
fact that Schumacher raced for different teams). Resources and entities are linked by
mentions, i.e. proper names in a news that refer to an entity. They permit to navigate
from a news to its mentioned entities and back, realizing the tight interlinking of knowl-
edge and multimedia at the basis of the interaction mechanisms of T RENTINO M EDIA.
    Concerning the content processing pipeline, it integrates a number of NLP tools
to load, process and interlink news and background knowledge, resulting in the full
 3
     http://hadoop.apache.org
 4
     http://hbase.apache.org
       Table 1: Resource statistics           Table 2: Extraction, coref. and linking stats.
Provider            News Images Videos       Entity      Recognized Mention Linked Total
                                             type         mentions clusters clusters entities
l’Adige      733,738 21,525            -
VitaTrentina 33,403 14,198             -     persons      5,566,174 340,147 5.03% 351,713
RTTR           2,455      -        120 h     organiz.     3,230,007 16,649 7.96% 17,129
Fed. Coop.     1,402      -            -     locations    3,224,539 52,478 48.64% 52,478
Total             770,998 35,723   120 h     Total       12,020,720 409,274 10.74% 421,320


population of the schema in figure 2. Apart from the loading of background knowl-
edge, which is bootstrapped by manually selecting and wrapping the relevant knowl-
edge sources, the pipeline works automatically and incrementally, processing news as
they are collected daily. The rest of this section describes the processing steps of the
pipeline, while the user interface of the frontend is described in the next section.
Content acquisition. News are supplied daily by a number of news providers local to
the Trentino region. They are in Italian, cover a time period from 1999 to 2011 and con-
sist of text articles, images and videos. Loading of news is performed automatically and
table 1 shows some statistics about the news collected so far. Background knowledge
is collected manually through a set of ad-hoc wrappers from selected (Semantic) Web
sources, including selected pages of the Italian Wikipedia, sport-related community
sites and the sites of local and national public administrations and government bodies.
Overall, it consists of 352,244 facts about 28,687 persons and 1,806 organizations.
Resource preprocessing. Several operations are performed on stored news with the
goal of easing their further processing in the pipeline. Preprocessing includes the ex-
traction of speech transcription from audio and video news, the annotation of news
with a number of linguistic taggers (e.g., part of speech tagging and temporal expres-
sion recognition, performed using the TextPro tool suite5 [6]) and the segmentation of
complex news in their components (e.g., the separation of individual stories in a news
broadcast or the extraction of texts, figures and captions from a complex XML article).
Mention extraction. Textual news are processed with TextPro to recognize mentions
of three types of named entities: persons, organizations and geo-political / location en-
tities. For each mention, a number of attributes is extracted from the mention and its
surrounding text. Given the text “the German pilot Michael Schumacher”, for instance,
the system recognizes “Michael Schumacher” as a person mention and annotates it with
F IRST NAME “Michael”, L AST NAME “Schumacher”, ROLE “pilot” and NATIONALITY
“German”. Attributes are extracted based on a set of rules (e.g., to split first and last
names) and language-specific lists of words (e.g., for nationalities, roles, . . . ). Statistics
about the mentions recognized so far are reported in the second column of table 2.
Coreference resolution. This step consists in grouping together in a mention cluster
all the mentions that (are assumed to) refer to the same entity, e.g., to decide that men-
tions “Michael Schumacher” and “Schumi” in different news denote the same person.
Two coreference resolution systems are used. Person and organization mentions are
processed with JQT2 [8], a system based on the Quality Threshold algorithm [4] that
 5
     http://textpro.fbk.eu
compares every pair of mentions and decides for coreference if their similarity score is
above a certain dynamic threshold; similarity is computed based on a rich set of features
(e.g., mention attributes and nearby words), while the threshold is higher for ambiguous
names, requiring more “evidence” to assume coreference. Location mentions are pro-
cessed with GeoCoder [1], a system based on geometric methods and on the idea that
locations in the same news are likely to be close one to another; it exploits the Google
Maps geo-referencing service6 and the GeoNames geographical database7 . Statistics
about the mention clusters identified so far are reported in the third column of table 2.
Linking and entity creation. The last step consists in linking mention clusters to en-
tities in the background knowledge and to external knowledge sources. Clusters of lo-
cation mentions are already linked to well-known GeoNames toponyms by GeoCoder.
Clusters of person and organization mentions are linked to entities in the background
knowledge by exploiting the representation of contexts. The algorithm [7] firstly iden-
tifies the htime, space, topici contexts most appropriate for a mention cluster among the
ones in the K NOWLEDGE S TORE, based on the mentions attributes and the metadata of
the containing news (e.g., the publication date). Then, it searches for a matching entity
only in those contexts, improving disambiguation. The fourth column of table 2 reports
the fraction of mention clusters linked by the systems, i.e. the linking coverage: cover-
age is low for clusters (10.74%), but increases in terms of mentions (31.03%), meaning
that the background knowledge mainly consists of popular (and thus frequently men-
tioned) entities. New entities are then created and stored for unlinked mention clusters,
as they denote real-world entities unknown in the background knowledge; the last col-
umn of table 2 reports the total number of entities obtained so far. All the entities are
finally associated to the corresponding Wikipedia pages using the WikiMachine tool8 .

3    User Interface
The entry point of the T RENTINO M EDIA Web interface is a search page supporting
entity-based search. The user supplies a proper name which is looked up among the
entities in the K NOWLEDGE S TORE and a list of matching entities is returned for dis-
ambiguation; entities are listed by type and distinguished with short labels generated
from stored information, as in figure 3, left side. By selecting an entity, the user is
presented with the list of news mentioning that entity, retrieved based on the associ-
ations between entities, mentions and resources stored in the K NOWLEDGE S TORE. A
descriptive card is also displayed, as shown in the right side of figure 3. It contains all the
information known about the entity, including: (i) background knowledge, (ii) informa-
tion carried by the attributes of the entity mentions and (iii) frequently co-occurring and
likely related entities. The example in figure 3 shows the potential but also the weak-
nesses of processing noisy, real world data with automatic NLP tools. In particular,
typos and the use of different names for the same entity (e.g., acronyms, abbreviations)
may cause coreference resolution to fail and identify multiple entities in place of one,
as happens with “F1” and “Formula 1”, “Raikkonen” and “Kimi Raikkonen”, “Micheal
Schumacher” and “Michael Schumacher”. Still, the use of additional information ex-
 6
   http://code.google.com/apis/maps/
 7
   http://www.geonames.org/
 8
   http://thewikimachine.fbk.eu/
Fig. 3: Example of entity-based search for query “Schumacher” in T RENTINO M EDIA.


                    Fig. 4: SmartReader showing a sport news.
tracted from texts (e.g., keywords) can often overcome the problem, as happens with
the correct coreference of “Schumacher”, “Michael” and “Michael Shumacher”).
    The other interaction mechanism supported by T RENTINO M EDIA—contextual se-
mantic enrichment—is accessed through the SmartReader interface shown in figure 4,
which is displayed by selecting a news. The SmartReader allows a user to read news or
watch videos while gaining access to related information linked in the K NOWLEDGE -
S TORE to the news and its mentioned entities. The interface is organized in two panels.
The left panel displays the text of the news or the video with its speech transcription
and permits to selectively highlight the recognized mentions of named entities. The
right panel provides contextual information that enriches the news or a selected men-
tion. It can display a cloud of automatically extracted keywords, each providing access
to related news. It can also show additional information about the selected mention, by
presenting: (i) the Wikipedia page associated to the mentioned entity, (ii) a map display-
ing a location entity and (iii) a descriptive card with information about the entity in the
background knowledge. In the latter case, only facts which are valid in the htime, space,
topici context of the news are shown, e.g., that “Schumacher is a pilot of Mercedes GP
in 2010”, so to avoid to overload and confound the user with irrelevant information.

4    Conclusions
T RENTINO M EDIA shows how the application of NLP techniques and the interlinking of
knowledge and multimedia resources can be beneficial to users accessing information
contents. In particular, two mechanisms to exploit this interlinking are demonstrated:
entity-based search exploits links from knowledge (entities) to resources, while seman-
tic enrichment exploits links in the opposite direction. T RENTINO M EDIA also shows
that NLP and Semantic Web technologies are mature enough to support the large scale
extraction, storage and processing of knowledge from multimedia resources.

References
1. Buscaldi, D., Magnini, B.: Grounding toponyms in an Italian local news corpus. In: Proc. of
   6th Workshop on Geographic Information Retrieval. pp. 15:1–15:5. GIR ’10 (2010)
2. Carroll, J.J., Klyne, G.: Resource description framework (RDF): Concepts and abstract syntax.
   W3C recommendation (2004), http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
3. Cattoni, R., Corcoglioniti, F., Girardi, C., Magnini, B., Serafini, L., Zanoli, R.: The K NOWL -
   EDGE S TORE : an entity-based storage system. In: Proc. of 8th Int. Conf. on Language Re-
   sources and Evaluation. LREC ’12 (2012)
4. Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring expression data: Identification and analysis
   of coexpressed genes. Genome Research 9(11), 1106–1115 (1999)
5. Homola, M., Tamilin, A., Serafini, L.: Modeling contextualized knowledge. In: Proc. of 2nd
   Int. Workshop on Context, Information And Ontologies. CIAO ’10, vol. 626 (2010)
6. Pianta, E., Girardi, C., Zanoli, R.: The TextPro tool suite. In: Proc. of 6th Int. Conf. on Lan-
   guage Resources and Evaluation. LREC ’08 (2008)
7. Tamilin, A., Magnini, B., Serafini, L.: Leveraging entity linking by contextualized background
   knowledge: A case study for news domain in Italian. In: Proc. of 6th Workshop on Semantic
   Web Applications and Perspectives. SWAP ’10 (2010)
8. Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering
   person names. In: Proc. of Evalita 2011 – Evaluation of NLP and Speech Tools for Italian
   (2012), to appear