A proposal for an ontology supported news
         reader and question-answer system

                           José Saias and Paulo Quaresma

                            Universidade de Évora, Portugal
                               jsaias|pq@di.uevora.pt


        Abstract. Reading the news is a very time-consuming task. We present
        a methodology for a system that will automatically analyse the “last
        hour” news articles, offering semantic based features and real-time news
        reaction options.
        This proposal is based on an ontology knowledge representation, natural
        language processment and a logic-programming framework.


1     Introduction

 In the last decade the volume of available information on the web has grown
exponentially. There are many more sources of information and each one seems to
produce much more potentially relevant documents. As an effect of globalization,
the news we hear from a remote point of the globe have now gained importance
and may influence some aspects of our life. The newspapers, tv and other media
spread the news from any event to the whole world.

 The average citizen can read the papers and watch some news program on tv.
However it’s not possible to be aware of all occurrences in the world. In the other
hand, most of the information taken in media resources may not be relevant to
the end citizen. All this has a special importance to professionals whose activity
relies on news analysis, such as stock market brokers, military intelligence or
economists.

 Nowadays, the main newspapers have an online RSS1 service where they publish
the latest news to all Internet users. Computer based systems can help people,
allowing a quick and broader analysis on the available sources. Filtering by
date and section (politics, economy) is not enough for today’s demands. There
must be done some automatic work on the body text of each article in order to
capture the expressed semantics in it. This might involve NLP techniques and an
inference enabled system. The semantic information captured from a document
1
    Really Simple Syndication (sometimes also used for Rich Site Summary), is a popular
    XML format for Web content publication.
is stored in a knowledge base. Ontologies allow the definition of class hierarchies,
object properties and relation rules, such as, transitivity or functionality. The
information extracted from each news document is given a formal representation,
associated with an ontology. The resulting knowledge base has the facts list
expressed by instances of ontology classes, in a semantic context. Then it will
be possible to make some inferences about them.

 This paper proposes an ontology based methodology for news article processing
in order to:
 – cover a large amount of documents
 – identify the most relevant documents
 – try to automatically understand some information in those documents
 – set a notification or action to do in case a certain ’thing’ happens
 – get automatic answers to some simple questions

 The initial knowledge base is described in the next section. Section 3 shows
the techniques used for news document analysis and knowledge base evolution.
In section 4 we present the system features and how they are accomplished. Fi-
nally, section 5 describes some related work and in section 6 some conclusions
and future work are pointed out.


2     Common sense Knowledge Base
 When we have an isolated sentence it’s usually difficult to automatically capture
the semantics in it. Our approach uses an ontology as the starting knowledge
base with semantic information that helps to perform the sentence analysis and
the subsequent inferences and interrogations. Each element found on a sentence
is related to the starting knowledge base and also with the previous sentence
semantics processed.

 Senso is still in development and is the sum of a taxonomy of classes and a list
of semantic information gathered from several ways. Some influences were taken
from previous works ([2] and [1]) where we had a top-level OWL2 ontology with
some basic concepts. This ontology has a hierarchy of concepts used to organize
the set of concepts mentioned in Portuguese text documents.

 We used OWL because it has the intended semantic features and it is suitable
for web publications, allowing us to share parts of our knowledge base in a direct
and appropriate manner.
2
    OWL[4] is the short name for Web Ontology Language and it is a language proposed
    by the W3C consortium to be used in the Semantic Web for the representation
    of ontologies. This language is based in the previous DAML+OIL (Darpa Agent
    Markup Language) language and it is defined using RDF (Resource Description
    Framework).
 Besides the formal concept definitions and “IsA” relations, there are a few sim-
ple facts about everyday life that might be very useful for document analysis.
This lead us to ConceptNet[3], which is a freely available common sense knowl-
edge base and natural language processing toolkit. This tool gave us access to
a semantic network presently available in two versions: concise (200,000 asser-
tions) and full (1.6 million assertions) about spatial, physical, social, temporal,
and psychological aspects of everyday life.
ConceptNet is available in English and our work needs Portuguese language. We
started by choosing a set of terms that we were interested on. Then we automat-
ically followed their relations in the ConceptNet semantic network for a couple
of nodes and collected those too. Having the terms and relations identified we
run an automatic translation. Finally we filtered some wrong translations using
another Portuguese dictionary.

 The top-level ontology and the ConceptNet translated terms were merged in
a manual form. This work was done by several people using a web application
where we could browse the database, insert new classes, update relations between
classes, add or remove facts.
Our current ontology (Senso Knowledge Base) contains about 2000 concepts and
has several relations connecting them: isA, usedFor, locatedAt, capableOf and
madeOf. All terms are written in Portuguese and they are not about a specific
domain of knowledge. These concepts and relations represent a small common
sense knowledge base about places, entities and events. Some of the top-level
concepts are: AbstractConcept (the root concept), Event, Time and Entity, as
shown in figure 1.


                          Fig. 1. Senso top-level concepts


 Figure 2 has a screenshot of the Senso KB Web Interface. We can see a search
in the ontology for terms with a certain pattern (here: ão) and having an IsA
relation with the term animal. The result of such query includes dog and lion,
which in Portuguese have the specified syntactic pattern, as shown on figure 3.
              Fig. 2. Web interface for Senso Knowledge Base analysis


Choosing any of those results, with a mouse click will show that term’s detail,
as illustrated on figure 4 for the term firearm. Those lines are part of that term’s
detail and they mean that firearm is a kind of weapon that might be used for
actions like murder, hunt, shoot something or protect.
The next section explains the document analysis performed by the system.


                         Fig. 3. Senso: partial query result


3     Fetching and processing the news

  The proposed system might be used with any text documents in Portuguese
natural language. Is this case, we focus our work in news articles that are pub-
lished day by day by the national media.
Some popular newspapers like Público 3 or Correio da Manhã4 have a “last hour”
3
    http://www.publico.pt/
4
    http://www.correiodamanha.pt
                    Fig. 4. Senso: some details on firearm term


news section in their web site, including an RSS channel. This is suitable for an
automatic search for any recently added news article.
We used a program to periodically collect the recent news from Público’s RSS
channel. As we can see in figure 5, each news item has some metadata fields:
title, description, author, category, publication date and hour, and of course,
the link to the web document containing the information. The category gives
us a first simple classification for the document, placing it in Economy, Politics,
International or Sports (in Portuguese Desporto - like the item listed in figure
5). The publication date gives the temporal context to the semantic content we
find in the document. Later we will see some examples.


             Fig. 5. Document source: RSS news channel from Público


 Each document imported to the system has a text body. That text is pro-
cessed, following a methodology based on natural language processing tech-
niques, namely, a syntactical parser and a semantic analyzer able to obtain a
partial interpretation of the document.

  The tool used for the syntactical analysis is PALAVRAS [5]. It’s a syntactical
parser developed by E. Bick in the domain of the VISL Project5 . This parser
is based in the Constraint Grammars formalism and it is able to cover a large
percentage of the Portuguese language. Because the parser output is in a non-
standard format, it was necessary to transform it into a structured form, like
XML and Prolog terms. This was accomplished with the translation tool6 Xtrac-
tor[6], that performs the conversion VISL to Prolog and XML.
Let us consider a sentence in the above sports news item:

     “Marcus Grönholm venceu neste domingo o Rali da Grécia.”
     (in English: “Marcus Grönholm won the Greece Rally, this Sunday.”)

 As can be seen in figure 6, the parser identified correctly the subject, the
predicate and direct object.

 The next step is the semantic analysis. The technique used for this process is
based on Discourse Representation Structures (DRS) [7]. The partial semantic
representation of a sentence is a DRS built with two lists, one with the rewritten
sentence and the other with the sentence discourse referents.
We are only dealing with a restricted semantic analysis and we are not able to
handle every aspect of the semantics: our focus is on the representation on con-
cepts (nouns and verbs) and the correct extraction of its properties (modifiers,
agents, objects). In the last section of this paper we point out some possible
improvements for this text semantic analysis.
The previous news item is stored in the system with the following details:

item(publico1259478,
     ’Desporto’,
     ’Sun, 04 Jun 2006 16:09:00 GMT’).
...
sentence(publico1259478,
      [ name(A, ’Marcus_Grönholm’ , [’M/F’, ’S’, ’Marcus_Grönholm’ ],
           [ ] ),
        name(B, ’Rali_da_Grécia’ , [’M’, ’S’, ’Rali_da_Grécia’ ],
           [ ] ),
        ’vencer’(A,B,
           [ modif(verb,’vencer’, [’PS’,’3S’,’IND’] ) ] ),
        [ modif(temp,’domingo’, [’M’,’S’],
                 [ modif(pronDet,’este’, [’M’,’S’], []) ] ) ]     ],
      [ ref(A), ref(B) ] ).

... (other sentences)
5
    http://visl.hum.sdu.dk/visl
6
    It is also available to the other VISL users at http://abc.di.uevora.pt/xtractor
Parser PALAVRAS output:

    STA:fcl
    =SUBJ:prop(’Marcus_Grönholm’ M/F S)     Marcus_Grönholm
    =P:v-fin(’vencer’ PS 3S IND)     venceu
    =ADVL:pp
    ==H:prp(’em’ <sam->)     em
    ==P<:np
    ===>N:pron-det(’este’ M S <dem> <-sam>) este
    ===H:n(’domingo’ M S <temp>)     domingo
    =ACC:np
    ==>N:art(’o’ M S <artd>)         o
    ==H:prop(’Rali_da_Grécia’ M S) Rali_da_Grécia
    =.

And the Prolog representation:

 sentence(syn(sta(fcl,
    subj(prop(’Marcus_Grönholm’,’M/F’,’S’),’Marcus_Grönholm’),
    p(v_fin(’vencer’,’PS’,’3S’,’IND’),’venceu’),
     advl(pp, h(prp(’em’,’<sam->’),’em’),
       p(np, n(pron_det(’este’,’M’,’S’,’<dem>’,’<-sam>’),’este’),
       h(n(’domingo’,’M’,’S’,’<temp>’),’domingo’))),
     acc(np, n(art(’o’,’M’,’S’,’<artd>’),’o’),
       h(prop(’Rali_da_Grécia’,’M’,’S’),’Rali_da_Grécia’, ’.’))))).


                     Fig. 6. Syntactical analysis for a sentence


   Notice that we store some metadata like the item category and the time of
publication. This is sometimes useful, when a sentence has a temporal modifier
that might be related to the publication field. Later we shall see it.


 All documents retrieved from the newspaper RSS channel are processed using
this technique. This produces a knowledge base that we will use for inference
processes as we describe in the next section.


4     Using the system

 Once the news documents are obtained from the source and analyzed they
become part of the second knowledge base used by the system. This is called
the facts knowledge base, and used with Senso ontology allows the system to
perform some inference and then offer interesting features. Three main features
are presented in the following subsections.
4.1     Using a better search filter for news

  When we visit the newspaper site we can search for news items from category
X or items where some pattern might be found in the text. This is not enough
when we want more than pattern match search. Suppose we want to find the
items about rally drivers that play some instrument. Simple syntactical searches
would probably get many documents with low relevance.
Our system can receive a search instruction and retrieve the set of documents
that validate that condition. It’s possible to search for documents about some
entity by the description the user gave for it, in natural language. For the above
example, the entity description introduced by the user (in Portuguese) is: ’é
piloto de rali’ and ’toca um instrumento’. Then, the Prolog syntax for the search
is:

searchItemList([ condition( [], [ piloto_de_rali(X),
                                  toca(X,Y), instrumento(Y)] ) ],
                            L ).

Suppose the document D1 has a sentence like:

      “Marcus Grönholm tocou piano até aos 16 anos de idade.”
      (in English: “Marcus Grönholm played the piano until 16 years old.”)

  Our ontology in Senso tells us that piano is an instrument. Now, if there
is a sentence in our fact knowledge base saying that Marcus Grönholm is a
piloto de rali (rally driver), then the document D will indeed be selected for the
returned list of retrieved relevant documents, L. This feature is available in a
web form in a high level manner. The prolog query searchItemList/2 is processed
in the system backend and the list L is then shown in a nice format to the user.


4.2     Trigger an action if “something happens”

  Sometimes we want to schedule some action in the case of some certain event
occurs. Our system allows users to set actions that will be triggered if a news
item with a content that validates a defined condition is found. At the moment,
these actions are only e-mails with some message sent to the user if any news
item text validates a condition. As an example, suppose you are interested in
economic transactions where a company buys another company (the Portuguese
word is empresa, bellow), but you only want to receive the alert message if the
news occur in November. Then you can instruct the system with the following:
      Action:

  mail you@di.uevora.pt -s ‘‘ALERT: a Company is buying another company’’

      Condition Syntax:

  conditionList([ condition([ metadata(pubDate,after,’2006-11’) ],
                            [ empresa(X), comprar(X,Y), empresa(Y)] ) ]).
  When a news item is fetched and processed, it’s internal representation is
checked for each user defined action condition list. The verified conditions will
trigger the execution of the associated action.

4.3     Question-Answer
 The most interesting feature of the system is the Question-Answer feature.
This receives a natural language written query, in Portuguese, and searches for
answers, based on the collected news documents and the Senso ontology.

 The analysis of a natural language query is split in three steps: Syntax, Seman-
tics, and Pragmatics. Each query is processed using the same natural language
tools used for the news texts. First, the received interrogation is parsed by the
methodology described before. Like we did with the text sentences, each query
syntactical structure is translated into a First-Order Logic expression (a DRS).
Note that, at present, we are not able to deal with general unrestricted queries
nor to translate them from a syntactical into a semantic structure. In fact this
is a quite complex NLP problem and we have decided to deal only with specific
subsets of the Portuguese language, namely, with interrogatives about specific
domains.

 The search for an answer is done by a logic-programming based module that
performs a pragmatic interpretation of the query DRS over the full system knowl-
edge base (Senso ontology and facts from the news). The inference process is done
with the Prolog resolution algorithm, which tries to unify the referent from the
query with facts extracted from the documents and expressed in DRS structures.

 As an example, we could enter a query like:

      “Quem ganhou o Rali da Grécia?” (in English: “Who won the Greece Rally?”)

      The DRS for such query would be:

  query(q281,
      [ q(X, ’quem’ , [’M/F’, ’S’, ’quem’ ],
           [ ] ),
       name(Y, ’Rali_da_Grécia’ , [’M’, ’S’, ’Rali_da_Grécia’ ],
           [ ] ),
       ’ganhar’(X, Y,
           [ modif(verb,’ganhar’, [’PS’,’3S’,’IND’] ) ] ),
       [ ]    ],
      [ ref(X), ref(Y) ] ).

    And the result displayed by the system web interface is given is figure 7. Note
that in this case there is only one result. The response given may include zero
or more values considered valid as response. For each possible response there is
also the link to the news item where the system found the answer.
                          Fig. 7. Question-Answer result


 The previous question was answered because the concept vencer is defined
as a synonym of ganhar. Another note is that there was only one fact on the
knowledge base about someone winning the Greece Rally. If we had a former
document with a sentence identifying last year winner then probably we would
get two answers. Then we could check the best solution by reading the text. The
precise query for this year winner would be:

   “Quem venceu o Rali da Grécia no ano de 2006?” (in English: “Who won the
Greece Rally in the year 2006?”)

 That would introduce a temporal modifier on the query DRS expression to be
checked against the date of publication of the document. Another example of
query about time is:

   “Quando é que Marcus Grönholm venceu o Rali da Grécia?” (in English:
“When did Marcus Grönholm won the Greece Rally?”)

  Once again, the interrogative term quando’s referent is matched against the
temporal modifier on the sentence DRS: “este domingo” (in English: this Sun-
day). This information is then related with the news item publication date and
we infer the desired date, 2006-06-04. Similar treatment is given to temporal
expressions like today, this month, last year and other.
5    Related Work
  There are other initiatives related to the semantic content search. Ontologies
are used in [8] for the specific domain of International Affairs. It has a natural
language interface also, but works with RDQL7 instead of the Prolog logic res-
olution environment we adopted. Other field of research, in this context, is the
automatic mapping from an existing large news archive, in XML standards, to
OWL ontologies [9].
The semantic archive features provided by [10] include means to annotate news
materials and semantic search and browsing capabilities. This system runs inside
the newspaper environment and uses a newspaper library specialized ontology,
while the system we present works alone and outside the newspaper, allowing
the use of many independent news sources, and our ontology is about common
sense knowledge and not about a specific domain.
The Rich News system [11] helps to annotate radio and television news. It is
specialized for the English language and combines automatic speech recognition
with information extraction using the KIM8 knowledge management platform.


6    Conclusions and Future Work
 We proposed a methodology to an ontology supported news reader and question-
answer system. The system is based on the initial knowledge in the Senso ontol-
ogy and on the semantic content extracted from documents.
The system we described has some interesting features. However it exists only in
a prototype version and it must now be improved to support a large-scale facts
knowledge base.

 The accuracy of the results found for automatic question-answer is also a point
that needs more work. The system is affected in the inference process by:
 – the quality of the Senso ontology
 – the precision of the semantic information taken from the text sentences.
The ontology should be manually revised and extended. The semantic analy-
sis can be improved if we add a tool to identify the inter-sentence anaphoric
references. Finally, the system needs to be fully evaluated.

References
1. José Saias and Paulo Quaresma. A methodology to create legal ontologies in a logic
   programming information retrieval system, pages 185–200. V.R. Benjamins et al.
   (Eds): Law and the Semantic Web, LNAI 3369. Springer-Verlag, 2005. ISBN: 3-
   540-25063-8.
7
  RDQL is a query language for RDF based on SquishQL. For more detail visit
  http://jena.sourceforge.net/tutorial/RDQL/
8
  http://www.ontotext.com/kim/index.html
2. Paulo Quaresma, Luis Quintano, Irene Rodrigues, José Saias, and Pedro Salgueiro.
   The university of evora approach to qa@clef-2004. In Carol Peters, editor, Question-
   Answering Track of the Cross Language Evaluation Forum 2004, Bath, UK, Septem-
   ber 2004.
3. H. Liu and P. Singh ConceptNet: A Practical Commonsense Reasoning Toolkit. BT
   Technology Journal, Volume 22, Number 4. Kluwer Academic Publishers, October
   2004
4. Michael Smith, Chris Welty and Deborah McGuinness. Owl web ontology language
   guide. Technical report, 2004. http://www.w3.org/TR/owl-guide/
5. Eckhard Bick. The Parsing System ”Palavras”. Automatic Grammatical Analysis
   of Portuguese in a Constraint Grammar Framework. Aarhus University Press, 2000.
6. C. Gasperin, R. Vieira, R. Goulart and P. Quaresma. Extracting XML syntac-
   tic chunks from Portuguese corpora. In TALN’2003 - Workshop on Natural Lan-
   guage Processing of Minority Languages and Small Languages of the Conference on
   ”Traitement Automatique des Langues Naturelles”, France, June 2003.
7. Kamp, H. and Reyle, U. From Discourse to Logic. Kluwer: Dordrecht. 1993
8. J. Contreras, V. Richard Benjamins, M. Blázquez, S. Losada, R. Salla, J. Sevilla,
   D. Navarro, J. Casillas, A. Mompó, D. Patón, Óscar Corcho, P. Tena, I. Martos. A
   Semantic Portal for the International Affairs Sector. In Proceedings of the EKAW
   2004 - 14th International Conference on Knowledge Engineering and Knowledge
   Management. pages 203-215. Springer, 2004
9. Roberto Garcı́a, Ferran Perdrix, Rosa Gil. Ontological Infrastructure for a Semantic
   Newspaper. In Semantic Web Annotations for Multimedia Workshop (SWAMM
   2006) World Wide Web Conference, Edinburgh, UK, 2006
10. Pablo Castells, F. Perdrix, E. Pulido, M. Rico, V. Richard Benjamins, J. Contreras,
   J. Lorés. Neptuno: Semantic Web Technologies for a Digital Newspaper Archive. In
   ESWS 2004 - 1st European Semantic Web Symposium, Greece, pages 445-458, 2004.
11. Mike Dowman, Valentin Tablan, Hamish Cunningham, Borislav Popov. Web-
   assisted annotation, semantic indexing and search of television and radio news. In
   Proceedings of the 14th International World Wide Web Conference. Chiba, Japan,
   2005. ISBN: 1-59593-046-9, pages 225-234, ACM Press