<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>First experiments of using semantic knowledge learned by A S I U M for information extraction task using I N T E X</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thierry Poibeau</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Information Extraction (IE) is a technology dedicated to the extraction of structured information from texts. This technique is used to highlight relevant sequences in the original text or to fill pre-defined templates [1]. Below is the example of a story concerning a terrorist attack in Turkey together with the corresponding entry in the database filled by the IE system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Semantic knowledge acquisition from texts remains a hard task
even for limited domains. This knowledge is crucial in order to
improve natural language applications like information extraction.
Approaches mixing machine learning (ML) and natural language
processing (NLP) obtain good results in a short development time (we
can cite, among others M. E. Califf [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], R. Basili [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], S. Buchholz
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], D. Hindle [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], R. J. Mooney [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] et E. Riloff [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]).
      </p>
      <p>
        We present here ASIUM which learns cooperatively semantic
knowledge from texts syntactically parsed without previous manual
processing. This knowledge consists in subcategorization frames of
verbs and an ontology of concepts for a specific domain following
the ”domain dependence” defined by G. Grefenstette 4 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>ASIUM is based on an unsupervised conceptual clustering method
and provides an ergonomic user-interface5 to help knowledge
acquisition process.</p>
      <p>In this part, we will show how ASIUM is able to learn good quality
knowledge in a reasonable time from parsed text, even if the syntactic
parsing of texts is noisy.
Our aim is to learn subcategorization frames of verbs and an ontology
for a specific domain, from texts. Actually, existing knowledge bases
like EUROWORDNET or WORDNET are frequently over-general for
applications in specific domains. These ontologies, although very
complete, are not suitable for processing texts in technical languages.
On one hand they are not purpose directed ontologies, they may store
up to seven meanings and syntactic roles for a word, thus increasing
the risk of semantic ambiguity. In a specific domain, the vocabulary
as well as its possible usage is reduced, which makes ontologies such
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Our approach</title>
      <p>ASIUM learns subcategorization frames like &lt;to drop&gt;
&lt;object: Explosive&gt; &lt;in: Public_Place&gt; for the verb
to drop. Both couples object: Explosive et in:
Public Place are subcategories, object is a syntactic role and in
is a preposition but Explosive and Public Place are concepts
used as restrictions of selection. More usually, ASIUM learns verb
frames like: &lt;verb&gt; &lt;prep.|syntactic role: concept*&gt;*</p>
      <p>These frames are more general than the ones defined in the LFG 6
formalism because the subcategories are verb arguments (subject,
direct object or indirect object) and adjuncts. In our framework,
restrictions of selection can be filled by an exhaustive list of nouns (in
canonical form) or by one or more concepts defined in an ontology.
The ontology represents generality relations between concepts in the
form of a directed acyclic graph (DAG). For example, the ontology
could define car, train and motorcycle as motorized vehicle,
and motorized vehicle as both vehicle and pollutant. Our
method learns such an ontology and subcategorization frames in an
unsupervised manner7 from texts in natural language. The concepts
formed have to be labeled by an expert.
2.3</p>
    </sec>
    <sec id="sec-3">
      <title>Knowledge acquisition method</title>
      <p>as WORDNET overly general. On the other hand, WORDNET may
lack some specific terminology of the application domain.</p>
      <p>
        Contrary to any approach of increasing or specializing general
ontologies for a specific domain like R. Basili [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we learn an ontology
and verbs frames from the corpus reducing the risk of inconsistency.
      </p>
      <p>Our previous attempts to automatically revise subcategorization
frames and a subset of an ontology acquired by a domain expert
have failed. Revision of the acquired knowledge with respect to the
training texts required deep restructuring of the knowledge that
incremental and even cooperative ML revision methods were not able
to handle. The main reason was that the expert built the ontology and
the subcategorization frames with too many a priori that were not
reflected in the texts. This experiment illustrates one of the limitation
of manual acquisition by domain experts without linguists.</p>
      <p>
        The first step of the acquisition process is to automatically extract
syntactic frames from texts. We use the syntactic parser SYLEX
developed by P. Constant [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In case of syntactic ambiguities, SYLEX
gives all the differents interpretations and ASIUM uses all theses
interpretations. Experiments have shown that the ML method works
well with theses ambiguities and acquisition of semantic knowledge
is not affected. This method avoids a very time-consuming manual
disambiguation step. These frames are the same like
subcategorization frames but with concepts replaced by nouns. &lt;verb&gt; &lt;prep.
| role: head noun&gt;*
      </p>
      <p>ASIUM only uses head nouns of complements and links with
verbs. Adjectives and empty nouns are not used. Our experiments
have shown that these informations were enough to learn semantic
knowledge even from a noisy syntactic parsing.</p>
      <p>
        The learning method relies on the observation of syntactic
regularities in the context of words [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We assume here that head nouns
occuring with the same couple verb+preposition/syntactic
role represent a so-called basic class and have a semantic
similarity in the same line as Grefenstette[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Peat[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or others, but our
method is based on a double regularity model: ASIUM gathers nouns
together as representing a concept only if they share at least two
different (verb+preposition/syntactic role) contexts as
in Grishman[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Experiments show that it forms more reliable
concepts, thus requiring less involvement from the user. Our similarity
measure computes the overlap between two lists of nouns 8 (Details in
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]). As usual in conceptual clustering, the validity of learned
concepts relies on the quality of the similarity measure between clusters
that increases with the size of their intersection.
      </p>
      <p>
        Basic classes are then successively aggregated by a bottom-up
breadth-first conceptual clustering method to form the concepts of
the ontology level by level with an expert validation and/or labelling
at each level. Thus a given cluster cannot be used in a new
construction before it has been validated. For complexity reasons, the
number of clusters to be aggregated is restricted to two, but this does
not affect the relevance of the learned concept [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Verb
subcategorization frames are learned in parallel so that each new concept
fills the corresponding restriction of selection then resulting in the
generalization of the initial syntactic frames which allows to cover
examples that did not occur as such in texts. Thus, the clustering
process does not only identify the lists of nouns occuring after the
same verb+preposition/function but also augments this list by
induction.
acvlfeatAesrrsbgeg+strwpe(ogrCaet1dpioi.afnfn/edofrefuCnnt2twc)octofibouoaupsnlneidcs cLoeVVnacr21ne,,epPPdt21//FF21 nnnnnnoooooouuuuuunnnnnn432651 IndVVVVu1221ce,,,,PPPPd1221E////FFFFx1221amnnnnoooopuuuulennnns5346:
(V1,P1/F1 and V2,P2/F2) V1,P1/F1 Induction V2,P2/F2
aawlnlidlolwVecd2re,Pat2e/aFfa2te.rnTehwuVs,1co,nPno1cu/eFnp1st C1 nnnnoooouuuunnnn2341 Copmarmton nnnnoooouuuunnnn2561 C2
which only appear in basic
class C1 (resp. C2) will now be allowed with the couple V2,P2/F2
(resp. V1,P1/F1). This results in a generalization of knowledge
found in the corpus as presented in the figure.
      </p>
      <p>For example, starting with these syntactic frames,
&lt;to travel&gt;
&lt;subject:[father,neighbour,friend]&gt;
&lt;by: [car,train]&gt;
&lt;to drive&gt;
&lt;subject:[friend,colleague]&gt;
&lt;object:[car,motorcycle]&gt;
2.2</p>
      <p>Experts have to control the link between the new concept and the
verb because the only threshold, fixed by the expert, can not
measure the over-generalization risk. This validation process is relatively
quick due to the ergonomic user-interface. ASIUM provides to the
expert the list of newly covered examples in order to estimate the
generality of the proposed concept. Moreover the expert can use
functionalities provided by ASIUM in order to divide the learned concept
into sub-concepts in case of a proposed concept overly general for
the target task.
: : : Allemagne de l’Est,N+Loc+Country;
via transducers are for most part syntactic structures (the set of
expressions equivalent to the notion of ”bombing”) integrating some
of the semantic classes furnished by the ASIUM system 10.</p>
      <p>The homogeneous semantic lists learned by the ASIUM system are
introduced in the INTEX vocabulary. At this level, a manual work is
necessary to exploit the semantic classes from ASIUM. These classes
are refined (merging of scattered classes, deletion of irrelevant
elements, addition of new elements, etc.). About ten hours have been
dedicated, after the acquisition process, to the refinement of the data
furnished by ASIUM. This knowledge is then considered as a
resource for INTEX and is exploited either as dictionaries or as
transducers, in function of the nature of the information. If it is a general
information that is not domain specific, we prefer to use a dictionary
which can be reused, otherwise, we use a transducer.</p>
      <p>A dictionary is a list of words or phrases, each one being
accompanied of a tag and a list of features11. The first names dictionary or
the locations dictionary are generic reusable resources. Below is a
sample of the location names dictionary12:</p>
      <p>
        Abidjan,N+Loc+City;
Afghanistan,N+Loc+Country;
Allemagne,N+Loc+Country;
Allemagne de l’Ouest,N+Loc+Country;
As for D. Hindle [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or F. Peireira [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], our method gather nouns
regarding syntactic regularities of arguments and adjuncts of the
verbs. We suppose that in specialized texts, verbs are also
characterized by theirs adjuncts. G. Grefenstette [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposes to learn
something close to our ”basic classes”. Our ”double similarity model”
learns a concept by gathering two basic classes only if they have a
good similarity. This model limits the number of non relevant
produced concepts. M. R. Brent [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] learns only five subcategorization
frames from untagged texts with an automatic method. S. Buchholz
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] learns subcategorization frames very close to ours but with a
supervised method which is very time-consuming for the expert. In
the same way, WOLFIE (A. C. Thompson [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) with CHILL (J. M.
      </p>
      <p>
        Zelle [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]) learns ”case-roles” and a thesaurus from texts
syntactically parsed by CHILL but fully semantically annotated by hand.
      </p>
      <p>The case roles differs from our subcategorization frames because our
prepositions or grammatical functions are replaced by semantic roles
like agent or patient. Contrary to the ontology learned by ASIUM,
selectional restrictions learned by WOLFIE are attribute-value lists.</p>
      <p>
        An unsupervised learning approach like ASIUM delays concepts
labelling after the learning process and so considerably reduces the
time needed by the expert. After ASIUM learning, the semantic roles
can be labelled by assuming a couple verb+prep./function
represents a specific semantic role. E. Riloff in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] learns five
concepts from texts. She uses lists of nouns representing general
concepts (seeds) and uses coocurrence method to augment these lists to
concepts. These augmented lists are checked by the expert who only
retains nouns representing the concept. We can assume basic classes
of ASIUM are seeds that will be increased by our induction process.
      </p>
      <p>The main advantage is that the number of concepts is not limited to
five and we learn in parallel subcategorization frames of verbs
without more time-consuming validation needed.</p>
      <p>These items structured in a list are convenient for the dictionary
format and the semantic lists elaborated from ASIUM complete in an
accurate manner the coverage of the initial dictionaries from INTEX.</p>
      <p>The transducer format is essentially used for more complex or
more variable data where linguistic phenomena such as insertion or
optionality may interfere.</p>
      <p>Here,
the figure LastName
presents an
example of FirstName
a transducer &lt;Person&gt; monsieur de &lt;/Person&gt;
allowing to mMMa.mdeame UknownWord
recognize Transducer "Person".
person names
such as
Monsieur Jean Dupont. the transducer recognizes a sequence composed
of a trigger word (Monsieur), a first name (Jean) and a proper name
(Dupont). But we must keep in mind that most of these elements
can be optional (Monsieur Dupont or Jean Dupont are correct
sequences) and that Dupont can be a word that is not listed in any
dictionary (it will then be considered as an unknown word).</p>
      <p>At this level, one can find two types of transducers: some are
generic - as the ”Person” one, and some others are domain-specific
and can filled with the semantic knowledge acquired by ASIUM.</p>
      <p>The next figure
is the illustration of
a transducer recog- &lt;We&lt;aepxopnl&gt;osion&gt; de &lt;DET&gt; Explosive &lt;/Weapon&gt;
nizing explosion Transducer "Weapon".
de Det N
(explosion of Det N), where
the nominal phrase Det N recognizes nominal phases elaborated
from the semantic class bombing where the following words
appear: bombe (bomb), obus (shell), grenade, etc.</p>
      <p>The elaboration of such transducers requires some linguistic
expertise to obtain in fine a system recognizing the relevant sequences
without too much noise. The architecture of the system is using
cascading transducers, it is then important that each level has a good
quality in order to allow the following analysis level to operate on a
solid background.
3</p>
    </sec>
    <sec id="sec-4">
      <title>The Information Extraction system 3.1</title>
    </sec>
    <sec id="sec-5">
      <title>Linguistic resources modeling</title>
      <p>The Information Extraction system is based on the INTEX tool-box,
developed by the LADL laboratory9. INTEX allows a rapid and
interactive development of automata and transducers to analyze texts.
A linguistic automaton recognizes expressions in texts, whereas a
transducer associate specific tags with words in the texts (for
example, assign a syntactic category to a word). Transducers are efficient,
expressive and sufficient for a local analysis of texts. We chose this
approach because it allows the rapid development of an IE system
for a given domain with a strictly local analysis limited to the
sentence area. Our aim is to develop a highly portable system even if
this means using more precise analysis strategies afterwards.
To elaborate linguistic resources, we first used the semantic classes
defined by the ASIUM system. Before the experiment, the corpus
was separated in two different parts : the training set and the test
set. The linguistic resources are constantly tested on the training set
during the development. This development approach allows to
evaluate the performances and to detect possible errors in the grammar
(a grammar with too much or not enough constraints which would
bring silence or noise during the analysis). The expressions modeled</p>
      <p>
        IE is a now widely spread research domain. The American Message
Understanding Conferences (MUC) provided a formidable
framework for the development of research in this area ([
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]). The
conferences are held about every two years and generally bring
together about fifteen teams working on IE systems. The elaboration of
the linguistic resources is for most part a manual work even if some
attempts were done to have some more portable systems.
      </p>
      <p>
        At least two French-speaking projects have been developed which
are somewhat comparable with MUC systems. these two systems are
the European project ECRAN and the EXIBUM project from the
University of Montreal (Canada). ECRAN developed a generic and
multilingual system tested on different corpora (movie reviews, stories
from the economic area, etc.) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. EXIBUM is a bilingual system
(French and English) that aims at processing agency news about
terrorist events in Algeria [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        Several other Information Extraction systems were developed for
specific kind of information (dates, location names, etc.). For
example, D. Maurel [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] developed a system highlighting dates by means
of automata and acceptability tables. More recently, C. Belleil [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
presented a system highlighting French toponyms and J. Se´nellart
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] a system recognizing Minister names from the French
newspaper Le Monde. These Approaches generally require exhaustive
descriptions of the concerned domain.
      </p>
      <p>
        Recent American work in the area proposed an approach mixing
corpus exploration and knowledge acquisition to feed IE systems. A
first well-known experiment is the AutoSlog from E. Riloff [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
allowing to find in texts relevant syntactic structures from keywords given
to the system by the end-user. In the framework of ECRAN, a
similar attempt was done to try to generalize relevant syntactic structures
from a training corpus and a general dictionary [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. The experiment
we present is different considering that the learning system is not
supervised and furnishes the IE system designer a wide amount of
knowledge extracted from the texts.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4 Experiment</title>
      <p>In our experiment, we have used a corpus of texts form the French
journal ”Le Monde”. Texts indexed by the noun ”terrorist event” have
been extracted and manually filtered in order to be sure that they
really contain a terrorist event description15. This corpus is of the
same kind as the one used for experiments in the ECRAN project, so
that we will be able to compare our results.</p>
      <p>The time spent on the definition of the linguistic resources with
INTEX is estimated to about 15 hours. This duration has to be
compared with the two weeks16 needed for the manual resources
development of the ECRAN project.</p>
      <p>Hundred texts have been used as ”training corpus” and fifteen
different texts have been used as ”test corpus”. Texts are first parsed
with our system, and then some heuristics allow to fill the extraction
template:</p>
      <p>Due to the structure of articles of Le Monde, the first date is
always the date of the article;
we assume that the second date is the one of the terrorist event;
the two first occurrences of locations found are stored and
usually quite well identify the location of the terrorist event;
the first occurrence of a number of victims or injured persons
is stored. If a text speaks of more than one terrorist event, we
assume that only the first one is relevant. We have chosen short texts to
prevent us from this problem inherent to long texts;</p>
      <p>only the first weapon linked with the terrorism event is stored.</p>
      <p>These heuristics are very succinct and we will have to specialize
them to perform information extraction on longer or less-specialized
texts. We have used these simple heuristics to evaluate our system
and compare it with the ECRAN one. With these heuristics, we obtain
good results on our corpus, and most of the extraction systems
evaluated in the American MUC conferences used this kind of heuristics
in order to solve any parsing problems.</p>
      <p>Our results have been evaluated by two human experts who did
not follow our experiment. Our performance indicators were defined
as:</p>
      <sec id="sec-6-1">
        <title>OK (O) if extracted information is correct; FALSE (F) if extracted information is incorrect or not filled; NONE (N) if there were no extracted information and no information has to be extracted.</title>
        <p>FALSE for all the other cases.</p>
        <p>Using these indicators, we can compute two differents values:</p>
        <p>PRECISION1 (P1), ratio between OK and FALSE answers,
without taking into account the NONE answers.</p>
        <p>PRECISION2 (P2), same as P1 but with the NONE answers.</p>
        <p>The next table summarizes results for the different elements of the
template.</p>
        <p>Date of the story
Location
Date
Nb dead persons
Nb persons injured
Weapon
Average</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5.1 Semantic Knowledge quality</title>
      <p>Semantic knowledge acquisition tools like ASIUM are always very
difficult to evaluate. Measuring the quality of an ontology or
evaluating an ontology regarding another one is not easy and heavily
depends on applications. So, we will only present here some indicators
to have an idea on the quality of the acquired knowledge.</p>
      <p>Concept quality depends of two different elements. The first one
is the distance which computes similarity between classes in order
to create relevant concepts and perform relevant inductions. As usual
in conceptual clustering, the distance is a parameter of the concept
quality and of quantity of expert’s work.</p>
      <p>This first qualitative element is very hard to estimate. In our
application, 16 of the 19 first classes proposed by ASIUM have been
accepted by the expert. 447 inductions have been proposed by ASIUM
and 73 % of these inductions have been judged relevant by the expert.</p>
      <p>The second element which affects the concepts quality is the level
of generality for a concept. When ASIUM proposes a new concept,
the expert has to decide from the generality of the concept whether it
should be split or not. This work is easy for an expert because he has
a very good knowledge of the final application.</p>
      <p>For example, if ASIUM proposes the ”Organization” concept,
the expert has to decide if it is relevant for the task to identify
subconcepts like ”Military org.” and ”Politic org.”.</p>
      <p>
        The generality level in the application highly depends on the
subtlety of the template to be filled by the information extraction system.
Our previous experiments on this domain and on the cooking recipes
domain have shown that this work is simple and that expert choices
really depend on the task. (More explanations on the suitability of
concepts for the main task and unsuitability of these concept for
another task are given in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].)
      </p>
      <p>The results of the ASIUM system allow to speed up the
definition of the paradigmatic classes filling states in the INTEX
transducers, even if certain classes need to be manually completed. For
example, the ASIUM semantic classes allowed to rapidly complete
the graph representing the set of weapons or persons who were
implicated in terrorist events. ASIUM provided a class in which terms such
as, for example: ”bomb”, ”grenade”, ”explosive” or ”car”
could appear, considering that a booby-trapped car is a kind of
weapon, etc;</p>
      <p>The description language provided by INTEX is richer than the
one of ECRAN. The time spent to model the linguistic INTEX
transducers was longer than the one spent for ECRAN since the constraints
and the empty transitions in automata and transducers have to be
manually designed so that the noise is kept at a low level 17.</p>
      <p>Such an evaluation, in which we deliberately limited the time
spent on the development of linguistic resources, shows the
importance of having accurate resources adapted to the task. Moreover, the
inescapable incompleteness of the developed resources facing new
texts shows that this kind of systems have to integrate dynamic
acquisition processes to assist the incremental enrichment of resources,
as time goes by.</p>
      <p>The experiment was intended to show the time needed for the
development of a sufficient set of resources, in order to obtain results
equivalent to those of the ECRAN project. That is the reason why
we emphasize on an evaluation of the amount of time spent on the
task rather than on the improvement potential. That is also the
reason why we focused on a limited template that only necessitates a
surface analysis. This limitation could certainly be solved if we used
more accurately the knowledge acquired by ASIUM. Thus, we plan
to take into account a deeper linguistic analysis (anaphora resolution,
partial information merging, etc.).
pour me´decins sans frontie`res (MSF-Belgique) ont e´te´ blesse´s. (Two
Italian doctors working for me´decins sans frontie`res (MSF-Belgique) have been
injured.)”, where the passive subject have not been correctly parsed
by the system;</p>
      <p>The silence for the weapon slot is frequently due to
incompleteness of semantic dictionaries.</p>
    </sec>
    <sec id="sec-8">
      <title>5 Discussion</title>
      <p>In this section, we will comment some of the results of this
experiment. Results obtained prove the interest of coupling a semantic
knowledge acquisition tool with the IE system. But those results are
not precise enough to decide about the quality of the semantic
knowledge acquisition tool. We will examine here some indicators which
allow to judge of the quality of the semantic knowledge learned and
next we will present some comments on the information extraction.</p>
    </sec>
    <sec id="sec-9">
      <title>5.2 Comments on the extraction process</title>
      <p>The results we obtained during this experiment can be
satisfactorily compared with those that we obtained on the same corpus with
the ECRAN system, that performed 0.89 precision. Moreover, the
results we performed with the new system were obtained after a
reduced development phase: about 40 hours for the learning phase with
ASIUM and about 15 hours to format the knowledge base as INTEX
resources. The following comments can be done on this experiment:</p>
      <p>Having a good knowledge of the corpus is indubitably an
advantage for the system designer. The fact that one of the author had
previously done the same task for ECRAN speeded up the
development process, given that the search of relevant syntactic structures
was facilitated;</p>
    </sec>
    <sec id="sec-10">
      <title>6 Future work</title>
      <p>All the knowledge learned by ASIUM is not used in this experiment,
especially subcategorization frames. We showed that a surface
analysis is sufficient when templates to be filled are not more complex
than those of ECRAN. The good quality obtained in a very short time
proves this idea.</p>
      <p>Nevertheless, in order to extract more specific informations from
texts (like the name of the organization that performs the terrorist
event, the politic membership of victims or attacker nationality), we
think that the use of subcategorization frames could be very useful.
Writing syntactic rules in order to perform relevant information
extraction becomes very hard because of the multiplicity of the
syntactic variations used in texts.</p>
      <p>Our current work is to create a cooperative acquisition system
to learn resources using the subcategorization frames learned by
ASIUM. The expert will be able to express rules using
complements of verbs independently of the syntax. Active and passive forms
will be given the same representation by the system. For example,
the two following sentences will be equivalent: L’action terroriste est
revendique´e par le Front populaire de libe´ration de la Palestine (FPLP) (The
terrorist event was claimed by the FPLP) or le Front populaire de libe´ration
de la Palestine (FPLP) revendique l’action terroriste (The FPLP claimed
responsibility of the terrorist event.) One example of rules for this kind of
sentences can be:</p>
      <sec id="sec-10-1">
        <title>If verb is ”to claim”, and object belongs to the class</title>
        <p>”Attack” Then the subject is the attacker.</p>
        <p>This kind of rule allows to differentiate people claiming
terrorism events like in Un groupe terroriste libanais revendique l’attentat
antise´mite de Buenos-Aires (A lebanese terrorist group claim anti-semite attack
in Buenos-Aires) from an organization claiming for a right like in ”les
fondamentalistes musulmans revendiquent le droit de vote (Muslim
fundamentalists are claiming voting rights).</p>
        <p>Semantic rules allow to make fine differences to accurately fill
fine-grain slots. The two next rules fill the field ”Missile” or
”Attacker” regarding the concept (Explosive or Person) learned
by ASIUM and used as subject of the verb to kill.</p>
        <p>If verb = ”to kill” and subject = Person Then the subject is the attacker.
If verb = ”to kill” and subject = Explosive Then The subject is the
missile used.</p>
        <p>We can see that, even if syntactic parsers generate errors and
ambiguities, ASIUM can check texts using the ontology and the
subcategorization frames previously learned. Then, the information extraction
process will process only on consistent sentences with
subcategorization frames. This allows to detect some parsing errors.</p>
        <p>The system we are thinking of will process two different parsing
steps. First, we will use syntax and concepts learned by ASIUM to
pre-fill the frame. Second, we will use our ”conceptual rules” to fill
more specifically the frame.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>7 Conclusion</title>
      <p>We have described in this article an experiment in which we
coupled an information extraction system using INTEX with the machine
learning system ASIUM. The development time of the linguistic
resources of the information extraction system has been reduced by
using the semantic knowledge learned by ASIUM. The quality of the
results remains the same as in the European ECRAN project.</p>
      <p>The aim of this experiment was to validate our approach. We will
now explore a better integration of the two systems and examine how
to better use the semantic knowledge learned by ASIUM in order to
increase the quality of our results.</p>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgment</title>
      <p>The research from Thierry Poibeau is partially funded by a Cifre grant
between the Laboratoire Central de Recherches of Thomson-CSF and the
Laboratoire d’Informatique de l’Universite´ de Paris-Nord.</p>
      <p>The authors want acknowledge M. Rodde (Cristal-Gresec) and A. Balvet
(Universite´ Paris X) for their contribution during analysis of the results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pazienza</surname>
          </string-name>
          , ed.,
          <article-title>Information extraction (a multidisciplinary approach to an emerging information technology)</article-title>
          .
          <source>Berlin: Springer Verlag (Lecture Notes in computer Science)</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Riloff</surname>
          </string-name>
          , “
          <article-title>Automatically generating extraction pattern from untagged texts</article-title>
          ,
          <source>” in 13th Conference on Artifcial Intelligence (AAAI'96)</source>
          , (Portland, Canada),
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Poibeau</surname>
          </string-name>
          , “
          <article-title>Mixing technologies for Intelligent Information Extraction</article-title>
          ,”
          <source>in Proceedings of the workshop on Intelligent Information Integration, 16th International Joint Conference on Artificial Intelligence</source>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>121</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Califf</surname>
          </string-name>
          ,
          <article-title>Relational Learning Techniques for Natural Language Information Extraction</article-title>
          .
          <source>PhD thesis</source>
          , Department of Computer Sciences, University of Texas at Austin,
          <year>February 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Basili</surname>
          </string-name>
          and M. T. Pazienza, “
          <article-title>Lexical Acquisition for Information Extraction,” in Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology (M. T</article-title>
          . Pazienza, ed.), (Frascati, Italy),
          <source>LNAI Tutorial</source>
          , Springer,
          <year>July 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buchholz</surname>
          </string-name>
          , “
          <article-title>Distinguishing Complements from Adjuncts using Memory-Based Learning</article-title>
          ,”
          <source>in Proceedings of the ESSLLI'98 workshop on Automated Acquisition of Syntax and Parsing (B</source>
          . Keller, ed.), pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hindle</surname>
          </string-name>
          , “
          <article-title>Noun classification from predicate-argument structures,” in Proceedings of the 28st annual meeting of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          , Pittsburgh, PA, pp.
          <fpage>1268</fpage>
          -
          <lpage>1275</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Tang</surname>
          </string-name>
          , “
          <article-title>Learning to Parse Natural Language Database Queries into Logical Form,”</article-title>
          <source>Proceedings of the ML-97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Riloff</surname>
          </string-name>
          , “
          <article-title>Automatically Constructing a Dictionnary for Information Extraction Tasks</article-title>
          ,”
          <source>in Proceedings of the Eleventh Nationnal Conference on Artificial Intelligence</source>
          , pp.
          <fpage>811</fpage>
          -
          <lpage>816</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Riloff</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Shepherd</surname>
          </string-name>
          , “
          <article-title>A Corpus-Based Approach for Building Semantic Lexicons</article-title>
          ,”
          <source>in Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2)</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          , “
          <article-title>Sextant: exploring unexplored contexts for semantic extraction from syntactic analysis,” in Proceedings of the 30st annual meeting of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          , (Newark, Deleware, USA), pp.
          <fpage>324</fpage>
          -
          <lpage>326</lpage>
          ,
          <year>June 1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Constant</surname>
          </string-name>
          , “
          <article-title>Reducing the complexity of encoding rule-based grammars</article-title>
          ,”
          <year>December 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harris</surname>
          </string-name>
          , Mathematical Structures of Language. New York: Wiley,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Peat</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Willet</surname>
          </string-name>
          , “
          <article-title>The limitations of term co-occurrence data for query expansion in document retrieval systems</article-title>
          ,
          <source>” Journal of the American Society for Information Science</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>378</fpage>
          -
          <lpage>383</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Grishman</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sterling</surname>
          </string-name>
          , “Generalizing Automatically Generated Selectionnal Patterns,”
          <source>in Proceedings of COLLING'94 15th International Conference on Computational Linguistic</source>
          , (Kyoto, Japan),
          <year>August 1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Faure</surname>
          </string-name>
          and
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Ne´dellec, “A Corpus-based Conceptual Clustering Method for Verb Frames and Ontology Acquisition,” in LREC workshop on Adapting lexical and corpus ressources to sublanguages and applications (P</article-title>
          . Velardi, ed.), (Granada, Spain), pp.
          <fpage>5</fpage>
          -
          <lpage>12</lpage>
          , May
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          , “
          <article-title>Distributional Clustering of English Words,” in Proceedings of the 31st annual meeting of the Association for Computational Linguistics</article-title>
          , ACL, pp.
          <fpage>183</fpage>
          -
          <lpage>190</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Brent</surname>
          </string-name>
          , “
          <article-title>Automatic acquisition of subcategorization frames from untagged text,” in Proceedings of the 29st annual meeting of the Association for Computational Linguistics</article-title>
          , ACL, pp.
          <fpage>209</fpage>
          -
          <lpage>214</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , “
          <article-title>Acquisition of a Lexicon from Semantic Representations of Sentences,” in 33rd Annual Meeting of the Association of Computational Linguistics</article-title>
          , Boston, MA July,
          <source>(ACL-95)</source>
          ., pp.
          <fpage>335</fpage>
          -
          <lpage>337</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zelle</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          , “
          <article-title>Learning semantic grammars with constructive inductive logic programming</article-title>
          ,
          <source>” Proceedings of the Eleventh National Conference on Artificial Intelligence</source>
          , pp.
          <fpage>817</fpage>
          -
          <lpage>822</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>“</surname>
          </string-name>
          MUC-6,” in
          <source>Proceedings of the sixth Message Understanding Conference (MUC 6)</source>
          , (San Francisco), Morgan Kaufmann,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>“</surname>
          </string-name>
          MUC-7,” in
          <source>Proceedings of the seventh Message Understanding Conference</source>
          , (San Francisco), Morgan Kaufmann,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Poibeau</surname>
          </string-name>
          , “
          <article-title>Extraction d'information : adaptation lexicale et calcul dynamique du sens,” in Actes des rencontres internationales sur l'extraction, le filtrage et le re´sume´ automatique (Rifra'98), (</article-title>
          <string-name>
            <surname>Sfax</surname>
          </string-name>
          , Tunisia), pp.
          <fpage>141</fpage>
          -
          <lpage>153</lpage>
          ,
          <year>November 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kosseim</surname>
          </string-name>
          and G. Lapalme, “
          <article-title>EXIBUM : un syste`me expe´rimental d'extraction bilingue,” in Actes des rencontres internationales sur l'extraction, le filtrage et le re´sume´ automatique (Rifra'98), (</article-title>
          <string-name>
            <surname>Sfax</surname>
          </string-name>
          , Tunisia), pp.
          <fpage>129</fpage>
          -
          <lpage>140</lpage>
          ,
          <year>November 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Maurel</surname>
          </string-name>
          ,
          <article-title>Reconnaissance des se´quences de mots par automate, adverbes de date du franc¸ais</article-title>
          .
          <source>PhD thesis</source>
          , Universite´ Paris 7,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Belleil</surname>
          </string-name>
          ,
          <article-title>Reconnaissance, typage et traitement des core´fe´rences des toponymes franc¸ais et de leurs gentile´s par dictionnaire e´lectronique relationnel</article-title>
          .
          <source>PhD thesis</source>
          , Universite´ de Nantes,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Se</surname>
          </string-name>
          <article-title>´nellart, “Locating noun phrases with finite state transducers</article-title>
          ,
          <source>” in 15th International Conference on Computational Linguistics (COLING'98)</source>
          , (Montre´al), pp.
          <fpage>1212</fpage>
          -
          <lpage>1217</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R.</given-names>
            <surname>Basili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Catizone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pazienza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Velardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vindigni</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wilks</surname>
          </string-name>
          , “
          <article-title>An empirical approach to Lexical Tuning,” in Actes du Workshop on Adapting lexical and corpus resources to sublanguages and applications, (Granada</article-title>
          , Spain), May
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Faure</surname>
          </string-name>
          , “
          <article-title>Connaissances se´mantiques acquises par Asium: exemples d'utilisations,” in Journe´e du Re´seau de sciences cognitives d'Ile-deFrance (RISC</article-title>
          , ed.), p.
          <fpage>12</fpage>
          ,
          <year>October 1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>