=Paper=
{{Paper
|id=Vol-31/paper-2
|storemode=property
|title=First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX
|pdfUrl=https://ceur-ws.org/Vol-31/DFaure_6.pdf
|volume=Vol-31
|dblpUrl=https://dblp.org/rec/conf/ecai/FaureP00
}}
==First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX==
<pdf width="1500px">https://ceur-ws.org/Vol-31/DFaure_6.pdf</pdf>
<pre>
     First experiments of using semantic knowledge learned
     by A SIUM for information extraction task using I NTEX
                                                            David Faure1 and Thierry Poibeau2


Abstract. Our aim in this article is to show how semantic knowl-                       elaboration of the IE system. We also aim at reaching a better cover-
edge learned for a specific domain can help the creating of a powerful                 age thanks to the generalization process implemented in A SIUM.
information extraction system. We describe a first experiment of cou-                     We will firstly present the A SIUM system which allows to learn
pling an information extraction system based and the machine learn-                    semantic knowledge for the elaboration of an IE system similar to
ing system A SIUM. We will show how semantic knowledge learned                         that of the E CRAN project. We will show to what extent it is pos-
by A SIUM helps the user to write an information extraction system                     sible to speed up the elaboration of resources without any decrease
more efficiently, in reducing the time spent on the development of                     of the quality of the system. We will finish with some comments on
resources. Our approach will be compared to the European E CRAN                        this experiment and we will show how domain-specific knowledge
project, that aims at the same result, regarding development time and                  acquired by A SIUM such as the subcategorization frame of the verbs
performances.                                                                          could be used to extract more precise information from texts.

1     Introduction                                                                     2     Semantic Knowledge Acquisition
Information Extraction (IE) is a technology dedicated to the extrac-
tion of structured information from texts. This technique is used to                   Semantic knowledge acquisition from texts remains a hard task
highlight relevant sequences in the original text or to fill pre-defined               even for limited domains. This knowledge is crucial in order to im-
templates [1]. Below is the example of a story concerning a ter-                       prove natural language applications like information extraction. Ap-
rorist attack in Turkey together with the corresponding entry in the                   proaches mixing machine learning (ML) and natural language pro-
database filled by the IE system.                                                      cessing (NLP) obtain good results in a short development time (we
                                                                                       can cite, among others M. E. Califf [4], R. Basili [5], S. Buchholz
940815LM347810 Le Monde - 15 août 1994, page 6
TURQUIE: neuf blessés dans un attentat à la bombe.                                   [6], D. Hindle [7], R. J. Mooney [8] et E. Riloff [9], [2], [10]).
Neuf personnes, dont trois touristes étrangers, ont été blessées par l’explosion
d’une bombe vendredi 12 août dans une gare routière de la partie européenne
d’Istanbul. (...)- (AFP.)
                                                                                          We present here A SIUM which learns cooperatively semantic
940815LM347810 Le Monde - August 15, 1994, page 6
                                                                                       knowledge from texts syntactically parsed without previous manual
TURKEY: nine persons were injured during a bomb attack.
Nine persons, three of them being foreign tourists, were injured by a bomb             processing. This knowledge consists in subcategorization frames of
explosion on Friday August 12, at a bus station in the European part of Istanbul.
                         Date of the story 15 août 1994
                                                                                       verbs and an ontology of concepts for a specific domain following
                         Loc.
                         Date
                                           TURQUIE Istanbul
                                           Vendredi 12 août
                                                                                       the ”domain dependence” defined by G. Grefenstette 4 [11].
                         Nb dead person
                         Nb person injured neuf (nine)
                                                                                          A SIUM is based on an unsupervised conceptual clustering method
                         Weapon            bombe (bomb)
                                                                                       and provides an ergonomic user-interface 5 to help knowledge acqui-
   Even if IE seems to be now a relatively mature technology, it                       sition process.
suffers from a number of yet unsolved problems that limit its dis-                        In this part, we will show how A SIUM is able to learn good quality
semination through industrial applications. Among these limitations,                   knowledge in a reasonable time from parsed text, even if the syntactic
we can consider the fact that systems are not really portable from                     parsing of texts is noisy.
one domain to another. Even if the system is using some generic
components, most of its knowledge resources are domain-dependent.
Moving from one domain to another means re-developing some re-
                                                                                       2.1      Our approach
sources, which is a boring and time-consuming task 3 .                                 Our aim is to learn subcategorization frames of verbs and an ontology
   This fact was observed by one of the authors during the elaboration                 for a specific domain, from texts. Actually, existing knowledge bases
of a previous prototype in the framework of the European E CRAN                        like E UROW ORD NET or W ORD N ET are frequently over-general for
project [3] with the same aim. The system necessitated resources                       applications in specific domains. These ontologies, although very
manually defined from the reading of a huge amount of texts.                           complete, are not suitable for processing texts in technical languages.
   In order to decrease the time spent on the elaboration of resources                 On one hand they are not purpose directed ontologies, they may store
for the IE system, we suggest to use A SIUM that allows to learn se-                   up to seven meanings and syntactic roles for a word, thus increasing
mantic knowledge from texts. This knowledge is then used for the                       the risk of semantic ambiguity. In a specific domain, the vocabulary
1 L.R.I. UMR 86-23 du CNRS Université Paris Sud F-91405 Orsay Cedex                   as well as its possible usage is reduced, which makes ontologies such
    David.Faure@lri.fr
2 Thomson-CSF Laboratoire Central de Recherches Domaine de Corbeville                  4 “A semantic structure developed for one domain would not be applicable to
  F-91404 Orsay Thierry.Poibeau@lcr.thomson-csf.com                                        another”.
3 for example Riloff [2] mentions a 1500 hours development.                            5 http://www.lri.fr/Francais/Recherche/ia/sujets/asium.html
as W ORD N ET overly general. On the other hand, W ORD N ET may             method is based on a double regularity model: A SIUM gathers nouns
lack some specific terminology of the application domain.                   together as representing a concept only if they share at least two dif-
   Contrary to any approach of increasing or specializing general on-       ferent (verb+preposition/syntactic role) contexts as
tologies for a specific domain like R. Basili [5], we learn an ontology     in Grishman[15]. Experiments show that it forms more reliable con-
and verbs frames from the corpus reducing the risk of inconsistency.        cepts, thus requiring less involvement from the user. Our similarity
   Our previous attempts to automatically revise subcategorization          measure computes the overlap between two lists of nouns 8 (Details in
frames and a subset of an ontology acquired by a domain expert              [16]). As usual in conceptual clustering, the validity of learned con-
have failed. Revision of the acquired knowledge with respect to the         cepts relies on the quality of the similarity measure between clusters
training texts required deep restructuring of the knowledge that in-        that increases with the size of their intersection.
cremental and even cooperative ML revision methods were not able                Basic classes are then successively aggregated by a bottom-up
to handle. The main reason was that the expert built the ontology and       breadth-first conceptual clustering method to form the concepts of
the subcategorization frames with too many a priori that were not           the ontology level by level with an expert validation and/or labelling
reflected in the texts. This experiment illustrates one of the limitation   at each level. Thus a given cluster cannot be used in a new construc-
of manual acquisition by domain experts without linguists.                  tion before it has been validated. For complexity reasons, the num-
                                                                            ber of clusters to be aggregated is restricted to two, but this does
                                                                            not affect the relevance of the learned concept [16]. Verb subcate-
2.2    Learned knowledge                                                    gorization frames are learned in parallel so that each new concept
A SIUM learns subcategorization frames like <to drop>                       fills the corresponding restriction of selection then resulting in the
<object: Explosive> <in: Public_Place> for the verb                         generalization of the initial syntactic frames which allows to cover
to drop. Both couples object: Explosive et in: Pub-                         examples that did not occur as such in texts. Thus, the clustering
lic Place are subcategories, object is a syntactic role and in              process does not only identify the lists of nouns occuring after the
is a preposition but Explosive and Public Place are concepts                same verb+preposition/function but also augments this list by
used as restrictions of selection. More usually, A SIUM learns verb         induction.
frames like: <verb> <prep.|syntactic role: concept*>*                           Aggregation of two basic                     V1,P1/F1   noun1     Induced Examples:

   These frames are more general than the ones defined in the LFG 6         classes ( C1 and C2 ) found                      V2,P2/F2   noun2
                                                                                                                                        noun3
                                                                                                                                                     V2,P2/F2 noun3
                                                                                                                                                     V2,P2/F2 noun4
formalism because the subcategories are verb arguments (subject, di-        after two different couples                    Learned
                                                                                                                           concept
                                                                                                                                        noun4        V1,P1/F1 noun5
                                                                                                                                        noun5        V1,P1/F1 noun6
rect object or indirect object) and adjuncts. In our framework, re-         verb+prep./function                                         noun6

strictions of selection can be filled by an exhaustive list of nouns (in    (V1,P1/F1 and V2,P2/F2)                 V1,P1/F1          Induction          V2,P2/F2
canonical form) or by one or more concepts defined in an ontology.          will create a new concept                          noun1   Common   noun1
                                                                                                                                         part
The ontology represents generality relations between concepts in the        allowed after V1,P1/F1                      C1
                                                                                                                               noun2
                                                                                                                               noun3
                                                                                                                                                noun2
                                                                                                                                                noun5 C2
form of a directed acyclic graph (DAG). For example, the ontology           and V2,P2/F2. Thus, nouns                          noun4            noun6

could define car, train and motorcycle as motorized vehicle,                which only appear in basic
and motorized vehicle as both vehicle and pollutant. Our                    class C1 (resp. C2) will now be allowed with the couple V2,P2/F2
method learns such an ontology and subcategorization frames in an           (resp. V1,P1/F1). This results in a generalization of knowledge
unsupervised manner 7 from texts in natural language. The concepts          found in the corpus as presented in the figure.
formed have to be labeled by an expert.                                         For example, starting with these syntactic frames,
                                                                                      <to travel>
                                                                                             <subject:[father,neighbour,friend]>
2.3    Knowledge acquisition method                                                          <by: [car,train]>
                                                                                       <to drive>
The first step of the acquisition process is to automatically extract                        <subject:[friend,colleague]>
syntactic frames from texts. We use the syntactic parser S YLEX de-                          <object:[car,motorcycle]>
veloped by P. Constant [12]. In case of syntactic ambiguities, S YLEX           A SIUM will learn two concepts
gives all the differents interpretations and A SIUM uses all theses in-               Human: father; neighbor; friend; colleague.
                                                                                      Motorized Vehicle: car; train; motorcycle.
terpretations. Experiments have shown that the ML method works
well with theses ambiguities and acquisition of semantic knowledge              and two subcategorization frames:
is not affected. This method avoids a very time-consuming manual                      <to travel>
                                                                                             <subject: Human>
disambiguation step. These frames are the same like subcategoriza-                           <by:Motorized Vehicle>
tion frames but with concepts replaced by nouns. <verb> <prep.                         <to drive>
| role: head noun>*                                                                          <subject: Human>
                                                                                             <object: Motorized Vehicle>
    A SIUM only uses head nouns of complements and links with
verbs. Adjectives and empty nouns are not used. Our experiments                Experts have to control the link between the new concept and the
have shown that these informations were enough to learn semantic            verb because the only threshold, fixed by the expert, can not mea-
knowledge even from a noisy syntactic parsing.                              sure the over-generalization risk. This validation process is relatively
   The learning method relies on the observation of syntactic regular-      quick due to the ergonomic user-interface. A SIUM provides to the ex-
ities in the context of words [13]. We assume here that head nouns oc-      pert the list of newly covered examples in order to estimate the gen-
curing with the same couple verb+preposition/syntactic                      erality of the proposed concept. Moreover the expert can use func-
role represent a so-called basic class and have a semantic similar-         tionalities provided by A SIUM in order to divide the learned concept
ity in the same line as Grefenstette[11], Peat[14] or others, but our       into sub-concepts in case of a proposed concept overly general for
                                                                            the target task.
6 Lexical Functional Grammar.
7 Asium is called unsupervised because no concepts examples are provided    8   Sim(C1 ; C2 ) = 1 for lists with the same nouns and Sim(C 1 ;C2 ) = 0
  as input.                                                                     for lists without any common nouns.
2.4      Related work in semantic knowledge                                 via transducers are for most part syntactic structures (the set of ex-
         acquisition                                                        pressions equivalent to the notion of ”bombing”) integrating some
                                                                            of the semantic classes furnished by the A SIUM system 10.
As for D. Hindle [7] or F. Peireira [17], our method gather nouns              The homogeneous semantic lists learned by the A SIUM system are
regarding syntactic regularities of arguments and adjuncts of the           introduced in the I NTEX vocabulary. At this level, a manual work is
verbs. We suppose that in specialized texts, verbs are also character-      necessary to exploit the semantic classes from A SIUM. These classes
ized by theirs adjuncts. G. Grefenstette [11] proposes to learn some-       are refined (merging of scattered classes, deletion of irrelevant ele-
thing close to our ”basic classes”. Our ”double similarity model”           ments, addition of new elements, etc.). About ten hours have been
learns a concept by gathering two basic classes only if they have a         dedicated, after the acquisition process, to the refinement of the data
good similarity. This model limits the number of non relevant pro-          furnished by A SIUM. This knowledge is then considered as a re-
duced concepts. M. R. Brent [18] learns only five subcategorization         source for I NTEX and is exploited either as dictionaries or as trans-
frames from untagged texts with an automatic method. S. Buchholz            ducers, in function of the nature of the information. If it is a general
[6] learns subcategorization frames very close to ours but with a su-       information that is not domain specific, we prefer to use a dictionary
pervised method which is very time-consuming for the expert. In             which can be reused, otherwise, we use a transducer.
the same way, W OLFIE (A. C. Thompson [19]) with C HILL (J. M.                 A dictionary is a list of words or phrases, each one being accom-
Zelle [20]) learns ”case-roles” and a thesaurus from texts syntac-          panied of a tag and a list of features 11. The first names dictionary or
tically parsed by C HILL but fully semantically annotated by hand.          the locations dictionary are generic reusable resources. Below is a
The case roles differs from our subcategorization frames because our        sample of the location names dictionary 12 :
prepositions or grammatical functions are replaced by semantic roles                 Abidjan,N+Loc+City;
like agent or patient. Contrary to the ontology learned by A SIUM,                   Afghanistan,N+Loc+Country;
selectional restrictions learned by W OLFIE are attribute-value lists.               Allemagne,N+Loc+Country;
                                                                                     Allemagne de l’Ouest,N+Loc+Country;
An unsupervised learning approach like A SIUM delays concepts la-                    Allemagne de l’Est,N+Loc+Country; : : :
belling after the learning process and so considerably reduces the             These items structured in a list are convenient for the dictionary
time needed by the expert. After A SIUM learning, the semantic roles        format and the semantic lists elaborated from A SIUM complete in an
can be labelled by assuming a couple verb+prep./function                    accurate manner the coverage of the initial dictionaries from I NTEX.
represents a specific semantic role. E. Riloff in [10] learns five con-        The transducer format is essentially used for more complex or
cepts from texts. She uses lists of nouns representing general con-         more variable data where linguistic phenomena such as insertion or
cepts (seeds) and uses coocurrence method to augment these lists to         optionality may interfere.
concepts. These augmented lists are checked by the expert who only             Here,
retains nouns representing the concept. We can assume basic classes         the      figure                                                LastName
of A SIUM are seeds that will be increased by our induction process.        presents an
The main advantage is that the number of concepts is not limited to         example of
                                                                                                                        FirstName

five and we learn in parallel subcategorization frames of verbs with-       a transducer          <Person>
                                                                                                           monsieur                de
                                                                                                                                                          </Person>

out more time-consuming validation needed.                                  allowing to
                                                                                                           madame
                                                                                                           M.
                                                                                                                                          UknownWord
                                                                                                           Mme
                                                                            recognize                                 Transducer "Person".
3      The Information Extraction system                                    person names
                                                                            such as Mon-
The Information Extraction system is based on the I NTEX tool-box,          sieur Jean Dupont. the transducer recognizes a sequence composed
developed by the LADL laboratory 9 . I NTEX allows a rapid and in-          of a trigger word (Monsieur), a first name (Jean) and a proper name
teractive development of automata and transducers to analyze texts.         (Dupont). But we must keep in mind that most of these elements
A linguistic automaton recognizes expressions in texts, whereas a           can be optional (Monsieur Dupont or Jean Dupont are correct
transducer associate specific tags with words in the texts (for exam-       sequences) and that Dupont can be a word that is not listed in any
ple, assign a syntactic category to a word). Transducers are efficient,     dictionary (it will then be considered as an unknown word).
expressive and sufficient for a local analysis of texts. We chose this         At this level, one can find two types of transducers: some are
approach because it allows the rapid development of an IE system            generic - as the ”Person” one, and some others are domain-specific
for a given domain with a strictly local analysis limited to the sen-       and can filled with the semantic knowledge acquired by A SIUM.
tence area. Our aim is to develop a highly portable system even if             The next figure
this means using more precise analysis strategies afterwards.               is the illustration of
                                                                                                                 <explosion>    de  <DET>      Explosive
                                                                            a transducer recog-              <Weapon>                                    </Weapon>

3.1      Linguistic resources modeling                                      nizing explosion                                   Transducer "Weapon".
                                                                            de Det N (explo-
To elaborate linguistic resources, we first used the semantic classes       sion of Det N), where
defined by the A SIUM system. Before the experiment, the corpus             the nominal phrase Det N recognizes nominal phases elaborated
was separated in two different parts : the training set and the test        from the semantic class bombing where the following words
set. The linguistic resources are constantly tested on the training set     appear: bombe (bomb), obus (shell), grenade, etc.
during the development. This development approach allows to eval-              The elaboration of such transducers requires some linguistic ex-
uate the performances and to detect possible errors in the grammar          pertise to obtain in fine a system recognizing the relevant sequences
(a grammar with too much or not enough constraints which would              without too much noise. The architecture of the system is using cas-
bring silence or noise during the analysis). The expressions modeled        cading transducers, it is then important that each level has a good
9 Laboratoire d’Automatique Documentaire et Linguistique de l’Université   quality in order to allow the following analysis level to operate on a
    de Paris 7.                                                             solid background.
                                                                            10 for example the list of weapons which could be used in a bomb attack.
                                                                            11 For example, in this case: Loc,City,Country       :: :
                                                                            12 Each line begins with a term, followed by some indication about its syn-
                                                                               tactic category (N for Noun) and semantic features (Loc to indicate a
                                                                               location, Country to indicate a country, etc.).
   This kind of architecture amplifies indeed systematically the noise               The time spent on the definition of the linguistic resources with
generated by the previous level.                                                  I NTEX is estimated to about 15 hours. This duration has to be com-
   For example, the                                                               pared with the two weeks 16 needed for the manual resources devel-
results of the trans-                                                             opment of the E CRAN project.
                                   meurtre
ducer presented on                                                                   Hundred texts have been used as ”training corpus” and fifteen dif-
            13
right figure will be              assassinat   de       Person                    ferent texts have been used as ”test corpus”. Texts are first parsed
                                                  <Victim>      </Victim>
better if those of the             exécution                                      with our system, and then some heuristics allow to fill the extraction
transducer ”Person”                                                               template:
                                           Transducer "Victim".
(in grey) are already                                                                   Due to the structure of articles of Le Monde, the first date is
good.                                                                             always the date of the article;
   The different defined transducers are then minimized and deter-                      we assume that the second date is the one of the terrorist event;
mined 14 . The overall set of transducers is composed of 1000 nodes                     the two first occurrences of locations found are stored and usu-
and about 5000 arrows in our experiment.                                          ally quite well identify the location of the terrorist event;
                                                                                        the first occurrence of a number of victims or injured persons
3.2     Related work in Information Extraction                                    is stored. If a text speaks of more than one terrorist event, we as-
                                                                                  sume that only the first one is relevant. We have chosen short texts to
IE is a now widely spread research domain. The American Message                   prevent us from this problem inherent to long texts;
Understanding Conferences (M UC) provided a formidable frame-                           only the first weapon linked with the terrorism event is stored.
work for the development of research in this area ([21], [22]). The
conferences are held about every two years and generally bring to-                   These heuristics are very succinct and we will have to specialize
gether about fifteen teams working on IE systems. The elaboration of              them to perform information extraction on longer or less-specialized
the linguistic resources is for most part a manual work even if some              texts. We have used these simple heuristics to evaluate our system
attempts were done to have some more portable systems.                            and compare it with the E CRAN one. With these heuristics, we obtain
    At least two French-speaking projects have been developed which               good results on our corpus, and most of the extraction systems eval-
are somewhat comparable with M UC systems. these two systems are                  uated in the American M UC conferences used this kind of heuristics
the European project E CRAN and the E XIBUM project from the Uni-                 in order to solve any parsing problems.
versity of Montreal (Canada). E CRAN developed a generic and mul-                    Our results have been evaluated by two human experts who did
tilingual system tested on different corpora (movie reviews, stories              not follow our experiment. Our performance indicators were defined
from the economic area, etc.) [23]. E XIBUM is a bilingual system                 as:
(French and English) that aims at processing agency news about ter-                    O K (O) if extracted information is correct;
rorist events in Algeria [24].                                                         FALSE (F) if extracted information is incorrect or not filled;
    Several other Information Extraction systems were developed for                    N ONE (N) if there were no extracted information and no infor-
specific kind of information (dates, location names, etc.). For exam-             mation has to be extracted.
ple, D. Maurel [25] developed a system highlighting dates by means                     FALSE for all the other cases.
of automata and acceptability tables. More recently, C. Belleil [26]
presented a system highlighting French toponyms and J. Sénellart                   Using these indicators, we can compute two differents values:
[27] a system recognizing Minister names from the French newspa-
                                                                                       P RECISION 1 (P1), ratio between O K and FALSE answers,
per Le Monde. These Approaches generally require exhaustive de-
                                                                                  without taking into account the N ONE answers.
scriptions of the concerned domain.
                                                                                       P RECISION 2 (P2), same as P1 but with the N ONE answers.
    Recent American work in the area proposed an approach mixing
corpus exploration and knowledge acquisition to feed IE systems. A
                                                                                    The next table summarizes results for the different elements of the
first well-known experiment is the AutoSlog from E. Riloff [2] allow-
                                                                                  template.
ing to find in texts relevant syntactic structures from keywords given
to the system by the end-user. In the framework of E CRAN, a simi-                                                   O      F     N     P1      P2
                                                                                                Date of the story    50     0     0     1,00    1,00
lar attempt was done to try to generalize relevant syntactic structures                         Location             45     5     0     0,90    0,90
                                                                                                Date                 49     1     0     0,98    0,98
from a training corpus and a general dictionary [28]. The experiment                            Nb dead persons      20     5     25     0,80    0,90
                                                                                                Nb persons injured   26     9     15    0,74     0,82
we present is different considering that the learning system is not                             Weapon               35     11    4     0,76    0,78
                                                                                                Average              37,5   5,2   7,3   0,86     0,89
supervised and furnishes the IE system designer a wide amount of
knowledge extracted from the texts.
                                                                                     We obtain a good quality for the extracted information in most of
                                                                                  the elements.
4     Experiment                                                                       The date of the story is fully correct because we can use the
In our experiment, we have used a corpus of texts form the French                 structure of the article to extract it;
journal ”Le Monde”. Texts indexed by the noun ”terrorist event” have                   The errors for the location slot are due to two ”contradictory”
been extracted and manually filtered in order to be sure that they                locations found by the system. A more complete linguistic analysis
really contain a terrorist event description15. This corpus is of the             or a database providing lists of cities in different countries would
same kind as the one used for experiments in the E CRAN project, so               reduce this kind of errors;
that we will be able to compare our results.                                           The errors in the number of dead or injured persons slot are
13 meurtre=murder, assassinat=assassination, éxecution=execution.                frequently due to silence. Our system, for example, fails against
14 These two operations allow to optimize the analysis time.                      too complex syntactic forms like ”Deux médecins italiens travaillant
15 The full corpus also contains others texts describing proceedings or terror-
    ist menaces.                                                                  16 about 80 hours.
pour médecins sans frontières (MSF-Belgique) ont été blessés. (Two Ital-        The results of the A SIUM system allow to speed up the defi-
ian doctors working for médecins sans frontières (MSF-Belgique) have been     nition of the paradigmatic classes filling states in the I NTEX trans-
injured.)”, where the passive subject have not been correctly parsed            ducers, even if certain classes need to be manually completed. For
by the system;                                                                  example, the A SIUM semantic classes allowed to rapidly complete
     The silence for the weapon slot is frequently due to incom-               the graph representing the set of weapons or persons who were impli-
pleteness of semantic dictionaries.                                             cated in terrorist events. A SIUM provided a class in which terms such
                                                                                as, for example: ”bomb”, ”grenade”, ”explosive” or ”car”
                                                                                could appear, considering that a booby-trapped car is a kind of
5     Discussion                                                                weapon, etc;
In this section, we will comment some of the results of this exper-
iment. Results obtained prove the interest of coupling a semantic                    The description language provided by I NTEX is richer than the
knowledge acquisition tool with the IE system. But those results are            one of E CRAN. The time spent to model the linguistic I NTEX trans-
not precise enough to decide about the quality of the semantic knowl-           ducers was longer than the one spent for E CRAN since the constraints
edge acquisition tool. We will examine here some indicators which               and the empty transitions in automata and transducers have to be
allow to judge of the quality of the semantic knowledge learned and             manually designed so that the noise is kept at a low level 17 .
next we will present some comments on the information extraction.
                                                                                   Such an evaluation, in which we deliberately limited the time
                                                                                spent on the development of linguistic resources, shows the impor-
5.1    Semantic Knowledge quality                                               tance of having accurate resources adapted to the task. Moreover, the
Semantic knowledge acquisition tools like A SIUM are always very                inescapable incompleteness of the developed resources facing new
difficult to evaluate. Measuring the quality of an ontology or evalu-           texts shows that this kind of systems have to integrate dynamic ac-
ating an ontology regarding another one is not easy and heavily de-             quisition processes to assist the incremental enrichment of resources,
pends on applications. So, we will only present here some indicators            as time goes by.
to have an idea on the quality of the acquired knowledge.                          The experiment was intended to show the time needed for the de-
   Concept quality depends of two different elements. The first one             velopment of a sufficient set of resources, in order to obtain results
is the distance which computes similarity between classes in order              equivalent to those of the E CRAN project. That is the reason why
to create relevant concepts and perform relevant inductions. As usual           we emphasize on an evaluation of the amount of time spent on the
in conceptual clustering, the distance is a parameter of the concept            task rather than on the improvement potential. That is also the rea-
quality and of quantity of expert’s work.                                       son why we focused on a limited template that only necessitates a
   This first qualitative element is very hard to estimate. In our appli-       surface analysis. This limitation could certainly be solved if we used
cation, 16 of the 19 first classes proposed by A SIUM have been ac-             more accurately the knowledge acquired by A SIUM. Thus, we plan
cepted by the expert. 447 inductions have been proposed by A SIUM               to take into account a deeper linguistic analysis (anaphora resolution,
and 73 % of these inductions have been judged relevant by the expert.           partial information merging, etc.).
   The second element which affects the concepts quality is the level
of generality for a concept. When A SIUM proposes a new concept,
the expert has to decide from the generality of the concept whether it          6     Future work
should be split or not. This work is easy for an expert because he has          All the knowledge learned by A SIUM is not used in this experiment,
a very good knowledge of the final application.                                 especially subcategorization frames. We showed that a surface anal-
   For example, if A SIUM proposes the ”Organization” concept,                  ysis is sufficient when templates to be filled are not more complex
the expert has to decide if it is relevant for the task to identify sub-        than those of E CRAN. The good quality obtained in a very short time
concepts like ”Military org.” and ”Politic org.”.                               proves this idea.
   The generality level in the application highly depends on the sub-              Nevertheless, in order to extract more specific informations from
tlety of the template to be filled by the information extraction system.        texts (like the name of the organization that performs the terrorist
Our previous experiments on this domain and on the cooking recipes              event, the politic membership of victims or attacker nationality), we
domain have shown that this work is simple and that expert choices              think that the use of subcategorization frames could be very useful.
really depend on the task. (More explanations on the suitability of             Writing syntactic rules in order to perform relevant information ex-
concepts for the main task and unsuitability of these concept for an-           traction becomes very hard because of the multiplicity of the syntac-
other task are given in [29].)                                                  tic variations used in texts.
                                                                                   Our current work is to create a cooperative acquisition system
5.2    Comments on the extraction process                                       to learn resources using the subcategorization frames learned by
The results we obtained during this experiment can be satisfacto-               A SIUM. The expert will be able to express rules using comple-
rily compared with those that we obtained on the same corpus with               ments of verbs independently of the syntax. Active and passive forms
the E CRAN system, that performed 0.89 precision. Moreover, the re-             will be given the same representation by the system. For example,
sults we performed with the new system were obtained after a re-                the two following sentences will be equivalent: L’action terroriste est
duced development phase: about 40 hours for the learning phase with             revendiquée par le Front populaire de libération de la Palestine (FPLP) (The
A SIUM and about 15 hours to format the knowledge base as I NTEX                terrorist event was claimed by the FPLP) or le Front populaire de libération
resources. The following comments can be done on this experiment:               de la Palestine (FPLP) revendique l’action terroriste (The FPLP claimed re-
                                                                                sponsibility of the terrorist event.) One example of rules for this kind of
      Having a good knowledge of the corpus is indubitably an ad-
vantage for the system designer. The fact that one of the author had            sentences can be:
previously done the same task for E CRAN speeded up the develop-                17 The effort to manage empty transitions in graphs took about 5 hours but
ment process, given that the search of relevant syntactic structures                allowed to obtain a more efficient grammar than the one obtained by the
was facilitated;                                                                    description of syntactic patterns by a set of regular expressions.
    If verb is ”to claim”, and object belongs to the class                        [7] D. Hindle, “Noun classification from predicate-argument structures,” in
                                                                                      Proceedings of the 28st annual meeting of the Association for Compu-
    ”Attack” Then the subject is the attacker.                                        tational Linguistics, ACL, Pittsburgh, PA, pp. 1268–1275, 1990.
  This kind of rule allows to differentiate people claiming terror-               [8] R. J. Mooney, A. C. Thompson, and R. L. Tang, “Learning to Parse Nat-
ism events like in Un groupe terroriste libanais revendique l’attentat anti-          ural Language Database Queries into Logical Form,” Proceedings of the
sémite de Buenos-Aires (A lebanese terrorist group claim anti-semite attack          ML-97 Workshop on Automata Induction, Grammatical Inference, and
                                                                                      Language Acquisition, 1996.
in Buenos-Aires) from an organization claiming for a right like in ”les           [9] E. Riloff, “Automatically Constructing a Dictionnary for Information
fondamentalistes musulmans revendiquent le droit de vote (Muslim funda-               Extraction Tasks,” in Proceedings of the Eleventh Nationnal Confer-
mentalists are claiming voting rights).                                               ence on Artificial Intelligence, pp. 811–816, 1993.
                                                                                 [10] E. Riloff and J. Shepherd, “A Corpus-Based Approach for Building
   Semantic rules allow to make fine differences to accurately fill                   Semantic Lexicons,” in Proceedings of the Second Conference on Em-
fine-grain slots. The two next rules fill the field ”Missile” or ”At-                 pirical Methods in Natural Language Processing (EMNLP-2), 1997.
tacker” regarding the concept (Explosive or Person) learned                      [11] G. Grefenstette, “Sextant: exploring unexplored contexts for seman-
by A SIUM and used as subject of the verb to kill.                                    tic extraction from syntactic analysis,” in Proceedings of the 30st an-
                                                                                      nual meeting of the Association for Computational Linguistics, ACL,
    If verb = ”to kill” and subject = Person Then the subject is the attacker.        (Newark, Deleware, USA), pp. 324–326, June 1992.
    If verb = ”to kill” and subject = Explosive Then The subject is the          [12] P. Constant, “Reducing the complexity of encoding rule-based gram-
    missile used.                                                                     mars,” December 1996.
   We can see that, even if syntactic parsers generate errors and ambi-          [13] Z. Harris, Mathematical Structures of Language. New York: Wiley,
                                                                                      1968.
guities, A SIUM can check texts using the ontology and the subcatego-            [14] H. J. Peat and P. Willet, “The limitations of term co-occurrence data for
rization frames previously learned. Then, the information extraction                  query expansion in document retrieval systems,” Journal of the Ameri-
process will process only on consistent sentences with subcategoriza-                 can Society for Information Science, vol. 42, no. 5, pp. 378–383, 1991.
tion frames. This allows to detect some parsing errors.                          [15] R. Grishman and J. Sterling, “Generalizing Automatically Generated
                                                                                      Selectionnal Patterns,” in Proceedings of COLLING’94 15th Interna-
   The system we are thinking of will process two different parsing
                                                                                      tional Conference on Computational Linguistic, (Kyoto, Japan), August
steps. First, we will use syntax and concepts learned by A SIUM to                    1994.
pre-fill the frame. Second, we will use our ”conceptual rules” to fill           [16] D. Faure and C. Nédellec, “A Corpus-based Conceptual Clustering
more specifically the frame.                                                          Method for Verb Frames and Ontology Acquisition,” in LREC work-
                                                                                      shop on Adapting lexical and corpus ressources to sublanguages and
7    Conclusion                                                                       applications (P. Velardi, ed.), (Granada, Spain), pp. 5–12, May 1998.
We have described in this article an experiment in which we cou-                 [17] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English
pled an information extraction system using I NTEX with the machine                   Words,” in Proceedings of the 31st annual meeting of the Association
                                                                                      for Computational Linguistics, ACL, pp. 183–190, 1993.
learning system A SIUM. The development time of the linguistic re-               [18] M. R. Brent, “Automatic acquisition of subcategorization frames from
sources of the information extraction system has been reduced by                      untagged text,” in Proceedings of the 29st annual meeting of the Asso-
using the semantic knowledge learned by A SIUM. The quality of the                    ciation for Computational Linguistics, ACL, pp. 209–214, 1991.
results remains the same as in the European E CRAN project.                      [19] C. A. Thompson, “Acquisition of a Lexicon from Semantic Represen-
                                                                                      tations of Sentences,” in 33rd Annual Meeting of the Association of
   The aim of this experiment was to validate our approach. We will                   Computational Linguistics, Boston, MA July, (ACL-95)., pp. 335–337,
now explore a better integration of the two systems and examine how                   1995.
to better use the semantic knowledge learned by A SIUM in order to               [20] J. M. Zelle and R. J. Mooney, “Learning semantic grammars with con-
increase the quality of our results.                                                  structive inductive logic programming,” Proceedings of the Eleventh
                                                                                      National Conference on Artificial Intelligence, pp. 817–822, 1993.
Acknowledgment                                                                   [21] “MUC-6,” in Proceedings of the sixth Message Understanding Confer-
The research from Thierry Poibeau is partially funded by a Cifre grant be-            ence (MUC 6), (San Francisco), Morgan Kaufmann, 1996.
tween the Laboratoire Central de Recherches of Thomson-CSF and the Lab-          [22] “MUC-7,” in Proceedings of the seventh Message Understanding Con-
oratoire d’Informatique de l’Université de Paris-Nord.                               ference, (San Francisco), Morgan Kaufmann, 1998.
   The authors want acknowledge M. Rodde (Cristal-Gresec) and A. Balvet          [23] T. Poibeau, “Extraction d’information : adaptation lexicale et cal-
(Université Paris X) for their contribution during analysis of the results.          cul dynamique du sens,” in Actes des rencontres internationales sur
                                                                                      l’extraction, le filtrage et le résumé automatique (Rifra’98), (Sfax,
REFERENCES                                                                            Tunisia), pp. 141–153, November 1998.
                                                                                 [24] L. Kosseim and G. Lapalme, “EXIBUM : un système expérimen-
[1] M. T. Pazienza, ed., Information extraction (a multidisciplinary ap-              tal d’extraction bilingue,” in Actes des rencontres internationales sur
    proach to an emerging information technology). Berlin: Springer Ver-              l’extraction, le filtrage et le résumé automatique (Rifra’98), (Sfax,
    lag (Lecture Notes in computer Science), 1997.                                    Tunisia), pp. 129–140, November 1998.
[2] E. Riloff, “Automatically generating extraction pattern from untagged        [25] D. Maurel, Reconnaissance des séquences de mots par automate, ad-
    texts,” in 13th Conference on Artifcial Intelligence (AAAI’96), (Port-            verbes de date du français. PhD thesis, Université Paris 7, 1989.
    land, Canada), 1996.                                                         [26] C. Belleil, Reconnaissance, typage et traitement des coréférences des
[3] T. Poibeau, “Mixing technologies for Intelligent Information Extrac-              toponymes français et de leurs gentilés par dictionnaire électronique
    tion,” in Proceedings of the workshop on Intelligent Information Inte-            relationnel. PhD thesis, Université de Nantes, 1997.
    gration, 16th International Joint Conference on Artificial Intelligence,     [27] J. Sénellart, “Locating noun phrases with finite state transducers,” in
    pp. 116–121, 1999.                                                                15th International Conference on Computational Linguistics (COL-
[4] M. E. Califf, Relational Learning Techniques for Natural Language In-             ING’98), (Montréal), pp. 1212–1217, 1998.
    formation Extraction. PhD thesis, Department of Computer Sciences,           [28] R. Basili, R. Catizone, M. T. Pazienza, M. Stevenson, P. Velardi, M.
    University of Texas at Austin, February 1997.                                     Vindigni and Y. Wilks, “An empirical approach to Lexical Tuning,” in
[5] R. Basili and M. T. Pazienza, “Lexical Acquisition for Information Ex-            Actes du Workshop on Adapting lexical and corpus resources to sub-
    traction,” in Information Extraction: A Multidisciplinary Approach to             languages and applications, (Granada, Spain), May 1998.
    an Emerging Information Technology (M. T. Pazienza, ed.), (Frascati,         [29] D. Faure, “Connaissances sémantiques acquises par Asium: exemples
    Italy), LNAI Tutorial, Springer, July 1997.                                       d’utilisations,” in Journée du Réseau de sciences cognitives d’Ile-de-
[6] S. Buchholz, “Distinguishing Complements from Adjuncts using                      France (RISC, ed.), p. 12, October 1999.
    Memory-Based Learning,” in Proceedings of the ESSLLI’98 work-
    shop on Automated Acquisition of Syntax and Parsing (B. Keller, ed.),
    pp. 41–48, 1998.

</pre>