=Paper=
{{Paper
|id=Vol-31/paper-2
|storemode=property
|title=First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX
|pdfUrl=https://ceur-ws.org/Vol-31/DFaure_6.pdf
|volume=Vol-31
|dblpUrl=https://dblp.org/rec/conf/ecai/FaureP00
}}
==First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX==
First experiments of using semantic knowledge learned
by A SIUM for information extraction task using I NTEX
David Faure1 and Thierry Poibeau2
Abstract. Our aim in this article is to show how semantic knowl- elaboration of the IE system. We also aim at reaching a better cover-
edge learned for a specific domain can help the creating of a powerful age thanks to the generalization process implemented in A SIUM.
information extraction system. We describe a first experiment of cou- We will firstly present the A SIUM system which allows to learn
pling an information extraction system based and the machine learn- semantic knowledge for the elaboration of an IE system similar to
ing system A SIUM. We will show how semantic knowledge learned that of the E CRAN project. We will show to what extent it is pos-
by A SIUM helps the user to write an information extraction system sible to speed up the elaboration of resources without any decrease
more efficiently, in reducing the time spent on the development of of the quality of the system. We will finish with some comments on
resources. Our approach will be compared to the European E CRAN this experiment and we will show how domain-specific knowledge
project, that aims at the same result, regarding development time and acquired by A SIUM such as the subcategorization frame of the verbs
performances. could be used to extract more precise information from texts.
1 Introduction 2 Semantic Knowledge Acquisition
Information Extraction (IE) is a technology dedicated to the extrac-
tion of structured information from texts. This technique is used to Semantic knowledge acquisition from texts remains a hard task
highlight relevant sequences in the original text or to fill pre-defined even for limited domains. This knowledge is crucial in order to im-
templates [1]. Below is the example of a story concerning a ter- prove natural language applications like information extraction. Ap-
rorist attack in Turkey together with the corresponding entry in the proaches mixing machine learning (ML) and natural language pro-
database filled by the IE system. cessing (NLP) obtain good results in a short development time (we
can cite, among others M. E. Califf [4], R. Basili [5], S. Buchholz
940815LM347810 Le Monde - 15 août 1994, page 6
TURQUIE: neuf blessés dans un attentat à la bombe. [6], D. Hindle [7], R. J. Mooney [8] et E. Riloff [9], [2], [10]).
Neuf personnes, dont trois touristes étrangers, ont été blessées par l’explosion
d’une bombe vendredi 12 août dans une gare routière de la partie européenne
d’Istanbul. (...)- (AFP.)
We present here A SIUM which learns cooperatively semantic
940815LM347810 Le Monde - August 15, 1994, page 6
knowledge from texts syntactically parsed without previous manual
TURKEY: nine persons were injured during a bomb attack.
Nine persons, three of them being foreign tourists, were injured by a bomb processing. This knowledge consists in subcategorization frames of
explosion on Friday August 12, at a bus station in the European part of Istanbul.
Date of the story 15 août 1994
verbs and an ontology of concepts for a specific domain following
Loc.
Date
TURQUIE Istanbul
Vendredi 12 août
the ”domain dependence” defined by G. Grefenstette 4 [11].
Nb dead person
Nb person injured neuf (nine)
A SIUM is based on an unsupervised conceptual clustering method
Weapon bombe (bomb)
and provides an ergonomic user-interface 5 to help knowledge acqui-
Even if IE seems to be now a relatively mature technology, it sition process.
suffers from a number of yet unsolved problems that limit its dis- In this part, we will show how A SIUM is able to learn good quality
semination through industrial applications. Among these limitations, knowledge in a reasonable time from parsed text, even if the syntactic
we can consider the fact that systems are not really portable from parsing of texts is noisy.
one domain to another. Even if the system is using some generic
components, most of its knowledge resources are domain-dependent.
Moving from one domain to another means re-developing some re-
2.1 Our approach
sources, which is a boring and time-consuming task 3 . Our aim is to learn subcategorization frames of verbs and an ontology
This fact was observed by one of the authors during the elaboration for a specific domain, from texts. Actually, existing knowledge bases
of a previous prototype in the framework of the European E CRAN like E UROW ORD NET or W ORD N ET are frequently over-general for
project [3] with the same aim. The system necessitated resources applications in specific domains. These ontologies, although very
manually defined from the reading of a huge amount of texts. complete, are not suitable for processing texts in technical languages.
In order to decrease the time spent on the elaboration of resources On one hand they are not purpose directed ontologies, they may store
for the IE system, we suggest to use A SIUM that allows to learn se- up to seven meanings and syntactic roles for a word, thus increasing
mantic knowledge from texts. This knowledge is then used for the the risk of semantic ambiguity. In a specific domain, the vocabulary
1 L.R.I. UMR 86-23 du CNRS Université Paris Sud F-91405 Orsay Cedex as well as its possible usage is reduced, which makes ontologies such
David.Faure@lri.fr
2 Thomson-CSF Laboratoire Central de Recherches Domaine de Corbeville 4 “A semantic structure developed for one domain would not be applicable to
F-91404 Orsay Thierry.Poibeau@lcr.thomson-csf.com another”.
3 for example Riloff [2] mentions a 1500 hours development. 5 http://www.lri.fr/Francais/Recherche/ia/sujets/asium.html
as W ORD N ET overly general. On the other hand, W ORD N ET may method is based on a double regularity model: A SIUM gathers nouns
lack some specific terminology of the application domain. together as representing a concept only if they share at least two dif-
Contrary to any approach of increasing or specializing general on- ferent (verb+preposition/syntactic role) contexts as
tologies for a specific domain like R. Basili [5], we learn an ontology in Grishman[15]. Experiments show that it forms more reliable con-
and verbs frames from the corpus reducing the risk of inconsistency. cepts, thus requiring less involvement from the user. Our similarity
Our previous attempts to automatically revise subcategorization measure computes the overlap between two lists of nouns 8 (Details in
frames and a subset of an ontology acquired by a domain expert [16]). As usual in conceptual clustering, the validity of learned con-
have failed. Revision of the acquired knowledge with respect to the cepts relies on the quality of the similarity measure between clusters
training texts required deep restructuring of the knowledge that in- that increases with the size of their intersection.
cremental and even cooperative ML revision methods were not able Basic classes are then successively aggregated by a bottom-up
to handle. The main reason was that the expert built the ontology and breadth-first conceptual clustering method to form the concepts of
the subcategorization frames with too many a priori that were not the ontology level by level with an expert validation and/or labelling
reflected in the texts. This experiment illustrates one of the limitation at each level. Thus a given cluster cannot be used in a new construc-
of manual acquisition by domain experts without linguists. tion before it has been validated. For complexity reasons, the num-
ber of clusters to be aggregated is restricted to two, but this does
not affect the relevance of the learned concept [16]. Verb subcate-
2.2 Learned knowledge gorization frames are learned in parallel so that each new concept
A SIUM learns subcategorization frames like fills the corresponding restriction of selection then resulting in the
for the verb generalization of the initial syntactic frames which allows to cover
to drop. Both couples object: Explosive et in: Pub- examples that did not occur as such in texts. Thus, the clustering
lic Place are subcategories, object is a syntactic role and in process does not only identify the lists of nouns occuring after the
is a preposition but Explosive and Public Place are concepts same verb+preposition/function but also augments this list by
used as restrictions of selection. More usually, A SIUM learns verb induction.
frames like: * Aggregation of two basic V1,P1/F1 noun1 Induced Examples:
These frames are more general than the ones defined in the LFG 6 classes ( C1 and C2 ) found V2,P2/F2 noun2
noun3
V2,P2/F2 noun3
V2,P2/F2 noun4
formalism because the subcategories are verb arguments (subject, di- after two different couples Learned
concept
noun4 V1,P1/F1 noun5
noun5 V1,P1/F1 noun6
rect object or indirect object) and adjuncts. In our framework, re- verb+prep./function noun6
strictions of selection can be filled by an exhaustive list of nouns (in (V1,P1/F1 and V2,P2/F2) V1,P1/F1 Induction V2,P2/F2
canonical form) or by one or more concepts defined in an ontology. will create a new concept noun1 Common noun1
part
The ontology represents generality relations between concepts in the allowed after V1,P1/F1 C1
noun2
noun3
noun2
noun5 C2
form of a directed acyclic graph (DAG). For example, the ontology and V2,P2/F2. Thus, nouns noun4 noun6
could define car, train and motorcycle as motorized vehicle, which only appear in basic
and motorized vehicle as both vehicle and pollutant. Our class C1 (resp. C2) will now be allowed with the couple V2,P2/F2
method learns such an ontology and subcategorization frames in an (resp. V1,P1/F1). This results in a generalization of knowledge
unsupervised manner 7 from texts in natural language. The concepts found in the corpus as presented in the figure.
formed have to be labeled by an expert. For example, starting with these syntactic frames,
2.3 Knowledge acquisition method
The first step of the acquisition process is to automatically extract
syntactic frames from texts. We use the syntactic parser S YLEX de-
veloped by P. Constant [12]. In case of syntactic ambiguities, S YLEX A SIUM will learn two concepts
gives all the differents interpretations and A SIUM uses all theses in- Human: father; neighbor; friend; colleague.
Motorized Vehicle: car; train; motorcycle.
terpretations. Experiments have shown that the ML method works
well with theses ambiguities and acquisition of semantic knowledge and two subcategorization frames:
is not affected. This method avoids a very time-consuming manual
disambiguation step. These frames are the same like subcategoriza-
tion frames but with concepts replaced by nouns.
| role: head noun>*
A SIUM only uses head nouns of complements and links with
verbs. Adjectives and empty nouns are not used. Our experiments Experts have to control the link between the new concept and the
have shown that these informations were enough to learn semantic verb because the only threshold, fixed by the expert, can not mea-
knowledge even from a noisy syntactic parsing. sure the over-generalization risk. This validation process is relatively
The learning method relies on the observation of syntactic regular- quick due to the ergonomic user-interface. A SIUM provides to the ex-
ities in the context of words [13]. We assume here that head nouns oc- pert the list of newly covered examples in order to estimate the gen-
curing with the same couple verb+preposition/syntactic erality of the proposed concept. Moreover the expert can use func-
role represent a so-called basic class and have a semantic similar- tionalities provided by A SIUM in order to divide the learned concept
ity in the same line as Grefenstette[11], Peat[14] or others, but our into sub-concepts in case of a proposed concept overly general for
the target task.
6 Lexical Functional Grammar.
7 Asium is called unsupervised because no concepts examples are provided 8 Sim(C1 ; C2 ) = 1 for lists with the same nouns and Sim(C 1 ;C2 ) = 0
as input. for lists without any common nouns.
2.4 Related work in semantic knowledge via transducers are for most part syntactic structures (the set of ex-
acquisition pressions equivalent to the notion of ”bombing”) integrating some
of the semantic classes furnished by the A SIUM system 10.
As for D. Hindle [7] or F. Peireira [17], our method gather nouns The homogeneous semantic lists learned by the A SIUM system are
regarding syntactic regularities of arguments and adjuncts of the introduced in the I NTEX vocabulary. At this level, a manual work is
verbs. We suppose that in specialized texts, verbs are also character- necessary to exploit the semantic classes from A SIUM. These classes
ized by theirs adjuncts. G. Grefenstette [11] proposes to learn some- are refined (merging of scattered classes, deletion of irrelevant ele-
thing close to our ”basic classes”. Our ”double similarity model” ments, addition of new elements, etc.). About ten hours have been
learns a concept by gathering two basic classes only if they have a dedicated, after the acquisition process, to the refinement of the data
good similarity. This model limits the number of non relevant pro- furnished by A SIUM. This knowledge is then considered as a re-
duced concepts. M. R. Brent [18] learns only five subcategorization source for I NTEX and is exploited either as dictionaries or as trans-
frames from untagged texts with an automatic method. S. Buchholz ducers, in function of the nature of the information. If it is a general
[6] learns subcategorization frames very close to ours but with a su- information that is not domain specific, we prefer to use a dictionary
pervised method which is very time-consuming for the expert. In which can be reused, otherwise, we use a transducer.
the same way, W OLFIE (A. C. Thompson [19]) with C HILL (J. M. A dictionary is a list of words or phrases, each one being accom-
Zelle [20]) learns ”case-roles” and a thesaurus from texts syntac- panied of a tag and a list of features 11. The first names dictionary or
tically parsed by C HILL but fully semantically annotated by hand. the locations dictionary are generic reusable resources. Below is a
The case roles differs from our subcategorization frames because our sample of the location names dictionary 12 :
prepositions or grammatical functions are replaced by semantic roles Abidjan,N+Loc+City;
like agent or patient. Contrary to the ontology learned by A SIUM, Afghanistan,N+Loc+Country;
selectional restrictions learned by W OLFIE are attribute-value lists. Allemagne,N+Loc+Country;
Allemagne de l’Ouest,N+Loc+Country;
An unsupervised learning approach like A SIUM delays concepts la- Allemagne de l’Est,N+Loc+Country; : : :
belling after the learning process and so considerably reduces the These items structured in a list are convenient for the dictionary
time needed by the expert. After A SIUM learning, the semantic roles format and the semantic lists elaborated from A SIUM complete in an
can be labelled by assuming a couple verb+prep./function accurate manner the coverage of the initial dictionaries from I NTEX.
represents a specific semantic role. E. Riloff in [10] learns five con- The transducer format is essentially used for more complex or
cepts from texts. She uses lists of nouns representing general con- more variable data where linguistic phenomena such as insertion or
cepts (seeds) and uses coocurrence method to augment these lists to optionality may interfere.
concepts. These augmented lists are checked by the expert who only Here,
retains nouns representing the concept. We can assume basic classes the figure LastName
of A SIUM are seeds that will be increased by our induction process. presents an
The main advantage is that the number of concepts is not limited to example of
FirstName
five and we learn in parallel subcategorization frames of verbs with- a transducer
monsieur de
out more time-consuming validation needed. allowing to
madame
M.
UknownWord
Mme
recognize Transducer "Person".
3 The Information Extraction system person names
such as Mon-
The Information Extraction system is based on the I NTEX tool-box, sieur Jean Dupont. the transducer recognizes a sequence composed
developed by the LADL laboratory 9 . I NTEX allows a rapid and in- of a trigger word (Monsieur), a first name (Jean) and a proper name
teractive development of automata and transducers to analyze texts. (Dupont). But we must keep in mind that most of these elements
A linguistic automaton recognizes expressions in texts, whereas a can be optional (Monsieur Dupont or Jean Dupont are correct
transducer associate specific tags with words in the texts (for exam- sequences) and that Dupont can be a word that is not listed in any
ple, assign a syntactic category to a word). Transducers are efficient, dictionary (it will then be considered as an unknown word).
expressive and sufficient for a local analysis of texts. We chose this At this level, one can find two types of transducers: some are
approach because it allows the rapid development of an IE system generic - as the ”Person” one, and some others are domain-specific
for a given domain with a strictly local analysis limited to the sen- and can filled with the semantic knowledge acquired by A SIUM.
tence area. Our aim is to develop a highly portable system even if The next figure
this means using more precise analysis strategies afterwards. is the illustration of
de Explosive
a transducer recog-
3.1 Linguistic resources modeling nizing explosion Transducer "Weapon".
de Det N (explo-
To elaborate linguistic resources, we first used the semantic classes sion of Det N), where
defined by the A SIUM system. Before the experiment, the corpus the nominal phrase Det N recognizes nominal phases elaborated
was separated in two different parts : the training set and the test from the semantic class bombing where the following words
set. The linguistic resources are constantly tested on the training set appear: bombe (bomb), obus (shell), grenade, etc.
during the development. This development approach allows to eval- The elaboration of such transducers requires some linguistic ex-
uate the performances and to detect possible errors in the grammar pertise to obtain in fine a system recognizing the relevant sequences
(a grammar with too much or not enough constraints which would without too much noise. The architecture of the system is using cas-
bring silence or noise during the analysis). The expressions modeled cading transducers, it is then important that each level has a good
9 Laboratoire d’Automatique Documentaire et Linguistique de l’Université quality in order to allow the following analysis level to operate on a
de Paris 7. solid background.
10 for example the list of weapons which could be used in a bomb attack.
11 For example, in this case: Loc,City,Country :: :
12 Each line begins with a term, followed by some indication about its syn-
tactic category (N for Noun) and semantic features (Loc to indicate a
location, Country to indicate a country, etc.).
This kind of architecture amplifies indeed systematically the noise The time spent on the definition of the linguistic resources with
generated by the previous level. I NTEX is estimated to about 15 hours. This duration has to be com-
For example, the pared with the two weeks 16 needed for the manual resources devel-
results of the trans- opment of the E CRAN project.
meurtre
ducer presented on Hundred texts have been used as ”training corpus” and fifteen dif-
13
right figure will be assassinat de Person ferent texts have been used as ”test corpus”. Texts are first parsed
better if those of the exécution with our system, and then some heuristics allow to fill the extraction
transducer ”Person” template:
Transducer "Victim".
(in grey) are already Due to the structure of articles of Le Monde, the first date is
good. always the date of the article;
The different defined transducers are then minimized and deter- we assume that the second date is the one of the terrorist event;
mined 14 . The overall set of transducers is composed of 1000 nodes the two first occurrences of locations found are stored and usu-
and about 5000 arrows in our experiment. ally quite well identify the location of the terrorist event;
the first occurrence of a number of victims or injured persons
3.2 Related work in Information Extraction is stored. If a text speaks of more than one terrorist event, we as-
sume that only the first one is relevant. We have chosen short texts to
IE is a now widely spread research domain. The American Message prevent us from this problem inherent to long texts;
Understanding Conferences (M UC) provided a formidable frame- only the first weapon linked with the terrorism event is stored.
work for the development of research in this area ([21], [22]). The
conferences are held about every two years and generally bring to- These heuristics are very succinct and we will have to specialize
gether about fifteen teams working on IE systems. The elaboration of them to perform information extraction on longer or less-specialized
the linguistic resources is for most part a manual work even if some texts. We have used these simple heuristics to evaluate our system
attempts were done to have some more portable systems. and compare it with the E CRAN one. With these heuristics, we obtain
At least two French-speaking projects have been developed which good results on our corpus, and most of the extraction systems eval-
are somewhat comparable with M UC systems. these two systems are uated in the American M UC conferences used this kind of heuristics
the European project E CRAN and the E XIBUM project from the Uni- in order to solve any parsing problems.
versity of Montreal (Canada). E CRAN developed a generic and mul- Our results have been evaluated by two human experts who did
tilingual system tested on different corpora (movie reviews, stories not follow our experiment. Our performance indicators were defined
from the economic area, etc.) [23]. E XIBUM is a bilingual system as:
(French and English) that aims at processing agency news about ter- O K (O) if extracted information is correct;
rorist events in Algeria [24]. FALSE (F) if extracted information is incorrect or not filled;
Several other Information Extraction systems were developed for N ONE (N) if there were no extracted information and no infor-
specific kind of information (dates, location names, etc.). For exam- mation has to be extracted.
ple, D. Maurel [25] developed a system highlighting dates by means FALSE for all the other cases.
of automata and acceptability tables. More recently, C. Belleil [26]
presented a system highlighting French toponyms and J. Sénellart Using these indicators, we can compute two differents values:
[27] a system recognizing Minister names from the French newspa-
P RECISION 1 (P1), ratio between O K and FALSE answers,
per Le Monde. These Approaches generally require exhaustive de-
without taking into account the N ONE answers.
scriptions of the concerned domain.
P RECISION 2 (P2), same as P1 but with the N ONE answers.
Recent American work in the area proposed an approach mixing
corpus exploration and knowledge acquisition to feed IE systems. A
The next table summarizes results for the different elements of the
first well-known experiment is the AutoSlog from E. Riloff [2] allow-
template.
ing to find in texts relevant syntactic structures from keywords given
to the system by the end-user. In the framework of E CRAN, a simi- O F N P1 P2
Date of the story 50 0 0 1,00 1,00
lar attempt was done to try to generalize relevant syntactic structures Location 45 5 0 0,90 0,90
Date 49 1 0 0,98 0,98
from a training corpus and a general dictionary [28]. The experiment Nb dead persons 20 5 25 0,80 0,90
Nb persons injured 26 9 15 0,74 0,82
we present is different considering that the learning system is not Weapon 35 11 4 0,76 0,78
Average 37,5 5,2 7,3 0,86 0,89
supervised and furnishes the IE system designer a wide amount of
knowledge extracted from the texts.
We obtain a good quality for the extracted information in most of
the elements.
4 Experiment The date of the story is fully correct because we can use the
In our experiment, we have used a corpus of texts form the French structure of the article to extract it;
journal ”Le Monde”. Texts indexed by the noun ”terrorist event” have The errors for the location slot are due to two ”contradictory”
been extracted and manually filtered in order to be sure that they locations found by the system. A more complete linguistic analysis
really contain a terrorist event description15. This corpus is of the or a database providing lists of cities in different countries would
same kind as the one used for experiments in the E CRAN project, so reduce this kind of errors;
that we will be able to compare our results. The errors in the number of dead or injured persons slot are
13 meurtre=murder, assassinat=assassination, éxecution=execution. frequently due to silence. Our system, for example, fails against
14 These two operations allow to optimize the analysis time. too complex syntactic forms like ”Deux médecins italiens travaillant
15 The full corpus also contains others texts describing proceedings or terror-
ist menaces. 16 about 80 hours.
pour médecins sans frontières (MSF-Belgique) ont été blessés. (Two Ital- The results of the A SIUM system allow to speed up the defi-
ian doctors working for médecins sans frontières (MSF-Belgique) have been nition of the paradigmatic classes filling states in the I NTEX trans-
injured.)”, where the passive subject have not been correctly parsed ducers, even if certain classes need to be manually completed. For
by the system; example, the A SIUM semantic classes allowed to rapidly complete
The silence for the weapon slot is frequently due to incom- the graph representing the set of weapons or persons who were impli-
pleteness of semantic dictionaries. cated in terrorist events. A SIUM provided a class in which terms such
as, for example: ”bomb”, ”grenade”, ”explosive” or ”car”
could appear, considering that a booby-trapped car is a kind of
5 Discussion weapon, etc;
In this section, we will comment some of the results of this exper-
iment. Results obtained prove the interest of coupling a semantic The description language provided by I NTEX is richer than the
knowledge acquisition tool with the IE system. But those results are one of E CRAN. The time spent to model the linguistic I NTEX trans-
not precise enough to decide about the quality of the semantic knowl- ducers was longer than the one spent for E CRAN since the constraints
edge acquisition tool. We will examine here some indicators which and the empty transitions in automata and transducers have to be
allow to judge of the quality of the semantic knowledge learned and manually designed so that the noise is kept at a low level 17 .
next we will present some comments on the information extraction.
Such an evaluation, in which we deliberately limited the time
spent on the development of linguistic resources, shows the impor-
5.1 Semantic Knowledge quality tance of having accurate resources adapted to the task. Moreover, the
Semantic knowledge acquisition tools like A SIUM are always very inescapable incompleteness of the developed resources facing new
difficult to evaluate. Measuring the quality of an ontology or evalu- texts shows that this kind of systems have to integrate dynamic ac-
ating an ontology regarding another one is not easy and heavily de- quisition processes to assist the incremental enrichment of resources,
pends on applications. So, we will only present here some indicators as time goes by.
to have an idea on the quality of the acquired knowledge. The experiment was intended to show the time needed for the de-
Concept quality depends of two different elements. The first one velopment of a sufficient set of resources, in order to obtain results
is the distance which computes similarity between classes in order equivalent to those of the E CRAN project. That is the reason why
to create relevant concepts and perform relevant inductions. As usual we emphasize on an evaluation of the amount of time spent on the
in conceptual clustering, the distance is a parameter of the concept task rather than on the improvement potential. That is also the rea-
quality and of quantity of expert’s work. son why we focused on a limited template that only necessitates a
This first qualitative element is very hard to estimate. In our appli- surface analysis. This limitation could certainly be solved if we used
cation, 16 of the 19 first classes proposed by A SIUM have been ac- more accurately the knowledge acquired by A SIUM. Thus, we plan
cepted by the expert. 447 inductions have been proposed by A SIUM to take into account a deeper linguistic analysis (anaphora resolution,
and 73 % of these inductions have been judged relevant by the expert. partial information merging, etc.).
The second element which affects the concepts quality is the level
of generality for a concept. When A SIUM proposes a new concept,
the expert has to decide from the generality of the concept whether it 6 Future work
should be split or not. This work is easy for an expert because he has All the knowledge learned by A SIUM is not used in this experiment,
a very good knowledge of the final application. especially subcategorization frames. We showed that a surface anal-
For example, if A SIUM proposes the ”Organization” concept, ysis is sufficient when templates to be filled are not more complex
the expert has to decide if it is relevant for the task to identify sub- than those of E CRAN. The good quality obtained in a very short time
concepts like ”Military org.” and ”Politic org.”. proves this idea.
The generality level in the application highly depends on the sub- Nevertheless, in order to extract more specific informations from
tlety of the template to be filled by the information extraction system. texts (like the name of the organization that performs the terrorist
Our previous experiments on this domain and on the cooking recipes event, the politic membership of victims or attacker nationality), we
domain have shown that this work is simple and that expert choices think that the use of subcategorization frames could be very useful.
really depend on the task. (More explanations on the suitability of Writing syntactic rules in order to perform relevant information ex-
concepts for the main task and unsuitability of these concept for an- traction becomes very hard because of the multiplicity of the syntac-
other task are given in [29].) tic variations used in texts.
Our current work is to create a cooperative acquisition system
5.2 Comments on the extraction process to learn resources using the subcategorization frames learned by
The results we obtained during this experiment can be satisfacto- A SIUM. The expert will be able to express rules using comple-
rily compared with those that we obtained on the same corpus with ments of verbs independently of the syntax. Active and passive forms
the E CRAN system, that performed 0.89 precision. Moreover, the re- will be given the same representation by the system. For example,
sults we performed with the new system were obtained after a re- the two following sentences will be equivalent: L’action terroriste est
duced development phase: about 40 hours for the learning phase with revendiquée par le Front populaire de libération de la Palestine (FPLP) (The
A SIUM and about 15 hours to format the knowledge base as I NTEX terrorist event was claimed by the FPLP) or le Front populaire de libération
resources. The following comments can be done on this experiment: de la Palestine (FPLP) revendique l’action terroriste (The FPLP claimed re-
sponsibility of the terrorist event.) One example of rules for this kind of
Having a good knowledge of the corpus is indubitably an ad-
vantage for the system designer. The fact that one of the author had sentences can be:
previously done the same task for E CRAN speeded up the develop- 17 The effort to manage empty transitions in graphs took about 5 hours but
ment process, given that the search of relevant syntactic structures allowed to obtain a more efficient grammar than the one obtained by the
was facilitated; description of syntactic patterns by a set of regular expressions.
If verb is ”to claim”, and object belongs to the class [7] D. Hindle, “Noun classification from predicate-argument structures,” in
Proceedings of the 28st annual meeting of the Association for Compu-
”Attack” Then the subject is the attacker. tational Linguistics, ACL, Pittsburgh, PA, pp. 1268–1275, 1990.
This kind of rule allows to differentiate people claiming terror- [8] R. J. Mooney, A. C. Thompson, and R. L. Tang, “Learning to Parse Nat-
ism events like in Un groupe terroriste libanais revendique l’attentat anti- ural Language Database Queries into Logical Form,” Proceedings of the
sémite de Buenos-Aires (A lebanese terrorist group claim anti-semite attack ML-97 Workshop on Automata Induction, Grammatical Inference, and
Language Acquisition, 1996.
in Buenos-Aires) from an organization claiming for a right like in ”les [9] E. Riloff, “Automatically Constructing a Dictionnary for Information
fondamentalistes musulmans revendiquent le droit de vote (Muslim funda- Extraction Tasks,” in Proceedings of the Eleventh Nationnal Confer-
mentalists are claiming voting rights). ence on Artificial Intelligence, pp. 811–816, 1993.
[10] E. Riloff and J. Shepherd, “A Corpus-Based Approach for Building
Semantic rules allow to make fine differences to accurately fill Semantic Lexicons,” in Proceedings of the Second Conference on Em-
fine-grain slots. The two next rules fill the field ”Missile” or ”At- pirical Methods in Natural Language Processing (EMNLP-2), 1997.
tacker” regarding the concept (Explosive or Person) learned [11] G. Grefenstette, “Sextant: exploring unexplored contexts for seman-
by A SIUM and used as subject of the verb to kill. tic extraction from syntactic analysis,” in Proceedings of the 30st an-
nual meeting of the Association for Computational Linguistics, ACL,
If verb = ”to kill” and subject = Person Then the subject is the attacker. (Newark, Deleware, USA), pp. 324–326, June 1992.
If verb = ”to kill” and subject = Explosive Then The subject is the [12] P. Constant, “Reducing the complexity of encoding rule-based gram-
missile used. mars,” December 1996.
We can see that, even if syntactic parsers generate errors and ambi- [13] Z. Harris, Mathematical Structures of Language. New York: Wiley,
1968.
guities, A SIUM can check texts using the ontology and the subcatego- [14] H. J. Peat and P. Willet, “The limitations of term co-occurrence data for
rization frames previously learned. Then, the information extraction query expansion in document retrieval systems,” Journal of the Ameri-
process will process only on consistent sentences with subcategoriza- can Society for Information Science, vol. 42, no. 5, pp. 378–383, 1991.
tion frames. This allows to detect some parsing errors. [15] R. Grishman and J. Sterling, “Generalizing Automatically Generated
Selectionnal Patterns,” in Proceedings of COLLING’94 15th Interna-
The system we are thinking of will process two different parsing
tional Conference on Computational Linguistic, (Kyoto, Japan), August
steps. First, we will use syntax and concepts learned by A SIUM to 1994.
pre-fill the frame. Second, we will use our ”conceptual rules” to fill [16] D. Faure and C. Nédellec, “A Corpus-based Conceptual Clustering
more specifically the frame. Method for Verb Frames and Ontology Acquisition,” in LREC work-
shop on Adapting lexical and corpus ressources to sublanguages and
7 Conclusion applications (P. Velardi, ed.), (Granada, Spain), pp. 5–12, May 1998.
We have described in this article an experiment in which we cou- [17] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English
pled an information extraction system using I NTEX with the machine Words,” in Proceedings of the 31st annual meeting of the Association
for Computational Linguistics, ACL, pp. 183–190, 1993.
learning system A SIUM. The development time of the linguistic re- [18] M. R. Brent, “Automatic acquisition of subcategorization frames from
sources of the information extraction system has been reduced by untagged text,” in Proceedings of the 29st annual meeting of the Asso-
using the semantic knowledge learned by A SIUM. The quality of the ciation for Computational Linguistics, ACL, pp. 209–214, 1991.
results remains the same as in the European E CRAN project. [19] C. A. Thompson, “Acquisition of a Lexicon from Semantic Represen-
tations of Sentences,” in 33rd Annual Meeting of the Association of
The aim of this experiment was to validate our approach. We will Computational Linguistics, Boston, MA July, (ACL-95)., pp. 335–337,
now explore a better integration of the two systems and examine how 1995.
to better use the semantic knowledge learned by A SIUM in order to [20] J. M. Zelle and R. J. Mooney, “Learning semantic grammars with con-
increase the quality of our results. structive inductive logic programming,” Proceedings of the Eleventh
National Conference on Artificial Intelligence, pp. 817–822, 1993.
Acknowledgment [21] “MUC-6,” in Proceedings of the sixth Message Understanding Confer-
The research from Thierry Poibeau is partially funded by a Cifre grant be- ence (MUC 6), (San Francisco), Morgan Kaufmann, 1996.
tween the Laboratoire Central de Recherches of Thomson-CSF and the Lab- [22] “MUC-7,” in Proceedings of the seventh Message Understanding Con-
oratoire d’Informatique de l’Université de Paris-Nord. ference, (San Francisco), Morgan Kaufmann, 1998.
The authors want acknowledge M. Rodde (Cristal-Gresec) and A. Balvet [23] T. Poibeau, “Extraction d’information : adaptation lexicale et cal-
(Université Paris X) for their contribution during analysis of the results. cul dynamique du sens,” in Actes des rencontres internationales sur
l’extraction, le filtrage et le résumé automatique (Rifra’98), (Sfax,
REFERENCES Tunisia), pp. 141–153, November 1998.
[24] L. Kosseim and G. Lapalme, “EXIBUM : un système expérimen-
[1] M. T. Pazienza, ed., Information extraction (a multidisciplinary ap- tal d’extraction bilingue,” in Actes des rencontres internationales sur
proach to an emerging information technology). Berlin: Springer Ver- l’extraction, le filtrage et le résumé automatique (Rifra’98), (Sfax,
lag (Lecture Notes in computer Science), 1997. Tunisia), pp. 129–140, November 1998.
[2] E. Riloff, “Automatically generating extraction pattern from untagged [25] D. Maurel, Reconnaissance des séquences de mots par automate, ad-
texts,” in 13th Conference on Artifcial Intelligence (AAAI’96), (Port- verbes de date du français. PhD thesis, Université Paris 7, 1989.
land, Canada), 1996. [26] C. Belleil, Reconnaissance, typage et traitement des coréférences des
[3] T. Poibeau, “Mixing technologies for Intelligent Information Extrac- toponymes français et de leurs gentilés par dictionnaire électronique
tion,” in Proceedings of the workshop on Intelligent Information Inte- relationnel. PhD thesis, Université de Nantes, 1997.
gration, 16th International Joint Conference on Artificial Intelligence, [27] J. Sénellart, “Locating noun phrases with finite state transducers,” in
pp. 116–121, 1999. 15th International Conference on Computational Linguistics (COL-
[4] M. E. Califf, Relational Learning Techniques for Natural Language In- ING’98), (Montréal), pp. 1212–1217, 1998.
formation Extraction. PhD thesis, Department of Computer Sciences, [28] R. Basili, R. Catizone, M. T. Pazienza, M. Stevenson, P. Velardi, M.
University of Texas at Austin, February 1997. Vindigni and Y. Wilks, “An empirical approach to Lexical Tuning,” in
[5] R. Basili and M. T. Pazienza, “Lexical Acquisition for Information Ex- Actes du Workshop on Adapting lexical and corpus resources to sub-
traction,” in Information Extraction: A Multidisciplinary Approach to languages and applications, (Granada, Spain), May 1998.
an Emerging Information Technology (M. T. Pazienza, ed.), (Frascati, [29] D. Faure, “Connaissances sémantiques acquises par Asium: exemples
Italy), LNAI Tutorial, Springer, July 1997. d’utilisations,” in Journée du Réseau de sciences cognitives d’Ile-de-
[6] S. Buchholz, “Distinguishing Complements from Adjuncts using France (RISC, ed.), p. 12, October 1999.
Memory-Based Learning,” in Proceedings of the ESSLLI’98 work-
shop on Automated Acquisition of Syntax and Parsing (B. Keller, ed.),
pp. 41–48, 1998.