=Paper=
{{Paper
|id=Vol-31/paper-4
|storemode=property
|title=SVETLAN' - A System to Classify Words in Context
|pdfUrl=https://ceur-ws.org/Vol-31/GChalendar_12.pdf
|volume=Vol-31
}}
==SVETLAN' - A System to Classify Words in Context==
69(7/$1¶
$6\VWHPWR&ODVVLI\1RXQVLQ&RQWH[W
1
*DsOGH&KDOHQGDU and%ULJLWWH*UDX1 2
$EVWUDFW Using semantic knowledge in NLP applications always 29(59,(:2)7+(6<67(0
improves their competence. Broad lexicons have been developed,
but there are few resources made for non-specialized domains Input data of SVETLAN’ (see Fig. 1) are semantic domains with
which contain semantic information available for words. In order the Thematic Units (TUs) that have given birth to them. Domains
to build such a base, we conceived a system, SVETLAN’, able to are sets of weighted words, relevant to represent a same specific
learn categories of nouns from texts, whatever their domain. In topic. They are automatically learned by aggregating similar the-
order to avoid general classes mixing all the meanings of words, matic units, made of sets of words. Each TU corresponds to a part
they are learned taking into account the contextual use of words. of text that is homogeneous from a topic point of view and is de-
limited from a text by a topic segmentation process relying on
,1752'8&7,21 lexical cohesion. Processed texts are newspaper articles that are
pre-treated in order to retain only lemmatized content words.
Using semantic knowledge in NLP applications always improves
their competence as in Information Retrieval or Word Sense Dis- Domain
ambiguation systems. Broad lexicons have been developed, but
there are few existing resources which contain semantic informa- TU
tion available for words that are not specialized to very specific TU
domains apart from WordNet [1]. Moreover, manual or automatic TU
TU
processes that build semantic categories of nouns usually lead to Input data
define general categories. For example, words in WordNet are
related to a Synset when they are synonymous, however Synsets
correspond to large categories, and there are some shifts of mean- TU TU
ing so that when two words belonging to a same Synset are consid-
ered within a specific context, they often no longer share a com-
mon meaning. Automatic processes that extract knowledge from
Text Segment Text Segment
texts by using statistical [2] or distributional [3], [4] approaches
also lead to build broad classes, if they are not applied to special- STU
STU
ized texts belonging to a very specific domain. On the other hand,
V r N V r N
we do not want to learn a general ontology, whatever the domain
is. As most words are polysemous, we claim that a semantic base
has to deal with all the meanings of a word, by associating them
V r N V r N
with their context of interpretation. Having such a semantic knowl-
edge will allow information retrieval and Question/Answering
systems for example to use deeper semantic analysis of texts, even Structured Domain
if applied on database that contain texts on different domains V r N1, N2, …
which are non technical articles and uses a general and common
vocabulary such as newspaper articles bases.
In order to build such a base, we conceived a system,
V r N1, N2, …
SVETLAN’, able to learn categories of nouns in context from
texts, whatever their domain. It is based on a distributional ap-
proach: nouns playing the same syntactic role with a verb in sen- )LJXUH . Schemata of Structured Domain learning
tences related to the same topic, i.e. the same domain, are aggre-
gated in the same class. SVETLAN’ relies on knowledge about The first step of SVETLAN’ consists of retrieving text segments
semantic domains automatically learned by SEGAPSITH [5]. of the original texts associated to the different TUs in order to
parse their sentences. We extract then all the triplets constituted by
1 LIMSI/CNRS, BP 133, 91 403 Orsay Cédex, France,
email: {gael,grau}@limsi.fr
2 IIE-CNAM, 18 allée J. Rostand, 91 000 Evry, France
a verb, the head noun of a phrase and its syntactic role from the rized thematic unit, called a semantic domain. Each aggregation of
parser results in order to produce the Syntactic Thematic Units a new TU increases the system’s knowledge about one topic by
(STUs). The STUs belonging to a same semantic domain are ag- reinforcing recurrent words and adding new ones. Weights on
gregated altogether to learn a Structured Domain. Aggregation words represent the importance of each word relative to the topic
leads to group nouns playing the same syntactic roles with a verb and is computed from the number of occurrences of these words in
in order to form classes. As these aggregations are made within the TUs. This method leads SEGAPSITH to learn specific topic
TUs belonging to a same domain, classes are context sensitive, representations as opposed to [7] for example whose method builds
which ensures a better homogeneity. A filtering step, based on the general topic descriptions as for economy, sport, etc.
weights of the words in their domain allows the system to eliminate We have applied the learning module of SEGAPSITH on one
nouns from classes when they are not very relevant in this context. month (May 1994) of $)3 newswires. Figure 2 shows an example
of a domain about justice that gathers 69 TUs.
As some of these domains are close and refer to the same gen-
6(0$17,&'20$,1/($51,1* eral topic, we have applied a hierarchical classification method
We only give here a brief overview of the semantic domain learn- based on their common words to organize them in separate general
ing module. This one is described more precisely in [5]. This mod- topics and to structure them. Figure 3 shows the hierarchies built
ule incrementally builds topic representations, made of weighted about sport, police and stock exchange. Each leaf is a domain,
words, from discourse segments delimited by SEGCOHLEX [6]. It named by its two more weighted words, while internal nodes are
works without any D SULRUL classification or hand-coded pieces of described by their name and their size, i.e. the number of common
knowledge. Processed texts are typically newspaper articles com- words found in their children.
ing from /H 0RQGH or $)3 $JHQFH )UDQFH 3UHVVH . They are
pre-processed to only keep their lemmatized content words (adjec- Pilot/Formula
tives, single or compound nouns and verbs). Team_mate/Champion
The topic segmentation implemented by SEGCOHLEX is based ship 11 To_beat/Finale
on a large collocation network, built from 24 months of /H0RQGH Tennis/Team_mate 50
newspaper, where a link between two words aims at capturing Cycle_race/Stage
semantic and pragmatic relations between them. The strength of Police/Policeman
such a link is evaluated by the mutual information between its two To_question/Arrest 6
words. The segmentation process relies on these links for comput- Prison/Condemn
ing a cohesion value for each position of a text. It assumes that a
discourse segment is a part of text whose words refer to the same Dollar/Billion
topic, that is, words are strongly linked to each other in the collo-
cation network and yield a high cohesion value. On the contrary, Money/Quarter 27
low cohesion values indicate topics shifts. After delimiting seg- Rate/Rise
ments by an automatic analysis of the cohesion graph, only highly
cohesive segments, named Thematic Units (TUs), are kept to learn )LJXUH Three hierarchies of semantic domains
topic representations. This segmentation method entails a text to be
decomposed in small thematic units, whose size is equivalent to a
paragraph. Discourse segments, even related to the same topic,
often develop different points of view. To enrich the particular 6758&785(''20$,1/($51,1*
description given by a text, we add to TUs those words of the As in [4], verbs allow us to categorize nouns. A class is defined by
collocation network that are particularly linked to the words found those nouns which play a same role relative to a same verb. In
in the corresponding segment. order to learn very homogeneous3 classes, we only apply this
principle on words belonging to a same context, i.e. a domain.
ZRUGV RFF ZHLJKW
examining judge 58 0.501
police custody 50 0.442
6\QWDFWLFDQDO\VLV
public property 46 0.428 In order to find the verbs and their arguments in the texts, we use
charging 49 0.421 the syntactic analyzer Sylex [8], [9]. Figure 4 shows a little part of
to imprison 45 0.417 the results of Sylex for a sentence. The first part exhibits lexico-
court of criminal appeal 47 0.412 syntactic information for the words and this for four different
receiving stolen goods 42 0.397 interpretations pointed out by the string “WDX[ ” meaning an
to presume 45 0.382 ambiguity rate of 4. This rate is due to the fact that Sylex cannot
criminal investigation department 42 0.381 solve two ambiguities: the ambiguity of “ODLVVH” between the verb
fraud 42 0.381 “ODLVVHU” (to let) and the noun “ODLVVH” (leash) and the ambiguity of
“FULWLTXH” between the verb “WRFULWLFL]H” and the noun “FULWLFLVP”.
)LJXUH The most representative words of a domain about justice Note that Sylex does not consider the adjectival form which is the
right interpretation here. The second part shows syntactic links
found by Sylex. Between parenthesis are references to the words in
Learning a complete description of a topic consists of merging
all successive points of view, i.e. similar TUs, into a single memo- 3 We call homogeneous a class that contains words that denote a same
concept in the corresponding domain.
the preceding analysis. Here Sylex has found four times the same token1 lemma1 rel token2 lemma2
interpretation in each of its possible analyses. In this case, we hang over hang over subject threat threat
count one occurrence of the link. However if it finds several times play play object cup cup
the same relation between a verb and different words, for example hear hear of sources source
several possible subjects, then we keep all the different interpreta-
tions because we have no way to choose between them. We make )LJXUH Examples of extracted links
the reasonable expectation that the false interpretations will have
much less occurrences in the corpus and so, will be filtered out Sylex, as other syntactic analyzers, has difficulties with some
during the rest of the processing. constructions and as a consequence introduces errors that can
cause problems to the remaining of the system. Some common
******************** Phrase 193-466 ***********************************
"L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes), errors are the bad interpretation of the passive form that causes a
victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de subject to be analyzed as a direct object and conversely, a direct
Monaco de Formule Un, laisse planer une menace sur le déroulement de la course, object to be viewed as a subject. Another common error is that it
dimanche en Principauté."
******************** Partie 1 193-466 WDX[ ************************** often happens that Sylex does not find any link in a phrase. That’s
"L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes), what we will call VLOHQFH. We will see in Section 5 that we can
victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de
Monaco de Formule Un, laisse planer une menace sur le déroulement de la course,
obtain good results despite these problems thanks to the
dimanche en Principauté." redundancy needed to validate the links in the next steps of
/H[LFR6\QWDFWLFLQIRUPDWLRQ!
processing. But another consequence of this redundancy needs is
193-195 (164) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj
195-208 (165) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier that the system must use great quantities of texts in order to create
mot_compose locsw classes with a satisfactory size.
....... Having gotten the syntactic links in the texts, we want to group
382-388 (203) "laisse" "laisse" [gs.12,nom.1] nom : feminin singulier
389-395 (204) "planer" "planer" [gs.13,verbe] verbe : infinitif them relatively to the belonging of their text segment to a Thematic
....... Unit. So, we define a Syntactic Thematic Unit (STU) as a set of
193-195 (16) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj
195-208 (117) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier
<9HUEÅV\QWDFWLF UHODWLRQÅ1RXQ> structures, i.e. a syntactic
mot_compose locsw relation instantiated with a verb and a noun. We will refer to these
....... structures as Instantiated Syntactic Relations or ISR. We are able
382-388 (211) "laisse" "laisser" [gs.13,verbe] verbe : singulier autoontif antiontif
anontif present indicatif subjonctif imperatif to put in relation the links extracted from the results of Sylex and
389-395 (212) "planer" "planer" [gs.14,verbe] verbe : infinitif the words contained in the domains because each domain in the
....... thematic memory remembers which thematic units have been used
6\QWDFWLF/LQNV!
`L'état de santé critique' (164) ->- cn head ->- `du pilote autrichien' (170) to create it. In the same way, each thematic unit remembers the part
`planer' (204) ->- a2 head ->- `une menace' (205) of text it comes from.
.......
`planer' (153) ->- a2 head ->- `une menace' (154)
.......
`planer' (161) ->- a2 head ->- `une menace' (162)
$JJUHJDWLRQ
.......
`planer' (212) ->- a2 head ->- `une menace' (213) In order to construct group of words with very similar meanings,
`sur le déroulement' (66) ->- cn head ->- `de la course' (235) we want to group the nouns appearing with the same syntactic role
in relation to a verb inside a Domain. Then, a Structured Domain
)LJXUH An extract of a sentence analysis by Sylex. (SD) is a set of <9HUEÅV\QWDFWLF UHODWLRQÅ1RXQ1 « 1RXQn>
structures, i.e. an aggregated ISR.
The results of Sylex are very detailed and not easy to parse di- STUs related to a same domain are aggregated altogether to
rectly with, say, Perl. Furthermore, we do not need all the informa- form a Structured Domain. Aggregating a STU within a SD con-
tion it extracts. In fact, we only need to find the verb with its links sists of:
and the head nouns arguments of these links. So, we have devel- - aggregating their ISR that contain a same verb ;
oped a formal grammar that extracts from these raw analyzes the - adding new ISR, i.e. adding new verbs with their arguments
associations between a verb and its arguments. This grammar made of a syntactic relation and the lemmatized form of a
extracts links from the results of Sylex in the following format: noun.
i#j verb # WRNHQ # OHPPD # k rel # WRNHQ # OHPPD # l Figure 6 shows the aggregation of a SD and three ISR. This ex-
ample shows all the possible effects of the aggregation. In the
where i and j are the boundaries of the sentence that contains the figure, bold elements represent new or updating data. Aggregating
link in the corpus; WRNHQand OHPPDare the token and the lemma an ISR in a SD that already contains the verb of the ISR leads to
of the verb respectively ; relis the syntactic relation which can be increment the occurrence number of the verb, as for SOD\ in the
"subject", "direct object" or a preposition ("to", "from", etc.) ; example. Similarly, the occurrence number of same nouns related
WRNHQand OHPPDare the token and the lemma of the head noun to the verb by the same relation are updated (as for PDWFK), and
of the noun phrase pointed by the relation; lastly, N and O are the new relations with their associated nouns are added to the verb. In
indexes in the corpus of WRNHQand WRNHQ respectively. Figure 5 the example, the subject FKDPSLRQ is added. An ISR with a new
shows some links that we have extracted from the results of Sylex. verb is simply added with an occurrence of 1, as for .
• read a text,
6\QWDFWLF'RPDLQVRXUFH
• extract the TUs from it,
to play [4] object cup [3], match [1] • extract the corresponding STUs,
with ball [1] • add each TU to its domain,
to win [2] subject player [1] • add each STU to its corresponding domain,
object match [1]
and after the processing of all the texts, to filter the classes.
In fact, each computing step is done on the entire corpus and
,QVWDQWLDWHG6\QWDFWLF5HODWLRQVVRXUFHV
the results are next aligned. This allows us to save computation
to play subject champion
time as we do not have to run each tool multiple times. However
object match
we have to deal with dictionaries and indexes for various files and
to lose object championship tools.
5(68/76
6\QWDFWLF'RPDLQUHVXOW
to play [] object cup [3], match []
with ball [1] The experiments4 we have conducted had as a goal to show that
VXEMHFW FKDPSLRQ>@ SVETLAN’ lead to learn classes of words which obviously belong
to win [2] subject player [1] to the same concept in the domain. To obtain such results we have
object match [1] chosen to run our system on one month of $)3 $JHQFH )UDQFH
WRORVH>@ REMHFW FKDPSLRQVKLS>@ 3UHVVH) wires, that forms a corpus stylistically coherent but that
covers varied subjects with very polysemous and non specific
)LJXUH An example of the aggregation of three ISR in a SD
verbs.
These wires are made of 4,500,000 words and 48,000 sentences
Classes of nouns in the produced SDs contain a lot of words in 6,000 texts. The thematic analysis gives 8,000 TUs aggregated
that disturb their homogeneity. These words often belong to parts in 2,000 domains. More details on these domains can be found in
of the different TU at the origin of the SD that are not very related [5]. From these 48,000 sentences, 117,000 different Instantiated
to the described topic. Either they result from an error of the topic Syntactic Links are extracted by Sylex. 24,000 of these links con-
segmentation process or they correspond to a meaning of a verb cern subject, direct object, or circumstantial complements intro-
scarcely used in the current context. Another possibility is that the duced by a preposition and are integrated in 1,531 Structured
ISR results from an error of Sylex. As these cases do not often Domains.
recur for the same words in the same context, their nouns are After aggregating, but before filtering, the system obtains 431
weekly weighted in the corresponding domains. This characteristic aggregated links with two or more arguments, equivalent to 431
gives us a mean to filter the class content: each noun that possesses word classes. Some of them, such as WR PDQXIDFWXUH Å GLUHFW
a weight lower than a threshold is removed from the class. By this REMHFWÅERPEZHDSRQ! are good. Nevertheless other classes are
selection, we reinforce learning classes of words according to their heterogeneous as WR UHWXUQ Å GLUHFW REMHFW Å WHUULWRU\ VWULS
contextual use. Figure 7 shows two aggregated links first obtained FRQWH[W V\QDJRJXH! (here strip comes from the Gaza Strip), or
without filtering in its upper part and the filtered counterparts in its clearly mix different meanings of a verb, like WR TXLW Å GLUHFW
lower part. The class associated to the verb ‘WRHVWDEOLVK’ has been REMHFWÅEDVHJRYHUQPHQW! which mix together the meanings "to
completely removed as the weights of both ‘EDVH’ and ‘]RQH’ are leave a place" and "to retire from an institution". For the two latter
lower than the threshold, while the class related to the verb ‘WR cases, one can see the interest to take into account the fact that the
DQVZHU’ with the ‘REMHFW¶ link has been reduced by removing ‘OLVW’. domains contain words with different weights representing their
We can see on this example that this filtering is efficient: the verb relevance to this domain. The higher the weight, the higher the
‘WRHVWDEOLVK’ as the words ‘EDVH¶ and ‘]RQH¶ are not very related to relevance of this word in this domain. So we apply the aforesaid
the domain of ‘QXFOHDUZHDSRQV’ from which this example is taken filter to our classes and retain only those with weights higher than a
and the usage of ‘WRDQVZHUDOLVW’ has a very low probability. More threshold. The class WHUULWRU\VWULSFRQWH[WV\QDJRJXH! is cor-
details on the effects of the filtering process will be given in sec- rected to WHUULWRU\VWULS!andEDVHJRYHUQPHQW!is removed.
tion 5. Among the wrong classes, some are due to errors of Sylex, as
WRFRQIHUÅGLUHFWREMHFWÅSULFHDFWRU! where DFWRU should be
linked to WRFRQIHU by the preposition WR. The remaining others are
WRHVWDEOLVK REMHFW base, zone due to the extensive use of two different meanings of the verb in
WRDQVZHU REMHFW document, question, list the same domain, as for WR FRQGXFWWR PDQDJH Å GLUHFW REMHFW
Å GHOHJDWLRQ QHJRWLDWLRQ! (in French: "conduire une négocia-
WRHVWDEOLVK REMHFW base, zone tion/une délégation"). This kind of error is inherent to the method
WRDQVZHU REMHFW document, question, list we use and should be removed by other means. Note that the cor-
rectness of the links have been manually judged by ourselves. The
)LJXUH Filtered aggregated links in a domain about nuclear weapons precision measure used below is the ratio between the number of
good classes and the total number of classes. We cannot define a
In the principle, the described operations are not very compli- recall measure because we have no way to know which classes we
cated. The difficulties comes from the necessity to work with data
coming from various tools. Furthermore, for performance and
practical reasons, we do not apply the chain of tools text by text. 4 Please note that all the tests have been made in French. So the English
The natural way to see the process would be to : examples that appear here are translations from French.
miss. To our knowledge, there is no existing resources with associ- context, assuming the words always have the same meaning. How-
ated classes that would allow us to formally judge the results. ever this method has to be tested on more results in order to prove
We have tried two thresholds: 0.05 and 0.1. Figure 8 details the its reliability. With our results, we would build for example (law,
results for both. constitution, article, disposition) in the domain of “/DZ” and (re-
bel, force, northerner, leader) in the domain of “FRQIOLFW”.
Threshold Total Good Sylex errors Remaining errors SVETLAN’, in collaboration with SEGAPSITH, allows an
0.05 73 46 13 14 automatic learning of structured semantic domains. Instead of just
63% 18% 19% having sets of weighted words for describing semantic domains,
0.1 38 27 7 4 domains are described by a set of verbs related to classes of words
71% 18% 11% by a syntactic link. Besides, we can also view this base as semantic
classes, each one being related to its context of interpretation.
)LJXUH Results of the filtering for two thresholds
As SVETLAN’ works with very specific domains, it builds
small classes. In order to generalize them, we could apply a process
After filtering, a lot of classes are removed but the remaining analogous to ASIUM, that merges classes independently of the
classes are well funded in most cases. An example of a retained related verbs according to a similarity measure, even if, in our case,
class for both thresholds is : this generalization process would operate in a same general do-
WRLQMXUHÅVXEMHFWÅFRORQLVWVROGLHU! main. Afterwards, ASIUM asks an expert to validate its results.
With a threshold set to 0.1 rather than to 0.05, we retain only 38 Words are often polysemous or ambiguous. However, when
links, but we gain 8% in precision. If we ignore the errors due to used in context, they only denote one meaning, and moreover this
Sylex, the real precision of SVETLAN’ is in the first case 78% and meaning is generally the same in different occurrences of a same
in the second case 87%. It is very good and shows the interest there context. When building classes of nouns according to their con-
is to choose a good threshold. textual use, we avoid mixing all the meanings of a word, either for
Our experiments lead to homogeneous classes containing words the verbs or for the nouns. Such a result can be exhibited in the
denoting a same concept, though these classes contain few words. classes (law, constitution) and (law, article, disposition) in the
In order to directly view the interest there is to construct and clus- juridical context, where the words “DUWLFOH”, “FRQVWLWXWLRQ” and
ter classes of words in being guided by their belonging to a do- “GLVSRVLWLRQ” do not attract synonymous of their other meanings as
main, it is interesting to see what kind of classes would be obtained “VHFWLRQ”, “FRPSRVLWLRQ” and “DSWLWXGH” for example.
by the merging of all domains, that is to say : creating context-free
classes. So, we have applied the same aggregation principle to the 5(/$7(':25.6
same corpus but without taking into account the domains. Just
below, we show two classes for the verb “WRUHSODFH”. The top one There is a lot of works dedicated to the formation of classes of
is made context-free and the bottom one is made inside a domain. words. These classes have very various status. They can contain
This verb is very general. Virtually everything can be replaced ! words belonging to the same semantic field or near synonymous.
WordNet [1] is a lexical database made by lexicographers. It
WRUHSODFH REMHFW text, constitution, trousers, combustible, aims at representing the sense of the bigger part of the lexicon. It is
law, dinar, rod, film, circulation, judge, composed of Synsets. A Synset is a set of words that are synony-
season, device, parliament, battalion, police, mous. These Synsets are linked by IS A relations. Its coverage is
president, treaty large but this is, in a sense, a shortcoming as its classes are too
WRUHSODFH REMHFW combustible, rod large and do not refer to precise meanings. Indeed, the generality of
its contents makes it difficult to use in real sized applications that
The first group of words merges very different senses while the are often centered on a domain. It rarely can be used without a lot
second class, much more little, is better because it contains words of manual adaptation.
referring to very similar concepts: a rod of enriched uranium is IMToolset, by Uri Zernik [2], extracts, for a word, several
nuclear combustible, thus the words “URG” and “FRPEXVWLEOH” clusters of words from text. Each of these clusters reflects a differ-
actually denote the same concept in the nuclear domain. Another ent meaning of the studied word. This extraction is done by scan-
example is the following, for the verb “WRDWWULEXWH”: ning the local contexts of the word, the 10 words surrounding it in
the texts. These signatures are statistically analyzed and clustered.
WRDWWULEXWH REMHFW talk, prize, decoration, pope, responsi- The result is groups of words that are similar to our domains but
bility, television, attempt, letter, con- more focused on the sense of a word alone.
tract, ministry, jury, funds, authority, We have already stressed out some characteristics of ASIUM by
note, bonus, band, bombing D. Faure and C. Nedellec [4], and we give here some more details.
WRDWWULEXWH REMHFW prize, decoration ASIUM learns subcategorization frames of verbs and ontologies
from text using syntactic analysis and a conceptual clustering
Obtaining meaningful classes with a corpus such as $)3 shows algorithm. It analyses texts with Sylex and creates basic clusters of
the efficiency of our method. Moreover, it is very good to obtain words appearing with a same verb and a same syntactic role or
cohesive classes for verbs very general and polysemous. preposition, as do SVETLAN’. These basic classes are then clus-
At this time, the class sizes are little. They do not contain a lot tered to create an ontology by the mean of a cooperative learning
of words. A way to enlarge them could be to regroup classes that algorithm. The main difference with SVETLAN’ is this coopera-
are related to the same verb, by the same syntactic relation in two tive generalization part: ASIUM depends on the expert who has to
domains belonging to the same hierarchy, i. e. a same more general valid, and possibly to split, the clusters made by the algorithm.
This approach is justified for specialized technical texts, but
ASIUM, applied on texts such as $)3 wires would certainly not be
able to extract good basic classes. Furthermore, as each word does
not occur a lot in these texts, the distance is not appropriate to the
grouping of our classes. On the contrary, on technical texts and
with the cooperation of an expert, ASIUM will certainly obtain
better results than ours from the point of view of the domain cover-
age.
&21&/86,21
The system SVETLAN’ we propose, in conjunction with
SEGAPSITH and the syntactic parser Sylex, extracts classes of
words from raw text. These classes are created by the gathering of
nouns appearing with the same syntactic role after the same verb
inside a context. This context is made by the aggregation of text
about similar subjects. The first experiments carried out give good
results. But they also confirm that a great volume of data is neces-
sary in order to extract a large quantity of lexical knowledge by the
analysis of syntactic distributions. Moreover the very low recall of
the syntactic parser and its systematic errors on some construc-
tions, for example the passive form, which is very common in the
journalistic style of our corpus, reduce the number and size of the
classes. To solve this problem, we envisage trying another analyzer
or adding a post-processing step to Sylex that detects the passive
form by using data already in its output. These adaptations and the
study of more larger corpora will allow us to obtain a good cover-
age of numerous semantic domains. So, we will be able to give
valuable semantic data useful in a lot of applications as information
retrieval systems or word sense disambiguation systems.
5()(5(1&(6
[1] Christiane Fellbaum, WordNet: an electronic lexical database, The
MIT Press, 1998
[2] Uri Zernik, TRAIN1 vs. TRAIN2: Tagging Word Senses in Corpus,
RIAO’91, 1991
[3] Gregory Greffenstette, Explorations in automatic thesaurus discovery,
Kluwer Academic Pub., Boston, 1994
[4] David Faure and Claire Nedellec, ASIUM, Learning subcategorization
frames and restrictions of selection. In Y. Kodratoff ed., proceedings
of 10th ECML – Workshop on text mining, 1998
[5] Olivier Ferret and Brigitte Grau, A Thematic Segmentation Procedure
for Extracting Semantic Domains from Texts, Proceedings of
ECAI’98, Brighton, 1998.
[6] Olivier Ferret, How to thematically segment texts by using lexical
cohesion? Proceedings of ACL-COLING'98 (student session), pp.
1481-1483, Montreal, Canada, 1998.
[7] C.-Y. Lin. Robust Automated Topic Identification, Doctoral Disserta-
tion, University of Southern California, (1997).
[8] Patrick Constant, Analyse Syntaxique Par Couches. Ph.D thesis, École
Nationale Supérieure des Télécommunications, April, 1991.
[9] Patrick Constant, L'analyseur linguistique SYLEX. 5ème ecole d'été
du CENT, 1995.