=Paper= {{Paper |id=Vol-31/paper-4 |storemode=property |title=SVETLAN' - A System to Classify Words in Context |pdfUrl=https://ceur-ws.org/Vol-31/GChalendar_12.pdf |volume=Vol-31 }} ==SVETLAN' - A System to Classify Words in Context== https://ceur-ws.org/Vol-31/GChalendar_12.pdf
                                      69(7/$1¶
                          $6\VWHPWR&ODVVLI\1RXQVLQ&RQWH[W
                                                                        1
                                              *DsOGH&KDOHQGDU and%ULJLWWH*UDX1 2                    




$EVWUDFW Using semantic knowledge in NLP applications always               29(59,(:2)7+(6<67(0
improves their competence. Broad lexicons have been developed,
but there are few resources made for non-specialized domains                Input data of SVETLAN’ (see Fig. 1) are semantic domains with
which contain semantic information available for words. In order            the Thematic Units (TUs) that have given birth to them. Domains
to build such a base, we conceived a system, SVETLAN’, able to              are sets of weighted words, relevant to represent a same specific
learn categories of nouns from texts, whatever their domain. In             topic. They are automatically learned by aggregating similar the-
order to avoid general classes mixing all the meanings of words,            matic units, made of sets of words. Each TU corresponds to a part
they are learned taking into account the contextual use of words.           of text that is homogeneous from a topic point of view and is de-
                                                                            limited from a text by a topic segmentation process relying on
 ,1752'8&7,21                                                              lexical cohesion. Processed texts are newspaper articles that are
                                                                            pre-treated in order to retain only lemmatized content words.
Using semantic knowledge in NLP applications always improves
their competence as in Information Retrieval or Word Sense Dis-                            Domain
ambiguation systems. Broad lexicons have been developed, but
there are few existing resources which contain semantic informa-                                        TU
tion available for words that are not specialized to very specific                                            TU
domains apart from WordNet [1]. Moreover, manual or automatic                                                   TU
                                                                                                        TU
processes that build semantic categories of nouns usually lead to                                                                       Input data
define general categories. For example, words in WordNet are
related to a Synset when they are synonymous, however Synsets
correspond to large categories, and there are some shifts of mean-                        TU                             TU
ing so that when two words belonging to a same Synset are consid-
ered within a specific context, they often no longer share a com-
mon meaning. Automatic processes that extract knowledge from
                                                                                    Text Segment                   Text Segment
texts by using statistical [2] or distributional [3], [4] approaches
also lead to build broad classes, if they are not applied to special-                                                             STU
                                                                                                   STU
ized texts belonging to a very specific domain. On the other hand,
                                                                                      V        r    N                V        r    N
we do not want to learn a general ontology, whatever the domain
is. As most words are polysemous, we claim that a semantic base
has to deal with all the meanings of a word, by associating them
                                                                                     V         r   N                 V        r    N
with their context of interpretation. Having such a semantic knowl-
edge will allow information retrieval and Question/Answering
systems for example to use deeper semantic analysis of texts, even                         Structured Domain
if applied on database that contain texts on different domains                             V        r        N1, N2, …
which are non technical articles and uses a general and common
vocabulary such as newspaper articles bases.
    In order to build such a base, we conceived a system,
                                                                                           V        r        N1, N2, …
SVETLAN’, able to learn categories of nouns in context from
texts, whatever their domain. It is based on a distributional ap-
proach: nouns playing the same syntactic role with a verb in sen-                     )LJXUH     . Schemata of Structured Domain learning
tences related to the same topic, i.e. the same domain, are aggre-
gated in the same class. SVETLAN’ relies on knowledge about                    The first step of SVETLAN’ consists of retrieving text segments
semantic domains automatically learned by SEGAPSITH [5].                    of the original texts associated to the different TUs in order to
                                                                            parse their sentences. We extract then all the triplets constituted by

1 LIMSI/CNRS, BP 133, 91 403 Orsay Cédex, France,
email: {gael,grau}@limsi.fr
2 IIE-CNAM, 18 allée J. Rostand, 91 000 Evry, France
a verb, the head noun of a phrase and its syntactic role from the        rized thematic unit, called a semantic domain. Each aggregation of
parser results in order to produce the Syntactic Thematic Units          a new TU increases the system’s knowledge about one topic by
(STUs). The STUs belonging to a same semantic domain are ag-             reinforcing recurrent words and adding new ones. Weights on
gregated altogether to learn a Structured Domain. Aggregation            words represent the importance of each word relative to the topic
leads to group nouns playing the same syntactic roles with a verb        and is computed from the number of occurrences of these words in
in order to form classes. As these aggregations are made within          the TUs. This method leads SEGAPSITH to learn specific topic
TUs belonging to a same domain, classes are context sensitive,           representations as opposed to [7] for example whose method builds
which ensures a better homogeneity. A filtering step, based on the       general topic descriptions as for economy, sport, etc.
weights of the words in their domain allows the system to eliminate         We have applied the learning module of SEGAPSITH on one
nouns from classes when they are not very relevant in this context.      month (May 1994) of $)3 newswires. Figure 2 shows an example
                                                                         of a domain about justice that gathers 69 TUs.
                                                                            As some of these domains are close and refer to the same gen-
 6(0$17,&'20$,1/($51,1*                                               eral topic, we have applied a hierarchical classification method
We only give here a brief overview of the semantic domain learn-         based on their common words to organize them in separate general
ing module. This one is described more precisely in [5]. This mod-       topics and to structure them. Figure 3 shows the hierarchies built
ule incrementally builds topic representations, made of weighted         about sport, police and stock exchange. Each leaf is a domain,
words, from discourse segments delimited by SEGCOHLEX [6]. It            named by its two more weighted words, while internal nodes are
works without any D SULRUL classification or hand-coded pieces of       described by their name and their size, i.e. the number of common
knowledge. Processed texts are typically newspaper articles com-         words found in their children.
ing from /H 0RQGH or $)3 $JHQFH )UDQFH 3UHVVH . They are
pre-processed to only keep their lemmatized content words (adjec-                                       Pilot/Formula
tives, single or compound nouns and verbs).                                Team_mate/Champion
    The topic segmentation implemented by SEGCOHLEX is based               ship 11                                              To_beat/Finale
on a large collocation network, built from 24 months of /H0RQGH                                       Tennis/Team_mate 50
newspaper, where a link between two words aims at capturing                                                                     Cycle_race/Stage
semantic and pragmatic relations between them. The strength of                                          Police/Policeman
such a link is evaluated by the mutual information between its two         To_question/Arrest 6
words. The segmentation process relies on these links for comput-                                       Prison/Condemn
ing a cohesion value for each position of a text. It assumes that a
discourse segment is a part of text whose words refer to the same                                    Dollar/Billion
topic, that is, words are strongly linked to each other in the collo-
cation network and yield a high cohesion value. On the contrary,           Money/Quarter 27
low cohesion values indicate topics shifts. After delimiting seg-                                     Rate/Rise
ments by an automatic analysis of the cohesion graph, only highly
cohesive segments, named Thematic Units (TUs), are kept to learn                     )LJXUH    Three hierarchies of semantic domains
topic representations. This segmentation method entails a text to be
decomposed in small thematic units, whose size is equivalent to a
paragraph. Discourse segments, even related to the same topic,
often develop different points of view. To enrich the particular          6758&785(''20$,1/($51,1*
description given by a text, we add to TUs those words of the            As in [4], verbs allow us to categorize nouns. A class is defined by
collocation network that are particularly linked to the words found      those nouns which play a same role relative to a same verb. In
in the corresponding segment.                                            order to learn very homogeneous3 classes, we only apply this
                                                                         principle on words belonging to a same context, i.e. a domain.
                        ZRUGV                      RFF     ZHLJKW

     examining judge                                58      0.501
     police custody                                 50      0.442
                                                                         6\QWDFWLFDQDO\VLV
     public property                                46      0.428        In order to find the verbs and their arguments in the texts, we use
     charging                                       49      0.421        the syntactic analyzer Sylex [8], [9]. Figure 4 shows a little part of
     to imprison                                    45      0.417        the results of Sylex for a sentence. The first part exhibits lexico-
     court of criminal appeal                       47      0.412        syntactic information for the words and this for four different
     receiving stolen goods                         42      0.397        interpretations pointed out by the string “WDX[ ” meaning an
     to presume                                     45      0.382        ambiguity rate of 4. This rate is due to the fact that Sylex cannot
     criminal investigation department              42      0.381        solve two ambiguities: the ambiguity of “ODLVVH” between the verb
     fraud                                          42      0.381        “ODLVVHU” (to let) and the noun “ODLVVH” (leash) and the ambiguity of
                                                                         “FULWLTXH” between the verb “WRFULWLFL]H” and the noun “FULWLFLVP”.
   )LJXUH   The most representative words of a domain about justice   Note that Sylex does not consider the adjectival form which is the
                                                                         right interpretation here. The second part shows syntactic links
                                                                         found by Sylex. Between parenthesis are references to the words in
    Learning a complete description of a topic consists of merging
all successive points of view, i.e. similar TUs, into a single memo-     3 We call homogeneous a class that contains words that denote a same
                                                                         concept in the corresponding domain.
the preceding analysis. Here Sylex has found four times the same                               token1        lemma1               rel      token2    lemma2
interpretation in each of its possible analyses. In this case, we                             hang over     hang over           subject     threat    threat
count one occurrence of the link. However if it finds several times                             play           play              object      cup       cup
the same relation between a verb and different words, for example                               hear           hear                of      sources    source
several possible subjects, then we keep all the different interpreta-
tions because we have no way to choose between them. We make                                              )LJXUH   Examples of extracted links
the reasonable expectation that the false interpretations will have
much less occurrences in the corpus and so, will be filtered out                             Sylex, as other syntactic analyzers, has difficulties with some
during the rest of the processing.                                                       constructions and as a consequence introduces errors that can
                                                                                         cause problems to the remaining of the system. Some common
******************** Phrase 193-466 ***********************************
"L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes),        errors are the bad interpretation of the passive form that causes a
victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de        subject to be analyzed as a direct object and conversely, a direct
Monaco de Formule Un, laisse planer une menace sur le déroulement de la course,          object to be viewed as a subject. Another common error is that it
dimanche en Principauté."
******************** Partie 1 193-466 WDX[ **************************                  often happens that Sylex does not find any link in a phrase. That’s
"L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes),        what we will call VLOHQFH. We will see in Section 5 that we can
victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de
Monaco de Formule Un, laisse planer une menace sur le déroulement de la course,
                                                                                         obtain good results despite these problems thanks to the
dimanche en Principauté."                                                                redundancy needed to validate the links in the next steps of
                             /H[LFR6\QWDFWLFLQIRUPDWLRQ!
                                                                                         processing. But another consequence of this redundancy needs is
  193-195 (164) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj
  195-208 (165) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier    that the system must use great quantities of texts in order to create
mot_compose locsw                                                                        classes with a satisfactory size.
.......                                                                                Having gotten the syntactic links in the texts, we want to group
  382-388 (203) "laisse" "laisse" [gs.12,nom.1] nom : feminin singulier
  389-395 (204) "planer" "planer" [gs.13,verbe] verbe : infinitif                        them relatively to the belonging of their text segment to a Thematic
.......                                                                            Unit. So, we define a Syntactic Thematic Unit (STU) as a set of
  193-195 (16) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj
  195-208 (117) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier
                                                                                         <9HUEÅV\QWDFWLF UHODWLRQÅ1RXQ> structures, i.e. a syntactic
mot_compose locsw                                                                        relation instantiated with a verb and a noun. We will refer to these
.......                                                                            structures as Instantiated Syntactic Relations or ISR. We are able
  382-388 (211) "laisse" "laisser" [gs.13,verbe] verbe : singulier autoontif antiontif
anontif present indicatif subjonctif imperatif                                           to put in relation the links extracted from the results of Sylex and
  389-395 (212) "planer" "planer" [gs.14,verbe] verbe : infinitif                        the words contained in the domains because each domain in the
.......                                                                            thematic memory remembers which thematic units have been used
                                    6\QWDFWLF/LQNV!

`L'état de santé critique' (164) ->- cn head ->- `du pilote autrichien' (170)            to create it. In the same way, each thematic unit remembers the part
`planer' (204) ->- a2 head ->- `une menace' (205)                                        of text it comes from.
.......
`planer' (153) ->- a2 head ->- `une menace' (154)
.......
`planer' (161) ->- a2 head ->- `une menace' (162)
                                                                                         $JJUHJDWLRQ
.......
`planer' (212) ->- a2 head ->- `une menace' (213)                                        In order to construct group of words with very similar meanings,
`sur le déroulement' (66) ->- cn head ->- `de la course' (235)                           we want to group the nouns appearing with the same syntactic role
                                                                                         in relation to a verb inside a Domain. Then, a Structured Domain
             )LJXUH    An extract of a sentence analysis by Sylex.                   (SD) is a set of <9HUEÅV\QWDFWLF UHODWLRQÅ1RXQ1 « 1RXQn>
                                                                                         structures, i.e. an aggregated ISR.
   The results of Sylex are very detailed and not easy to parse di-                          STUs related to a same domain are aggregated altogether to
rectly with, say, Perl. Furthermore, we do not need all the informa-                     form a Structured Domain. Aggregating a STU within a SD con-
tion it extracts. In fact, we only need to find the verb with its links                  sists of:
and the head nouns arguments of these links. So, we have devel-                               - aggregating their ISR that contain a same verb ;
oped a formal grammar that extracts from these raw analyzes the                               - adding new ISR, i.e. adding new verbs with their arguments
associations between a verb and its arguments. This grammar                                     made of a syntactic relation and the lemmatized form of a
extracts links from the results of Sylex in the following format:                               noun.

   i#j      verb # WRNHQ # OHPPD # k rel # WRNHQ # OHPPD # l                            Figure 6 shows the aggregation of a SD and three ISR. This ex-
                                                                                         ample shows all the possible effects of the aggregation. In the
where i and j are the boundaries of the sentence that contains the                       figure, bold elements represent new or updating data. Aggregating
link in the corpus; WRNHQand OHPPDare the token and the lemma                        an ISR in a SD that already contains the verb of the ISR leads to
of the verb respectively ; relis the syntactic relation which can be                    increment the occurrence number of the verb, as for SOD\ in the
"subject", "direct object" or a preposition ("to", "from", etc.) ;                       example. Similarly, the occurrence number of same nouns related
WRNHQand OHPPDare the token and the lemma of the head noun                           to the verb by the same relation are updated (as for PDWFK), and
of the noun phrase pointed by the relation; lastly, N and O are the                    new relations with their associated nouns are added to the verb. In
indexes in the corpus of WRNHQand WRNHQ respectively. Figure 5                        the example, the subject FKDPSLRQ is added. An ISR with a new
shows some links that we have extracted from the results of Sylex.                       verb is simply added with an occurrence of 1, as for .
                                                                                • read a text,
6\QWDFWLF'RPDLQVRXUFH
                                                                                • extract the TUs from it,
to play [4]                  object                     cup [3], match [1]      • extract the corresponding STUs,
                             with                       ball [1]                • add each TU to its domain,
to win [2]                   subject                    player [1]              • add each STU to its corresponding domain,
                             object                     match [1]
                                                                             and after the processing of all the texts, to filter the classes.
                                                                                In fact, each computing step is done on the entire corpus and
,QVWDQWLDWHG6\QWDFWLF5HODWLRQVVRXUFHV
                                                                             the results are next aligned. This allows us to save computation
to play                      subject                    champion
                                                                             time as we do not have to run each tool multiple times. However
                             object                     match
                                                                             we have to deal with dictionaries and indexes for various files and
to lose                      object                     championship         tools.


                                                                              5(68/76
6\QWDFWLF'RPDLQUHVXOW

to play []                  object                     cup [3], match []
                             with                       ball [1]             The experiments4 we have conducted had as a goal to show that
                             VXEMHFW                    FKDPSLRQ>@         SVETLAN’ lead to learn classes of words which obviously belong
to win [2]                   subject                    player [1]           to the same concept in the domain. To obtain such results we have
                             object                     match [1]            chosen to run our system on one month of $)3 $JHQFH )UDQFH
WRORVH>@                  REMHFW                     FKDPSLRQVKLS>@     3UHVVH) wires, that forms a corpus stylistically coherent but that
                                                                             covers varied subjects with very polysemous and non specific
      )LJXUH   An example of the aggregation of three ISR in a SD
                                                                             verbs.
                                                                                 These wires are made of 4,500,000 words and 48,000 sentences
    Classes of nouns in the produced SDs contain a lot of words              in 6,000 texts. The thematic analysis gives 8,000 TUs aggregated
that disturb their homogeneity. These words often belong to parts            in 2,000 domains. More details on these domains can be found in
of the different TU at the origin of the SD that are not very related        [5]. From these 48,000 sentences, 117,000 different Instantiated
to the described topic. Either they result from an error of the topic        Syntactic Links are extracted by Sylex. 24,000 of these links con-
segmentation process or they correspond to a meaning of a verb               cern subject, direct object, or circumstantial complements intro-
scarcely used in the current context. Another possibility is that the        duced by a preposition and are integrated in 1,531 Structured
ISR results from an error of Sylex. As these cases do not often              Domains.
recur for the same words in the same context, their nouns are                    After aggregating, but before filtering, the system obtains 431
weekly weighted in the corresponding domains. This characteristic            aggregated links with two or more arguments, equivalent to 431
gives us a mean to filter the class content: each noun that possesses        word classes. Some of them, such as WR PDQXIDFWXUH Å GLUHFW
a weight lower than a threshold is removed from the class. By this           REMHFWÅERPEZHDSRQ! are good. Nevertheless other classes are
selection, we reinforce learning classes of words according to their         heterogeneous as WR UHWXUQ Å GLUHFW REMHFW Å WHUULWRU\ VWULS
contextual use. Figure 7 shows two aggregated links first obtained           FRQWH[W V\QDJRJXH! (here strip comes from the Gaza Strip), or
without filtering in its upper part and the filtered counterparts in its     clearly mix different meanings of a verb, like WR TXLW Å GLUHFW
lower part. The class associated to the verb ‘WRHVWDEOLVK’ has been         REMHFWÅEDVHJRYHUQPHQW! which mix together the meanings "to
completely removed as the weights of both ‘EDVH’ and ‘]RQH’ are              leave a place" and "to retire from an institution". For the two latter
lower than the threshold, while the class related to the verb ‘WR            cases, one can see the interest to take into account the fact that the
DQVZHU’ with the ‘REMHFW¶ link has been reduced by removing ‘OLVW’.          domains contain words with different weights representing their
We can see on this example that this filtering is efficient: the verb        relevance to this domain. The higher the weight, the higher the
‘WRHVWDEOLVK’ as the words ‘EDVH¶ and ‘]RQH¶ are not very related to        relevance of this word in this domain. So we apply the aforesaid
the domain of ‘QXFOHDUZHDSRQV’ from which this example is taken             filter to our classes and retain only those with weights higher than a
and the usage of ‘WRDQVZHUDOLVW’ has a very low probability. More         threshold. The class WHUULWRU\VWULSFRQWH[WV\QDJRJXH! is cor-
details on the effects of the filtering process will be given in sec-        rected to WHUULWRU\VWULS!andEDVHJRYHUQPHQW!is removed.
tion 5.                                                                          Among the wrong classes, some are due to errors of Sylex, as
                                                                             WRFRQIHUÅGLUHFWREMHFWÅSULFHDFWRU! where DFWRU should be
                                                                             linked to WRFRQIHU by the preposition WR. The remaining others are
    WRHVWDEOLVK    REMHFW   base, zone                                      due to the extensive use of two different meanings of the verb in
    WRDQVZHU       REMHFW   document, question, list                        the same domain, as for WR FRQGXFWWR PDQDJH Å GLUHFW REMHFW
                                                                             Å GHOHJDWLRQ QHJRWLDWLRQ! (in French: "conduire une négocia-
    WRHVWDEOLVK    REMHFW   base, zone                                      tion/une délégation"). This kind of error is inherent to the method
    WRDQVZHU       REMHFW   document, question, list                        we use and should be removed by other means. Note that the cor-
                                                                             rectness of the links have been manually judged by ourselves. The
  )LJXUH   Filtered aggregated links in a domain about nuclear weapons   precision measure used below is the ratio between the number of
                                                                             good classes and the total number of classes. We cannot define a
   In the principle, the described operations are not very compli-           recall measure because we have no way to know which classes we
cated. The difficulties comes from the necessity to work with data
coming from various tools. Furthermore, for performance and
practical reasons, we do not apply the chain of tools text by text.          4 Please note that all the tests have been made in French. So the English
The natural way to see the process would be to :                             examples that appear here are translations from French.
miss. To our knowledge, there is no existing resources with associ-         context, assuming the words always have the same meaning. How-
ated classes that would allow us to formally judge the results.             ever this method has to be tested on more results in order to prove
   We have tried two thresholds: 0.05 and 0.1. Figure 8 details the         its reliability. With our results, we would build for example (law,
results for both.                                                           constitution, article, disposition) in the domain of “/DZ” and (re-
                                                                            bel, force, northerner, leader) in the domain of “FRQIOLFW”.
Threshold        Total       Good Sylex errors          Remaining errors        SVETLAN’, in collaboration with SEGAPSITH, allows an
0.05                73        46       13                     14            automatic learning of structured semantic domains. Instead of just
                               63%        18%                        19%    having sets of weighted words for describing semantic domains,
0.1                  38       27        7                      4            domains are described by a set of verbs related to classes of words
                               71%        18%                        11%    by a syntactic link. Besides, we can also view this base as semantic
                                                                            classes, each one being related to its context of interpretation.
             )LJXUH   Results of the filtering for two thresholds
                                                                                As SVETLAN’ works with very specific domains, it builds
                                                                            small classes. In order to generalize them, we could apply a process
After filtering, a lot of classes are removed but the remaining             analogous to ASIUM, that merges classes independently of the
classes are well funded in most cases. An example of a retained             related verbs according to a similarity measure, even if, in our case,
class for both thresholds is :                                              this generalization process would operate in a same general do-
              WRLQMXUHÅVXEMHFWÅFRORQLVWVROGLHU!                      main. Afterwards, ASIUM asks an expert to validate its results.
    With a threshold set to 0.1 rather than to 0.05, we retain only 38          Words are often polysemous or ambiguous. However, when
links, but we gain 8% in precision. If we ignore the errors due to          used in context, they only denote one meaning, and moreover this
Sylex, the real precision of SVETLAN’ is in the first case 78% and          meaning is generally the same in different occurrences of a same
in the second case 87%. It is very good and shows the interest there        context. When building classes of nouns according to their con-
is to choose a good threshold.                                              textual use, we avoid mixing all the meanings of a word, either for
    Our experiments lead to homogeneous classes containing words            the verbs or for the nouns. Such a result can be exhibited in the
denoting a same concept, though these classes contain few words.            classes (law, constitution) and (law, article, disposition) in the
In order to directly view the interest there is to construct and clus-      juridical context, where the words “DUWLFOH”, “FRQVWLWXWLRQ” and
ter classes of words in being guided by their belonging to a do-            “GLVSRVLWLRQ” do not attract synonymous of their other meanings as
main, it is interesting to see what kind of classes would be obtained       “VHFWLRQ”, “FRPSRVLWLRQ” and “DSWLWXGH” for example.
by the merging of all domains, that is to say : creating context-free
classes. So, we have applied the same aggregation principle to the           5(/$7(':25.6
same corpus but without taking into account the domains. Just
below, we show two classes for the verb “WRUHSODFH”. The top one           There is a lot of works dedicated to the formation of classes of
is made context-free and the bottom one is made inside a domain.            words. These classes have very various status. They can contain
This verb is very general. Virtually everything can be replaced !           words belonging to the same semantic field or near synonymous.
                                                                                WordNet [1] is a lexical database made by lexicographers. It
WRUHSODFH      REMHFW     text, constitution, trousers, combustible,       aims at representing the sense of the bigger part of the lexicon. It is
                           law, dinar, rod, film, circulation, judge,       composed of Synsets. A Synset is a set of words that are synony-
                           season, device, parliament, battalion, police,   mous. These Synsets are linked by IS A relations. Its coverage is
                           president, treaty                                large but this is, in a sense, a shortcoming as its classes are too
WRUHSODFH      REMHFW     combustible, rod                                 large and do not refer to precise meanings. Indeed, the generality of
                                                                            its contents makes it difficult to use in real sized applications that
   The first group of words merges very different senses while the          are often centered on a domain. It rarely can be used without a lot
second class, much more little, is better because it contains words         of manual adaptation.
referring to very similar concepts: a rod of enriched uranium is                IMToolset, by Uri Zernik [2], extracts, for a word, several
nuclear combustible, thus the words “URG” and “FRPEXVWLEOH”                 clusters of words from text. Each of these clusters reflects a differ-
actually denote the same concept in the nuclear domain. Another             ent meaning of the studied word. This extraction is done by scan-
example is the following, for the verb “WRDWWULEXWH”:                      ning the local contexts of the word, the 10 words surrounding it in
                                                                            the texts. These signatures are statistically analyzed and clustered.
WRDWWULEXWH     REMHFW      talk, prize, decoration, pope, responsi-       The result is groups of words that are similar to our domains but
                             bility, television, attempt, letter, con-      more focused on the sense of a word alone.
                             tract, ministry, jury, funds, authority,           We have already stressed out some characteristics of ASIUM by
                             note, bonus, band, bombing                     D. Faure and C. Nedellec [4], and we give here some more details.
WRDWWULEXWH     REMHFW      prize, decoration                              ASIUM learns subcategorization frames of verbs and ontologies
                                                                            from text using syntactic analysis and a conceptual clustering
   Obtaining meaningful classes with a corpus such as $)3 shows             algorithm. It analyses texts with Sylex and creates basic clusters of
the efficiency of our method. Moreover, it is very good to obtain           words appearing with a same verb and a same syntactic role or
cohesive classes for verbs very general and polysemous.                     preposition, as do SVETLAN’. These basic classes are then clus-
   At this time, the class sizes are little. They do not contain a lot      tered to create an ontology by the mean of a cooperative learning
of words. A way to enlarge them could be to regroup classes that            algorithm. The main difference with SVETLAN’ is this coopera-
are related to the same verb, by the same syntactic relation in two         tive generalization part: ASIUM depends on the expert who has to
domains belonging to the same hierarchy, i. e. a same more general          valid, and possibly to split, the clusters made by the algorithm.
This approach is justified for specialized technical texts, but
ASIUM, applied on texts such as $)3 wires would certainly not be
able to extract good basic classes. Furthermore, as each word does
not occur a lot in these texts, the distance is not appropriate to the
grouping of our classes. On the contrary, on technical texts and
with the cooperation of an expert, ASIUM will certainly obtain
better results than ours from the point of view of the domain cover-
age.


 &21&/86,21
The system SVETLAN’ we propose, in conjunction with
SEGAPSITH and the syntactic parser Sylex, extracts classes of
words from raw text. These classes are created by the gathering of
nouns appearing with the same syntactic role after the same verb
inside a context. This context is made by the aggregation of text
about similar subjects. The first experiments carried out give good
results. But they also confirm that a great volume of data is neces-
sary in order to extract a large quantity of lexical knowledge by the
analysis of syntactic distributions. Moreover the very low recall of
the syntactic parser and its systematic errors on some construc-
tions, for example the passive form, which is very common in the
journalistic style of our corpus, reduce the number and size of the
classes. To solve this problem, we envisage trying another analyzer
or adding a post-processing step to Sylex that detects the passive
form by using data already in its output. These adaptations and the
study of more larger corpora will allow us to obtain a good cover-
age of numerous semantic domains. So, we will be able to give
valuable semantic data useful in a lot of applications as information
retrieval systems or word sense disambiguation systems.


5()(5(1&(6
[1] Christiane Fellbaum, WordNet: an electronic lexical database, The
    MIT Press, 1998
[2] Uri Zernik, TRAIN1 vs. TRAIN2: Tagging Word Senses in Corpus,
    RIAO’91, 1991
[3] Gregory Greffenstette, Explorations in automatic thesaurus discovery,
    Kluwer Academic Pub., Boston, 1994
[4] David Faure and Claire Nedellec, ASIUM, Learning subcategorization
    frames and restrictions of selection. In Y. Kodratoff ed., proceedings
    of 10th ECML – Workshop on text mining, 1998
[5] Olivier Ferret and Brigitte Grau, A Thematic Segmentation Procedure
    for Extracting Semantic Domains from Texts, Proceedings of
    ECAI’98, Brighton, 1998.
[6] Olivier Ferret, How to thematically segment texts by using lexical
    cohesion? Proceedings of ACL-COLING'98 (student session), pp.
    1481-1483, Montreal, Canada, 1998.
[7] C.-Y. Lin. Robust Automated Topic Identification, Doctoral Disserta-
    tion, University of Southern California, (1997).
[8] Patrick Constant, Analyse Syntaxique Par Couches. Ph.D thesis, École
    Nationale Supérieure des Télécommunications, April, 1991.
[9] Patrick Constant, L'analyseur linguistique SYLEX. 5ème ecole d'été
    du CENT, 1995.