69(7/$1¶ $6\VWHPWR&ODVVLI\1RXQVLQ&RQWH[W 1 *DsOGH&KDOHQGDU and%ULJLWWH*UDX1 2  $EVWUDFW Using semantic knowledge in NLP applications always  29(59,(:2)7+(6<67(0 improves their competence. Broad lexicons have been developed, but there are few resources made for non-specialized domains Input data of SVETLAN’ (see Fig. 1) are semantic domains with which contain semantic information available for words. In order the Thematic Units (TUs) that have given birth to them. Domains to build such a base, we conceived a system, SVETLAN’, able to are sets of weighted words, relevant to represent a same specific learn categories of nouns from texts, whatever their domain. In topic. They are automatically learned by aggregating similar the- order to avoid general classes mixing all the meanings of words, matic units, made of sets of words. Each TU corresponds to a part they are learned taking into account the contextual use of words. of text that is homogeneous from a topic point of view and is de- limited from a text by a topic segmentation process relying on  ,1752'8&7,21 lexical cohesion. Processed texts are newspaper articles that are pre-treated in order to retain only lemmatized content words. Using semantic knowledge in NLP applications always improves their competence as in Information Retrieval or Word Sense Dis- Domain ambiguation systems. Broad lexicons have been developed, but there are few existing resources which contain semantic informa- TU tion available for words that are not specialized to very specific TU domains apart from WordNet [1]. Moreover, manual or automatic TU TU processes that build semantic categories of nouns usually lead to Input data define general categories. For example, words in WordNet are related to a Synset when they are synonymous, however Synsets correspond to large categories, and there are some shifts of mean- TU TU ing so that when two words belonging to a same Synset are consid- ered within a specific context, they often no longer share a com- mon meaning. Automatic processes that extract knowledge from Text Segment Text Segment texts by using statistical [2] or distributional [3], [4] approaches also lead to build broad classes, if they are not applied to special- STU STU ized texts belonging to a very specific domain. On the other hand, V r N V r N we do not want to learn a general ontology, whatever the domain is. As most words are polysemous, we claim that a semantic base has to deal with all the meanings of a word, by associating them V r N V r N with their context of interpretation. Having such a semantic knowl- edge will allow information retrieval and Question/Answering systems for example to use deeper semantic analysis of texts, even Structured Domain if applied on database that contain texts on different domains V r N1, N2, … which are non technical articles and uses a general and common vocabulary such as newspaper articles bases. In order to build such a base, we conceived a system, V r N1, N2, … SVETLAN’, able to learn categories of nouns in context from texts, whatever their domain. It is based on a distributional ap- proach: nouns playing the same syntactic role with a verb in sen- )LJXUH . Schemata of Structured Domain learning tences related to the same topic, i.e. the same domain, are aggre- gated in the same class. SVETLAN’ relies on knowledge about The first step of SVETLAN’ consists of retrieving text segments semantic domains automatically learned by SEGAPSITH [5]. of the original texts associated to the different TUs in order to parse their sentences. We extract then all the triplets constituted by 1 LIMSI/CNRS, BP 133, 91 403 Orsay Cédex, France, email: {gael,grau}@limsi.fr 2 IIE-CNAM, 18 allée J. Rostand, 91 000 Evry, France a verb, the head noun of a phrase and its syntactic role from the rized thematic unit, called a semantic domain. Each aggregation of parser results in order to produce the Syntactic Thematic Units a new TU increases the system’s knowledge about one topic by (STUs). The STUs belonging to a same semantic domain are ag- reinforcing recurrent words and adding new ones. Weights on gregated altogether to learn a Structured Domain. Aggregation words represent the importance of each word relative to the topic leads to group nouns playing the same syntactic roles with a verb and is computed from the number of occurrences of these words in in order to form classes. As these aggregations are made within the TUs. This method leads SEGAPSITH to learn specific topic TUs belonging to a same domain, classes are context sensitive, representations as opposed to [7] for example whose method builds which ensures a better homogeneity. A filtering step, based on the general topic descriptions as for economy, sport, etc. weights of the words in their domain allows the system to eliminate We have applied the learning module of SEGAPSITH on one nouns from classes when they are not very relevant in this context. month (May 1994) of $)3 newswires. Figure 2 shows an example of a domain about justice that gathers 69 TUs. As some of these domains are close and refer to the same gen-  6(0$17,&'20$,1/($51,1* eral topic, we have applied a hierarchical classification method We only give here a brief overview of the semantic domain learn- based on their common words to organize them in separate general ing module. This one is described more precisely in [5]. This mod- topics and to structure them. Figure 3 shows the hierarchies built ule incrementally builds topic representations, made of weighted about sport, police and stock exchange. Each leaf is a domain, words, from discourse segments delimited by SEGCOHLEX [6]. It named by its two more weighted words, while internal nodes are works without any D SULRUL classification or hand-coded pieces of described by their name and their size, i.e. the number of common knowledge. Processed texts are typically newspaper articles com- words found in their children. ing from /H 0RQGH or $)3 $JHQFH )UDQFH 3UHVVH . They are pre-processed to only keep their lemmatized content words (adjec- Pilot/Formula tives, single or compound nouns and verbs). Team_mate/Champion The topic segmentation implemented by SEGCOHLEX is based ship 11 To_beat/Finale on a large collocation network, built from 24 months of /H0RQGH Tennis/Team_mate 50 newspaper, where a link between two words aims at capturing Cycle_race/Stage semantic and pragmatic relations between them. The strength of Police/Policeman such a link is evaluated by the mutual information between its two To_question/Arrest 6 words. The segmentation process relies on these links for comput- Prison/Condemn ing a cohesion value for each position of a text. It assumes that a discourse segment is a part of text whose words refer to the same Dollar/Billion topic, that is, words are strongly linked to each other in the collo- cation network and yield a high cohesion value. On the contrary, Money/Quarter 27 low cohesion values indicate topics shifts. After delimiting seg- Rate/Rise ments by an automatic analysis of the cohesion graph, only highly cohesive segments, named Thematic Units (TUs), are kept to learn )LJXUH Three hierarchies of semantic domains topic representations. This segmentation method entails a text to be decomposed in small thematic units, whose size is equivalent to a paragraph. Discourse segments, even related to the same topic, often develop different points of view. To enrich the particular  6758&785(''20$,1/($51,1* description given by a text, we add to TUs those words of the As in [4], verbs allow us to categorize nouns. A class is defined by collocation network that are particularly linked to the words found those nouns which play a same role relative to a same verb. In in the corresponding segment. order to learn very homogeneous3 classes, we only apply this principle on words belonging to a same context, i.e. a domain. ZRUGV RFF ZHLJKW examining judge 58 0.501 police custody 50 0.442 6\QWDFWLFDQDO\VLV public property 46 0.428 In order to find the verbs and their arguments in the texts, we use charging 49 0.421 the syntactic analyzer Sylex [8], [9]. Figure 4 shows a little part of to imprison 45 0.417 the results of Sylex for a sentence. The first part exhibits lexico- court of criminal appeal 47 0.412 syntactic information for the words and this for four different receiving stolen goods 42 0.397 interpretations pointed out by the string “WDX[ ” meaning an to presume 45 0.382 ambiguity rate of 4. This rate is due to the fact that Sylex cannot criminal investigation department 42 0.381 solve two ambiguities: the ambiguity of “ODLVVH” between the verb fraud 42 0.381 “ODLVVHU” (to let) and the noun “ODLVVH” (leash) and the ambiguity of “FULWLTXH” between the verb “WRFULWLFL]H” and the noun “FULWLFLVP”. )LJXUH The most representative words of a domain about justice Note that Sylex does not consider the adjectival form which is the right interpretation here. The second part shows syntactic links found by Sylex. Between parenthesis are references to the words in Learning a complete description of a topic consists of merging all successive points of view, i.e. similar TUs, into a single memo- 3 We call homogeneous a class that contains words that denote a same concept in the corresponding domain. the preceding analysis. Here Sylex has found four times the same token1 lemma1 rel token2 lemma2 interpretation in each of its possible analyses. In this case, we hang over hang over subject threat threat count one occurrence of the link. However if it finds several times play play object cup cup the same relation between a verb and different words, for example hear hear of sources source several possible subjects, then we keep all the different interpreta- tions because we have no way to choose between them. We make )LJXUH Examples of extracted links the reasonable expectation that the false interpretations will have much less occurrences in the corpus and so, will be filtered out Sylex, as other syntactic analyzers, has difficulties with some during the rest of the processing. constructions and as a consequence introduces errors that can cause problems to the remaining of the system. Some common ******************** Phrase 193-466 *********************************** "L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes), errors are the bad interpretation of the passive form that causes a victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de subject to be analyzed as a direct object and conversely, a direct Monaco de Formule Un, laisse planer une menace sur le déroulement de la course, object to be viewed as a subject. Another common error is that it dimanche en Principauté." ******************** Partie 1 193-466 WDX[ ************************** often happens that Sylex does not find any link in a phrase. That’s "L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes), what we will call VLOHQFH. We will see in Section 5 that we can victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de Monaco de Formule Un, laisse planer une menace sur le déroulement de la course, obtain good results despite these problems thanks to the dimanche en Principauté." redundancy needed to validate the links in the next steps of /H[LFR6\QWDFWLFLQIRUPDWLRQ! processing. But another consequence of this redundancy needs is 193-195 (164) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj 195-208 (165) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier that the system must use great quantities of texts in order to create mot_compose locsw classes with a satisfactory size. ....... Having gotten the syntactic links in the texts, we want to group 382-388 (203) "laisse" "laisse" [gs.12,nom.1] nom : feminin singulier 389-395 (204) "planer" "planer" [gs.13,verbe] verbe : infinitif them relatively to the belonging of their text segment to a Thematic ....... Unit. So, we define a Syntactic Thematic Unit (STU) as a set of 193-195 (16) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj 195-208 (117) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier <9HUEÅV\QWDFWLF UHODWLRQÅ1RXQ> structures, i.e. a syntactic mot_compose locsw relation instantiated with a verb and a noun. We will refer to these ....... structures as Instantiated Syntactic Relations or ISR. We are able 382-388 (211) "laisse" "laisser" [gs.13,verbe] verbe : singulier autoontif antiontif anontif present indicatif subjonctif imperatif to put in relation the links extracted from the results of Sylex and 389-395 (212) "planer" "planer" [gs.14,verbe] verbe : infinitif the words contained in the domains because each domain in the ....... thematic memory remembers which thematic units have been used 6\QWDFWLF/LQNV! `L'état de santé critique' (164) ->- cn head ->- `du pilote autrichien' (170) to create it. In the same way, each thematic unit remembers the part `planer' (204) ->- a2 head ->- `une menace' (205) of text it comes from. ....... `planer' (153) ->- a2 head ->- `une menace' (154) ....... `planer' (161) ->- a2 head ->- `une menace' (162) $JJUHJDWLRQ ....... `planer' (212) ->- a2 head ->- `une menace' (213) In order to construct group of words with very similar meanings, `sur le déroulement' (66) ->- cn head ->- `de la course' (235) we want to group the nouns appearing with the same syntactic role in relation to a verb inside a Domain. Then, a Structured Domain )LJXUH An extract of a sentence analysis by Sylex. (SD) is a set of <9HUEÅV\QWDFWLF UHODWLRQÅ1RXQ1 « 1RXQn> structures, i.e. an aggregated ISR. The results of Sylex are very detailed and not easy to parse di- STUs related to a same domain are aggregated altogether to rectly with, say, Perl. Furthermore, we do not need all the informa- form a Structured Domain. Aggregating a STU within a SD con- tion it extracts. In fact, we only need to find the verb with its links sists of: and the head nouns arguments of these links. So, we have devel- - aggregating their ISR that contain a same verb ; oped a formal grammar that extracts from these raw analyzes the - adding new ISR, i.e. adding new verbs with their arguments associations between a verb and its arguments. This grammar made of a syntactic relation and the lemmatized form of a extracts links from the results of Sylex in the following format: noun. i#j verb # WRNHQ # OHPPD # k rel # WRNHQ # OHPPD # l Figure 6 shows the aggregation of a SD and three ISR. This ex- ample shows all the possible effects of the aggregation. In the where i and j are the boundaries of the sentence that contains the figure, bold elements represent new or updating data. Aggregating link in the corpus; WRNHQand OHPPDare the token and the lemma an ISR in a SD that already contains the verb of the ISR leads to of the verb respectively ; relis the syntactic relation which can be increment the occurrence number of the verb, as for SOD\ in the "subject", "direct object" or a preposition ("to", "from", etc.) ; example. Similarly, the occurrence number of same nouns related WRNHQand OHPPDare the token and the lemma of the head noun to the verb by the same relation are updated (as for PDWFK), and of the noun phrase pointed by the relation; lastly, N and O are the new relations with their associated nouns are added to the verb. In indexes in the corpus of WRNHQand WRNHQ respectively. Figure 5 the example, the subject FKDPSLRQ is added. An ISR with a new shows some links that we have extracted from the results of Sylex. verb is simply added with an occurrence of 1, as for . • read a text, 6\QWDFWLF'RPDLQVRXUFH • extract the TUs from it, to play [4] object cup [3], match [1] • extract the corresponding STUs, with ball [1] • add each TU to its domain, to win [2] subject player [1] • add each STU to its corresponding domain, object match [1] and after the processing of all the texts, to filter the classes. In fact, each computing step is done on the entire corpus and ,QVWDQWLDWHG6\QWDFWLF5HODWLRQVVRXUFHV the results are next aligned. This allows us to save computation to play subject champion time as we do not have to run each tool multiple times. However object match we have to deal with dictionaries and indexes for various files and to lose object championship tools.  5(68/76 6\QWDFWLF'RPDLQUHVXOW to play [] object cup [3], match [] with ball [1] The experiments4 we have conducted had as a goal to show that VXEMHFW FKDPSLRQ>@ SVETLAN’ lead to learn classes of words which obviously belong to win [2] subject player [1] to the same concept in the domain. To obtain such results we have object match [1] chosen to run our system on one month of $)3 $JHQFH )UDQFH WRORVH>@ REMHFW FKDPSLRQVKLS>@ 3UHVVH) wires, that forms a corpus stylistically coherent but that covers varied subjects with very polysemous and non specific )LJXUH An example of the aggregation of three ISR in a SD verbs. These wires are made of 4,500,000 words and 48,000 sentences Classes of nouns in the produced SDs contain a lot of words in 6,000 texts. The thematic analysis gives 8,000 TUs aggregated that disturb their homogeneity. These words often belong to parts in 2,000 domains. More details on these domains can be found in of the different TU at the origin of the SD that are not very related [5]. From these 48,000 sentences, 117,000 different Instantiated to the described topic. Either they result from an error of the topic Syntactic Links are extracted by Sylex. 24,000 of these links con- segmentation process or they correspond to a meaning of a verb cern subject, direct object, or circumstantial complements intro- scarcely used in the current context. Another possibility is that the duced by a preposition and are integrated in 1,531 Structured ISR results from an error of Sylex. As these cases do not often Domains. recur for the same words in the same context, their nouns are After aggregating, but before filtering, the system obtains 431 weekly weighted in the corresponding domains. This characteristic aggregated links with two or more arguments, equivalent to 431 gives us a mean to filter the class content: each noun that possesses word classes. Some of them, such as WR PDQXIDFWXUH Å GLUHFW a weight lower than a threshold is removed from the class. By this REMHFWÅERPEZHDSRQ! are good. Nevertheless other classes are selection, we reinforce learning classes of words according to their heterogeneous as WR UHWXUQ Å GLUHFW REMHFW Å WHUULWRU\ VWULS contextual use. Figure 7 shows two aggregated links first obtained FRQWH[W V\QDJRJXH! (here strip comes from the Gaza Strip), or without filtering in its upper part and the filtered counterparts in its clearly mix different meanings of a verb, like WR TXLW Å GLUHFW lower part. The class associated to the verb ‘WRHVWDEOLVK’ has been REMHFWÅEDVHJRYHUQPHQW! which mix together the meanings "to completely removed as the weights of both ‘EDVH’ and ‘]RQH’ are leave a place" and "to retire from an institution". For the two latter lower than the threshold, while the class related to the verb ‘WR cases, one can see the interest to take into account the fact that the DQVZHU’ with the ‘REMHFW¶ link has been reduced by removing ‘OLVW’. domains contain words with different weights representing their We can see on this example that this filtering is efficient: the verb relevance to this domain. The higher the weight, the higher the ‘WRHVWDEOLVK’ as the words ‘EDVH¶ and ‘]RQH¶ are not very related to relevance of this word in this domain. So we apply the aforesaid the domain of ‘QXFOHDUZHDSRQV’ from which this example is taken filter to our classes and retain only those with weights higher than a and the usage of ‘WRDQVZHUDOLVW’ has a very low probability. More threshold. The class WHUULWRU\VWULSFRQWH[WV\QDJRJXH! is cor- details on the effects of the filtering process will be given in sec- rected to WHUULWRU\VWULS!andEDVHJRYHUQPHQW!is removed. tion 5. Among the wrong classes, some are due to errors of Sylex, as WRFRQIHUÅGLUHFWREMHFWÅSULFHDFWRU! where DFWRU should be linked to WRFRQIHU by the preposition WR. The remaining others are WRHVWDEOLVK REMHFW base, zone due to the extensive use of two different meanings of the verb in WRDQVZHU REMHFW document, question, list the same domain, as for WR FRQGXFWWR PDQDJH Å GLUHFW REMHFW Å GHOHJDWLRQ QHJRWLDWLRQ! (in French: "conduire une négocia- WRHVWDEOLVK REMHFW base, zone tion/une délégation"). This kind of error is inherent to the method WRDQVZHU REMHFW document, question, list we use and should be removed by other means. Note that the cor- rectness of the links have been manually judged by ourselves. The )LJXUH Filtered aggregated links in a domain about nuclear weapons precision measure used below is the ratio between the number of good classes and the total number of classes. We cannot define a In the principle, the described operations are not very compli- recall measure because we have no way to know which classes we cated. The difficulties comes from the necessity to work with data coming from various tools. Furthermore, for performance and practical reasons, we do not apply the chain of tools text by text. 4 Please note that all the tests have been made in French. So the English The natural way to see the process would be to : examples that appear here are translations from French. miss. To our knowledge, there is no existing resources with associ- context, assuming the words always have the same meaning. How- ated classes that would allow us to formally judge the results. ever this method has to be tested on more results in order to prove We have tried two thresholds: 0.05 and 0.1. Figure 8 details the its reliability. With our results, we would build for example (law, results for both. constitution, article, disposition) in the domain of “/DZ” and (re- bel, force, northerner, leader) in the domain of “FRQIOLFW”. Threshold Total Good Sylex errors Remaining errors SVETLAN’, in collaboration with SEGAPSITH, allows an 0.05 73 46 13 14 automatic learning of structured semantic domains. Instead of just 63% 18% 19% having sets of weighted words for describing semantic domains, 0.1 38 27 7 4 domains are described by a set of verbs related to classes of words 71% 18% 11% by a syntactic link. Besides, we can also view this base as semantic classes, each one being related to its context of interpretation. )LJXUH Results of the filtering for two thresholds As SVETLAN’ works with very specific domains, it builds small classes. In order to generalize them, we could apply a process After filtering, a lot of classes are removed but the remaining analogous to ASIUM, that merges classes independently of the classes are well funded in most cases. An example of a retained related verbs according to a similarity measure, even if, in our case, class for both thresholds is : this generalization process would operate in a same general do- WRLQMXUHÅVXEMHFWÅFRORQLVWVROGLHU! main. Afterwards, ASIUM asks an expert to validate its results. With a threshold set to 0.1 rather than to 0.05, we retain only 38 Words are often polysemous or ambiguous. However, when links, but we gain 8% in precision. If we ignore the errors due to used in context, they only denote one meaning, and moreover this Sylex, the real precision of SVETLAN’ is in the first case 78% and meaning is generally the same in different occurrences of a same in the second case 87%. It is very good and shows the interest there context. When building classes of nouns according to their con- is to choose a good threshold. textual use, we avoid mixing all the meanings of a word, either for Our experiments lead to homogeneous classes containing words the verbs or for the nouns. Such a result can be exhibited in the denoting a same concept, though these classes contain few words. classes (law, constitution) and (law, article, disposition) in the In order to directly view the interest there is to construct and clus- juridical context, where the words “DUWLFOH”, “FRQVWLWXWLRQ” and ter classes of words in being guided by their belonging to a do- “GLVSRVLWLRQ” do not attract synonymous of their other meanings as main, it is interesting to see what kind of classes would be obtained “VHFWLRQ”, “FRPSRVLWLRQ” and “DSWLWXGH” for example. by the merging of all domains, that is to say : creating context-free classes. So, we have applied the same aggregation principle to the  5(/$7(':25.6 same corpus but without taking into account the domains. Just below, we show two classes for the verb “WRUHSODFH”. The top one There is a lot of works dedicated to the formation of classes of is made context-free and the bottom one is made inside a domain. words. These classes have very various status. They can contain This verb is very general. Virtually everything can be replaced ! words belonging to the same semantic field or near synonymous. WordNet [1] is a lexical database made by lexicographers. It WRUHSODFH REMHFW text, constitution, trousers, combustible, aims at representing the sense of the bigger part of the lexicon. It is law, dinar, rod, film, circulation, judge, composed of Synsets. A Synset is a set of words that are synony- season, device, parliament, battalion, police, mous. These Synsets are linked by IS A relations. Its coverage is president, treaty large but this is, in a sense, a shortcoming as its classes are too WRUHSODFH REMHFW combustible, rod large and do not refer to precise meanings. Indeed, the generality of its contents makes it difficult to use in real sized applications that The first group of words merges very different senses while the are often centered on a domain. It rarely can be used without a lot second class, much more little, is better because it contains words of manual adaptation. referring to very similar concepts: a rod of enriched uranium is IMToolset, by Uri Zernik [2], extracts, for a word, several nuclear combustible, thus the words “URG” and “FRPEXVWLEOH” clusters of words from text. Each of these clusters reflects a differ- actually denote the same concept in the nuclear domain. Another ent meaning of the studied word. This extraction is done by scan- example is the following, for the verb “WRDWWULEXWH”: ning the local contexts of the word, the 10 words surrounding it in the texts. These signatures are statistically analyzed and clustered. WRDWWULEXWH REMHFW talk, prize, decoration, pope, responsi- The result is groups of words that are similar to our domains but bility, television, attempt, letter, con- more focused on the sense of a word alone. tract, ministry, jury, funds, authority, We have already stressed out some characteristics of ASIUM by note, bonus, band, bombing D. Faure and C. Nedellec [4], and we give here some more details. WRDWWULEXWH REMHFW prize, decoration ASIUM learns subcategorization frames of verbs and ontologies from text using syntactic analysis and a conceptual clustering Obtaining meaningful classes with a corpus such as $)3 shows algorithm. It analyses texts with Sylex and creates basic clusters of the efficiency of our method. Moreover, it is very good to obtain words appearing with a same verb and a same syntactic role or cohesive classes for verbs very general and polysemous. preposition, as do SVETLAN’. These basic classes are then clus- At this time, the class sizes are little. They do not contain a lot tered to create an ontology by the mean of a cooperative learning of words. A way to enlarge them could be to regroup classes that algorithm. The main difference with SVETLAN’ is this coopera- are related to the same verb, by the same syntactic relation in two tive generalization part: ASIUM depends on the expert who has to domains belonging to the same hierarchy, i. e. a same more general valid, and possibly to split, the clusters made by the algorithm. This approach is justified for specialized technical texts, but ASIUM, applied on texts such as $)3 wires would certainly not be able to extract good basic classes. Furthermore, as each word does not occur a lot in these texts, the distance is not appropriate to the grouping of our classes. On the contrary, on technical texts and with the cooperation of an expert, ASIUM will certainly obtain better results than ours from the point of view of the domain cover- age.  &21&/86,21 The system SVETLAN’ we propose, in conjunction with SEGAPSITH and the syntactic parser Sylex, extracts classes of words from raw text. These classes are created by the gathering of nouns appearing with the same syntactic role after the same verb inside a context. This context is made by the aggregation of text about similar subjects. The first experiments carried out give good results. But they also confirm that a great volume of data is neces- sary in order to extract a large quantity of lexical knowledge by the analysis of syntactic distributions. Moreover the very low recall of the syntactic parser and its systematic errors on some construc- tions, for example the passive form, which is very common in the journalistic style of our corpus, reduce the number and size of the classes. To solve this problem, we envisage trying another analyzer or adding a post-processing step to Sylex that detects the passive form by using data already in its output. These adaptations and the study of more larger corpora will allow us to obtain a good cover- age of numerous semantic domains. So, we will be able to give valuable semantic data useful in a lot of applications as information retrieval systems or word sense disambiguation systems. 5()(5(1&(6 [1] Christiane Fellbaum, WordNet: an electronic lexical database, The MIT Press, 1998 [2] Uri Zernik, TRAIN1 vs. TRAIN2: Tagging Word Senses in Corpus, RIAO’91, 1991 [3] Gregory Greffenstette, Explorations in automatic thesaurus discovery, Kluwer Academic Pub., Boston, 1994 [4] David Faure and Claire Nedellec, ASIUM, Learning subcategorization frames and restrictions of selection. In Y. Kodratoff ed., proceedings of 10th ECML – Workshop on text mining, 1998 [5] Olivier Ferret and Brigitte Grau, A Thematic Segmentation Procedure for Extracting Semantic Domains from Texts, Proceedings of ECAI’98, Brighton, 1998. [6] Olivier Ferret, How to thematically segment texts by using lexical cohesion? Proceedings of ACL-COLING'98 (student session), pp. 1481-1483, Montreal, Canada, 1998. [7] C.-Y. Lin. Robust Automated Topic Identification, Doctoral Disserta- tion, University of Southern California, (1997). [8] Patrick Constant, Analyse Syntaxique Par Couches. Ph.D thesis, École Nationale Supérieure des Télécommunications, April, 1991. [9] Patrick Constant, L'analyseur linguistique SYLEX. 5ème ecole d'été du CENT, 1995.