<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>IIE-CNAM</institution>
          ,
          <addr-line>18 allée J. Rostand, 91 000 Evry</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIMSI/CNRS</institution>
          ,
          <addr-line>BP 133, 91 403 Orsay Cédex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>! Using semantic knowledge in NLP applications always improves their competence. Broad lexicons have been developed, but there are few resources made for non-specialized domains which contain semantic information available for words. In order to build such a base, we conceived a system, SVETLAN', able to learn categories of nouns from texts, whatever their domain. In order to avoid general classes mixing all the meanings of words, they are learned taking into account the contextual use of words.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        #%
Using semantic knowledge in NLP applications always improves
their competence as in Information Retrieval or Word Sense
Disambiguation systems. Broad lexicons have been developed, but
there are few existing resources which contain semantic
information available for words that are not specialized to very specific
domains apart from WordNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, manual or automatic
processes that build semantic categories of nouns usually lead to
define general categories. For example, words in WordNet are
related to a Synset when they are synonymous, however Synsets
correspond to large categories, and there are some shifts of
meaning so that when two words belonging to a same Synset are
considered within a specific context, they often no longer share a
common meaning. Automatic processes that extract knowledge from
texts by using statistical [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or distributional [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] approaches
also lead to build broad classes, if they are not applied to
specialized texts belonging to a very specific domain. On the other hand,
we do not want to learn a general ontology, whatever the domain
is. As most words are polysemous, we claim that a semantic base
has to deal with all the meanings of a word, by associating them
with their context of interpretation. Having such a semantic
knowledge will allow information retrieval and Question/Answering
systems for example to use deeper semantic analysis of texts, even
if applied on database that contain texts on different domains
which are non technical articles and uses a general and common
vocabulary such as newspaper articles bases.
      </p>
      <p>
        In order to build such a base, we conceived a system,
SVETLAN’, able to learn categories of nouns in context from
texts, whatever their domain. It is based on a distributional
approach: nouns playing the same syntactic role with a verb in
sentences related to the same topic, i.e. the same domain, are
aggregated in the same class. SVETLAN’ relies on knowledge about
semantic domains automatically learned by SEGAPSITH [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
1 and
$ # ) %* +
      </p>
    </sec>
    <sec id="sec-2">
      <title>Domain TU TU TU</title>
      <p>TU
TU
TU</p>
    </sec>
    <sec id="sec-3">
      <title>Text Segment</title>
    </sec>
    <sec id="sec-4">
      <title>Text Segment</title>
      <p>V
V
r
r
V
V
STU
N
N
r
r</p>
    </sec>
    <sec id="sec-5">
      <title>Structured Domain N1, N2, … N1, N2, … V</title>
      <p>V
r
r
STU
N
N</p>
    </sec>
    <sec id="sec-6">
      <title>Input data</title>
      <p>)LJXUH . Schemata of Structured Domain learning</p>
      <p>The first step of SVETLAN’ consists of retrieving text segments
of the original texts associated to the different TUs in order to
parse their sentences. We extract then all the triplets constituted by
a verb, the head noun of a phrase and its syntactic role from the
parser results in order to produce the Syntactic Thematic Units
(STUs). The STUs belonging to a same semantic domain are
aggregated altogether to learn a Structured Domain. Aggregation
leads to group nouns playing the same syntactic roles with a verb
in order to form classes. As these aggregations are made within
TUs belonging to a same domain, classes are context sensitive,
which ensures a better homogeneity. A filtering step, based on the
weights of the words in their domain allows the system to eliminate
nouns from classes when they are not very relevant in this context.
.</p>
      <p>
        # &amp;%- #
$ #
We only give here a brief overview of the semantic domain
learning module. This one is described more precisely in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
module incrementally builds topic representations, made of weighted
words, from discourse segments delimited by SEGCOHLEX [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It
works without any classification or hand-coded pieces of
knowledge. Processed texts are typically newspaper articles
coming from or . They are
pre-processed to only keep their lemmatized content words
(adjectives, single or compound nouns and verbs).
      </p>
      <p>The topic segmentation implemented by SEGCOHLEX is based
on a large collocation network, built from 24 months of
newspaper, where a link between two words aims at capturing
semantic and pragmatic relations between them. The strength of
such a link is evaluated by the mutual information between its two
words. The segmentation process relies on these links for
computing a cohesion value for each position of a text. It assumes that a
discourse segment is a part of text whose words refer to the same
topic, that is, words are strongly linked to each other in the
collocation network and yield a high cohesion value. On the contrary,
low cohesion values indicate topics shifts. After delimiting
segments by an automatic analysis of the cohesion graph, only highly
cohesive segments, named Thematic Units (TUs), are kept to learn
topic representations. This segmentation method entails a text to be
decomposed in small thematic units, whose size is equivalent to a
paragraph. Discourse segments, even related to the same topic,
often develop different points of view. To enrich the particular
description given by a text, we add to TUs those words of the
collocation network that are particularly linked to the words found
in the corresponding segment.</p>
      <p>ZRUGV
examining judge
police custody
public property
charging
to imprison
court of criminal appeal
receiving stolen goods
to presume
criminal investigation department
fraud</p>
      <p>
        Learning a complete description of a topic consists of merging
all successive points of view, i.e. similar TUs, into a single
memorized thematic unit, called a semantic domain. Each aggregation of
a new TU increases the system’s knowledge about one topic by
reinforcing recurrent words and adding new ones. Weights on
words represent the importance of each word relative to the topic
and is computed from the number of occurrences of these words in
the TUs. This method leads SEGAPSITH to learn specific topic
representations as opposed to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for example whose method builds
general topic descriptions as for economy, sport, etc.
      </p>
      <p>We have applied the learning module of SEGAPSITH on one
month (May 1994) of newswires. Figure 2 shows an example
of a domain about justice that gathers 69 TUs.</p>
      <p>As some of these domains are close and refer to the same
general topic, we have applied a hierarchical classification method
based on their common words to organize them in separate general
topics and to structure them. Figure 3 shows the hierarchies built
about sport, police and stock exchange. Each leaf is a domain,
named by its two more weighted words, while internal nodes are
described by their name and their size, i.e. the number of common
words found in their children.</p>
      <p>Team_mate/Champion
ship 11
To_question/Arrest 6
Money/Quarter 27</p>
      <p>Pilot/Formula
Tennis/Team_mate 50
Police/Policeman</p>
      <p>Prison/Condemn
Dollar/Billion
Rate/Rise</p>
      <p>To_beat/Finale
Cycle_race/Stage
)LJXUH</p>
    </sec>
    <sec id="sec-7">
      <title>Three hierarchies of semantic domains</title>
      <p>
        As in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], verbs allow us to categorize nouns. A class is defined by
those nouns which play a same role relative to a same verb. In
order to learn very homogeneous3 classes, we only apply this
principle on words belonging to a same context, i.e. a domain.
In order to find the verbs and their arguments in the texts, we use
the syntactic analyzer Sylex [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Figure 4 shows a little part of
the results of Sylex for a sentence. The first part exhibits
lexicosyntactic information for the words and this for four different
interpretations pointed out by the string “ /” meaning an
ambiguity rate of 4. This rate is due to the fact that Sylex cannot
solve two ambiguities: the ambiguity of “ ” between the verb
“ ” (to let) and the noun “ ” (leash) and the ambiguity of
“ ” between the verb “ ” and the noun “ ”.
Note that Sylex does not consider the adjectival form which is the
right interpretation here. The second part shows syntactic links
found by Sylex. Between parenthesis are references to the words in
3 We call homogeneous a class that contains words that denote a same
concept in the corresponding domain.
the preceding analysis. Here Sylex has found four times the same
interpretation in each of its possible analyses. In this case, we
count one occurrence of the link. However if it finds several times
the same relation between a verb and different words, for example
several possible subjects, then we keep all the different
interpretations because we have no way to choose between them. We make
the reasonable expectation that the false interpretations will have
much less occurrences in the corpus and so, will be filtered out
during the rest of the processing.
******************** Phrase 193-466 ***********************************
"L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes),
victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de
Monaco de Formule Un, laisse planer une menace sur le déroulement de la course,
dimanche en Principauté."
******************** Partie 1 193-466 WDX[ **************************
"L'état de santé critique du pilote autrichien Karl Wendlinger (Sauber-Mercedes),
victime d'un grave accident jeudi matin lors des premiers essais du Grand Prix de
Monaco de Formule Un, laisse planer une menace sur le déroulement de la course,
dimanche en Principauté."
      </p>
      <p>/H[LFR6\QWDFWLFLQIRUPDWLRQ!
193-195 (164) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj
195-208 (165) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier
mot_compose locsw
...&lt;snip&gt;....
382-388 (203) "laisse" "laisse" [gs.12,nom.1] nom : feminin singulier
389-395 (204) "planer" "planer" [gs.13,verbe] verbe : infinitif
...&lt;snip&gt;....
193-195 (16) "L'" "le" [gs.1,avn,pdet.1] pdet : singulier elision dmaj
195-208 (117) "état de santé" "état de santé" [gs.1,nom.1] nom : masculin singulier
mot_compose locsw
...&lt;snip&gt;....
382-388 (211) "laisse" "laisser" [gs.13,verbe] verbe : singulier autoontif antiontif
anontif present indicatif subjonctif imperatif
389-395 (212) "planer" "planer" [gs.14,verbe] verbe : infinitif
...&lt;snip&gt;....</p>
      <p>6\QWDFWLF/LQNV!
`L'état de santé critique' (164) -&gt;- cn head -&gt;- `du pilote autrichien' (170)
`planer' (204) -&gt;- a2 head -&gt;- `une menace' (205)
...&lt;snip&gt;....
`planer' (153) -&gt;- a2 head -&gt;- `une menace' (154)
...&lt;snip&gt;....
`planer' (161) -&gt;- a2 head -&gt;- `une menace' (162)
...&lt;snip&gt;....
`planer' (212) -&gt;- a2 head -&gt;- `une menace' (213)
`sur le déroulement' (66) -&gt;- cn head -&gt;- `de la course' (235)</p>
      <p>The results of Sylex are very detailed and not easy to parse
directly with, say, Perl. Furthermore, we do not need all the
information it extracts. In fact, we only need to find the verb with its links
and the head nouns arguments of these links. So, we have
developed a formal grammar that extracts from these raw analyzes the
associations between a verb and its arguments. This grammar
extracts links from the results of Sylex in the following format:
i # j
verb #
0 " #
# k rel #
0 ( #
# l
where i and j are the boundaries of the sentence that contains the
link in the corpus; 0 " and are the token and the lemma
of the verb respectively ; rel is the syntactic relation which can be
"subject", "direct object" or a preposition ("to", "from", etc.) ;
0 ( and are the token and the lemma of the head noun
of the noun phrase pointed by the relation; lastly, and are the
indexes in the corpus of 0 " and 0 ( respectively. Figure 5
shows some links that we have extracted from the results of Sylex.
token1
hang over
play
hear
lemma1
hang over
play
hear</p>
      <p>rel
subject
object
of
token2
threat
cup
sources
lemma2
threat
cup
source
)LJXUH</p>
    </sec>
    <sec id="sec-8">
      <title>Examples of extracted links</title>
      <p>Sylex, as other syntactic analyzers, has difficulties with some
constructions and as a consequence introduces errors that can
cause problems to the remaining of the system. Some common
errors are the bad interpretation of the passive form that causes a
subject to be analyzed as a direct object and conversely, a direct
object to be viewed as a subject. Another common error is that it
often happens that Sylex does not find any link in a phrase. That’s
what we will call . We will see in Section 5 that we can
obtain good results despite these problems thanks to the
redundancy needed to validate the links in the next steps of
processing. But another consequence of this redundancy needs is
that the system must use great quantities of texts in order to create
classes with a satisfactory size.</p>
      <p>Having gotten the syntactic links in the texts, we want to group
them relatively to the belonging of their text segment to a Thematic
Unit. So, we define a Syntactic Thematic Unit (STU) as a set of
&lt; &gt; structures, i.e. a syntactic
relation instantiated with a verb and a noun. We will refer to these
structures as Instantiated Syntactic Relations or ISR. We are able
to put in relation the links extracted from the results of Sylex and
the words contained in the domains because each domain in the
thematic memory remembers which thematic units have been used
to create it. In the same way, each thematic unit remembers the part
of text it comes from.
In order to construct group of words with very similar meanings,
we want to group the nouns appearing with the same syntactic role
in relation to a verb inside a Domain. Then, a Structured Domain
(SD) is a set of &lt; 1! "! n&gt;
structures, i.e. an aggregated ISR.</p>
      <p>STUs related to a same domain are aggregated altogether to
form a Structured Domain. Aggregating a STU within a SD
consists of:
- aggregating their ISR that contain a same verb ;
- adding new ISR, i.e. adding new verbs with their arguments
made of a syntactic relation and the lemmatized form of a
noun.</p>
      <p>
        Figure 6 shows the aggregation of a SD and three ISR. This
example shows all the possible effects of the aggregation. In the
figure, bold elements represent new or updating data. Aggregating
an ISR in a SD that already contains the verb of the ISR leads to
increment the occurrence number of the verb, as for in the
example. Similarly, the occurrence number of same nouns related
to the verb by the same relation are updated (as for #), and
new relations with their associated nouns are added to the verb. In
the example, the subject # is added. An ISR with a new
verb is simply added with an occurrence of 1, as for &lt;to
$ # # &gt;.
6\QWDFWLF’RPDLQVRXUFH
to play [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
object
with
to win [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] subject
      </p>
      <p>object
,QVWDQWLDWHG6\QWDFWLF5HODWLRQVVRXUFHV
to play subject</p>
      <p>object
to lose object
6\QWDFWLF’RPDLQUHVXOW
to play [ ]</p>
      <p>Classes of nouns in the produced SDs contain a lot of words
that disturb their homogeneity. These words often belong to parts
of the different TU at the origin of the SD that are not very related
to the described topic. Either they result from an error of the topic
segmentation process or they correspond to a meaning of a verb
scarcely used in the current context. Another possibility is that the
ISR results from an error of Sylex. As these cases do not often
recur for the same words in the same context, their nouns are
weekly weighted in the corresponding domains. This characteristic
gives us a mean to filter the class content: each noun that possesses
a weight lower than a threshold is removed from the class. By this
selection, we reinforce learning classes of words according to their
contextual use. Figure 7 shows two aggregated links first obtained
without filtering in its upper part and the filtered counterparts in its
lower part. The class associated to the verb ‘ #’ has been
completely removed as the weights of both ‘ ’ and ‘ ’ are
lower than the threshold, while the class related to the verb ‘
% ’ with the ‘ $ &amp; link has been reduced by removing ‘ ’.
We can see on this example that this filtering is efficient: the verb
‘ #’ as the words ‘ &amp; and ‘ &amp; are not very related to
the domain of ‘ % ’ from which this example is taken
and the usage of ‘ % ’ has a very low probability. More
details on the effects of the filtering process will be given in
section 5.
In the principle, the described operations are not very
complicated. The difficulties comes from the necessity to work with data
coming from various tools. Furthermore, for performance and
practical reasons, we do not apply the chain of tools text by text.
The natural way to see the process would be to :
• read a text,
• extract the TUs from it,
• extract the corresponding STUs,
• add each TU to its domain,
• add each STU to its corresponding domain,
and after the processing of all the texts, to filter the classes.</p>
      <p>In fact, each computing step is done on the entire corpus and
the results are next aligned. This allows us to save computation
time as we do not have to run each tool multiple times. However
we have to deal with dictionaries and indexes for various files and
tools.
1 $</p>
      <p>'
The experiments4 we have conducted had as a goal to show that
SVETLAN’ lead to learn classes of words which obviously belong
to the same concept in the domain. To obtain such results we have
chosen to run our system on one month of</p>
      <p>) wires, that forms a corpus stylistically coherent but that
covers varied subjects with very polysemous and non specific
verbs.</p>
      <p>
        These wires are made of 4,500,000 words and 48,000 sentences
in 6,000 texts. The thematic analysis gives 8,000 TUs aggregated
in 2,000 domains. More details on these domains can be found in
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. From these 48,000 sentences, 117,000 different Instantiated
Syntactic Links are extracted by Sylex. 24,000 of these links
concern subject, direct object, or circumstantial complements
introduced by a preposition and are integrated in 1,531 Structured
Domains.
      </p>
      <p>After aggregating, but before filtering, the system obtains 431
aggregated links with two or more arguments, equivalent to 431
word classes. Some of them, such as ' (</p>
      <p>$ ! % ) are good. Nevertheless other classes are
heterogeneous as ' $ ! !
* ! ) (here strip comes from the Gaza Strip), or
clearly mix different meanings of a verb, like '</p>
      <p>$ ! + ) which mix together the meanings "to
leave a place" and "to retire from an institution". For the two latter
cases, one can see the interest to take into account the fact that the
domains contain words with different weights representing their
relevance to this domain. The higher the weight, the higher the
relevance of this word in this domain. So we apply the aforesaid
filter to our classes and retain only those with weights higher than a
threshold. The class ' ! ! * ! ) is
corrected to ' ! ) and ' ! + ) is removed.</p>
      <p>Among the wrong classes, some are due to errors of Sylex, as
' ( $ ! ) where should be
linked to ( by the preposition . The remaining others are
due to the extensive use of two different meanings of the verb in
the same domain, as for, ' - $
! ) (in French: "conduire une
négociation/une délégation"). This kind of error is inherent to the method
we use and should be removed by other means. Note that the
correctness of the links have been manually judged by ourselves. The
precision measure used below is the ratio between the number of
good classes and the total number of classes. We cannot define a
recall measure because we have no way to know which classes we
4 Please note that all the tests have been made in French. So the English
examples that appear here are translations from French.
miss. To our knowledge, there is no existing resources with
associated classes that would allow us to formally judge the results.</p>
      <p>We have tried two thresholds: 0.05 and 0.1. Figure 8 details the
results for both.
After filtering, a lot of classes are removed but the remaining
classes are well funded in most cases. An example of a retained
class for both thresholds is :</p>
      <p>' $ $ ! )</p>
      <p>With a threshold set to 0.1 rather than to 0.05, we retain only 38
links, but we gain 8% in precision. If we ignore the errors due to
Sylex, the real precision of SVETLAN’ is in the first case 78% and
in the second case 87%. It is very good and shows the interest there
is to choose a good threshold.</p>
      <p>Our experiments lead to homogeneous classes containing words
denoting a same concept, though these classes contain few words.
In order to directly view the interest there is to construct and
cluster classes of words in being guided by their belonging to a
domain, it is interesting to see what kind of classes would be obtained
by the merging of all domains, that is to say : creating context-free
classes. So, we have applied the same aggregation principle to the
same corpus but without taking into account the domains. Just
below, we show two classes for the verb “ ”. The top one
is made context-free and the bottom one is made inside a domain.
This verb is very general. Virtually everything can be replaced !</p>
      <p>The first group of words merges very different senses while the
second class, much more little, is better because it contains words
referring to very similar concepts: a rod of enriched uranium is
nuclear combustible, thus the words “ ” and “ ”
actually denote the same concept in the nuclear domain. Another
example is the following, for the verb “ ”:
$
$
$
$
context, assuming the words always have the same meaning.
However this method has to be tested on more results in order to prove
its reliability. With our results, we would build for example (law,
constitution, article, disposition) in the domain of “ %” and
(rebel, force, northerner, leader) in the domain of “ ( ”.</p>
      <p>SVETLAN’, in collaboration with SEGAPSITH, allows an
automatic learning of structured semantic domains. Instead of just
having sets of weighted words for describing semantic domains,
domains are described by a set of verbs related to classes of words
by a syntactic link. Besides, we can also view this base as semantic
classes, each one being related to its context of interpretation.</p>
      <p>As SVETLAN’ works with very specific domains, it builds
small classes. In order to generalize them, we could apply a process
analogous to ASIUM, that merges classes independently of the
related verbs according to a similarity measure, even if, in our case,
this generalization process would operate in a same general
domain. Afterwards, ASIUM asks an expert to validate its results.</p>
      <p>Words are often polysemous or ambiguous. However, when
used in context, they only denote one meaning, and moreover this
meaning is generally the same in different occurrences of a same
context. When building classes of nouns according to their
contextual use, we avoid mixing all the meanings of a word, either for
the verbs or for the nouns. Such a result can be exhibited in the
classes (law, constitution) and (law, article, disposition) in the
juridical context, where the words “ ”, “ ” and
“ ” do not attract synonymous of their other meanings as
“ ”, “ ” and “ ” for example.
3 $
There is a lot of works dedicated to the formation of classes of
words. These classes have very various status. They can contain
words belonging to the same semantic field or near synonymous.</p>
      <p>
        WordNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a lexical database made by lexicographers. It
aims at representing the sense of the bigger part of the lexicon. It is
composed of Synsets. A Synset is a set of words that are
synonymous. These Synsets are linked by IS A relations. Its coverage is
large but this is, in a sense, a shortcoming as its classes are too
large and do not refer to precise meanings. Indeed, the generality of
its contents makes it difficult to use in real sized applications that
are often centered on a domain. It rarely can be used without a lot
of manual adaptation.
      </p>
      <p>
        IMToolset, by Uri Zernik [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], extracts, for a word, several
clusters of words from text. Each of these clusters reflects a
different meaning of the studied word. This extraction is done by
scanning the local contexts of the word, the 10 words surrounding it in
the texts. These signatures are statistically analyzed and clustered.
The result is groups of words that are similar to our domains but
more focused on the sense of a word alone.
      </p>
      <p>
        We have already stressed out some characteristics of ASIUM by
D. Faure and C. Nedellec [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and we give here some more details.
ASIUM learns subcategorization frames of verbs and ontologies
from text using syntactic analysis and a conceptual clustering
algorithm. It analyses texts with Sylex and creates basic clusters of
words appearing with a same verb and a same syntactic role or
preposition, as do SVETLAN’. These basic classes are then
clustered to create an ontology by the mean of a cooperative learning
algorithm. The main difference with SVETLAN’ is this
cooperative generalization part: ASIUM depends on the expert who has to
valid, and possibly to split, the clusters made by the algorithm.
2
2
text, constitution, trousers, combustible,
law, dinar, rod, film, circulation, judge,
season, device, parliament, battalion, police,
president, treaty
combustible, rod
talk, prize, decoration, pope,
responsibility, television, attempt, letter,
contract, ministry, jury, funds, authority,
note, bonus, band, bombing
prize, decoration
      </p>
      <p>Obtaining meaningful classes with a corpus such as shows
the efficiency of our method. Moreover, it is very good to obtain
cohesive classes for verbs very general and polysemous.</p>
      <p>At this time, the class sizes are little. They do not contain a lot
of words. A way to enlarge them could be to regroup classes that
are related to the same verb, by the same syntactic relation in two
domains belonging to the same hierarchy, i. e. a same more general
This approach is justified for specialized technical texts, but
ASIUM, applied on texts such as wires would certainly not be
able to extract good basic classes. Furthermore, as each word does
not occur a lot in these texts, the distance is not appropriate to the
grouping of our classes. On the contrary, on technical texts and
with the cooperation of an expert, ASIUM will certainly obtain
better results than ours from the point of view of the domain
coverage.
5
%</p>
      <p>' #%
The system SVETLAN’ we propose, in conjunction with
SEGAPSITH and the syntactic parser Sylex, extracts classes of
words from raw text. These classes are created by the gathering of
nouns appearing with the same syntactic role after the same verb
inside a context. This context is made by the aggregation of text
about similar subjects. The first experiments carried out give good
results. But they also confirm that a great volume of data is
necessary in order to extract a large quantity of lexical knowledge by the
analysis of syntactic distributions. Moreover the very low recall of
the syntactic parser and its systematic errors on some
constructions, for example the passive form, which is very common in the
journalistic style of our corpus, reduce the number and size of the
classes. To solve this problem, we envisage trying another analyzer
or adding a post-processing step to Sylex that detects the passive
form by using data already in its output. These adaptations and the
study of more larger corpora will allow us to obtain a good
coverage of numerous semantic domains. So, we will be able to give
valuable semantic data useful in a lot of applications as information
retrieval systems or word sense disambiguation systems.
$ * $</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Christiane</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <article-title>WordNet: an electronic lexical database</article-title>
          , The MIT Press,
          <year>1998</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Uri</given-names>
            <surname>Zernik</surname>
          </string-name>
          ,
          <article-title>TRAIN1 vs. TRAIN2: Tagging Word Senses in Corpus</article-title>
          ,
          <source>RIAO'91</source>
          ,
          <year>1991</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gregory</given-names>
            <surname>Greffenstette</surname>
          </string-name>
          ,
          <article-title>Explorations in automatic thesaurus discovery</article-title>
          , Kluwer Academic Pub., Boston,
          <year>1994</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>David</given-names>
            <surname>Faure</surname>
          </string-name>
          and
          <article-title>Claire Nedellec, ASIUM, Learning subcategorization frames and restrictions of selection</article-title>
          . In Y. Kodratoff ed.,
          <source>proceedings of 10th ECML - Workshop on text mining</source>
          ,
          <year>1998</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Ferret</surname>
          </string-name>
          and
          <string-name>
            <given-names>Brigitte</given-names>
            <surname>Grau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Thematic</given-names>
            <surname>Segmentation</surname>
          </string-name>
          <article-title>Procedure for Extracting Semantic Domains from Texts</article-title>
          ,
          <source>Proceedings of ECAI'98</source>
          ,
          <string-name>
            <surname>Brighton</surname>
          </string-name>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <article-title>How to thematically segment texts by using lexical cohesion?</article-title>
          <source>Proceedings of ACL-COLING'98 (student session)</source>
          , pp.
          <fpage>1481</fpage>
          -
          <lpage>1483</lpage>
          , Montreal, Canada,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <source>Robust Automated Topic Identification, Doctoral Dissertation</source>
          , University of Southern California, (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <source>Analyse Syntaxique Par Couches. Ph.D thesis</source>
          , École Nationale Supérieure des Télécommunications, April,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <article-title>L'analyseur linguistique SYLEX. 5ème ecole d'été du</article-title>
          <string-name>
            <surname>CENT</surname>
          </string-name>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>