=Paper= {{Paper |id=Vol-101/paper-13 |storemode=property |title=Corpus Analysis to Extract Information |pdfUrl=https://ceur-ws.org/Vol-101/Fabrice_Even.pdf |volume=Vol-101 }} ==Corpus Analysis to Extract Information== https://ceur-ws.org/Vol-101/Fabrice_Even.pdf
                      Corpus Analysis to Extract Information
                                                        Fabrice Even
                                                  IRIN - University of Nantes
                                                    2 rue de la Houssinière
                                                    44322 Nantes, France
                                                   even@irin.univ-nantes.fr


ABSTRACT                                                          specific information about clients (future plans, evolution
This article presents an automatic information extraction         of their family situation, etc.) to expand a database.
method from poor quality specific-domain corpora. This
method is based on building a semi-formal ontology in             1    ONTOLOGY BUILDING METHODS
order to model information present in the corpus and its          There are a lot of methods to build ontology from corpora.
relation. This approach takes place in four steps: corpus         Most of them are based on text content. In these ap-
normalization by a correcting process, ontology building          proaches, texts are the main source for knowledge acquisi-
from texts and external knowledge, model formalization in         tion [10]. Concepts and relations result only from a corpus
grammar and the information extraction itself, which is           analysis, without external knowledge. Aussenac-Gilles and
made by a tagging process using grammar rules. After a            al. [1] follow this point of view but affirm that there can be
description of the different stages of our method, experi-        other knowledge sources than corpus. Such an approach
mentation on a French bank corpus is presented.                   includes two steps. The first one consists of the construc-
                                                                  tion by a corpus terminological and linguistic analysis, of a
Keywords                                                          first set of concepts that corresponds to terms (conceptual
Information extraction, modeling, building ontology, poor         primitives [10]) and the extraction of lexical relations. The
quality corpus, corpus correction.                                result of this stage is a base of primitive concepts and first
                                                                  concept relations (terminological knowledge base [1] &
INTRODUCTION                                                      [9]). In this stage, the designer has to select terms and rela-
This research stems from the need for information extrac-         tions that will be modeled, those that are relevant and in the
tion from a poor quality corpus (without punctuation, with        case of several meanings for one term or relation, which
poor syntax and a lot of abbreviations). In our approach,         one must be kept. The second step is based on conceptual
the modeling of information needed and its identification in      modeling by a study of the semantic relations between
the corpus carry out information extraction. The modeling         terms. This analysis gives new concept relations and new
uses external knowledge sources to construct a semi-formal        concepts, which are added to the first one. This new set of
ontology, which covers a part of the corpus domain, i.e.          concepts and relations is also structured into a concept se-
only the knowledge actually described in corpus. This on-         mantic network. An expert of the corpus domain must vali-
tology is used to extract information.                            date this network to express which relations are relevant
After a brief presentation of different ontology-building         (normalization [5]). The result of the entire process is a
methods, a presentation of our method is made by a de-            hierarchical structure of a set of terms of the domain [12].
scription of its different stages: a corpus partial correction,   This model is also an ontology, which can be formalized by
ontology building from this corrected corpus and external         a formal or semi-formal representation.
knowledge, the ontology representation by a grammar and           Using such a method is powerful for syntactically correct
the information extraction engine based on this grammar.          texts but not for poor quality corpora. The terminological
The results are evaluated and analyzed.                           step is conceivable after corpus partial correction but lexi-
We work on a French corpus composed of bank texts.                cal and semantic relation extraction are impossible. Indeed
These texts are compilations of interviews between bank           linguistic tools are highly ineffective on syntactical and
employees and clients. Our goal is to automatically extract       lexical poor corpora.
                                                                  So for such poor quality corpora, our approach can be
                                                                  based on a terminological analysis for first concept identi-
                                                                  fication but a solution other than classical methods must be
                                                                  found for the other modeling steps (terminological knowl-
                                                                  edge base and semantic network building).
2     CORPUS CORRECTION                                          have to be bound to corpus terms, so a normalization proc-
The texts from our corpus are distinguished by a lot of ty-      ess is necessary. This process takes place in three steps:
pological errors, spelling mistakes and the use of non-          initial ontology extension, terminological knowledge base
standard abbreviations.                                          (TKB) building and unification of models.
These characteristics make a correction and normalization        3.3.1    Initial ontology extension
step necessary. Indeed using the texts directly without cor-     Initial ontology is revised with domain experts. Some new
rection gives a lot of undesirable information and causes a      concepts that stemmed from this first set of concepts are
poor coverage of information. Therefore modeling and             defined and added to hierarchy. This gives an extended
information extraction starts from the corrected corpus.         initial ontology.
This step concerns value formats (normalization of num-
bers with unit), dates (unique numeric representation and        3.3.2    TKB building
specific abbreviation treatment such days or month), ab-         With these same experts and domain specific documents, a
breviation standardization (substitution of abbreviations        new set of concepts is built from terminology: the basic
specific to the texts or the writers by a unique one) and        concepts. From these basic concepts, others are defined
orthographical correction of lexical and typological errors.     recursively by inheritance. The result is a set of small hier-
This correction process is carried out with a set of contex-     archies with, for each one a unique ancestor whose last heir
tual rules written after a lexical study of the corpus. Auto-    is a basic concept. These hierarchies are normalized: each
mata are used on the texts to applied rules.                     father is divided into sons by a unique criterion. They also
                                                                 respect the Guarino-Rigid-Property [7].
3     DOMAIN MODEL BUILDING
According to Bachimont [3], there are no independent con-        3.3.3    Models Unification
cepts from context or current problems, which allow build-       The two precedent processes give ontology linked to the
ing the whole knowledge of a particular domain. Ontology         current problem and a hierarchical structure linked to texts.
works like a theoretical framework of a domain and is built      We proceed to the unification of these two models. The
according to a current problem. The modeling process de-         extended initial ontology is unified with a hierarchy if it
scribed here is based on this definition. Ontology is built      ancestor is a concept of the initial ontology or if a relation
from knowledge found in the corpus and from external             can be built between this ancestor and concepts from this
knowledge (experts).                                             ontology.

3.1    Initial ontology definition                               After these three steps, domain model is obtained, which
The information searched is first informally expressed (in       covers all concepts relevant for information searching. An
natural language) and next converted into predicates. These      oriented graph diagram first describes this model. This
predicates are described by information patterns. This work      model defines a semi-formal ontology because it is not
must be done with domain experts or information extrac-          dependent on a representation language [4].
tion final users with a wide knowledge of the domain and
who are able to express exactly what information must be         4     FORMAL REPRESENTATION
extracted. This step gives a set of concept hierarchies. This    To make the model usable, it is formalized into a grammar.
set is the first sub-ontology (initial ontology), which is       As seen in last section, two relation types are found in this
composed of predicative relations between concepts (there        model: hierarchical relations and predicative relations. The
is a relation between two concepts when one of them is an        grammar must represent these two types of relation. Also
attribute of the other). These relations are in accordance       two sorts of rules are defined. On the one hand, constituent
with the Attribute Consistency Postulate [8]: in each predi-     rules, which represent hierarchical relations, are defined.
cative relation, any value of an attribute is also an instance   When we have a hierarchical relation between two con-
of the concept corresponding to that attribute.                  cepts A and B, with B a son of A, we say that B constitutes
                                                                 A. On the other hand predicative rules are defined to repre-
3.2    Terminology definition                                    sent predicative relations. When we have a predicative rela-
The terminology is built from the union of two sets of           tion between two concepts C and D, D is an attribute of C
terms. The first is made by a terminological study of texts      (the type of attribute depends on the relation). All these
by linguistic tools as ANA [6]. The other is built by a set of   rules are written with a BNF-like description.
documents about domain terminology where terms that can
be used in texts are found (domain technical documents for       4.1     Constituent rules
example).                                                        A concept C is defined by a set of rules Def(C). These rules
                                                                 concern terms or concepts. For each X from Def(C), X
3.3    Normalization                                             only defines C, never another concept. The notation for
There is not a direct correspondence between each term           these rules is C ::= Def(C). There are three sorts of con-
and a concept from initial ontology. But these concepts
stituent rules: select rules, conjunctive rules and disjunc-     5       EXTRACTION ENGINE
tives rules.                                                     The extraction engine is based on the grammar modeling of
                                                                 the domain. It proceeds in four steps: rules database crea-
4.1.1    Select rules                                            tion, two tagging processes and information collection.
The form of select rules is: C ::= B1 | B2 | ... . The concept
C is defined by B1 or by B2 but not by both of them at the       Rules database is composed of two sets of rules (constitu-
same time.                                                       ent and predicative) inferred from the grammar. Constitu-
                                                                 ent tagging is based on database constituent rules (each
Examples:                                                        term and concept is tagged according to database rules).
 ::=  |                                      Second tagging is based on predicative database rules for
          |                                           instantiate predicates. After these step, information is col-
 ::= ford | mercedes | …                                    lected directly. This information will expand a database.

4.1.2    Conjunctive rules                                       5.1       Constituent tagging
 The form of conjunctive rules is: C ::= B1 + B2 + ... .         Constituent tagging find terms, and then concepts in the
The concept C is defined by a set of concepts in which all       text by recursive applications of constituent rules. Each
concepts are necessary to define C.                              time a concept is found, a tag marks it up.
Example:  ::=  +                              Some specific concepts (with a known syntax) such as
                                                       sums, rates or dates are tagged first. After that, tagging
                                                                 takes place in two steps: term tagging then concept propa-
4.1.3    Disjunctive rules                                       gation.
The form of disjunctive rules is: C ::= B1 v B2 v ... .          Some select rules define concepts from terms. In term tag-
The concept C is defined by B1 or by B2 or by both of            ging, these rules are applied to the corpus (for each rule C
them.                                                            ::= t, tags of concept C mark the term t in the text). When
Example:  ::=  v                       all these rules are applied, every term in the grammar is
                                                                 tagged by concepts.
4.2     Predicative rules                                        With concept propagation, some new concepts are found.
These rules describe predicative concepts (also called           When a rule A ::= B exists, tags of A are added to tags of B
predicate). These are concepts with attribute. Predicative       in the corpus. So concept A is marked in the texts.
relations define links between these concepts and their at-      Conceptual rules are applied until none of them is applica-
tributes. These rules define a predicate by one descriptor       ble. Then the corpus is completely tagged by constituent
and one main attribute: the object. For a predicate, the de-     rules (cf. figure 2).
scriptor is a unique concept, this mean that a concept can-
not be the descriptor for more than one rule. The object is                  Figure 2: Example of predicative tagging
one of a set of possible concepts (this set is defined by the
model).                                                                   The text “buy studio london in 2003” becomes after
These rules can have some optional attributes. These at-             constituent tagging:
tributes give more information on the predicate but are nei-
ther necessary nor sufficient to define it.                          buy
                                                                     studio
Predicative rules are written: P ::= (descriptor = D; object =       
O1 | O2 | O3 | ... ; option1 = A1 | A2 | A3 | ... ; option2 =        london
B1 | B2 | ... ; ...)                                                 in 2003
Example: PURCHASE predicate is described by figure 1.
                                                                     with the rules :
                                                                      ::= buy | bought
            Figure 1: PURCHASE predicate                              ::= studio | apartment | loft
                                                                      ::= london | paris | tokyo
         ::=                                                ::=  | 
         (
                                                                      ::=  |  | 
               descriptor =  ;
               object = 
                          | 
                          | ;
               date = ;
               amount =                                     5.2       Predicative tagging
              location =                                  Application of predicative rules detects in the texts the in-
        )                                                        stances of grammar predicates. Each time a predicate de-
scriptor is found, the process search one of the concepts          Predicate instantiation is made through a text tagging proc-
defined as possible object for this predicate.                     ess. The system tags the descriptor by a predicate reference
Predicates are instantiated until it is impossible to do. This     (the predicate name and an instance number to distinguish
proceeds as follow. Text is processed from left to right.          different instances of the same predicate). Each predicate
When a predicate's descriptor is recognized, the process           attribute is tagged by the predicate reference and its type
looks for a correct object (concept or predicate) for this         (Object, Date, Location...).
predicate before the next concept that is a descriptor of          Example: from the extract described in figure 2, after ap-
another untreated predicate instance. If a correct object is       plying the predicative rules, we obtain the tagging text and
found, the attribute object is given a value for this predicate    the predicate instance described by figure 3.
instance. Next the process tries to give a value to the op-
tional attributes by looking at correct concepts in the text       6     INFORMATION RETRIEVING
located between the descriptor of this predicate and the           After constituent and predicative tagging, tags make con-
next one. After that, the system treats the next descriptor in     cepts and relations clearly readable in the corpus. In the
the text.                                                          retrieving step, all that has to be done to specify the con-
 If no correct object is found, this descriptor is left and the    cepts to be searched. With the tags, the system can easily
system immediately treats the next descriptor. This process        locate these concepts and their different attributes. In this
is made right to the end of text. At this point, if untreated      step empty predicates are ignored. This information feeds a
descriptors (that define predicate instance without a found        database in which tables correspond to grammar predicate.
object) are left, the process is repeated from the text's be-
ginning. The operation is repeated until there are no de-          7     RESULTS
scriptors left to treat or only those that cannot be treated. If   We have a corpus with around one million records. Each
such descriptors are left, they are marked as defining empty       record is taken from an interview between a client and a
predicate instances (instances without an object).                 bank employee. It is composed of a numerical heading and
                                                                   a text area. In the heading, there is an identification number
                                                                   and the recording date. The text area is filled with the in-
         Figure 3: Example of predicative tagging                  terview report written by the employee. The text size varies
                                                                   from record to record: from a few to thirty words. Before
                                                       text analysis, the text area is treated to make it in confor-
      buy                               mity with Data Protection Act. Terminological extraction
                                                      with ANA defines a first set of 15000 term-candidates.
   
                                                         After creaming off this set, 1300 remain. Terminological
        studio                              documents (which contain about 350 terms) give 200 new
                                                         terms. So the terminological step gives us a set of 1500
                                           terms.
   
      london
                                         7.1    Evaluation method
   in                                                              The goal of this research is to extract client events. These
                                             events are client projects and the proposition refusals (from
      2003
   
                                                                   the bank or from the client). The result is a set of searching
                                                                   predicate instances. As there are different attributes for a
   Therefore we obtain this instance of PURCHASE predicate:        predicate, three degrees of precision are defined, which
                                                                   depend upon the way these attributes are given a value.
                                                       A predicate instance is called valid if the value is correct
   [
                                                                   for attributes, which are given a value (not all the attributes
                                                                   need to be given a value). The validity rate is the number of
             DESCRIPTOR = buy
             OBJECT = studio
                                                                   valid instance per number of instances found.
             DATE = 2003                                           A valid instance is called totally valid if all of these attrib-
             LOCATION = london                                     utes are given a value and partially valid if one or more
             AMOUNT = ∅                                            attribute is not given a value.
   ]                                                               A partially valid instance is called incomplete when one or
                                                                   more attributes is not given a value because of a process
                                                                   mistake and complete when all of the attributes are not
                                                                   given a value because of a lack of information in the cor-
A predicate P1 can be the object of another one (P2). In
                                                                   pus.
this case those of P2 give values to attributes of P1 when
possible.
7.2    Experimentation                                           [2] Aussenac-Gilles N., Bourrigault D., Codamines A. and
Our experiment focuses on a representative sample of                 Gross C., How can knowledge acquisition benefit from
10,000 records taken at random in the corpus. The experi-            terminology. Proceeding of the Ninth Knowledge Ac-
ment has been carried out with the aim of extracting the             quisition for Knowledge-Based Systems Workshop
clients' projects from this sample. The results found are            (KAW '95), Banff, Canada, 1995
validated by experts who have aligned the sample with the        [3] Bachimont B., Modélisation linguistique et modélisa-
table PROJECT in the database filled by our system. Ac-              tion logique des ontologies : l'apport de l'ontologie for-
cording to them, 651 projects are in this sample. The sys-           melle, Proceeding of IC2001, pp. 349-368, Grenoble,
tem detects 623 instances of the predicate PROJECT.                  France, 2001.
These instances are detailled as follow : 589 valid instances
                                                                 [4] Barry C., Cormier C., Kassel G. and Nobécourt J.,
of which 72 are totally valids. Also the system found 517
                                                                     Evaluation de langages opérationnels de représentation
partially valid instances of whom 385 are complete and 132
                                                                     d'ontologies. Proceeding of IC'2001, pp. 309-327,
are incomplete.
                                                                     Grenoble, France, 2001.
7.3    Analysis                                                  [5] Bouaud J., Bachimont B., Charlet J. and Zweigenbaum
The coverage rate is not revealing because a lot of records          P., Methodological Principles for Structuring an On-
do not contain projects (95%). The recall rate (number of            tology. Proceeding of IJCAI-95 Workshop on Basic
both instances found per number of instances in the corpus)          Ontological Issues in Knowledge Sharing, Montreal,
and the validity rate (respectively 95.7% and 94.5%) are             Canada, 1995.
both very satisfactory but a lot of instances are partially      [6] Enguehard C. and Pantéra L., Automatic Natural Ac-
valid (88% of valid projects). 74.5% of these instances are          quisition of a Terminology. Journal of quantitative lin-
partially valid because of corpus non-fulfillment but the            guistics, Vol. 2, n°1, pp.27-32, 1995.
other 25.5% are imputable to the system. We are working
                                                                 [7] Guarino N. and Welty W., A Formal Ontology of
at present to reduce the number of incomplete partially
                                                                     Properties. Proceedings of the ICAI-00 Work-shop on
valid instances. For this, after a predicative tagging from
                                                                     Applications of Ontologies and Problem-Solving
left to right, we are considering repeating the process from
                                                                     Methods, pp. 12/1-12/8, Las Vegas, United States,
right to left to first complete optional arguments of predi-
                                                                     2000.
cate instances found and last to try treating the predicate
descriptors left by left to right process (defining new predi-   [8] Guarino N., Concepts, Attributes and Arbitrary Rela-
cate instances by detecting objects and arguments). After            tions : Some Linguistic and Ontological Criteria for
this right to left step we'll have to repeat the process from        Structuring Knowledge Bases. Data & Knowledge
left to right because some arguments of new instances can            Bases Engineering, Vol. 8(2), pp. 249-261, 1992.
be located on the left of the descriptor.                        [9] Lame G. Knowledge acquisition from texts to-wards
                                                                     an ontology of French law. Proceedings of
CONCLUSION                                                           EKAW'2000, pp. 53-62, Juan-les-Pins, France, 2000.
As usual information extraction processes are unusable on
poor quality texts, we described a method to extract infor-      [10] Nobécourt J., A method to build formal ontologies
mation from this type of corpus. This approach is based on            from texts. Proceedings of EKAW'2000, pp. 21-27,
ontology building led by the type of information to be                Juan-les-Pins, France, 2000.
searched in texts. Good results have been obtained with a        [11] Nestorov S. & al., Representative objects : concise
very wide cover of information in each record, even for               representation of semistructured, hierarchical data.
record containing very little information. This method can            Proceedings of International Conference on Data En-
easily be applied to other corpora and to other domains.              gineering, pp. 79-90, Birmingham, United Kingdom,
The different parts of our system are based on generic                1997.
methods and do not need modifications to be used with            [12] Swartout B., Patil R., Knight K. and Russ T., Towards
another corpus. Currently experimentations are being car-             distributed use of large-scale ontologies. Proceedings
ried out to that end.                                                 of the Tenth Knowledge Acquisition for Knowledge-
                                                                      Based Systems Workshop (KAW '96), pp. 32.1-32.19,
REFERENCES                                                            Banff, Canada, 1996.
[1] Aussenac-Gilles N., Biébow B. and Szulman S., Cor-
    pus analysis for conceptual modeling. Proceeding of
    EKAW'2000, pp. 13-20, Juan-les-Pins, France, 2000.