=Paper=
{{Paper
|id=Vol-1752/paper04
|storemode=property
|title=
Conceptual Modeling with Formal Concept Analysis on Natural Language Texts
|pdfUrl=https://ceur-ws.org/Vol-1752/paper04.pdf
|volume=Vol-1752
|authors=Mikhail Bogatyrev
|dblpUrl=https://dblp.org/rec/conf/rcdl/Bogatyrev16
}}
==
Conceptual Modeling with Formal Concept Analysis on Natural Language Texts
==
Conceptual Modeling with Formal Concept Analysis
on Natural Language Texts †
© Mikhail Bogatyrev
Tula State University,
Tula
okkambo@mail.ru
Abstract Conceptual modeling is one of the ways of modeling
semantics in the Natural Language Processing (NLP)
The paper presents conceptual modelling technique on [22]. Conceptual modeling is the process of
natural language texts. This technique combines the conceptualization of real world phenomena and creating
usage of two conceptual modeling paradigms: conceptual models as a result of conceptualization.
conceptual graphs and Formal Concept Analysis. Conceptual model is a graph which vertices are concepts
Conceptual graphs serve as semantic models of text and arrows or edges are links between concepts. Every
sentences and the data source for concept lattice – the conceptual model has its own semantics which
basic conceptual model in Formal Concept Analysis. represents the meanings of concepts and links.
With the use of conceptual graphs the Text Mining Conceptual modeling has long been applied for
problems of Named Entity Recognition and Relations databases and software modeling [19] and this term is
Extraction are solved. Then these solutions are applied also used in other fields including NLP. Entity
for creating concept lattice. The main problem Relationship Diagram (ERD) [19] is well known
investigated in the paper is the problem of creating representative of conceptual models. It describes the
formal contexts on a set of conceptual graphs. Its solution structure of database in terms of entities, relationships,
is based on the analysis of semantic roles and conceptual and constraints. These terms of entities, relationships,
patterns in conceptual graphs. Concept lattice built on and constraints are explicitly or implicitly present at
textual data is applied for knowledge extraction. many other conceptual models including ones discussed
Knowledge, sometimes interpreted as facts, can be in this paper.
extracted by using navigation in the lattice and Formal Concept Analysis (FCA) [13] is the paradigm
interpretation its concepts and hierarchical links between of conceptual modeling which studies how objects can
them. Experimental investigation of the proposed be hierarchically grouped together according to their
technique is performed on the annotated textual corpus common attributes. In the FCA, its conceptual model is
consisted of descriptions of biotopes of bacteria. the lattice of formal concepts (concept lattice) which is
built on the abstract sets treated as objects and their
†The paper concerns the work which is partially attributes. Concept lattices have been applied as an
supported by Russian Foundation of Basic Research, instrument for information retrieval and knowledge
grant № 15-07-05507 extraction in many applications. The number of FCA
applications now is growing up including applications in
1 Introduction social science, civil engineering, planning, biology,
psychology and linguistics [21], [22]. Several successful
implementations of FCA methods on fact extraction on
Knowledge extraction from textual data requires textual data [8] and Web data are known [15]. Although
more in-depth intensive analysis of this data. In the area the high level of abstraction makes FCA suitable for use
of Text Mining, some variants of knowledge extraction with data of any nature, its application to specific data
have been realized by solving such problems as often requires special investigation. It is fully relevant for
sentiment analysis, fact extraction and decision making using FCA on textual data.
support. To solve these problems it is necessary to have The main problem in creating concept lattice on
models that reflect semantics of textual data. It is textual data is building so called formal contexts on this
especially urgent when this data is presented as data. Formal context is matrix representation of the
unstructured natural language texts. relation on the sets of objects and attributes. So it is
needed to acquire words or word combinations from
texts which are interpreted as objects and attributes. To
restrict all possible combinations of words of such
Proceedings of the XVIII International Conference
meanings we need to select from them those ones which
«Data Analytics and Management in Data Intensive
are valued for solving concrete problem or the class of
Domains» (DAMDID/RCDL’2016), Ershovo, Russia,
problems. As a result a concept lattice created on texts
October 11 - 14, 2016
16
becomes domain specific. This is similar to the design of properties of A : {m M | g , m I g A}
ontologies and concept lattice is often considered as and B : {g G | g , m I m B} then the pair
framework of ontology [21].
Another paradigm of conceptual modeling is (A, B) that A B, B A is named as formal concept.
Conceptual Graphs (CGs) [24]. Conceptual graph is The sets A and B are closed by composition of mappings:
bipartite directed graph having two types of vertices: A '' A, B '' B ; A and B is called the extent and the
concepts and conceptual relations. Conceptual terms of intent of a formal context K = (G, M , I ) respectively.
entities and relationships are represented in conceptual By other words, a formal concept is a pair (A, B) of
graphs as its concepts and conceptual relations. subsets of objects and attributes which are connected so
Conceptual graphs have been applied for modeling that every object in A has every attribute in B, for every
many real life objects including texts. Acquiring object in G that is not in A, there is an attribute in B that
conceptual graphs from natural language texts is non- the object does not have and for every attribute in M that
trivial problem but it is quite solvable [3], [5]. is not in B, there is an object in A that does not have that
The main purpose of this paper is to show how two attribute.
paradigms of conceptual modeling - Conceptual Graphs The partial orders established by relations ф and Р
and Formal Concept Analysis - can be united in one
on the set G and M induce a partial order ≤ on the set of
modeling technique. The idea of joining these two
formal concepts. If for formal concepts (A1, B1) and (A2,
paradigms seems very attractive but not elaborated much
enough [22], [26]. B2), A1 ф A 2 and B2 Р B1 then (A1, B1) ≤ (A2, B2) and
Proposed technique is used in on-going project of formal concept (A1, B1) is less general than (A2, B2). This
creating fact extraction system working on biomedical order is represented by concept lattice. A lattice consists
data. Experimental investigation of it is performed on the of a partially ordered set in which every two elements
annotated textual corpus consisted of descriptions of have a unique supremum (also called a least upper bound
biotopes of bacteria. or join) and a unique infimum (also called a greatest
lower bound or meet).
2 CGs-FCA modeling According to the central theorem of FCA [13], a
collection of all formal concepts in the context
The proposed modeling technique named briefly as CGs- K = (G, M , I ) with subconcept-superconcept ordering
FCA modeling is based on using conceptual graphs and
≤ constitutes the concept lattice of K . Its concepts are
concept lattice. It may be applied for knowledge
extraction from textual data. In CGs-FCA modeling subsets of objects and attributes connected each other by
conceptual graphs serve as semantic models of text mappings A , B and ordered by a subconcept-
sentences and the data source for formal context of superconcept relation.
concept lattice. Concept lattice built on textual data is
applied for knowledge extraction. Knowledge,
sometimes interpreted as facts, can be extracted by using
navigation in the lattice and interpretation its concepts
and hierarchical links between them.
To illustrate CGs-FCA modeling, consider some
FCA basics.
2.1 Formal Concept Analysis basics
There are two basic notions FCA deals with: formal
context and concept lattice [13]. Formal context is a
triple K = (G, M , I ) , where G is a set of objects, M –
set of their attributes, I G M – binary relation
which represents facts of belonging attributes to objects.
The sets G and M are partially ordered by relations ф
and Р , correspondingly: G = (G, ф ) , M (M , Р ) .
Formal context may be represented by [0, 1] - matrix
K = {ki , j } in which units mark correspondence
Figure 1 Example of formal context and concept lattice.
between objects gi G and attributes m j M . The
concepts in the formal context have been determined by To illustrate these abstract definitions consider an
the following way. If for subsets of objects A G and example. Figure 1 shows simple formal context and
concept lattice composed on the sets G = {DNA, Virus,
attributes B M there are exist mappings (which may Prokaryotes, Eukaryotes, Bacterium} and M =
be functions also) A : A B and B : B A with {Membrane, Nucleus, Replication, Recombination}. The
17
set G is ordered according to sizes of its elements: DNA using corpus tagging and semantic models of
is smallest and bacterium is biggest ones. The set M has texts [10].
relative order: one part (Membrane, Nucleus)
characterizes microbiological structure of objects from We apply the second approach and use conceptual
G, but another part (Replication, Recombination) graphs for representing semantics of individual sentences
characterizes the way of breeding, and these parts are of a text.
incomparable. In the concept lattice the bacterium is
placed in the concept C1 = ({Prokaryotes, Eukaryotes, 2.2 The modeling process
Bacterium}, {Membrane, Replication}). In this concept,
three objects {Prokaryotes, Eukaryotes, Bacterium} Consider in general the process of CGs – FCA modeling.
constitute the extent of the concept; they are united by It includes the following steps.
their mutual attribute {Membrane, Replication} which 1. Acquiring a set of conceptual graphs from input
constitute the intent of the concept. The concept C1 is texts. As it is mentioned above conceptual graphs can be
more general concept than the concept C2 = acquired from texts by existing information systems. For
({Eukaryotes}, {Nucleus}). example they can be created by our system CGs Maker
1
Also on the Fig. 1 there are two different branches of . Some details about it can be found in [3], [5]. We use
concepts characterizing two families: the viruses and verb-centered approach for creating conceptual graphs.
DNA and prokaryotes, eukaryotes and bacteria. This According to this approach, a conceptual graph is
concept demonstrates the fact of separation of objects constructed so that there is the central concept in it which
from the set G into two important branches. The link is realized as a verb. If there are no verbs in a sentence
between them is the attribute “Membrane”. It is known then method also creates conceptual graph. Verb-
[7] that viruses can have a lipid shell formed from the centered approach is important for us since it provides
membrane of the host cell. Therefore, the membrane is predicate forms in the structures of conceptual graphs.
positioned in the formal context on the Fig. 1 as an These forms are mostly used for representing conceptual
attribute of the virus. graph semantics.
This example demonstrates specific ways of 2. Aggregating the set of conceptual graphs.
extracting knowledge from conceptual lattice: Aggregation is needed to exclude excessive dimension of
analyzing formal concepts in concept lattice; conceptual models, not related to useful information. We
analyzing conceptual structures in concept have tested following ways of conceptual graphs
lattice – its sub lattices in the general case. aggregation: conceptual graphs clustering and using
corpus tagging together with support of concept types in
conceptual graphs. Clusters of conceptual graphs need to
2.1.1 FCA on textual data be semantically interpreted which may lead to additional
investigations. The second method is more constructive
The main problem in applying FCA on textual data is the since it selects those conceptual graphs which concepts
problem of building formal context. If textual data is have mappings to certain domain. Such domain of terms
represented as natural language texts then this problem may be presented by corpus tagging or by thesaurus.
becomes acute. Some details of aggregation are below.
There are several approaches to the construction of 3. Creating formal contexts. This is the central point
formal contexts on the textual data, presented as separate of CGs – FCA modeling. One or several formal contexts
documents, as data corpora. One, mostly applied variant have been built on the aggregated conceptual graphs. The
is the context in which the objects are text documents and number of formal concepts and the method of building
the attributes are the terms from these documents. them depend on the problem being solved with CGs –
Another variant is building formal context directly on the FCA modeling.
texts and the formal context may represent various 4. Building concept lattice. Having a formal context
features of textual data: as input data, a concept lattice may be created by using
semantic relations (synonymy, hyponymy, various algorithms. There is a field of research in FCA
hypernymy) in a set of words for semantic devoted to creating and developing algorithms for
matching [16], concept lattice creation [21]. On the current stage of CGs
– FCA modeling technique we use standard solution of
verb-object dependencies from texts [10],
creating concept lattice realized in the open source tool
words and their lexico-syntactic contexts [20].
[27]. Nevertheless, here there are certain possibilities to
These lexical elements must be distinguished in texts as create new algorithms oriented on specific structure of
objects and attributes. There are following approaches formal contexts acquired from conceptual graphs. One of
to solve this problem: such structure is block-diagonal structure which arises
namely on using textual data as input.
adding special descriptions to texts which mark 5. Knowledge extraction from concept lattice. In
objects and attributes and partial order, concept lattice it is possible to identify connections
1
The lightweight online version of CGs Maker for
simple English and Russian texts can be found at http://85.142.138.156:8888 .
18
between its concepts according to the principle of similar “semantically”. That means that their concepts
"common – particular". Each concept may be interpreted have the same type. For example different names of
as the set of potential facts of certain level, which is bacteria belong to the type “bacterium” or the type “the
associated with other facts. So the knowledge extracted name of bacteria”.
from concept lattice may be interpreted a s facts. The second way of conceptual graphs aggregation is
based on supporting types of concepts by using external
2.3 Aggregation of conceptual graphs resources. Thesaurus or corpus tagging may be such
resource. Section 3 contains additional details.
In the theory of conceptual graphs aggregation means
replacing conceptual graphs by more general graphs [24]
. These general graphs may be created as new graphs or 2.4 Creating formal contexts
may be graphs or sub graphs from initial set of graphs.
Aggregation of conceptual graphs has semantic meaning The crucial step in the described process of CGs – FCA
and general graphs make up the context (not formal modeling is creating formal contexts on the set of
context) of initial set of graphs. conceptual graphs.
Clustering is a way of aggregation of conceptual At first glance, this problem seems simple: those
graphs. Graphs which are the nearest ones to the centers concepts of conceptual graphs which are connected by
of clusters have been treated as general graphs. "attribute" relation have been put into formal context as
We have studied several approaches for clustering its objects and attributes. Actually the solution is much
conceptual graphs [2] using various similarity measures. more complex.
There are two known similarity measures proposed in
[17], the conceptual similarity Fig. 2 shows an example of conceptual graph for the
sentence “Burkholderia phytofirmans belongs to the
2n( c )
sc beta-proteobacteria and was isolated from surface-
n( 1 ) n( 2 ) sterilized glomus vesiculiferum-infected onion roots.”
(1)
and relative similarity
2m( c )
sr
m c ( 1 ) m c ( 2 ) . (2)
Here 1 , 2 - conceptual graphs, c 1 2 is their
common sub graph, n ( i ) - number of concepts of graph
i , m ( ) - number of relations of graph i , m ( )
i c i
is the number of relations of conceptual graph i , at least
one of which belongs to the common sub graph c .
If two conceptual graphs have identical concepts then
their conceptual similarity has non zero value. Relative
similarity is non-zero when two conceptual graphs have
identical structures of patterns of conceptual relations.
We used conceptual and relative similarities (1), (2)
Figure 2 Conceptual graph for the sentence
and their combination in the experiments of conceptual
“Burkholderia phytofirmans belongs to the beta-
graphs clustering [2]. Except traditional algorithms of
proteobacteria and was isolated from surface-sterilized
clustering such as K-means, we used genetic clustering
glomus vesiculiferum-infected onion roots.”
algorithm with special encoding. The peculiarity of
implementing genetic algorithms for clustering is that
This graph has five conceptual relations “attribute”
there may be several final solutions i.e. several different
but only four of them indicate the objects and attributes
variants of clustering.
valid for formal context. Using “phytofirman” as object
All numerical characteristics of conceptual graphs
and “Burkholderia” as attribute in the formal context is
clustering results (number of clusters, dimensions of
wrong way because “Burkholderia phytofirmans” is
clusters, etc.) are not informative. Clusters of conceptual
known full name of this bacterium [6] and full names of
graphs need to be semantically interpreted. The way of
bacteria have to be objects in a formal context devoted to
that interpretation depends on the nature of the problem
bacteria. Word combinations denoting the names of
to be solved with conceptual graphs.
bacteria must be recognized before the building of
Both conceptual and relative similarity measures
conceptual graphs. There is no other way of doing this
share a common sub graph c . But two conceptual than to use an external source of information, for
graphs may have no common sub graph but may be example, the corpus tagging. So in this example the sub
19
graph - (attribute) - < burkholderia > is by huge amount of scientific publications in
useless for bacteria names recognizing. Bioinformatics and organizing them into corpora with
Remaining elements of conceptual graph on the Fig. access to full texts of articles via such systems as
2 are not useless and play significant roles in creating PubMed [25]. Information resources of PubMed have
formal context. Conceptual graph on the Fig. 2 represents been united in several subsystems presenting databases,
two facts: corpora and ontologies.
1. bacterium Burkholderia phytofirmans belongs to So called “research community around PubMed” [14]
beta-proteobacteria; forms data intensive domain in this area. It not only uses
2. this bacterium infects the onion. data from PubMed but also creates new data resources
To provide the presence information about those and and data mining tools including specialized languages
other facts in the formal contexts the following rules are for effective biomedical data processing [11].
implemented as mostly important when creating formal In our experiments we use PubMed vocabulary thesaurus
contexts. MeSH (Medical Subject Headings) as external resource
1. Not only individual concepts and relations, but also for supporting types of concepts in conceptual graphs.
patterns of connections between concepts in
conceptual graphs represented as sub graphs have 3.2 Data structures
been analyzed and processed. These patterns are
predicate forms