Application of Formal Contexts in the Analysis of
Heterogeneous Biomedical Data
Mikhail Bogatyreva , Dmitry Orlova
a
    Tula State University, Tula, Russia


                                         Abstract
                                         The paper proposes a method of conceptual modeling based on the use of formal contexts. Formal con-
                                         text is the main notion in the Formal Concept Analysis (FCA), the lattice-based data analysis approach.
                                         Biomedical data and the tasks of their analysis under Biomedical Natural Language Processing are dis-
                                         cussed. Two variants of formal contexts constructed on natural language texts are considered. They are
                                         the contexts constructed with the use of keywords and n-grams. Textual n-grams are acquired by using
                                         conceptual graphs and Abstract Meaning Representation (AMR) schemata. Both contexts are used in
                                         the text clustering task. It is shown that the classical FCA clustering on keyword based context entails
                                         appearing bag-of-words in clusters. It is proposed the clustering approach for n-grams based multidi-
                                         mensional contexts which avoid appearance of a bag-of-words in clusters. The method was tested on
                                         the texts of annotations of scientific articles from PubMed databases.

                                         Keywords
                                         conceptual modeling, conceptual graphs, Biomedical Natural Language Processing, polyadic formal context


1. Introduction
Data analysis tasks are diverse. A task that is often a priority in data analysis is clustering. Clus-
tering allows one to combine data into subsets-clusters according to some proximity measure
of data, which simplifies their further analysis. This work relates to two areas that are directly
and indirectly related to clustering: Formal Concept Analysis (FCA) [1] and Text Mining. In the
classical FCA, the sets of data related by “object-attribute” rela-tionship is studied. This data
forms the formal context on which the concept lattice is built. The formal concepts that make
up the lattice are linked by a general-private relationship and form a hierarchical conceptual
data model. A special feature of the FCA is the mathematical rigor of the proposed solutions
and their universality. The formal context, the central notion of FCA, is defined on arbi-trary
sets, so it can be applied to data of any nature. This advantage of the FCA has the opposite
side – the need to adapt the FCA to specific tasks, which often requires special research. As a
result of such research, many applications of FCA are known in a variety of fields, from biology
and natural sciences, software engineering and public networks to computational linguistics.
The FCA application review [15] contains a structured analysis of them and is still relevant.
The current state of FCA is characterized by the use of multidimensional formal contexts and
Russian Advances in Artificial Intelligence: selected contributions to the Russian Conference on Artificial intelligence
(RCAI 2020), October 10-16, 2020, Moscow, Russia
" okkambo@mail.ru (M. Bogatyrev)
 0000-0001-8477-6006 (M. Bogatyrev)
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
corresponding generalizations of FCA concepts in data analysis problems. Formal concepts
based on two-dimensional formal contexts represent a special solution to the data clustering
problem: biclustering [10]. The use of multidimensional formal contexts extends cluster anal-
ysis to three-dimensional and n-dimensional clustering. An important result here is progress
in solving the well-known problem of cluster interpretability: n-dimensional clusters are more
representative than normal ones, since they are built simultaneously on several sets.
   In this paper, FCA clustering on heterogeneous biomedical data is investigated. The het-
erogeneity of biomedical data consists in the use of texts along with numerical values. This
can be text designations of drugs, genes, bacteria, etc., as well as natural language texts. We
consider variants of formal contexts for text clustering problems, including contexts have been
constructed using n-grams. We propose a method for constructing formal contexts on textual
data, in which n-grams are obtained from conceptual graphs and correspond to the model of
Abstract Meaning Representation of text [6]. Then constructed formal contexts are used in the
task of clustering biomedical data.
   It is shown that the use of standard FCA clustering algorithms on such contexts leads to
the appearance “bag-of-words” in clusters, which makes it difficult to interpret them. A ver-
sion of clustering using n-gram data associations is proposed, which allows avoiding “bag-of-
words” and allows interpreting clusters in the context of data queries in the form of meaningful
phrases.
   The paper is organized as follows. Section 2 briefly introduces the main definitions of Formal
Concept Analysis. In the Section 2.1 polyadic formal contexts and multimodal clusters are
described. Section 3 contains brief review of biomedical data analysis including Biomedical
Natural Language Processing in the Section 3.1. Section 4 is devoted to constructing polyadic
formal contexts on natural language texts. The use of keywords as attributes in formal context
is described in the Section 4.1 and in the Section 4.2 we discuss semantic features of formal
contexts. In Section 5 results of experimental study of application of two variants of formal
contexts for clustering are presented. Section 6 devoted to comparing our results with some
ones in related work. In Section 7 we conclude and discuss the future work.


2. Elements of Formal Concept Analysis
Briefly consider the main definitions of the FCA. Classical FCA [1] deals with two basic no-
tions: formal context and concept lattice. Formal context is a triple 𝐊 = (𝐺, 𝑀, 𝐼 ) where G
is a set of objects, M is a set of their attributes, 𝐼 ⊆ 𝐺 × 𝑀 – binary relation which repre-
sents facts of belonging attributes to objects. Formal context may be represented by [0, 1]
- matrix 𝐊 = {𝑘𝑖,𝑗 } in which units denote relationship between objects 𝑔𝑖 ∈ 𝐺 and attributes
𝑚𝑗 ∈ 𝑀. The concepts in the formal context are defined in the following way. If for subsets
of objects 𝐴 ⊆ 𝐺 and attributes 𝐵 ⊆ 𝑀 there exist mappings (which may be functions also)
𝐴′ ∶ 𝐴 → 𝐵 and 𝐵′ ∶ 𝐵 → 𝐴 with the properties of 𝐴′ ∶= {𝑚 ∈ 𝑀| < 𝑔, 𝑚 >∈ 𝐼 for all 𝑔 ∈ 𝐴}
and 𝐵′ ∶= {𝑔 ∈ 𝐺| < 𝑔, 𝑚 >∈ 𝐼 for all 𝑚 ∈ 𝐵} then the pair of subsets (A, B) like that 𝐴′ = 𝐵, 𝐵′ = 𝐴
are called formal concepts. The sets A and B called the extent and the intent of a formal context
𝐊 = (𝐺, 𝑀, 𝐼 ) respectively.
   In other words, a formal concept is a pair (A, B) of subsets of objects and attributes which
are connected so that every object in A has every attribute in B, for every object in G that is
not in A, there is an attribute in B that the object does not have and for every attribute in M
that is not in B, there is an object in A that does not have that attribute. If for formal concepts
(𝐴1 , 𝐵1 ) and (𝐴2 , 𝐵2 ), 𝐴1 ⊑ 𝐴2 and 𝐵2 ⊑ 𝐵1 then (𝐴1 , 𝐵1 ) ⩽ (𝐴2 , 𝐵2 ) and formal concept (𝐴1 , 𝐵1 ) is
less general than (𝐴2 , 𝐵2 ). This order makes a lattice, which is called concept lattice. A lattice is
a partially ordered set in which every two elements have a supremum (also called a least upper
bound or join) and a infimum (also called a greatest lower bound or meet).

2.1. Polyadic FCA
Polyadic or multidimensional FCA is based on the notion of multidimensional formal context.
A multidimensional, n-ary formal context is defined by a relation 𝑅 ⊆ 𝐷1 × 𝐷2 × … × 𝐷𝑛 on data
domains 𝐷1 , 𝐷2 , … , 𝐷𝑛 . The context is an n+1 set:

                                           𝕂 =< 𝐾1 , 𝐾2 , … , 𝐾𝑛 , 𝑅 >,                                  (1)

where 𝐾𝑖 ⊆ 𝐷𝑖 . Every n-ary context begets k-ary contexts, whose number is given by the Stir-
                             𝑘
ling formula 𝑆(𝑛, 𝑘) = 𝑘!1 ∑ (−1)𝑖 (𝑘𝑖 )(𝑘 − 𝑖)𝑛 [7]. As it is shown in [19], multidimensional n-ary
                            𝑖=0
context also contains formal concepts which also form a lattice.
   Already on two-dimensional formal contexts, and especially on multidimensional ones, not
only formal concepts are of interest, but also “insufficiently dense concepts” which are two-
dimensional, three-dimensional, and n-dimensional clusters. These clusters may contain useful
information.
   By introducing the notion of bicluster density [11], one can investigate various biclustering
options and estimate their significance [12]. An important result here is the statement proved
in [11] that each concept contains a cluster, but the opposite is not true.
Clustering on multidimensional formal contexts is called multimodal clustering.
   According to multimodal clustering, for any dimension of formal context, the purpose of its
processing is to find n - sets 𝐻 = < 𝑋1 , 𝑋2 , … , 𝑋𝑛 > which have the closure property [7]:

                                 ∀𝑢 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) ∈ 𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝑢 ∈ 𝑅,                    (2)

∀𝑗 = 1, 2, … , 𝑛, ∀𝑥𝑗 ∈ 𝐷𝑗 ⧵𝑋𝑗 < 𝑋1 , … , 𝑋𝑗 ∪ {𝑥𝑗 }, … , 𝑋𝑛 > does not satisfy (2). The sets 𝐻 = < 𝑋1 , 𝑋2 , … , 𝑋𝑛 >
constitute multimodal clusters.
When solving any clustering problem, a proximity measure of the objects being clustered is
used. In FCA clustering, the proximity of objects is set by their relation R, so it is actually
written in a formal context: those objects are close to each other that have common attributes,
and vice versa for attributes and objects.


3. Tasks and Methods of Heterogeneous Biomedical Data
   Analysis
One of the areas where NLP applications are becoming more in demand is Bioinformatics. Data
in Bioinformatics is often heterogeneous: it includes both numeric and symbolic sequences, as
well as texts. The Biomedical Natural Language Processing (BioNLP) [8] is the new area of
research in Bioinformatics which appearance was due to the avalanche-like growth of publi-
cations in the field of biomedicine.

3.1. Biomedical Natural Language Processing
The main purpose of BioNLP is to obtain new knowledge from published texts, not completely
contained in each individual publication. Initially, the main area of application of BioNLP
methods was genomic studies. Over time, the subject matter of texts processed by BioNLP has
expanded to other areas. BioNLP was formed as research area with its own data, tasks and
methods [8, 9]. They are summarized as follows.
    BioNLP Data and Resources. The main types of data used in BioNLP are the texts of scien-
tific publications – usually abstracts of these publications. Along with lexical and grammatical
elements common for non-structured texts, they have their own specifics: characteristic terms,
for example, names of genes, ab-breviations, and the inclusion of numerical data in the text.
Distinctive feature of biomedical data is synonymy. The same concept may be expressed using
different words in a text. For example, “heart attack” and “myocardial infarction” refer to the
same medical problem.
Natural language texts are implied to be unstructured. The peculiarity of biomedical textual
data is that it is actually semi-structured. The most used resource of the modern BioNLP sys-
tems is the PubMed system [20]. This is online knowledge base that includes many special
databases. Simultaneously, PubMed is both ontology and a text corpus with extra linguistic
tagging. Tagging options are limited to hyperlinks to other publications cited in one work, as
well as to publications of similar content. The tagging makes it easier to solve a number of
problems on corpus data, for example, solving clustering problems.
Knowledge Sources. In addition to the data itself, biomedical information resources contain the
so called sources of knowledge. Among them there is the Medical Subject Headings (MeSH)
[13], which contains controlled vocabulary terms organized as tree structure. Another impor-
tant resource is the Unified Medical Language System [18]: the compendium of controlled vo-
cabularies. It has knowledge source databases and associated software tools for use by BioNLP
systems developers.
    Thus, almost any task of BioNLP is solved not only directly on the texts, but also with the
involvement of external resources.
    BioNLP Tasks and Methods. All the BioNLP tasks may be classified as more or less general.
The general task of knowledge extraction is transformed to the tasks of fact extraction and
event extraction. Often these terms are not distinguished [2]. As atomic tasks, being solved as
a part of the solution of more general task there are tasks of Named Entity Recognition and
Relation Extraction.
    Named Entity Recognition. Named Entity Recognition (NER) is the standard task of BioNLP. It
consists in automatically identifying occurrences of biological or medical terms in unstructured
text. As named entities, there are the names of genes, proteins, living organisms or diseases –
it is depended on the domain to which processed text belongs to.
    NER is typically consisted of three-stages process [9] that involves:
    • determining an entity’s substring boundaries within the text,
    • assigning the entity to a predefined class or category, and
    • selecting the preferred name or unique identifier of the concept that the entity names.
   The performance of NER solutions is measured in terms of precision, recall, and F-score [9]
since this task can be interpreted as classification task.
   Relation Extraction. Relation Extraction (RE) is another standard task of BioNLP. Relations
are associations among biomedical entities. The simplest relations are binary, involving only
the pair-wise associations between two entities. But biomedical relationships can involve more
than just two entities. This kind of relationship is actual in the task of event extraction. In our
time, named as genomic era, much of BioNLP work has focused on automatically extracting
interactions between genes and proteins. Other associations include interactions between pro-
teins and mutations, proteins and their binding sites, genes and diseases, genes and phenotypic
context [8, 9].
   Events (Facts) Extraction. As it was noticed events and facts are often not distinguished in
the BioNLP literature. But strictly it is appropriate to consider a fact as a static object and to
attribute some duration to an event. Additional distinction is that events can be nested.
   It is also known from BioNLP literature that events are typically characterized by verbs or
nominalized verbs in the text [6, 10]. This is roughly true because there exists a verb-centric
model of the meaning, according to which the meaning of the sentence is primarily reflected
by verbs. But in general, facts or events are identified by means of objects that are external to
the text.
   BioNLP Methods. All methods for solving BioNLP problems can be divided into two types:
methods that work at the level of individual words and sentences of the text, and methods that
use models that are external to the text. Such models include syntactic models, for example,
parse trees, as well as text semantics models. The methods working "inside" the text are typical
for the tasks of linguistics, where, for example, the peculiarities of the use of certain lexical
elements in the texts are studied.
   The tasks of information retrieval usually require the involvement of both types of methods.
For example, the solution of the NER problem described in the previous paragraph includes
three stages. The second and third stages of the solution involve the use of external data and
conceptual models.
   The extraction of the boundaries of the entity influence in the text is based on the linguistic
concept of context. A context is a region of text that surrounds selected elements of a sentence,
and is usually associated with a specific content of a fragment of text. The notion of formal
context used in FCA is not tied to specific boundaries in the text that significantly expands the
possibilities of text analysis, using this notion.
   Text Mining as general technique is applied in BioNLP systems. However, special approaches
and methods are being developed here [2, 8, 9].
   Statistical approach is the oldest one in BioNLP and has been applied as in the NER as in the
RE tasks. It is based on the idea that if the entities are repeatedly mentioned together, then
there is a greater chance that they may be related in some way. But the type and direction of
this relation cannot be determined by co-occurrence statistics only.
   Rule-based approach uses the linguistic patterns connected with particular relations. Unlike
the systems based on statistical term co-occurrences, rule-based approach demonstrates high
precision and low recall [6]. The rules used for relation extraction can be manually defined by
domain experts, or they can be derived from annotated corpora by machine learning algorithms
   Classification-based approach together with Dictionary-based methods is also frequently used
to identify relations involving medical entities. Dictionaries, thesauri and ontologies constitute
the set of external resources which have been applied here [21, 22].


4. Constructing Polyadic Formal Contexts
Consider two approaches to building formal contexts on textual data: using keywords and
using n-grams based on conceptual graphs. These approaches are not depended on biomedical
data domain, they use features of any text but standard tasks of BioNLP.

4.1. The Use of Keywords
Keywords are a long-standing and frequently used tool in linguistics and Text Mining. They
are still used in modern models, for example, in the Word2Vec model [14] and in thematic
text representation models. They construct vectors containing the frequency of occurrence of
keywords in texts, which are compared using a proximity measure – often the cosine of the
angle between the vectors. Similar proximity measures are also used in text clustering tasks.
  Consider a formal context 𝐊𝑤 = (𝑇 , 𝑊 , 𝐼 ) that is built using keywords. Let T be a set of
texts and W be a set of keywords. The context matrix is binary, with elements that reflect
the fact that keywords belong to certain texts. Each formal concept (A, B) in this context is a
combination of elements of sets T and W : 𝐴 ⊆ 𝑇 , 𝐵 ⊆ 𝑊 . It reflects the division of texts into
subsets according to the occurrence of keyords. The set of formal concepts ⋃(𝐴𝑖 , 𝐵𝑖 ) forms a
                                                                                𝑖
lattice that defines the hierarchy of texts according with the presence keywords in them. This
solution to the text clustering problem is constructed without the use of traditional linguistic
proximity measures. The advantages of this clustering method compared to standard clustering
methods are the absence of the need to set the number of clusters in advance and the presence a
cluster hierarchy in the form of concept lattice. Compared to standard hierarchical clustering
methods, this method works faster because it does not require multiple calculations of the
proximity measure.
   The disadvantage of using keywords in formal context will be the appearance of a “bag-of-
words” in clusters, because, regardless of the method of obtaining keywords from texts, they
are sets of words that are not related in meaning. If more than one text is included in the
concept, then, having “bag-of-words” in the concept, one can determine what word belongs to
what text only by referencing to the original formal context. So the problem of interpreting
results of clustering has no solution in this case.

4.2. Preservation of Semantics in Formal Contexts
To avoid “bag-of-words”, we need to apply in formal context the objects that reflect the seman-
tics of texts to some extent. These objects include n-grams, the sets of words in the form of
sequences that have a certain meaning.
   In this paper, we use n-grams extracted from texts by constructing conceptual graphs [17]
and corresponding to the Abstract Meaning Representation (AMR) of the text [6].
   Conceptual graphs constitute semantic model of text that belongs to the class of semantic
networks. They play an important role as a conceptual modeling tool in the fields of mathe-
matical linguistics, bioinformatics, and mathematical logic.
   A conceptual graph is a finite oriented connected bipartite graph [17] which has two dif-
ferent kinds of nodes: concepts and conceptual relations. Figure 1 shows a fragment of con-
ceptual graph for one of the sentences of the processed texts in our experiments together with
the marked elements of the AMR scheme. In the conceptual graph in Fig. 1 concepts are
represented by rectangles, and conceptual relationships are represented by ellipses. We used
conceptual graphs in a number of studies [3, 4]. We obtain conceptual graphs similar to the
one shown in Fig. 1 using a method [3] which is based on the solution of the Semantic Role La-
beling problem [5]. The algorithm of acquiring conceptual graphs from text has the following
main steps.
   1. Dividing the text into sentences.
   2. Dividing sentences into words, punctuation marks, and other symbols. Deleting stop
words.
   3. Determining morphological features of words in sentences.
   4. Defining semantic roles as conceptual relations in conceptual graph. At this stage, lexico-
semantic templates are used.
   5. Constructing conceptual graph visualization.


Figure 1: Fragment of the conceptual graph for the sentence “We describe a mouse with a different
osteopetrosis-causing mutation.”


  The Abstract Meaning Representation of a text [6] is defined as a directed tree graph that
fixes a certain concept in the text in such a way that sentences that have the same meaning
from the point of view of this concept have the same AMR graph. The AMR graph usually
corresponds to an AMR schema in the form of a phrase, for example: “who” – “what does” –
“with whom”. This scheme is three-element one. An AMR diagram can correspond to the entire
text or to individual sentences and may have various numbers of elements. There are several
approaches to building an AMR of a text.
   Using conceptual graphs allows one to build quite complex AMR schemata. An AMR scheme
is constructed as a tuple < 𝑐1 , 𝑐2 , ..., 𝑐𝑛 >, the elements of which are the concepts of a conceptual
graph connected by conceptual relations corresponding to the meaning of the phrase of AMR
scheme. So for the AMR scheme “who”–“what does”–“with whom” such relations are the well-
known semantic roles of “agent” and “patient”. In the Fig. 1, the elements of the “who”–“what
does”–“with whom” AMR scheme are marked, which is made up with the concepts <“SHP-2”,
“attenuates”,“function”>, together forming a meaningful phrase.
Conceptual graphs allow us to define AMR schemes uniquely in the form of, in which the length
of the scheme n is equal to the number of conceptual graph concepts involved in constructing
the scheme. As a result, such n-grams constitute meaningful phrases.
   A polyadic formal context based on AMR schemata is constructed as follows.
   1. A conceptual graph is constructed for each sentence of the processed text.
   2. A specific AMR schema is created based on the elements of the conceptual graph.
   3. The formal context is constructed as a multidimensional tensor. Its points 𝑘𝑖,𝑗,...,𝑛 = {𝑐𝑖 , 𝑐𝑗 , ...𝑐𝑛 }
are the elements 𝑐𝑘 , 𝑘 = 1, 2, ..., 𝑁 of the AMR scheme for each sentence, N – the total number
of concepts obtained on the processed text.
   The number of points in the formal context matches the number of AMR schemata found in
the text.
   The vast majority of points in the formal context are meaningful phrases, which is an im-
portant feature of this method.


5. Experimental studies
Experimental studies of the developed approach were performed on the texts of the Active Gene
Annotation Corpus (AGAC), which contains abstracts of scientific articles on the biomedical
topics of the PubMed system. The corpus was created for the BioNLP Shared Tasks 2019 com-
petition and was offered as a data set for NER and RE extraction tasks. The corpus contains
1000 unprocessed abstracts, and its size is about 300,000 tokens.
  The considered approach to clustering using formal contexts was applied in the task of study-
ing the interrelations of texts.

5.1. Clustering using keywords
The first variant of clustering was performed using keywords. The experiment included the
following stages.
   1. Finding keywords in the set of texts.
   2. Constructing formal context 𝐊 = (𝐺, 𝑀, 𝐼 ) where G is the set of text names, M is the set
of keywords.
   3. Generating concept lattice for the context being used.
   4. Comparing results of clustering with the standard k-means clustering.
   Options for limiting the number of keywords in the range from 5 to 20 words were studied.
   As a result the following regularity is obtained. Increasing the number of keywords leads
to appearing more links between texts. The link configuration in the concept lattice has a
hierarchy that allows one to evaluate the generality/particularity of texts in terms of their
keywords.
   In order to avoid bulky presentation, we will limit it to presenting results for five texts.
Figures. 2.2 a,b show in the form of concept lattices the examples of clustering constructed for
five texts and two sets of 5 and 20 keywords. The figures show the increasing the number of
links between texts depending on the number of keywords: concept lattice on the Fig. 2.2 b) is
more multilinked than one on the Fig. 2.2 a).


                        a)                                                  b)
Figure 2: Concept lattices constructed for five texts and two sets of 5 (a) and 20 (b) keywords.


   Formal concepts in both lattices contain “bag-of-words”. So for solving the task of NER
additional analysis of words belonging to formal concepts is required. The task of relation
extraction (RE) has general solution acquired from the lattices demonstrating how texts are
linked by keywords.
   Comparing with k-means clustering. As it is known, k-means clustering method produces as
many clusters as the k-variable specifies. In our example 𝑘𝑚𝑖𝑛 = 1, 𝑘𝑚𝑎𝑥 = 5. The formal context
on the Fig. 3 demonstrates what words belong to what texts.


Figure 3: Formal context for five texts and 20 keywords.


  It is clear from the context that all five variants of clustering are possible. The concept lattice
in figure 2.2. b) also shows all possible variants of clustering.
   Indeed, the five concepts in figure 2.2 b) that have text names in light rectangles make up
five clusters. Next, we see that, for example, two concepts with the names of texts PubMed-
28507206.txt and PubMed-28488085.txt can form a single cluster if we take into account their
common keyword intracranial, located higher in the lattice. This principle of fixing clusters is
applied to the whole concept lattice: moving up the lattice, we add keywords to the concepts
in the next node, leaving unchanged the text names that were in the lower nodes.
   Thus, the FCA clustering with the use of keywords demonstrates the advantage over k-
means clustering: FCA clustering potentially reveals all variants of clustering.

5.2. Clustering using polyadic formal contexts
In the next experiments, formal contexts constructed using n-grams, as it is described in Sec-
tion 4.2 were also studied. Three-, four-, and five-element n-grams were used, which are con-
structed according to the conceptual graphs and correspond to AMR schemata. Fig. 4 shows
five-element AMR scheme which was used to construct the formal context. Formal contexts
obtained on such n-grams are n-dimensional tensors whose points are combinations of words
that have a certain meaning. FCA clustering of such contexts is possible by known FCA al-
gorithms, for example OAC [12] or Data Peeler [7] ones. However, such clustering will again
lead to the appearance of “bag-of-words” in clusters. In fact, if a point 𝑘𝑖,𝑗,...,𝑛 = {𝑐𝑖 , 𝑐𝑗 , ..., 𝑐𝑛 } in
a formal context falls into a cluster, its elements (words) are combined into subsets with words
from other points, forming the following cluster structure:

                                      𝐶 = {{𝑐𝑖 , ..., 𝑐𝑗 }, ..., {𝑐𝑘 , ..., 𝑐𝑙 }}                         (3)
                                          ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
                                                                         𝑛

   In expression (3), the sublists contain “bag-of-words”, n is the length of the n-gram. There-
fore, instead of the standard FCA clustering, a different version of clustering was used, focused
on application in question-answering systems. In such systems, the formal context is the basis
of the information resource that the system users access. User queries are texts in the form of
phrases that may correspond to AMR schemes. To process such queries, the structure of an
n-dimensional formal context is transformed into a set of associations.


Figure 4: Example of a five-element AMR scheme. Verb – the main verb, the concept in conceptual
graph. “Agent”, “Patient”, “Attribute” – semantic roles; Concept_1, 2, 3, 4 – concepts of the conceptual
graph.


An Association is a set of points of formal context ordered relative to the selected position of the
AMR scheme. This corresponds to the logic of the AMR scheme: associations combine certain
semantic elements in it. The Association includes all words in the selected position of the
AMR scheme. Therefore, an Association is a cluster built on the basis of the proximity measure
“belong to position of the AMR scheme as a certain grammatical element of a sentence”. On
the other hand, Association is a function 𝐴(𝑥1 , ..., 𝑥𝑝 ) whose argument can be a given word or
a set of p words belonging to the k–th position of the AMR scheme.
   Associative queries are made to associations – queries that fix one of the variables 𝑥1 , ..., 𝑥𝑝 .
Responses to such queries contain data belonging to points in the formal context according
with semantic meanings of the variable words 𝑥1 , ..., 𝑥𝑝 . The use of associations allows to avoid
“bag-of-word” in text processing.
   The experiments included the following stages.
   1. Building associations based on the selected positions of the formal context AMR scheme.
   2. Creating queries to associations based on keywords that interest the user.
   3. Getting query results as clusters containing points in the formal context.
   4. Interpretation of the clusters.
   Fig. 5 shows a fragment of the association built for the second position (Concept_1) of the
AMR scheme on the Fig. 4, in which the keyword “mutation” is highlighted. The natural
numbers in the Association lists are the numbers of the texts that AMR schemata are based
on. Most of the corpus texts are devoted to the study of various manifestations of mutation
and its impact on organisms. Therefore, queries to associations were made using the keyword
“mutation”. The query results are clusters that were also used for building nested associations.


Figure 5: Fragment of an association based on the AMR scheme.


   The question that determines further actions with the resulting clusters is: How does the
mutation manifest itself?. Implementation of this query on clusters was performed by building
associations relative to the fourth position for the five-element AMR scheme. The keywords
and text numbers obtained in the constructed associations were then presented for analysis.
The answers to the queries of the associations are generated in table form. If the result of a
query to associations for two elements is presented as a cross-table, it is interpreted as a two-
dimensional formal context. In this case, it can be visualized as a concept lattice according to
the classic FCA.
   Fig. 6 shows classical concept lattice based on the results of a query to associ-ations for two
elements.
Figure 6: Concept lattice based on the results of a query to associations.


   The lattice in Fig. 6 allows one to evaluate the relationships of texts in the context of the
word "mutation" and by using the words used in the texts, marked in colored rectangles that
are attributes of text objects according to FCA. Some lattice concepts are related hierarchically
- these are concepts that include texts 95, 97, 51, 138, and 231. When using standard FCA
clustering, the data that makes up such hierarchically related concepts would generate “bag-
of-word” in clusters. The use of associations and further visualization them in the form of a
concept lattice allows one to correctly investigate the relationship between texts.
   Using this kind of clustering, the solution to the Named Entity Recognition (NER) and Re-
lationship Extraction (RE) tasks becomes more defined. These tasks may be solved with the
use of corresponding associations. Associations that are built related to the “who” and “with
whom” elements of AMR scheme are the most suitable for the NER task and for the RE task the
element “verb” of AMR scheme together with the “who” and “with whom” elements is suitable
too.


6. Related Work
As it was mentioned previously this work relates to two areas of FCA and BioNLP, where
there is a significant number of works devoted to text analysis and clustering. There are not
many works devoted to text analysis among them. The work [23] contains a description of
the general FCA approach to the problems of linguistics. Other works, for example, work [24],
are devoted to solving individual problems there. Our work differs in that it uses a special
formal context, which is constructed using n-grams that have a semantic meaning. The use of
n-grams is common in text analysis, but the use of conceptual graphs for this purpose and the
production of meaningful n-grams, respectively, is not described in the works.
  FCA establishes its own approach to clustering based on flat and polyadic formal contexts
and supported by various algorithms of constructing concept lattices [7, 11, 12]. Review [25]
contains descriptions of almost all FCA models and methods.
   Another approach to clustering, often applied in text analysis, is based on the use of vector
models Word2Vec, Doc2Vec, etc. [14, 26]. They construct vectors containing the frequency of
occurrence of keywords in texts, which are compared using a proximity measure – often the
cosine of the angle between the vectors. FCA clustering has the advantage of not requiring an
Euclidean proximity measure of objects being clustered. The clustering used in this work is
distinguished by the fact that it has “semantic coloring”. Associations are built in such a way
that they reveal the contents of data, fixing their topic in the form of the semantic position of
the AMR scheme.
   Abstract Meaning Representation is applied in BioNLP works, for example, in [27], where
this method is used to extract combinations of words interpreted as events. This paper develops
this approach towards application it in question-answering systems.


7. Conclusion
This paper describes a method for clustering multidimensional formal contexts built on natural
language tests. The method uses conceptual graphs as data source for AMR schemes. It should
be noted that the use of conceptual graphs allows building AMR schemata of greater length
than those considered in the paper. This will allow one to implement multidimensional formal
contexts that reflect the content of the modelled text more fully and, accordingly, extract more
complete information from it. This method can be used in question-answering systems where
queries in natural language correspond to the logic of AMR schemes.
   The method for constructing formal contexts on textual data, in which n-grams are obtained
from conceptual graphs and correspond to the model of abstract meaning representation of the
text is proposed.
   The novelty of the work is as follows. First, we applied a special type of formal context based
on the use of n-grams. Second, conceptual graphs are used to extract meaningful n-grams from
the text. Third, non-standard clustering in the form of building associations is used to avoid
“bag-of-words” as a result of texts clustering.
   The future of this work is oriented to realization its results in prototype of question-answering
system.


Acknowledgments
The reported study was funded by Russian Foundation of Basic Research according to research
project № 19-07-01178 and RFBR and Tula Region according to research project № 19-47-
710007.


References
      [1] Ganter, Bernhard; Stumme, Gerd; Wille, Rudolf. Formal Concept Analysis: Founda-
          tions and Applications. Lecture Notes in Artificial Intelligence, No. 3626, Springer-
          Verlag. Berlin. 2003.
 [2] Ananiadou, S., Pyysalo, S., Tsujii, J. and D. B. Kell. Event extraction for systems biology
     by text mining the literature. // Trends in Biotechnology, Vol. 28. No 7. 2010.
 [3] Bogatyrev, M.Y., Mitrofanova, O.A., Tuhtin, V.V. Building Conceptual Graphs for Ar-
     ticles Abstracts in Digital Libraries. In: Proceedings of the Conceptual Structures
     Tool Interoperability Workshop (CS-TIW 2009) at 17th International Conference on
     Conceptual Structures (ICCS’09), pp. 50-57, 2009.
 [4] Bogatyrev, M. Fact Extraction from Natural Language Texts with Conceptual Modeling.
     // Communications in Computer and Information Science. Vol. 706. Springer-Verlag,
     2017
 [5] Gildea, D., Jurafsky, D.: Automatic Labeling of Semantic Roles. In: Computational
     Linguistics, 2002, vol. 28. 2002.
 [6] Bos, J.: Expressive Power of Abstract Meaning Representations, Computational Lin-
     guistics 42(3), 2016.
 [7] Cerf, L., Besson, J., Robardet, C., and Boulicaut, J. F. Closed patterns meet n-ary rela-
     tions. //ACM Trans. Knowl. Discov. Data. 3, 1, 2009.
 [8] Cohen, K. B., Demner-Fushman, D.: Biomedical Natural Language Processing. John
     Benjamins Publishing Company, Philadelphia 2014.
 [9] Demner-Fushman, D., Cohen, K., Ananiadou, S. Tsujii, J. Proceedings of the 18th
     BioNLP Workshop and Shared Task. Association for Computational Linguistics. 2019.
[10] Hartigan J A. Direct clustering of a data matrix. // Journal of the American statistical
     association, Vol. 67, no. 337. 1972.
[11] Ignatov D. I., Kuznetsov S. O., Zhukov L. E., Poelmans J., Can triconcepts become
     triclusters? // International Journal of General Systems, Vol. 42. No. 6, 2013.
[12] Ignatov D. I., Gnatyshak D. V., Sergei O. Kuznetsov, Boris G. Mirkin, Triadic For-
     mal Concept Analysis and triclustering: searching for optimal patterns. In: Machine
     Learning, April, 2015.
[13] Medical Subject Headings, https://www.nlm.nih.gov/mesh/meshhome.html
[14] Mikolov, T., Chen, K, Corrado, G., Dean, J. Efficient estimation of word representations
     in vector space. In: arXiv preprint arXiv:1301.3781. 2013.
[15] Poelmans J., Ignatov D. I., Kuznetsov S., Dedene G. Formal concept analysis in knowl-
     edge processing: A survey on applications // Expert Systems with Applications. 2013.
     Vol. 40. No. 16.
[16] Simpson M.S., Demner-Fushman D.: Biomedical Text Mining: A Survey of Recent
     Progress. In: Charu C. Aggarwal and ChengXiang Zhai, Editors. Mining Text Data.
     Springer. 2011.
[17] Sowa, J.F., Knowledge Representation: Logical, Philosophical, and Computational Foun-
     dations, Brooks Cole Publishing Co., Pacific Grove, CA. 2000.
[18] 18. Unified Medical Language System, https://www.nlm.nih.gov/research/umls/
[19] Voutsadakis, G. Polyadic concept analysis. – Order. Vol. 19 (3). 2000
[20] U.S. National Library of Medicine, http://www.ncbi.nlm.nih.gov/pubmed
[21] Biomedical natural language processing. Tools and resources, http://bio.nlplab.org
[22]         MetaMap,       a Tool For Recognizing UMLS Concepts in Text,
     https://metamap.nlm.nih.gov
[23] Priss, U., Linguistic Applications of Formal Concept Analysis. In: Ganter; Stumme;
     Wille (eds.), Formal Concept Analysis, Foundations and Applications. Springer-
     Verlag. LNAI 3626. 2005
[24] Falk, I., Gardent, C.: Combining Formal Concept Analysis and Translation to Assign
     Frames and Thematic Grids to French Verbs. In: Napoli, A., Vychodil, V. (eds.): CLA
     2011. INRIA Nancy Grand Est and LORIA. 2011.
[25] Poelmans J., Kuznetsov S., Ignatov D. I., Dedene G. Formal Concept Analysis in knowl-
     edge processing: A survey on models and techniques // Expert Systems with Applica-
     tions. Vol. 40. No. 16. 2013.
[26] Clark, S.: Vector Space Models of Lexical Meaning. In: Lappin, Sh., Fox, Ch. (eds.) The
     Handbook of Contemporary Semantic Theory, pp. 493-522. Blackwell Publishing,
     Ltd. 2015.
[27] Sudha Rao, Daniel Marcu, Kevin Knight, Hal Daum´e III. Biomedical Event Extrac-
     tion using Abstract Meaning Representation. //Proc. of the BioNLP 2017 workshop,
     Vancouver, Canada, August 4, 2017.