=Paper=
{{Paper
|id=None
|storemode=property
|title=Formal Concept Analysis Applied to Transcriptomic Data
|pdfUrl=https://ceur-ws.org/Vol-972/paper_short1.pdf
|volume=Vol-972
|dblpUrl=https://dblp.org/rec/conf/cla/AlamCNS12
}}
==Formal Concept Analysis Applied to Transcriptomic Data==
Formal Concept Analysis Applied to
Transcriptomic Data
Mehwish Alam2,3 , Adrien Coulet2,3 , Amedeo Napoli1,2 , and Malika
Smaı̈l-Tabbone2,3
1
CNRS, LORIA, UMR 7503, Vandoeuvre-lès-Nancy, F-54506, France
2
Inria, Villers-lès-Nancy, F-54600, France
3
Université de Lorraine, LORIA, UMR 7503, Vandoeuvre-lès-Nancy, F-54506, France
{mehwish.alam,adrien.coulet,amedeo.napoli,malika.smail@inria.fr}
Abstract. Identifying functions shared by genes responsible for can-
cer is a challenging task. This paper describes the preparation work for
applying Formal Concept Analysis (FCA) to complex biological data.
We present here a preliminary experiment using these data on a core
context with the addition of domain knowledge. The resulting concept
lattices are explored and some interesting concepts are discussed. Our
study shows how FCA can help the domain experts in the exploration
of complex data.
Keywords: Formal Concept Analysis, Knowledge Discovery, Transcrip-
tomic Data.
1 Introduction
Over past few years, large volumes of transcriptomic data were produced but
their analysis remains a challenging task because of the complexity of the biolog-
ical background. Some earlier studies aimed at retrieving sets of genes sharing
the same transcriptional behavior with the help of Formal Concept Analysis [1,
2]. Further studies analyze gene expression data by using gene annotations to
determine whether a set of differentially expressed genes is enriched with biologi-
cal attributes [3, 4]. Several efforts have been made for integrating heterogeneous
data [5]. For example, at the Broad Institute, biological data were recently gath-
ered from multiple resources to get thousands of predefined genesets stored in
the Molecular Signature DataBase (MSigDB) [6]. A predefined geneset is a set
of genes known to have a specific property such as their position on the genome,
their involvement in a molecular pathway etc.
This paper focuses on the preparation of biological data to data mining
guided by domain knowledge. The objective is to apply knowledge discovery
techniques for analyzing a list of differentially expressed genes and identify-
ing functions or pathways shared by these genes assumed to be responsible for
cancer. Section 2 explains the proposed approach for FCA-based analysis of bio-
logical data. Section 3 focuses on the conducted experiment. Section 4 discusses
the results. Section 5 concludes the paper.
c 2012 by the paper authors. CLA 2012, pp. 339–344. Copying permitted only for
private and academic purposes. Volume published and copyrighted by its editors.
Local Proceedings in ISBN 978–84–695–5252–0,
Universidad de Málaga (Dept. Matemática Aplicada), Spain.
340 M. Alam, A. Coulet, A. Napoli and M. Smaı̈l-Tabbone
2 The Proposed Framework
We rely on the standard definition of FCA fully described in [7] and adapt it
according to the current problem. Let G be the set of genes {g1 , g2 , g3 , ..., gn }, and
M be a set of attributes of MSigDB for describing genes. M will be considered
as a partition of three points of view, M = M1 ∪ M2 ∪ M3 , with Mi ∩ Mj = ∅
whenever i 6= j.
The first set of attributes M1 refers to four types of attributes, “Location”,
“Pathway”, “Transcription Factors” and “GO Terms” (see Table 1). For our
convenience we have named MSigDB categories as types of attributes and used
only C1 , C2 , C3 and C5 . The category C4 was not used as it keeps information
on sets of genes related to a certain kind of cancer, which is not useful for the
current problem. Thus we have a first context K1 = (G, M1 , I1 ) where I1 denotes
the relation stating that gene gi has an attribute mj in M1 .
Types of Attributes Description Data Provenance
C1: Positional Gene Sets
Location of the gene on the chro- Broad Institute
mosome.
C2: Curated Gene Sets Pathway KEGG, REAC-
TOME, BIOCARTA
C3: Motif Gene Sets Transcription Factors Broad Institute
C4: Computational Gene Cancer Modules Broad Institute
Sets
C5: Gene Ontology (GO) Biological Process, Cellular Com- AmiGO
Gene Sets ponents, Molecular Functions
Table 1. Types of attributes from MSigDB
The second set of attributes M2 is related to the so-called “categories” where
a category makes reference to a set of attributes with the “Pathway” type. For
example, “Cell Growth and Death” is an example of category (see Figure 1).
The categories in M2 determine a second context, K2 = (G, M2 , I2 ) where M2 is
the set of categories and I2 denotes the relation between a gene and a category.
It can be noticed that the categories are only related to the “Pathway” type and
that they can be considered as domain knowledge.
Moreover, the third set of attributes, namely M3 , refers to the so-called
“upper categories”, which are defined as groupings of categories. Actually, we
have for the type “Pathway” a hierarchy of categories with two levels, categories
and upper categories (see Figure 1). The upper categories in M3 define a third
context K3 = (G, M3 , I3 ) where M3 is the set of upper categories and I3 denotes
the relation between gene gi and an upper level category mj . Upper categories
are also related to the “Pathway” type and as categories, they can be considered
as domain knowledge too.
Formal Concept Analysis Applied to Transcriptomic Data 341
Fig. 1. Categories and upper categories in KEGG.
Now we consider the apposition of the three contexts K1 , K2 and K3 , which
yields the final context K = (G, K1 ∪ K2 ∪ K3 , I1 ∪ I2 ∪ I3 ). For example the
context in Table 2 shows five genes described by attributes of M1 , M2 and M3 .
3 Using FCA for Analyzing Genes
The framework described above was applied on three published sets of genes
corresponding to Cancer Modules defined in [8]. Our test data are composed of
three lists of genes corresponding to the so-called “Cancer Module 1” (Ovary
Genes), “Cancer Module 2” (Dorsal Root Ganglia Genes), and “Cancer Module
5” (Lung Genes). For example, “PSPHL” is one gene with “Pathway” attribute
as “PPAR Signaling” which belongs to category “kc:Endocrine System” and
upper category “kuc:Organismal System”. Considering the three lists of genes
given by “Cancer Module 1”, “Cancer Module 2” and “Cancer Module 5”, we
built three different contexts having the same form as the context in Table 2).
Then we obtained three associated concept lattices with the help of the Coron
Plate-form (http://coron.loria.fr). The concept lattice for Table 2 is given
in Figure 2. The global characteristics of the three concept lattices are given in
Table 3.
The exploration of a given concept lattice is carried out following the “Iceberg
metaphor”, i.e., the lattice is explored level by level according to the support of
342 M. Alam, A. Coulet, A. Napoli and M. Smaı̈l-Tabbone
bly Com tor)
(L r5q1 O T ent
rs
)
to
erm
n
ac
P P y ) ec ep
V $ y) lin g
(G p o
l
ma
Ce i p t i 0 2
A s llu la o n F
rm g
e
a
Te ndin
a R
m rin
ms n is
cr F2
hw gn
)
a t n in
st e o c
st e g a
ra U3
i
)
S
s em r
io n
oc 2
i
(G P B
Sy nd
S y c : Or
a
(P roto
(P R
(T O
hw
at
ns
A
P
E
AT
:
O
at
ku
Se
ch
kc
Genes
BTB03 × × ×
PSPHL × × × ×
CCT6A × ×
QNGPT1 × × × ×
MYC × ×
Table 2. A toy example of formal context including domain knowledge.
each concept, where the the support of a concept is the cardinality of the extent.
In addition, we also used stability for extracting interesting frequent and stable
concepts [9].
Fig. 2. The concept lattice corresponding to table 2.
Formal Concept Analysis Applied to Transcriptomic Data 343
Data Sets No. of Genes No. of Attributes No. of Concepts Levels
Module 1 361 3496 9,588 12
Module 2 378 3496 6,508 11
Module 5 419 3496 5,004 12
Table 3. Concept lattice statistics for the cancer modules with domain knowledge.
4 Results
In this study, biologists are interested in links between the input genes in terms
of pathways in which they participate, relationships between genes and their
positions etc. We obtained concepts with shared transcription factors, pathways,
locations of genes and GO terms. After the selection of concepts with a high
support (≥ 10), we observed that there were some concepts with pathways either
related to cell proliferation or apoptosis (expert interpretation). The addition
of domain knowledge gives an opportunity to obtain the pathway categories
shared by larger sets of genes (as categories and upper categories are there for
maximizing the grouping of objects, see below).
Table 4 shows the top-ranked concepts found in each module. For exam-
ple, in Table 4, we have the concept C4938 :(KEGG Cytokine Cytokine Receptor
Interaction, kc:Signaling Molecules and Interaction, kuc:Environmental Infor-
mation Processing) and the concept C4995 :(kc:Signaling Molecules and Interac-
tion, kuc:Environmental Information Processing). These two concepts are such
as C4938 ≤ C4995 , meaning that C4995 has greater support than C4938 . More-
over, we observed that the introduction of categories and upper categories in
the global context allows us to consider concepts that otherwise would not be
frequent. Actually, the role of categories and upper level categories is to facilitate
the observation of sets of related genes.
This is a general way of obtaining larger sets of objects to interpret. When
available, one can introduce a hierarchy of attributes –this is domain knowledge–
and then insert the levels of each attribute in this hierarchy as a new attribute
in the context. As a result, some classes of objects, that could not emerge before,
will appear based on these hierarchical indications. Given the test data sets, the
preliminary results obtained here constitute an interesting and positive control,
and confirm that FCA-based analysis offers an efficient and practical procedure
to explore complex and large sets of genes.
5 Conclusion
The preliminary study presented here shows how FCA can be applied to complex
biological data and can give flexibility in using various types of attributes for
analyzing a list of genes. In addition, domain knowledge can be introduced and
guide the analysis.
344 M. Alam, A. Coulet, A. Napoli and M. Smaı̈l-Tabbone
Dataset Concept Intents Absolute Stability
ID Support
Module 1 9585 GGGAGGRR V$MAZ Q6 51 0.99
9571 GO Membrane Part 27 0.99
9566 kc:Immune System, kuc:Organismal Systems 25 0.99
9402 chr19q13 10 0.99
9078 KEGG MAPK Signaling Pathway, kc:Signal Transduction, 12 0.87
kuc:Environmental Information Processing
Module 2 6502 GGGAGGRR V$MAZ Q6 44 0.99
6501 AACTTT UNKNOWN 38 0.99
6496 kc:Immune System, kuc:Organismal Systems 15 0.99
6388 chr6p21 10 0.97
6335 KEGG MAPK Signaling Pathway, kc:Signal Transduction, 11 0.89
kuc:Environmental Information Processing
Module 5 5002 kuc:Cellular Processes 48 0.99
5000 GGGAGGRR V$MAZ Q6 44 0.99
4995 kc:Signaling Molecules and Interaction, kuc:Environmental In- 26 0.99
formation Processing
4933 chr19q13 11 0.99
4985 kc:Immune System, kuc:Organismal Systems 11 0.99
4938 KEGG Cytokine Cytokine Receptor Interaction, kc:Signaling 11 0.87
Molecules and Interaction, kuc:Environmental Information Pro-
cessing
Table 4. Top ranked concepts for each cancer module
As for future work, we plan to take into account relationships between genes
and between terms (Gene Ontology relationships) and use the framework of
relational concept analysis.
References
1. Kaytoue-Uberall, M., Duplessis, S., Kuznetsov, S.O., Napoli, A.: Two FCA-Based
Methods for Mining Gene Expression Data. In Ferré, S., Rudolph, S., eds.: ICFCA.
Volume 5548 of Lecture Notes in Computer Science., Springer (2009) 251–266
2. Rioult, F., Boulicaut, J.F., Crémilleux, B., Besson, J.: Using Transposition for
Pattern Discovery from Microarray Data. In: DMKD. (2003) 73–79
3. Berriz, G.F., King, O.D., Bryant, B., Sander, C., Roth, F.P.: Characterizing gene
sets with FuncAssociate. Bioinfo. 19(18) (2003) 2502–2504
4. Doniger, S., Salomonis, N., Dahlquist, K., Vranizan, K., Lawlor, S., Conklin, B.:
MAPPFinder: using Gene Ontology and GenMAPP to Create a Global Gene-
expression Profile from Microarray Data. Genome Biology 4(1) (2003) R7
5. Galperin, M.Y., Fernández-Suarez, X.M.: The 2012 Nucleic Acids Research
Database Issue and the online Molecular Biology Database Collection. Nucleic
Acids Research 40 (2012) 1–8
6. Liberzon, A.: Molecular Signatures Database (MSigDB) 3.0. Bioinfo. 27(12) (2011)
1739–1740
7. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations.
Springer, Berlin/Heidelberg (1999)
8. Segal, E., Friedman, N., Koller, D., Regev, A.: A Module Map Showing Conditional
Activity of Expression Modules in Cancer. Nat.Genet. 36 (2004) 1090–8
9. Kuznetsov, S.O.: On stability of a Formal Concept. Ann. Math. Artif. Intell. 49(1-4)
(2007) 101–115