=Paper= {{Paper |id=Vol-1718/paper1 |storemode=property |title=A Framework for Scalable Inference of Temporal Gene Regulatory Networks based on Clustering and Multivariate Analysis |pdfUrl=https://ceur-ws.org/Vol-1718/paper1.pdf |volume=Vol-1718 |authors=Ricardo de Souza Jacomini,David Correa Martins-Jr,Felipe Leno da Silva,Anna Helena Reali Costa |dblpUrl=https://dblp.org/rec/conf/ijcai/JacominiMSC16 }} ==A Framework for Scalable Inference of Temporal Gene Regulatory Networks based on Clustering and Multivariate Analysis== https://ceur-ws.org/Vol-1718/paper1.pdf

A Framework for Scalable Inference of Temporal Gene Regulatory Networks
based on Clustering and Multivariate Analysis∗
Ricardo de Souza Jacomini1 , David Correa Martins-Jr2 ,
Felipe Leno da Silva1 and Anna Helena Reali Costa1
1
Escola Politécnica da Universidade de São Paulo – São Paulo, Brazil
2
Universidade Federal do ABC – Santo André, Brazil
{ricardo.jacomini,f.leno,anna.reali}@usp.br, david.martins@ufabc.edu.br

Abstract which can describe not only diverse biological functions, but
also the dynamics of molecular activities. Once the network
Genomic and transcriptomic information have been
is recovered, intervention studies can be conducted to control
used as a starting point for the analysis of the ori-
the dynamics of biological systems aiming to prevent or treat
gin and development of diseases, which lead to the
diseases [Shmulevich and Dougherty, 2014]. Genes and pro-
development of many methods that model the dy-
teins usually form an intrincate complex network where, in
namics of gene expression data. Gene Networks
many cases, the behavior of a given gene, measured by means
(GN) are widely used to model such information,
of its expression level (i.e. mRNA abundance), depends on a
and many methods have been developed for GN in-
multivariate and coordinated action of other genes and their
ference from temporal gene expression data. How-
byproducts (proteins) [Martins-Jr et al., 2008]. The impor-
ever, this data usually results in training sets com-
tance of GRN reconstruction can also be seen through many
posed of a small number of temporal samples for
initiatives, such as DREAM (Dialogue for Reverse Engineer-
a large amount of genes, which renders many GN
ing Assessments and Methods) [Marbach et al., 2012].
inference methods unfeasible to apply in real tem-
poral expression data composed of thousands of There are two main approaches to model the gene inter-
genes, since they are exponential in function of the actions [Shmulevich and Dougherty, 2014]: continuous and
number of genes. In order to improve the scala- discrete. The continuous approaches consider mainly dif-
bility of the GN inference problem, we propose a ferential equations to obtain a quantitative detailed model of
novel framework based on the Probabilistic Gene biochemical networks [De-Jong, 2002; Hecker et al., 2009]
Networks model, in which we rely on a cluster- while discrete models measure the gene interactions from the
ing preprocessing step to provide an approximated qualitative point of view. The main discrete models include
solution with reduced computational complexity. the ones based on graphs such as Bayesian Networks [Fried-
We compared our proposal with a similar approach man et al., 2000], Boolean Networks (BN) [Kauffman, 1969],
without the clustering step, and our experiments and its stochastic versions Probabilistic Boolean Networks
show that the proposed framework achieves sub- (PBN) [Shmulevich et al., 2002] and Probabilistic Gene Net-
stantial computation time reduction, while approx- works (PGN), a simplified model of PBN [Barrera et al.,
imately preserving the prediction accuracy. 2007]. Continuous models provide a detailed understanding
of the system, but prior information about kinetic parameters
and more experimental samples are demanded [Hecker et al.,
1 Introduction 2009]. On the other hand, discrete models are more useful to
The modeling, inference and interpretation of gene regula- capture the global behavior of the system dynamics, requiring
tory networks (GRNs) from temporal gene expression data less data and being easier to implement and analyse [Hecker
has drawn significant attention recently [Hecker et al., 2009; et al., 2009].
Marbach et al., 2012; Shmulevich and Dougherty, 2014], spe- The literature that deals with the GRN inference problem is
cially after the advent of large scale gene expression mea- vast. Some examples of methods that deal with this problem
surement techniques, such as cDNA microarrays [Shalon et include mutual information based feature selection [Liang et
al., 1996] SAGE [Velculescu et al., 1995] and, more recently, al., 1998; Lopes et al., 2014], relevance networks [Margolin
RNA-Seq [Wang et al., 2009]. This interest relies on the fact et al., 2006; Faith et al., 2007], feature selection by maximum
that genes play a major role in the control of cell functions. relevance/minimum redundancy [Meyer et al., 2007], sig-
The GRN inference problem involves the discovery of com- nal perturbation [Ideker et al., 2000; Carastan-Santos et al.,
plex regulatory relationships between biological molecules 2016], among others. Even though there are many GRN infer-
∗
We gratefully acknowledge funding from: CAPES, CNPq ence methods in the literature [Markowetz and Spang, 2007;
(grants 304955/2014-0, 311608/2014-0), São Paulo Research Hecker et al., 2009; De-Smet and Marchal, 2010; Marbach et
Foundation (FAPESP) (grants 2011/50761-2, 2015/01587-0, al., 2012], GRN inference is considered an ill-posed prob-
2015/16310-4) lem, since for a given dataset of gene expression profiles,
there are many (if not infinity) networks capable of generating 2 Probabilistic Gene Networks inference
this same dataset. This problem is further hampered due to a
typically limited number of samples, a huge dimensionality The Probabilistic Gene Networks (PGN) model [Barrera et
(number of variables, i.e., genes), and the presence of noise al., 2007] assumes that the temporal gene expression samples
[Hecker et al., 2009; Shmulevich and Dougherty, 2014]. follow a first order Markov Chain where each target gene in
a given timepoint depends only on its predictor subset val-
In the specific context of discrete models, BNs and PBNs ues in the previous time instant. The transition function is
can generalize and capture the global behavior of biologi- homogeneous (it does not change over time), almost deter-
cal systems [Kauffman, 1969; Shmulevich et al., 2002]. The ministic (from any state, the system has a preferential state to
main disadvantage of these models is the information loss as a go) and conditionally independent (i.e., the expression value
consequence of the required data quantization. However, the of a given gene is dependent only on its predictors, following
quantization makes the BN and PBN models simpler to im- the Markov hypothesis). These assumptions are important
plement and analyse [Styczynski and Stephanopoulos, 2005; simplifications to deal with the limited number of samples
Ivanov and Dougherty, 2006], and many methods were pro- typically available in real gene expression data.
posed to infer GRNs modeled as BN or PBN [Akutsu et al., Conceptually, PGN is a restricted type of PBN [Shmule-
1999; Lahdesmaki and Shmulevich, 2003; Liang et al., 1998; vich et al., 2002]. While PBNs assume that variables are
Nam et al., 2006]. binary, PGNs assume that gene expression values can be de-
scribed in two or more discrete values. For example, Bar-
Although PBN genes have only two possible expression rera et al. (2007) considered three possible states for each
values, network inferences are still difficult, since the curse of gene: -1 (underexpressed), 0 (normally expressed), +1 (over-
dimensionality still plays an important role. In this way, PGN expressed). However, it is worth to note that the number of
provides a simplification of the inference process that allows statistical parameters (configuration values of a given pre-
to apply local feature selection to search for the best subsets dictor subset) are doubled when including just one binary
of genes to predict the behavior of a given target gene [Bar- predictor in the subset (i.e., it grows exponentially). PGN
rera et al., 2007]. Since exhaustive search is the only feature assumes that a target gene presents several different predic-
selection algorithm that guarantees optimality [Cover and van tor functions like PBN. However, all these functions nec-
Campenhout, 1977], high performance computing techniques essarily present the same set of predictor genes as inputs
are required when using this algorithm to search for predic- in PGN, whereas a PBN target gene might be described by
tor subsets of three or four dimensions (for higher dimen- several transition functions that might take as input different
sions, this technique is still impractical) [Borelli et al., 2013; sets of predictors. Another important restriction is the quasi-
Carastan-Santos et al., 2016]. An alternative to reduce the determinism assumed by PGN, which implies that it is often
computational complexity of the exhaustive search is to ap- possible to find very good predictors for every target in terms
ply some prior dimensionality reduction technique to restrict of prediction error.
the search space of candidate predictor subsets for a given PGN-based GRN inference methods rely on three funda-
target, which is not trivial to do, since even the worst features mental steps [Barrera et al., 2007; Lopes et al., 2014]: (i)
individually could be great when combined to predict a given data quantization; (ii) feature selection; (iii) determination of
target, while the best individual features could not be so good the logic function that minimizes the classification error of
in predicting the target when combined [Pudil et al., 1994; each target expression profile.
Martins-Jr et al., 2008]. Feature Selection is an extremely important step in this
inference procedure. A feature selection problem consists
In order to alleviate the curse of dimensionality inherent in selecting a subset of features that well represents the ob-
to the GRN inference problem, and consequently its com- jects under study. In our case, a feature selection algo-
putational cost, this paper proposes a novel GRN inference rithm consists basically in searching the subsets of genes
framework to infer PGNs. Our proposal relies on a clustering that best predicts a given target gene according to a crite-
technique to reduce the search complexity when evaluating rion function, which assign a quality value for a subset ac-
the possible predictor subsets and thus alleviate the computa- cording to the expression profile of the target gene at the
tional complexity of the GRN inference. Besides, an intrin- next time instant [Barrera et al., 2007; Borelli et al., 2013;
sically multivariate analysis is conducted to eliminate redun- Lopes et al., 2014].
dant features from each predictor subset [Martins-Jr et al., There are many feature selection algorithms proposed in
2008] and, consequently, to obtain a minimal network. We the literature, some of them are computationally efficient but
experimentally compared the prediction quality of our pro- suboptimal. In fact, in the general case the only algorithm
posal with GRN inference by executing an exhaustive search that guarantees optimality is the exhaustive search [Cover and
over all possible predictor subsets, which has prohibitive van Campenhout, 1977]. This is due to the well known nest-
computational costs but is expected to achieve the best pre- ing effect in which a feature included into the solution subset
diction quality. Our results using in silico data show that the might never be removed by a suboptimal algorithm feature
approximated solution given by our framework achieves very selection, even if that feature is not in the optimal solution
similar prediction quality to the exhaustive search, while pro- set. Similarly, a previously removed feature might never be
viding a substantial reduction of the computational complex- inserted again into the current subset solution, even if it be-
ity. longs to the optimal solution set [Pudil et al., 1994].
The GRN inference framework here proposed (see Sec- Our proposed gene networks inference framework per-
tion 4) applies an exhaustive search for subsets of a given forms an intrinsically multivariate prediction analysis in the
fixed dimension k, adopting two criterion functions popu- subsets returned by the exhaustive search algorithm to reduce
larly used in feature selection-based GRN inference meth- the search dimensionality by discarding irrelevant features.
ods [Martins-Jr et al., 2008; Lopes et al., 2014]: (i) Co- This procedure helps to simplify the final networks.
efficient of Determination (CoD), which is based on clas-
sification Bayesian error [Dougherty et al., 2000]; and (ii) 3 Clustering
Mutual Information (MI), which is based on Shannon’s en-
tropy [Shannon, 2001]. The use of clustering algorithms on gene expression data
The Coefficient of Determination (CoD) [Dougherty et al., analysis can elucidate some challenging biological and ge-
2000] for a target gene Y given a set of candidate predictor nomic issues, such as identifying the functionality of genes,
genes Z is a non-linear criterion function given by: finding out which genes are co-regulated (which could give
clues about functional annotation of genes), revealing im-
εY − εY (Z) portant genes that distinguish between abnormal and normal
CoDY (Z) = (1) tissues, etc [Zhao and Karypis, 2003]. Furthermore, since
εY
some genes might be strongly correlated (have almost identi-
where
P εY = 1 − maxy∈Y P (y) and εY (Z) = 1 − cal profiles), this may suggest that they could be assigned to
z∈Z maxy∈Y P (z, y). Greater CoD values lead to better the same group in such a way that this group could be repre-
feature subspaces (CoD = 0 means that the feature subspace sented by one of these genes. In this way, highly correlated
does not identify the prior error, while CoD = 1 means that genes belonging to the same cluster are never considered in
the error is totally eliminated). the same candidate predictor subset, since highly correlated
In its turn, the mutual information (MI) is defined as: genes are considered redundant. Therefore, in practice, clus-
tering could be a useful tool to discard many candidate sub-
I(Z, Y ) = H(Y ) − H(Y |Z) (2) sets with redundant genes which, in turn, implies in a signifi-
where H(.) Pis the Shannon entropy, with cant reduction of both initial dimensionality and search space,
H(Y ) = y∈Y P (y)logP (y) and H(Y |Z) = which helps to alleviate scalability issues.
P
P (z)P (y|z)logP (y|z), P (y) is the probabil- In this paper, we adopted the k-means clustering as an ini-
y∈Y,z∈Z
ity of Y = y and P (y|z) is the conditional probability of tial step of the GRN inference proposed method (see Sec-
Y = y given that Z = z. tion 4) for two reasons. First, k-means clustering gives par-
However, it is not enough to select the best subset of pre- titions as result, i.e., each resulting cluster contains a list of
dictors, since redundant genes might be present in these sub- genes and the intersection between different clusters lists is
sets. So it is important to perform a multivariate analysis of always null (a gene cannot belong to more than one cluster).
these predictors with the aim of reducing the number of pre- Second, k-means clustering allows to regulate the number of
dictors per target, thus simplifying the network. desired clusters (parameter k indicates the number of clus-
The multivariate nature of the relationship of certain pre- ters). This is important, since k becomes the resulting dimen-
dictors with regard to the target leads to the already men- sionality of the GRN inference process. For instance, if k
tioned nesting effect, and can be estimated by the intrinsically is set to a number in the order of dozens, the dimensionality
multivariate prediction (IMP) phenomenon [Martins-Jr et al., of the process reduces from N initial genes in the order of
2008]. A set of genes Z is considered IMP given a target gene thousands to k gene clusters in the order of dozens.
Y if the target behavior (expression profile) is strongly pre-
dicted by the combined expression profiles of Z and, at the 4 Proposed GRN inference framework
same time, weakly predicted by any proper subset of Z. In Here we propose a new framework for GRN inference that
this sense, the IMP score (IS) can be defined by [Martins-Jr follows the PGN model assumptions (see Section 2), and ap-
et al., 2008]: plies a clustering technique before feature selection to reduce
the dimensionality of possible predictor subsets. Besides, af-
IS(Z, Y ) = J (Z, Y ) − max
0
J (Z0 , Y ), (3) ter the feature selection phase, minimal predictor subsets are
Z ⊂Z
found by removing redundant genes inside the predictor sub-
where J (.) is the chosen criterion function which evaluates set through a multivariate analysis to make networks as sim-
the dependence of a variable target Y with regard to a candi- ple as possible. Figure 1 depicts the main steps involved,
date feature set Z (higher values imply higher dependence). which are described as follows.
IS(Z, Y ) = 0 indicates that there is at least one redundant
variable in Z, implying that Z should be reduced (Z is defi- (a) Gene expression standardization: Given a gene ex-
nitely not IMP with regard to the target). It is also possible to pression data, first a transformation is applied to the
define a positive threshold to decide whether a feature set is input data. In the experiments of Section 5, a Z-score
IMP or not with regard to the target. In case the pair (Z, Y ) standardization is applied, in which the expression ei of
is not IMP, Z can be reduced to one of its proper subsets that a given gene i becomes e0i = eiσ−µ i
i
, where µi and σi
presents maximum J (.) value. This process is recursive: the are average and standard deviation of the expressions of
reduction is applied until the IMP score of the current pair gene i, respectively. This transformation aims to change
(Z, Y ) be positive or larger than a certain threshold. the data in such a way that expression values of a given
achieved for a given target. This can be done by evalu-
ating the IMP score (IS) according to Equation 3. In the
experiments of Section 5, a subset Z selected to predict
Y is reduced if IS(Z, Y ) = 0. As already mentioned in
Section 2, this reduction process is recursive: the reduc-
tion continues until IS(Z, Y ) > 0, achieving the min-
imal predictor subset. Once defined the final predictor
subset for each target, it derives the dependence logics
that rule the target expression profile based on its final
predictor subset. These dependence logics are retrieved
from the conditional probabilities distributions P (Y |Z)
(where Y is the target and Z is the best predictor subset
for Y ), in such a way that for all z ∈ Z, the Y output
is defined by {y|P (Y = y|Z = z) = maxy∈Y P (Y =
y|Z = z)} (the logic outputs are those that minimize
the Bayesian classification error of Y based on Z val-
Figure 1: Block diagram with the main steps performed in ues). Thus, the expression of a gene at the time t + 1
this proposed framework to infer gene networks following the is given by the application of the prediction logic of the
PGN model. The light blue box represents the gene expres- predictors aforementioned by taking its expression val-
sion profiles dataset to be taken as input by the framework, ues from time t as inputs (these expressions are obtained
and the yellow box represents the output PGN. from the quantized dataset). This is performed consider-
ing all timepoints.
gene below its own average become negative (underex-
pressed), while expressions above its own average be- Finally, the output of our framework is a set of predictor
come positive (overexpressed). clusters for each desired target gene. If necessary, a PGN can
be assembled by combining all desired target genes with its
(b) Clustering: A clustering method is applied to group predictor set. Thus, the output PGN is composed of all target
genes with similar expression profile measures. Any genes linked to cluster representative genes.
clustering method that returns a partition and a list of
members per cluster, including their respective represen-
tative genes (genes that most represent their respective 4.1 Computational complexity analysis
clusters according to a given criterion) can be used. It
is desirable that the clustering method allows to set the As the computational complexity of the framework is mainly
number of clusters to be returned, or at least to restrict given by the exhaustive search algorithm in the Feature Se-
the maximum number of clusters, since this number has lection step (step d), we focus only on the analysis of the
a crucial impact on the feature search space. Here we complexity of this step. Other steps have a negligible pro-
adopted the k-means clustering for the experiments de- cessing time, since they are processed in seconds even for
scribed in Section 5. very big datasets. Hence, the complexity is measured ac-
cording to the number of times that the criterion function
(c) Quantization: Following the PGN model, the gene ex- is calculated during the Feature Selection step (so lets as-
pression dataset is quantized so that each gene expres- sume that one criterion function calculation presents O(1)
sion presents a finite set of possible values. We adopt time, which is true for small predictor subset cardinalities and
the binary quantization where negative Z-scored values small number of possible discrete expression values). Let N
become 0, while positive Z-scored values become 1. be the number of genes in the dataset, p be the fixed num-
(d) Feature selection: A feature selection algorithm is ap- ber of predictors for a predictor subset and k be the num-
plied considering each gene placed as target, aiming to ber of clusters obtained in step b. The complexity of infer-
achieve the best predictor subset for that target, accord- ring the gene network topology using the exhaustive search
is given by: O(N × kp ) = O(N × k p ), where N is the

ing to a given criterion function. All gene representa-
tives (one gene for each cluster) are taken as potential number of genes in the original dataset, k is the number of
predictor genes, hence all other genes are ignored. If the clusters, and p is the number of predictors in each subset.
defined number of clusters is small enough (one hundred Since k is expected to be much smaller than N (k is in the
at most), an exhaustive search is applicable to search for order of tens or hundreds while N is in the order of thou-
trios or even subsets with larger dimensions (4 or 5). sands), the gain in computational time is substantial when
Following the PGN model, the criterion function needs compared to the pure
to evaluate the prediction power of a candidate predic- exhaustive search, which presents com-
plexity O(N × Np ) = O(N p+1 ). For example, in a dataset
tor subset with regard to the target expression at the next with N = 1000 and p = 3, the number of criterion func-
timepoint (first-order Markov Chain). tion evaluations is equal to 1.66 × 108 for the pure exhaustive
(e) Multivariate analysis: This step is necessary to elim- search, and 1.62 × 105 for our proposal with k = 100 (three
inate redundant genes from the best predictor subset orders of magnitude below).
5 Experimental Setup
We adopted the SysGenSIM to generate datasets for our ex-
periments. SysGenSIM is an in silico method that generates
gene expression data from non-linear differential equations
based on biochemical dynamics of yeasts [Pinna et al., 2011].
The following parameters were defined when generating
the datasets: 3 different expression profiles were generated
with 40 samples (M = 40) each. The Barabási-Albert
scale-free model was adopted to generate the network topol-
ogy [Barabási and Albert, 1999], and the average input degree
was set to 3. The number of genes was defined as N = 100
and N = 1000 (one for each experiment). The cooperativity
coefficient was set to a Gamma distribution and the degrada- Figure 2: Evaluation of the framework. A quantized data
tion rate was constant. The biological variance of transcrip- sample corresponding to timepoint ti is used to define the
tion, degradation and noise was set to a Gaussian distribution, inferred sample at timepoint t + 1 by applying the prediction
and the other parameters were set to the default values pro- logics derived for every gene. Each inferred gene expression
vided by the simulator. These parameters should be defined profile is evaluated by means of the accuracy (percentage of
such that the distribution of estimated ”heritabilities” of the inferred timepoints correctly inferred). The overall accuracy
traits is close to those found in real data. of the inferred dataset is given by the average accuracy of all
In the clustering step, the k-means method was applied to inferred gene expression profiles.
group genes with similar expression profiles. The param-
eter k, which indicates the number of clusters, was varied
among 20, 30, 40, 50 and 100, and the Euclidean distance search and by our proposed framework. The pure exhaus-
was adopted as the distance criterion. For each cluster, the tive search was performed only for datasets composed of
gene with minimum Euclidean distance to the cluster cen- N = 100 genes, since this method was unfeasible to com-
troid in terms of the expression profiles was selected to be the pute in our hardware for N > 100 (see computational com-
representative gene of the cluster. plexity analysis in Section 4.1). In contrast, our proposal
After the clustering step, we set each gene of the N genes was executed for datasets composed of N = {100, 1000}
of the input dataset as the target gene, and then we performed genes. We evaluated the performance of our proposal for
an exhaustive search that evaluates all possible subsets of k = {20, 30, 40, 50, 100}, where k is the number of clusters.
candidate predictor genes of size p = 3 to retrieve the best Our framework was implemented in R language (version
predictor subset according to the coefficient of determination 3.2.3). All experiments were executed in a computer Intel R
(Equation 1) and the mutual information score (Equation 2) as Xeon R 8 core CPU E7- 2870 2.40 GHz with 32 GB RAM,
criterion functions. Recall that candidate predictors are only under Linux Ubuntu 64-bit operating system.
the representative genes of the clusters (one for each of the
k clusters), which were retrieved in the previous step. Then, 6 Results and Discussion
a multivariate analysis is applied to further discard redundant Table 1 shows the average prediction accuracy of the inferred
predictors from subsets that present null IMP score with re- expression profiles taking the quantized dataset as ground
gard to their corresponding targets. truth for different numbers of clusters and for the pure ex-
As we are interested in evaluating the structure of the in- haustive with N = 100.
ferred gene expression profile dynamics, the expression of a It is noteworthy that the accuracy loss was very small when
gene at the time ti+1 is given by the application of the predic- using our proposal, specially for k = {50, 100}, while the
tion logic of its corresponding predictors by taking its expres- processing time is substantially reduced when compared to
sion values from time ti obtained from the quantized dataset the pure exhaustive search. Note also that our proposal spent
as inputs. This is performed considering all timepoints. Each less than 30 minutes regardless of the number of total genes,
inferred binary gene expression profile is compared with the which means that our proposal is scalable in terms of total
corresponding binary gene expression profile from the quan- number of genes. As predicted by our theoretical analysis,
tized dataset. The percentage of correctly predicted time- the execution time in our proposal is affected by the number
points defines the accuracy (it is equivalent to the Hamming of clusters, and the overhead introduced by the clustering step
distance between the two binary profiles divided by the num- is negligible.
ber of timepoints of each profile). The average of accuracies The exhaustive search, by its turn, is unfeasible to execute
obtained for all target genes is taken as the overall accuracy of in the greater number of genes. According to our estimates,
the inferred dataset (values between 0 and 1, where 1 means if the pure exhaustive search was fully processed for a sin-
perfect accuracy and 0.5 is the expected value obtained by gle target gene considering N = 1000 genes, it would spend
random guesses of the binary gene expression profile values). about 28,800 minutes (20 days) according to our estimates,
Figure 2 illustrates this assessment. which is roughly 1000 times longer than the processing time
In our experiments, we compare both accuracy and ex- required by our proposed framework considering N = 1000
ecution time between GRN inference by pure exhaustive and k = 100. Finally, it is also noteworthy that real expres-
Table 1: Average precision of all N gene expression profiles and observed processing time for a single gene achieved by both
our proposal and the pure exhaustive search (represented by Exha. columns) for N = {100, 1000} genes. In our proposal
the k-means clustering was adopted considering k = {20, 30, 40, 50, 100}. In both cases, the exhaustive feature selection was
applied using Mutual Information (MI) and Coefficient of Determination (CoD) as criterion functions. Times denoted by m
(minutes) and d (days). Times with ∗ symbol are estimated.

N = 100 Genes N = 1000 Genes
k=20 k=30 k=40 k=50 Exha. k=20 k=30 k=40 k=50 k=100 Exha.
Acc. (%) 82.57 84.84 85.41 86.86 89.84 81.99 84.30 85.78 86.75 89.62 −−
MI
Time < 1m ≈ 1m ≈ 3m ≈ 6m ≈ 29m < 1m ≈ 1m ≈ 3m ≈ 6m ≈ 29m ∗20d
Acc. (%) 83.59 85.73 86.03 87.70 90.46 83.19 85.28 86.70 87.65 90.29 −−
CoD
Time < 1m ≈ 1m ≈ 2m ≈ 5m ≈ 28m < 1m ≈ 1m ≈ 2m ≈ 5m ≈ 28m ∗20d

sion data often have greater dimensionality than N = 1000, possible considering previous works that applied exhaustive
which means that our proposal may be an useful method in search as a way to guarantee the best predictor subset for each
many domains in which the exhaustive search was inapplica- target.
ble. Finally, our proposal showed to be scalable, since we were
It can also be noted in our results that the prediction accura- able to increase in ten times the number of genes in the in-
cies obtained by the CoD and the MI scores used as search cri- put expression data without increase in the processing time
terion functions were very similar. These experiments were of exhaustive feature selection for a single target gene, which
repeated 10 times for k = 100 and N = 1000, the CoD score implies that the processing time linearly increases with the
achieved an average accuracy of 90.29% with 2% of standard number of genes in the whole network.
deviation, while the MI score achieved 89.62% with 3% of
standard deviation. References
[Akutsu et al., 1999] T. Akutsu, S. Miyano, S. Kuhara, et al.
7 Conclusion and future work Identification of Genetic Networks from a small number
In this paper we describe a new framework for gene regula- of Gene Expression Patterns under the Boolean Network
tory networks inference, in which a clustering method is ap- Model. In Proceedings of the Pacific Symposium on Bio-
plied to reduce the complexity of the predictor subsets search computing (PSB), volume 4, pages 17–28, 1999.
for each gene placed as target. We demonstrated the applica- [Barabási and Albert, 1999] A. L. Barabási and R. Albert.
bility of our proposal in experiments using synthetic data, for Emergence of scaling in Random Networks. Science,
which it was able to preserve the prediction accuracy obtained 286(5439):509–512, 1999.
by the pure exhaustive search, but substantially reduced the
[Barrera et al., 2007] J. Barrera, R. M. Cesar-Jr, D. C.
computational complexity of the search. In addition, it is im-
portant to highlight that the synthetic datasets were gener- Martins-Jr, R. Z. N. Vencio, E. F. Merino, M. M. Ya-
ated by a complex and detailed model (non-linear differential mamoto, F. G. Leonardi, C. A. B. Pereira, and H. A. del
equations based on biochemical dynamics of yeasts [Pinna et Portillo. Constructing Probabilistic Genetic Networks of
al., 2011]), while the PGN model on which our framework Plasmodium falciparum from Dynamical Expression Sig-
relies is much simpler. Even assuming a simpler model, our nals of the Intraerythrocytic Development Cycle. In Meth-
framework described the synthetic expression profiles with ods of Microarray Data Analysis V, chapter 2, pages 11–
great accuracy (about 90%) considering datasets with 1000 26. Springer, 2007.
genes and setting 100 clusters. [Borelli et al., 2013] F. F. Borelli, R. Y. de Camargo, D. C.
The next step is to evaluate the performance of our pro- Martins-Jr, and L. C. S. Rozante. Gene Regulatory Net-
posal in real gene expression datasets. Besides, as our pro- works Inference using a multi-GPU Exhaustive Search al-
posal consists in a framework, several aspects regarding the gorithm. BMC Bioinformatics, 14(S5), 2013.
different steps involved can be improved. For example, other [Carastan-Santos et al., 2016] D. Carastan-Santos, R. Y. Ca-
clustering algorithms can be tested as well as other distance margo, D. C. Martins-Jr, S. W. Song, and L. C. S. Rozante.
metrics and methods to define the representative genes. Also, Finding Exact Hitting Set Solutions for Systems Biology
the clustering algorithm can be applied after the quantization applications using heterogeneous GPU Clusters. Future
step, which might lead to clusters with less variability among Generation Computer Systems, 2016. (in press).
their respective gene expression profiles.
Even though completely understanding and modeling the [Cover and van Campenhout, 1977] T. M. Cover and J. M.
properties and structures of real biological systems is still van Campenhout. On the Possible Orderings in the Mea-
an open problem, our proposal showed promise in assisting surement Selection Problem. IEEE Transactions on Sys-
professionals of biomedicine and related areas in decision- tems, Man and Cybernetics, 7(9):657–661, 1977.
making regarding the control of the gene regulatory systems [De-Jong, 2002] H. De-Jong. Modeling and Simulation of
dynamics. Our proposal also provides a viable system in en- Genetic Regulatory Systems: A Literature Review. Jour-
vironments with limited computing resources, which was not nal of Computational Biology, 9(1):67–103, 2002.
[De-Smet and Marchal, 2010] R. De-Smet and K. Marchal. A. Califano. ARACNE: An Algorithm for the Reconstruc-
Advantages and Limitations of Current Network Inference tion of Gene Regulatory Networks in a Mammalian Cellu-
Methods. Nature Reviews Microbiology, 8(10):717–729, lar Context. BMC Bioinformatics, 7(Suppl 1):S7, 2006.
2010. [Markowetz and Spang, 2007] F. Markowetz and R. Spang.
[Dougherty et al., 2000] E. R. Dougherty, S. Kim, and Inferring Cellular Networks – A Review. BMC Bioinfor-
Y. Chen. Coefficient of Determination in Nonlinear Signal matics, 8(Suppl 6):S5, 2007.
Processing. Signal Processing, 80:2219–2235, 2000. [Martins-Jr et al., 2008] D. C. Martins-Jr, U. M. Braga-Neto,
[Faith et al., 2007] J. Faith, B. Hayete, J. Thaden, I. Mogno, R. F. Hashimoto, M. L. Bittner, and E. R. Dougherty. In-
J. Wierzbowski, G. Cottarel, S. Kasif, J. Collins, and trinsically Multivariate Predictive Genes. IEEE Journal of
T. Gardner. Large-scale mapping and validation of Selected Topics in Signal Processing, 2(3):424–439, 2008.
Escherichia coli transcriptional regulation from a com- [Meyer et al., 2007] P. Meyer, K. Kontos, F. Lafitte, and
pendium of expression profiles. PLoS Biology, 5:259–265, G. Bontempi. Information Theoretic Inference of Large
2007. Transcriptional Regulatory Networks. EURASIP Journal
[Friedman et al., 2000] N Friedman, M Linial, I Nachman, on Bioinformatics and Systems Biology, 2007:1–9, 2007.
and D Pe’er. Using Bayesian Networks to Analyze Ex- [Nam et al., 2006] D. Nam, S. Seo, and S. Kim. An Efficient
pression Data. Journal of Computational Biology, 7(3- Top-down Search Algorithm for Learning Boolean Net-
4):601–20, January 2000. works of Gene Expression. Machine Learning, 65:229–
[Hecker et al., 2009] M. Hecker, S. Lambeck, S. Toepfere, 245, 2006.
E. Van-Someren, and R. Guthke. Gene Regulatory Net- [Pinna et al., 2011] A. Pinna, N. Soranzo, I. Hoeschele, and
work Inference: Data Integration in Dynamic Models: a A. de-la Fuente. Simulating Systems Genetics Data with
Review. Biosystems, 96(1):86–103, 2009. SysGenSIM. Bioinformatics, 27(17):2459–2462, 2011.
[Ideker et al., 2000] T. Ideker, V. Thorsson, and R. M. Karp. [Pudil et al., 1994] P. Pudil, J. Novovicová, and J. Kittler.
Discovery of Regulatory Interactions through Perturba- Floating Search Methods in Feature Selection. Pattern
tion: Inference and Experimental Design. In Proceed- Recognition Letters, 15:1119–1125, 1994.
ings of the Pacific Symposium on Biocomputing (PSB),
volume 5, pages 302–313, 2000. [Shalon et al., 1996] D. Shalon, S. J. Smith, and P. O. Brown.
A DNA Microarray System for Analyzing Complex DNA
[Ivanov and Dougherty, 2006] I. Ivanov and E. R.
Samples Using Two-color Fluorescent Probe Hybridiza-
Dougherty. Modeling Genetic Regulatory Networks: tion. Genome Res, pages 639–45, 1996.
Continuous or Discrete? Journal of Biological Systems,
14(2):219–229, 2006. [Shannon, 2001] C. E. Shannon. A Mathematical Theory of
Communication. ACM SIGMOBILE Mobile Computing
[Kauffman, 1969] S. A. Kauffman. Metabolic Stability and
and Communications Review, 5(1):3–55, 2001.
Epigenesis in Randomly Constructed Genetic Nets. Jour-
nal of Theoretical Biology, 22(3):437–467, 1969. [Shmulevich and Dougherty, 2014] I. Shmulevich and E. R.
Dougherty. Genomic Signal Processing. Princeton Uni-
[Lahdesmaki and Shmulevich, 2003] H. Lahdesmaki and
versity Press, 2014.
I. Shmulevich. On Learning Gene Regulatory Networks
under the Boolean Network Model. Machine Learning, [Shmulevich et al., 2002] I. Shmulevich, E. R. Dougherty,
52:147–167, 2003. S. Kim, and W. Zhang. Probabilistic Boolean Networks: A
[Liang et al., 1998] S. Liang, S. Fuhrmane, and R. Somogyi. Rule-based Uncertainty Model for Gene Regulatory Net-
works. Bioinformatics, 18(2):261–274, 2002.
Reveal, a General Reverse Engineering Algorithm for In-
ference of Genetic Network Architectures. In Proceed- [Styczynski and Stephanopoulos, 2005] M. P. Styczynski
ings of the Pacific Symposium on Biocomputing (PSB), and G. Stephanopoulos. Overview of Computational
volume 3, pages 18–29, 1998. Methods for the Inference of Gene Regulatory Net-
[Lopes et al., 2014] F. M. Lopes, D. C. Martins-Jr, J. Bar- works. Computers & Chemical Engineering, 29(3):519–
534, 2005.
rera, and R. M. Cesar-Jr. A Feature Selection Technique
for Inference of Graphs from their Known Topological [Velculescu et al., 1995] V. E. Velculescu, L. Zhang, B. Vo-
Properties: Revealing Scale-free Gene Regulatory Net- gelstein, and K. W. Kinzler. Serial Analysis of Gene Ex-
works. Information Sciences, 272:1–15, 2014. pression. Science, 270:484–487, 1995.
[Marbach et al., 2012] Daniel Marbach, James C Costello, [Wang et al., 2009] Z. Wang, M. Gerstein, and M. Snyder.
Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M RNA-Seq: A Revolutionary Tool for Transcriptomics. Nat
Camacho, Kyle R Allison, Manolis Kellis, James J Rev Genet, 10(1):57–63, 2009.
Collins, and Gustavo Stolovitzky. Wisdom of Crowds [Zhao and Karypis, 2003] Y. Zhao and G. Karypis. Cluster-
for Robust Gene Network Inference. Nature Methods, ing in Life Sciences. Functional Genomics: Methods and
9(8):796–804, 2012. Protocols, pages 183–218, 2003.
[Margolin et al., 2006] A. A. Margolin, I. Nemenman,
K. Basso, C. Wiggins, G. Stolovitzky, R. D. Favera, and