=Paper= {{Paper |id=Vol-1624/paper21 |storemode=property |title=Continuous Target Variable Prediction with Augmented Interval Pattern Structures: A Lazy Algorithm |pdfUrl=https://ceur-ws.org/Vol-1624/paper21.pdf |volume=Vol-1624 |authors=Alexey Masyutin,Sergei O. Kuznetsov |dblpUrl=https://dblp.org/rec/conf/cla/MasyutinK16 }} ==Continuous Target Variable Prediction with Augmented Interval Pattern Structures: A Lazy Algorithm== https://ceur-ws.org/Vol-1624/paper21.pdf

Continuous Target Variable Prediction
with Augmented Interval Pattern Structures:
Lazy Algorithm

Alexey Masyutin and Sergei O. Kuznetsov

National Research University Higher School of Economics
Moscow, Russia
alexey.masyutin@gmail.com,skuznetsov@hse.ru

Abstract. Pattern structures are known to provide a tool for predictive
modeling and classification. However, in order to generate classification
rules concept lattice should be built. This procedure may take much time
and resources. In previous work it was shown that it is possible to escape
the problem with so-called lazy associative classification algorithm. It
does not require lattice construction and it is applicable to classification
problems such as credit scoring. In this paper we adjust this method to
the case of continuous target variable, i.e. regression problem, and apply
it to recovery rates forecasting. We perform parameters tuning, assess
the accuracy of the algorithm based on the bank data and compare it to
the models adopted in the bank system and other benchmarks.

1 Introduction

Banks and financial institutions take and mitigate credit risk on daily basis.
Credit risk commonly has the biggest contribution to the bank losses compared
to other types of risks such as market, operational and liquidity risks [10]. The
key to successful risk-management is to adequately assess the possibility of credit
losses and potential amount of the loans that is going to be recovered in case
of default. The problem of accurate risk assessment is not only important for
an individual bank, but it is also crucial for the banking system as a whole.
The problem is so vital that banking industry is strictly regulated by central
banks and Basel supervising committee, which even pose certain requirements
for predictive models that are used by banks [17]. Some predictive models, so-
called ”black-box” models provide good results that are hard to interprete. So,
the major feature of risk management practice is that, regardless of the model
accuracy, it must not be the black box. That is why methods such as neural
networks and SVM classifiers did not earn much trust within the banking com-
munity [4]. At the same time, the more accurate the model is, the less capital
charge the bank is going to have. So, banks prefer accurate models that pro-
vide interpretable decision-making. Therefore, FCA-based algorithms seem to
be helpful since they rely on concepts that have obvious interpretation. The in-
tent of a concept can be interpreted as a set of rules that is supported by the
extent of the concept. In previous work it was shown that FCA-based interval
pattern structures methods are applicable to credit scoring which represents the
classification problem with binary target variable [14,16]. Classifying credit ap-
plicants into good and potentially delinquent clients is the first part of credit risk
assessment. The second part is to estimate recovery rate in case of default, i.e.
the proportion of the loan that is going to be collected by the bank [10]. As far
as recovery rates prediction is concerned, it implies continuous target variable.
In this paper, we will adopt the lazy classification algorithm based on interval
pattern structures to the case of continuous target variable, i.e. we will introduce
modified lazy regression algorithm (MLRA). The paper is structured as follows:
Section 2 provides basic formal concept analysis definitions. Section 3 describes
the architecture of MLRA and its parameters. Section 4 describes the data used
for algorithm accuracy evaluation and comparison with benchmarks such as ran-
dom forests. Section 5 concludes the paper. Finally, we attach a pseudo-code for
the algorithm in Appendix.

2 Main Definitions
First, we recall some standard definitions related to FCA, see e.g. [1,2].
Let G be a set (of objects), let (D, u) be a meet-semi-lattice (of all possible
object descriptions) and let δ: G → D be a mapping. Then (G, D ,δ), where
D =(D, u), is called a pattern structure [1], provided that the set
δ(G) := {δ(g)—g ∈ G} generates a complete subsemilattice (Dδ , u) of (D, u),
i.e., every subset X of δ(G) has an infimum uX in (D, u). Elements of D are
called patterns and are naturally ordered by subsumption relation v:
given c, d ∈ D one has c v d ↔ c u d = c. Operation u is also called a similarity
operation. A pattern structure (G, D, δ) gives rise to the following derivation
operators (·) :

l
A = δ(g) for A ∈ G,
g∈A
d = {g ∈ G | d v δ(g)} for d ∈ (D, u).

These operators form a Galois connection between the powerset of G and
(D, u). The pairs (A, d) satisfying A ⊆ G, d ∈ D, A = d, and A = d are called
pattern concepts of (G,D, δ), with pattern extent A and pattern intent d. Oper-
ator (·) is an algebraical closure operator on patterns, since it is idempotent,
extensive, and monotone [1]. In case of credit scoring we work with pattern struc-
tures on intervals as soon as a typical object-attribute data table is not binary,
but has many-valued attributes. Instead of binarizing (scaling) data, one can di-
rectly work with many-valued attributes by applying interval pattern structure.
For two intervals [a1 , b1 ] and [a2 , b2 ], with a1 , b1 , a2 , b2 ∈ R the meet operation
is defined as [14]:

[a1 , b1 ] u [a2 , b2 ] = [min(a1 , a2 ), max(b1 , b2 )]
The concept-based learning model for standard object-attribute representa-
tion (i.e., formal contexts) is naturally extended to pattern structures, when we
have a binary target attribute, i.e. a set of positive examples G+ and a set of
negative examples G− [16].
However, what should we do when the target attribute is not a class label
but a continuous variable? For that case we augment the definition of interval
pattern structure by equipping it with additional feature h.

Augmented interval pattern structures

Let us define an augmented interval pattern structure as a quadruple (G, D
,δ, h), where the description d consists of two elements dx and dy (dy is an in-
terval for target attribute y ∈ R and dx is a vector of intervals for explanatory
attributes x which are supposed to predict the target attribute y), δ : G → D
and h ∈ H, whereR +∞ H is a family of density distribution functions for target at-
tribute y, i.e. −∞ h(s)ds = 1. We will also use notation δx and δy to distinguish
between descriptions containing explanatory attributes and target attribute cor-
respondingly. The meet operation definition is left unchanged.
Suppose, we have an arbitrary set of objects A0 ⊆ G, i.e. A0 = {g1 , g2 , ..., gJ },
δ(gj ) = {δx , δy } = {[x1j ; x1j ], ..., [xM j ; xM j ], [yj ; yj ]}, for j = 1, ..., J, where M
is number of explanatory attributes. Then we define the derivation operator in
the following way
A0 = (d0 , h0 )
where d0 = {dx0 , dy0 }, and dx0 = δx (g1 ) u ... u δx (gJ ) and target attribute
description dy0 = δy (g1 )u...uδy (gJ ) which is in fact a single interval [ymin , ymax ]
and h0 : dy0 → [0; 1]. The h0 is in effect a target attribute density distribution
function based on observations of A0 , which we describe below. Let τ0 , ..., τK be a
−ymin
partition of dy0 and τ0 = ymin , τK = ymax and ∆τi = ymaxK = τi − τi−1 , i =
1, ..., K. Then:

|{g ∈ A|[τi−1 , τi ) v δy (g)}|
h([τi−1 , τi )) = , ∀i = 1, ..., K
|A|

Thus, h is a density function of target attribute y values of objects in A. The
derivation operator on descriptions returns the set of objects with description
subsuming the description dx0 whatever target description dy0 and density func-
tion h are:
A def
0 = (d0 , h0 ) = dx0 = A1

where A1 = {g ⊆ G|dx0 v δx (g)}. Finally, A1 = (d1 , h1 ). Note, that d1 =
{dx0 , dy1 }, i.e. only target attribute description dy is updated, so does h density
function, while the explanatory variables description dx0 remains the same.
In order to approach target attribute prediction problem it will be useful to
define α-weak premises with allowed dropout. An h-augmented interval pattern
d ∈ D is called an α-weak premise with allowed ω-dropout iff:
|{g ∈ A|dmin
y − ω(m − dmin
y ) ≤ δy (g) ≤ dmax
y + ω(dmax
y − m)}|
1− ≤α
|A|

where d = (dx , dy ), dy is a single interval [dmin
y ; dmax
y ] for target attribute y
A = dx , and m is a median of density function h which reflects the distribution
of target attribute within the interval dy based on objects from A. Parameter α
controls the frequency of hypothesis falsifications and parameter ω controls the
magnitude of falsification, i.e. how dramatically it is falsificated. In our case the
magnitude is evaluated as the times the δy (g) − dmax
y is larger than dmax
y − m if
δy (g) > dy or the times the δy (g)−dy is larger than m−dy if δy (g) < dmin
max min min
y
Note, that in case when ω = 0 we apply the strictest criterion to consider a
hypothesis as falsificated:

|{g ∈ A|dmin
y ≤ δy (g) ≤ dmax
y }| |{g ∈ A|dy v δy (g)}|
1− ≤α⇔1− ≤α⇔
|A| |A|

|{g ∈ A|dy 6v δy (g)}|
⇔ ≤α
|A|

3 Lazy predictive algorithm with continuous target
attribute
Assume we have a set of objects G and numerical context with a set of
explanatory attributes x1 , ..., xM and target attribute y. In contrast to classifi-
cation problem the context is not divided into positive and negative examples as
soon as y take numerical values. Now, suppose we receive a test object gt with
observable attributes x, but with unknown value of target attribute y. Is there
a way to predict y using interval pattern structures approach? Indeed, there is,
and we are going to describe it below and compare the accuracy results with
some benchmarks.
The first stage of algorithm is mining α-weak premises with allowed ω-
dropout, the second is to perform prediction for test object gt based on the
mined premises. Let us start by choosing subsample size parameter which is the
number of objects being randomly extracted from G. Then we specify α and
ω parameters that control for ”anti-support” in terms of both frequency and
magnitude. Upon randomly extracting some objects A0 = {g1 , ..., gK } we com-
pute following pattern d0 = δ(g1 ) u ... u δ(gK ) u δ(gt ) and density distribution
function h0 for target attribute values. If d0 is an α - weak premise with allowed
ω-dropout then it is added to the set of premises that will be used for prediction
later. Together with the pattern it is necessary to store the density function h.
But which of h0 , h1 or other we have to use?
Here we introduce another parameter of the algorithm which is called ”capped ”.
Capped is a boolean value, and if true then the range for target attribute dy1 in
d
0 is truncated to dy0 and corresponding density function is h1 calculated on
the truncated set of target values. If capped parameter is false, then we add dy1
and calculate the density function based on all target values that fell into dy1
based on objects from d0 . The whole procedure is repeated many times and the
number of iterations parameter controls for that.
Having finished with premises mining, we move on to the next stage which
is building up a prediction for target attribute based on mined premises. In our
case, the resulting prediction was defined by mixture of distributions from all
premises. In practice all target attribute values stored within premises were put
together to form a final distribution. Finally, we tried both an average and a
median of that distribution as the prediction for target attribute. Such approach
takes into accout different support of the premises as soon as premises with
greater number of objects will contribute more.
However, one can argue that premises are different in sense of anti-support
and deviation in target attribute values. Indeed, we would put more weight to the
prediction based on premises with narrow range of target attribute values and the
ones with less falsifying examples from set G. Therefore, we added target values
to the final distributions with different weights, thus both weighted average and
weighted median were used as forecast.
We introduced two boolean parameters which controlled the weightening
schemes. The first parameter is account for anti-support and the second is penalty
for high deviation. When account for anti-support parameter is true, then the
target values δy (g) of objects g ∈ A with the premise d are given weight according
to the anti-support of that premise:

|{g ∈ A|dmin
y − ω(m − dmin
y ) ≤ δy (g) ≤ dmax
y + ω(dmax
y − m)}|
wa =
|A|

When penalty for high deviation is true, then the weight is decreased with the
higher deviation in the target attribute values:

1
wpen =
σ(δy (g))

where σ(δy (g)) is standard deviation of target attribute values. If the parameters
values are false then the weigths are equal to one. The final weight for the
target attribute value of the object g, which will be contributed to aggregate
distribution used for prediction, is defined as product of the two weights:

w(g) = wa · wpen

Finally, suppose that P is a set of mined α-weak premises with allowed ω-
dropout. The prediction for target attribute y of a test object gt can be based
on weighted average:
P P
p∈P g∈Ap δy (g) · w(g)
δyd
(gt ) = P P
p∈P g∈Ap w(g)
or on the weighted median:
[ [
δyd
(gt ) = median( (δy (g), w(g))
g∈∪p Ap
p∈P g∈Ap

In case where P is an empty set, the prediction is average or median of all target
attribute values in G, i.e. the prediction is based on ”naive” model.

4 Data and experiments

The data we used for the computation represent a pool of delinquent cor-
porate clients loans, which were expected to be restructured. The process of
restructuring is started at the early stage when the client shows the first signs
of insolvency. At that very moment a bank chooses either to execute default
strategy, when the court processes are launched and any disposable collateral is
displayed for sale, or to execute restructuring strategy, when the funding condi-
tions are being revisited usually resulting in a longer credit period. In case of
corporate clients banks usually do not want to go to extremes right from the
start as soon as court launch and collateral sales imply costs and spending time
resources. Also, the bank would prefer to maintain relations with the client if
financial distress is temporary. So, the decision whether to launch default strat-
egy or not is based to the greater extent on the recovery expectations. This
makes the problem of recovery prediction crucial for banking decision making.
Recovery rate is a number between zero and one which reflects the share of the
current exposure which the client is going to payback on some time horizon. If
recovery rate expectation is at high level, the bank would prefer restructuring
and court launch otherwise.
In this paper we use financial data from balance sheets and profit and loss
statements of 612 corporate clients of a top-10 Russian bank. Among others
factors we used assets-to-liabilities ratio, debt-to-equity ratio, earnings before
taxes and interest payments, return on assets etc, resuting. These clients were
assessed at the time of early insolvency signals and the resulting recovery rate
was collected.
The data was randomly divided into two parts with 70% of observations in
one part and 30% in the other. The bigger part was used as a context with
known target attribute for the lazy algorithm and 30% was used as a test set
to evaluate predictions and their accuracy. The same data partition was used
to run random forests with different tunings with 70% part used as a training
set and the other as test set. For random forests there were three parameters
tuned by grid search which are minimum nodesize, number of trees and number
of feasible variables.
The accuracy of predictions were evaluated in terms of mean absolute devi-
ation (MAD):
PN
|yi − yˆi |
M AD = i=1
N
where yi is a target attribute (recovery rate) for i-th client in the test set and yˆi
is prediction.
The random forests were run with following parameters grid: minimum node-
size ranging from 30 to 100 with increment 10, number of trees ranging took
values 10, 30, 50 and 100, and number of feasible variables from ranging from 5
to 45 with increment of 5.
As far as lazy algorithm is concerned, we tuned seven parameters, four of
them were continuous and three were boolean. Subsample size took following
values: 0.01, 0.02, 0.03, 0.04, 0.05, 0.1. Number of iterations: 100, 500, 1000,
2000. Alpha threshold : 0, 0.05, 0.01, 0.015, 0.02. Allowed dropout: 0, 0.1, 0.5, 1,
1.5.
For each combination of parameters we calculated MAD for the test set and
in fact that produced metadata for the analysis. Effectively we obtained MAD
distributions, which at the first step helped us to choose in favour of forecast
based on weighted median forecast rather than weighted average as soon as MAD
distributions for the latter took dramatically higher values which are, of course,
undesirable.
When building new algorithm one has some intuition about it mechanism
and we performed regression analysis of algorithm accuracy versus parameters
values to check that intuition. Also, the analysis was important to determine
better parameters tuning and explain variation in accuracy of the predictions.
The results of regression are presented below:

Table 1. Regression analysis for dependency between MAD and algorithm parameters

Coefficients Estimate Std.Error t p-value
(Intercept) 0,3288 0,0006 519,4 0,0000
Subsample size 0,0155 0,0031 4,940 0,0000
Number of iterations -0,0004 0,0000 -18,05 0,0000
Alpha-threshold -0,0457 0,0270 -1,695 0,0903
Allowed dropout -0,0011 0,0004 -2,975 0,0030
Capped -0,0022 0,0004 -5,401 0,0000
Account for anti-support 0,0002 0,0004 0,624 0,5329
Penalty for high deviation 0,0010 0,0004 2,433 0,0150

We see that increasing number of iterations, allowing dropouts and using
capped improve algorithm performance as soon as the coefficients are negative
and significant: overall error of prediction decreases as those factors increase.
Surprisingly, adjusting account for anti-support and penalty for high deviation
parameters do not show significant improvement in accuracy. Also, we expected
that there are some non-linear dependencies between MAD and parameter values
as soon as, intuitively, there has to be an optimal subsample size of randomly
extracted objects. Therefore, we support the regression output with one-factor
scatter plots with average MAD across all other iterations versus each parameter:
Fig. 1. Single-factor analysis of average MAD versus parameter value: continuous and
boolean parameters

As expected, there is a local minimum for the subsample size being extracted
from G. It is quite natural because as the subsample size grows, the intersection
of the subsample with a test object results in a generic description, which is very
likely to be falsified by objects with target attribute value out of the premise
description target range.
According to performed grid search the range with the lowest MAD (0.247 -
0.290) on the test sample is achieved in following parameter area: alpha-threshold
Fig. 2. MAD distribution shows that lazy algorithm allows one to obtain prediction
error relatively lower than the one with random forest tunings

= 1.5%, number of iterations = 10, subsample size = 1%, allowed dropout = 0.1.
The result was compared to benchmarks represented by random forest tunings.

5 Conclusion
Formal concept analysis offers attractive instruments to extract knowledge
from data as soon as intents of concepts can be considered as associative rules.
FCA-based algorithms are suitable for predictive modeling in areas where model
interpretation clarity is of great priority. However, in previous work only classi-
fication problems were considered, while continuous target attribute prediction,
i.e. regression problem, was out of focus. In this paper, we adjusted the lazy
algorithm [3,16], so that it can perform continuous predictions. The adjustment
required a new definition of an augmented interval pattern structure. In effect,
the adjusted algorithm mines the premises (with target attribute expected dis-
tribution) that are relevant to test object and then prediction is performed based
on the target attribute distribution, e.g. based on the median of the distribution.
We applied the algorithm to delinquent corporate clients loans in order to
predict the recovery rate for each loan. The data we used comes from the pilot
project with one of the top-10 banks in Russia. Mean absolute deviation was
chosen as accuracy metric of the algorithm. We performed simple grid search by
running the algorithm with different parameter values and chose the tuning with
the lowest value of the metric. The classification accuracy of the algorithm was
compared to some benchamrks represented by random forests, as soon as their
Fig. 3. MAD distribution of the lazy algorithm versus best tuning for random forest
and naive model MAD

predictions are based on combination of simple rules, too. The proposed modified
lazy regression algorithm showed comparable quality in the greater number of
runs and in certain parameters area it outperformed random forests. However, it
has to be mentioned that the number of parameters is greater in our algorithm
what, in effect, results in greater algorithm complexity and greater degrees of
freedom. As an area for further research, one can consider keeping the density
function h not only for target attribute in premises, but also make use of those
density functions for explanatory attributes as well. It can be expected, that if
the premises are mined not only based on allowed dropout and alpha-threshold
parameters, but also based on some properties of attributes distribution, then
the premises will be more relevant for the test objects and will produce more
accurate predictions for target attribute.

References

1. B. Ganter and S.O. Kuznetsov: Pattern structures and their projections, in Con-
ceptual Structures: Broadening the Base, Harry Delugach and Gerd Stumme,
Eds., vol. 2120 of Lecture Notes in Computer Science, pp. 129–142. Springer,
Berlin/Heidelberg, 2001.
2. B. Ganter, R. Wille: Formal Concept Analysis: Mathematical Foundations. Springer,
Berlin, 1999.
3. S.O. Kuznetsov: Scalable knowledge discovery in complex data with pattern struc-
tures, in PReMI, Pradipta Maji, Ashish Ghosh, M. Narasimha Murty, Kuntal
Ghosh, and Sankar K. Pal, Eds., vol. 8251 of Lecture Notes in Computer Science,
pp. 30–39, Springer, 2013.
4. L.Thomas, D. Edelman, J.Crook: Credit Scoring and Its Applications, Monographs
on Mathematical Modeling and Computation, SIAM: Philadelphia, pp. 107–117,
2002.
5. N. Siddiqi: Credit Risk Scorecards: Developing and Implementing Intelligent Credit
Scoring, WILEY,ISBN: 978-0-471-75451-0, 2005.
6. B. Baesens, T.V. Gestel, S. Viaene, M. Stepanova, J. Suykens: Benchmarking state-
of-the-art classification algorithms for credit scoring, Journal of the Operational
Research Society 54 (6), 627-635, 2003.
7. A. Ghodselahi: A Hybrid Support Vector Machine Ensemble Model for Credit Scor-
ing, International Journal of Computer Applications (0975 – 8887), Volume 17–
No.5, 2011.
8. L. Yu, S.Wang and K.K. Lai: An intelligent agent-based fuzzy group decision mak-
ing model for financial multicriteria decision support: the case of credit scoring.
European journal of operational research. vol. 195. pp.942-959, 2009.
9. T.V. Gestel, B. Baesens, J.A. Suykens, Poel V.D.V., B.: Bayesian kernel based clas-
sification for financial distress detection. European journal of operational research.
vol. 172. pp. 979-1003, 2006.
10. P. Ravi and V. Ravi: Bankruptcy Prediction in Banks and Firms via Statistical
and Intelligent Techniques-A Review, European Journal of Operational Research,
Vol. 180, No. 1, pp. 1-28, 2007
11. S.O. Kuznetsov and M.V. Samokhin: Learning closed sets of labeled graphs for
chemical applications., in ILP, Stefan Kramer and Bernhard Pfahringer, Eds., vol.
3625 of Lecture Notes in Computer Science, pp. 190– 208, Springer, 2005
12. SAS Institute Inc., Developing Credit Scorecards Using Credit Scoring for SAS R
Enterprise MinerTM 12.1, Cary, NC: SAS Institute Inc., 2012.
13. R.R. Hocking: The Analysis and Selection of Variables in Linear Regression, Bio-
metrics, 32, 1976.
14. M. Kaytoue, S.O. Kuznetsov, A. Napoli, and S. Duplessis: Mining gene expression
data with pattern structures in formal concept analysis, Information Sciences, vol.
181, no. 10, pp. 1989–2001, 2011.
15. A. Veloso and W. Meira, Jr.: Demand-Driven Associative Classification., Springer,
2011.
16. A. Masyutin, Y. Kashnitsky, S.O. Kuznetsov: Lazy Classification with Interval Pat-
tern Structures: Application to Credit Scoring, In: Proc. 4th International Workshop
”What can FCA do for Artificial Intelligence?” (FCA4AI 2015), pp. 43-54, 2015.
17. E.I. Altman, A. Resti and A. Sironi: The link between default and recovery rates:
effects on the procyclicality of regulatory capital ratios”, BIS Working Papers No
113, Monetary and Economic Department, 2002.
Appendix

Algorithm 1 Lazy Regression by Sub-Samples with Continuous Target At-
tribute
Input: {G} – numerical context with explanatory variables x and single target at-
tribute y.
M – number of explanatory attributes.
sub.smpl – percentage of the context randomly used for intersection with the test ob-
ject (parameter).
num.iter – number of iterations (resamplings) during the premise mining (parameter).
alpha.threshold is the maximum allowable percentage of the context G which repre-
sents the objects which falsify the premise (parameter).
gt – test object.
Output: δyd (gt ) – prediction that is produced by the voting rule.
P – a set of premises, i.e. associative rules produced for the test object gt . P can be
empty.
for iter from 1 to num.iter do
A0 =random.sample(G,size=sub.smpl · |G|) — mine α - weak premises with ω-
allowed dropout.
d0 = δx (g1 ) u ... u δx (gs ) u δx (gt ), gs ∈ A0 ∀s
Compute empirical density function h0 for d0y .
A1 = d0
|{g∈A1 |d0 y min −ω(m−d0 y min )≤δy (g)≤d0 y max +ω(d0 y max −m)}|
if 1 − |A1 |
≤ α then
Update empirical density function h0 to h1 based on new values of target at-
tribute in A1 .
Add (d0 , h1 ) to the set P of α - weak premises with ω-allowed dropout.
else
Do nothing
end if
end for
Define weighting scheme wa , wpen .
Calculate the median for mixture of distribution functions hp based on dpy , ∀p ∈ P .
P P
p∈P g∈Ap δy (g) · w(g)
δyd(gt ) = P P
p∈P g∈Ap w(g)

If P is empty, then calculate the median for target attributes of all g ∈ G (naive
prediction).