<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Semantically Coherent Rules</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Gabriel</string-name>
          <email>agabriel@mayanna.org</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko@informatik.uni-mannheim.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frederik Janssen</string-name>
          <email>janssen@ke.tu-darmstadt.de</email>
        </contrib>
      </contrib-group>
      <fpage>49</fpage>
      <lpage>63</lpage>
      <abstract>
        <p>The capability of building a model that can be understood and interpreted by humans is one of the main selling points of symbolic machine learning algorithms, such as rule or decision tree learners. However, those algorithms are most often optimized w.r.t. classification accuracy, but not the understandability of the resulting model. In this paper, we focus on a particular aspect of understandability, i.e., semantic coherence. We introduce a variant of a separate-and-conquer rule learning algorithm using a WordNet-based heuristic to learn rules that are semantically coherent. In an evaluation on di↵erent datasets, we show that the approach learns rules that are significantly more semantically coherent, without losing accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>Rule Learning</kwd>
        <kwd>Semantic Coherence</kwd>
        <kwd>Interpretability</kwd>
        <kwd>Rule Learning Heuristics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Symbolic machine learning approaches, such as rule or decision tree induction,
have the advantage of creating a model that can be understood and interpreted
by human domain experts – unlike statistical models such as Support Vector
Machines. In particular, rule learning is one of the oldest and most intensively
researched fields of machine learning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Despite this advantage, the actual understandability of a learned model has
received only little attention so far. Most learning algorithms are optimized w.r.t.
the classification accuracy, but not understandability. Most often the latter is
measured rather naively by, e.g., the average number of rules and/or conditions
without paying any attention to the relation among them.</p>
      <p>
        The understandability of a rule model comprises di↵erent dimensions. One
of those dimensions is semantic coherence, i.e., the semantic proximity of the
di↵erent conditions in a rule (or across the entire ruleset). Prior experiments
have shown that this coherence has a major impact on the reception of a rule
model. This notion is similar to the notion of semantic coherence of texts, which
is a key factor to understanding those texts [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        In a previous user study, we showed di↵erent rules describing the quality
of living in cities to users. The experiments showed that semantically coherent
rules – such as Cities with medium temperatures and low precipitation – are
favored over incoherent rules, such as Cities with medium temperatures where
many music albums have been recorded [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        In this paper, we discuss how separate-and-conquer rule learning algorithms
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] can be extended to support the learning of more coherent rules. We introduce
a new heuristic function that combines a standard heuristic (such as Accuracy
or m-Estimate) with a semantic one, and allows for adjusting the weight of
each component. With that weight, we are able to control the trade-o↵ between
classification accuracy and semantic coherence.
      </p>
      <p>The rest of this paper is structured as follows. We begin by briefly introducing
separate-and-conquer rule learning. Next, our approach to learning semantically
coherent rules is detailed. In the following evaluation, we introduce the datasets
and show the results. Here, also some exemplary rules are given, indeed
indicating semantic coherence between the conditions of the rules. After that, related
work is captured. Then, the paper is concluded and future work is shown.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Separate-and-Conquer Rule Learning</title>
      <p>
        Separate-and-conquer rule learning is still amongst the most popular strategies
to induce a set of rules that can be used to classify unseen examples, i.e., correctly
map them on their respective classes. How exactly this strategy is implemented
varies among the di↵erent algorithms but most of them fit into the framework of
separate-and-conquer. This led to the development of the so-called SeCo suite
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], a versatile framework that allows for most existing algorithms to be
configured properly. Based on the flexibility and the convenient way to implement
new functions or extensions, we chose this framework for our experiments.
      </p>
      <p>In essence, a separate-and-conquer rule learner proceeds in two major steps:
First, a single rule, that fulfills certain quality criteria, is learned from the data
(this is called the conquer step of the algorithm). Then, all (positive) examples
that are covered by this rule are removed from the dataset (the separate step)
and the algorithm proceeds by learning the next rule until all examples are
covered.</p>
      <p>Certainly, this strategy is only usable for binary data as a notion of positive
and negative example is mandatory but then, if desired, it can guarantee that
every positive example is covered (completeness) and no negative one is covered
(consistency). There are di↵erent strategies to convert multi-class datasets to
binary ones. However, in this paper we used an ordered binarization as
implemented in the SeCo framework. Therefore, the classes of the dataset are ordered
by their class-frequency and the smallest class is defined to be the positive one
whereas the other ones are treated as negative examples. After the necessary
number of rules to cover the smallest class is learned, all examples from it are
removed and the next smallest one is defined to be positive while again the rest
of the examples are negative. The algorithm proceeds in this manner until all
classes expect the largest one are covered. The resulting ruleset is a so-called
decision list where for each example that is to be classified the rules are tested
from top to bottom and the first one that covers the example is used for
prediction. If, however, no rule covers the example, a default rule at the end of the list
assigns it to the largest class in the dataset.</p>
      <p>
        A single rule is learned in a top-down fashion meaning that it is initialized
as an empty rule and conditions are greedily added one by one until no more
negative examples are covered. Then, the best rule encountered during this
process is heuristically determined and returned as best rule. Note that this has
not to be the last rule covering no negative example, i.e., consistency due to
reasons of overfitting is not assured. A heuristic, in one way or another,
maximizes the covered positive examples while trying to cover as few negative ones as
possible. The literature shows a wide variety of di↵erent heuristics [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For the
experiments conducted in this paper we had to make a selection and chose three
well known heuristics namely Accuracy, Laplace Estimate, and the m-Estimate,
as defined later. We are aware of the restrictions that come with our selection
but we are confident that our findings regarding the semantic coherence are not
subject to a certain type of heuristic but rather are universally valid.
      </p>
      <p>
        To keep it simple, we used the default algorithm implemented in the SeCo
framework. Namely, the configuration uses a top-down hill-climbing search (a
beam size of one) that refines a rule as long as negative examples are covered.
The learning of rules stops when the best rule covers more negative than positive
examples and the conditions of a rule test for equality (nominal conditions) or
use &lt; and for numerical conditions. No special pruning or post-processing
of rules is employed. For the m-Estimate, the parameter was set to 22.466 as
suggested in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Enabling Semantic Coherence</title>
      <p>The key idea of this paper is to enrich the heuristic used for finding the best
condition with a semantic component that additionally to the goal of maximizing
positive examples while minimizing negatives, will incorporate that the selected
condition will also be as semantically coherent as possible. In essence, we have
two components now:
– The classic heuristic (selects conditions based on statistical properties of the
data) and
– the semantic heuristic (selects conditions based on their semantic coherence
with previous conditions).</p>
      <p>Hence, the new heuristic WH o↵ers the possibility to trade-o↵ between
statistical validity (the classic heuristic CH ) and the semantic part (a semantic
heuristic SH ). This is enabled by a parameter ↵ that weights the two objectives:</p>
      <p>
        WH (Rule) = ↵ · SH (Rule) + (1 ↵ ) · CH (Rule), ↵ 2 [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]
(1)
      </p>
      <p>A higher value ↵ gives more weight to semantic coherence, while a value of
↵ = 0 is equivalent to classic rule learning using only the standard heuristic. We
expect that higher values of ↵ lead to a decrease in predictive accuracy because
the rule learning algorithm focuses less on the quality of the rule and more on
choosing conditions that are semantically coherent (which are likely not to have
a strong correlation with the rule’s accuracy). At the same time, higher values
of ↵ should lead to more coherent rules.</p>
      <p>When learning rules, the first condition is selected by using the classic
heuristic CH only (since a rule with only one condition is always coherent in itself).
Then, while growing the rule, the WH heuristic is used, which leads to
conditions being added that result in both a coherent and an accurate rule according
to the trade-o↵ specified by ↵ .
3.1</p>
      <p>
        WordNet Similarity
There are di↵erent possibilities to measure the semantic relatedness between two
conditions. In this paper, we use an external source of linguistic information, i.e.,
WordNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. WordNet organizes words in so-called synsets, i.e., sets of synonym
words. Those synsets are linked by homonym and hyperonym relations, among
others. Using those relations, the semantic distance between words in di↵erent
synsets can be computed.
      </p>
      <p>In the first step, we map each feature that can be used in a rule to one or
more synsets in WordNet4. To do so, we search WordNet for the feature name.
In the following, we will consider the case of measuring the semantic coherence
of two features named smartphone vendor and desktop.</p>
      <p>The search for synsets returns a list of synsets, ordered by relevance. The
search result for smartphone vendor is empty {}, the search result for desktop is
{desktop#n#1 , desktop#n#2 } where desktop#n#1 describes a tabletop and
desktop#n#2 describes a desktop computer.5</p>
      <p>If the list is not empty, we add it to the attribute label’s list of synset lists.
If otherwise the list is empty, we check whether the attribute label is a
compound of multiple tokens and restart the search for each of the individual
tokens. We then add all non-empty synset lists that are returned to the list of
synset lists of the attribute label. The result for smartphone vendor is then
{{smartphone#n#1 }, {vendor #n#1 }} while the result for desktop is
{{desktop#n#1 , desktop#n#2 }}.</p>
      <p>
        In the second step, we calculate the distance between two synsets using the
LIN [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] metric. We chose this metric as it performs well in comparison with
other metrics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and it outputs a score normalized to [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
4 Note that at the moment, we only use the names of the features to measure semantic
coherence, but not the nominal or numeric feature values that are used to build a
condition.
5 The ’n’ indicates that the synsets are describing nouns.
      </p>
      <p>
        The LIN metric is based on the Information Content (IC ) metric [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], a
measure for the particularity of a concept. The IC of a concept c is calculated
as the negative of the log likelihood, simpler put: the negative of the logarithm
of the probability to encounter concept c:
      </p>
      <p>IC (c) =</p>
      <p>log p(c)</p>
      <p>
        Higher values denote less abstract, more general concepts, while lower values
denote more abstract, less general concepts. The body of text used for the
calculation of the p(c) values in this work is the SemCor [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] corpus, a collection
of 100 passages from the Brown corpus which were semantically tagged “based
on the WordNet word sense definition” and thus provide the exact frequency
distribution of each synset, which covers roughly 25% of the synsets in WordNet
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>The LIN metric is calculated by dividing the Information Content (IC ) of
the least common synset of the two synsets by the sum of their Information
Content, and multiplying the result with two:6</p>
      <p>IC (lcs)
lin(syn1 , syn2 ) = 2 · IC (syn1 ) + IC (syn2 )
Information Content. For each pair of synsets associated with two attributes,
we calculate the LIN metric. In our example, the corresponding values are
lin(smartphone#n#1 , desktop#n#1 ) = 0.0,
lin(smartphone#n#1 , desktop#n#2 ) = 0.5,
lin(vendor #n#1 , desktop#n#1 ) = 0.0, and
lin(vendor #n#1 , desktop#n#2 ) = 0.0.</p>
      <p>In the third step, we choose the maximum value for each pair of synset lists
(syn) so that we end up with the maximum similarity value per pair of tokens.
The overall semantic similarity of two attributes (att) is then computed as the
mean of those similarities across the tokens t:
(2)
(3)
(4)
SH (att1, att2) =</p>
      <p>avg
8 t2 2 att2 88 ssyynn21 22 tt12
8 t1 2 att1
max lin(syn1 , syn2 )</p>
      <p>This assigns each word pair the similarity value of the synset combination
that is most similar among all the synset combinations that arise from the two
lists of possible synsets for the two words. Thus, in our example, the SH value
assigned to smartphone vendor and desktop would be 0.25.</p>
      <p>To compute the semantic coherence of a rule given the pairwise SH scores
for the attributes used in the rule, we use the mean of those pairwise scores to
assign a final score to the rule.7
6 This metric limits the similarity calculation to synsets of the same POS and works
only with nouns and verbs. Our implementation returns a similarity value of 0 in all
other cases.
7 All experiments were carried out with minimum and maximum as well, but using
the mean turned out to give the best results.
We have conducted experiments with di↵erent classic heuristics on a number of
datasets from the UCI machine learning repository8 shown in Table 1. The table
depicts the overall number of attributes and the percentage of attributes for
which at least one matching synset was found in WordNet. For classic heuristics
CH , we chose Accuracy, m-Estimate, and Laplace Estimate, which are defined
as follows:</p>
      <p>Accuracy := p
n ⌘
p + (N</p>
      <p>P + N</p>
      <p>n)
Laplace Estimate :=
m-Estimate :=</p>
      <p>p + 1
p + n + 2</p>
      <p>
        P
p + m · P +N
p + n + m
(5)
(6)
(7)
where p, n denote the positive/negative examples covered by the rule and
P , N stand for the total positive/negative examples. Please see [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for more
details on these heuristics.
      </p>
      <p>In addition, we used the semantic heuristic SH based on WordNet as defined
above. For each experiment, we report the accuracy (single run of a ten fold cross
validation) and the average semantic coherence of all the rules in the ruleset
(measured by SH ), as well as the average rule length and the overall number of
conditions and rules in the ruleset.</p>
      <p>As datasets, we had to pick some that have attribute labels that carry
semantics, i.e., the attributes have to have speaking names instead of, e.g., names
from att1 to att20 (which unfortunately is the case for the majority of datasets
in the UCI repository). We searched for datasets where we could map at least
two thirds of the attributes to at least one synset WordNet. This led to the eight
datasets used for the experiments in this paper which are listed in Table 1.</p>
      <sec id="sec-3-1">
        <title>Classic ↵</title>
        <p>Heuristic 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Accuracy 0.649 0.667 0.668 0.668 0.669 0.669 0.668 0.668 0.668 0.668 0.465
m-Estimate 0.670 0.673 0.672 0.671 0.671 0.670 0.670 0.673 0.673 0.674 0.474
Laplace 0.673 0.680 0.679 0.682 0.681 0.680 0.681 0.679 0.679 0.681 0.476
Table 2 shows the macro average accuracy across the eight datasets for di↵erent
values of ↵ . It can be observed that, except for ↵ = 1, the accuracy does not
change significantly. This is an encouraging result, as it shows that a weight
of up to 0.9 can be assigned to the semantic heuristic without the learning
model losing much accuracy. How much exactly the coherence can be enforced
has to be examined by a more detailed inspection of the parameter values in
between 0.9 and 1.0. Interestingly, the trade-o↵ between coherence and accuracy
seems to occur rather at the edge at high parameter values. Clearly, a study of
these parameters would yield more insights, but, however, ensuring such high
coherence without a noticeable e↵ect on accuracy already is a remarkable e↵ect
and seems to be sucient for our purposes. Only when assigning all weight to
the semantic heuristic (and none to the classic heuristic), the accuracy drops
significantly, which is the expected result. In most of these cases, no rules are
learned at all, but only a default rule is created, assigning all examples to the
majority class.</p>
        <p>In Table 3, we report the macro average semantic coherence of the learned
rulesets across the eight datasets. The results have to be seen in context with
Table 2 as our primary goal was to increase semantic coherence while not losing
too much accuracy. Clearly, the higher the values of ↵ will be, the more
semantic coherence will be achieved anyway. This is because the heuristic component
uses the same measure for semantic coherence as is reported in the evaluation
in Table 3. However, as confirmation, it can be observed that the semantic
coherence is indeed increased in all cases, whereas, when using m-Estimate as a
classic heuristic, the increase is not statistically significant. As stated above, no</p>
      </sec>
      <sec id="sec-3-2">
        <title>8 http://archive.ics.uci.edu/ml/</title>
        <p>↵ = 0.0 peritoneum = yes, skin = yes, histologic-type = adeno ! class = ovary
↵ = 0.8 peritoneum = yes, skin = yes, pleura = no, brain = no ! class = ovary
rules are learned in many cases for ↵ = 1, so the semantic coherence cannot be
computed there.</p>
        <p>These results support our main claim, i.e., that it is possible to learn more
coherent rules without losing classification accuracy. What is surprising is that
even for ↵ = 0.9, the accuracy does not drop. This may be explained by the
selection of the first condition in a rule, which is picked according only to the
classic heuristic and thus leads to growing a rule that has at least a moderate
accuracy. Furthermore, in many cases, there may be a larger number of possible
variants for growing a rule the learning algorithm can choose from, each leading
to a comparable value according to the classic heuristic, so adding weight to the
semantic heuristic still can lead to a reasonable rule.
The two rules learned for the primary-tumor dataset shown in Table 4
illustrate the di↵erence between rules with and without semantic coherence. Both
rules cover two positive and no negative example, i.e., according to any classic
heuristic, they are equally good. However, the second one can be considered to
be semantically more coherent, since three out of four attributes refer to body
parts (skin, pleura, and brain), and are thus semantically related.</p>
        <p>In order to further investigate the influence of the semantic heuristic on
general properties of the learned ruleset, we also looked at the average rule
length, the total number of rules, and the total number of conditions in a ruleset.
The results are depicted in Tables 5 and 6.</p>
        <p>In Table 5 we observe a mostly constant and sometimes increasing number
of rules for all but the last three datasets. This exception to the overall trend is</p>
        <p>Dataset
auto-mpg
balloons
bridges2
flag
zoo
analyzed more closely in case of the primary-tumor dataset. The values for this
dataset are depicted in Fig. 1.</p>
        <p>When looking at the rulesets learned on the primary-tumor dataset, it can
be observed that many very special rules for small classes, covering only a few
examples, are missing when increasing the value for ↵ . A possible explanation
is that as long as there are many examples for a class, there are enough degrees
of freedom for the rule learner to respect semantic coherence. If, on the other
hand, the number of examples drops (e.g., for small classes), it becomes harder
to learn meaningful semantic rules, which leads the rule learner to ignore those
small example sets. Since only a small number of examples is concerned by this,
the accuracy remains stable – or it even rises slightly, as ignoring those small
sets may eventually reduce the risk of overfitting.</p>
        <p>Note that a similar trend could be observed for the other two datasets
(hepatitis and glass, depicted at the lower part of Table 5). While the changes are
not so intense for the m-Estimate, certainly those for the other two heuristics are
significant. Interestingly, most often the rules in the beginning of the decision list
are similar and at a certain point, no rules are learned any more. Thus, similar
to the e↵ect noticeable at the dataset primary-tumor, the following low coverage
rules are not induced any more.</p>
        <p>However, when looking at the average rule length (cf. Table 6), the only
significant change occurs when all weight is given to the semantic component.
The reason is that most often no rule is learned at all in this case.
4.3</p>
        <p>
          Semantic Coherent Rules in Relation to Characteristic Rules
When we inspected the rule sets and the behavior of our separate-and-conquer
learner in more detail, we found that semantically coherent rules interestingly
have a connection to so-called characteristic rules [
          <xref ref-type="bibr" rid="ref22 ref4">22, 4</xref>
          ]. Where a discriminant
rule tries to use as few conditions as possible with the goal of separating the
example(s) of a certain class versus all the other ones, a characteristic rule has
as much as possible conditions that actually describe the example(s) at hand.
For instance, if the example to be described would be an elephant, a discriminant
rule would concentrate on the single attribute(s) an elephant has and no other
animal shows such as, e.g., its trunk, its gray color, or its huge ears. Instead, a
characteristic rule would list all attributes that indicate an elephant such as four
legs, a tail, thick skin etc. In essence, a discriminant rule has only conditions
that discriminate elephants from all other animals whereas a characteristic rule
rather describes the elephant without the need to be discriminant, i.e., to use
only features no other animal has.
        </p>
        <p>Not surprisingly, a semantically coherent rule tends to show the same
properties. Often the induced rules consist of conditions that are not necessarily
important to discriminate the examples, but rather are semantically coherent
with the conditions located at earlier positions in these rules. This becomes
obvious when we take a look at the above example of the two rules where the rule
without semantic influence has a condition less albeit both of them have the
same coverage.</p>
        <p>However, the number of rules is strongly dependent on the attribute’s
semantics. For most of the datasets where actually less rules are induced with our
approach, semantic coherence is hard to measure. The glass database contains of
descriptions of chemicals, in the hepatitis dataset biochemical components are
used as features and in primary-tumor we have simply considerably more classes.
A detailed examination of this phenomenon remains subject to future work.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        Most of the work concerned with the trade-o↵ between interpretability and
accuracy stems from the fuzzy rules community. Here, this trade-o↵ is well-known
and there are a number of papers that addressed this problem [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. There are
several ways to deal with it, either by using (evolutionary) multiobjective
optimization [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], context adaptation, hierarchical fuzzy modeling as well as fuzzy
partitioning, membership functions, rules, rule bases or similar. However, most
often comprehensibility of fuzzy rules is measured by means such as the
transparency of the fuzzy partitions, the number of fuzzy rules and conditions or the
complexity of reasoning, i.e., defuzzification and inference mechanisms. As we
use classification rules in this paper, most of these techniques are not applicable.
      </p>
      <p>
        There are also some papers about comprehensibility in general. For example,
[
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] deals with the means of dimensionality reduction and with presenting
statistical models in a way that the user can grasp them better, e.g., with the help
of graphical representations or similar. The interpretability of di↵erent model
classes is discussed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The advantages and disadvantages of decision trees,
classification rules, decision tables, nearest neighbor, and Bayesian networks are
shown. Arguments are given why using model size on its own for measuring
comprehensibility is not the best choice and directives are demonstrated how
user-given constraints such as monotonicity constraints can be incorporated into
the classification model. For a general discussion of comprehensibility this is very
interesting, however, as single conditions of a rule are not compared against each
other, the scope is somewhat di↵erent than in our work.
      </p>
      <p>
        A lot of papers try to induce a ruleset that has high accuracy as well as good
comprehensibility by employing genetic, evolutionary, or ant colony optimization
algorithms. Given the right measure for relating single conditions of a rule or
even whole rules in a complete ruleset, this seems to be a promising direction.
Unfortunately, most of the fitness functions do not take this into account. For
example, in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] an extension of a ant colony algorithm was derived to induce
unordered rulesets. They introduced a new measure for comprehensibility of
rules, namely the prediction-explanation size. In essence this measure is oriented
more strongly on the actual prediction hence the average number of conditions
that have to be checked for predicting the class value. Therefore, not the total
number of conditions or rules is measured as usual measures often do but for an
unordered ruleset exactly those that are actually used for classifying the example
at hand. For ordered rulesets also rules are counted that are before the classifying
rule in the decision list as they have to be also checked at prediction time. Other
algorithms are capable of multi-target learning [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and define interestingness
as those rules that cover example of infrequent classes in the dataset. Also,
some papers deal with interpretability rather as a side e↵ect [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], while here no
optimization of this objective is done during learning time. In contrast, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] uses
a simple combination of accuracy maximization and size minimization in the
fitness function of the genetic algorithm.
      </p>
      <p>
        Some research is focused on specific problems where consequently rather
unique properties are taken into account [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. In this bioinformatic domain, only
the presence of an attribute (value) is of interest whereas the absence is of no
concern. The contribution are two new versions of CN2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Ant-Miner [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
which are able to incorporate this constraint.
      </p>
      <p>
        Another thread is concerned with the measures themselves. For example, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
surveyed objective measures (data-driven) for interestingness and defined a new
objective, namely attribute surprisingness AttSurp, i.e., by arguing that a user
is mostly interested in a rule that has high prediction performance but many
single attributes with a low information gain, the authors define AttSurp as
one divided by the information gain of all attributes in the rule. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] it is
argued that small disjuncts (i.e., rules that cover only a very small number of
positive examples) are indeed surprising while most often not unfolding good
generalization or predictive quality. Here, also AttSurp is used which is di↵erent
to most other interestingness measures in the sense that not the whole rule body
is taken into account but single attributes which one can also see as a property of
our algorithm. Interestingly, surprisingness also is related to Simpson’s Paradox.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we have examined an approach to increase the understandability
of a rule model by learning rules that are in themselves semantically coherent.
To do so, we have introduced a method for combining classic heuristics, tailored
at learning correct rule models, with semantic heuristics, tailored at learning
coherent rules. While we have only looked at the coherence of single rules, adding
means to control the coherence across a set of rules would be an interesting
extension for future work.</p>
      <p>An experiment with eight datasets from the UCI repository has shown that
it is possible to learn rules that are significantly more coherent, while not being
significantly less accurate. In fact, the accuracy of the learned model has stayed
constant in all cases, even if adjusting the influence of the semantic heuristic to
90% of the overall heuristic. These results show that, even at a very preliminary
stage, the proposed approach actually works.</p>
      <p>Furthermore, we have observed that in some cases, adding the semantic
heuristic may lead to more compact rule sets, which are still as accurate as
the original ones. Although we have a possible explanation, i.e., that it is
dicult for semantically enhanced heuristics to learn rules for small sets of examples,
we do not have statistically significant results here. An evaluation with synthetic
datasets may lead to more insights into the characteristics of datasets for which
this property holds, and help us to confirm or reject that hypothesis.</p>
      <p>
        Although we have evidence from previous research that semantically
coherent rules are perceived to be better understandable, e.g. in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], we would like to
strengthen that argument by additional user studies. These may also help
revealing other characteristics a ruleset should have beyond coherence, e.g., minimum
or maximum length. For example, the experiments in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] have indicated that
less accurate rules (e.g., Countries with a high HDI are less corrupt ) are
preferred over more accurate ones (e.g., Countries with a HDI higher than 6.243
are less corrupt ).
      </p>
      <p>
        In this paper, we have only looked into one method of measuring semantic
coherence, i.e., a similarity metric based on WordNet. There are more possible
WordNet-based metrics, e.g., the LESK [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the HSO [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] metrics, which
both work with adjectives and adverbs in addition to nouns and verbs and they
support arbitrary pairing of the POS classes. Furthermore, there is a number
of alternatives beyond WordNet, e.g., the use of Wikipedia [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] or a web search
engine [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Furthermore, in the realm of Linked Open Data, there are various
means to determine the relatedness of two concepts [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>The approach so far only uses the classical heuristic to select the first rule,
which sometimes lead to rules that are not too coherent w.r.t. that attribute, e.g.,
if there are no other attributes that match the first one well semantically. Here,
it may help to introduce a semantic part in the selection of the first condition
as well, e.g., the average semantic distance of all other attributes to the one
selected. However, the impact of that variation on accuracy has to be carefully
investigated.</p>
      <p>Another possible point for improvement is the selection of the final rule from
one refinement process. So far, we use the same combined heuristic for the
refinement and the selection, but it might make sense to use a di↵erent weight
here, or even entirely remove the semantic heuristic from that step, since the
coherence has already been assured by the selection of the conditions.</p>
      <p>In summary, we have introduced an approach that is able to explicitly trade
o↵ semantic coherence and accuracy in rule learning, and we have shown that
it is possible to learn more coherent rules without losing accuracy. However, it
remains an open question whether or not our results are generalizable to other
types of rule learning algorithms that do not rely on a separate-and-conquer
strategy. We will inspect the impact on other rule learners in the near future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet</article-title>
          . In:
          <article-title>Computational linguistics and intelligent text processing</article-title>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>145</lpage>
          . Springer Berlin Heidelberg (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bojarczuk</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>H.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Discovering comprehensible classification rules by using genetic programming: a case study in a medical domain</article-title>
          . In: Banzhaf,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Daida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Eiben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.E.</given-names>
            ,
            <surname>Garzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.H.</given-names>
            ,
            <surname>Honavar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Jakiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.E</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the Genetic and Evolutionary Computation Conference</source>
          . vol.
          <volume>2</volume>
          , pp.
          <fpage>953</fpage>
          -
          <lpage>958</lpage>
          . Morgan Kaufmann, Orlando, Florida, USA (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Budanitsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Evaluating WordNet-based Measures of Lexical Semantic Relatedness</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>32</volume>
          (
          <issue>1</issue>
          ),
          <fpage>13</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          .:
          <article-title>Attribute-oriented induction in relational databases</article-title>
          .
          <source>In: Knowledge Discovery in Databases</source>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>228</lpage>
          . AAAI/MIT Press (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cilibrasi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Vita´nyi, P.M.B.:
          <article-title>The google similarity distance</article-title>
          .
          <source>CoRR abs/cs/0412098</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niblett</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The CN2 Induction Algorithm</article-title>
          .
          <source>Machine Learning</source>
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <fpage>261</fpage>
          -
          <lpage>283</lpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Falco</surname>
            ,
            <given-names>I.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cioppa</surname>
            ,
            <given-names>A.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarantino</surname>
          </string-name>
          , E.:
          <article-title>Discovering interesting classification rules with genetic programming</article-title>
          .
          <source>Applied Soft Computing</source>
          <volume>1</volume>
          (
          <issue>4</issue>
          ),
          <fpage>257</fpage>
          -
          <lpage>269</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : WordNet. Wiley Online Library (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>On rule interestingness measures</article-title>
          .
          <source>Knowledge-Based Systems</source>
          <volume>12</volume>
          (
          <issue>56</issue>
          ),
          <fpage>309</fpage>
          -
          <lpage>315</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Comprehensible classification models: A position paper</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          (
          <year>Mar 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>On objective measures of rule surprisingness</article-title>
          .
          <source>In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . PKDD '
          <volume>98</volume>
          , Springer-Verlag, London, UK, UK (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Fu¨rnkranz, J.:
          <article-title>Separate-and-Conquer Rule Learning</article-title>
          .
          <source>Artificial Intelligence Review</source>
          <volume>13</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>54</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Fu¨rnkranz, J.,
          <string-name>
            <surname>Flach</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          :
          <article-title>ROC 'n' Rule Learning - Towards a Better Understanding of Covering Algorithms</article-title>
          .
          <source>Machine Learning</source>
          <volume>58</volume>
          (
          <issue>1</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>77</lpage>
          (
          <year>January 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Fu¨rnkranz, J.,
          <string-name>
            <surname>Gamberger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Lavraˇc, N.:
          <source>Foundations of Rule Learning</source>
          . Springer Berlin Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>St-Onge</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms</article-title>
          . In: Fellbaum,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (ed.)
          <source>WordNet: An Electronic Lexical Database</source>
          , pp.
          <fpage>305</fpage>
          -
          <lpage>332</lpage>
          . MIT Press (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ishibuchi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nojima</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Analysis of interpretability-accuracy tradeo↵ of fuzzy systems by multiobjective fuzzy genetics-based machine learning</article-title>
          .
          <source>International Journal of Approximate Reasoning</source>
          <volume>44</volume>
          (
          <issue>1</issue>
          ),
          <fpage>4</fpage>
          -
          <lpage>31</lpage>
          (
          <year>Jan 2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Janssen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Fu¨rnkranz, J.:
          <article-title>On the quest for optimal rule learning heuristics</article-title>
          .
          <source>Machine Learning</source>
          <volume>78</volume>
          (
          <issue>3</issue>
          ),
          <fpage>343</fpage>
          -
          <lpage>379</lpage>
          (
          <year>Mar 2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Janssen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zopf</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The SeCo-framework for rule learning</article-title>
          .
          <source>In: Proceedings of the German Workshop on Lernen, Wissen</source>
          , Adaptivita¨
          <fpage>t</fpage>
          -
          <lpage>LWA2012</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conrath</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          :
          <source>Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of International Conference on Research in Computational Linguistics (ROCLING X)</source>
          . pp.
          <fpage>19</fpage>
          -
          <lpage>33</lpage>
          . No. Rocling
          <string-name>
            <surname>X</surname>
          </string-name>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kintsch</surname>
            , W., Van Dijk,
            <given-names>T.A.</given-names>
          </string-name>
          :
          <article-title>Toward a model of text comprehension and production</article-title>
          .
          <source>Psychological review 85(5)</source>
          ,
          <volume>363</volume>
          (
          <year>1978</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>An Information-Theoretic Definition of Similarity</article-title>
          . In: ICML. pp.
          <fpage>296</fpage>
          -
          <lpage>304</lpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. Michalski, R.S.:
          <article-title>A theory and methodology of inductive learning</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>20</volume>
          (
          <issue>2</issue>
          ),
          <fpage>111</fpage>
          -
          <lpage>162</lpage>
          (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.a.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leacock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tengi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunker</surname>
          </string-name>
          , R.T.:
          <article-title>A Semantic Concordance</article-title>
          .
          <source>In: Proceedings of the workshop on Human Language Technology</source>
          . pp.
          <fpage>303</fpage>
          -
          <lpage>308</lpage>
          . Association for Computational Linguistics, Morristown, NJ, USA (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Noda</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
          </string-name>
          , H.:
          <article-title>Discovering interesting prediction rules with a genetic algorithm</article-title>
          .
          <source>In: Proceedings of the 1999 Congress on Evolutionary Computation</source>
          . pp.
          <fpage>1322</fpage>
          -
          <lpage>1329</lpage>
          . IEEE (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Otero</surname>
            ,
            <given-names>F.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Improving the interpretability of classification rules discovered by an ant colony algorithm</article-title>
          .
          <source>In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation</source>
          . pp.
          <fpage>73</fpage>
          -
          <lpage>80</lpage>
          . GECCO '13,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Parpinelli</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>H.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Data mining with an ant colony optimization algorithm</article-title>
          .
          <source>IEEE Transactions on Evolutionary Computation</source>
          <volume>6</volume>
          (
          <issue>4</issue>
          ),
          <fpage>321</fpage>
          -
          <lpage>332</lpage>
          (
          <year>August 2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Generating possible interpretations for statistics from linked open data</article-title>
          .
          <source>In: 9th Extended Semantic Web Conference (ESWC)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Dbpedianyd - a silver standard benchmark dataset for semantic relatedness in dbpedia</article-title>
          . In: Workshop on NLP &amp;
          <article-title>DBpedia (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Resnik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Using Information Content to Evaluate Semantic Similarity in a Taxonomy</article-title>
          .
          <source>In: Proceedings of the 14th International Joint Conference on Artificial Intelligence</source>
          . vol.
          <volume>1</volume>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Shukla</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tripathi</surname>
            ,
            <given-names>S.P.:</given-names>
          </string-name>
          <article-title>A survey on interpretability-accuracy (i-a) trade-o↵ in evolutionary fuzzy systems</article-title>
          . In: Watada,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.C.</given-names>
            ,
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Shieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.S.</given-names>
            ,
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.S</surname>
          </string-name>
          . (eds.) 5th
          <source>International Conference on Genetic and Evolutionary Computing</source>
          . pp.
          <fpage>97</fpage>
          -
          <lpage>101</lpage>
          . IEEE (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Smaldon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Improving the interpretability of classification rules in sparse bioinformatics datasets</article-title>
          .
          <source>In: Proceedings of AI-2006, the Twenty-sixth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence</source>
          . pp.
          <fpage>377</fpage>
          -
          <lpage>381</lpage>
          . Research and Development in Intelligent
          <string-name>
            <surname>Systems</surname>
            <given-names>XXIII</given-names>
          </string-name>
          , Springer London (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Strube</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          : WikiRelate!
          <article-title>Computing semantic relatedness using Wikipedia</article-title>
          .
          <source>In: In Proceedings of the 21st National Conference on Artificial Intelligence</source>
          . pp.
          <fpage>1419</fpage>
          -
          <lpage>1424</lpage>
          . No. February, AAAI Press (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Vellido</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martn-Guerrero</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lisboa</surname>
            ,
            <given-names>P.J.G.:</given-names>
          </string-name>
          <article-title>Making machine learning models interpretable</article-title>
          .
          <source>In: 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>