<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuliano Armano</string-name>
          <email>armano@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Fanni</string-name>
          <email>francesca.fanni@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Giuliani</string-name>
          <email>alessandro.giuliani@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Electrical and Electronic Engineering, University of Cagliari Piazza d'Armi I09123</institution>
          ,
          <addr-line>Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Performance metrics are used in various stages of the process aimed at solving a classification problem. Unfortunately, most of these metrics are in fact biased, meaning that they strictly depend on the class ratio -i.e., on the imbalance between negative and positive samples. After pointing to the source of bias for the most acknowledged metrics, novel unbiased metrics are defined, able to capture the concepts of discriminant and characteristic capability. The combined use of these metrics can give important information to researchers involved in machine learning or pattern recognition tasks, such as classifier performance assessment and feature selection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Several metrics are used in pattern recognition and machine learning in various tasks
concerning classifier building and assessment. An important category of these metrics
is related to confusion matrices. Accuracy, precision, sensitivity (also called recall) and
specificity are all relevant examples [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of metrics that belong to this category. As none
of the above metrics is able to give information about the process under assessment
in isolation, two different strategies have been adopted so far for assessing classifier
performance or feature importance: i) devising single metrics on top of other ones and
ii) identifying proper pairs of metrics able to capture the wanted information. The
former strategy is exemplified by F1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and MCC (Matthews Correlation Coefficient) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which are commonly used in the process of model building and assessment. Typical
members of the latter strategy are sensitivity vs. specificity diagrams, which allow to
draw relevant information (e.g., ROC curves [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) in a Cartesian space. Unfortunately,
regardless from the strategies discussed above, most of the existing metrics are in fact
biased, meaning that they strictly depend on the class ratio –i.e., on the imbalance
between positive and negative samples. However, the adoption of biased metrics can only
be recommended when the statistics of input data is available. In the event one wants
to assess the intrinsic properties of a classifier, or other relevant aspects in the process
of classifier building and evaluation, the adoption of biased metrics does not appear a
reliable choice. For this reason, in the literature, some proposals have been made to
introduce unbiased metrics –see in particular the work of Flach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper a pair of
unbiased metrics is proposed, able to capture the concepts of discriminant and
characteristic capability. The former is expected to measure to which extent positive samples
can be separated from the negative ones, whereas the latter is expected to measure to
which extent positive and negative samples can be grouped together. After giving
pragmatic definitions of these metrics, their semantics is discussed for binary classifiers and
binary features. An analysis focusing on the combined use of the corresponding metrics
in form of Cartesian diagrams is also made.
      </p>
      <p>The remainder of the paper is organized as follows: after introducing the concept of
normalized confusion matrix, obtained by applying Bayes decomposition to any given
confusion matrix, in Section 2 a brief analysis of the most acknowledged metrics is
performed, pointing out that most of them are in fact biased. Section 3 introduces novel
metrics devised to measure the discriminant and characteristic capability of binary
classifiers or binary features. Section 4 reports experiments aimed at pointing out the
potential of Cartesian diagrams drawn using the proposed metrics. Section 5 highlights the
strengths and weaknesses of this paper and Section 6 draws conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>As the concept of confusion matrix is central in this paper, let us preliminarily illustrate
the notation adopted for its components (also because the adopted notation slightly
differs from the most acknowledged one). When used for classifier assessment, the generic
element xi j of a confusion matrix X accounts for the number of samples that satisfy the
property specified by the subscripts. Limiting our attention to binary problems, in which
samples are described by binary features, let us assume that 1 and 0 identify the
presence and the absence of a property.</p>
      <p>In particular, let us denote with Xc(P; N) the confusion matrix of a run in which
a classifier c, trained on a category c, is fed with P positive samples and N negative
b
samples (with a total of M samples). With Xbc and Xc random variables that account
for the output of classifier and oracle, the joint probability p(Xc; Xbc) is proportional,
through M, to the expected value of Xc(P; N).</p>
      <p>Assuming statistical significance, the confusion matrix obtained from a single test
(or, better, averaged over multiple tests in which the values for P and N are left
unchanged) gives us reliable information on the performance of the classifier. In symbols:
Xc(P; N)</p>
      <p>M p(Xc; Xbc) = M p(Xc) p(XbcjXc)
In so doing, we assume that the transformation performed by c can be isolated from the
b
inputs it processes, at least from a statistical perspective. Hence, the confusion matrix
for a given set of inputs can be written as the product between a term that accounts for
the number of positive and negative instances, on one hand, and a term that represents
the expected recognition / error rate of c, on the other hand. In symbols:
b
Xc(P; N) = M
w00 w01
w10 w11
| {z }
W(c) p(Xc;Xbc)
= M
n 0 g00 g01
0 p g10 g11
O(|c){z } | {z }
p(Xc) G (c) p(XbcjXc)
(1)
(2)
where:
– wi j p(Xc = i ; Xbc = j); i; j = 0; 1, denotes the joint occurrence of correct
classifications (i = j) or misclassifications (i 6= j). According to the total probability law:
åi j wi j = 1.
– p is the percent of positive samples and n is the percent of negative samples.
– gi j p(Xbc = j j Xc = i); i; j = 0; 1, denotes the percent of inputs that have been
correctly classified (i = j) or misclassified (i 6= j) by Xbc. g00; g01; g10, and g11
respectively denote the rate of true negatives, false positives, false negatives, and true
positives. According to the total probability law: g00 + g01 = g10 + g11 = 1. An
estimate of the conditional probability p(XbcjXc) for a classifier c that accounts for a
b
category c will be called normalized confusion matrix hereinafter.</p>
      <p>The separation between inputs and the intrinsic behavior of a classifier reported
in Equation (2) suggests an interpretation that recalls the concept of transfer function,
where a set of inputs is applied to c. In fact, Equation (2) highlights the separation of the
b
optimal behavior of a classifier from the deterioration introduced by its actual filtering
capabilities. In particular, O p(Xc) represents the optimal behavior obtainable when
cb acts as an oracle, whereas G p(Xbc jXc) represents the expected deterioration caused
by the actual characteristics of the classifier. Hence, under the assumption of statistical
significance of experimental results, any confusion matrix can be divided in terms of
optimal behavior and expected deterioration using the Bayes theorem.</p>
      <p>A different interpretation holds for confusion matrix subscripts when they are used
to investigate binary features. In this case i still denotes the actual category, whereas
j denotes the truth value of the binary feature (with 0 and 1 made equivalent to false
and true, respectively). However, as a binary feature can always be though of as a very
simple classifier whose classification output reflects the truth value of the feature in the
given samples, all definitions and comments concerning classifiers can be applied to
binary features as well.</p>
      <p>Let us now examine the most acknowledged metrics deemed useful for pattern
recognition and machine learning according to the above perspective. The classical
definitions for accuracy (a), precision (p), and recall (r) can be given in terms of false
positives rate ( f p), true positives rate (t p) and class ratio (the imbalance between
negative and positive samples, s ) as follows:
a =
p =
r =
trace(W )
jW j
w01
w01 + w11</p>
      <p>w11
w11 + w10
=</p>
      <p>1 + s
= g11 = t p
= w00 + w11 = s (1
1</p>
      <p>g01) + g11 = s (1
s + 1</p>
      <p>f p) + t p
s + 1
g01
g11
1
=
1 + s
f p
t p
1
(3)
Equation (3) highlights the dependence of accuracy and precision from the class
ratio, only recall being unbiased. Note that the expression concerning accuracy has been
obtained taking into account that p + n = 1 implies p = 1=(s + 1) and n = s =(s + 1).</p>
      <p>As pointed out, when the goal is to assess the intrinsic properties of a classifier or
a feature, biased metrics do not appear a proper choice, leaving room for alternative
definitions aimed at dealing with the imbalance between negative and positive samples.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Flach gave definitions of some unbiased metrics starting from classical ones.
In practice, unbiased metrics can be obtained from classical ones by setting the
imbalance s to 1. In the following, if needed, unbiased metrics will be denoted using the
subscript u.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Definition of Novel Metrics</title>
      <p>To our knowledge, no satisfactory definitions have been given so far able to account
for the need of capturing the potential of a model according to its discriminant and
characteristic capability. With the goal of filling this gap, let us spend few words on the
expected behavior of any metrics intended to measure them. Without loss of generality,
let us assume the metrics be defined in [ 1; +1]. As for the discriminant capability,
we expect its value be close to +1 when a classifier or feature partitions a given set
of samples in strong accordance with the corresponding class labels. Conversely, the
metric is expected to be close to 1 when the partitioning occurs in strong discordance
with the class label. As for the characteristic capability, we expect its value be close
to +1 when a classifier or feature tend to cluster most of the samples as if they were
in fact belonging to the main category. Conversely, the metric is expected to be close
to 1 when most of the samples are clustered as belonging to the alternate category.1
An immediate consequence of the desired behavior is that the above properties are not
independent. In other words, regardless from their definition, the metrics devised to
measure discriminant and characteristic capability of a classifier or feature (say d and
j, hereinafter) are expected to show an orthogonal behavior. In particular, when the
absolute value of one metric is about 1 the other should be close to 0.</p>
      <p>Let us now characterize d and j with more details, focusing on classifiers only
(similar considerations can also be made for features):
– f p 0 and t p 1 – We expect d +1 and j 0, meaning that the classifier is
able to partition the samples almost in complete accordance with the class labels.
– f p 1 and t p 1 – We expect d 0 and j +1, meaning that almost all samples
are recognized as belonging to the main class label.
– f p 0 and t p 0 – We expect d 0 and j 1, meaning that almost all samples
are recognized as belonging to the alternate class label.
– f p 1 and t p 0 – We expect d 1 and j 0, meaning that the classifier is
able to partition the domain space almost in complete discordance with the class
labels (however, this ability can still be used for classification purposes by simply
turning the classifier output into its opposite).</p>
      <p>The determinant of the normalized confusion matrix is the starting point for giving
proper definitions of d and j able to satisfy the constraints and boundary conditions
1It is worth noting that the definition of characteristic capability proposed in this paper is
in partial disagreement with the classical concept of “characteristic property” acknowledged by
most of the machine learning and pattern recognition researchers. The classical definition only
focuses on samples that belong to the main class, whereas the conceptualization adopted in this
paper applies to all samples. The motivation of this choice should become clearer later on.
discussed above. It can be rewritten as follows:</p>
      <p>D = g00 g11
g01 g10 = g00 g11
(1</p>
      <p>g00) (1
= g00 g11
= r + r
1 + g11 + g00</p>
      <p>g00 g11 = g11 + g00
1
t p
f p
g11)
1
When D = 0, the classifier under assessment has no discriminant capability whereas
D = +1 and D = 1 correspond to the highest discriminant capability, from the positive
and negative side, respectively. It is clear that the simplest definition of d is to make it
coincident to D , as the latter has all the desired properties required by the discriminant
capability metric.</p>
      <p>As for j, considering the definition of d and the constraints that must apply to a
metric intended to measure the characteristic capability, the following definition appear
appropriate, being actually dual with respect to d also from a syntactic point of view:
j = r
r = t p + f p
1</p>
      <p>The two measures can be taken in combination for investigating properties of
classifiers or features. The run of a classifier over a specific test set, different runs of a
classifier over multiple test sets, and the statistics about the presence/absence of a feature on
a specific dataset are all examples of potential use cases. However, while reporting
information about classifier or feature properties in j d diagrams, one should be aware
that the j d space is constrained by a rhomboidal shape. This shape depends on the
constraints that apply to d , j, t p, and f p.</p>
      <p>In particular, as d = t p f p and j = t p + f p 1, the following relations hold:
d =
j + (2 t p
1) = +j + (2 f p + 1)
(6)
Considering f p and t p as parameters, we can easily draw the corresponding isometric
curves in the j d space. Figure 2 shows their behavior for t p = f0; 0:5; 1g and for
f p = f0; 0:5; 1g.</p>
      <p>As the definitions of d and j are given as linear transformations over t p and f p, it
is not surprising that the isometric curves of f p and t p drawn in the j d space are
again straight lines.</p>
      <p>
        d space: the rhombus centered in (0,0) delimits the area of
Semantics of the j d space for classifiers. As for binary classifiers, their discriminant
capability is strictly related to the unbiased accuracy, which in turn can be given in
terms of unbiased error (say eu). The following equivalences make explicit the relation
between au, eu and d :
au =
tn + t p
2
=
1 + d
2
= 1
1 d
2
= 1
f p + f n
2
= 1
eu
(7)
It is worth pointing out that the actual discriminant capability of a classifier is not a
redefinition of accuracy (or error), as a classifier may still have high discriminant
capability also in presence of high unbiased error. Indeed, as already pointed out, a
lowperformance classifier can be easily transformed into a high-performance one by simply
turning its output into its opposite. Thanks to the “turning-into-opposite” trick, the
actual discriminant capability of a classifier could in fact be made coincident with the
absolute value of d . However, for reasons related to the informative content of j d
diagrams, we still take apart the discriminant capability observed from the positive side
from the one observed on the negative side. As for the characteristic capability, let us
(8)
(9)
preliminarily note that, in presence of statistical significance, we can write:
E[Xc]
E[Xbc]
According to Friedman [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], it is easy to show that Equation (9) actually represents an
estimate of the bias of a classifier, measured over the confusion matrix that describes
the outcomes of the experiments performed on the test set(s). Summarizing, in a j d
diagram used for assessing classifiers, the d -axis and the j-axis represent the unbiased
accuracy and the unbiased bias, respectively. It is worth pointing out that a high positive
value of d means that the classifier at hand approximates the behavior of an oracle,
whereas a high negative value approximates the behavior of a classifier that is almost
always wrong (say anti-oracle when d = 1). Conversely, a high positive value of j
denotes a dummy classifier that almost always consider input items as belonging to the
main category, whereas a high negative value denotes a dummy classifier that almost
always consider input items as belonging to the alternate category.
      </p>
      <p>Semantics of the j d space for features. As for binary features, d measures to which
extent a feature is able to partition the given samples in accordance (d ' +1) or in
discordance (d ' 1) with the main class label. In either case, the feature has high
discriminant capability. As already pointed out for classifiers, instead of considering
the absolute value of d as a measure of discriminant capability, we take apart the value
observed on the positive side from the one observed on the negative side for reasons
related to the informative content of j d diagrams. On the other hand, j measures
to which extent the feature at hand is spread over the given dataset. A high positive
value of j indicates that the feature is mainly true along positive and negative samples,
whereas a high negative value indicates that the feature is mainly false in the dataset
–regardless of the class label of samples.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Some experiments have been performed with the aim of assessing the potential of j d
diagrams. In our experiments we use a collection in which each document is a webpage.
The dataset is extracted from the DMOZ taxonomy2. Let us recall that DMOZ is the
collection of HTML documents referenced in a Web directory developed in the Open
Directory Project (ODP). We choose a set of 174 categories containing about 20000
documents, organized in 36 domains.</p>
      <p>In this scenario, we expect terms important for categorization appear at the upper
or lower corner of the j d rhombus, in correspondence with high values of jd j. As</p>
      <p>d diagrams for the selected DMOZ’s categories.
for the characteristic capability, terms that occur barely on documents are expected to
appear at the left hand corner (high negative values of j), while the so-called stopwords
are expected to appear at the right hand corner (high values of j).</p>
      <p>
        Experiments have been focusing on the identification of discriminant terms and
stopwords. Figure 3 plots the “signatures” obtained for DMOZ’s categories
Filmmaking, Composition, Arts, and Magic. Alternate categories have been derived considering
the corresponding siblings. Note that, in accordance with the Zipf’s law [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], most of the
words are located at the left hand corner of the constraining rhombus. Looking at the
drawings, it appears that Filmmaking and Arts are expected to be the most difficult
categories to predict, as no terms with a significant value of jd j exist for it. On the contrary,
documents of Composition and Magic appear to be relatively easy to classify, as
several terms exist with significant discriminant value. This conjecture is confirmed after
training 50 decision trees using only terms t whose characteristic capability satisfies the
constraint jj(t)j &lt; 0:4. For each category, test samples have been randomly extracted
at each run, whereas the remainder of the samples trained the classifiers.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Strengths and Weaknesses of This Proposal</title>
      <p>Apart from the analysis of existing metrics, the paper has been mainly concerned with
the definition of two novel metrics deemed useful in the task of developing and
assessing machine learning and pattern recognition algorithms and systems. All in all, there
is no magic in the given definitions. In fact, the j d space is basically obtained by
rotating the f p t p space of p=4. Although this is not a dramatic change of perspective,
it is clear that the j d space allows to analyze at a glance the most relevant properties
of classifiers or features. In particular, the (unbiased) accuracy and the (unbiased) bias
of a classifier are immediately visible on the vertical and horizontal axis of a j d
space, respectively. Moreover, an estimate of the variance of a classifier can be easily
investigated by just reporting the results of several experiments in the j d space (see,
for instance, Figure 4, which clearly points out to which extent the performance of
individual classifiers change along experiments). All the above measures are completely
independent from the imbalance of data by construction, as the j d space is defined
on top of unbiased metrics (i.e., r and r ). This aspect is very important for classifier
assessment, making it easier to compare the performance obtained on different test data,
regardless from the imbalance between negative and positive samples. Summarizing,
the j d space for classifiers can be actually thought of as a bias vs. accuracy (or
error) space, whose primary uses can be: (i) assessing the accuracy of a classifier over a
single or multiple runs, looking at its d axis; (ii) assessing the bias of a classifier over a
single or multiple runs, looking at the j axis; (iii) assessing the variance of a classifier,
looking at the scattering of multiple runs on the j d space. As for binary features, an
insight about the potential of j d diagrams in the task of assessing their importance
has been given in Section 4. In particular, let us recall that the most important features
related to a given domain are expected to have high values of jd j, whereas not
important ones are expected to have high values of jjj. Moreover, in the special case of text
categorization, stopwords are expected to occur at the right hand corner of the rhombus
that constrains the j d space.</p>
      <p>It is worth mentioning that alternative definitions could also be given in the j d
space for other relevant properties, e.g., ROC curves and AUC (or Gini’s coefficient).
Although these aspects are beyond the scope of this paper, let us spend few words on
ROC curves. It is easy to verify that random guessing for a classifier would constrain
the ROC curve to the j axis, whereas the ROC curve of a classifier acting as an oracle
would coincide with the positive border of the surrounding rhombus.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>After discussing and analyzing some issues related to the most acknowledged metrics
used in pattern recognition and machine learning, two novel metrics have been
proposed, i.e. d and j, intended to measure discriminant and characteristic capability for
binary classifiers and binary features. They are unbiased and are obtained as linear
transformations of false and true positive rates. Moreover, the corresponding
isometric curves show that they are orthogonal. The applications of j d diagrams to pattern
recognition and machine learning problems are manifold, ranging from feature selection
to classifier performance assessment. Some experiments performed in a text
categorization setting confirm the usefulness of the proposal. As for future work, the properties of
terms in a scenario of hierarchical text categorization will be investigated using d and
j diagrams. A generalization of d and j to multilabel categorization problems with
multivalued features is also under study.</p>
      <p>Acknowledgments. This work has been supported by LR7 2009 - Investment funds
for basic research (funded by the local government of Sardinia).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andrew</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bradley</surname>
          </string-name>
          .
          <article-title>The use of the area under the roc curve in the evaluation of machine learning algorithms</article-title>
          . Pattern Recogn.,
          <volume>30</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1145</fpage>
          -
          <lpage>1159</lpage>
          ,
          <year>July 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Peter</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Flach</surname>
          </string-name>
          .
          <article-title>The geometry of roc space: understanding machine learning metrics through roc isometrics</article-title>
          .
          <source>In in Proceedings of the Twentieth International Conference on Machine Learning</source>
          , pages
          <fpage>194</fpage>
          -
          <lpage>201</lpage>
          . AAAI Press,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jerome</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Friedman</surname>
            and
            <given-names>Usama</given-names>
          </string-name>
          <string-name>
            <surname>Fayyad</surname>
          </string-name>
          .
          <article-title>On bias, variance, 0/1-loss, and the curse-ofdimensionality</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>1</volume>
          :
          <fpage>55</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Matthews</surname>
          </string-name>
          .
          <article-title>Comparison of the predicted and observed secondary structure of T4 phage lysozyme</article-title>
          .
          <source>Biochim. Biophys. Acta</source>
          ,
          <volume>405</volume>
          :
          <fpage>442</fpage>
          -
          <lpage>451</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Bollmann</surname>
          </string-name>
          , and
          <string-name>
            <surname>Gwang</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jung</surname>
          </string-name>
          .
          <article-title>A critical investigation of recall and precision as measures of retrieval system performance</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>205</fpage>
          -
          <lpage>229</lpage>
          ,
          <year>July 1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C. J. Van Rijsbergen. Information</given-names>
            <surname>Retrieval</surname>
          </string-name>
          . Butterworth-Heinemann, Newton, MA, USA, 2nd edition,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>George</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Zipf</surname>
          </string-name>
          .
          <article-title>Human Behavior and the Principle of Least Effort</article-title>
          .
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          (Reading MA),
          <year>1949</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>