<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Balancing Method for Training Segmentation Neural Networks ?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexey Kochkarev</string-name>
          <email>alexey.kochkarev@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Khvostikov</string-name>
          <email>khvostikov@cs.msu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Korshunov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>y Krylov</string-name>
          <email>kryl@cs.msu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>il Bogusl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vskiy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computational Mathematics and Cybernetics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Geology, Lomonosov Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data imbalance is a common problem in machine learning and image processing. The lack of training data for the rarest classes can lead to worse learning ability and negatively affect the quality of segmentation. In this paper we focus on the problem of data balancing for the task of image segmentation. We review major trends in handling unbalanced data and propose a new method for data balancing, based on Distance Transform. This method is designed for using in segmentation convolutional neural networks (CNNs), but it is universal and can be used with any patch-based segmentation machine learning model. The evaluation of the proposed data balancing method is performed on two datasets. The first is medical dataset LiTS, containing CT images of liver with tumor abnormalities. Second one is a geological dataset, containing of photographs of polished sections of different ores. The proposed algorithm enhances the data balance between classes and improves the overall performance of CNN model.</p>
      </abstract>
      <kwd-group>
        <kwd>Image Segmentation</kwd>
        <kwd>Data Balancing</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Liver</kwd>
        <kwd>Tumor</kwd>
        <kwd>Geology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data imbalance is a common issue in image segmentation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. If pixels corresponding
to a particular “majority” class are far more numerous than pixels of one or more
“minority” classes, the rarity of the “minority” class in the training data makes the training
process less effective and worses the final results, as the learned model will tend to
classify most pixels as members of the “majority” classes.
      </p>
      <p>
        The problem of data imbalance is very common in medical problems and, in
particular, detecting liver tumors. One of these problems is segmentation of CT images, since
the volume and area of different organs and abnormalities differs a lot. In a typical CT
? The work was funded by RFBR, CNPq and MOST according to the research project
19-5780014 (BRICS2019-394)
image of a liver tumor, the volume of healthy liver tissue is significantly greater than
the volume of cancerous tissue [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Data imbalance problem also occurs in geological image segmentation. Some
minerals are found in nature much less often than others. For example, photographs of
polished sections, taken from lead-zinc fields contain a larger amount of Sphalerite ore
(about 30 %) than Galena (about 7%).</p>
      <p>
        The existent methods that are used to overcome class imbalance can be divided into
two main categories [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The first category of methods is represented with algorithm-based methods. One
of the approaches is cost-sensetive learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The idea is to assign different costs to
classification mistakes for different classes. One common scheme involves assigning to
each class a cost equal to the inverse of the proportion of this class in dataset. This leads
to higher model penalization for rarest classes.
      </p>
      <p>
        The second category of methods is represented with so-called data-based methods.
They use sampling techniques to rebalance the distribution of classes during
preprocessing. This involves either oversampling instances of the minority class or
undersampling instances of the majority class, or even both [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of the generalizations of
these approaches is Synthetic Minority Over-Sampling Technique, or SMOTE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This
technique involves generating synthetic samples of the minority class to train on, thus
reducing the class imbalance by artificially inflating the size of the minority class itself.
It also should be noticed that SMOTE and other methods based on SMOTE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are
designed for the machine learning-based classification problems, and either can not be
used at all in the segmentation problem or demonstrate mediocre results.
      </p>
      <p>In this paper we propose a data balancing method that focuses on modifying the
class distribution in the dataset. The proposed method uses only data from the
original set rather than replicating additional minority samples. The proposed method is
specially created for segmentation problems and has a wide range of applications.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed method</title>
      <p>Segmentation is defined as finding a mapping of the source color image I 2 Rh w 3
of height h and width w to the annotation S 2 Zh w, containing labels (or classes). Let
us consider a set of classes C = fc0; c1; :::; cN g.</p>
      <p>For the convenience we internally store the annotation as a 2-D array of integers
(S = S(x; y)), each value of which is just the index of the class value: S(x; y) 2 C.
The source images are stored as 3-D array of float numbers: I = I(x; y); I(x; y) 2 R3.</p>
      <p>The proposed method is created to be used with data that is fed into neural networks
during training process. On every step of each training epoch the neural network gets a
batch of square patches, taken from one of the dataset images fI1(x; y); :::; Im(x; y)g.
Conventional random choice of patches can only increase the existing data imbalance,
because there could occur objects that are smaller then a patch and for any patch
covering case, the ratio between object and non-object pixels inside the patch can not be
increased. The proposed method allows to choose patches, that contain most amount of
pixels of the certain class. Thus, it is possible to keep balance of classes in images, that
are fed into network. The proposed method consists of three main steps:</p>
      <p>Data Balancing Method for Training Segmentation Neural Networks 3
1. choose class, that was most rarely fed into the model;
2. choose image, which contains the greatest quantity of pixels of chosen class;
3. crop patch from the selected image, which contains the greatest quantity of pixels
of chosen class;
All these stages will be considered in detail below.
2.1</p>
      <sec id="sec-2-1">
        <title>Class choice</title>
        <p>A square area in image Ii(x; y) is called a patch and marked as P (x; y). The
appropriate area on corresponding annotation Si(x; y) is marked as PAnn(x; y). After patch is
chosen, we sum the amount of pixels for each class:</p>
        <p>sj = j(x; y) : PAnn(x; y) = cj j; j = 0; 1; :::; N:
On this step we should choose class Ck, for which amount of pixels is the smallest:
sk = minfs0; s1; :::; sN g:</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Image choice</title>
        <p>For every image Ii in dataset we compute weights for each class fw0; w1; :::; wN g as
the amount of pixels of this class on the image:</p>
        <p>wj = j(x; y) : Si(x; y) = cj j; j = 0; 1; :::; N:
So, the image will be chosen from the dataset of fI1; :::; Img images with probability
proportionally to its weight.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Patch choice</title>
        <p>On this step we choose patch, which contains the greatest quantity of pixels of chosen
class Ck. To perform this choice we first calculate the probability maps of choosing
upper left pixel of the patch in current pixel. First, distance to class map is built. Let,
annotation S(x; y) to be consisted of two classes: Object (c1) and Background (c0):</p>
        <p>
          S(x; y) 2 fc0; c1g:
At first, we apply Distance Transform [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and get a map Sd(x; y), where each pixel is a
distance to the nearest object pixel on the annotation S(x; y):
        </p>
        <p>Sd(x; y) =
0;
min kx
x0; y</p>
        <p>S(x; y) = c1 ;
y0kL2 ; 8S (x0; y0) = c0 ; S(x; y) = c0
where kx; ykL2 = px2 + y2:
Consider a patch P (x; y) of size p p in annotation S(x; y). The less the sum of pixels
in a relevant area on Sd(x; y), the more pixels of chosen class there are in this area. Let
us define:</p>
        <p>S~d(x; y) = 1</p>
        <p>Sd(x; y)
max (Sd(x; y))
:
(1)
(2)
(3)
(4)
(5)
(6)</p>
        <p>The more sum is inside a relevant area in S~d(x; y), the more pixels of chosen class
there are in chosen area. Summing up pixels on every rectangle area of S(x; y), we get
desired probability map P r(x; y), where each pixel is sum of distances from chosen
class to background pixels. The higher the value in current pixel of P r(x; y), the higher
the probability of getting most pixels of chosen class, if we choose upper left pixel of
patch in current pixel.</p>
        <p>
          The sum of pixels on the patch has to be calculated many times. Quick and efficient
summation of pixels can be performed using Summed-Area table [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The value at
any point (x; y) in the summed-area table is the sum of all the pixels that are above and
to the left of (x; y), inclusive:
        </p>
        <p>Sint(x; y) = X S (x0; y0) :
x0 x
y0 y
The summed-area table can be computed efficiently in a single pass over the image, as
the value in the summed-area table at (x; y) is just:</p>
        <p>Sint(x; y) = S(x; y) + Sint(x; y
1) + Sint(x
1; y)</p>
        <sec id="sec-2-3-1">
          <title>Sint(x</title>
          <p>1; y
1): (9)
Once the summed-area table is computed, evaluating the sum of intensities over any
rectangular area requires exactly four array references regardless of the area size:</p>
          <p>X
x0&lt;x x1
y0&lt;y y1</p>
          <p>S(x; y) = Sint(D) + Sint(A)</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Sint(B)</title>
          <p>Sint(C);</p>
          <p>A = (x0; y0) ; B = (x1; y0) ; C = (x0; y1) ; D = (x1; y1) :
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Used datasets</title>
      <p>
        In this paper we used two datasets to evaluate the proposed data balancing method.
First one consists of 2-dimensional slices of liver Computer Tomography – Liver
Tumor Segmentation Challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (LiTS). The second dataset consists of photographs of
polished sections of ores, collected by geologists. Each dataset is fully-annotated.
3.1
      </p>
      <sec id="sec-3-1">
        <title>LiTS dataset</title>
        <p>LiTS dataset contains of 130 3-dimensional CT scans (Fig. 1a) of liver. The images’
resolution is 512 512 75 pixels. We took 20 2-dimensional slices (Fig. 1b) from
every scan, that contain liver and tumor pixels. Black pixels correspond to background,
grey pixels correspond to liver, white pixels correspond to tumors (Fig. 1b).
The obtained dataset of chosen slices is highly imbalanced: total amount of pixels,
corresponding to liver is 29.4%, while tumor abnormalities are present on 4.3% of pixels.
Remaining 66.3% of pixels from LiTS dataset correspond to background.
(8)
(10)
(11)</p>
        <p>Data Balancing Method for Training Segmentation Neural Networks 5
(a) Visualization of 3D liver volume
(b) Corresponding ground truth
segmentation of one of the axial slices (black –
background, grey – liver, white – tumor)
This dataset consists of 46 images with ground truth annotation made by expert
geologist. The dimension of the images is 3396 2547. All the photos were taken by Canon
G10 with ZEISS Axioscop 40 microscope.</p>
        <p>Every image contains up to 4 classes (ores): Background (0 – Bg), Sphalerite (1 – Sh),
Pyrite and Marcasite (2 – Py/Mrc) and Galena (3 – Gl).</p>
        <p>The most common mineral in the images is Pyrite (pink color on GT) – 30.2%,
Spha(a) Polished section photo
(b) Corresponding ground truth
segmentation (black – background, pink – Pyrite,
yellow – Sphalerite, green – Galena)
lerite (orange color on GT) is less common – 29.3%, Galena (green color on GT) is the
rarest – only 6.7%, background (black on GT) is 33.7%.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and results</title>
      <p>To evaluate the proposed data balancing method we chose 3000 patches from each
dataset. First, the patches were chosen randomly, and after, the patches were chosen
using the proposed method based on probability maps (Fig. 3, Fig. 4).</p>
      <p>The ratio of pixels for different classes improved in case of LiTS dataset
significantly, but, as we see in Table 1, the quantity of pixels, corresponding to tumor is still
less than for liver and background pixels. This imbalance can be explained by the fact
that areas with tumor abnormalities are much smaller than selected patch. Even if we
take patch according to probability map for selected class, we also get a big amount
of pixels, corresponding to nearby classes. The decrease of patch size in this case is
inappropriate from the point of view of training the neural network model.</p>
      <p>On the opposite side, even the rarest ores present in areas which size is comparable
with patch size even for small pathces. In this case the proposed method demonstrates
its efficiency, resulting in well-balanced data, fed into network.</p>
      <p>In multiclass cases I oU is computed separately for each class, according to the ”one
against all” principle.</p>
      <p>The comparison of I oU value for each class before and after balancing is presented
on Fig. 5.</p>
      <p>The balancing method improved training process and hence the final results.
(12)
The developed method of data balancing was tested on biomedical and geological
datasets. It has shown its effectiveness in levelling data balance across classes for
imbalanced datasets. The method can be used to handle the data imbalance problem in image
segmentation task. It is universal and can be used with any patch-based segmentation
machine learning model, including convolutional neural networks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>He</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            <given-names>E. A.</given-names>
          </string-name>
          :
          <article-title>Learning from imbalanced data</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering 21(9)</source>
          ,
          <fpage>1263</fpage>
          <lpage>1284</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Christ P.: LiTS Liver Tumor Segmentation</surname>
          </string-name>
          <article-title>Challenge (LiTS17)</article-title>
          . URL https://competitions.codalab.org/competitions/17094 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Small</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ventura</surname>
            <given-names>J</given-names>
          </string-name>
          .:
          <article-title>Handling unbalanced data in deep image segmentation</article-title>
          . University of Colorado (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nashnush</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadera</surname>
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Cost-sensitive Bayesian network learning using sampling</article-title>
          .
          <source>In: Recent Advances on Soft Computing and Data Mining</source>
          . pp.
          <fpage>467</fpage>
          <lpage>476</lpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buda</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maki</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazurowski</surname>
            <given-names>M. A.</given-names>
          </string-name>
          :
          <article-title>A systematic study of the class imbalance problem in convolutional neural networks</article-title>
          .
          <source>Neural Networks</source>
          <volume>106</volume>
          ,
          <fpage>249</fpage>
          -
          <lpage>259</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chawla</surname>
            <given-names>N. V.</given-names>
          </string-name>
          et al.:
          <article-title>SMOTE: synthetic minority over-sampling technique</article-title>
          .
          <source>Journal of artificial intelligence research 16</source>
          ,
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Han</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>W. Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            <given-names>B. H.</given-names>
          </string-name>
          :
          <article-title>Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning</article-title>
          .
          <source>In: International conference on intelligent computing</source>
          . pp.
          <fpage>878</fpage>
          -
          <lpage>887</lpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Maciejewski</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanowski</surname>
            <given-names>J</given-names>
          </string-name>
          .:
          <article-title>Local neighbourhood extension of SMOTE for mining imbalanced data</article-title>
          .
          <source>In: 2011 IEEE symposium on computational intelligence and data mining (CIDM)</source>
          . pp.
          <fpage>104</fpage>
          -
          <lpage>111</lpage>
          . IEEE (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fabbri R</surname>
          </string-name>
          . et al.:
          <article-title>2D Euclidean distance transform algorithms: A comparative survey</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 40(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bradley</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Adaptive thresholding using the integral image</article-title>
          .
          <source>Journal of graphics tools 12(2)</source>
          ,
          <fpage>13</fpage>
          <lpage>21</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ronneberger</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
            <given-names>T</given-names>
          </string-name>
          .:
          <article-title>U-net: Convolutional networks for biomedical image segmentation</article-title>
          .
          <source>In: International Conference on Medical image computing and computerassisted intervention</source>
          . pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>