<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approximation Algorithm for a Quadratic Euclidean Problem of Searching a Subset with the Largest Cardinality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander A. Ageev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander V. Kel'manov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artem V. Pyatkin</string-name>
          <email>artem@math.nsc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergey A. Khamidullin</string-name>
          <email>kham@math.nsc.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladimir V. Shenmaier</string-name>
          <email>shenmaier@mail.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Novosibirsk State University</institution>
          ,
          <addr-line>Pirogova str. 1, 630090 Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sobolev Institute of Mathematics</institution>
          ,
          <addr-line>Acad. Koptyug avenue, 4, 630090 Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>19</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>We consider the problem of searching a subset in a finite set of points of Euclidean space. The problem is to find a cluster (subset) of the largest cardinality satisfying a given upper bound on the sum of squared distances between the cluster elements and their centroid. We prove that this problem is strongly NP-hard and present a polynomial-time 1/2-approximation algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In the paper we consider the problem of finding a subset of largest cardinality in a finite set of points of the
Euclidean space for which quadratic variation of points with respect to its unknown centroid does not exceed a
given fraction of the quadratic variation of points of the input set with respect to its centroid. Our goal is to
analyze the computational complexity of this problem and construct an algorithm for its solution.</p>
      <p>The study is motivated, on the one hand, by the absence of published results on the complexity status of
the problem and on efficient algorithms with guaranteed performance for this problem. On the other hand, it is
motivated by relevance of the problem for a number of applications.</p>
      <p>
        In particular, the problem under consideration is the typical problem for Editing and Cleaning data of “litter”
in the form of inaccurate measurements of the characteristics of any objects
        <xref ref-type="bibr" rid="ref1 ref3">(see for example [Waal et al., 2011],
[Osborne, 2013], [Greco, 2015])</xref>
        . It is well-known that, in Machine learning problem, the cleaning of irrelevant
“litter” from data is a necessary element [Bishop, 2006], [James et al., 2013], [Hastie et al., 2009], [Aggarwal, 2015].
If the values of the input data set are the results of measurements of the characteristics of an object and some
of the measurements have been taken with instrumental error, the value of which exceeds some given threshold,
then we need usually to find a subset (it’s desirable, of largest cardinality), having no significant (exceeding the
threshold) errors.
      </p>
      <p>Figure 1 shows three examples (a, b, c) the sets of points on the plane (two-dimensional data). The examples
correspond to the measurements of the characteristics of three objects. In each of the examples, one can see the
dense and sparse subsets of points that correspond to the small and big instrumental errors.
(a)
(c)
(b)</p>
      <p>
        In Data mining and Big data problems, the key element is approximation of the available data by a
mathematical model, allowing adequately interpret the data and reasonably explain their origin in terms of the
model
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">(see for example, [Waal et al., 2011], [Osborne, 2013], [Greco, 2015], [Bishop, 2006], [James et al., 2013],
[Hastie et al., 2009], [Aggarwal, 2015])</xref>
        . In particular, the following statistical hypothesis on the origin of the data
could be verified: is it true that the input set is an inhomogeneous sample from several probability distributions
while a part of the elements of this set is a sample from a single probability distribution with an unknown average
(it is assumed that the correspondence between the sample elements and the distribution is unknown). To verify
this conjecture, we first need to find the sample-subset from a single probability distribution. Only after that we
can use the classical results from the field of statistical hypothesis testing and investigate the properties of the
found set.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Problem Formulation, Known and Obtained Results</title>
      <p>Everywhere below denote by R the set of real numbers and by ∥ · ∥ the Euclidean norm.</p>
      <p>Consider the following problems.</p>
      <p>Problem 1 (Subset of points with the largest cardinality).</p>
      <p>Given: a set Y = {y1; : : : ; yN } of points from Rq and number ∈ (0; 1).</p>
      <p>Find : a subset C ⊂ Y with the largest cardinality such that</p>
      <p>F (C) = ∑ ∥y − y(C)∥2 ≤</p>
      <p>∑ ∥y − y(Y)∥2;
y2C
y2Y
(1)
where y(C) = jC1j ∑y2C y is the centroid (the geometrical center) of the subset C, and y(Y) = jY1 j ∑y2Y y — is
the centroid of the input set.</p>
      <p>The problem has a simple interpretation, namely, searching the largest by cardinality subset C of points,
whose total quadratic variation from the unknown centroid y(C) doesn’t exceed the total quadratic variation of
the input set Y from its centroid y(Y) by 1= times. If the coordinates of the points of the input set Y are the
results of measuring the characteristics of some object, there could be instrumental errors; if their magnitude
(in the form of a quadratic variation) exceeds a certain predetermined threshold (the right-hand side of (1)),
then the solution of the problem 1 yields a subset C of the largest cardinality that does not contain data with a
significant (exceeding the threshold) error. The level of the error significance (threshold value) can be regulated
by changing .</p>
      <p>The closest problem related to the problem 1 is the following
Problem 2 (M -Variance).</p>
      <p>Given: a set Y = {y1; : : : ; yN } of points from Rq and positive integer M &gt; 1.</p>
      <p>Find : a subset C ⊆ Y with cardinality M such that</p>
      <p>F (C) → min :</p>
      <p>This problem, like problem 1, is also typical for the applications mentioned above. Unlike the problem 1, in
this problem it is required to find a subset of points of a given cardinality, in which the scatter of points from
the unknown centroid is minimal. For this problem, in contrast to the problem 1, some results are known.</p>
      <p>Apparently, the first formulation of this extremal problem was given in [Aggarwal, 1991], and its strong
NP-hardness is established in [Kelmanov, 2011].</p>
      <p>In [Aggarwal, 1991] and [Shenmaier, 2016] the solvability of the problem in time O(qN q+1), polynomial in the
case for the fixed (bounded from above by a constant) dimension q of the space was shown. The polynomial-time
2-approximation algorithm with O(qN 2) running time was proposed in [Kelmanov, 2012]. The polynomial-time
approximation scheme (PTAS) is developed in [Shenmaier, 2012]. This scheme solves the problem with an
arbitrary relative error " in time O(qN 2="+1(9=")3=").</p>
      <p>For the case of a fixed dimension q of space and integer coordinates of points, an exact pseudopolynomial
algorithm is constructed [Kelmanov &amp; Romanchenko, 2012] having the O(N (M D)q) time complexity, where D
is the maximum absolute value of the coordinates of the input points.</p>
      <p>In [Kelmanov &amp; Romanchenko, 2014] it is proved that for problem 2 having a numerical input there is no
fully polynomial-time approximation scheme (FPTAS), unless P = N P and in the same paper such a scheme
was developed for the case when the space dimension is bounded from above by a constant. This scheme solves
the problem with an arbitrary relative error " in time O(N 2(M=")q).</p>
      <p>In this paper we show that the problem 1 is NP-hard in the strong sense and present an 1/2-approximation
polynomial-time algorithm.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Analysis of the Problem Complexity</title>
      <p>Before analyzing the computational complexity of problem 1, note that the right-hand side (1) does not depend
on C and for a specified input is a constant. Therefore, the problem 1 in the property verification form looks as
follows.</p>
      <p>Problem 1A (Subset of points with the largest cardinality).</p>
      <p>Given: a set Y = {y1; : : : ; yN } of points from Rq, positive number A, and positive integer M .
Question: is there a subset C ⊂ Y of cardinality at least M such that
(2)
Remind that the following problem is NP-complete in the strong sense [Kelmanov, 2011].
Problem 2A (M -Variance).</p>
      <p>Given: a set Y = {y1; : : : ; yN } of points from Rq, integer M and positive number B.</p>
      <p>Question: is there a subset C ⊂ Y of cardinality M such that</p>
      <p>F (C) ≤ A ?</p>
      <p>F (C) ≤ B ?</p>
      <p>In [Kelmanov, 2011], the well-known [Papadimitriou, 1994] strongly NP-complete problem Clique in a regular
graph whose degree is not fixed was reduced to problem 2A.</p>
      <p>Note that the function F has the following property: if C1 ⊆ C2, then F (C1) ≤ F (C2). Therefore, if in the
problem 1A the answer is positive, then there is a subset of cardinality M satisfying the inequality (2). Thus,
problems 1A and 2A are equivalent and obviously we have the following</p>
      <p>Statement 1. The problem 1A is NP-complete in the strong sense.</p>
      <p>It follows from statement 1 that problem 1 is an NP-hard problem in the strong sense, that is, it is not easier
than the problem 2.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Approximation Algorithm</title>
      <p>The idea of the proposed approximation algorithm for the problem 1 is following. For each point y of the input
set, a subset consisting of the maximum number of closest (in the sense of Euclidean distance) points from the
input set is constructed such that the sum of the squares of the distances from y to the points of the subset does
not exceed a given threshold (that is the fraction of the quadratic scatter of points of the input set). Among
the found subsets the one with the largest cardinality is taken as an output. This approach is realized in the
following algorithm.
Algorithm A.</p>
      <sec id="sec-4-1">
        <title>Input : set Y and number .</title>
        <p>Step 1. Compute the value A =</p>
      </sec>
      <sec id="sec-4-2">
        <title>For each point y ∈ Y perform steps 2 and 3.</title>
        <p>∑y2Y ∥y − y(Y)∥2 (the right part of the inequality (1)).</p>
        <p>Step 2. Compute the distances from the point y to all points in Y and sort the set Y in the nondecreasing
order according to these distances. Denote this sequence by y1; : : : ; yN .</p>
        <p>Step 3. Find the subsequence y1; : : : yM of maximum length such that ∑M
i=1 ∥y − yi∥2 ≤ A. Define the subset
Cy = {y1 : : : ; yM }.</p>
        <p>Step 4. In the family {Cy| y ∈ Y} of admissible subsets constructed in step 3 choose as the output CA any
subset Cy of the largest cardinality.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Output : subset CA.</title>
        <p>The following theorem holds
Theorem 1. Algorithm A finds a 1=2-approximate solution of problem 1 in O(N 2(q + log N ))-time.
The results presented in the paper have two aspects. On the one hand, they are an investigation of a new discrete
optimization problem, and on the other hand, they give a new effective algorithmic tool for solving one of the
actual problems in data analysis.</p>
        <p>In terms of analysis of large-scale data, faster algorithms with guaranteed accuracy bounds, solving the
problem in linear and sublinear time, are of considerable interest. The construction of such algorithms, as well
as polynomial approximation schemes for the considered problem, seems to be a matter of immediate prospects.
Acknowledgements
This work was supported by the Russian Foundation for Basic Research, project nos. 15-01-00462, 15-01-00976,
and 16-07-00168, and by the grant of Presidium RAS (program 5, project 227), and by the Ministry of Science
and Education of the Russian Federation under the 5-100 Excellence Programme.
[Greco, 2015] Greco, L. (2015). Robust Methods for Data Reduction Alessio Farcomeni. Chapman and Hall/CRC.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Waal et al.,
          <year>2011</year>
          ] de Waal,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Pannekoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            , &amp;
            <surname>Scholtus</surname>
          </string-name>
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Handbook of Statistical Data Editing and Imputation</article-title>
          . John Wiley and Sons, Inc. Hoboken, New Jersey.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Bishop</source>
          , 2006] Bishop,
          <string-name>
            <surname>C. M.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Pattern Recognition and Machine Learning</article-title>
          . New York: Springer Science+Business Media, LLC.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [James et al.,
          <year>2013</year>
          ] James,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            , &amp;
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>An Introduction to Statistical Learning</article-title>
          . New York: Springer Science+Business Media, LLC.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Hastie et al.,
          <year>2009</year>
          ] Hastie,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            &amp;
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <source>The Elements of Statistical Learning (2nd edition)</source>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Aggarwal</source>
          , 2015] Aggarwal,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Data Mining: The Textbook</article-title>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Aggarwal</source>
          , 1991] Aggarwal,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Imai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Katoh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , &amp;
            <surname>Suri</surname>
          </string-name>
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>Finding k points with minimum diameter and related problems</article-title>
          .
          <source>J. Algorithms</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <fpage>38</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Kelmanov</source>
          , 2011] Kelmanov,
          <string-name>
            <given-names>A. V.</given-names>
            , &amp;
            <surname>Pyatkin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. V.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>NP-Completeness of Some Problems of Choosing a Vector Subset</article-title>
          .
          <source>J. Appl. Indust. Math.</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <fpage>352</fpage>
          -
          <lpage>357</lpage>
          . doi:
          <volume>10</volume>
          .1134/S1990478911030069
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Shenmaier</source>
          , 2016] Shenmaier,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Solving Some Vector Subset Problems by Voronoi Diagrams</article-title>
          .
          <source>J. Appl</source>
          . Indust. Math.,
          <volume>10</volume>
          (
          <issue>4</issue>
          ),
          <fpage>560</fpage>
          -
          <lpage>566</lpage>
          . doi:
          <volume>10</volume>
          .1134/S199047891604013X
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Kelmanov</source>
          , 2012]
          <string-name>
            <surname>Kelmanov</surname>
            <given-names>A. V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Romanchenko</surname>
            <given-names>S. M.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>An Approximation Algorithm for Solving a Problem of Search for a Vector Subset</article-title>
          .
          <source>J. Appl. Indust. Math.</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>90</fpage>
          -
          <lpage>96</lpage>
          . doi:
          <volume>10</volume>
          .1134/S1990478912010097
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Shenmaier</source>
          , 2012] Shenmaier,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>An Approximation Scheme for a Problem of Search for a Vector Subset</article-title>
          .
          <source>J. Appl. Indust. Math.</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ).
          <fpage>381</fpage>
          -
          <lpage>386</lpage>
          . doi:
          <volume>10</volume>
          .1134/S1990478912030131
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Kelmanov &amp; Romanchenko</source>
          ,
          <year>2012</year>
          ] (
          <year>2012</year>
          ). Kelmanov,
          <string-name>
            <given-names>A. V.</given-names>
            , &amp;
            <surname>Romanchenko S. M.</surname>
          </string-name>
          <article-title>Pseudopolynomial Algorithms for Certain Computationally Hard Vector Subset and Cluster Analysis Problems</article-title>
          . Automation and Remote Control,
          <volume>73</volume>
          (
          <issue>2</issue>
          ).
          <fpage>349</fpage>
          -
          <lpage>354</lpage>
          . doi:
          <volume>10</volume>
          .1134/S0005117912020129
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Kelmanov &amp; Romanchenko</source>
          , 2014]
          <article-title>Kel'manov,</article-title>
          <string-name>
            <given-names>A. V.</given-names>
            , &amp;
            <surname>Romanchenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. M.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>An FPTAS for a Vector Subset Search Problem</article-title>
          . J. Appl. Indust. Math.,
          <volume>8</volume>
          (
          <issue>3</issue>
          ).
          <fpage>329</fpage>
          -
          <lpage>336</lpage>
          . doi:
          <volume>10</volume>
          .1134/S1990478914030041
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Papadimitriou</source>
          , 1994] Papadimitriou,
          <string-name>
            <surname>C. H.</surname>
          </string-name>
          (
          <year>1994</year>
          ).
          <source>Computational Complexity</source>
          . New-York: Addison-Wesley.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>