A Local Search for a Graph Correlation Clustering

                           Victor Il’ev1,2 and Anna Navrotskaya1,2
                               1
                                Sobolev Institute of Mathematics,
                       4 Acad. Koptyug Ave., 630090 Novosibirsk, Russia,
                                   2
                                     Omsk State University,
                             55a Mira Ave., 644077 Omsk, Russia
                               iljev@mail.ru, nawrocki@ya.ru


       Abstract. In the clustering problems one has to partition a given set of objects
       into some subsets (called clusters) taking into consideration only similarity of
       the objects. We consider a version of the clustering problem when the number
       of clusters does not exceed a positive integer k and the goal is to minimize
       the number of edges between clusters and the number of missing edges within
       clusters. This problem is NP-hard for any k ≥ 2. We propose a polynomial time
       k-approximation algorithm for this problem.

       Keywords: Graph clustering, local search, approximation algorithm.


1    Introduction
We consider only the simple graphs, i.e., the graphs without loops and multiple edges.
A graph is called cluster graph [11] if each of its connected components is a complete
graph. Let V be a finite set. Denote by M(V ) the set of all cluster graphs on the vertex
set V ; by Mk (V ), the set of all cluster graphs on V consisting of exactly k nonempty
connected components, and by M1,k (V ), the set of all cluster graphs on V consisting
of at most k connected components, 2 ≤ k ≤ |V |.
    If G1 = (V, E1 ) and G2 = (V, E2 ) are graphs on the same vertex set V , then the
distance d(G1 , G2 ) between them is defined as follows:

                               d(G1 , G2 ) = |E1 \ E2 | + |E2 \ E1 |,

i.e., d(G1 , G2 ) is the number of noncoinciding edges in G1 and G2 .
    In the clustering problems one has to partition a given set of objects (a data set) into
some subsets (called clusters) taking into consideration only similarity of the objects.
    Bansal, Blum and Chawla [3] proposed the following version of the clustering prob-
lem:
    CORRELATION CLUSTERING (CC). Given a complete graph G = (V, E) with
edges labelled +1 (similar) or −1 (diﬀerent), find an optimal clustering, i.e., a partition
of the vertex set of G into clusters minimizing disagreements (the number of −1 edges
inside the clusters plus the number of +1 edges between clusters).

Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org
                                                       Graph Clustering: Local Search   511

     The variants of CC in which the number of clusters is bounded were also studied
[3, 6].
    One of the most visual formalizations of the clustering problem is the graph cluster-
ing [10], that is grouping the vertices of a graph into clusters taking into consideration
the edge structure of the graph whose vertices are objects and edges represent similar-
ities between the objects. In this setting, the clustering can be understood as cluster
graph M whose connected components correspond to the clusters.
    Obviously, the following setting is equivalent to CC.
   GRAPH CORRELATION CLUSTERING (GCC). Given a graph G = (V, E), find
a graph M ∗ ∈ M(V ) such that

                             d(G, M ∗ ) =     min      d(G, M ).
                                            M ∈M(V )

  We focus on the following bounded variants of GRAPH CORRELATION CLUS-
TERING.
   GCCk . Given a graph G = (V, E) and an integer k, 2 ≤ k ≤ |V |, find a graph
M ∗ ∈ Mk (V ) such that

                            d(G, M ∗ ) =      min        d(G, M ).
                                            M ∈Mk (V )

   GCC1,k . Given a graph G = (V, E) and an integer k, 2 ≤ k ≤ |V |, find a graph
M ∗ ∈ M1,k (V ) such that

                            d(G, M ∗ ) =      min         d(G, M ).
                                           M ∈M1,k (V )


   In Section 2 of this paper we present a short survey of the known results involving
results on computation complexity and approximability of diﬀerent versions of the
graph clustering problem. In section 3 we propose a new k-approximation algirithm for
GCC1,k .


2    Known results
Apparently, NP-hardness of problem GCC was first proved by Křivánek and Morávek
[9] in 1986.
    At the beginning and in the middle of the 2000s, several groups of authors were
independently dealing with the diﬀerent versions of the graph clustering problems.
Chen, Jiang, and Lin [5] considered the closest phylogenetic root problems and also
proved NP-hardness of GCC. Bansal, Blum, and Chawla [3], using a reduction from
PARTITION INTO TRIANGLES, show that GCC is NP-hard even if all clusters are
of size 3 as a maximum.
    Shamir, Sharan, and Tsur [11] independently showed NP-hardness of problem GCC
by a reduction from the 3-exact 3-Cover problem. They also reduced the known NP-
complete problem of 2-coloring of 3-Uniform Hypergraph to problem GCC2 and as
512     V. Il’ev, A. Navrotskaya

a result they showed that problem GCCk is NP-hard for any fixed k ≥ 2. In both
cases Shamir, Sharan, and Tsur use rather complicated reduction. Later, Giotis and
Guruswami [7] published a more simple proof of the same result using a polynomial
reduction from the graph bisection problem.
   At the same time, Ageev, Il’ev, Kononov, and Talevnin [1] independently proved
that problems GCC2 and GCC1,2 are NP-hard on cubic (i.e., 3-regular) graphs and
deduced from this that all the above-mentioned variants of the graph clustering prob-
lems (including GCC1,k ) are NP-hard.
    In 2004, Bansal, Blum, and Chawla [3] presented a simple polynomial time 3-
approximation algorithm for GCC1,2 . For each v ∈ V their algorithm considers the fol-
lowing pair of clusters. The first cluster contains v and all neighbors of v in G = (V, E).
The second cluster contains all other vertices. The algorithm outputs the pair that min-
imizes the number of mismatched edges. More formally this algorithm may be described
as follows.
   Algorithm N1,2 (G).
   Step 1. For each vertex v ∈ V construct the cluster graph Mv ∈ M1,2 (V ) with
V1 = {v} ∪ N (G) and V2 = V \ V1 as the vertex sets of connected components of Mv .
   Step 2. Pick the graph Mv with minimal distance from G, i.e.,

                                   d(G, M ) = min d(G, Mv ).
                                              v∈V

    In 2006, Ageev, Il’ev, Kononov, and Talevnin [1] proved the existence of a ran-
domized PTAS for problem GCC1,2 by reducing this problem to the graph bisection
problem, and Giotis and Guruswami [7] presented a randomized PTAS for problem
GCCk (for any fixed k ≥ 2). In 2007, Il’ev, Navrotskaya, and Talevnin [8] considered
a local search algorithm for problem GCC1,2 . They showed that if the number of
edges in a graph is subquadratic of the number of vertices, the local search algorithm
is asymptotically exact.
    In 2008, Coleman, Saunderson, and Wirth [6] pointed out that complexity of PTAS
from [7] make this scheme practically useless. They presented a 2-approximation algo-
rithm for problem GCC1,2 applying local search to the feasible solution obtained by
the 3-approximation algorithm from [3].
    In 2005, Charicar, Guruswami, and Wirth [4] proved that problem GCC is APX-
hard. They also constructed a 4-approximation algorithm for problem GCC by round-
ing a natural LP relaxation using the region growing technique. In 2008, Ailon, Charicar,
and Newman [2] proposed a randomized 2.5-approximation algorithm for GCC.


3     An approximation algorithm for GCC1,k

As it was already said, Coleman, Saunderson, and Wirth [6] presented a 2-approxi-
mation algorithm for problem GCC1,2 applying local search to the feasible solution
obtained by the 3-approximation algorithm N1,2 from [3]. Extending this strategy
we propose a polynomial time approximation algorithm NLS1,k for problem GCC1,k
when k > 2 and use algorithm from [6] for k = 2.
                                                       Graph Clustering: Local Search     513

    First we describe the algorithm formally. Denote by N (v) the set of all neighbors
of a vertex v.
    Algorithm NLS1,k
    Input: Given graph G = (V, E) and k > 2.
    Step 1. For each vertex v ∈ V construct the cluster graph Mv ∈ M1,k (V ) as follows:
if {v} ∪ N (v) = V , then Mv = K|V | , else Mv = N1,k (G, {v} ∪ N (v), k).
    Step 2. Use procedure LS1,k to modify each graph Mv .
    Step 3. Pick the graph Mv with minimal distance from G, i.e.,

                                 d(G, M ) = min d(G, Mv ).
                                             v∈V


   Function N1,k (G, V1 , k).
   Step 0. Let i = 1.
   Step 1. Denote by Gi the subgraph of graph G induced by the vertex set V \ (V1 ∪
V2 ∪ . . . ∪ Vi ).
   Step 2. Vi+1 = N1,2 (Gi ).
   Step 3. If V \ (V1 ∪ V2 ∪ . . . ∪ Vi+1 ) ̸= ∅ and i < k − 2, then i = i + 1 and go to
step 1. If V \ (V1 ∪ V2 ∪ . . . ∪ Vi+1 ) = ∅, then Vi+2 = Vi+3 = . . . = Vk = ∅. If i = k − 2,
then Vk = V \ (V1 ∪ V2 ∪ . . . ∪ Vk−1 ).
   Step 4. The sets V1 , V2 , . . . , Vk induce the connected components of the cluster graph
M.
   Return: The cluster graph M .
   Function N1,2 (G).
   Step 1. For each vertex v ∈ V construct the cluster graph Mv ∈ M1,2 (V ) with
V1 = {v} ∪ N (G) and V2 = V \ V1 as the vertex sets of connected components of Mv .
   Step 2. Pick the graph Mv with minimal distance from G, i.e.,

                                 d(G, M ) = min d(G, Mv ).
                                             v∈V

   Return: Subset V1 of the vertex set of the graph M .
   For any vertex u ∈ V define the value

                   impu (Vi , Vj ) = Vi+ (u) + Vj− (u) − Vi− (u) − Vj+ (u),

where Vi+ (u) is the number of vertices in the set Vi adjacent to u and Vi− (u) is the
number of vertices in Vi not adjacent to u.
   Procedure LS1,k
   Input: A graph G = (V, E) and a cluster graph M ∈ M1,k (V ). The sets V1 , . . . , Vk
are the vertex sets of the connected components of M .
   Step s (s = 1, 2, ...). Pick the vertex v ∈ V such that

                      impv (Vp , Vq ) = max ( max impu (Vi , Vj )).
                                       i=1,...k, j=1,...k,
                                        u∈Vi       j̸=i
514     V. Il’ev, A. Navrotskaya

If impv (Vp , Vq ) > 0, then move v from Vp to Vq and go to step s + 1, else stop.

   Comments. Algorithm NLS1,k uses the function N1,k for constructing a cluster
graph and the local search procedure LS1,k for improving the found cluster graph. In
the end the algorithm picks a cluster graph ’nearest’ to G.
    Function N1,k makes a partition of the vertex set of a given graph G = (V, E).
Consider the subgraph of G without the subsets of the vertex set V constructed at
the previous step. Function N1,2 solves problem GCC1,2 approximately and returns
a vertex set of the first connected component of the found cluster graph. This subset
is added to already constructed ones. At the next step the algorithm considers the
subgraph without all found subsets involving the last constructed subset. At step 4 we
have k subsets (some of which may be empty) which are a partition of the vertex set
of given graph G. This partition induces a cluster graph M and the algorithm returns
M.
    Function N1,2 takes as an input a graph G and for each vertex v of G constructs
a cluster graph whose first connected component has vertex set consisting of v and all
adjacent to v vertices. The second connected component consists of all other vertices
of G. Later, cluster graph ’nearest’ to G among all the constructed cluster graphs is
selected. Function N1,2 returns the vertex set of the first connected component of the
cluster graph, i.e., vertex v and all its neighbors.
    Procedure LS1,k takes as an input a given graph G = (V, E) and a cluster graph
M ∈ M1,k (V ) constructed by algorithm N1,k . On each iteration the algorithm finds
a vertex whose moving to one of the rest connected components leads to steepest
decreasing the distance bitween graphs G and M . If such vertex is found, it is moved
to the respective corresponding connected component, else the algorithm finishes.
    The followimg theorem shows that algorithm NLS1,k is k-approximation algorithm
for problem GCC1,k .
Theorem 1. For any graph G = (V, E) and an integer k, 2 ≤ k ≤ |V |, the following
bound occurs:
                           d(G, M ) ≤ kd(G, M ∗ ),
where M is the cluster graph constucted by algorithm NLS1,k and M ∗ is an optimal
solution to GCC1,k .


Acknowledgement
This research was supported by the RSF grant 15-11-10009.


References
1. Ageev, A. A., Il’ev, V. P., Kononov, A. V., Talevnin, A. S.: Computational complexity of
  the graph approximation problem. Diskretnyi Analiz i Issledovanie Operatsii. Ser. 1. 13(1),
  3–11 (2006) (in Russian). English transl. in J. of Applied and Industrial Math. 1(1), 1–8
  (2007)
                                                     Graph Clustering: Local Search       515

2. Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: Ranking and
  clustering. J. ACM. 55(5), 1–27 (2008)
3. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning. 56, 89–113
  (2004)
4. Charikar, M., Guruswami, V., Wirth, A.: Clustering with qualitative information. J. Com-
  put. Syst. Sci. 71(3), 360–383 (2005)
5. Chen, Z.-Z., Jiang, T., Lin, G.: Computing phylogenetic roots with bounded degrees and
  errors. SIAM J. Comput. 32(4), 864–879 (2003)
6. Coleman, T., Saunderson, J., Wirth, A.: A local-search 2-approximation for 2-correlation-
  clustering. In: Algorithms – ESA 2008. LNCS, vol. 5193, pp. 308–319. Springer, Heidelberg
  (2008)
7. Giotis, I., Guruswami, V.: Correlation clustering with a fixed number of clusters. Theory
  of Computing. 2(1), 249–266 (2006)
8. Il’ev, V. P., Navrotskaya, A. A., Talevnin, A. S.: Polynomial time approximation scheme
  for the graph approximation problem. Vestnik Omskogo Universiteta. 4, 24–27 (2007) (in
  Russian)
9. Křivánek, M., Morávek, J.: NP-hard problems in hierarchical-tree clustering. Acta infor-
  matica. 23, 311–323 (1986)
10. Schaeﬀer, S. E.: Graph clustering. Computer Science Review. 1(1), 27–64 (2007)
11. Shamir, R., Sharan, R., Tsur, D.: Cluster graph modification problems. Discrete Appl.
  Math. 144(1–2), 173–182 (2004)