CCS CONCEPTS

October

uQantifying and Reducing Imbalance in Networks

Yoosof Mashayekhi

yoosof.mashayekhi@ugent.be yoosof.mashayekhi@ugent.be bo.kang@ugent.be Ghent University Ghent, Belgium 0

Jefrey Lijfijt

jefrey.lijfijt@ugent.be jefrey.lijfijt@ugent.be tijl.debie@ugent.be Ghent University Ghent, Belgium 1 0 Bo Kang 1 Tijl De Bie

2021

1 2021

Real-world data can often be represented as a heterogeneous network relating nodes of diferent types. For example, envision a network of the job market, where nodes may be job seekers, skills, and jobs, and where links to skill nodes could indicate having that skill (if linked to a job seeker) or having the skill as a requirement (for jobs). It can be relevant to consider the imbalance in such a network between the nodes of diferent types. In the example, this imbalance could correspond to the mismatch between supply and demand of jobs due to a mismatch in skills, an efect known as 'friction'. Identifying and reducing such friction is a problem of great economic and societal significance. We introduce a quantification of the imbalance in a network between two sets of nodes (nodes of diferent types, attributes, etc.) based on the embedding of a network, i.e., a real-valued vector space representation of the network nodes. Moreover, we introduce an algorithm named GraB (Graph Balancing) which ranks unconnected node pairs according to how well they would reduce the imbalance in a network if an edge were added between them. E.g., in the job network, GraB could be used to rank skills that job seekers do not yet have but could strive to acquire, to move them closer in the embedding towards an area where there is an abundance of jobs, and hence to reduce job market imbalance. We evaluated GraB on several datasets, including a job market network, and find that GraB outperforms baselines in reducing network imbalance.

CCS CONCEPTS INTRODUCTION

Graphs (or networks) are natural models for a wide range of realworld structures [ 4 ], arising from e.g., social networks [ 8 ], biology (e.g., Protein-Protein interaction networks) [ 26 ], and linguistics (e.g., word co-occurrence networks) [ 5 ]. Network embedding provides an eficient way to solve graph analytics problems by mapping nodes into a real-valued space, which can later be used as an input feature vector to a machine learning model [ 11 ]. Using these vector representations, machine learning methods can be applied on graph datasets to perform graph analysis tasks such as link prediction [ 12 ], information difusion [ 9 ], and node classification [ 25 ].

An imbalance between two sets of nodes is an undesirable phenomenon in some networks. This paper studies how to quantify network imbalance using the embedding of a network and proposes a method to reduce network imbalance by adding a limited number of links to the network.

Motivation: There are many networks for which it is be desirable to minimize the imbalance between specific sets of nodes. Let us consider an example of a job market network with three sets of nodes job vacancies, job seekers, and skills, and where job vacancies are connected to the skills they require, and job seekers are connected to the skills they have and possibly to the job vacancies they have shown an interest in.

Imagine there are many Python developers seeking a job, and few vacancies requiring Python programming, while there are many vacancies requiring Java programming but few Java developers. As a result, Java jobs would remain unfilled and many Python programmers would remain unemployed. With an ever faster evolving job market, such imbalances are increasingly common and serious, harming job market eficiency and ultimately the economy. Thus, quantifying such imbalances would provide policy makers with an objective picture of the current state of afairs.

Moreover, the ability to quantify imbalance also opens up the possibility of trying to reduce it through targeted interventions and incentives. While policy makers may not be able to influence employers to shift their requirements, they can provide courses and training material for specific worker profiles lacking sought-after skills, to shift their area of expertise and better meet the demand of the job market. In network terms, it is equivalent to adding a certain number of links connecting job seekers (let us call this the set of source nodes) to skills (auxiliary nodes), to reduce the imbalance between job seekers (source nodes) and job vacancies (target nodes).

Our approach: In a job market network, a matching with lowest cost—where cost could be defined as the required training time of employees in the company, or the efort a job seeker has to make to be suited for a job—between job seekers and job vacancies appears to be the ideal situation. However, matching job seekers to connected jobs only is not always desirable, because a link may only indicate a job seeker’s expressed interest in a job. Yet, we could let job seekers be matched with jobs even if they are not connected.

Formally. we denote an undirected network by = ( , ), where and are the sets of nodes and links respectively. Moreover, we define three sets of nodes, namely source nodes (e.g. job seekers in the job market network), target nodes (e.g. job vacancies), and auxiliary nodes (e.g. skills). Sets , , and are three disjoint subsets of . Given such a network with sets and , we let every node in be matched with every node in .

Hence, in efect we model a complete bipartite network ′ from node sets and . We define the imbalance between and in as the cost of a so-called optimal fractional perfect b-matching (see, e.g., [ 1 ]) in ′. Specifying the fractional perfect b-matching requires the definition of the cost of matching any pair of nodes from and in ′. As the (inverse of the) distance between a pair of nodes’ embeddings in a network embedding usually represents some type of afinity between these nodes, we propose to let this cost be based on the embedding distance of nodes in . More details are presented in Section 2.2.

Next, we define the problem of adding a limited number of links between sets and —in the original graph —, to reduce the imbalance between sets and . Adding links will change the embedding of the network and thus modify the cost of matching node pairs in ′. We propose a method called Graph Balancing (GraB) to tackle this problem, which is based on a local approximation of the imbalance that we may compute analytically, thus providing the necessary scalability.

Example: To understand the relevance of the embedding, consider the sample network shown in Figure 1a. The figure shows three clusters of nodes. In the top-left cluster, source and target nodes are mixed, while the bottom-left cluster only contains nodes in while the top-right cluster only contains nodes in . Our goal is to quantify which links between and (the green nodes here) would reduce the imbalance between and . Figure 1b shows the network embedding after adding the top 200 links from GraB to the network. Now, and are well-mixed in the network.

Related work: Previous studies on imbalance in supply and demand on the job market mostly focus on the factors influencing the imbalance (such as retirement, salary, etc.) and do not consider the imbalance between nodes individually (diferent jobs require people with specific skills) [ 27, 29 ]. The literature on graph matching is related to our work, as we also define the imbalance in a network between two sets of nodes using the cost of a matching between them. However, the focus in this paper is not on the computational problem of identifying this matching, we simply use the cost of the optimal fractional perfect b-matching as the imbalance measure. These two relations are further discussed in Section 5.

The main contributions of this paper are: • We define the imbalance between two sets of nodes, and , in a network and propose the measure (, , ) for quantifying it, where we compute the cost of matching node pairs using the embedding distances . • We introduce the novel problem of reducing the imbalance in a network by adding a limited number of links. • We also propose a novel generic method, Graph Balancing ( GraB), to optimally select those links. Because this is computationally challenging, we introduce a link utility that uses a local imbalance measure as a proxy and employ a greedy algorithm to select the links. • To better understand the merits of GraB, we propose two baselines (a naive and a more intelligent baseline) for the novel problem of reducing the imbalance in a network. • GraB is proposed as a generic method, applicable to a wide range of network embedding methods. We also develop a concrete instantiation using conditional network embedding (CNE) [ 16 ], a state-of-the-art network embedding method. • We perform several experiments to compare the performance of GraB and baselines in reducing the imbalance in a network embedding. The experiments show that GraB outperforms the baselines in this task.

Outline. In Section 2, we define and quantify imbalance and formalize the problem. In Section 3, we introduce our method GraB to reduce the imbalance in a network using its embedding. In Section 4, we provide an experimental evaluation of GraB. In Section 5, we discuss the related work. In Section 6, we conclude and outline avenues for future work. 2

PROBLEM DEFINITION

In this section, we first provide the relevant background and notation. Next, we provide a definition and quantification of imbalance in networks. Finally, we formulate the problem of reducing imbalance by adding links to a network. 2.1

Background

An undirected network is denoted by = ( , ), where is the set of | | = nodes and ⊆ 2 is the set of links between nodes. For convenience, we will index the set of nodes by natural numbers, i.e. = {1, . . . , }. Let denote the adjacency matrix, and is the element of the adjacency matrix corresponding to the link between node and node , i.e. = 1 if {, } ∈ . Network embedding methods find a mapping each node ∈ to a -dimensional real vector ∈ R . For convenience, these may be aggregated in a matrix = (1, ..., )′ ∈ R× . In this paper, we assume there is a given network embedding method to find . 2.2

Network Imbalance

To generically define our proposed notion of imbalance, we develop it first for the concrete example of the job market. There, we define the imbalance as the cost of matching all job seekers to job vacancies. Here, we allow a job seeker and a job to be matched even if they are not connected by a link. Indeed, a link between a job seeker and a job vacancy might mean that the job seeker has applied for the job or has otherwise expressed an interest, not necessarily that they were employed for that job. Moreover, the absence of a link does not imply that the job seeker would not be a good candidate for the job. It is this property that distinguishes our work from the literature on combinatorial matching problems in graphs.

However, the skills to which the job seeker and job vacancy are linked, and jobs vacancies they are linked to, provide information on whether the job seeker is suited for the job; and the more suited, the smaller the cost of a match between them should be. Hence, the cost of matching a job seeker and a job vacancy should be a function of the network structure, and adding or removing skills to that job seeker or job vacancy should influence the cost of matching them. Hence, the cost could be defined using any model that provides link costs or link probabilities (so the cost of node pairs could be computed based on them) based on the network structure. In this paper, we investigate using the embedding of the network for this, as it is a state-of-the-art approach to summarize the network structure, where proximity between node embeddings reflects the probability that both nodes should be linked. 2.2.1 Formal definition of network imbalance. The imbalance can be formalized as follows. We create a complete bipartite network (in which all node pairs between the two sets are linked), and assign equal weights to each node in a set, such that the sum of weights in two sets are equal. Given a cost defined for each link = [ ] (e.g., based on the link probability or distance in the embedding space), we define the imbalance as the cost of the fractional perfect b-matching with minimum cost in the bipartite network.

A fractional perfect b-matching = [ ] is defined as an assignment of nonnegative real numbers to the links of a network such that the sum of those numbers over links incident to any node is equal to a specified weight of that node [ 1 ]. The cost of in a undirected network with nodes is then defined as Õ Õ . =1 =+1 We thus define the imbalance in a network as follows:

Definition 1 (Imbalance). Consider a network = ( , ), two disjoint non-trivial subsets of its nodes referred to as the source nodes ∅ ⊂ ⊂ and the target nodes ∅ ⊂ ⊂ with ∩ = ∅, and the costs of matching the pairs of nodes and for all ∈ and ∈ , arranged in a matrix = [ ]. Moreover, consider the complete edge-weighted and node-weighted bipartite network ′ = ( ′ = ∪ , ′ = × ), with weight of edge {, } with ∈ and ∈ equal to , and weight of node ∈ equal to = |1 | and weight of node ∈ equal to = |1 | . Then the imbalance between nodes in sets and , denoted as (, , ), is defined as the cost of the minimum cost fractional perfect b-matching in ′.

Computing the imbalance defined in this way can be done by solving a linear programming problem for finding the matching = [ ] in ′ that minimizes the overall cost: = Õ

Õ , ∈ ∈ s.t. ≥ 0 Õ = ∀ ∈ , ∈

∈ ∀(, ) ∈ × , Õ = ∀ ∈ .

The imbalance measure is defined as the minimum cost of the optimal matching. (The matching itself is not of interest to us for the purposes of this paper.) 2.2.2 The matching cost, and a relation to the earth mover’s distance. While the cost for each edge could be defined in several ways, network embeddings arguably ofer a natural way to define them: our approach is to use the distance of nodes in the network embedding of as the matching cost of each node pair in ′. For embedding methods modeling first-order similarities in networks, these distances are directly related to the link probability between nodes. Moreover, the embedding of a node aggregates all relevant information about the network structure in relation to the node.

Interestingly, with this matching cost, the proposed definition of the imbalance is equivalent with the Earth Mover’s Distance (EMD) [ 23 ] between the source and target sets in the embedding space. The EMD computes the minimum cost to transform one distribution into another.

Proposition 1 (Network Imbalance based on fractional b– matching and EMD are eqivalent). Consider the two empirical distributions of the node embeddings for and in the embedding space. The EMD between these two distributions is equal to the network imbalance measure .

We refer to Appendix A1 for a more precise formulation of this proposition and a proof. 2.3

Formulating the problem of reducing network imbalance

As the cost of matching node pairs depends on the structure of the network, it will change by modifying the network. Motivated by the job market example, we propose the operation of adding links as the kind of modification that can be made. We further propose to restrict the problem by allowing links to be added only between nodes from the source set and nodes from an auxiliary set of nodes. This is again motivated by the job market, where we can realistically add new links between skills and job seekers (by training job seekers), but not between job vacancies and skills. More formally, we introduce the following problem:

Problem 1 (Imbalance Reduction). Given a network = ( , ) and three mutually disjoint sets of nodes source nodes ∅ ⊂ ⊂ , target nodes ∅ ⊂ ⊂ and auxiliary nodes ∅ ⊂ ⊂ , the cost of matching each pair of nodes = [ ] (that depends on the network structure ), and imbalance measure (, , ), find the optimal links E connecting nodes from set with nodes from set , that reduce the imbalance between the nodes in sets and . Formally, argmin ( E, , ),

E⊆× ,E∩=∅ where E is computed based on E = ( , ∪ E).

In the next section, we introduce GraB to solve this problem.

1https://github.com/aida-ugent/GraB/blob/master/GraB_appendix.pdf

REDUCING NETWORK IMBALANCE: GRAB

In this section, we introduce GraB, a generic method to solve the imbalance reduction problem, i.e., how to add links to a network in order to maximally reduce the imbalance in the network, as defined in Definition 1. We first provide a sketch of the solution and then provide more details in the respective sections below.

The exact minimization problem amounts to finding a set of links that jointly minimize the imbalance. This exact approach may be computationally intractable when the number of candidate links is large, since it may require computing the imbalance reduction for every possible set of links. Besides the vast number of possible sets, to compute the reduction in imbalance we need to re-embed the network and compute the imbalance again. Recomputing the embedding is computationally demanding, and is practically impossible to do even for a moderate number of candidate link sets. Motivated by these two problems, our approach is as follows.

Firstly, rather than re-embedding the network and observing the change in imbalance, we introduce a proxy measure for the change, based on an infinitesimal change to any link {, }. We refer to this proxy measure as the link utility. The link utility is based on three elements. (1) Since the formulation of is a linear programming problem, the derivative of w.r.t. links cannot be directly computed. Hence, we introduce a local imbalance measure to be used as a proxy for , and quantify the infinitesimal changes in links for that measure. (2) This local imbalance measure is then used to derive a measure of the utility of links. (3) The local imbalance measure relies on the estimation of the density of nodes at any point in the embedding. For this, we employ multivariate Gaussian kernel density estimation. These elements are presented in Section 3.1.

Secondly, the problem that we cannot test the imbalance reduction for all possible sets of links remains. Link utility does not account for any interactions between links on the amount of imbalance in the network. To find a good balance between accuracy and computational tractability, GraB employs a greedy selection strategy using link utility in combination with re-embedding the network every steps (the batch size). The GraB algorithm implementing this strategy is introduced in Section 3.2.

Finally, we need to select a suitable embedding method. We briefly discuss suitable methods in Section 3.3. 3.1

The Link Utility Measure

In the embedding of a network, there are areas where the set (or ) is denser than the other one. Given a network with nodes and their corresponding embeddings {1, ..., }, the idea behind the proxy measure is that to reduce the imbalance, we should add links connecting source nodes with auxiliary nodes, such that the source nodes are moved to areas with a higher density of target nodes and fewer source nodes. We thus introduce a local imbalance measure to quantify the density imbalance between source and target node sets at any point in the embedding. Moving source nodes to areas with higher local imbalance (more target nodes than source nodes) would reduce the local imbalance in those areas. We define the utility of adding a link using the first order of approximation of this local imbalance measure. 3.1.1 Local Imbalance Quantification. Skipping for a moment how to quantify the density of a set of nodes at a specific point in the embedding, we define the local imbalance measure which evaluates the imbalance between the embedding of two sets of nodes and locally at any point in the embedding space as follows:

Definition 2 (Local Imbalance Measure , ). Given a network = ( , ) with the embedding and two disjoint sets of nodes, source nodes with ∅ ⊂ ⊂ and target nodes with ∅ ⊂ ⊂ , denote the density function of the target nodes as estimated based on the embedding and evaluated at point by (; ), and let (; ) denote the estimated density function for the source nodes evaluated at . We use the log ratio of the density of the two sets of nodes as local imbalance measure , evaluated at point : , (; ) = ln (; ) .

(; )

If the densities are diferentiable and non-zero everywhere, also , becomes diferentiable and suitable for optimization.

Example: Let us illustrate the idea behind GraB and the local imbalance as a proxy to optimize the imbalance . Figure 2a shows a 2D embedding of a toy network with equal number of source and target nodes. Hence, each source node should be matched with exactly one target node to compute the imbalance . Visually, 1 should be matched with 3, since they are close to each other and the cost of matching them is low. Thus, 1 and 2 should be matched with 2 and 3, although the matching cost (their distance) would be high. GraB can then be used to move source nodes 1 and 2 to areas with higher local imbalance , which is the area where 2 and 3 reside. If we move 1 and 2 closer to 2 and 3, the matching costs between them would be reduced and the imbalance would be reduced as well. GraB’s goal is to find links between source and auxiliary nodes such that adding them would move the source nodes to areas with higher local imbalance . GraB uses a link utility measure for this purpose, which will be discussed in the next section. Figure 2b shows the 2D embedding of the network after adding two links {1, 3} and {2, 3} suggested by GraB, demonstrating that adding the links indeed reduces , from 3.44 to 2.71. 3.1.2

Link Utility. We can now define the utility of a link:

Definition 3 (Link Utility). The utility of adding a link {, } for reducing the local imbalance at the embedding of source node ∈ is defined as the rate at which the local imbalance evaluated at changes when increasing , or mathematically: , (; ) .

Note that two efects of adding the link {, } are accounted for in this definition: the fact that the embedding of node will move, possibly to a denser or sparser region, and the fact that the density functions themselves may change. Both efects can be separated by computing the total derivative: (1) (2) , (; ) = ∇, (; ) ()

+ Õ ∇ , (; ) ( ) + Õ ∇ , (; ) () , ∈ ∈ where the first term accounts for the change in position where the densities are measured, and the second and third terms account fo the changes in both estimated densities.

Evaluating all these terms is costly. However, since changing has a direct efect only on the embeddings of nodes and [ 17 ], we argue that the summation terms over target nodes and over source nodes where ≠ can be neglected. With set equal to , we can thus approximate the utility as follows: , ( ; ) ≈ ∇, (; ) + ∇ , (; ) |= . (3)

Computing this approximation is far more eficient than a bruteforce computation of the utility. It is scalable especially if the derivatives () of the embeddings can be computed analytically. An embedding method for which that is the case is presented in Sec. 3.3. 3.1.3 KDE as Density Estimation Method. Kernel density estimation (KDE), also known as Parzen window estimation [ 21 ], is a nonparametric density estimator. The flexibility arising from KDE’s non-parametric nature makes it a very popular approach for data drawn from a complicated distribution [ 6 ]. We use KDE as the density estimator in the local imbalance measure. () .

Definition 4 (Kernel Density Estimation [ 21 ]). Given a set of -dimensional points {1, ..., } forming the rows of data matrix , and a kernel function , the KDE for an arbitrary point is defined as: (; ) = 1 Õ =1 ( − ).

For the kernel, we use multivariate Gaussian KDE:

() = (2 )−/2 | |−1/2 − 21 −1, where the so-called bandwidth matrix is computed using Scott’s rule [ 24 ]. Thus, we can quantify the utility (Definition 3) of adding link {, } at node by rewriting Eq. (3) with as the density estimator. Doing this, observe that ∇ ( − )|= = 0, such that also ∇ , (; )|= = 0. Hence, with this KDE as density estimator, Eq. (3) simplifies to: , ( ; )

(). ≈ ∇, (; )|= . (4)

3.2 The GraB Algorithm

Having defined the network imbalance measure and the link utility measure to be used as a proxy, we can now craft a scalable algorithm to optimize the imbalance. We designed GraB using three concepts.

1. Greedy link selection: As discussed, solving problem 1 exactly requires computing the imbalance reduction for every possible set of links between set and , which is computationally infeasible. Instead, GraB picks links greedily based on the link utility measure. Although the link selection step aims to provide the most beneficial links to add to the network, the network embedding after adding the links might be diferent than expected, due to the following two issues.

2. Include links in batches: The first issue is that we are using the utility of adding a link, assuming that the rest of the network embedding remains the same. However, the efect of adding a link may change after another link is included (especially if they are close to each other in the network). A solution would be to re-embed the network after the inclusion of each link, but this is computationally costly. Thus, as a trade-of between cost and accuracy, we add links to the network in several batches: in each iteration, we select a batch of links from the top candidates ranked by link utility, add them to the network, and re-embed the network. Additionally, since the embedding of a node is afected more by its direct links, we add at most one link per source node in a batch.

3. Post-hoc filtering : The second issue is that adding a link may actually not reduce the imbalance, for example because the source node moves a lot in the embedding and ‘jumps over’ the area of higher local imbalance , . This is the case when the derivative from Definition 3 is not a good approximation to the finite diference of the local imbalance measure. An example is shown in Figure 3, which shows the heatmap of , on a 2-dimensional embedding of the Weibo network [ 28 ] using conditional network embedding. In this example, node jumps over the area with a high , and ends up in a worse position in terms of , .

Hence, to find the most beneficial links to add to the network in each batch, we instead select the · top candidate links using the link utility measure ( ≥ 1 controls how many more links than has to be added to the network for re-embedding). After adding · candidate links, we re-embed the network and select the first links that caused their connected source nodes be moved to positions with higher , . The parameter has to be large enough so that at least links would end up in a better position after re-embedding.

GraB: In summary, we select links to add to the network to reduce the imbalance in several batches, each of size . For each batch, we select the · top candidate links (at most one link per source node) using the link utility measure, add them to the network, re-embed the network, and select the top links that caused their connected source nodes to move to positions with higher , .

We call this method Graph Balancing ( GraB). Full pseudocode for the algorithm is given in Appendix B2. 3.3

Choice of Embedding Method

The generic method GraB applies to network embedding methods where the optimal embeddings are diferentiable w.r.t. changes in link strength, i.e. where () for ∈ and ∈ and the embedding of node , can be evaluated. To the best of our knowledge, two methods satisfy this requirement: LINE [ 25 ] and Conditional network embedding (CNE) [ 16 ]. We chose to use CNE for three reasons. First, LINE uses the inner product as the similarity measure between node embeddings, whereas KDE (and CNE) are based on the Euclidean distance. This mismatch would make the local imbalance proxy less efective. Second, re-embedding the network starting from an initial embedding is easily done with CNE, greatly speeding up GraB. And third, CNE was shown to outperform LINE.

Let denote the (analytically computable) Hessian of the objective function of CNE w.r.t. . Then, with a hyper-parameter

2https://github.com/aida-ugent/GraB/blob/master/GraB_appendix.pdf

of CNE, Kang et al. [ 17 ] showed that () is given by: () = − −1 ( − ).

EXPERIMENTAL EVALUATION

In this section, we describe the experimental evaluation of GraB. In a qualitative experiment, we investigate question Q1: Does GraB work as expected in moving source nodes towards areas with more target nodes compared to source nodes? In the quantitative experiments, we investigate two questions Q2: How does GraB perform in reducing the imbalance compared to the baselines? Q3: Does GraB constantly reduce imbalance by increasing the number of links added to the networks? In a hyper-parameter sensitivity experiment, we investigate question Q4: Is GraB sensitive to the batch size ? Finally, we investigate question Q5: How does GraB perform in terms of execution time compared to the baselines?

We first discuss the datasets, baselines, and settings. Next, we present the result of each experiment. The source code for repeating the experiments are available here 3. 4.1

Datasets

We evaluated the methods using three datasets described below:

VDAB 4: VDAB is the employment service of Flanders in Belgium. It provides a platform for job seekers to find jobs. The dataset contains a sample of the applications made by job seekers to available job vacancies from January 2018 until October 2020. We construct the job market network with three sets of nodes: job seekers, job vacancies, and skills. Job vacancies are connected to job seekers that have applied for them and to the skills they require. Our goal is to reduce the imbalance between job seekers (source nodes) and job vacancies (target nodes) by adding links connecting job seekers with skills. This could be seen as teaching a group of job seekers some skills, in a way that balances the job market network.

Weibo [ 28 ]: Weibo is the most popular Chinese microblogging service. The dataset contains tweets of the users and the topic distribution of each tweet. We construct the network with three sets of nodes: male users, female users, and topics. Users are connected to their friends (reciprocal follow relationships) and the top topics of their tweets. To find the top topics for each tweet, we first sort the topic probabilities (relevance of topic for the tweet) in descending order. Next, we select the top topics until the diference of the probabilities of two consecutive topics is greater than a very small threshold (1e-6). We select a sample from the dataset by only considering tweets for the first year of the data. Our goal is to reduce the imbalance between males (source nodes or target nodes) and females (target nodes or source nodes) by adding new links connecting males/females with topics (auxiliary nodes). It is more like recommending tweets with specific topics to the users to increase their interest in that topic. In the experiments, we call this dataset Weibo-mf if males are the source nodes and females are the target nodes. Otherwise, we call the dataset Weibo-fm.

Movielens [ 14 ]: MovieLens is a web-based movie recommender system. The dataset contains 100000 user ratings on movies. We construct the network with three sets of nodes: users, movies, and

3https://github.com/aida-ugent/GraB 4https://www.vdab.be/

movie genres. There is a link between each user and movie for each rating. We also connect each movie to its genres. Our goal is to reduce the imbalance between movies (source nodes or target nodes) and users (target nodes or source nodes) by adding new links connecting movies with genres (auxiliary nodes).

We only used the largest connected component in each network. Table 1 shows the main statistics of each of the networks. 4.2

GraB-variants and baselines evaluated

As mentioned earlier, we are the first to introduce the concept of imbalance in a network in the way described in Section 2.2 and to reduce it by adding links to the network. However, there exist other methods that try to add links to the networks to make them more cohesive. We consider two of those methods [ 10, 20 ] for comparison.

Parotsidis et al. [ 20 ] minimize the average shortest path in a network by adding links. Garimella et al. [ 10 ] compute controversy between two sets of nodes using the random walks starting from one set, and ending in the same or the other set. The main diference between the imbalance and controversy is that the amount of links between nodes in the same set has a major efect on the controversy, which is not necessarily the case in computing the imbalance. Moreover, we use the distance in the embedding space as the cost of matching node pairs, while they consider the actual links in the network to compute the controversy.

We also designed a random method combined with our proposed greedy algorithm, and a pure random method for comparison.

In summary, the following methods will be evaluated: GraB: The main method proposed in this paper.

S-GraB: ‘Simple Graph Balancing’ is the same as GraB without comparing the value of , of the previous and the new embedding of the source node (i.e. without post-hoc filtering). I.e., S-GraB selects links connecting source nodes with auxiliary nodes for each batch from the link selection step. To select links to add to the network, S-GraB runs iterations. S-GraB is evaluated to assess the importance of the post-hoc filtering.

Since the problem of graph balancing has not been studied before, we also propose two simple baselines for comparison:

ROV [ 10 ]: ’Recommend opposing view’ adds links to the network to reduce controversy. In this work, links between high degree nodes in sets and that reduce controversy the most, are added using a greedy algorithm. We adopt this method for our problem setting by adding links between sets and using the same method.

SSW [ 20 ]: ’Shortcuts for a smaller world’ adds links to the network to reduce the average shortest path length. In this work, links are added to the network using diferent strategies. We employ the greedy strategy, since it has the best performance [ 20 ].

S-Random: The ‘Simple random’ baseline selects random links connecting source nodes with auxiliary nodes.

I-Random: The ‘Intelligent random’ baseline is the same as GraB, except that we select random links connecting source nodes with auxiliary nodes in the link selection step. Since the link selection step is a random algorithm, adding links in several batches probably does not help the performance. Hence, I-Random adds links in one batch and re-embeds the network to select links that moved their source nodes to positions with higher , (post-hoc ifltering). We expect that adding random links results in making the graph denser and nodes tend to be closer to each other.

In summary, I-Random selects · random links connecting source nodes with auxiliary nodes, re-embeds the network, and the same as GraB, it only selects links that helped their source nodes end up in a position with a higher , (post-hoc filtering). 4.3

Experimental Settings

In the quantitative experiments, we compare methods in terms of (Definition 1). We conduct the experiments on CNE with dimensions 2 and 4 with a combination of block and degree prior (see [ 16 ]). We average the results for I-Random and S-Random over 3 repetitions to smoothen out random fluctuations.

In the qualitative experiment of GraB, we set = 10 and = 5. We run the experiment for 2-dimensional embedding.

In the experiment in Sec. 4.5.1, the methods are evaluated on several datasets. We set = 100 for this experiment. We tune from values {25, 100} for GraB and S-GraB. We also tune from values {3, 5} for I-Random and GraB (S-GraB does not have hyperparameter , since it does not need to compare the value of , of the previous and the new embedding of the source node). We use the author’s implementation of computing the controversy in a network between two sets and used their default hyper-parameters, which is used in ROV. We used 10% of the high degree source nodes and 20 high degree auxiliary nodes for the candidate selection in ROV. We also use the author’s implementation of SSW. SSW does not require to set any particular hyper-parameter.

In the experiment in Sec. 4.5.2, we analyze the behavior of GraB with diferent values of . We increase from 25 to 1000 by 25 and report values. We set = 25 and = 5.

In the experiment in Sec. 4.6, we analyze the behavior of GraB with diferent values of batch size . We set = 100 and = 5 for this experiment and vary from 1 to 100. 4.4

Qualitative Evaluation

In this section, we show that GraB moves source nodes to the areas in the embedding space with having more target node percentage in their neighborhood than before (Q1). We first add 10 links to each dataset. Next, we analyze two sample source nodes that we added links to and compare the percentage of target nodes in their neighborhood. The percentage of target nodes in 25 nearest neighbors (only neighbors from source nodes and target nodes) of two sample source nodes for each network is presented in Table 2.

The number of target nodes in 25 nearest neighbors of the two source nodes is increased for each network. It means that GraB moved source nodes to areas where they have more access to target nodes, and also fewer source nodes are competing with them to access the same target nodes. 4.5

Quantitative Evaluation

4.5.1 Baseline Comparison . Here we compare the methods in terms of on all datasets. We report on the main graph as well (Q2). Table 3 shows the result of this experiment.

GraB reduces over the main graph and outperforms the other methods in all datasets. S-GraB also outperforms other baselines in most datasets. This shows that the link selection based on the link utility is choosing proper links to add to the network, and also post-hoc filtering of links in GraB improves the results.

For the baselines, I-Random improves over the main graph and S-Random. S-Random however, is not efective and gains the same results as the main graph. Hence, the greedy algorithm with post-hoc filtering proposed in this paper (applied in I-Random) is helpful even with a random link selection.

The other methods SSW and ROV do not perform particularly well (in some datasets even increases the imbalance, and the amount of imbalance reduction in other datasets is less than GraB), since they are designed for a diferent purpose and objective function.

As a result, we can conclude that both the link selection step based on link utility and the greedy algorithm of GraB with posthoc filtering are crucial to solve the problem and they are both playing an important role in reducing of the main graph. 4.5.2 Efect of adding a diferent number of links on GraB. In this experiment, we evaluate the performance of GraB for diferent values of , i.e., number of links added to the network (Q3). Figure 4 shows the result of this experiment for all datasets.

In most datasets, decreases by adding more links to the network. The amount of reduction in mostly decreases, or even in some datasets increases as we add more links to the network. The reason is that in the beginning, there are more links available to add to the network. As we gradually add the most beneficial links to the network, the remaining candidate links might have smaller utilities and less certainty. Hence, GraB selects links from those candidates which results in worsening the performance globally. So GraB does not constantly reduce imbalance by increasing the number of links added to the networks. In this section, we evaluate the sensitivity of GraB w.r.t. the batch size (Q4). Figure 5 shows the result of this experiment.

Smaller means fewer links are added to the network and hence, the network embedding is more stable. As a result, the algorithm ifnds links with the highest utilities more accurately. The issue of small values of is that the efect of adding links on each other is neglected. For example, consider adding link {, } and {, } to a network one by one ( = 1). Since GraB adds {, } at the first step, it means that node has moved to an area with higher local imbalance , . Yet, it is possible that after adding {, } the local imbalance , at the position of node becomes less than its original position (since moving node changes the density of the source nodes in the embedding space). On the other hand, by adding the two links at the same time ( = 2), {, } will be filtered in post-hoc filtering step and will not be selected and there will be the opportunity to select other links to reduce the imbalance.

On the other hand, having a large also is not always the best option because by adding · links at the same time, the density of source nodes and target nodes are not stable and varies a lot (since by filtering some links and selecting only links, the actual densities after re-embedding will be diferent). Moreover, by adding so many links at the same time, the behavior of the embeddings is less predictable due to the impact of links on each other. Hence, the method will not be accurate in selecting and adding links to the network, to reduce the imbalance.

Middle values of show more stable results in most datasets.

There is a blank spot in Weibo-fm-CNE4 dataset for = 100 because GraB could not find 100 links to add to the network (because after re-embedding, not 100 links moved their source nodes to a position with a higher , ). Hence, we did not report . 4.7

Execution Time

In this experiment, we compare the methods in terms of the execution time (Q5) with the same settings as experiment 4.5.1. Figure 6 shows the execution time in seconds (log-scale) for all methods, including the time for hyper-parameter tuning.

GraB and ROV have the highest execution time among all methods. GraB has a high execution time due to the number of hyperparameters to be tuned, the link selection step, and also re-embedding after adding each batch. ROV has a high execution time due to the large number of candidates and the time needed for computing the controversy after adding each candidate link to the graph.

S-GraB has a greater execution time than I-Random. For more analysis, we first count the number of times they need to re-embed the networks. I-Random re-embeds the networks once for each hyper-parameter selection because it does not add links in batches. Since we tune from 2 values, I-Random selects links and re-embeds the networks 2 times in total. S-GraB re-embeds the networks − 1 times for each value of . Since = 100 and we tune from values {25, 100}, S-GraB re-embeds the networks 3 times in total. Besides, S-GraB performs the link selection step times. Thus, it performs the link selection step 5 times in total (once when = 100 and 4 times when = 25). Moreover, the link selection step is more time consuming in S-GraB than I-Random which selects links randomly. Hence, S-GraB is slower than I-Random.

Moreover, S-Random is a simple random method and the execution time is almost zero for all datasets. 5

RELATED WORK

Imbalance in the workforce has been studied in various systems [ 27, 29 ]. However, our work difers from these studies. Previous studies are mostly domain-specific and they analyze the supply and demand based on domain-specific features such as educational training program length, retirement, and salary. Previous studies in this area also lack a global measure to quantify the imbalance between two diferent entities. In contrast to this, we tackle the problem from a graph analysis approach. We propose a quantification of the imbalance in the network using its embedding and a method to reduce the imbalance by adding links to the network.

Another line of research focuses on matching problems [ 18, 22 ]. Finding the cost of the fractional perfect b-matching [ 1 ] with minimum cost in bipartite networks is related to our work. In this problem, the goal is to find a matching between two sets of nodes in a network with minimum total cost. We define the imbalance as the cost of the minimum cost fractional perfect b-matching on a new bipartite network created from the original network. Our work difers from the studies focusing on matching, since we do not address the computational problem of how to find the matching, we only use the cost of the matching to quantify the imbalance. Moreover, we add links to the network (not directly between the two sets of nodes of interest), to change the cost of links between the two sets of nodes, and hence, reduce the imbalance in the network.

Fairness in node embeddings is studied in various research papers [ 2, 3 ]. These studies try to learn an unbiased network embedding. The similarity between learning an unbiased embedding and reducing the imbalance in a network embedding is that they both try to have a mix between nodes with diferent attributes or diferent types (job seeker and job vacancy, female and male, etc.). The main diference is that debiasing methods learn an unbiased embedding based on the original network, while we modify the network in order to make it more balanced. A secondary diference is in the quantification of the imbalance or unfairness. In our setting, we intend to bring two sets of nodes closer, to reduce the imbalance, while the goal to reduce unfairness is that the two sets of nodes cannot be separated, which is not important in our case.

There are also several papers aiming to add links to a network to modify the network structure. Some of the researches focus on adding links to a network to make it more cohesive, where cohesiveness is quantified using network properties such as shortest paths [ 19, 20 ], diameter [ 7 ], information unfairness [ 15 ], controversy [ 10 ] and structural bias [ 13 ]. The paper by Garimella et al. [ 10 ] is most related to our work, since they add links to the network to reduce the controversy between two sets of nodes. However, the diference between our works is that we consider a diferent measure to compute the imbalance and use a diferent approach to optimize it. 6

CONCLUSIONS AND FUTURE WORK

We defined and quantified the concept of imbalance between two sets of nodes in a network, and introduced the novel problem of reducing that imbalance by introducing new links. We proposed GraB, a scalable algorithm to tackle this problem, leveraging a number of well-motivated heuristics to trade-of speed with accuracy. We presented experiments applying GraB together with CNE as the network embedding method to various networks. The experimental results indicate that GraB outperforms (also newly proposed) baselines for reducing imbalance in a network embedding.

In future work, we plan to investigate the benefits of a new link for individual nodes (e.g., improving the access to a target set of nodes) instead of just for the global balance of the network, as well as other problem settings such as reducing the imbalance in a network by removing a specific number of links from the network (e.g., changing job contents and required skills, making jobs more accessible), or both adding and removing links at the same time. Moreover, a more detailed investigation of the impact of hyperparameters such as the KDE bandwidth would be useful.

ACKNOWLEDGMENTS

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Grant Agreement no. 615517, from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme, and from the FWO (project no. G091017N, G0F9816N, 3G042220). Part of the experiments were conducted on pseudonimized HR data generously provided by VDAB.

[1] Roger

Behrend . 2013 . Fractional perfect b-matching polytopes I: General theory . Linear Algebra Appl . 439 , 12 ( 2013 ), 3822 - 3858 .

[2]

Avishek

Joey Bose and William L Hamilton . 2019 . Compositional fairness constraints for graph embeddings . arXiv preprint arXiv: 1905 . 10674 ( 2019 ).

[3]

Maarten

Buyl and Tijl De Bie . 2020 . DeBayes: a Bayesian method for debiasing network embeddings . arXiv preprint arXiv: 2002 . 11442 ( 2020 ).

[4]

Hongyun

Cai , Vincent W Zheng, and Kevin

Chen-Chuan

Chang . 2018 . A comprehensive survey of graph embedding: Problems, techniques, and applications . IEEE Transactions on Knowledge and Data Engineering 30 , 9 ( 2018 ), 1616 - 1637 .

[5]

Ramon

Ferrer I Cancho and Richard V Solé. 2001 . The small world of human language . Proceedings of the Royal Society of London. Series B: Biological Sciences 268 , 1482 ( 2001 ), 2261 - 2265 .

[6] Yen-Chi Chen . 2017 . A tutorial on kernel density estimation and recent advances . Biostatistics & Epidemiology 1 , 1 ( 2017 ), 161 - 187 .

[7] Erik

Demaine and Morteza

Zadimoghaddam . 2010 . Minimizing the diameter of a network using shortcut edges . In Scandinavian Workshop on Algorithm Theory . Springer, 420 - 431 .

[8] Linton

Freeman . 2000 . Visualizing social networks . Journal of social structure 1 , 1 ( 2000 ), 4 .

[9]

Sheng

Gao , Huacan Pang, Patrick Gallinari, Jun Guo, and

Nei

Kato . 2017 . A novel embedding method for information difusion prediction in social network big data . IEEE Transactions on Industrial Informatics 13 , 4 ( 2017 ), 2097 - 2105 .

[10] Kiran

Garimella

, Gianmarco De Francisci Morales, Aristides Gionis, and

Michael

Mathioudakis . 2017 . Reducing controversy by connecting opposing views . In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining . 81 - 90 .

[11]

Palash

Goyal and

Emilio

Ferrara . 2018 . Graph embedding techniques, applications, and performance: A survey . Knowledge-Based Systems 151 ( 2018 ), 78 - 94 .

[12]

Aditya

Grover and

Jure

Leskovec . 2016 . node2vec: Scalable feature learning for networks . In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining . 855 - 864 .

[13] Shahrzad

Haddadan

, Cristina Menghini, Matteo Riondato, and

Eli

Upfal . 2021 . RePBubLik: Reducing the Polarized Bubble Radius with Link Insertions . arXiv preprint arXiv:2101.04751 ( 2021 ).

[14]

F Maxwell

Harper and Joseph A Konstan . 2015 . The movielens datasets: History and context . Acm transactions on interactive intelligent systems (tiis) 5 , 4 ( 2015 ), 1 - 19 .

[15] Zeinab

S Jalali

, Weixiang Wang, Myunghwan Kim, Hema Raghavan, and

Sucheta

Soundarajan . 2020 . On the Information Unfairness of Social Networks . In Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM , 613 - 521 .

[16] Bo

Kang

, Jefrey Lijfijt, and Tijl De Bie. 2018 . Conditional network embeddings . arXiv preprint arXiv: 1805 . 07544 ( 2018 ).

[17] Bo

Kang

, Jefrey Lijfijt, and Tijl De Bie. 2019 . Explaine: An approach for explaining network embedding-based link predictions . preprint arXiv: 1904 . 12694 ( 2019 ).

[18]

Vladimir

Kolmogorov . 2009 . Blossom V: a new implementation of a minimum cost perfect matching algorithm . Mathematical Programming Computation 1 , 1 ( 2009 ), 43 - 67 .

[19] Manos

Papagelis

, Francesco Bonchi, and

Aristides

Gionis . 2011 . Suggesting ghost edges for a smaller world . In Proceedings of the 20th ACM international conference on Information and knowledge management . 2305 - 2308 .

[20] Nikos

Parotsidis

, Evaggelia Pitoura, and

Panayiotis

Tsaparas . 2015 . Selecting shortcuts for a smaller world . In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM , 28 - 36 .

[21]

Emanuel

Parzen . 1962 . On estimation of a probability density function and mode . The annals of mathematical statistics 33 , 3 ( 1962 ), 1065 - 1076 .

[22]

Lyle

Ramshaw and

Robert E

Tarjan . 2012 . On minimum-cost assignments in unbalanced bipartite graphs . HP Labs , Palo Alto, CA, USA, Tech. Rep. HPL-2012- 40R1 ( 2012 ).

[23] Yossi

Rubner

, Carlo Tomasi, and Leonidas J Guibas. 1998 . A metric for distributions with applications to image databases . In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271) . IEEE, 59 - 66 .

[24] David

W Scott.

2015 . Multivariate density estimation: theory, practice, and visualization . John Wiley & Sons.

[25] Jian

Tang

, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and

Qiaozhu

Mei . 2015 . Line: Large-scale information network embedding . In Proceedings of the 24th international conference on world wide web. 1067-1077.

[26] Athanasios

Theocharidis

, Stjin Van Dongen, Anton J Enright , and Tom C Freeman. 2009 . Network visualization and analysis of gene expression data using BioLayout Express 3D . Nature protocols 4 , 10 ( 2009 ), 1535 .

[27] Graham

Willis

Andrew

Woodward , and

Siôn

Cave . 2013 . Robust workforce planning for the English medical workforce . In Conference Proceedings, The 31st International Conference of the System Dynamics Society.

[28] Jing

Zhang

, Biao Liu, Jie Tang, Ting Chen, and

Juanzi

Li . 2013 . Social influence locality for modeling retweeting behaviors . In Twenty-Third International Joint Conference on Artificial Intelligence .

[29] Pascal

Zurn

, Mario R Dal Poz , Barbara Stilwell, and Orvill Adams . 2004 . Imbalance in the health workforce . Human resources for health 2 , 1 ( 2004 ), 1 - 12 .